michal.i/o

❯

❯

❯

Decoder Transformer Inference

Decoder Transformer Inference

Dec 22, 20243 min read

inference
llm

Transformer Inference Arithmetic | kipply’s blog

Prefill vs generation, compute bound vs memory bound

KV Cache

Grouped Query Attention

Less Keys and Values, to reduce Key Value Cache

Paged Attention

Fast LLM Serving with vLLM and PagedAttention - YouTube

Avoid padding out samples with fixed size blocks, use proper paging, split up samples across smaller pages

Sliding Window Attention ⇒ Rolling Buffer KV Cache

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral - YouTube
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models - YouTube

Cross Layer KV Cache Sharing

[2405.12981] Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models

[2410.17897v3] Value Residual Learning For Alleviating Attention Concentration In Transformers

StreamingLLM

See Long Context Transformers

Prompt / Prefix Caching

Cache KV cache for common prefixes

Low Precision and Compression

Use bf16, float8, int8 or smaller dtypes like int4 to save memory

GitHub - NVIDIA/kvpress: LLM KV cache compression made easy

Other

[2410.03960] SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

Speculative Decoding

Dynamic Speculative Decoding
How Speculative Decoding Boosts vLLM Performance by up to 2.8x | vLLM Blog
[2403.09919] Recurrent Drafter for Fast Speculative Decoding in Large Language Models

Chunked Prefill (for batched inference)

Split up the prefill compute into chunks to reduce the units of work so that large prefills do not interfere with other request doing generation in the same batch

[2403.02310] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Disaggregated Prefill and Decoding

Do prefill on larger GPU then decode on smaller cheaper GPUs

[2401.09670] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference - YouTube

Prefix Caching

RadixAttention

Continuous Batching / Inflight Batching

Accelerating LLM Inference with vLLM - YouTube

Flash Attention

Flash Decoding

Flash-Decoding for long-context inference | PyTorch

Mixture of Experts

Reduce compute, skipping experts

See Mixture of Experts

Quantization

Structured Decoding / Generation

[2412.10418v1] Constrained Decoding with Speculative Lookaheads

LLM Routing

Predict best (speed / cost / expert) model for given request and route based on that

Metrics

Time to First Token

Inter Token Latency

Links

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA - YouTube
Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog
Optimizing AI Inference at Character.AI
[2402.16363] LLM Inference Unveiled: Survey and Roofline Model Insights
[2407.12391] LLM Inference Serving: Survey of Recent Advances and Opportunities
Optimizing AI Inference at Character.AI (Part Deux)

KV Cache
Grouped Query Attention
Paged Attention
Sliding Window Attention <span>⇒</span> Rolling Buffer KV Cache
Cross Layer KV Cache Sharing
StreamingLLM
Prompt / Prefix Caching
Low Precision and Compression
Other
Speculative Decoding
Chunked Prefill (for batched inference)
Disaggregated Prefill and Decoding
Prefix Caching
RadixAttention
Continuous Batching / Inflight Batching
Flash Attention
Flash Decoding
Mixture of Experts
Quantization
Structured Decoding / Generation
LLM Routing
Metrics
Time to First Token
Inter Token Latency
Links

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2024