inferencellm

Prefill vs generation, compute bound vs memory bound

KV Cache

Grouped Query Attention

Less Keys and Values, to reduce Key Value Cache

Paged Attention

Avoid padding out samples with fixed size blocks, use proper paging, split up samples across smaller pages

Sliding Window Attention Rolling Buffer KV Cache

Cross Layer KV Cache Sharing

[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models

StreamingLLM

See Long Context Transformers

Prompt / Prefix Caching

Cache KV cache for common prefixes

Low Precision and Compression

Use bf16, float8, int8 or smaller dtypes like int4 to save memory

Other

Speculative Decoding

Chunked Prefill (for batched inference)

Split up the prefill compute into chunks to reduce the units of work so that large prefills do not interfere with other request doing generation in the same batch

Disaggregated Prefill and Decoding

Do prefill on larger GPU then decode on smaller cheaper GPUs

Prefix Caching

RadixAttention

Continuous Batching / Inflight Batching

Flash Attention

Flash Decoding

Mixture of Experts

Reduce compute, skipping experts

See Mixture of Experts

Quantization

Structured Decoding / Generation

LLM Routing

Predict best (speed / cost / expert) model for given request and route based on that

Metrics

Time to First Token

Inter Token Latency