inferencellm

Server Inference

Prefill vs generation, compute bound vs memory bound

KV Cache

Grouped Query Attention

Less Keys and Values, to reduce Key Value Cache

Paged Attention

Avoid padding out samples with fixed size blocks, use proper paging, split up samples across smaller pages

Sliding Window Attention Rolling Buffer KV Cache

Cross Layer KV Cache Sharing

[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models

StreamingLLM

See Long Context Transformers

Prefix Caching / Prompt Caching

Cache KV cache for common prefixes

Radix Cache

Auto Prefix Caching

Cache Aware Request Routing

Low Precision and Compression

Use bf16, float8, int8 or smaller dtypes like int4 to save memory

Ring Attention

Other

Speculative Decoding

Chunked Prefill (for batched inference)

Split up the prefill compute into chunks to reduce the units of work so that large prefills do not interfere with other request doing generation in the same batch

Disaggregated Prefill and Decoding

Do prefill on larger GPU then decode on smaller cheaper GPUs

RadixAttention

Continuous Batching / Inflight Batching

Ring Attention

Flash Attention

Flash Decoding

Mixture of Experts

Reduce compute, skipping experts

See Mixture of Experts

Multi LoRA Serving

Quantization

Structured Decoding / Generation

Jump Forward Decoding for Known Structured Tokens


LLM Routing

Predict best (speed / cost / expert) model for given request and route based on that


Metrics

Time to First Token

Inter Token Latency