Codebook KV Cache

January 25, 2025 1 min read

learn a codebook to compress KV cache vectors (potentially share the codebook with token embeddings table, with extra “latent” tokens)
use LOPQ style coarse and fine grained encoding
- keep mapping from kv cache entries to coarse codebook, then query the buckets and feed contents into a limited “nearest neighbor” attention
pros:
- tiny KV cache storage size (ints)
- fast lookup / limited attention
Residuals with smaller codebook