Codebook KV Cache
-
learn a codebook to compress KV cache vectors (potentially share the codebook with token embeddings table, with extra “latent” tokens)
-
use LOPQ style coarse and fine grained encoding
- keep mapping from kv cache entries to coarse codebook, then query the buckets and feed contents into a limited “nearest neighbor” attention
-
pros:
- tiny KV cache storage size (ints)
- fast lookup / limited attention
-
Residuals with smaller codebook