Indexing and Retrieval
Full Text Search
BM25
TF-IDF
Vector Search
Indexes
HNSW
IVF-Flat
LO-PQ
Chunking
Fixed Sized Window
Sentence Splitting
Late Chunking
Use long context text embedding model, then pool output word embedding with a smaller window. This gives you a a full document contextualized embedding of the smaller section of the document
- Late Chunking in Long-Context Embedding Models
- [2409.04701] Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models
Contextual Retrieval
Use an LLM to enrich the chunked segment with info from the whole document
Embedding Models
Contextual Document Embeddings
Matryoshka Embeddings
Gecko - Versatile Text Embeddings Distilled from Large Language Models
Gecko - Versatile Text Embeddings Distilled from Large Language Models
Binary Embeddings
Ranking / Reranking
Biencoders
CrossEncoders
Colbert (Late Interaction)
SPLADE
- [2109.10086] SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval
- SPLADE for Sparse Vector Search Explained | Pinecone
DRAGON
Query Understanding / Rewriting / Expansion
Index Expansion / Document Augmentation
- extract keywords, summaries, facts using llm
- augment with synonyms using llm
Retrieval Eval
nDCG
mAP
MRR
Recall@K
Precision@K
Function Calling
Structured Generation
Prompt Engineering
In-context learning
Show input output examples for “few shot” learning
Chain of Thought
DSPy
Hacks
Repeat Prompt
By repeating input twice you allow decoder to do “bidirectional” attention instead of causal
Move Question to Top
With causal attention it’s better to ask then show context so that future tokens are aware of what they should be looking for.
Can also help to have the question and start and end
Short Output Formats
Speed up generation by asking LLM to as an example output csv data instead of JSON since it’s a lot less tokens