Indexing and Retrieval

BM25

TF-IDF

Indexes

HNSW

IVF-Flat

LO-PQ

Chunking

Fixed Sized Window

Sentence Splitting

Late Chunking

Use long context text embedding model, then pool output word embedding with a smaller window. This gives you a a full document contextualized embedding of the smaller section of the document

Comparative panels display Berlin's Wikipedia article and its chunked text to highlight clarity and readability benefits.

Flowchart comparing naive and late chunking methods in document processing with labeled steps and embeddings.

Contextual Retrieval

Use an LLM to enrich the chunked segment with info from the whole document

Embedding Models

Contextual Document Embeddings

Matryoshka Embeddings

Gecko - Versatile Text Embeddings Distilled from Large Language Models

Gecko - Versatile Text Embeddings Distilled from Large Language Models

Binary Embeddings

Ranking / Reranking

Biencoders

CrossEncoders

Colbert (Late Interaction)

SPLADE

DRAGON

Query Understanding / Rewriting / Expansion

Index Expansion / Document Augmentation

  • extract keywords, summaries, facts using llm
  • augment with synonyms using llm

Retrieval Eval

nDCG

mAP

MRR

Recall@K

Precision@K

Function Calling

Function Calling (with LLMs)

Structured Generation

Prompt Engineering

In-context learning

Show input output examples for “few shot” learning

Chain of Thought

DSPy

Hacks

Repeat Prompt

By repeating input twice you allow decoder to do “bidirectional” attention instead of causal

Move Question to Top

With causal attention it’s better to ask then show context so that future tokens are aware of what they should be looking for.

Can also help to have the question and start and end

Short Output Formats

Speed up generation by asking LLM to as an example output csv data instead of JSON since it’s a lot less tokens

Ask the LLM to rewrite the prompt

Document Parsing

Multimodal Rag

ColPali and ColQwen

End to End LLMs with Retrieval

Retrieval Augmented Models

kNN-LM

REALM

RETRO

RETRO++

RAG Evaluation

LLM as Judge

Inference Optimizations

Prefix Caching

Batch Processing