michal.i/o

❯

❯

❯

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG)

Jan 21, 20253 min read

Prompt Engineering
Information Retrieval - Retrieval, Ranking and Search

Indexing and Retrieval

Full Text Search

BM25

TF-IDF

Vector Search

Indexes

HNSW

IVF-Flat

LO-PQ

Chunking

Fixed Sized Window

Sentence Splitting

Late Chunking

Use long context text embedding model, then pool output word embedding with a smaller window. This gives you a a full document contextualized embedding of the smaller section of the document

Comparative panels display Berlin's Wikipedia article and its chunked text to highlight clarity and readability benefits.

Flowchart comparing naive and late chunking methods in document processing with labeled steps and embeddings.

Late Chunking in Long-Context Embedding Models
[2409.04701] Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Contextual Retrieval

Use an LLM to enrich the chunked segment with info from the whole document

Introducing Contextual Retrieval \ Anthropic

Embedding Models

Contextual Document Embeddings

Matryoshka Embeddings

[2205.13147] Matryoshka Representation Learning

Gecko - Versatile Text Embeddings Distilled from Large Language Models

Gecko - Versatile Text Embeddings Distilled from Large Language Models

Binary Embeddings

Ranking / Reranking

Biencoders

CrossEncoders

Colbert (Late Interaction)

SPLADE

[2109.10086] SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval
SPLADE for Sparse Vector Search Explained | Pinecone

DRAGON

[2302.07452] How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval

Query Understanding / Rewriting / Expansion

Index Expansion / Document Augmentation

extract keywords, summaries, facts using llm
augment with synonyms using llm

Retrieval Eval

nDCG

mAP

MRR

Recall@K

Precision@K

Function Calling

Function Calling (with LLMs)

Structured Generation

Prompt Engineering

In-context learning

Show input output examples for “few shot” learning

Chain of Thought

DSPy

Hacks

Repeat Prompt

By repeating input twice you allow decoder to do “bidirectional” attention instead of causal

Move Question to Top

With causal attention it’s better to ask then show context so that future tokens are aware of what they should be looking for.

Can also help to have the question and start and end

Short Output Formats

Speed up generation by asking LLM to as an example output csv data instead of JSON since it’s a lot less tokens

Ask the LLM to rewrite the prompt

Document Parsing

Multimodal Rag

ColPali and ColQwen

End to End LLMs with Retrieval

Retrieval Augmented Models

kNN-LM

REALM

[2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training

RETRO

RETRO++

[2304.06762] Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study

RAG Evaluation

LLM as Judge

Inference Optimizations

Prefix Caching

Batch Processing

[2411.02959] HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
https://www.anthropic.com/news/contextual-retrieval
[2409.13385] Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey
[2409.14924] Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely
GitHub - NirDiamant/RAG_Techniques: This repository showcases various advanced techniques for Retrieval-Augmented Generation (RAG) systems. RAG systems combine information retrieval with generative models to provide accurate and contextually rich responses.
What Late Chunking Really Is & What It’s Not: Part II
Stanford CS25: V3 I Retrieval Augmented Language Models - YouTube
GitHub - athina-ai/rag-cookbooks: This repository contains various advanced techniques for Retrieval-Augmented Generation (RAG) systems.
GitHub - RUC-NLPIR/FlashRAG: ⚡FlashRAG: A Python Toolkit for Efficient RAG Research

Backlinks

Prompt Engineering

Graph View

Created with Quartz v4.4.0 © 2025