Encoder-Decoder

T5

Decoder

GPT

Why decoder only beats encoder-decoder (Stanford CS25: V4 I Hyung Won Chung of OpenAI - YouTube):

	Encoder-decoder	Decoder-only
Additional cross attention	Separate cross attention	Self-attention serving both roles
Parameter sharing	Separate parameters for input and target	Shared parameters
Target-to-input attention pattern	Only attends to the last layer of encoder’s output	Within-layer (i.e. layer 1 attends to layer 1)
Input attention	Bidirectional	Unidirectional*

In generative applications causal attention allows us to cache previous steps since tokens only attend up to their step (KV Cache)
1. With encoder models need to recompute full sequence at each step to update previous tokens

from Stanford CS 25 - Google Slides

Decoder Improvements

pre norm (norm before attention)
RMS norm - cheaper
RoPE position embeddings
Grouped Query Attention - Smaller KV cache
Mixture of Experts - More Parameters with less Flops
SwiGLU / other GLU variants

LLMs: A Journey Through Time and Architecture - YouTube

Encoder

BERT ViT (MAE)

Transformers and LLMs (Large Language Models)

Components

Tokenization

Positional Encoding

RoPE

Embedding Tables

Attention

Flash Attention

Grouped Query Attention

Flex Attention

Normalization

RMS Norm

Why not BatchNorm

Can’t use batchnorm in causal models because it leads to information leakage (since it uses batch level statistics)

Residual

Feed Forward

SwiGLU

Activation Functions

Classifier

MoE - Mixture of Experts

Tricks

Speculative Decoding

KV Cache

Context Length

Long Context Transformers

Transformer Variants

nGPT - Normalized Transformer

TokenFormer

2024-11-03 - TokenFormer - RETHINKING TRANSFORMER SCAL-ING WITH TOKENIZED MODEL PARAMETERS

Sigmoid Attention

Diff Transformer

Block Transformer

GitHub - itsnamgyu/block-transformer: Block Transformer: Global-to-Local Language Modeling for Fast Inference (Official Code)

Other

Chat

Alignment

Instruction Tuning

RLHF

DPO

Training

Fine Tuning

Distillation / Pruning / Compression

[2407.14679] Compact Language Models via Pruning and Knowledge Distillation

Inference

GGML / LLAMA.CPP

Introduction to ggml llama.cpp guide - Running LLMs locally, on any hardware, from scratch ::

Open Models

Mistral NeMo | Mistral AI | Frontier AI in your hands

Serving

Hamel’s Blog - Optimizing latency

Quantization

Tutorials

GitHub - RahulSChand/llama2.c-for-dummies: Step by step explanation/tutorial of llama2.c

Visualization

LLM Visualization

Decoding

X

Inference Benchmarks

Performance of llama.cpp on Apple Silicon · ggerganov/llama.cpp · Discussion #4167 · GitHub

Evaluation

Evaluate LLMs and RAG a practical example using Langchain and Hugging Face

Applications

Text to SQL

GitHub - defog-ai/sqlcoder: SoTA LLM for converting natural language questions to SQL queries

Links

Articles

Lectures

Code