Encoder-Decoder

T5

Decoder

GPT

Why decoder only beats encoder-decoder (Stanford CS25: V4 I Hyung Won Chung of OpenAI - YouTube):

Encoder-decoderDecoder-only
Additional cross attentionSeparate cross attentionSelf-attention serving both roles
Parameter sharingSeparate parameters for input and targetShared parameters
Target-to-input attention patternOnly attends to the last layer of encoder’s outputWithin-layer (i.e. layer 1 attends to layer 1)
Input attentionBidirectionalUnidirectional*
  1. In generative applications causal attention allows us to cache previous steps since tokens only attend up to their step (KV Cache)
    1. With encoder models need to recompute full sequence at each step to update previous tokens

from Stanford CS 25 - Google Slides

Decoder Improvements

  1. pre norm (norm before attention)
  2. RMS norm - cheaper
  3. RoPE position embeddings
  4. Grouped Query Attention - Smaller KV cache
  5. Mixture of Experts - More Parameters with less Flops
  6. SwiGLU / other GLU variants

Encoder

BERT ViT (MAE)


Transformers and LLMs (Large Language Models)

llm

Components

Tokenization

Positional Encoding

RoPE

Embedding Tables

Attention

Flash Attention

Grouped Query Attention

Flex Attention

Normalization

RMS Norm

Why not BatchNorm

Can’t use batchnorm in causal models because it leads to information leakage (since it uses batch level statistics)

Residual

Feed Forward

Activation Functions

Classifier

MoE - Mixture of Experts

Tricks

Speculative Decoding

KV Cache

Context Length


Other

Chat

Alignment

Instruction Tuning

RLHF

DPO

Training

Fine Tuning

Distillation / Pruning / Compression

Inference

GGML / LLAMA.CPP

Introduction to ggml

Open Models

Serving

Quantization

Tutorials

Visualization

Decoding

Inference Benchmarks

Evaluation

Evaluate LLMs and RAG a practical example using Langchain and Hugging Face

Applications

Text to SQL

Articles

Lectures

Code