Transformers

Types

Encoder-Decoder

Attention is all you Need Transformer

T5

Decoder

GPT

Why decoder only beats encoder-decoder (Stanford CS25: V4 I Hyung Won Chung of OpenAI - YouTube):

Encoder-decoderDecoder-only
Additional cross attentionSeparate cross attentionSelf-attention serving both roles
Parameter sharingSeparate parameters for input and targetShared parameters
Target-to-input attention patternOnly attends to the last layer of encoder’s outputWithin-layer (i.e. layer 1 attends to layer 1)
Input attentionBidirectionalUnidirectional*
  1. In generative applications causal attention allows us to cache previous steps since tokens only attend up to their step (KV Cache)
    1. With encoder models need to recompute full sequence at each step to update previous tokens

from Stanford CS 25 - Google Slides

Decoder Improvements

  1. pre norm (norm before attention)
  2. RMS norm - cheaper
  3. RoPE position embeddings
  4. Grouped Query Attention - Smaller KV cache
  5. Mixture of Experts - More Parameters with less Flops
  6. SwiGLU / other GLU variants

Encoder

BERT

Masking and Next Sentence Prediction Tasks

Class Token vs Pooling

ModernBERT

ViT (MAE)


Transformers and LLMs (Large Language Models)

#ml/nlp/llm

Components

Tokenization

Positional Encoding

RoPE

ALiBi

Learned Positional Encoding

Sinusoidal Positional Encoding

Embedding Tables

Tied Embeddings

Attention

Flash Attention

Grouped Query Attention

Flex Attention

Sliding Window Attention

Normalization

RMS Norm

Why not BatchNorm

Can’t use batchnorm in causal models because it leads to information leakage (since it uses batch level statistics)

QK Norm

Residual

Residual Stream

Feed Forward

SwiGLU

Activation Functions

ReLU^2

Classifier

MoE - Mixture of Experts

Tricks

Speculative Decoding

KV Cache

Context Length

Long Context Transformers

Transformer Variants

nGPT - Normalized Transformer

TokenFormer

2024-11-03 - TokenFormer - RETHINKING TRANSFORMER SCAL-ING WITH TOKENIZED MODEL PARAMETERS

Sigmoid Attention

Diff Transformer

Block Transformer

YOCO - You Only Cache Once


Topics and Trends

”Reasoning” / Test Time Compute

Test Time Compute, LLM Reasoning, Inference Time Scaling

Multimodal Transformers

Vision Language Models

Vision Language Action Models

”Omni” Models

Chat

Alignment

Instruction Tuning

RLHF

DPO

Training

Fine Tuning

Parameter Efficient Fine Tuning (PEFT)

LoRA - Low Rank Adaptation

Transformer^2

Distillation / Pruning / Compression

Inference

Decoder Transformer Inference (LLM Serving)

GGML / LLAMA.CPP

Introduction to ggml llama.cpp guide - Running LLMs locally, on any hardware, from scratch ::

Open Models

Serving

Server Inference

Quantization

Visualization

Sampling and Decoding

Inference Benchmarks

Evaluation

Evaluate LLMs and RAG a practical example using Langchain and Hugging Face

Applications

Text to SQL

Articles

Lectures

Code

Tutorials


ml