Transformers

December 9, 2023 updated November 5, 2024 5 min read

Types

Encoder-Decoder

Attention is all you Need Transformer

T5

Decoder

GPT

Why decoder only beats encoder-decoder (Stanford CS25: V4 I Hyung Won Chung of OpenAI - YouTube):

	Encoder-decoder	Decoder-only
Additional cross attention	Separate cross attention	Self-attention serving both roles
Parameter sharing	Separate parameters for input and target	Shared parameters
Target-to-input attention pattern	Only attends to the last layer of encoder’s output	Within-layer (i.e. layer 1 attends to layer 1)
Input attention	Bidirectional	Unidirectional*

In generative applications causal attention allows us to cache previous steps since tokens only attend up to their step (KV Cache)
1. With encoder models need to recompute full sequence at each step to update previous tokens

from Stanford CS 25 - Google Slides

Decoder Improvements

pre norm (norm before attention)
RMS norm - cheaper
RoPE position embeddings
Grouped Query Attention - Smaller KV cache
Mixture of Experts - More Parameters with less Flops
SwiGLU / other GLU variants

LLMs: A Journey Through Time and Architecture - YouTube

Encoder

BERT

Masking and Next Sentence Prediction Tasks

Class Token vs Pooling

ModernBERT

ViT (MAE)

Transformers and LLMs (Large Language Models)

#ml/nlp/llm

Components

Tokenization

Positional Encoding

RoPE

ALiBi

Learned Positional Encoding

Sinusoidal Positional Encoding

Embedding Tables

Tied Embeddings

Attention

Flash Attention

Grouped Query Attention

Flex Attention

Sliding Window Attention

Normalization

RMS Norm

Why not BatchNorm

Can’t use batchnorm in causal models because it leads to information leakage (since it uses batch level statistics)

QK Norm

Residual

Residual Stream

Feed Forward

SwiGLU

Activation Functions

ReLU^2

Classifier

MoE - Mixture of Experts

Tricks

Speculative Decoding

KV Cache

Context Length

Long Context Transformers

Transformer Variants

nGPT - Normalized Transformer

TokenFormer

2024-11-03 - TokenFormer - RETHINKING TRANSFORMER SCAL-ING WITH TOKENIZED MODEL PARAMETERS

Sigmoid Attention

Diff Transformer

Block Transformer

GitHub - itsnamgyu/block-transformer: Block Transformer: Global-to-Local Language Modeling for Fast Inference (Official Code)

YOCO - You Only Cache Once

Topics and Trends

”Reasoning” / Test Time Compute

Test Time Compute, LLM Reasoning, Inference Time Scaling

Multimodal Transformers

Vision Language Models

Vision Language Action Models

”Omni” Models

Chat

Alignment

Instruction Tuning

RLHF

DPO

Training

Fine Tuning

Parameter Efficient Fine Tuning (PEFT)

LoRA - Low Rank Adaptation

Transformer^2

Distillation / Pruning / Compression

[2407.14679] Compact Language Models via Pruning and Knowledge Distillation

Inference

Decoder Transformer Inference (LLM Serving)

GGML / LLAMA.CPP

Introduction to ggml llama.cpp guide - Running LLMs locally, on any hardware, from scratch ::

Open Models

Mistral NeMo | Mistral AI | Frontier AI in your hands

Serving

Server Inference

Hamel’s Blog - Optimizing latency

Quantization

Visualization

LLM Visualization

Sampling and Decoding

X

Inference Benchmarks

Performance of llama.cpp on Apple Silicon · ggerganov/llama.cpp · Discussion #4167 · GitHub

Evaluation

Evaluate LLMs and RAG a practical example using Langchain and Hugging Face

Applications

Text to SQL

GitHub - defog-ai/sqlcoder: SoTA LLM for converting natural language questions to SQL queries

Links

Articles

Lectures

Code

Tutorials

GitHub - RahulSChand/llama2.c-for-dummies: Step by step explanation/tutorial of llama2.c

ml