Normalization

Normalization Method	Normalization Axis	Key Advantages	Key Disadvantages	Best Use Case
Batch Normalization	Across mini-batch and feature	- Accelerates convergence - Provides regularization - Effective for large batches	- Ineffective with small batches - Not ideal for RNNs or variable-length sequences	CNNs, fully connected layers in vision tasks
Instance Normalization	Per instance, per channel	- Effective for tasks requiring instance-level statistics - Good for style transfer	- Ignores global batch statistics - Less effective for tasks requiring batch-wide consistency	Style transfer, GANs
Layer Normalization	Across feature dimension per input	- Independent of batch size - Works well with RNNs and transformers - Effective for sequential data	- Computationally more intensive than BatchNorm - Not as effective for convolutional layers	Transformers, RNNs, attention models
RMS Normalization	RMS across feature dimension	- Efficient variant of LayerNorm - Reduces computation by skipping mean subtraction - Suitable for large-scale models	- Lack of mean subtraction can cause instability in certain tasks	Transformers, large models where efficiency is critical
Group Normalization	Across groups of feature channels	- Works well with small mini-batches - Effective for visual recognition tasks - Less sensitive to batch size	- Needs manual tuning of the number of groups - Not as effective as BatchNorm with large batches	Vision models (ResNets, object detection)
Weight Normalization	Reparameterization of weights	- Simplifies optimization - Improves training speed - Less sensitive to batch size	- Needs to be combined with other methods for optimal performance	Reinforcement learning, generative models
Spectral Normalization	Normalizes spectral norm of weight matrices	- Ensures a bounded Lipschitz constant - Stabilizes GAN training - Reduces exploding gradients	- Can increase computational overhead - Specialized for GANs, less generalizable	GANs, particularly in the discriminator
Batch-Instance Norm	Weighted mix of BatchNorm and InstanceNorm	- Flexibility in balancing global and instance-level statistics - Adaptable to various vision tasks	- Additional complexity with two normalizations - Needs fine-tuning for optimal balance	Vision tasks like style transfer and object detection
Switchable Norm	Weighted combination of multiple normalization methods	- Adapts to the specific needs of each layer - Can combine strengths of BatchNorm, InstanceNorm, and LayerNorm	- Computationally expensive - Requires extra parameters for weighting	General deep learning models where adaptation is key
Batch Renormalization	Across mini-batch and feature with correction	- Works with small mini-batches - More stable for online learning or non-i.i.d. data	- Additional complexity - Slightly slower than standard BatchNorm	Small-batch training, online learning, RL
Mean-Only Batch Norm	Subtracts batch mean (ignores variance)	- Computationally efficient - Prevents mean-shift - Simplifies training	- Skips variance normalization, less robust for complex data	Large-scale models where variance normalization isn’t crucial

BatchNorm

\overset{x}{^}_{i} = \frac{x _{i} - μ _{B}}{σ _{B}^{2} + ϵ}

y_{i} = γ \overset{x}{^}_{i} + β

Group Normalization

\overset{x}{^}_{i} = \frac{x _{i} - μ _{G}}{σ _{G}^{2} + ϵ}

y_{i} = γ \overset{x}{^}_{i} + β

Weight Normalization

w = \frac{g}{∥ v ∥} v

LayerNorm

\overset{x}{^}_{i} = \frac{x _{i} - μ _{L}}{σ _{L}^{2} + ϵ}

y_{i} = γ \overset{x}{^}_{i} + β

RMSNorm

RMS (x) = \frac{1}{N} i = 1 \sum N x_{i}^{2}

\overset{x}{^}_{i} = \frac{x _{i}}{RMS ( x ) + ϵ}

y_{i} = γ \overset{x}{^}_{i} + β

InstanceNorm

\overset{x}{^}_{i} = \frac{x _{i} - μ _{I}}{σ _{I}^{2} + ϵ}

y_{i} = γ \overset{x}{^}_{i} + β

QKNorm

[2010.04245] Query-Key Normalization for Transformers

michal.i/o

Explorer

Normalization

BatchNorm

Group Normalization

Weight Normalization

LayerNorm

RMSNorm

InstanceNorm

QKNorm

Explorer

Table of Contents

Backlinks

Graph View