Normalization Method | Normalization Axis | Key Advantages | Key Disadvantages | Best Use Case |
---|---|---|---|---|
Batch Normalization | Across mini-batch and feature | - Accelerates convergence - Provides regularization - Effective for large batches | - Ineffective with small batches - Not ideal for RNNs or variable-length sequences | CNNs, fully connected layers in vision tasks |
Instance Normalization | Per instance, per channel | - Effective for tasks requiring instance-level statistics - Good for style transfer | - Ignores global batch statistics - Less effective for tasks requiring batch-wide consistency | Style transfer, GANs |
Layer Normalization | Across feature dimension per input | - Independent of batch size - Works well with RNNs and transformers - Effective for sequential data | - Computationally more intensive than BatchNorm - Not as effective for convolutional layers | Transformers, RNNs, attention models |
RMS Normalization | RMS across feature dimension | - Efficient variant of LayerNorm - Reduces computation by skipping mean subtraction - Suitable for large-scale models | - Lack of mean subtraction can cause instability in certain tasks | Transformers, large models where efficiency is critical |
Group Normalization | Across groups of feature channels | - Works well with small mini-batches - Effective for visual recognition tasks - Less sensitive to batch size | - Needs manual tuning of the number of groups - Not as effective as BatchNorm with large batches | Vision models (ResNets, object detection) |
Weight Normalization | Reparameterization of weights | - Simplifies optimization - Improves training speed - Less sensitive to batch size | - Needs to be combined with other methods for optimal performance | Reinforcement learning, generative models |
Spectral Normalization | Normalizes spectral norm of weight matrices | - Ensures a bounded Lipschitz constant - Stabilizes GAN training - Reduces exploding gradients | - Can increase computational overhead - Specialized for GANs, less generalizable | GANs, particularly in the discriminator |
Batch-Instance Norm | Weighted mix of BatchNorm and InstanceNorm | - Flexibility in balancing global and instance-level statistics - Adaptable to various vision tasks | - Additional complexity with two normalizations - Needs fine-tuning for optimal balance | Vision tasks like style transfer and object detection |
Switchable Norm | Weighted combination of multiple normalization methods | - Adapts to the specific needs of each layer - Can combine strengths of BatchNorm, InstanceNorm, and LayerNorm | - Computationally expensive - Requires extra parameters for weighting | General deep learning models where adaptation is key |
Batch Renormalization | Across mini-batch and feature with correction | - Works with small mini-batches - More stable for online learning or non-i.i.d. data | - Additional complexity - Slightly slower than standard BatchNorm | Small-batch training, online learning, RL |
Mean-Only Batch Norm | Subtracts batch mean (ignores variance) | - Computationally efficient - Prevents mean-shift - Simplifies training | - Skips variance normalization, less robust for complex data | Large-scale models where variance normalization isn’t crucial |
BatchNorm
Group Normalization
Weight Normalization
LayerNorm
RMSNorm
- [1910.07467] Root Mean Square Layer Normalization
- [2305.14858] Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers