- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
14 KiB
What is Normalization? Step-by-Step Explanation
Complete step-by-step explanation of normalization in transformer models: how normalization stabilizes training and improves model performance.
Table of Contents
- The Problem Normalization Solves
- What is Normalization?
- How Layer Normalization Works: Step-by-Step
- Complete Example: Normalizing a Vector
- Why Normalization Matters
- Pre-Norm vs Post-Norm Architecture
- Visual Representation
- Key Takeaways
4.1 The Problem Normalization Solves
The Challenge
During training, activations can become unstable:
Problem 1: Varying Activations
Layer 1 output: [0.1, 0.2, 0.3, ...] (small values)
Layer 2 output: [10.5, 20.3, 15.8, ...] (large values)
Layer 3 output: [0.01, 0.02, 0.03, ...] (very small values)
Problem 2: Internal Covariate Shift
- Activations change distribution as weights update
- Later layers struggle to adapt to changing inputs
- Training becomes slower and less stable
Problem 3: Gradient Problems
Large activations → Large gradients → Exploding gradients
Small activations → Small gradients → Vanishing gradients
The Solution: Normalization
Normalization standardizes activations to have consistent statistics (mean zero, variance one), making training stable and efficient.
4.2 What is Normalization?
Simple Definition
Normalization is a technique that transforms activations to have:
- Mean of zero (centered)
- Variance of one (standardized scale)
Think of it like standardization:
- Converts any distribution to a standard form
- Makes values comparable across different scales
- Helps the model learn faster and more reliably
Visual Analogy
Imagine weights on a scale:
Before Normalization:
Bronze weight: 1 kg
Silver weight: 100 kg
Gold weight: 0.001 kg
→ Hard to compare!
After Normalization:
All weights standardized to mean 0, variance 1
→ Easy to compare and work with!
Types of Normalization
In transformers, we use Layer Normalization:
- Layer Normalization: Normalizes across features (dimensions) for each sample
- Batch Normalization: Normalizes across samples in a batch (not used in transformers)
- Instance Normalization: Normalizes each sample independently
Why Layer Normalization?
- Works well with variable sequence lengths
- Doesn't depend on batch size
- Suitable for autoregressive models
4.3 How Layer Normalization Works: Step-by-Step
High-Level Overview
Step 1: Compute mean of activations
Step 2: Compute variance of activations
Step 3: Normalize (subtract mean, divide by std)
Step 4: Scale and shift (learnable parameters)
Detailed Step-by-Step
Step 1: Compute Mean
Calculate the average value across all dimensions:
\mu = \frac{1}{d} \sum_{i=1}^{d} x_i
Example:
Input vector:
x = [1.0, 2.0, 3.0, 4.0]
d = 4 (number of dimensions)
Compute mean:
μ = (1.0 + 2.0 + 3.0 + 4.0) / 4
= 10.0 / 4
= 2.5
Meaning: The center of the distribution is at 2.5
Step 2: Compute Variance
Measure how spread out the values are:
\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2
Example:
Using the same input:
x = [1.0, 2.0, 3.0, 4.0]
μ = 2.5
Compute variance:
σ² = [(1.0 - 2.5)² + (2.0 - 2.5)² + (3.0 - 2.5)² + (4.0 - 2.5)²] / 4
= [(-1.5)² + (-0.5)² + (0.5)² + (1.5)²] / 4
= [2.25 + 0.25 + 0.25 + 2.25] / 4
= 5.0 / 4
= 1.25
Compute standard deviation:
σ = √σ² = √1.25 ≈ 1.118
Meaning: Values are spread out with standard deviation of 1.118
Step 3: Normalize
Subtract mean and divide by standard deviation:
\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
Where:
\epsilonis a small constant (default: 1e-5) to prevent division by zero
Example:
Using the same input:
x = [1.0, 2.0, 3.0, 4.0]
μ = 2.5
σ ≈ 1.118
ε = 0.00001
Normalize each element:
x̂₁ = (1.0 - 2.5) / (1.118 + 0.00001) ≈ -1.341
x̂₂ = (2.0 - 2.5) / (1.118 + 0.00001) ≈ -0.447
x̂₃ = (3.0 - 2.5) / (1.118 + 0.00001) ≈ 0.447
x̂₄ = (4.0 - 2.5) / (1.118 + 0.00001) ≈ 1.341
Result:
x̂ = [-1.341, -0.447, 0.447, 1.341]
Check:
- Mean ≈ 0 ✓
- Standard deviation ≈ 1 ✓
Meaning: Values are now standardized!
Step 4: Scale and Shift
Apply learnable parameters:
\text{LayerNorm}(x) = \gamma \odot \hat{x} + \beta
Where:
\gamma= learnable scale parameter (initialized to 1)\beta= learnable shift parameter (initialized to 0)\odot= element-wise multiplication
Example:
Normalized vector:
x̂ = [-1.341, -0.447, 0.447, 1.341]
Learnable parameters (initialized):
γ = [1.0, 1.0, 1.0, 1.0] (scale)
β = [0.0, 0.0, 0.0, 0.0] (shift)
Apply scale and shift:
Output = γ ⊙ x̂ + β
= [1.0, 1.0, 1.0, 1.0] ⊙ [-1.341, -0.447, 0.447, 1.341] + [0.0, 0.0, 0.0, 0.0]
= [-1.341, -0.447, 0.447, 1.341] + [0.0, 0.0, 0.0, 0.0]
= [-1.341, -0.447, 0.447, 1.341]
Initially, normalization is identity!
During training, γ and β learn optimal scale and shift.
4.4 Complete Example: Normalizing a Vector
Input
Word embedding after attention: [0.146, 0.108, 0.192, 0.155, ..., 0.11]
Dimension: 512
Step-by-Step Processing
Step 1: Compute Mean
Input:
x = [0.146, 0.108, 0.192, ..., 0.11] (512 numbers)
Compute mean:
μ = (0.146 + 0.108 + 0.192 + ... + 0.11) / 512
≈ 0.135
Visualization:
Values: [0.146, 0.108, 0.192, ..., 0.11]
└────────────────────────────────┘
Mean: 0.135 (center point)
Step 2: Compute Variance
Compute variance:
σ² = [(0.146 - 0.135)² + (0.108 - 0.135)² + (0.192 - 0.135)² + ... + (0.11 - 0.135)²] / 512
≈ 0.0023
Compute standard deviation:
σ = √0.0023 ≈ 0.048
Visualization:
Values: [0.146, 0.108, 0.192, ..., 0.11]
Spread: └───────── σ ≈ 0.048 ──────────┘
Step 3: Normalize
Normalize each element:
x̂₁ = (0.146 - 0.135) / (0.048 + 0.00001) ≈ 0.229
x̂₂ = (0.108 - 0.135) / (0.048 + 0.00001) ≈ -0.562
x̂₃ = (0.192 - 0.135) / (0.048 + 0.00001) ≈ 1.188
...
x̂₅₁₂ = (0.11 - 0.135) / (0.048 + 0.00001) ≈ -0.521
Result:
x̂ = [0.229, -0.562, 1.188, ..., -0.521]
Properties:
- Mean ≈ 0 ✓
- Standard deviation ≈ 1 ✓
Step 4: Scale and Shift
Apply learnable parameters:
γ = [1.0, 1.0, ..., 1.0] (512 values, may change during training)
β = [0.0, 0.0, ..., 0.0] (512 values, may change during training)
Output:
Output = γ ⊙ x̂ + β
= [0.229, -0.562, 1.188, ..., -0.521]
After training, γ and β adapt to optimal values!
4.5 Why Normalization Matters
Benefit 1: Stable Training
Without Normalization:
Layer 1: activations = [0.1, 0.2, ...]
Layer 2: activations = [50.0, 100.0, ...] ← Exploding!
Layer 3: activations = [0.001, 0.002, ...] ← Vanishing!
With Normalization:
Layer 1: activations = [0.1, -0.2, ...] (normalized)
Layer 2: activations = [0.3, -0.1, ...] (normalized)
Layer 3: activations = [0.2, 0.4, ...] (normalized)
→ Consistent scale throughout!
Benefit 2: Better Gradient Flow
Normalization helps gradients flow better:
Without Normalization:
Gradient 1: 0.0001 (too small, vanishing)
Gradient 2: 1000.0 (too large, exploding)
Gradient 3: 0.001 (too small)
With Normalization:
Gradient 1: 0.01 (reasonable)
Gradient 2: 0.02 (reasonable)
Gradient 3: 0.015 (reasonable)
→ Stable gradients!
Benefit 3: Faster Convergence
Normalized activations allow:
- Higher learning rates
- Faster weight updates
- Quicker convergence to good solutions
Analogy:
- Without normalization: Walking on rough terrain (slow progress)
- With normalization: Walking on smooth path (fast progress)
Benefit 4: Regularization Effect
Normalization acts as a form of regularization:
- Reduces internal covariate shift
- Makes optimization easier
- Helps prevent overfitting
4.6 Pre-Norm vs Post-Norm Architecture
Post-Norm (Original Transformer)
Order:
Input → Attention → LayerNorm → Output
Equation:
x_out = LayerNorm(x + Attention(x))
Problems:
- Can be unstable with many layers
- Gradient flow can be difficult
- Harder to train deep networks
Pre-Norm (Modern Approach)
Order:
Input → LayerNorm → Attention → Output
Equation:
x_out = x + Attention(LayerNorm(x))
Benefits:
- More stable training
- Better gradient flow
- Easier to train deep networks
Visual Comparison:
Post-Norm:
Input
↓
┌──────────────┐
│ Attention │
└──────┬───────┘
↓
┌──────────────┐
│ LayerNorm │ ← Normalization after
└──────┬───────┘
↓
Output
Pre-Norm:
Input
↓
┌──────────────┐
│ LayerNorm │ ← Normalization before
└──────┬───────┘
↓
┌──────────────┐
│ Attention │
└──────┬───────┘
↓
Output
Our Model Uses Pre-Norm!
4.7 Visual Representation
Normalization Process
Input Vector
│
│ [1.0, 2.0, 3.0, 4.0]
↓
┌─────────────────────────────┐
│ Step 1: Compute Mean │
│ μ = 2.5 │
└──────────┬──────────────────┘
│
↓
┌─────────────────────────────┐
│ Step 2: Compute Variance │
│ σ² = 1.25, σ ≈ 1.118 │
└──────────┬──────────────────┘
│
↓
┌────────────────────────────────┐
│ Step 3: Normalize │
│ x̂ = (x - μ) / σ │
│ [-1.341, -0.447, 0.447, 1.341] │
└──────────┬─────────────────────┘
│
↓
┌─────────────────────────────┐
│ Step 4: Scale and Shift │
│ Output = γ ⊙ x̂ + β │
└──────────┬──────────────────┘
│
↓
Output Vector
Distribution Transformation
Before Normalization:
Distribution:
│
0.4│ ●
│ ● ●
0.3│ ● ●
│ ● ●
0.2│ ● ●
│● ●
0.1│ ●
│
0.0├─────────────────────────
0 1 2 3 4 5
Mean: 2.5, Std: 1.118
After Normalization:
Distribution:
│
0.4│ ●
│ ● ●
0.3│ ● ●
│ ● ●
0.2│ ● ●
│● ●
0.1│ ●
│
0.0├─────────────────────────
-2 -1 0 1 2 3
Mean: 0, Std: 1
Standardized!
Gradient Flow Visualization
Without Normalization:
Gradient Magnitude:
│
1000│ ●
│
100│
│
10│
│
1│ ●
│
0.1│ ●
│
0.01│
└──────────────────────── Layer
1 2 3 4 5
(Unstable, varying magnitudes)
With Normalization:
Gradient Magnitude:
│
1000│
│
100│
│
10│
│ ● ● ● ● ●
1│
│
0.1│
│
0.01│
└──────────────────────── Layer
1 2 3 4 5
(Stable, consistent magnitudes)
4.8 Key Takeaways: Normalization
✅ Normalization standardizes activations to mean 0, variance 1
✅ Stabilizes training by preventing exploding/vanishing gradients
✅ Enables faster convergence and higher learning rates
✅ Pre-norm architecture is preferred for deep networks
✅ Learnable parameters (γ, β) allow optimal scaling
Complete Mathematical Formula
Layer Normalization Formula
For input \mathbf{x} \in \mathbb{R}^d:
\mu = \frac{1}{d} \sum_{i=1}^{d} x_i
\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2
\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
\text{LayerNorm}(\mathbf{x}) = \gamma \odot \hat{\mathbf{x}} + \beta
Where:
\epsilon= small constant (default: 1e-5) to prevent division by zero\gamma= learnable scale parameter (initialized to 1)\beta= learnable shift parameter (initialized to 0)\odot= element-wise multiplicationd= number of dimensions
In Transformer Block
Pre-Norm Architecture:
\mathbf{x}_{norm} = \text{LayerNorm}(\mathbf{x}_{in})
\mathbf{x}_{attn} = \text{Attention}(\mathbf{x}_{norm})
\mathbf{x}_{out} = \mathbf{x}_{in} + \mathbf{x}_{attn} \quad \text{(residual connection)}
Normalization happens before attention and feed-forward!
This document provides a step-by-step explanation of normalization, the critical component that stabilizes training and enables efficient learning in transformer models.