# What is Normalization? Step-by-Step Explanation Complete step-by-step explanation of normalization in transformer models: how normalization stabilizes training and improves model performance. ## Table of Contents 1. [The Problem Normalization Solves](#41-the-problem-normalization-solves) 2. [What is Normalization?](#42-what-is-normalization) 3. [How Layer Normalization Works: Step-by-Step](#43-how-layer-normalization-works-step-by-step) 4. [Complete Example: Normalizing a Vector](#44-complete-example-normalizing-a-vector) 5. [Why Normalization Matters](#45-why-normalization-matters) 6. [Pre-Norm vs Post-Norm Architecture](#46-pre-norm-vs-post-norm-architecture) 7. [Visual Representation](#47-visual-representation) 8. [Key Takeaways](#48-key-takeaways) --- ## 4.1 The Problem Normalization Solves ### The Challenge **During training, activations can become unstable:** **Problem 1: Varying Activations** ``` Layer 1 output: [0.1, 0.2, 0.3, ...] (small values) Layer 2 output: [10.5, 20.3, 15.8, ...] (large values) Layer 3 output: [0.01, 0.02, 0.03, ...] (very small values) ``` **Problem 2: Internal Covariate Shift** - Activations change distribution as weights update - Later layers struggle to adapt to changing inputs - Training becomes slower and less stable **Problem 3: Gradient Problems** ``` Large activations → Large gradients → Exploding gradients Small activations → Small gradients → Vanishing gradients ``` ### The Solution: Normalization **Normalization standardizes activations to have consistent statistics (mean zero, variance one), making training stable and efficient.** --- ## 4.2 What is Normalization? ### Simple Definition **Normalization** is a technique that transforms activations to have: - **Mean of zero** (centered) - **Variance of one** (standardized scale) **Think of it like standardization:** - Converts any distribution to a standard form - Makes values comparable across different scales - Helps the model learn faster and more reliably ### Visual Analogy **Imagine weights on a scale:** **Before Normalization:** ``` Bronze weight: 1 kg Silver weight: 100 kg Gold weight: 0.001 kg → Hard to compare! ``` **After Normalization:** ``` All weights standardized to mean 0, variance 1 → Easy to compare and work with! ``` ### Types of Normalization **In transformers, we use Layer Normalization:** - **Layer Normalization:** Normalizes across features (dimensions) for each sample - **Batch Normalization:** Normalizes across samples in a batch (not used in transformers) - **Instance Normalization:** Normalizes each sample independently **Why Layer Normalization?** - Works well with variable sequence lengths - Doesn't depend on batch size - Suitable for autoregressive models --- ## 4.3 How Layer Normalization Works: Step-by-Step ### High-Level Overview ``` Step 1: Compute mean of activations Step 2: Compute variance of activations Step 3: Normalize (subtract mean, divide by std) Step 4: Scale and shift (learnable parameters) ``` ### Detailed Step-by-Step #### Step 1: Compute Mean **Calculate the average value across all dimensions:** ```math \mu = \frac{1}{d} \sum_{i=1}^{d} x_i ``` **Example:** **Input vector:** ``` x = [1.0, 2.0, 3.0, 4.0] d = 4 (number of dimensions) ``` **Compute mean:** ``` μ = (1.0 + 2.0 + 3.0 + 4.0) / 4 = 10.0 / 4 = 2.5 ``` **Meaning:** The center of the distribution is at 2.5 #### Step 2: Compute Variance **Measure how spread out the values are:** ```math \sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2 ``` **Example:** **Using the same input:** ``` x = [1.0, 2.0, 3.0, 4.0] μ = 2.5 ``` **Compute variance:** ``` σ² = [(1.0 - 2.5)² + (2.0 - 2.5)² + (3.0 - 2.5)² + (4.0 - 2.5)²] / 4 = [(-1.5)² + (-0.5)² + (0.5)² + (1.5)²] / 4 = [2.25 + 0.25 + 0.25 + 2.25] / 4 = 5.0 / 4 = 1.25 ``` **Compute standard deviation:** ``` σ = √σ² = √1.25 ≈ 1.118 ``` **Meaning:** Values are spread out with standard deviation of 1.118 #### Step 3: Normalize **Subtract mean and divide by standard deviation:** ```math \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} ``` **Where:** - $\epsilon$ is a small constant (default: 1e-5) to prevent division by zero **Example:** **Using the same input:** ``` x = [1.0, 2.0, 3.0, 4.0] μ = 2.5 σ ≈ 1.118 ε = 0.00001 ``` **Normalize each element:** ``` x̂₁ = (1.0 - 2.5) / (1.118 + 0.00001) ≈ -1.341 x̂₂ = (2.0 - 2.5) / (1.118 + 0.00001) ≈ -0.447 x̂₃ = (3.0 - 2.5) / (1.118 + 0.00001) ≈ 0.447 x̂₄ = (4.0 - 2.5) / (1.118 + 0.00001) ≈ 1.341 ``` **Result:** ``` x̂ = [-1.341, -0.447, 0.447, 1.341] ``` **Check:** - Mean ≈ 0 ✓ - Standard deviation ≈ 1 ✓ **Meaning:** Values are now standardized! #### Step 4: Scale and Shift **Apply learnable parameters:** ```math \text{LayerNorm}(x) = \gamma \odot \hat{x} + \beta ``` **Where:** - $\gamma$ = learnable scale parameter (initialized to 1) - $\beta$ = learnable shift parameter (initialized to 0) - $\odot$ = element-wise multiplication **Example:** **Normalized vector:** ``` x̂ = [-1.341, -0.447, 0.447, 1.341] ``` **Learnable parameters (initialized):** ``` γ = [1.0, 1.0, 1.0, 1.0] (scale) β = [0.0, 0.0, 0.0, 0.0] (shift) ``` **Apply scale and shift:** ``` Output = γ ⊙ x̂ + β = [1.0, 1.0, 1.0, 1.0] ⊙ [-1.341, -0.447, 0.447, 1.341] + [0.0, 0.0, 0.0, 0.0] = [-1.341, -0.447, 0.447, 1.341] + [0.0, 0.0, 0.0, 0.0] = [-1.341, -0.447, 0.447, 1.341] ``` **Initially, normalization is identity!** **During training, γ and β learn optimal scale and shift.** --- ## 4.4 Complete Example: Normalizing a Vector ### Input ``` Word embedding after attention: [0.146, 0.108, 0.192, 0.155, ..., 0.11] Dimension: 512 ``` ### Step-by-Step Processing #### Step 1: Compute Mean **Input:** ``` x = [0.146, 0.108, 0.192, ..., 0.11] (512 numbers) ``` **Compute mean:** ``` μ = (0.146 + 0.108 + 0.192 + ... + 0.11) / 512 ≈ 0.135 ``` **Visualization:** ``` Values: [0.146, 0.108, 0.192, ..., 0.11] └────────────────────────────────┘ Mean: 0.135 (center point) ``` #### Step 2: Compute Variance **Compute variance:** ``` σ² = [(0.146 - 0.135)² + (0.108 - 0.135)² + (0.192 - 0.135)² + ... + (0.11 - 0.135)²] / 512 ≈ 0.0023 ``` **Compute standard deviation:** ``` σ = √0.0023 ≈ 0.048 ``` **Visualization:** ``` Values: [0.146, 0.108, 0.192, ..., 0.11] Spread: └───────── σ ≈ 0.048 ──────────┘ ``` #### Step 3: Normalize **Normalize each element:** ``` x̂₁ = (0.146 - 0.135) / (0.048 + 0.00001) ≈ 0.229 x̂₂ = (0.108 - 0.135) / (0.048 + 0.00001) ≈ -0.562 x̂₃ = (0.192 - 0.135) / (0.048 + 0.00001) ≈ 1.188 ... x̂₅₁₂ = (0.11 - 0.135) / (0.048 + 0.00001) ≈ -0.521 ``` **Result:** ``` x̂ = [0.229, -0.562, 1.188, ..., -0.521] ``` **Properties:** - Mean ≈ 0 ✓ - Standard deviation ≈ 1 ✓ #### Step 4: Scale and Shift **Apply learnable parameters:** ``` γ = [1.0, 1.0, ..., 1.0] (512 values, may change during training) β = [0.0, 0.0, ..., 0.0] (512 values, may change during training) ``` **Output:** ``` Output = γ ⊙ x̂ + β = [0.229, -0.562, 1.188, ..., -0.521] ``` **After training, γ and β adapt to optimal values!** --- ## 4.5 Why Normalization Matters ### Benefit 1: Stable Training **Without Normalization:** ``` Layer 1: activations = [0.1, 0.2, ...] Layer 2: activations = [50.0, 100.0, ...] ← Exploding! Layer 3: activations = [0.001, 0.002, ...] ← Vanishing! ``` **With Normalization:** ``` Layer 1: activations = [0.1, -0.2, ...] (normalized) Layer 2: activations = [0.3, -0.1, ...] (normalized) Layer 3: activations = [0.2, 0.4, ...] (normalized) → Consistent scale throughout! ``` ### Benefit 2: Better Gradient Flow **Normalization helps gradients flow better:** **Without Normalization:** ``` Gradient 1: 0.0001 (too small, vanishing) Gradient 2: 1000.0 (too large, exploding) Gradient 3: 0.001 (too small) ``` **With Normalization:** ``` Gradient 1: 0.01 (reasonable) Gradient 2: 0.02 (reasonable) Gradient 3: 0.015 (reasonable) → Stable gradients! ``` ### Benefit 3: Faster Convergence **Normalized activations allow:** - Higher learning rates - Faster weight updates - Quicker convergence to good solutions **Analogy:** - **Without normalization:** Walking on rough terrain (slow progress) - **With normalization:** Walking on smooth path (fast progress) ### Benefit 4: Regularization Effect **Normalization acts as a form of regularization:** - Reduces internal covariate shift - Makes optimization easier - Helps prevent overfitting --- ## 4.6 Pre-Norm vs Post-Norm Architecture ### Post-Norm (Original Transformer) **Order:** ``` Input → Attention → LayerNorm → Output ``` **Equation:** ``` x_out = LayerNorm(x + Attention(x)) ``` **Problems:** - Can be unstable with many layers - Gradient flow can be difficult - Harder to train deep networks ### Pre-Norm (Modern Approach) **Order:** ``` Input → LayerNorm → Attention → Output ``` **Equation:** ``` x_out = x + Attention(LayerNorm(x)) ``` **Benefits:** - More stable training - Better gradient flow - Easier to train deep networks **Visual Comparison:** **Post-Norm:** ``` Input ↓ ┌──────────────┐ │ Attention │ └──────┬───────┘ ↓ ┌──────────────┐ │ LayerNorm │ ← Normalization after └──────┬───────┘ ↓ Output ``` **Pre-Norm:** ``` Input ↓ ┌──────────────┐ │ LayerNorm │ ← Normalization before └──────┬───────┘ ↓ ┌──────────────┐ │ Attention │ └──────┬───────┘ ↓ Output ``` **Our Model Uses Pre-Norm!** --- ## 4.7 Visual Representation ### Normalization Process ``` Input Vector │ │ [1.0, 2.0, 3.0, 4.0] ↓ ┌─────────────────────────────┐ │ Step 1: Compute Mean │ │ μ = 2.5 │ └──────────┬──────────────────┘ │ ↓ ┌─────────────────────────────┐ │ Step 2: Compute Variance │ │ σ² = 1.25, σ ≈ 1.118 │ └──────────┬──────────────────┘ │ ↓ ┌────────────────────────────────┐ │ Step 3: Normalize │ │ x̂ = (x - μ) / σ │ │ [-1.341, -0.447, 0.447, 1.341] │ └──────────┬─────────────────────┘ │ ↓ ┌─────────────────────────────┐ │ Step 4: Scale and Shift │ │ Output = γ ⊙ x̂ + β │ └──────────┬──────────────────┘ │ ↓ Output Vector ``` ### Distribution Transformation **Before Normalization:** ``` Distribution: │ 0.4│ ● │ ● ● 0.3│ ● ● │ ● ● 0.2│ ● ● │● ● 0.1│ ● │ 0.0├───────────────────────── 0 1 2 3 4 5 Mean: 2.5, Std: 1.118 ``` **After Normalization:** ``` Distribution: │ 0.4│ ● │ ● ● 0.3│ ● ● │ ● ● 0.2│ ● ● │● ● 0.1│ ● │ 0.0├───────────────────────── -2 -1 0 1 2 3 Mean: 0, Std: 1 ``` **Standardized!** ### Gradient Flow Visualization **Without Normalization:** ``` Gradient Magnitude: │ 1000│ ● │ 100│ │ 10│ │ 1│ ● │ 0.1│ ● │ 0.01│ └──────────────────────── Layer 1 2 3 4 5 (Unstable, varying magnitudes) ``` **With Normalization:** ``` Gradient Magnitude: │ 1000│ │ 100│ │ 10│ │ ● ● ● ● ● 1│ │ 0.1│ │ 0.01│ └──────────────────────── Layer 1 2 3 4 5 (Stable, consistent magnitudes) ``` --- ## 4.8 Key Takeaways: Normalization ✅ **Normalization standardizes activations to mean 0, variance 1** ✅ **Stabilizes training by preventing exploding/vanishing gradients** ✅ **Enables faster convergence and higher learning rates** ✅ **Pre-norm architecture is preferred for deep networks** ✅ **Learnable parameters (γ, β) allow optimal scaling** --- ## Complete Mathematical Formula ### Layer Normalization Formula For input $\mathbf{x} \in \mathbb{R}^d$: ```math \mu = \frac{1}{d} \sum_{i=1}^{d} x_i ``` ```math \sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2 ``` ```math \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} ``` ```math \text{LayerNorm}(\mathbf{x}) = \gamma \odot \hat{\mathbf{x}} + \beta ``` **Where:** - $\epsilon$ = small constant (default: 1e-5) to prevent division by zero - $\gamma$ = learnable scale parameter (initialized to 1) - $\beta$ = learnable shift parameter (initialized to 0) - $\odot$ = element-wise multiplication - $d$ = number of dimensions ### In Transformer Block **Pre-Norm Architecture:** ```math \mathbf{x}_{norm} = \text{LayerNorm}(\mathbf{x}_{in}) ``` ```math \mathbf{x}_{attn} = \text{Attention}(\mathbf{x}_{norm}) ``` ```math \mathbf{x}_{out} = \mathbf{x}_{in} + \mathbf{x}_{attn} \quad \text{(residual connection)} ``` **Normalization happens before attention and feed-forward!** --- *This document provides a step-by-step explanation of normalization, the critical component that stabilizes training and enables efficient learning in transformer models.*