Files
sheepOp/docs/NORMALIZATION_EXPLAINED.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

641 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# What is Normalization? Step-by-Step Explanation
Complete step-by-step explanation of normalization in transformer models: how normalization stabilizes training and improves model performance.
## Table of Contents
1. [The Problem Normalization Solves](#41-the-problem-normalization-solves)
2. [What is Normalization?](#42-what-is-normalization)
3. [How Layer Normalization Works: Step-by-Step](#43-how-layer-normalization-works-step-by-step)
4. [Complete Example: Normalizing a Vector](#44-complete-example-normalizing-a-vector)
5. [Why Normalization Matters](#45-why-normalization-matters)
6. [Pre-Norm vs Post-Norm Architecture](#46-pre-norm-vs-post-norm-architecture)
7. [Visual Representation](#47-visual-representation)
8. [Key Takeaways](#48-key-takeaways)
---
## 4.1 The Problem Normalization Solves
### The Challenge
**During training, activations can become unstable:**
**Problem 1: Varying Activations**
```
Layer 1 output: [0.1, 0.2, 0.3, ...] (small values)
Layer 2 output: [10.5, 20.3, 15.8, ...] (large values)
Layer 3 output: [0.01, 0.02, 0.03, ...] (very small values)
```
**Problem 2: Internal Covariate Shift**
- Activations change distribution as weights update
- Later layers struggle to adapt to changing inputs
- Training becomes slower and less stable
**Problem 3: Gradient Problems**
```
Large activations → Large gradients → Exploding gradients
Small activations → Small gradients → Vanishing gradients
```
### The Solution: Normalization
**Normalization standardizes activations to have consistent statistics (mean zero, variance one), making training stable and efficient.**
---
## 4.2 What is Normalization?
### Simple Definition
**Normalization** is a technique that transforms activations to have:
- **Mean of zero** (centered)
- **Variance of one** (standardized scale)
**Think of it like standardization:**
- Converts any distribution to a standard form
- Makes values comparable across different scales
- Helps the model learn faster and more reliably
### Visual Analogy
**Imagine weights on a scale:**
**Before Normalization:**
```
Bronze weight: 1 kg
Silver weight: 100 kg
Gold weight: 0.001 kg
→ Hard to compare!
```
**After Normalization:**
```
All weights standardized to mean 0, variance 1
→ Easy to compare and work with!
```
### Types of Normalization
**In transformers, we use Layer Normalization:**
- **Layer Normalization:** Normalizes across features (dimensions) for each sample
- **Batch Normalization:** Normalizes across samples in a batch (not used in transformers)
- **Instance Normalization:** Normalizes each sample independently
**Why Layer Normalization?**
- Works well with variable sequence lengths
- Doesn't depend on batch size
- Suitable for autoregressive models
---
## 4.3 How Layer Normalization Works: Step-by-Step
### High-Level Overview
```
Step 1: Compute mean of activations
Step 2: Compute variance of activations
Step 3: Normalize (subtract mean, divide by std)
Step 4: Scale and shift (learnable parameters)
```
### Detailed Step-by-Step
#### Step 1: Compute Mean
**Calculate the average value across all dimensions:**
```math
\mu = \frac{1}{d} \sum_{i=1}^{d} x_i
```
**Example:**
**Input vector:**
```
x = [1.0, 2.0, 3.0, 4.0]
d = 4 (number of dimensions)
```
**Compute mean:**
```
μ = (1.0 + 2.0 + 3.0 + 4.0) / 4
= 10.0 / 4
= 2.5
```
**Meaning:** The center of the distribution is at 2.5
#### Step 2: Compute Variance
**Measure how spread out the values are:**
```math
\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2
```
**Example:**
**Using the same input:**
```
x = [1.0, 2.0, 3.0, 4.0]
μ = 2.5
```
**Compute variance:**
```
σ² = [(1.0 - 2.5)² + (2.0 - 2.5)² + (3.0 - 2.5)² + (4.0 - 2.5)²] / 4
= [(-1.5)² + (-0.5)² + (0.5)² + (1.5)²] / 4
= [2.25 + 0.25 + 0.25 + 2.25] / 4
= 5.0 / 4
= 1.25
```
**Compute standard deviation:**
```
σ = √σ² = √1.25 ≈ 1.118
```
**Meaning:** Values are spread out with standard deviation of 1.118
#### Step 3: Normalize
**Subtract mean and divide by standard deviation:**
```math
\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
```
**Where:**
- $\epsilon$ is a small constant (default: 1e-5) to prevent division by zero
**Example:**
**Using the same input:**
```
x = [1.0, 2.0, 3.0, 4.0]
μ = 2.5
σ ≈ 1.118
ε = 0.00001
```
**Normalize each element:**
```
x̂₁ = (1.0 - 2.5) / (1.118 + 0.00001) ≈ -1.341
x̂₂ = (2.0 - 2.5) / (1.118 + 0.00001) ≈ -0.447
x̂₃ = (3.0 - 2.5) / (1.118 + 0.00001) ≈ 0.447
x̂₄ = (4.0 - 2.5) / (1.118 + 0.00001) ≈ 1.341
```
**Result:**
```
x̂ = [-1.341, -0.447, 0.447, 1.341]
```
**Check:**
- Mean ≈ 0 ✓
- Standard deviation ≈ 1 ✓
**Meaning:** Values are now standardized!
#### Step 4: Scale and Shift
**Apply learnable parameters:**
```math
\text{LayerNorm}(x) = \gamma \odot \hat{x} + \beta
```
**Where:**
- $\gamma$ = learnable scale parameter (initialized to 1)
- $\beta$ = learnable shift parameter (initialized to 0)
- $\odot$ = element-wise multiplication
**Example:**
**Normalized vector:**
```
x̂ = [-1.341, -0.447, 0.447, 1.341]
```
**Learnable parameters (initialized):**
```
γ = [1.0, 1.0, 1.0, 1.0] (scale)
β = [0.0, 0.0, 0.0, 0.0] (shift)
```
**Apply scale and shift:**
```
Output = γ ⊙ x̂ + β
= [1.0, 1.0, 1.0, 1.0] ⊙ [-1.341, -0.447, 0.447, 1.341] + [0.0, 0.0, 0.0, 0.0]
= [-1.341, -0.447, 0.447, 1.341] + [0.0, 0.0, 0.0, 0.0]
= [-1.341, -0.447, 0.447, 1.341]
```
**Initially, normalization is identity!**
**During training, γ and β learn optimal scale and shift.**
---
## 4.4 Complete Example: Normalizing a Vector
### Input
```
Word embedding after attention: [0.146, 0.108, 0.192, 0.155, ..., 0.11]
Dimension: 512
```
### Step-by-Step Processing
#### Step 1: Compute Mean
**Input:**
```
x = [0.146, 0.108, 0.192, ..., 0.11] (512 numbers)
```
**Compute mean:**
```
μ = (0.146 + 0.108 + 0.192 + ... + 0.11) / 512
≈ 0.135
```
**Visualization:**
```
Values: [0.146, 0.108, 0.192, ..., 0.11]
└────────────────────────────────┘
Mean: 0.135 (center point)
```
#### Step 2: Compute Variance
**Compute variance:**
```
σ² = [(0.146 - 0.135)² + (0.108 - 0.135)² + (0.192 - 0.135)² + ... + (0.11 - 0.135)²] / 512
≈ 0.0023
```
**Compute standard deviation:**
```
σ = √0.0023 ≈ 0.048
```
**Visualization:**
```
Values: [0.146, 0.108, 0.192, ..., 0.11]
Spread: └───────── σ ≈ 0.048 ──────────┘
```
#### Step 3: Normalize
**Normalize each element:**
```
x̂₁ = (0.146 - 0.135) / (0.048 + 0.00001) ≈ 0.229
x̂₂ = (0.108 - 0.135) / (0.048 + 0.00001) ≈ -0.562
x̂₃ = (0.192 - 0.135) / (0.048 + 0.00001) ≈ 1.188
...
x̂₅₁₂ = (0.11 - 0.135) / (0.048 + 0.00001) ≈ -0.521
```
**Result:**
```
x̂ = [0.229, -0.562, 1.188, ..., -0.521]
```
**Properties:**
- Mean ≈ 0 ✓
- Standard deviation ≈ 1 ✓
#### Step 4: Scale and Shift
**Apply learnable parameters:**
```
γ = [1.0, 1.0, ..., 1.0] (512 values, may change during training)
β = [0.0, 0.0, ..., 0.0] (512 values, may change during training)
```
**Output:**
```
Output = γ ⊙ x̂ + β
= [0.229, -0.562, 1.188, ..., -0.521]
```
**After training, γ and β adapt to optimal values!**
---
## 4.5 Why Normalization Matters
### Benefit 1: Stable Training
**Without Normalization:**
```
Layer 1: activations = [0.1, 0.2, ...]
Layer 2: activations = [50.0, 100.0, ...] ← Exploding!
Layer 3: activations = [0.001, 0.002, ...] ← Vanishing!
```
**With Normalization:**
```
Layer 1: activations = [0.1, -0.2, ...] (normalized)
Layer 2: activations = [0.3, -0.1, ...] (normalized)
Layer 3: activations = [0.2, 0.4, ...] (normalized)
→ Consistent scale throughout!
```
### Benefit 2: Better Gradient Flow
**Normalization helps gradients flow better:**
**Without Normalization:**
```
Gradient 1: 0.0001 (too small, vanishing)
Gradient 2: 1000.0 (too large, exploding)
Gradient 3: 0.001 (too small)
```
**With Normalization:**
```
Gradient 1: 0.01 (reasonable)
Gradient 2: 0.02 (reasonable)
Gradient 3: 0.015 (reasonable)
→ Stable gradients!
```
### Benefit 3: Faster Convergence
**Normalized activations allow:**
- Higher learning rates
- Faster weight updates
- Quicker convergence to good solutions
**Analogy:**
- **Without normalization:** Walking on rough terrain (slow progress)
- **With normalization:** Walking on smooth path (fast progress)
### Benefit 4: Regularization Effect
**Normalization acts as a form of regularization:**
- Reduces internal covariate shift
- Makes optimization easier
- Helps prevent overfitting
---
## 4.6 Pre-Norm vs Post-Norm Architecture
### Post-Norm (Original Transformer)
**Order:**
```
Input → Attention → LayerNorm → Output
```
**Equation:**
```
x_out = LayerNorm(x + Attention(x))
```
**Problems:**
- Can be unstable with many layers
- Gradient flow can be difficult
- Harder to train deep networks
### Pre-Norm (Modern Approach)
**Order:**
```
Input → LayerNorm → Attention → Output
```
**Equation:**
```
x_out = x + Attention(LayerNorm(x))
```
**Benefits:**
- More stable training
- Better gradient flow
- Easier to train deep networks
**Visual Comparison:**
**Post-Norm:**
```
Input
┌──────────────┐
│ Attention │
└──────┬───────┘
┌──────────────┐
│ LayerNorm │ ← Normalization after
└──────┬───────┘
Output
```
**Pre-Norm:**
```
Input
┌──────────────┐
│ LayerNorm │ ← Normalization before
└──────┬───────┘
┌──────────────┐
│ Attention │
└──────┬───────┘
Output
```
**Our Model Uses Pre-Norm!**
---
## 4.7 Visual Representation
### Normalization Process
```
Input Vector
│ [1.0, 2.0, 3.0, 4.0]
┌─────────────────────────────┐
│ Step 1: Compute Mean │
│ μ = 2.5 │
└──────────┬──────────────────┘
┌─────────────────────────────┐
│ Step 2: Compute Variance │
│ σ² = 1.25, σ ≈ 1.118 │
└──────────┬──────────────────┘
┌────────────────────────────────┐
│ Step 3: Normalize │
│ x̂ = (x - μ) / σ
│ [-1.341, -0.447, 0.447, 1.341] │
└──────────┬─────────────────────┘
┌─────────────────────────────┐
│ Step 4: Scale and Shift │
│ Output = γ ⊙ x̂ + β │
└──────────┬──────────────────┘
Output Vector
```
### Distribution Transformation
**Before Normalization:**
```
Distribution:
0.4│ ●
│ ● ●
0.3│ ● ●
│ ● ●
0.2│ ● ●
│● ●
0.1│ ●
0.0├─────────────────────────
0 1 2 3 4 5
Mean: 2.5, Std: 1.118
```
**After Normalization:**
```
Distribution:
0.4│ ●
│ ● ●
0.3│ ● ●
│ ● ●
0.2│ ● ●
│● ●
0.1│ ●
0.0├─────────────────────────
-2 -1 0 1 2 3
Mean: 0, Std: 1
```
**Standardized!**
### Gradient Flow Visualization
**Without Normalization:**
```
Gradient Magnitude:
1000│ ●
100│
10│
1│ ●
0.1│ ●
0.01│
└──────────────────────── Layer
1 2 3 4 5
(Unstable, varying magnitudes)
```
**With Normalization:**
```
Gradient Magnitude:
1000│
100│
10│
│ ● ● ● ● ●
1│
0.1│
0.01│
└──────────────────────── Layer
1 2 3 4 5
(Stable, consistent magnitudes)
```
---
## 4.8 Key Takeaways: Normalization
**Normalization standardizes activations to mean 0, variance 1**
**Stabilizes training by preventing exploding/vanishing gradients**
**Enables faster convergence and higher learning rates**
**Pre-norm architecture is preferred for deep networks**
**Learnable parameters (γ, β) allow optimal scaling**
---
## Complete Mathematical Formula
### Layer Normalization Formula
For input $\mathbf{x} \in \mathbb{R}^d$:
```math
\mu = \frac{1}{d} \sum_{i=1}^{d} x_i
```
```math
\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2
```
```math
\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
```
```math
\text{LayerNorm}(\mathbf{x}) = \gamma \odot \hat{\mathbf{x}} + \beta
```
**Where:**
- $\epsilon$ = small constant (default: 1e-5) to prevent division by zero
- $\gamma$ = learnable scale parameter (initialized to 1)
- $\beta$ = learnable shift parameter (initialized to 0)
- $\odot$ = element-wise multiplication
- $d$ = number of dimensions
### In Transformer Block
**Pre-Norm Architecture:**
```math
\mathbf{x}_{norm} = \text{LayerNorm}(\mathbf{x}_{in})
```
```math
\mathbf{x}_{attn} = \text{Attention}(\mathbf{x}_{norm})
```
```math
\mathbf{x}_{out} = \mathbf{x}_{in} + \mathbf{x}_{attn} \quad \text{(residual connection)}
```
**Normalization happens before attention and feed-forward!**
---
*This document provides a step-by-step explanation of normalization, the critical component that stabilizes training and enables efficient learning in transformer models.*