Files

Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/

2025-11-06 22:07:41 -05:00

14 KiB

Raw Blame History

What is Normalization? Step-by-Step Explanation

Complete step-by-step explanation of normalization in transformer models: how normalization stabilizes training and improves model performance.

The Problem Normalization Solves
What is Normalization?
How Layer Normalization Works: Step-by-Step
Complete Example: Normalizing a Vector
Why Normalization Matters
Pre-Norm vs Post-Norm Architecture
Visual Representation
Key Takeaways

4.1 The Problem Normalization Solves

The Challenge

During training, activations can become unstable:

Problem 1: Varying Activations

Layer 1 output: [0.1, 0.2, 0.3, ...]   (small values)
Layer 2 output: [10.5, 20.3, 15.8, ...] (large values)
Layer 3 output: [0.01, 0.02, 0.03, ...] (very small values)

Problem 2: Internal Covariate Shift

Activations change distribution as weights update
Later layers struggle to adapt to changing inputs
Training becomes slower and less stable

Problem 3: Gradient Problems

Large activations → Large gradients → Exploding gradients
Small activations → Small gradients → Vanishing gradients

The Solution: Normalization

Normalization standardizes activations to have consistent statistics (mean zero, variance one), making training stable and efficient.

4.2 What is Normalization?

Simple Definition

Normalization is a technique that transforms activations to have:

Mean of zero (centered)
Variance of one (standardized scale)

Think of it like standardization:

Converts any distribution to a standard form
Makes values comparable across different scales
Helps the model learn faster and more reliably

Visual Analogy

Imagine weights on a scale:

Before Normalization:

Bronze weight: 1 kg
Silver weight: 100 kg  
Gold weight: 0.001 kg
→ Hard to compare!

After Normalization:

All weights standardized to mean 0, variance 1
→ Easy to compare and work with!

Types of Normalization

In transformers, we use Layer Normalization:

Layer Normalization: Normalizes across features (dimensions) for each sample
Batch Normalization: Normalizes across samples in a batch (not used in transformers)
Instance Normalization: Normalizes each sample independently

Why Layer Normalization?

Works well with variable sequence lengths
Doesn't depend on batch size
Suitable for autoregressive models

4.3 How Layer Normalization Works: Step-by-Step

High-Level Overview

Step 1: Compute mean of activations
Step 2: Compute variance of activations
Step 3: Normalize (subtract mean, divide by std)
Step 4: Scale and shift (learnable parameters)

Detailed Step-by-Step

Step 1: Compute Mean

Calculate the average value across all dimensions:

\mu = \frac{1}{d} \sum_{i=1}^{d} x_i

Example:

Input vector:

x = [1.0, 2.0, 3.0, 4.0]
d = 4 (number of dimensions)

Compute mean:

μ = (1.0 + 2.0 + 3.0 + 4.0) / 4
  = 10.0 / 4
  = 2.5

Meaning: The center of the distribution is at 2.5

Step 2: Compute Variance

Measure how spread out the values are:

\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2

Example:

Using the same input:

x = [1.0, 2.0, 3.0, 4.0]
μ = 2.5

Compute variance:

σ² = [(1.0 - 2.5)² + (2.0 - 2.5)² + (3.0 - 2.5)² + (4.0 - 2.5)²] / 4
   = [(-1.5)² + (-0.5)² + (0.5)² + (1.5)²] / 4
   = [2.25 + 0.25 + 0.25 + 2.25] / 4
   = 5.0 / 4
   = 1.25

Compute standard deviation:

σ = √σ² = √1.25 ≈ 1.118

Meaning: Values are spread out with standard deviation of 1.118

Step 3: Normalize

Subtract mean and divide by standard deviation:

\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}

Where:

\epsilon is a small constant (default: 1e-5) to prevent division by zero

Example:

Using the same input:

x = [1.0, 2.0, 3.0, 4.0]
μ = 2.5
σ ≈ 1.118
ε = 0.00001

Normalize each element:

x̂₁ = (1.0 - 2.5) / (1.118 + 0.00001) ≈ -1.341
x̂₂ = (2.0 - 2.5) / (1.118 + 0.00001) ≈ -0.447
x̂₃ = (3.0 - 2.5) / (1.118 + 0.00001) ≈  0.447
x̂₄ = (4.0 - 2.5) / (1.118 + 0.00001) ≈  1.341

Result:

x̂ = [-1.341, -0.447, 0.447, 1.341]

Check:

Mean ≈ 0 ✓
Standard deviation ≈ 1 ✓

Meaning: Values are now standardized!

Step 4: Scale and Shift

Apply learnable parameters:

\text{LayerNorm}(x) = \gamma \odot \hat{x} + \beta

Where:

\gamma = learnable scale parameter (initialized to 1)
\beta = learnable shift parameter (initialized to 0)
\odot = element-wise multiplication

Example:

Normalized vector:

x̂ = [-1.341, -0.447, 0.447, 1.341]

Learnable parameters (initialized):

γ = [1.0, 1.0, 1.0, 1.0]  (scale)
β = [0.0, 0.0, 0.0, 0.0]  (shift)

Apply scale and shift:

Output = γ ⊙ x̂ + β
       = [1.0, 1.0, 1.0, 1.0] ⊙ [-1.341, -0.447, 0.447, 1.341] + [0.0, 0.0, 0.0, 0.0]
       = [-1.341, -0.447, 0.447, 1.341] + [0.0, 0.0, 0.0, 0.0]
       = [-1.341, -0.447, 0.447, 1.341]

Initially, normalization is identity!
During training, γ and β learn optimal scale and shift.

4.4 Complete Example: Normalizing a Vector

Input

Word embedding after attention: [0.146, 0.108, 0.192, 0.155, ..., 0.11]
Dimension: 512

Step-by-Step Processing

Step 1: Compute Mean

Input:

x = [0.146, 0.108, 0.192, ..., 0.11]  (512 numbers)

Compute mean:

μ = (0.146 + 0.108 + 0.192 + ... + 0.11) / 512
  ≈ 0.135

Visualization:

Values:     [0.146, 0.108, 0.192, ..., 0.11]
            └────────────────────────────────┘
            Mean: 0.135 (center point)

Step 2: Compute Variance

Compute variance:

σ² = [(0.146 - 0.135)² + (0.108 - 0.135)² + (0.192 - 0.135)² + ... + (0.11 - 0.135)²] / 512
   ≈ 0.0023

Compute standard deviation:

σ = √0.0023 ≈ 0.048

Visualization:

Values:     [0.146, 0.108, 0.192, ..., 0.11]
Spread:     └───────── σ ≈ 0.048 ──────────┘

Step 3: Normalize

Normalize each element:

x̂₁ = (0.146 - 0.135) / (0.048 + 0.00001) ≈ 0.229
x̂₂ = (0.108 - 0.135) / (0.048 + 0.00001) ≈ -0.562
x̂₃ = (0.192 - 0.135) / (0.048 + 0.00001) ≈ 1.188
...
x̂₅₁₂ = (0.11 - 0.135) / (0.048 + 0.00001) ≈ -0.521

Result:

x̂ = [0.229, -0.562, 1.188, ..., -0.521]

Properties:

Mean ≈ 0 ✓
Standard deviation ≈ 1 ✓

Step 4: Scale and Shift

Apply learnable parameters:

γ = [1.0, 1.0, ..., 1.0]  (512 values, may change during training)
β = [0.0, 0.0, ..., 0.0]  (512 values, may change during training)

Output:

Output = γ ⊙ x̂ + β
       = [0.229, -0.562, 1.188, ..., -0.521]

After training, γ and β adapt to optimal values!

4.5 Why Normalization Matters

Benefit 1: Stable Training

Without Normalization:

Layer 1: activations = [0.1, 0.2, ...]
Layer 2: activations = [50.0, 100.0, ...]  ← Exploding!
Layer 3: activations = [0.001, 0.002, ...]  ← Vanishing!

With Normalization:

Layer 1: activations = [0.1, -0.2, ...]    (normalized)
Layer 2: activations = [0.3, -0.1, ...]    (normalized)
Layer 3: activations = [0.2, 0.4, ...]     (normalized)
→ Consistent scale throughout!

Benefit 2: Better Gradient Flow

Normalization helps gradients flow better:

Without Normalization:

Gradient 1: 0.0001  (too small, vanishing)
Gradient 2: 1000.0  (too large, exploding)
Gradient 3: 0.001   (too small)

With Normalization:

Gradient 1: 0.01   (reasonable)
Gradient 2: 0.02   (reasonable)
Gradient 3: 0.015  (reasonable)
→ Stable gradients!

Benefit 3: Faster Convergence

Normalized activations allow:

Higher learning rates
Faster weight updates
Quicker convergence to good solutions

Analogy:

Without normalization: Walking on rough terrain (slow progress)
With normalization: Walking on smooth path (fast progress)

Benefit 4: Regularization Effect

Normalization acts as a form of regularization:

Reduces internal covariate shift
Makes optimization easier
Helps prevent overfitting

4.6 Pre-Norm vs Post-Norm Architecture

Post-Norm (Original Transformer)

Order:

Input → Attention → LayerNorm → Output

Equation:

x_out = LayerNorm(x + Attention(x))

Problems:

Can be unstable with many layers
Gradient flow can be difficult
Harder to train deep networks

Pre-Norm (Modern Approach)

Order:

Input → LayerNorm → Attention → Output

Equation:

x_out = x + Attention(LayerNorm(x))

Benefits:

More stable training
Better gradient flow
Easier to train deep networks

Visual Comparison:

Post-Norm:

Input
  ↓
  ┌──────────────┐
  │  Attention   │
  └──────┬───────┘
         ↓
  ┌──────────────┐
  │ LayerNorm    │ ← Normalization after
  └──────┬───────┘
         ↓
  Output

Pre-Norm:

Input
  ↓
  ┌──────────────┐
  │ LayerNorm    │ ← Normalization before
  └──────┬───────┘
         ↓
  ┌──────────────┐
  │  Attention   │
  └──────┬───────┘
         ↓
  Output

Our Model Uses Pre-Norm!

4.7 Visual Representation

Normalization Process

Input Vector
    │
    │ [1.0, 2.0, 3.0, 4.0]
    ↓
┌─────────────────────────────┐
│ Step 1: Compute Mean        │
│ μ = 2.5                     │
└──────────┬──────────────────┘
           │
           ↓
┌─────────────────────────────┐
│ Step 2: Compute Variance    │
│ σ² = 1.25, σ ≈ 1.118        │
└──────────┬──────────────────┘
           │
           ↓
┌────────────────────────────────┐
│ Step 3: Normalize              │
│ x̂ = (x - μ) / σ                │
│ [-1.341, -0.447, 0.447, 1.341] │
└──────────┬─────────────────────┘
           │
           ↓
┌─────────────────────────────┐
│ Step 4: Scale and Shift     │
│ Output = γ ⊙ x̂ + β          │
└──────────┬──────────────────┘
           │
           ↓
    Output Vector

Distribution Transformation

Before Normalization:

Distribution:
    │
 0.4│       ●
    │     ●   ●
 0.3│   ●       ●
    │  ●         ●
 0.2│ ●           ●
    │●             ●
 0.1│                ●
    │
 0.0├─────────────────────────
    0    1    2    3    4    5
    Mean: 2.5, Std: 1.118

After Normalization:

Distribution:
    │
 0.4│       ●
    │     ●   ●
 0.3│   ●       ●
    │  ●         ●
 0.2│ ●           ●
    │●             ●
 0.1│                ●
    │
 0.0├─────────────────────────
   -2   -1    0    1    2    3
    Mean: 0, Std: 1

Standardized!

Gradient Flow Visualization

Without Normalization:

Gradient Magnitude:
    │
1000│     ●
    │
 100│
    │
  10│
    │
   1│           ●
    │
 0.1│                 ●
    │
0.01│
    └──────────────────────── Layer
    1    2    3    4    5
    (Unstable, varying magnitudes)

With Normalization:

Gradient Magnitude:
    │
1000│
    │
 100│
    │
  10│
    │     ●  ●  ●  ●  ●
   1│
    │
 0.1│
    │
0.01│
    └──────────────────────── Layer
    1    2    3    4    5
    (Stable, consistent magnitudes)

4.8 Key Takeaways: Normalization

✅ Normalization standardizes activations to mean 0, variance 1
✅ Stabilizes training by preventing exploding/vanishing gradients
✅ Enables faster convergence and higher learning rates
✅ Pre-norm architecture is preferred for deep networks
✅ Learnable parameters (γ, β) allow optimal scaling

Complete Mathematical Formula

Layer Normalization Formula

For input \mathbf{x} \in \mathbb{R}^d:

\mu = \frac{1}{d} \sum_{i=1}^{d} x_i

\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2

\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}

\text{LayerNorm}(\mathbf{x}) = \gamma \odot \hat{\mathbf{x}} + \beta

Where:

\epsilon = small constant (default: 1e-5) to prevent division by zero
\gamma = learnable scale parameter (initialized to 1)
\beta = learnable shift parameter (initialized to 0)
\odot = element-wise multiplication
d = number of dimensions

In Transformer Block

Pre-Norm Architecture:

\mathbf{x}_{norm} = \text{LayerNorm}(\mathbf{x}_{in})

\mathbf{x}_{attn} = \text{Attention}(\mathbf{x}_{norm})

\mathbf{x}_{out} = \mathbf{x}_{in} + \mathbf{x}_{attn} \quad \text{(residual connection)}

Normalization happens before attention and feed-forward!

This document provides a step-by-step explanation of normalization, the critical component that stabilizes training and enables efficient learning in transformer models.

14 KiB Raw Blame History Unescape Escape

What is Normalization? Step-by-Step Explanation

Table of Contents

4.1 The Problem Normalization Solves

The Challenge

The Solution: Normalization

4.2 What is Normalization?

Simple Definition

Visual Analogy

Types of Normalization

4.3 How Layer Normalization Works: Step-by-Step

High-Level Overview

Detailed Step-by-Step

Step 1: Compute Mean

Step 2: Compute Variance

Step 3: Normalize

Step 4: Scale and Shift

4.4 Complete Example: Normalizing a Vector

Input

Step-by-Step Processing

Step 1: Compute Mean

Step 2: Compute Variance

Step 3: Normalize

Step 4: Scale and Shift

4.5 Why Normalization Matters

Benefit 1: Stable Training

Benefit 2: Better Gradient Flow

Benefit 3: Faster Convergence

Benefit 4: Regularization Effect

4.6 Pre-Norm vs Post-Norm Architecture

Post-Norm (Original Transformer)

Pre-Norm (Modern Approach)

4.7 Visual Representation

Normalization Process

Distribution Transformation

Gradient Flow Visualization

4.8 Key Takeaways: Normalization

Complete Mathematical Formula

Layer Normalization Formula

In Transformer Block

14 KiB

Raw Blame History