Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions
--- a/docs/NORMALIZATION_EXPLAINED.md
+++ b/docs/NORMALIZATION_EXPLAINED.md
@@ -0,0 +1,640 @@
+# What is Normalization? Step-by-Step Explanation
+
+Complete step-by-step explanation of normalization in transformer models: how normalization stabilizes training and improves model performance.
+
+## Table of Contents
+
+1. [The Problem Normalization Solves](#41-the-problem-normalization-solves)
+2. [What is Normalization?](#42-what-is-normalization)
+3. [How Layer Normalization Works: Step-by-Step](#43-how-layer-normalization-works-step-by-step)
+4. [Complete Example: Normalizing a Vector](#44-complete-example-normalizing-a-vector)
+5. [Why Normalization Matters](#45-why-normalization-matters)
+6. [Pre-Norm vs Post-Norm Architecture](#46-pre-norm-vs-post-norm-architecture)
+7. [Visual Representation](#47-visual-representation)
+8. [Key Takeaways](#48-key-takeaways)
+
+---
+
+## 4.1 The Problem Normalization Solves
+
+### The Challenge
+
+**During training, activations can become unstable:**
+
+**Problem 1: Varying Activations**
+```
+Layer 1 output: [0.1, 0.2, 0.3, ...]   (small values)
+Layer 2 output: [10.5, 20.3, 15.8, ...] (large values)
+Layer 3 output: [0.01, 0.02, 0.03, ...] (very small values)
+```
+
+**Problem 2: Internal Covariate Shift**
+- Activations change distribution as weights update
+- Later layers struggle to adapt to changing inputs
+- Training becomes slower and less stable
+
+**Problem 3: Gradient Problems**
+```
+Large activations → Large gradients → Exploding gradients
+Small activations → Small gradients → Vanishing gradients
+```
+
+### The Solution: Normalization
+
+**Normalization standardizes activations to have consistent statistics (mean zero, variance one), making training stable and efficient.**
+
+---
+
+## 4.2 What is Normalization?
+
+### Simple Definition
+
+**Normalization** is a technique that transforms activations to have:
+- **Mean of zero** (centered)
+- **Variance of one** (standardized scale)
+
+**Think of it like standardization:**
+- Converts any distribution to a standard form
+- Makes values comparable across different scales
+- Helps the model learn faster and more reliably
+
+### Visual Analogy
+
+**Imagine weights on a scale:**
+
+**Before Normalization:**
+```
+Bronze weight: 1 kg
+Silver weight: 100 kg  
+Gold weight: 0.001 kg
+→ Hard to compare!
+```
+
+**After Normalization:**
+```
+All weights standardized to mean 0, variance 1
+→ Easy to compare and work with!
+```
+
+### Types of Normalization
+
+**In transformers, we use Layer Normalization:**
+
+- **Layer Normalization:** Normalizes across features (dimensions) for each sample
+- **Batch Normalization:** Normalizes across samples in a batch (not used in transformers)
+- **Instance Normalization:** Normalizes each sample independently
+
+**Why Layer Normalization?**
+- Works well with variable sequence lengths
+- Doesn't depend on batch size
+- Suitable for autoregressive models
+
+---
+
+## 4.3 How Layer Normalization Works: Step-by-Step
+
+### High-Level Overview
+
+```
+Step 1: Compute mean of activations
+Step 2: Compute variance of activations
+Step 3: Normalize (subtract mean, divide by std)
+Step 4: Scale and shift (learnable parameters)
+```
+
+### Detailed Step-by-Step
+
+#### Step 1: Compute Mean
+
+**Calculate the average value across all dimensions:**
+
+```math
+\mu = \frac{1}{d} \sum_{i=1}^{d} x_i
+```
+
+**Example:**
+
+**Input vector:**
+```
+x = [1.0, 2.0, 3.0, 4.0]
+d = 4 (number of dimensions)
+```
+
+**Compute mean:**
+```
+μ = (1.0 + 2.0 + 3.0 + 4.0) / 4
+  = 10.0 / 4
+  = 2.5
+```
+
+**Meaning:** The center of the distribution is at 2.5
+
+#### Step 2: Compute Variance
+
+**Measure how spread out the values are:**
+
+```math
+\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2
+```
+
+**Example:**
+
+**Using the same input:**
+```
+x = [1.0, 2.0, 3.0, 4.0]
+μ = 2.5
+```
+
+**Compute variance:**
+```
+σ² = [(1.0 - 2.5)² + (2.0 - 2.5)² + (3.0 - 2.5)² + (4.0 - 2.5)²] / 4
+   = [(-1.5)² + (-0.5)² + (0.5)² + (1.5)²] / 4
+   = [2.25 + 0.25 + 0.25 + 2.25] / 4
+   = 5.0 / 4
+   = 1.25
+```
+
+**Compute standard deviation:**
+```
+σ = √σ² = √1.25 ≈ 1.118
+```
+
+**Meaning:** Values are spread out with standard deviation of 1.118
+
+#### Step 3: Normalize
+
+**Subtract mean and divide by standard deviation:**
+
+```math
+\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
+```
+
+**Where:**
+- $\epsilon$ is a small constant (default: 1e-5) to prevent division by zero
+
+**Example:**
+
+**Using the same input:**
+```
+x = [1.0, 2.0, 3.0, 4.0]
+μ = 2.5
+σ ≈ 1.118
+ε = 0.00001
+```
+
+**Normalize each element:**
+```
+x̂₁ = (1.0 - 2.5) / (1.118 + 0.00001) ≈ -1.341
+x̂₂ = (2.0 - 2.5) / (1.118 + 0.00001) ≈ -0.447
+x̂₃ = (3.0 - 2.5) / (1.118 + 0.00001) ≈  0.447
+x̂₄ = (4.0 - 2.5) / (1.118 + 0.00001) ≈  1.341
+```
+
+**Result:**
+```
+x̂ = [-1.341, -0.447, 0.447, 1.341]
+```
+
+**Check:**
+- Mean ≈ 0 ✓
+- Standard deviation ≈ 1 ✓
+
+**Meaning:** Values are now standardized!
+
+#### Step 4: Scale and Shift
+
+**Apply learnable parameters:**
+
+```math
+\text{LayerNorm}(x) = \gamma \odot \hat{x} + \beta
+```
+
+**Where:**
+- $\gamma$ = learnable scale parameter (initialized to 1)
+- $\beta$ = learnable shift parameter (initialized to 0)
+- $\odot$ = element-wise multiplication
+
+**Example:**
+
+**Normalized vector:**
+```
+x̂ = [-1.341, -0.447, 0.447, 1.341]
+```
+
+**Learnable parameters (initialized):**
+```
+γ = [1.0, 1.0, 1.0, 1.0]  (scale)
+β = [0.0, 0.0, 0.0, 0.0]  (shift)
+```
+
+**Apply scale and shift:**
+```
+Output = γ ⊙ x̂ + β
+       = [1.0, 1.0, 1.0, 1.0] ⊙ [-1.341, -0.447, 0.447, 1.341] + [0.0, 0.0, 0.0, 0.0]
+       = [-1.341, -0.447, 0.447, 1.341] + [0.0, 0.0, 0.0, 0.0]
+       = [-1.341, -0.447, 0.447, 1.341]
+```
+
+**Initially, normalization is identity!**  
+**During training, γ and β learn optimal scale and shift.**
+
+---
+
+## 4.4 Complete Example: Normalizing a Vector
+
+### Input
+
+```
+Word embedding after attention: [0.146, 0.108, 0.192, 0.155, ..., 0.11]
+Dimension: 512
+```
+
+### Step-by-Step Processing
+
+#### Step 1: Compute Mean
+
+**Input:**
+```
+x = [0.146, 0.108, 0.192, ..., 0.11]  (512 numbers)
+```
+
+**Compute mean:**
+```
+μ = (0.146 + 0.108 + 0.192 + ... + 0.11) / 512
+  ≈ 0.135
+```
+
+**Visualization:**
+```
+Values:     [0.146, 0.108, 0.192, ..., 0.11]
+            └────────────────────────────────┘
+            Mean: 0.135 (center point)
+```
+
+#### Step 2: Compute Variance
+
+**Compute variance:**
+```
+σ² = [(0.146 - 0.135)² + (0.108 - 0.135)² + (0.192 - 0.135)² + ... + (0.11 - 0.135)²] / 512
+   ≈ 0.0023
+```
+
+**Compute standard deviation:**
+```
+σ = √0.0023 ≈ 0.048
+```
+
+**Visualization:**
+```
+Values:     [0.146, 0.108, 0.192, ..., 0.11]
+Spread:     └───────── σ ≈ 0.048 ──────────┘
+```
+
+#### Step 3: Normalize
+
+**Normalize each element:**
+```
+x̂₁ = (0.146 - 0.135) / (0.048 + 0.00001) ≈ 0.229
+x̂₂ = (0.108 - 0.135) / (0.048 + 0.00001) ≈ -0.562
+x̂₃ = (0.192 - 0.135) / (0.048 + 0.00001) ≈ 1.188
+...
+x̂₅₁₂ = (0.11 - 0.135) / (0.048 + 0.00001) ≈ -0.521
+```
+
+**Result:**
+```
+x̂ = [0.229, -0.562, 1.188, ..., -0.521]
+```
+
+**Properties:**
+- Mean ≈ 0 ✓
+- Standard deviation ≈ 1 ✓
+
+#### Step 4: Scale and Shift
+
+**Apply learnable parameters:**
+```
+γ = [1.0, 1.0, ..., 1.0]  (512 values, may change during training)
+β = [0.0, 0.0, ..., 0.0]  (512 values, may change during training)
+```
+
+**Output:**
+```
+Output = γ ⊙ x̂ + β
+       = [0.229, -0.562, 1.188, ..., -0.521]
+```
+
+**After training, γ and β adapt to optimal values!**
+
+---
+
+## 4.5 Why Normalization Matters
+
+### Benefit 1: Stable Training
+
+**Without Normalization:**
+```
+Layer 1: activations = [0.1, 0.2, ...]
+Layer 2: activations = [50.0, 100.0, ...]  ← Exploding!
+Layer 3: activations = [0.001, 0.002, ...]  ← Vanishing!
+```
+
+**With Normalization:**
+```
+Layer 1: activations = [0.1, -0.2, ...]    (normalized)
+Layer 2: activations = [0.3, -0.1, ...]    (normalized)
+Layer 3: activations = [0.2, 0.4, ...]     (normalized)
+→ Consistent scale throughout!
+```
+
+### Benefit 2: Better Gradient Flow
+
+**Normalization helps gradients flow better:**
+
+**Without Normalization:**
+```
+Gradient 1: 0.0001  (too small, vanishing)
+Gradient 2: 1000.0  (too large, exploding)
+Gradient 3: 0.001   (too small)
+```
+
+**With Normalization:**
+```
+Gradient 1: 0.01   (reasonable)
+Gradient 2: 0.02   (reasonable)
+Gradient 3: 0.015  (reasonable)
+→ Stable gradients!
+```
+
+### Benefit 3: Faster Convergence
+
+**Normalized activations allow:**
+- Higher learning rates
+- Faster weight updates
+- Quicker convergence to good solutions
+
+**Analogy:**
+- **Without normalization:** Walking on rough terrain (slow progress)
+- **With normalization:** Walking on smooth path (fast progress)
+
+### Benefit 4: Regularization Effect
+
+**Normalization acts as a form of regularization:**
+- Reduces internal covariate shift
+- Makes optimization easier
+- Helps prevent overfitting
+
+---
+
+## 4.6 Pre-Norm vs Post-Norm Architecture
+
+### Post-Norm (Original Transformer)
+
+**Order:**
+```
+Input → Attention → LayerNorm → Output
+```
+
+**Equation:**
+```
+x_out = LayerNorm(x + Attention(x))
+```
+
+**Problems:**
+- Can be unstable with many layers
+- Gradient flow can be difficult
+- Harder to train deep networks
+
+### Pre-Norm (Modern Approach)
+
+**Order:**
+```
+Input → LayerNorm → Attention → Output
+```
+
+**Equation:**
+```
+x_out = x + Attention(LayerNorm(x))
+```
+
+**Benefits:**
+- More stable training
+- Better gradient flow
+- Easier to train deep networks
+
+**Visual Comparison:**
+
+**Post-Norm:**
+```
+Input
+  ↓
+  ┌──────────────┐
+  │  Attention   │
+  └──────┬───────┘
+         ↓
+  ┌──────────────┐
+  │ LayerNorm    │ ← Normalization after
+  └──────┬───────┘
+         ↓
+  Output
+```
+
+**Pre-Norm:**
+```
+Input
+  ↓
+  ┌──────────────┐
+  │ LayerNorm    │ ← Normalization before
+  └──────┬───────┘
+         ↓
+  ┌──────────────┐
+  │  Attention   │
+  └──────┬───────┘
+         ↓
+  Output
+```
+
+**Our Model Uses Pre-Norm!**
+
+---
+
+## 4.7 Visual Representation
+
+### Normalization Process
+
+```
+Input Vector
+    │
+    │ [1.0, 2.0, 3.0, 4.0]
+    ↓
+┌─────────────────────────────┐
+│ Step 1: Compute Mean        │
+│ μ = 2.5                     │
+└──────────┬──────────────────┘
+           │
+           ↓
+┌─────────────────────────────┐
+│ Step 2: Compute Variance    │
+│ σ² = 1.25, σ ≈ 1.118        │
+└──────────┬──────────────────┘
+           │
+           ↓
+┌────────────────────────────────┐
+│ Step 3: Normalize              │
+│ x̂ = (x - μ) / σ                │
+│ [-1.341, -0.447, 0.447, 1.341] │
+└──────────┬─────────────────────┘
+           │
+           ↓
+┌─────────────────────────────┐
+│ Step 4: Scale and Shift     │
+│ Output = γ ⊙ x̂ + β          │
+└──────────┬──────────────────┘
+           │
+           ↓
+    Output Vector
+```
+
+### Distribution Transformation
+
+**Before Normalization:**
+```
+Distribution:
+    │
+ 0.4│       ●
+    │     ●   ●
+ 0.3│   ●       ●
+    │  ●         ●
+ 0.2│ ●           ●
+    │●             ●
+ 0.1│                ●
+    │
+ 0.0├─────────────────────────
+    0    1    2    3    4    5
+    Mean: 2.5, Std: 1.118
+```
+
+**After Normalization:**
+```
+Distribution:
+    │
+ 0.4│       ●
+    │     ●   ●
+ 0.3│   ●       ●
+    │  ●         ●
+ 0.2│ ●           ●
+    │●             ●
+ 0.1│                ●
+    │
+ 0.0├─────────────────────────
+   -2   -1    0    1    2    3
+    Mean: 0, Std: 1
+```
+
+**Standardized!**
+
+### Gradient Flow Visualization
+
+**Without Normalization:**
+```
+Gradient Magnitude:
+    │
+1000│     ●
+    │
+ 100│
+    │
+  10│
+    │
+   1│           ●
+    │
+ 0.1│                 ●
+    │
+0.01│
+    └──────────────────────── Layer
+    1    2    3    4    5
+    (Unstable, varying magnitudes)
+```
+
+**With Normalization:**
+```
+Gradient Magnitude:
+    │
+1000│
+    │
+ 100│
+    │
+  10│
+    │     ●  ●  ●  ●  ●
+   1│
+    │
+ 0.1│
+    │
+0.01│
+    └──────────────────────── Layer
+    1    2    3    4    5
+    (Stable, consistent magnitudes)
+```
+
+---
+
+## 4.8 Key Takeaways: Normalization
+
+✅ **Normalization standardizes activations to mean 0, variance 1**  
+✅ **Stabilizes training by preventing exploding/vanishing gradients**  
+✅ **Enables faster convergence and higher learning rates**  
+✅ **Pre-norm architecture is preferred for deep networks**  
+✅ **Learnable parameters (γ, β) allow optimal scaling**
+
+---
+
+## Complete Mathematical Formula
+
+### Layer Normalization Formula
+
+For input $\mathbf{x} \in \mathbb{R}^d$:
+
+```math
+\mu = \frac{1}{d} \sum_{i=1}^{d} x_i
+```
+
+```math
+\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2
+```
+
+```math
+\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
+```
+
+```math
+\text{LayerNorm}(\mathbf{x}) = \gamma \odot \hat{\mathbf{x}} + \beta
+```
+
+**Where:**
+- $\epsilon$ = small constant (default: 1e-5) to prevent division by zero
+- $\gamma$ = learnable scale parameter (initialized to 1)
+- $\beta$ = learnable shift parameter (initialized to 0)
+- $\odot$ = element-wise multiplication
+- $d$ = number of dimensions
+
+### In Transformer Block
+
+**Pre-Norm Architecture:**
+
+```math
+\mathbf{x}_{norm} = \text{LayerNorm}(\mathbf{x}_{in})
+```
+
+```math
+\mathbf{x}_{attn} = \text{Attention}(\mathbf{x}_{norm})
+```
+
+```math
+\mathbf{x}_{out} = \mathbf{x}_{in} + \mathbf{x}_{attn} \quad \text{(residual connection)}
+```
+
+**Normalization happens before attention and feed-forward!**
+
+---
+
+*This document provides a step-by-step explanation of normalization, the critical component that stabilizes training and enables efficient learning in transformer models.*
+