Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions
--- a/docs/SCHEDULING_EXPLAINED.md
+++ b/docs/SCHEDULING_EXPLAINED.md
@@ -0,0 +1,618 @@
+# What is Scheduling? Step-by-Step Explanation
+
+Complete step-by-step explanation of learning rate scheduling: how scheduling adjusts learning rates during training to improve convergence.
+
+## Table of Contents
+
+1. [What is Scheduling?](#81-what-is-scheduling)
+2. [Why Do We Need Scheduling?](#82-why-do-we-need-scheduling)
+3. [Fixed Learning Rate](#83-fixed-learning-rate)
+4. [Cosine Annealing](#84-cosine-annealing)
+5. [Other Scheduling Strategies](#85-other-scheduling-strategies)
+6. [Why Scheduling Matters](#86-why-scheduling-matters)
+7. [Complete Mathematical Formulation](#87-complete-mathematical-formulation)
+8. [Exercise: Schedule Calculation](#88-exercise-schedule-calculation)
+9. [Key Takeaways](#89-key-takeaways)
+
+---
+
+## 8.1 What is Scheduling?
+
+### Simple Definition
+
+**Scheduling** (learning rate scheduling) is the process of adjusting the learning rate during training to improve convergence and final model performance.
+
+### Visual Analogy
+
+**Think of scheduling like adjusting speed while driving:**
+
+```
+Fixed Learning Rate:
+    ┌──────────────────────────┐
+    │ Speed: 60 mph (constant) │
+    └──────────────────────────┘
+    → Hard to stop precisely!
+
+Scheduled Learning Rate:
+    ┌──────────────────────────┐
+    │ Speed: 60 → 40 → 20 → 10 │
+    └──────────────────────────┘
+    → Smooth deceleration!
+```
+
+**Scheduling adjusts speed (learning rate) as you approach the destination (convergence)!**
+
+### What Scheduling Does
+
+**Scheduling:**
+1. **Starts** with higher learning rate (fast learning)
+2. **Gradually reduces** learning rate (precise fine-tuning)
+3. **Converges** to optimal solution
+
+**Result:** Better convergence and performance!
+
+---
+
+## 8.2 Why Do We Need Scheduling?
+
+### The Problem with Fixed Learning Rate
+
+**High Learning Rate:**
+```
+Learning Rate: 0.001 (constant)
+→ Fast initial learning ✓
+→ But overshoots minimum ✗
+→ Bounces around ✗
+→ Poor convergence ✗
+```
+
+**Low Learning Rate:**
+```
+Learning Rate: 0.0001 (constant)
+→ Stable convergence ✓
+→ But very slow learning ✗
+→ Takes forever to converge ✗
+```
+
+**Can't have both!**
+
+### The Solution: Scheduling
+
+**Adaptive Learning Rate:**
+```
+Start: 0.001 (fast learning)
+Middle: 0.0005 (moderate)
+End: 0.0001 (fine-tuning)
+→ Fast initial learning ✓
+→ Stable convergence ✓
+→ Best of both worlds!
+```
+
+### Benefits of Scheduling
+
+**1. Faster Convergence**
+- High initial rate = Fast progress
+- Lower later rate = Precise convergence
+
+**2. Better Final Performance**
+- Fine-tuning at end = Better solution
+- Avoids overshooting = More stable
+
+**3. More Stable Training**
+- Gradual reduction = Smooth optimization
+- Less oscillation = More reliable
+
+---
+
+## 8.3 Fixed Learning Rate
+
+### What is Fixed Learning Rate?
+
+**Learning rate stays constant throughout training:**
+
+```math
+\eta_t = \eta_0 \quad \text{for all } t
+```
+
+**Where:**
+- $\eta_0$ = initial learning rate
+- $t$ = training step
+
+### Example
+
+**Fixed Rate:**
+```
+Step 0:    η = 0.001
+Step 100:  η = 0.001
+Step 1000: η = 0.001
+Step 10000: η = 0.001
+```
+
+**Constant throughout!**
+
+### Visualization
+
+```
+Learning Rate
+     │
+0.001│─────────────────────────────────────
+     │
+     │
+     │
+     │
+     └───────────────────────────────────── Steps
+```
+
+### Problems
+
+**1. Too High:**
+- Overshoots minimum
+- Oscillates around solution
+- Never converges precisely
+
+**2. Too Low:**
+- Very slow training
+- Takes forever to converge
+- May get stuck
+
+**Solution:** Use scheduling!
+
+---
+
+## 8.4 Cosine Annealing
+
+### What is Cosine Annealing?
+
+**Cosine Annealing** reduces the learning rate following a cosine curve from maximum to minimum.
+
+### Formula
+
+```math
+\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
+```
+
+**Where:**
+- $\eta_t$ = learning rate at step $t$
+- $\eta_{min}$ = minimum learning rate (default: 0)
+- $\eta_{max}$ = initial/maximum learning rate
+- $T_{max}$ = total number of steps
+- $t$ = current step
+
+### How It Works
+
+**Step 1: Calculate Cosine Value**
+```math
+\cos\left(\frac{\pi t}{T_{max}}\right)
+```
+
+**Step 2: Shift to [0, 1] Range**
+```math
+\frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
+```
+
+**Step 3: Scale to Learning Rate Range**
+```math
+\eta_{min} + (\eta_{max} - \eta_{min}) \times \text{scale}
+```
+
+### Example Calculation
+
+**Given:**
+- $\eta_{max} = 0.001$
+- $\eta_{min} = 0$
+- $T_{max} = 10000$
+
+**At step $t = 0$:**
+```math
+\eta_0 = 0 + (0.001 - 0) \times \frac{1 + \cos(0)}{2} = 0.001 \times 1 = 0.001
+```
+
+**At step $t = 2500$:**
+```math
+\eta_{2500} = 0 + 0.001 \times \frac{1 + \cos(\pi/4)}{2} = 0.001 \times \frac{1 + 0.707}{2} \approx 0.000854
+```
+
+**At step $t = 5000$:**
+```math
+\eta_{5000} = 0 + 0.001 \times \frac{1 + \cos(\pi/2)}{2} = 0.001 \times \frac{1 + 0}{2} = 0.0005
+```
+
+**At step $t = 7500$:**
+```math
+\eta_{7500} = 0 + 0.001 \times \frac{1 + \cos(3\pi/4)}{2} = 0.001 \times \frac{1 + (-0.707)}{2} \approx 0.000146
+```
+
+**At step $t = 10000$:**
+```math
+\eta_{10000} = 0 + 0.001 \times \frac{1 + \cos(\pi)}{2} = 0.001 \times \frac{1 + (-1)}{2} = 0
+```
+
+### Visualization
+
+```
+Learning Rate
+      │
+0.001 │●───────────────\
+      │                 \
+      │                  \
+0.0005│                   \
+      │                    \
+      │                     \
+      │                      \
+      │                       \
+      │                        \
+      │                         \
+     0│                          ●─────
+      └───────────────────────────────────── Steps
+        0    2500  5000  7500  10000
+```
+
+**Smooth cosine curve!**
+
+### Why Cosine Annealing?
+
+**Benefits:**
+1. **Smooth decay:** No abrupt changes
+2. **Gradual reduction:** Better fine-tuning
+3. **Works well:** Commonly used in practice
+4. **High initial rate:** Fast learning
+5. **Low final rate:** Precise convergence
+
+---
+
+## 8.5 Other Scheduling Strategies
+
+### 1. Step Decay
+
+**Reduce learning rate at fixed intervals:**
+
+```math
+\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}
+```
+
+**Where:**
+- $\gamma$ = decay factor (e.g., 0.1)
+- $s$ = step size (e.g., every 1000 steps)
+
+**Example:**
+```
+Step 0-999:    η = 0.001
+Step 1000-1999: η = 0.0001  (×0.1)
+Step 2000-2999: η = 0.00001 (×0.1)
+```
+
+**Visualization:**
+```
+Learning Rate
+      │
+0.001 │───────┐
+      │       │
+      │       └───────┐
+0.0001│               │
+      │               └───────┐
+      │                       │
+      └───────────────────────── Steps
+```
+
+### 2. Exponential Decay
+
+**Continuous exponential reduction:**
+
+```math
+\eta_t = \eta_0 \times \gamma^t
+```
+
+**Where:**
+- $\gamma$ = decay rate (e.g., 0.9995)
+
+**Visualization:**
+```
+Learning Rate
+     │
+0.001│●──────────────\
+     │                \
+     │                 \
+     │                  \
+     │                   \
+     │                    \
+     │                     \
+     │                      \
+     └──────────────────────── Steps
+```
+
+### 3. Warmup Scheduling
+
+**Start with low rate, increase, then decrease:**
+
+**Warmup Phase:**
+```math
+\eta_t = \eta_{max} \times \frac{t}{T_{warmup}}
+```
+
+**After Warmup:**
+```math
+\eta_t = \text{Cosine Annealing or other schedule}
+```
+
+**Visualization:**
+```
+Learning Rate
+     │
+0.001│      ╱───────\
+     │     ╱         \
+     │    ╱           \
+     │   ╱             \
+     │  ╱               \
+     │ ╱                 \
+     │╱                   \
+     └───────────────────── Steps
+```
+
+### 4. One Cycle Learning Rate
+
+**One cycle: increase then decrease:**
+
+```math
+\eta_t = \begin{cases}
+\eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{t}{T_1} & t \leq T_1 \\
+\eta_{max} - (\eta_{max} - \eta_{min}) \times \frac{t - T_1}{T_2} & t > T_1
+\end{cases}
+```
+
+**Visualization:**
+```
+Learning Rate
+     │
+0.001│      ╱─────\
+     │     ╱       \
+     │    ╱         \
+     │   ╱           \
+     │  ╱             \
+     │ ╱               \
+     │╱                 \
+     └─────────────────── Steps
+```
+
+---
+
+## 8.6 Why Scheduling Matters
+
+### Benefit 1: Better Convergence
+
+**Without Scheduling:**
+```
+Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.15 → 2.12 → ...
+      (slow convergence at end)
+```
+
+**With Scheduling:**
+```
+Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.1 → 2.05 → ...
+      (faster convergence, better final loss)
+```
+
+### Benefit 2: More Stable Training
+
+**Fixed High Rate:**
+```
+Loss: 3.0 → 2.5 → 2.3 → 2.4 → 2.3 → 2.4 → ...
+      (oscillating, unstable)
+```
+
+**Scheduled Rate:**
+```
+Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.15 → 2.12 → ...
+      (smooth, stable)
+```
+
+### Benefit 3: Better Final Performance
+
+**Comparison:**
+```
+Fixed LR:      Final Loss = 2.15
+Scheduled LR:  Final Loss = 2.05
+
+→ 5% improvement!
+```
+
+---
+
+## 8.7 Complete Mathematical Formulation
+
+### General Scheduling Formula
+
+```math
+\eta_t = f(t, \eta_0, \eta_{min}, T_{max}, ...)
+```
+
+**Where $f$ is the scheduling function**
+
+### Cosine Annealing (Complete)
+
+```math
+\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
+```
+
+**Boundary Conditions:**
+- At $t = 0$: $\eta_0 = \eta_{max}$
+- At $t = T_{max}$: $\eta_{T_{max}} = \eta_{min}$
+
+### Step Decay
+
+```math
+\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}
+```
+
+### Exponential Decay
+
+```math
+\eta_t = \eta_0 \times \gamma^t
+```
+
+### Warmup + Cosine Annealing
+
+**Warmup Phase ($t \leq T_{warmup}$):**
+```math
+\eta_t = \eta_{max} \times \frac{t}{T_{warmup}}
+```
+
+**Annealing Phase ($t > T_{warmup}$):**
+```math
+\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi (t - T_{warmup})}{T_{max} - T_{warmup}}\right)}{2}
+```
+
+---
+
+## 8.8 Exercise: Schedule Calculation
+
+### Problem
+
+**Given Cosine Annealing schedule:**
+
+- $\eta_{max} = 0.002$
+- $\eta_{min} = 0.0001$
+- $T_{max} = 5000$ steps
+
+**Calculate the learning rate at:**
+1. Step $t = 0$
+2. Step $t = 1250$
+3. Step $t = 2500$
+4. Step $t = 3750$
+5. Step $t = 5000$
+
+### Step-by-Step Solution
+
+#### General Formula
+
+```math
+\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
+```
+
+**Substitute values:**
+```math
+\eta_t = 0.0001 + (0.002 - 0.0001) \times \frac{1 + \cos\left(\frac{\pi t}{5000}\right)}{2}
+```
+
+```math
+\eta_t = 0.0001 + 0.0019 \times \frac{1 + \cos\left(\frac{\pi t}{5000}\right)}{2}
+```
+
+#### Step 1: t = 0
+
+```math
+\eta_0 = 0.0001 + 0.0019 \times \frac{1 + \cos(0)}{2}
+```
+
+```math
+\eta_0 = 0.0001 + 0.0019 \times \frac{1 + 1}{2}
+```
+
+```math
+\eta_0 = 0.0001 + 0.0019 \times 1 = 0.0001 + 0.0019 = 0.002
+```
+
+**Answer:** $\eta_0 = 0.002$
+
+#### Step 2: t = 1250
+
+```math
+\eta_{1250} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi/4)}{2}
+```
+
+```math
+\eta_{1250} = 0.0001 + 0.0019 \times \frac{1 + 0.707}{2}
+```
+
+```math
+\eta_{1250} = 0.0001 + 0.0019 \times 0.8535 = 0.0001 + 0.001621 = 0.001721
+```
+
+**Answer:** $\eta_{1250} \approx 0.001721$
+
+#### Step 3: t = 2500
+
+```math
+\eta_{2500} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi/2)}{2}
+```
+
+```math
+\eta_{2500} = 0.0001 + 0.0019 \times \frac{1 + 0}{2}
+```
+
+```math
+\eta_{2500} = 0.0001 + 0.0019 \times 0.5 = 0.0001 + 0.00095 = 0.00105
+```
+
+**Answer:** $\eta_{2500} = 0.00105$
+
+#### Step 4: t = 3750
+
+```math
+\eta_{3750} = 0.0001 + 0.0019 \times \frac{1 + \cos(3\pi/4)}{2}
+```
+
+```math
+\eta_{3750} = 0.0001 + 0.0019 \times \frac{1 + (-0.707)}{2}
+```
+
+```math
+\eta_{3750} = 0.0001 + 0.0019 \times 0.1465 = 0.0001 + 0.000278 = 0.000378
+```
+
+**Answer:** $\eta_{3750} \approx 0.000378$
+
+#### Step 5: t = 5000
+
+```math
+\eta_{5000} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi)}{2}
+```
+
+```math
+\eta_{5000} = 0.0001 + 0.0019 \times \frac{1 + (-1)}{2}
+```
+
+```math
+\eta_{5000} = 0.0001 + 0.0019 \times 0 = 0.0001 + 0 = 0.0001
+```
+
+**Answer:** $\eta_{5000} = 0.0001$
+
+### Summary Table
+
+| Step | Cosine Value | Scale Factor | Learning Rate |
+|------|--------------|--------------|---------------|
+| 0 | 1.0 | 1.0 | 0.002 |
+| 1250 | 0.707 | 0.854 | 0.001721 |
+| 2500 | 0.0 | 0.5 | 0.00105 |
+| 3750 | -0.707 | 0.146 | 0.000378 |
+| 5000 | -1.0 | 0.0 | 0.0001 |
+
+**Smooth decay from 0.002 to 0.0001!**
+
+---
+
+## 8.9 Key Takeaways
+
+### Scheduling
+
+✅ **Scheduling adjusts learning rate during training**  
+✅ **Starts high (fast learning), ends low (fine-tuning)**  
+✅ **Improves convergence and final performance**
+
+### Cosine Annealing
+
+✅ **Smooth cosine-based decay**  
+✅ **Gradual reduction from max to min**  
+✅ **Works well for transformers**
+
+### Why Important
+
+✅ **Faster convergence**  
+✅ **More stable training**  
+✅ **Better final performance**  
+✅ **Essential for optimal training**
+
+---
+
+*This document provides a comprehensive explanation of learning rate scheduling, including cosine annealing and other strategies with mathematical formulations and solved exercises.*
+