Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
This commit is contained in:
Carlos Gutierrez
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions

View File

@@ -0,0 +1,618 @@
# What is Scheduling? Step-by-Step Explanation
Complete step-by-step explanation of learning rate scheduling: how scheduling adjusts learning rates during training to improve convergence.
## Table of Contents
1. [What is Scheduling?](#81-what-is-scheduling)
2. [Why Do We Need Scheduling?](#82-why-do-we-need-scheduling)
3. [Fixed Learning Rate](#83-fixed-learning-rate)
4. [Cosine Annealing](#84-cosine-annealing)
5. [Other Scheduling Strategies](#85-other-scheduling-strategies)
6. [Why Scheduling Matters](#86-why-scheduling-matters)
7. [Complete Mathematical Formulation](#87-complete-mathematical-formulation)
8. [Exercise: Schedule Calculation](#88-exercise-schedule-calculation)
9. [Key Takeaways](#89-key-takeaways)
---
## 8.1 What is Scheduling?
### Simple Definition
**Scheduling** (learning rate scheduling) is the process of adjusting the learning rate during training to improve convergence and final model performance.
### Visual Analogy
**Think of scheduling like adjusting speed while driving:**
```
Fixed Learning Rate:
┌──────────────────────────┐
│ Speed: 60 mph (constant) │
└──────────────────────────┘
→ Hard to stop precisely!
Scheduled Learning Rate:
┌──────────────────────────┐
│ Speed: 60 → 40 → 20 → 10 │
└──────────────────────────┘
→ Smooth deceleration!
```
**Scheduling adjusts speed (learning rate) as you approach the destination (convergence)!**
### What Scheduling Does
**Scheduling:**
1. **Starts** with higher learning rate (fast learning)
2. **Gradually reduces** learning rate (precise fine-tuning)
3. **Converges** to optimal solution
**Result:** Better convergence and performance!
---
## 8.2 Why Do We Need Scheduling?
### The Problem with Fixed Learning Rate
**High Learning Rate:**
```
Learning Rate: 0.001 (constant)
→ Fast initial learning ✓
→ But overshoots minimum ✗
→ Bounces around ✗
→ Poor convergence ✗
```
**Low Learning Rate:**
```
Learning Rate: 0.0001 (constant)
→ Stable convergence ✓
→ But very slow learning ✗
→ Takes forever to converge ✗
```
**Can't have both!**
### The Solution: Scheduling
**Adaptive Learning Rate:**
```
Start: 0.001 (fast learning)
Middle: 0.0005 (moderate)
End: 0.0001 (fine-tuning)
→ Fast initial learning ✓
→ Stable convergence ✓
→ Best of both worlds!
```
### Benefits of Scheduling
**1. Faster Convergence**
- High initial rate = Fast progress
- Lower later rate = Precise convergence
**2. Better Final Performance**
- Fine-tuning at end = Better solution
- Avoids overshooting = More stable
**3. More Stable Training**
- Gradual reduction = Smooth optimization
- Less oscillation = More reliable
---
## 8.3 Fixed Learning Rate
### What is Fixed Learning Rate?
**Learning rate stays constant throughout training:**
```math
\eta_t = \eta_0 \quad \text{for all } t
```
**Where:**
- $\eta_0$ = initial learning rate
- $t$ = training step
### Example
**Fixed Rate:**
```
Step 0: η = 0.001
Step 100: η = 0.001
Step 1000: η = 0.001
Step 10000: η = 0.001
```
**Constant throughout!**
### Visualization
```
Learning Rate
0.001│─────────────────────────────────────
└───────────────────────────────────── Steps
```
### Problems
**1. Too High:**
- Overshoots minimum
- Oscillates around solution
- Never converges precisely
**2. Too Low:**
- Very slow training
- Takes forever to converge
- May get stuck
**Solution:** Use scheduling!
---
## 8.4 Cosine Annealing
### What is Cosine Annealing?
**Cosine Annealing** reduces the learning rate following a cosine curve from maximum to minimum.
### Formula
```math
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
```
**Where:**
- $\eta_t$ = learning rate at step $t$
- $\eta_{min}$ = minimum learning rate (default: 0)
- $\eta_{max}$ = initial/maximum learning rate
- $T_{max}$ = total number of steps
- $t$ = current step
### How It Works
**Step 1: Calculate Cosine Value**
```math
\cos\left(\frac{\pi t}{T_{max}}\right)
```
**Step 2: Shift to [0, 1] Range**
```math
\frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
```
**Step 3: Scale to Learning Rate Range**
```math
\eta_{min} + (\eta_{max} - \eta_{min}) \times \text{scale}
```
### Example Calculation
**Given:**
- $\eta_{max} = 0.001$
- $\eta_{min} = 0$
- $T_{max} = 10000$
**At step $t = 0$:**
```math
\eta_0 = 0 + (0.001 - 0) \times \frac{1 + \cos(0)}{2} = 0.001 \times 1 = 0.001
```
**At step $t = 2500$:**
```math
\eta_{2500} = 0 + 0.001 \times \frac{1 + \cos(\pi/4)}{2} = 0.001 \times \frac{1 + 0.707}{2} \approx 0.000854
```
**At step $t = 5000$:**
```math
\eta_{5000} = 0 + 0.001 \times \frac{1 + \cos(\pi/2)}{2} = 0.001 \times \frac{1 + 0}{2} = 0.0005
```
**At step $t = 7500$:**
```math
\eta_{7500} = 0 + 0.001 \times \frac{1 + \cos(3\pi/4)}{2} = 0.001 \times \frac{1 + (-0.707)}{2} \approx 0.000146
```
**At step $t = 10000$:**
```math
\eta_{10000} = 0 + 0.001 \times \frac{1 + \cos(\pi)}{2} = 0.001 \times \frac{1 + (-1)}{2} = 0
```
### Visualization
```
Learning Rate
0.001 │●───────────────\
│ \
│ \
0.0005│ \
│ \
│ \
│ \
│ \
│ \
│ \
0│ ●─────
└───────────────────────────────────── Steps
0 2500 5000 7500 10000
```
**Smooth cosine curve!**
### Why Cosine Annealing?
**Benefits:**
1. **Smooth decay:** No abrupt changes
2. **Gradual reduction:** Better fine-tuning
3. **Works well:** Commonly used in practice
4. **High initial rate:** Fast learning
5. **Low final rate:** Precise convergence
---
## 8.5 Other Scheduling Strategies
### 1. Step Decay
**Reduce learning rate at fixed intervals:**
```math
\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}
```
**Where:**
- $\gamma$ = decay factor (e.g., 0.1)
- $s$ = step size (e.g., every 1000 steps)
**Example:**
```
Step 0-999: η = 0.001
Step 1000-1999: η = 0.0001 (×0.1)
Step 2000-2999: η = 0.00001 (×0.1)
```
**Visualization:**
```
Learning Rate
0.001 │───────┐
│ │
│ └───────┐
0.0001│ │
│ └───────┐
│ │
└───────────────────────── Steps
```
### 2. Exponential Decay
**Continuous exponential reduction:**
```math
\eta_t = \eta_0 \times \gamma^t
```
**Where:**
- $\gamma$ = decay rate (e.g., 0.9995)
**Visualization:**
```
Learning Rate
0.001│●──────────────\
│ \
│ \
│ \
│ \
│ \
│ \
│ \
└──────────────────────── Steps
```
### 3. Warmup Scheduling
**Start with low rate, increase, then decrease:**
**Warmup Phase:**
```math
\eta_t = \eta_{max} \times \frac{t}{T_{warmup}}
```
**After Warmup:**
```math
\eta_t = \text{Cosine Annealing or other schedule}
```
**Visualization:**
```
Learning Rate
0.001│ ╱───────\
\
\
\
\
\
│╱ \
└───────────────────── Steps
```
### 4. One Cycle Learning Rate
**One cycle: increase then decrease:**
```math
\eta_t = \begin{cases}
\eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{t}{T_1} & t \leq T_1 \\
\eta_{max} - (\eta_{max} - \eta_{min}) \times \frac{t - T_1}{T_2} & t > T_1
\end{cases}
```
**Visualization:**
```
Learning Rate
0.001│ ╱─────\
\
\
\
\
\
│╱ \
└─────────────────── Steps
```
---
## 8.6 Why Scheduling Matters
### Benefit 1: Better Convergence
**Without Scheduling:**
```
Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.15 → 2.12 → ...
(slow convergence at end)
```
**With Scheduling:**
```
Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.1 → 2.05 → ...
(faster convergence, better final loss)
```
### Benefit 2: More Stable Training
**Fixed High Rate:**
```
Loss: 3.0 → 2.5 → 2.3 → 2.4 → 2.3 → 2.4 → ...
(oscillating, unstable)
```
**Scheduled Rate:**
```
Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.15 → 2.12 → ...
(smooth, stable)
```
### Benefit 3: Better Final Performance
**Comparison:**
```
Fixed LR: Final Loss = 2.15
Scheduled LR: Final Loss = 2.05
→ 5% improvement!
```
---
## 8.7 Complete Mathematical Formulation
### General Scheduling Formula
```math
\eta_t = f(t, \eta_0, \eta_{min}, T_{max}, ...)
```
**Where $f$ is the scheduling function**
### Cosine Annealing (Complete)
```math
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
```
**Boundary Conditions:**
- At $t = 0$: $\eta_0 = \eta_{max}$
- At $t = T_{max}$: $\eta_{T_{max}} = \eta_{min}$
### Step Decay
```math
\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}
```
### Exponential Decay
```math
\eta_t = \eta_0 \times \gamma^t
```
### Warmup + Cosine Annealing
**Warmup Phase ($t \leq T_{warmup}$):**
```math
\eta_t = \eta_{max} \times \frac{t}{T_{warmup}}
```
**Annealing Phase ($t > T_{warmup}$):**
```math
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi (t - T_{warmup})}{T_{max} - T_{warmup}}\right)}{2}
```
---
## 8.8 Exercise: Schedule Calculation
### Problem
**Given Cosine Annealing schedule:**
- $\eta_{max} = 0.002$
- $\eta_{min} = 0.0001$
- $T_{max} = 5000$ steps
**Calculate the learning rate at:**
1. Step $t = 0$
2. Step $t = 1250$
3. Step $t = 2500$
4. Step $t = 3750$
5. Step $t = 5000$
### Step-by-Step Solution
#### General Formula
```math
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
```
**Substitute values:**
```math
\eta_t = 0.0001 + (0.002 - 0.0001) \times \frac{1 + \cos\left(\frac{\pi t}{5000}\right)}{2}
```
```math
\eta_t = 0.0001 + 0.0019 \times \frac{1 + \cos\left(\frac{\pi t}{5000}\right)}{2}
```
#### Step 1: t = 0
```math
\eta_0 = 0.0001 + 0.0019 \times \frac{1 + \cos(0)}{2}
```
```math
\eta_0 = 0.0001 + 0.0019 \times \frac{1 + 1}{2}
```
```math
\eta_0 = 0.0001 + 0.0019 \times 1 = 0.0001 + 0.0019 = 0.002
```
**Answer:** $\eta_0 = 0.002$
#### Step 2: t = 1250
```math
\eta_{1250} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi/4)}{2}
```
```math
\eta_{1250} = 0.0001 + 0.0019 \times \frac{1 + 0.707}{2}
```
```math
\eta_{1250} = 0.0001 + 0.0019 \times 0.8535 = 0.0001 + 0.001621 = 0.001721
```
**Answer:** $\eta_{1250} \approx 0.001721$
#### Step 3: t = 2500
```math
\eta_{2500} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi/2)}{2}
```
```math
\eta_{2500} = 0.0001 + 0.0019 \times \frac{1 + 0}{2}
```
```math
\eta_{2500} = 0.0001 + 0.0019 \times 0.5 = 0.0001 + 0.00095 = 0.00105
```
**Answer:** $\eta_{2500} = 0.00105$
#### Step 4: t = 3750
```math
\eta_{3750} = 0.0001 + 0.0019 \times \frac{1 + \cos(3\pi/4)}{2}
```
```math
\eta_{3750} = 0.0001 + 0.0019 \times \frac{1 + (-0.707)}{2}
```
```math
\eta_{3750} = 0.0001 + 0.0019 \times 0.1465 = 0.0001 + 0.000278 = 0.000378
```
**Answer:** $\eta_{3750} \approx 0.000378$
#### Step 5: t = 5000
```math
\eta_{5000} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi)}{2}
```
```math
\eta_{5000} = 0.0001 + 0.0019 \times \frac{1 + (-1)}{2}
```
```math
\eta_{5000} = 0.0001 + 0.0019 \times 0 = 0.0001 + 0 = 0.0001
```
**Answer:** $\eta_{5000} = 0.0001$
### Summary Table
| Step | Cosine Value | Scale Factor | Learning Rate |
|------|--------------|--------------|---------------|
| 0 | 1.0 | 1.0 | 0.002 |
| 1250 | 0.707 | 0.854 | 0.001721 |
| 2500 | 0.0 | 0.5 | 0.00105 |
| 3750 | -0.707 | 0.146 | 0.000378 |
| 5000 | -1.0 | 0.0 | 0.0001 |
**Smooth decay from 0.002 to 0.0001!**
---
## 8.9 Key Takeaways
### Scheduling
**Scheduling adjusts learning rate during training**
**Starts high (fast learning), ends low (fine-tuning)**
**Improves convergence and final performance**
### Cosine Annealing
**Smooth cosine-based decay**
**Gradual reduction from max to min**
**Works well for transformers**
### Why Important
**Faster convergence**
**More stable training**
**Better final performance**
**Essential for optimal training**
---
*This document provides a comprehensive explanation of learning rate scheduling, including cosine annealing and other strategies with mathematical formulations and solved exercises.*