Files
sheepOp/docs/SCHEDULING_EXPLAINED.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

619 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# What is Scheduling? Step-by-Step Explanation
Complete step-by-step explanation of learning rate scheduling: how scheduling adjusts learning rates during training to improve convergence.
## Table of Contents
1. [What is Scheduling?](#81-what-is-scheduling)
2. [Why Do We Need Scheduling?](#82-why-do-we-need-scheduling)
3. [Fixed Learning Rate](#83-fixed-learning-rate)
4. [Cosine Annealing](#84-cosine-annealing)
5. [Other Scheduling Strategies](#85-other-scheduling-strategies)
6. [Why Scheduling Matters](#86-why-scheduling-matters)
7. [Complete Mathematical Formulation](#87-complete-mathematical-formulation)
8. [Exercise: Schedule Calculation](#88-exercise-schedule-calculation)
9. [Key Takeaways](#89-key-takeaways)
---
## 8.1 What is Scheduling?
### Simple Definition
**Scheduling** (learning rate scheduling) is the process of adjusting the learning rate during training to improve convergence and final model performance.
### Visual Analogy
**Think of scheduling like adjusting speed while driving:**
```
Fixed Learning Rate:
┌──────────────────────────┐
│ Speed: 60 mph (constant) │
└──────────────────────────┘
→ Hard to stop precisely!
Scheduled Learning Rate:
┌──────────────────────────┐
│ Speed: 60 → 40 → 20 → 10 │
└──────────────────────────┘
→ Smooth deceleration!
```
**Scheduling adjusts speed (learning rate) as you approach the destination (convergence)!**
### What Scheduling Does
**Scheduling:**
1. **Starts** with higher learning rate (fast learning)
2. **Gradually reduces** learning rate (precise fine-tuning)
3. **Converges** to optimal solution
**Result:** Better convergence and performance!
---
## 8.2 Why Do We Need Scheduling?
### The Problem with Fixed Learning Rate
**High Learning Rate:**
```
Learning Rate: 0.001 (constant)
→ Fast initial learning ✓
→ But overshoots minimum ✗
→ Bounces around ✗
→ Poor convergence ✗
```
**Low Learning Rate:**
```
Learning Rate: 0.0001 (constant)
→ Stable convergence ✓
→ But very slow learning ✗
→ Takes forever to converge ✗
```
**Can't have both!**
### The Solution: Scheduling
**Adaptive Learning Rate:**
```
Start: 0.001 (fast learning)
Middle: 0.0005 (moderate)
End: 0.0001 (fine-tuning)
→ Fast initial learning ✓
→ Stable convergence ✓
→ Best of both worlds!
```
### Benefits of Scheduling
**1. Faster Convergence**
- High initial rate = Fast progress
- Lower later rate = Precise convergence
**2. Better Final Performance**
- Fine-tuning at end = Better solution
- Avoids overshooting = More stable
**3. More Stable Training**
- Gradual reduction = Smooth optimization
- Less oscillation = More reliable
---
## 8.3 Fixed Learning Rate
### What is Fixed Learning Rate?
**Learning rate stays constant throughout training:**
```math
\eta_t = \eta_0 \quad \text{for all } t
```
**Where:**
- $\eta_0$ = initial learning rate
- $t$ = training step
### Example
**Fixed Rate:**
```
Step 0: η = 0.001
Step 100: η = 0.001
Step 1000: η = 0.001
Step 10000: η = 0.001
```
**Constant throughout!**
### Visualization
```
Learning Rate
0.001│─────────────────────────────────────
└───────────────────────────────────── Steps
```
### Problems
**1. Too High:**
- Overshoots minimum
- Oscillates around solution
- Never converges precisely
**2. Too Low:**
- Very slow training
- Takes forever to converge
- May get stuck
**Solution:** Use scheduling!
---
## 8.4 Cosine Annealing
### What is Cosine Annealing?
**Cosine Annealing** reduces the learning rate following a cosine curve from maximum to minimum.
### Formula
```math
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
```
**Where:**
- $\eta_t$ = learning rate at step $t$
- $\eta_{min}$ = minimum learning rate (default: 0)
- $\eta_{max}$ = initial/maximum learning rate
- $T_{max}$ = total number of steps
- $t$ = current step
### How It Works
**Step 1: Calculate Cosine Value**
```math
\cos\left(\frac{\pi t}{T_{max}}\right)
```
**Step 2: Shift to [0, 1] Range**
```math
\frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
```
**Step 3: Scale to Learning Rate Range**
```math
\eta_{min} + (\eta_{max} - \eta_{min}) \times \text{scale}
```
### Example Calculation
**Given:**
- $\eta_{max} = 0.001$
- $\eta_{min} = 0$
- $T_{max} = 10000$
**At step $t = 0$:**
```math
\eta_0 = 0 + (0.001 - 0) \times \frac{1 + \cos(0)}{2} = 0.001 \times 1 = 0.001
```
**At step $t = 2500$:**
```math
\eta_{2500} = 0 + 0.001 \times \frac{1 + \cos(\pi/4)}{2} = 0.001 \times \frac{1 + 0.707}{2} \approx 0.000854
```
**At step $t = 5000$:**
```math
\eta_{5000} = 0 + 0.001 \times \frac{1 + \cos(\pi/2)}{2} = 0.001 \times \frac{1 + 0}{2} = 0.0005
```
**At step $t = 7500$:**
```math
\eta_{7500} = 0 + 0.001 \times \frac{1 + \cos(3\pi/4)}{2} = 0.001 \times \frac{1 + (-0.707)}{2} \approx 0.000146
```
**At step $t = 10000$:**
```math
\eta_{10000} = 0 + 0.001 \times \frac{1 + \cos(\pi)}{2} = 0.001 \times \frac{1 + (-1)}{2} = 0
```
### Visualization
```
Learning Rate
0.001 │●───────────────\
│ \
│ \
0.0005│ \
│ \
│ \
│ \
│ \
│ \
│ \
0│ ●─────
└───────────────────────────────────── Steps
0 2500 5000 7500 10000
```
**Smooth cosine curve!**
### Why Cosine Annealing?
**Benefits:**
1. **Smooth decay:** No abrupt changes
2. **Gradual reduction:** Better fine-tuning
3. **Works well:** Commonly used in practice
4. **High initial rate:** Fast learning
5. **Low final rate:** Precise convergence
---
## 8.5 Other Scheduling Strategies
### 1. Step Decay
**Reduce learning rate at fixed intervals:**
```math
\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}
```
**Where:**
- $\gamma$ = decay factor (e.g., 0.1)
- $s$ = step size (e.g., every 1000 steps)
**Example:**
```
Step 0-999: η = 0.001
Step 1000-1999: η = 0.0001 (×0.1)
Step 2000-2999: η = 0.00001 (×0.1)
```
**Visualization:**
```
Learning Rate
0.001 │───────┐
│ │
│ └───────┐
0.0001│ │
│ └───────┐
│ │
└───────────────────────── Steps
```
### 2. Exponential Decay
**Continuous exponential reduction:**
```math
\eta_t = \eta_0 \times \gamma^t
```
**Where:**
- $\gamma$ = decay rate (e.g., 0.9995)
**Visualization:**
```
Learning Rate
0.001│●──────────────\
│ \
│ \
│ \
│ \
│ \
│ \
│ \
└──────────────────────── Steps
```
### 3. Warmup Scheduling
**Start with low rate, increase, then decrease:**
**Warmup Phase:**
```math
\eta_t = \eta_{max} \times \frac{t}{T_{warmup}}
```
**After Warmup:**
```math
\eta_t = \text{Cosine Annealing or other schedule}
```
**Visualization:**
```
Learning Rate
0.001│ ╱───────\
\
\
\
\
\
│╱ \
└───────────────────── Steps
```
### 4. One Cycle Learning Rate
**One cycle: increase then decrease:**
```math
\eta_t = \begin{cases}
\eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{t}{T_1} & t \leq T_1 \\
\eta_{max} - (\eta_{max} - \eta_{min}) \times \frac{t - T_1}{T_2} & t > T_1
\end{cases}
```
**Visualization:**
```
Learning Rate
0.001│ ╱─────\
\
\
\
\
\
│╱ \
└─────────────────── Steps
```
---
## 8.6 Why Scheduling Matters
### Benefit 1: Better Convergence
**Without Scheduling:**
```
Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.15 → 2.12 → ...
(slow convergence at end)
```
**With Scheduling:**
```
Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.1 → 2.05 → ...
(faster convergence, better final loss)
```
### Benefit 2: More Stable Training
**Fixed High Rate:**
```
Loss: 3.0 → 2.5 → 2.3 → 2.4 → 2.3 → 2.4 → ...
(oscillating, unstable)
```
**Scheduled Rate:**
```
Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.15 → 2.12 → ...
(smooth, stable)
```
### Benefit 3: Better Final Performance
**Comparison:**
```
Fixed LR: Final Loss = 2.15
Scheduled LR: Final Loss = 2.05
→ 5% improvement!
```
---
## 8.7 Complete Mathematical Formulation
### General Scheduling Formula
```math
\eta_t = f(t, \eta_0, \eta_{min}, T_{max}, ...)
```
**Where $f$ is the scheduling function**
### Cosine Annealing (Complete)
```math
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
```
**Boundary Conditions:**
- At $t = 0$: $\eta_0 = \eta_{max}$
- At $t = T_{max}$: $\eta_{T_{max}} = \eta_{min}$
### Step Decay
```math
\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}
```
### Exponential Decay
```math
\eta_t = \eta_0 \times \gamma^t
```
### Warmup + Cosine Annealing
**Warmup Phase ($t \leq T_{warmup}$):**
```math
\eta_t = \eta_{max} \times \frac{t}{T_{warmup}}
```
**Annealing Phase ($t > T_{warmup}$):**
```math
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi (t - T_{warmup})}{T_{max} - T_{warmup}}\right)}{2}
```
---
## 8.8 Exercise: Schedule Calculation
### Problem
**Given Cosine Annealing schedule:**
- $\eta_{max} = 0.002$
- $\eta_{min} = 0.0001$
- $T_{max} = 5000$ steps
**Calculate the learning rate at:**
1. Step $t = 0$
2. Step $t = 1250$
3. Step $t = 2500$
4. Step $t = 3750$
5. Step $t = 5000$
### Step-by-Step Solution
#### General Formula
```math
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
```
**Substitute values:**
```math
\eta_t = 0.0001 + (0.002 - 0.0001) \times \frac{1 + \cos\left(\frac{\pi t}{5000}\right)}{2}
```
```math
\eta_t = 0.0001 + 0.0019 \times \frac{1 + \cos\left(\frac{\pi t}{5000}\right)}{2}
```
#### Step 1: t = 0
```math
\eta_0 = 0.0001 + 0.0019 \times \frac{1 + \cos(0)}{2}
```
```math
\eta_0 = 0.0001 + 0.0019 \times \frac{1 + 1}{2}
```
```math
\eta_0 = 0.0001 + 0.0019 \times 1 = 0.0001 + 0.0019 = 0.002
```
**Answer:** $\eta_0 = 0.002$
#### Step 2: t = 1250
```math
\eta_{1250} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi/4)}{2}
```
```math
\eta_{1250} = 0.0001 + 0.0019 \times \frac{1 + 0.707}{2}
```
```math
\eta_{1250} = 0.0001 + 0.0019 \times 0.8535 = 0.0001 + 0.001621 = 0.001721
```
**Answer:** $\eta_{1250} \approx 0.001721$
#### Step 3: t = 2500
```math
\eta_{2500} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi/2)}{2}
```
```math
\eta_{2500} = 0.0001 + 0.0019 \times \frac{1 + 0}{2}
```
```math
\eta_{2500} = 0.0001 + 0.0019 \times 0.5 = 0.0001 + 0.00095 = 0.00105
```
**Answer:** $\eta_{2500} = 0.00105$
#### Step 4: t = 3750
```math
\eta_{3750} = 0.0001 + 0.0019 \times \frac{1 + \cos(3\pi/4)}{2}
```
```math
\eta_{3750} = 0.0001 + 0.0019 \times \frac{1 + (-0.707)}{2}
```
```math
\eta_{3750} = 0.0001 + 0.0019 \times 0.1465 = 0.0001 + 0.000278 = 0.000378
```
**Answer:** $\eta_{3750} \approx 0.000378$
#### Step 5: t = 5000
```math
\eta_{5000} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi)}{2}
```
```math
\eta_{5000} = 0.0001 + 0.0019 \times \frac{1 + (-1)}{2}
```
```math
\eta_{5000} = 0.0001 + 0.0019 \times 0 = 0.0001 + 0 = 0.0001
```
**Answer:** $\eta_{5000} = 0.0001$
### Summary Table
| Step | Cosine Value | Scale Factor | Learning Rate |
|------|--------------|--------------|---------------|
| 0 | 1.0 | 1.0 | 0.002 |
| 1250 | 0.707 | 0.854 | 0.001721 |
| 2500 | 0.0 | 0.5 | 0.00105 |
| 3750 | -0.707 | 0.146 | 0.000378 |
| 5000 | -1.0 | 0.0 | 0.0001 |
**Smooth decay from 0.002 to 0.0001!**
---
## 8.9 Key Takeaways
### Scheduling
**Scheduling adjusts learning rate during training**
**Starts high (fast learning), ends low (fine-tuning)**
**Improves convergence and final performance**
### Cosine Annealing
**Smooth cosine-based decay**
**Gradual reduction from max to min**
**Works well for transformers**
### Why Important
**Faster convergence**
**More stable training**
**Better final performance**
**Essential for optimal training**
---
*This document provides a comprehensive explanation of learning rate scheduling, including cosine annealing and other strategies with mathematical formulations and solved exercises.*