- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
619 lines
13 KiB
Markdown
619 lines
13 KiB
Markdown
# What is Scheduling? Step-by-Step Explanation
|
||
|
||
Complete step-by-step explanation of learning rate scheduling: how scheduling adjusts learning rates during training to improve convergence.
|
||
|
||
## Table of Contents
|
||
|
||
1. [What is Scheduling?](#81-what-is-scheduling)
|
||
2. [Why Do We Need Scheduling?](#82-why-do-we-need-scheduling)
|
||
3. [Fixed Learning Rate](#83-fixed-learning-rate)
|
||
4. [Cosine Annealing](#84-cosine-annealing)
|
||
5. [Other Scheduling Strategies](#85-other-scheduling-strategies)
|
||
6. [Why Scheduling Matters](#86-why-scheduling-matters)
|
||
7. [Complete Mathematical Formulation](#87-complete-mathematical-formulation)
|
||
8. [Exercise: Schedule Calculation](#88-exercise-schedule-calculation)
|
||
9. [Key Takeaways](#89-key-takeaways)
|
||
|
||
---
|
||
|
||
## 8.1 What is Scheduling?
|
||
|
||
### Simple Definition
|
||
|
||
**Scheduling** (learning rate scheduling) is the process of adjusting the learning rate during training to improve convergence and final model performance.
|
||
|
||
### Visual Analogy
|
||
|
||
**Think of scheduling like adjusting speed while driving:**
|
||
|
||
```
|
||
Fixed Learning Rate:
|
||
┌──────────────────────────┐
|
||
│ Speed: 60 mph (constant) │
|
||
└──────────────────────────┘
|
||
→ Hard to stop precisely!
|
||
|
||
Scheduled Learning Rate:
|
||
┌──────────────────────────┐
|
||
│ Speed: 60 → 40 → 20 → 10 │
|
||
└──────────────────────────┘
|
||
→ Smooth deceleration!
|
||
```
|
||
|
||
**Scheduling adjusts speed (learning rate) as you approach the destination (convergence)!**
|
||
|
||
### What Scheduling Does
|
||
|
||
**Scheduling:**
|
||
1. **Starts** with higher learning rate (fast learning)
|
||
2. **Gradually reduces** learning rate (precise fine-tuning)
|
||
3. **Converges** to optimal solution
|
||
|
||
**Result:** Better convergence and performance!
|
||
|
||
---
|
||
|
||
## 8.2 Why Do We Need Scheduling?
|
||
|
||
### The Problem with Fixed Learning Rate
|
||
|
||
**High Learning Rate:**
|
||
```
|
||
Learning Rate: 0.001 (constant)
|
||
→ Fast initial learning ✓
|
||
→ But overshoots minimum ✗
|
||
→ Bounces around ✗
|
||
→ Poor convergence ✗
|
||
```
|
||
|
||
**Low Learning Rate:**
|
||
```
|
||
Learning Rate: 0.0001 (constant)
|
||
→ Stable convergence ✓
|
||
→ But very slow learning ✗
|
||
→ Takes forever to converge ✗
|
||
```
|
||
|
||
**Can't have both!**
|
||
|
||
### The Solution: Scheduling
|
||
|
||
**Adaptive Learning Rate:**
|
||
```
|
||
Start: 0.001 (fast learning)
|
||
Middle: 0.0005 (moderate)
|
||
End: 0.0001 (fine-tuning)
|
||
→ Fast initial learning ✓
|
||
→ Stable convergence ✓
|
||
→ Best of both worlds!
|
||
```
|
||
|
||
### Benefits of Scheduling
|
||
|
||
**1. Faster Convergence**
|
||
- High initial rate = Fast progress
|
||
- Lower later rate = Precise convergence
|
||
|
||
**2. Better Final Performance**
|
||
- Fine-tuning at end = Better solution
|
||
- Avoids overshooting = More stable
|
||
|
||
**3. More Stable Training**
|
||
- Gradual reduction = Smooth optimization
|
||
- Less oscillation = More reliable
|
||
|
||
---
|
||
|
||
## 8.3 Fixed Learning Rate
|
||
|
||
### What is Fixed Learning Rate?
|
||
|
||
**Learning rate stays constant throughout training:**
|
||
|
||
```math
|
||
\eta_t = \eta_0 \quad \text{for all } t
|
||
```
|
||
|
||
**Where:**
|
||
- $\eta_0$ = initial learning rate
|
||
- $t$ = training step
|
||
|
||
### Example
|
||
|
||
**Fixed Rate:**
|
||
```
|
||
Step 0: η = 0.001
|
||
Step 100: η = 0.001
|
||
Step 1000: η = 0.001
|
||
Step 10000: η = 0.001
|
||
```
|
||
|
||
**Constant throughout!**
|
||
|
||
### Visualization
|
||
|
||
```
|
||
Learning Rate
|
||
│
|
||
0.001│─────────────────────────────────────
|
||
│
|
||
│
|
||
│
|
||
│
|
||
└───────────────────────────────────── Steps
|
||
```
|
||
|
||
### Problems
|
||
|
||
**1. Too High:**
|
||
- Overshoots minimum
|
||
- Oscillates around solution
|
||
- Never converges precisely
|
||
|
||
**2. Too Low:**
|
||
- Very slow training
|
||
- Takes forever to converge
|
||
- May get stuck
|
||
|
||
**Solution:** Use scheduling!
|
||
|
||
---
|
||
|
||
## 8.4 Cosine Annealing
|
||
|
||
### What is Cosine Annealing?
|
||
|
||
**Cosine Annealing** reduces the learning rate following a cosine curve from maximum to minimum.
|
||
|
||
### Formula
|
||
|
||
```math
|
||
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
|
||
```
|
||
|
||
**Where:**
|
||
- $\eta_t$ = learning rate at step $t$
|
||
- $\eta_{min}$ = minimum learning rate (default: 0)
|
||
- $\eta_{max}$ = initial/maximum learning rate
|
||
- $T_{max}$ = total number of steps
|
||
- $t$ = current step
|
||
|
||
### How It Works
|
||
|
||
**Step 1: Calculate Cosine Value**
|
||
```math
|
||
\cos\left(\frac{\pi t}{T_{max}}\right)
|
||
```
|
||
|
||
**Step 2: Shift to [0, 1] Range**
|
||
```math
|
||
\frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
|
||
```
|
||
|
||
**Step 3: Scale to Learning Rate Range**
|
||
```math
|
||
\eta_{min} + (\eta_{max} - \eta_{min}) \times \text{scale}
|
||
```
|
||
|
||
### Example Calculation
|
||
|
||
**Given:**
|
||
- $\eta_{max} = 0.001$
|
||
- $\eta_{min} = 0$
|
||
- $T_{max} = 10000$
|
||
|
||
**At step $t = 0$:**
|
||
```math
|
||
\eta_0 = 0 + (0.001 - 0) \times \frac{1 + \cos(0)}{2} = 0.001 \times 1 = 0.001
|
||
```
|
||
|
||
**At step $t = 2500$:**
|
||
```math
|
||
\eta_{2500} = 0 + 0.001 \times \frac{1 + \cos(\pi/4)}{2} = 0.001 \times \frac{1 + 0.707}{2} \approx 0.000854
|
||
```
|
||
|
||
**At step $t = 5000$:**
|
||
```math
|
||
\eta_{5000} = 0 + 0.001 \times \frac{1 + \cos(\pi/2)}{2} = 0.001 \times \frac{1 + 0}{2} = 0.0005
|
||
```
|
||
|
||
**At step $t = 7500$:**
|
||
```math
|
||
\eta_{7500} = 0 + 0.001 \times \frac{1 + \cos(3\pi/4)}{2} = 0.001 \times \frac{1 + (-0.707)}{2} \approx 0.000146
|
||
```
|
||
|
||
**At step $t = 10000$:**
|
||
```math
|
||
\eta_{10000} = 0 + 0.001 \times \frac{1 + \cos(\pi)}{2} = 0.001 \times \frac{1 + (-1)}{2} = 0
|
||
```
|
||
|
||
### Visualization
|
||
|
||
```
|
||
Learning Rate
|
||
│
|
||
0.001 │●───────────────\
|
||
│ \
|
||
│ \
|
||
0.0005│ \
|
||
│ \
|
||
│ \
|
||
│ \
|
||
│ \
|
||
│ \
|
||
│ \
|
||
0│ ●─────
|
||
└───────────────────────────────────── Steps
|
||
0 2500 5000 7500 10000
|
||
```
|
||
|
||
**Smooth cosine curve!**
|
||
|
||
### Why Cosine Annealing?
|
||
|
||
**Benefits:**
|
||
1. **Smooth decay:** No abrupt changes
|
||
2. **Gradual reduction:** Better fine-tuning
|
||
3. **Works well:** Commonly used in practice
|
||
4. **High initial rate:** Fast learning
|
||
5. **Low final rate:** Precise convergence
|
||
|
||
---
|
||
|
||
## 8.5 Other Scheduling Strategies
|
||
|
||
### 1. Step Decay
|
||
|
||
**Reduce learning rate at fixed intervals:**
|
||
|
||
```math
|
||
\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}
|
||
```
|
||
|
||
**Where:**
|
||
- $\gamma$ = decay factor (e.g., 0.1)
|
||
- $s$ = step size (e.g., every 1000 steps)
|
||
|
||
**Example:**
|
||
```
|
||
Step 0-999: η = 0.001
|
||
Step 1000-1999: η = 0.0001 (×0.1)
|
||
Step 2000-2999: η = 0.00001 (×0.1)
|
||
```
|
||
|
||
**Visualization:**
|
||
```
|
||
Learning Rate
|
||
│
|
||
0.001 │───────┐
|
||
│ │
|
||
│ └───────┐
|
||
0.0001│ │
|
||
│ └───────┐
|
||
│ │
|
||
└───────────────────────── Steps
|
||
```
|
||
|
||
### 2. Exponential Decay
|
||
|
||
**Continuous exponential reduction:**
|
||
|
||
```math
|
||
\eta_t = \eta_0 \times \gamma^t
|
||
```
|
||
|
||
**Where:**
|
||
- $\gamma$ = decay rate (e.g., 0.9995)
|
||
|
||
**Visualization:**
|
||
```
|
||
Learning Rate
|
||
│
|
||
0.001│●──────────────\
|
||
│ \
|
||
│ \
|
||
│ \
|
||
│ \
|
||
│ \
|
||
│ \
|
||
│ \
|
||
└──────────────────────── Steps
|
||
```
|
||
|
||
### 3. Warmup Scheduling
|
||
|
||
**Start with low rate, increase, then decrease:**
|
||
|
||
**Warmup Phase:**
|
||
```math
|
||
\eta_t = \eta_{max} \times \frac{t}{T_{warmup}}
|
||
```
|
||
|
||
**After Warmup:**
|
||
```math
|
||
\eta_t = \text{Cosine Annealing or other schedule}
|
||
```
|
||
|
||
**Visualization:**
|
||
```
|
||
Learning Rate
|
||
│
|
||
0.001│ ╱───────\
|
||
│ ╱ \
|
||
│ ╱ \
|
||
│ ╱ \
|
||
│ ╱ \
|
||
│ ╱ \
|
||
│╱ \
|
||
└───────────────────── Steps
|
||
```
|
||
|
||
### 4. One Cycle Learning Rate
|
||
|
||
**One cycle: increase then decrease:**
|
||
|
||
```math
|
||
\eta_t = \begin{cases}
|
||
\eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{t}{T_1} & t \leq T_1 \\
|
||
\eta_{max} - (\eta_{max} - \eta_{min}) \times \frac{t - T_1}{T_2} & t > T_1
|
||
\end{cases}
|
||
```
|
||
|
||
**Visualization:**
|
||
```
|
||
Learning Rate
|
||
│
|
||
0.001│ ╱─────\
|
||
│ ╱ \
|
||
│ ╱ \
|
||
│ ╱ \
|
||
│ ╱ \
|
||
│ ╱ \
|
||
│╱ \
|
||
└─────────────────── Steps
|
||
```
|
||
|
||
---
|
||
|
||
## 8.6 Why Scheduling Matters
|
||
|
||
### Benefit 1: Better Convergence
|
||
|
||
**Without Scheduling:**
|
||
```
|
||
Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.15 → 2.12 → ...
|
||
(slow convergence at end)
|
||
```
|
||
|
||
**With Scheduling:**
|
||
```
|
||
Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.1 → 2.05 → ...
|
||
(faster convergence, better final loss)
|
||
```
|
||
|
||
### Benefit 2: More Stable Training
|
||
|
||
**Fixed High Rate:**
|
||
```
|
||
Loss: 3.0 → 2.5 → 2.3 → 2.4 → 2.3 → 2.4 → ...
|
||
(oscillating, unstable)
|
||
```
|
||
|
||
**Scheduled Rate:**
|
||
```
|
||
Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.15 → 2.12 → ...
|
||
(smooth, stable)
|
||
```
|
||
|
||
### Benefit 3: Better Final Performance
|
||
|
||
**Comparison:**
|
||
```
|
||
Fixed LR: Final Loss = 2.15
|
||
Scheduled LR: Final Loss = 2.05
|
||
|
||
→ 5% improvement!
|
||
```
|
||
|
||
---
|
||
|
||
## 8.7 Complete Mathematical Formulation
|
||
|
||
### General Scheduling Formula
|
||
|
||
```math
|
||
\eta_t = f(t, \eta_0, \eta_{min}, T_{max}, ...)
|
||
```
|
||
|
||
**Where $f$ is the scheduling function**
|
||
|
||
### Cosine Annealing (Complete)
|
||
|
||
```math
|
||
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
|
||
```
|
||
|
||
**Boundary Conditions:**
|
||
- At $t = 0$: $\eta_0 = \eta_{max}$
|
||
- At $t = T_{max}$: $\eta_{T_{max}} = \eta_{min}$
|
||
|
||
### Step Decay
|
||
|
||
```math
|
||
\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}
|
||
```
|
||
|
||
### Exponential Decay
|
||
|
||
```math
|
||
\eta_t = \eta_0 \times \gamma^t
|
||
```
|
||
|
||
### Warmup + Cosine Annealing
|
||
|
||
**Warmup Phase ($t \leq T_{warmup}$):**
|
||
```math
|
||
\eta_t = \eta_{max} \times \frac{t}{T_{warmup}}
|
||
```
|
||
|
||
**Annealing Phase ($t > T_{warmup}$):**
|
||
```math
|
||
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi (t - T_{warmup})}{T_{max} - T_{warmup}}\right)}{2}
|
||
```
|
||
|
||
---
|
||
|
||
## 8.8 Exercise: Schedule Calculation
|
||
|
||
### Problem
|
||
|
||
**Given Cosine Annealing schedule:**
|
||
|
||
- $\eta_{max} = 0.002$
|
||
- $\eta_{min} = 0.0001$
|
||
- $T_{max} = 5000$ steps
|
||
|
||
**Calculate the learning rate at:**
|
||
1. Step $t = 0$
|
||
2. Step $t = 1250$
|
||
3. Step $t = 2500$
|
||
4. Step $t = 3750$
|
||
5. Step $t = 5000$
|
||
|
||
### Step-by-Step Solution
|
||
|
||
#### General Formula
|
||
|
||
```math
|
||
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
|
||
```
|
||
|
||
**Substitute values:**
|
||
```math
|
||
\eta_t = 0.0001 + (0.002 - 0.0001) \times \frac{1 + \cos\left(\frac{\pi t}{5000}\right)}{2}
|
||
```
|
||
|
||
```math
|
||
\eta_t = 0.0001 + 0.0019 \times \frac{1 + \cos\left(\frac{\pi t}{5000}\right)}{2}
|
||
```
|
||
|
||
#### Step 1: t = 0
|
||
|
||
```math
|
||
\eta_0 = 0.0001 + 0.0019 \times \frac{1 + \cos(0)}{2}
|
||
```
|
||
|
||
```math
|
||
\eta_0 = 0.0001 + 0.0019 \times \frac{1 + 1}{2}
|
||
```
|
||
|
||
```math
|
||
\eta_0 = 0.0001 + 0.0019 \times 1 = 0.0001 + 0.0019 = 0.002
|
||
```
|
||
|
||
**Answer:** $\eta_0 = 0.002$
|
||
|
||
#### Step 2: t = 1250
|
||
|
||
```math
|
||
\eta_{1250} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi/4)}{2}
|
||
```
|
||
|
||
```math
|
||
\eta_{1250} = 0.0001 + 0.0019 \times \frac{1 + 0.707}{2}
|
||
```
|
||
|
||
```math
|
||
\eta_{1250} = 0.0001 + 0.0019 \times 0.8535 = 0.0001 + 0.001621 = 0.001721
|
||
```
|
||
|
||
**Answer:** $\eta_{1250} \approx 0.001721$
|
||
|
||
#### Step 3: t = 2500
|
||
|
||
```math
|
||
\eta_{2500} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi/2)}{2}
|
||
```
|
||
|
||
```math
|
||
\eta_{2500} = 0.0001 + 0.0019 \times \frac{1 + 0}{2}
|
||
```
|
||
|
||
```math
|
||
\eta_{2500} = 0.0001 + 0.0019 \times 0.5 = 0.0001 + 0.00095 = 0.00105
|
||
```
|
||
|
||
**Answer:** $\eta_{2500} = 0.00105$
|
||
|
||
#### Step 4: t = 3750
|
||
|
||
```math
|
||
\eta_{3750} = 0.0001 + 0.0019 \times \frac{1 + \cos(3\pi/4)}{2}
|
||
```
|
||
|
||
```math
|
||
\eta_{3750} = 0.0001 + 0.0019 \times \frac{1 + (-0.707)}{2}
|
||
```
|
||
|
||
```math
|
||
\eta_{3750} = 0.0001 + 0.0019 \times 0.1465 = 0.0001 + 0.000278 = 0.000378
|
||
```
|
||
|
||
**Answer:** $\eta_{3750} \approx 0.000378$
|
||
|
||
#### Step 5: t = 5000
|
||
|
||
```math
|
||
\eta_{5000} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi)}{2}
|
||
```
|
||
|
||
```math
|
||
\eta_{5000} = 0.0001 + 0.0019 \times \frac{1 + (-1)}{2}
|
||
```
|
||
|
||
```math
|
||
\eta_{5000} = 0.0001 + 0.0019 \times 0 = 0.0001 + 0 = 0.0001
|
||
```
|
||
|
||
**Answer:** $\eta_{5000} = 0.0001$
|
||
|
||
### Summary Table
|
||
|
||
| Step | Cosine Value | Scale Factor | Learning Rate |
|
||
|------|--------------|--------------|---------------|
|
||
| 0 | 1.0 | 1.0 | 0.002 |
|
||
| 1250 | 0.707 | 0.854 | 0.001721 |
|
||
| 2500 | 0.0 | 0.5 | 0.00105 |
|
||
| 3750 | -0.707 | 0.146 | 0.000378 |
|
||
| 5000 | -1.0 | 0.0 | 0.0001 |
|
||
|
||
**Smooth decay from 0.002 to 0.0001!**
|
||
|
||
---
|
||
|
||
## 8.9 Key Takeaways
|
||
|
||
### Scheduling
|
||
|
||
✅ **Scheduling adjusts learning rate during training**
|
||
✅ **Starts high (fast learning), ends low (fine-tuning)**
|
||
✅ **Improves convergence and final performance**
|
||
|
||
### Cosine Annealing
|
||
|
||
✅ **Smooth cosine-based decay**
|
||
✅ **Gradual reduction from max to min**
|
||
✅ **Works well for transformers**
|
||
|
||
### Why Important
|
||
|
||
✅ **Faster convergence**
|
||
✅ **More stable training**
|
||
✅ **Better final performance**
|
||
✅ **Essential for optimal training**
|
||
|
||
---
|
||
|
||
*This document provides a comprehensive explanation of learning rate scheduling, including cosine annealing and other strategies with mathematical formulations and solved exercises.*
|
||
|