- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
13 KiB
What is Scheduling? Step-by-Step Explanation
Complete step-by-step explanation of learning rate scheduling: how scheduling adjusts learning rates during training to improve convergence.
Table of Contents
- What is Scheduling?
- Why Do We Need Scheduling?
- Fixed Learning Rate
- Cosine Annealing
- Other Scheduling Strategies
- Why Scheduling Matters
- Complete Mathematical Formulation
- Exercise: Schedule Calculation
- Key Takeaways
8.1 What is Scheduling?
Simple Definition
Scheduling (learning rate scheduling) is the process of adjusting the learning rate during training to improve convergence and final model performance.
Visual Analogy
Think of scheduling like adjusting speed while driving:
Fixed Learning Rate:
┌──────────────────────────┐
│ Speed: 60 mph (constant) │
└──────────────────────────┘
→ Hard to stop precisely!
Scheduled Learning Rate:
┌──────────────────────────┐
│ Speed: 60 → 40 → 20 → 10 │
└──────────────────────────┘
→ Smooth deceleration!
Scheduling adjusts speed (learning rate) as you approach the destination (convergence)!
What Scheduling Does
Scheduling:
- Starts with higher learning rate (fast learning)
- Gradually reduces learning rate (precise fine-tuning)
- Converges to optimal solution
Result: Better convergence and performance!
8.2 Why Do We Need Scheduling?
The Problem with Fixed Learning Rate
High Learning Rate:
Learning Rate: 0.001 (constant)
→ Fast initial learning ✓
→ But overshoots minimum ✗
→ Bounces around ✗
→ Poor convergence ✗
Low Learning Rate:
Learning Rate: 0.0001 (constant)
→ Stable convergence ✓
→ But very slow learning ✗
→ Takes forever to converge ✗
Can't have both!
The Solution: Scheduling
Adaptive Learning Rate:
Start: 0.001 (fast learning)
Middle: 0.0005 (moderate)
End: 0.0001 (fine-tuning)
→ Fast initial learning ✓
→ Stable convergence ✓
→ Best of both worlds!
Benefits of Scheduling
1. Faster Convergence
- High initial rate = Fast progress
- Lower later rate = Precise convergence
2. Better Final Performance
- Fine-tuning at end = Better solution
- Avoids overshooting = More stable
3. More Stable Training
- Gradual reduction = Smooth optimization
- Less oscillation = More reliable
8.3 Fixed Learning Rate
What is Fixed Learning Rate?
Learning rate stays constant throughout training:
\eta_t = \eta_0 \quad \text{for all } t
Where:
\eta_0= initial learning ratet= training step
Example
Fixed Rate:
Step 0: η = 0.001
Step 100: η = 0.001
Step 1000: η = 0.001
Step 10000: η = 0.001
Constant throughout!
Visualization
Learning Rate
│
0.001│─────────────────────────────────────
│
│
│
│
└───────────────────────────────────── Steps
Problems
1. Too High:
- Overshoots minimum
- Oscillates around solution
- Never converges precisely
2. Too Low:
- Very slow training
- Takes forever to converge
- May get stuck
Solution: Use scheduling!
8.4 Cosine Annealing
What is Cosine Annealing?
Cosine Annealing reduces the learning rate following a cosine curve from maximum to minimum.
Formula
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
Where:
\eta_t= learning rate at stept\eta_{min}= minimum learning rate (default: 0)\eta_{max}= initial/maximum learning rateT_{max}= total number of stepst= current step
How It Works
Step 1: Calculate Cosine Value
\cos\left(\frac{\pi t}{T_{max}}\right)
Step 2: Shift to [0, 1] Range
\frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
Step 3: Scale to Learning Rate Range
\eta_{min} + (\eta_{max} - \eta_{min}) \times \text{scale}
Example Calculation
Given:
\eta_{max} = 0.001\eta_{min} = 0T_{max} = 10000
At step t = 0:
\eta_0 = 0 + (0.001 - 0) \times \frac{1 + \cos(0)}{2} = 0.001 \times 1 = 0.001
At step t = 2500:
\eta_{2500} = 0 + 0.001 \times \frac{1 + \cos(\pi/4)}{2} = 0.001 \times \frac{1 + 0.707}{2} \approx 0.000854
At step t = 5000:
\eta_{5000} = 0 + 0.001 \times \frac{1 + \cos(\pi/2)}{2} = 0.001 \times \frac{1 + 0}{2} = 0.0005
At step t = 7500:
\eta_{7500} = 0 + 0.001 \times \frac{1 + \cos(3\pi/4)}{2} = 0.001 \times \frac{1 + (-0.707)}{2} \approx 0.000146
At step t = 10000:
\eta_{10000} = 0 + 0.001 \times \frac{1 + \cos(\pi)}{2} = 0.001 \times \frac{1 + (-1)}{2} = 0
Visualization
Learning Rate
│
0.001 │●───────────────\
│ \
│ \
0.0005│ \
│ \
│ \
│ \
│ \
│ \
│ \
0│ ●─────
└───────────────────────────────────── Steps
0 2500 5000 7500 10000
Smooth cosine curve!
Why Cosine Annealing?
Benefits:
- Smooth decay: No abrupt changes
- Gradual reduction: Better fine-tuning
- Works well: Commonly used in practice
- High initial rate: Fast learning
- Low final rate: Precise convergence
8.5 Other Scheduling Strategies
1. Step Decay
Reduce learning rate at fixed intervals:
\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}
Where:
\gamma= decay factor (e.g., 0.1)s= step size (e.g., every 1000 steps)
Example:
Step 0-999: η = 0.001
Step 1000-1999: η = 0.0001 (×0.1)
Step 2000-2999: η = 0.00001 (×0.1)
Visualization:
Learning Rate
│
0.001 │───────┐
│ │
│ └───────┐
0.0001│ │
│ └───────┐
│ │
└───────────────────────── Steps
2. Exponential Decay
Continuous exponential reduction:
\eta_t = \eta_0 \times \gamma^t
Where:
\gamma= decay rate (e.g., 0.9995)
Visualization:
Learning Rate
│
0.001│●──────────────\
│ \
│ \
│ \
│ \
│ \
│ \
│ \
└──────────────────────── Steps
3. Warmup Scheduling
Start with low rate, increase, then decrease:
Warmup Phase:
\eta_t = \eta_{max} \times \frac{t}{T_{warmup}}
After Warmup:
\eta_t = \text{Cosine Annealing or other schedule}
Visualization:
Learning Rate
│
0.001│ ╱───────\
│ ╱ \
│ ╱ \
│ ╱ \
│ ╱ \
│ ╱ \
│╱ \
└───────────────────── Steps
4. One Cycle Learning Rate
One cycle: increase then decrease:
\eta_t = \begin{cases}
\eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{t}{T_1} & t \leq T_1 \\
\eta_{max} - (\eta_{max} - \eta_{min}) \times \frac{t - T_1}{T_2} & t > T_1
\end{cases}
Visualization:
Learning Rate
│
0.001│ ╱─────\
│ ╱ \
│ ╱ \
│ ╱ \
│ ╱ \
│ ╱ \
│╱ \
└─────────────────── Steps
8.6 Why Scheduling Matters
Benefit 1: Better Convergence
Without Scheduling:
Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.15 → 2.12 → ...
(slow convergence at end)
With Scheduling:
Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.1 → 2.05 → ...
(faster convergence, better final loss)
Benefit 2: More Stable Training
Fixed High Rate:
Loss: 3.0 → 2.5 → 2.3 → 2.4 → 2.3 → 2.4 → ...
(oscillating, unstable)
Scheduled Rate:
Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.15 → 2.12 → ...
(smooth, stable)
Benefit 3: Better Final Performance
Comparison:
Fixed LR: Final Loss = 2.15
Scheduled LR: Final Loss = 2.05
→ 5% improvement!
8.7 Complete Mathematical Formulation
General Scheduling Formula
\eta_t = f(t, \eta_0, \eta_{min}, T_{max}, ...)
Where f is the scheduling function
Cosine Annealing (Complete)
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
Boundary Conditions:
- At
t = 0:\eta_0 = \eta_{max} - At
t = T_{max}:\eta_{T_{max}} = \eta_{min}
Step Decay
\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}
Exponential Decay
\eta_t = \eta_0 \times \gamma^t
Warmup + Cosine Annealing
Warmup Phase (t \leq T_{warmup}):
\eta_t = \eta_{max} \times \frac{t}{T_{warmup}}
Annealing Phase (t > T_{warmup}):
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi (t - T_{warmup})}{T_{max} - T_{warmup}}\right)}{2}
8.8 Exercise: Schedule Calculation
Problem
Given Cosine Annealing schedule:
\eta_{max} = 0.002\eta_{min} = 0.0001T_{max} = 5000steps
Calculate the learning rate at:
- Step
t = 0 - Step
t = 1250 - Step
t = 2500 - Step
t = 3750 - Step
t = 5000
Step-by-Step Solution
General Formula
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}
Substitute values:
\eta_t = 0.0001 + (0.002 - 0.0001) \times \frac{1 + \cos\left(\frac{\pi t}{5000}\right)}{2}
\eta_t = 0.0001 + 0.0019 \times \frac{1 + \cos\left(\frac{\pi t}{5000}\right)}{2}
Step 1: t = 0
\eta_0 = 0.0001 + 0.0019 \times \frac{1 + \cos(0)}{2}
\eta_0 = 0.0001 + 0.0019 \times \frac{1 + 1}{2}
\eta_0 = 0.0001 + 0.0019 \times 1 = 0.0001 + 0.0019 = 0.002
Answer: \eta_0 = 0.002
Step 2: t = 1250
\eta_{1250} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi/4)}{2}
\eta_{1250} = 0.0001 + 0.0019 \times \frac{1 + 0.707}{2}
\eta_{1250} = 0.0001 + 0.0019 \times 0.8535 = 0.0001 + 0.001621 = 0.001721
Answer: \eta_{1250} \approx 0.001721
Step 3: t = 2500
\eta_{2500} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi/2)}{2}
\eta_{2500} = 0.0001 + 0.0019 \times \frac{1 + 0}{2}
\eta_{2500} = 0.0001 + 0.0019 \times 0.5 = 0.0001 + 0.00095 = 0.00105
Answer: \eta_{2500} = 0.00105
Step 4: t = 3750
\eta_{3750} = 0.0001 + 0.0019 \times \frac{1 + \cos(3\pi/4)}{2}
\eta_{3750} = 0.0001 + 0.0019 \times \frac{1 + (-0.707)}{2}
\eta_{3750} = 0.0001 + 0.0019 \times 0.1465 = 0.0001 + 0.000278 = 0.000378
Answer: \eta_{3750} \approx 0.000378
Step 5: t = 5000
\eta_{5000} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi)}{2}
\eta_{5000} = 0.0001 + 0.0019 \times \frac{1 + (-1)}{2}
\eta_{5000} = 0.0001 + 0.0019 \times 0 = 0.0001 + 0 = 0.0001
Answer: \eta_{5000} = 0.0001
Summary Table
| Step | Cosine Value | Scale Factor | Learning Rate |
|---|---|---|---|
| 0 | 1.0 | 1.0 | 0.002 |
| 1250 | 0.707 | 0.854 | 0.001721 |
| 2500 | 0.0 | 0.5 | 0.00105 |
| 3750 | -0.707 | 0.146 | 0.000378 |
| 5000 | -1.0 | 0.0 | 0.0001 |
Smooth decay from 0.002 to 0.0001!
8.9 Key Takeaways
Scheduling
✅ Scheduling adjusts learning rate during training
✅ Starts high (fast learning), ends low (fine-tuning)
✅ Improves convergence and final performance
Cosine Annealing
✅ Smooth cosine-based decay
✅ Gradual reduction from max to min
✅ Works well for transformers
Why Important
✅ Faster convergence
✅ More stable training
✅ Better final performance
✅ Essential for optimal training
This document provides a comprehensive explanation of learning rate scheduling, including cosine annealing and other strategies with mathematical formulations and solved exercises.