Files
sheepOp/docs/SCHEDULING_EXPLAINED.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

13 KiB
Raw Permalink Blame History

What is Scheduling? Step-by-Step Explanation

Complete step-by-step explanation of learning rate scheduling: how scheduling adjusts learning rates during training to improve convergence.

Table of Contents

  1. What is Scheduling?
  2. Why Do We Need Scheduling?
  3. Fixed Learning Rate
  4. Cosine Annealing
  5. Other Scheduling Strategies
  6. Why Scheduling Matters
  7. Complete Mathematical Formulation
  8. Exercise: Schedule Calculation
  9. Key Takeaways

8.1 What is Scheduling?

Simple Definition

Scheduling (learning rate scheduling) is the process of adjusting the learning rate during training to improve convergence and final model performance.

Visual Analogy

Think of scheduling like adjusting speed while driving:

Fixed Learning Rate:
    ┌──────────────────────────┐
    │ Speed: 60 mph (constant) │
    └──────────────────────────┘
    → Hard to stop precisely!

Scheduled Learning Rate:
    ┌──────────────────────────┐
    │ Speed: 60 → 40 → 20 → 10 │
    └──────────────────────────┘
    → Smooth deceleration!

Scheduling adjusts speed (learning rate) as you approach the destination (convergence)!

What Scheduling Does

Scheduling:

  1. Starts with higher learning rate (fast learning)
  2. Gradually reduces learning rate (precise fine-tuning)
  3. Converges to optimal solution

Result: Better convergence and performance!


8.2 Why Do We Need Scheduling?

The Problem with Fixed Learning Rate

High Learning Rate:

Learning Rate: 0.001 (constant)
→ Fast initial learning ✓
→ But overshoots minimum ✗
→ Bounces around ✗
→ Poor convergence ✗

Low Learning Rate:

Learning Rate: 0.0001 (constant)
→ Stable convergence ✓
→ But very slow learning ✗
→ Takes forever to converge ✗

Can't have both!

The Solution: Scheduling

Adaptive Learning Rate:

Start: 0.001 (fast learning)
Middle: 0.0005 (moderate)
End: 0.0001 (fine-tuning)
→ Fast initial learning ✓
→ Stable convergence ✓
→ Best of both worlds!

Benefits of Scheduling

1. Faster Convergence

  • High initial rate = Fast progress
  • Lower later rate = Precise convergence

2. Better Final Performance

  • Fine-tuning at end = Better solution
  • Avoids overshooting = More stable

3. More Stable Training

  • Gradual reduction = Smooth optimization
  • Less oscillation = More reliable

8.3 Fixed Learning Rate

What is Fixed Learning Rate?

Learning rate stays constant throughout training:

\eta_t = \eta_0 \quad \text{for all } t

Where:

  • \eta_0 = initial learning rate
  • t = training step

Example

Fixed Rate:

Step 0:    η = 0.001
Step 100:  η = 0.001
Step 1000: η = 0.001
Step 10000: η = 0.001

Constant throughout!

Visualization

Learning Rate
     │
0.001│─────────────────────────────────────
     │
     │
     │
     │
     └───────────────────────────────────── Steps

Problems

1. Too High:

  • Overshoots minimum
  • Oscillates around solution
  • Never converges precisely

2. Too Low:

  • Very slow training
  • Takes forever to converge
  • May get stuck

Solution: Use scheduling!


8.4 Cosine Annealing

What is Cosine Annealing?

Cosine Annealing reduces the learning rate following a cosine curve from maximum to minimum.

Formula

\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}

Where:

  • \eta_t = learning rate at step t
  • \eta_{min} = minimum learning rate (default: 0)
  • \eta_{max} = initial/maximum learning rate
  • T_{max} = total number of steps
  • t = current step

How It Works

Step 1: Calculate Cosine Value

\cos\left(\frac{\pi t}{T_{max}}\right)

Step 2: Shift to [0, 1] Range

\frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}

Step 3: Scale to Learning Rate Range

\eta_{min} + (\eta_{max} - \eta_{min}) \times \text{scale}

Example Calculation

Given:

  • \eta_{max} = 0.001
  • \eta_{min} = 0
  • T_{max} = 10000

At step t = 0:

\eta_0 = 0 + (0.001 - 0) \times \frac{1 + \cos(0)}{2} = 0.001 \times 1 = 0.001

At step t = 2500:

\eta_{2500} = 0 + 0.001 \times \frac{1 + \cos(\pi/4)}{2} = 0.001 \times \frac{1 + 0.707}{2} \approx 0.000854

At step t = 5000:

\eta_{5000} = 0 + 0.001 \times \frac{1 + \cos(\pi/2)}{2} = 0.001 \times \frac{1 + 0}{2} = 0.0005

At step t = 7500:

\eta_{7500} = 0 + 0.001 \times \frac{1 + \cos(3\pi/4)}{2} = 0.001 \times \frac{1 + (-0.707)}{2} \approx 0.000146

At step t = 10000:

\eta_{10000} = 0 + 0.001 \times \frac{1 + \cos(\pi)}{2} = 0.001 \times \frac{1 + (-1)}{2} = 0

Visualization

Learning Rate
      │
0.001 │●───────────────\
      │                 \
      │                  \
0.0005│                   \
      │                    \
      │                     \
      │                      \
      │                       \
      │                        \
      │                         \
     0│                          ●─────
      └───────────────────────────────────── Steps
        0    2500  5000  7500  10000

Smooth cosine curve!

Why Cosine Annealing?

Benefits:

  1. Smooth decay: No abrupt changes
  2. Gradual reduction: Better fine-tuning
  3. Works well: Commonly used in practice
  4. High initial rate: Fast learning
  5. Low final rate: Precise convergence

8.5 Other Scheduling Strategies

1. Step Decay

Reduce learning rate at fixed intervals:

\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}

Where:

  • \gamma = decay factor (e.g., 0.1)
  • s = step size (e.g., every 1000 steps)

Example:

Step 0-999:    η = 0.001
Step 1000-1999: η = 0.0001  (×0.1)
Step 2000-2999: η = 0.00001 (×0.1)

Visualization:

Learning Rate
      │
0.001 │───────┐
      │       │
      │       └───────┐
0.0001│               │
      │               └───────┐
      │                       │
      └───────────────────────── Steps

2. Exponential Decay

Continuous exponential reduction:

\eta_t = \eta_0 \times \gamma^t

Where:

  • \gamma = decay rate (e.g., 0.9995)

Visualization:

Learning Rate
     │
0.001│●──────────────\
     │                \
     │                 \
     │                  \
     │                   \
     │                    \
     │                     \
     │                      \
     └──────────────────────── Steps

3. Warmup Scheduling

Start with low rate, increase, then decrease:

Warmup Phase:

\eta_t = \eta_{max} \times \frac{t}{T_{warmup}}

After Warmup:

\eta_t = \text{Cosine Annealing or other schedule}

Visualization:

Learning Rate
     │
0.001│      ╱───────\
     │              \
     │               \
     │                \
     │                 \
     │                  \
     │╱                   \
     └───────────────────── Steps

4. One Cycle Learning Rate

One cycle: increase then decrease:

\eta_t = \begin{cases}
\eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{t}{T_1} & t \leq T_1 \\
\eta_{max} - (\eta_{max} - \eta_{min}) \times \frac{t - T_1}{T_2} & t > T_1
\end{cases}

Visualization:

Learning Rate
     │
0.001│      ╱─────\
     │            \
     │             \
     │              \
     │               \
     │                \
     │╱                 \
     └─────────────────── Steps

8.6 Why Scheduling Matters

Benefit 1: Better Convergence

Without Scheduling:

Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.15 → 2.12 → ...
      (slow convergence at end)

With Scheduling:

Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.1 → 2.05 → ...
      (faster convergence, better final loss)

Benefit 2: More Stable Training

Fixed High Rate:

Loss: 3.0 → 2.5 → 2.3 → 2.4 → 2.3 → 2.4 → ...
      (oscillating, unstable)

Scheduled Rate:

Loss: 3.0 → 2.5 → 2.3 → 2.2 → 2.15 → 2.12 → ...
      (smooth, stable)

Benefit 3: Better Final Performance

Comparison:

Fixed LR:      Final Loss = 2.15
Scheduled LR:  Final Loss = 2.05

→ 5% improvement!

8.7 Complete Mathematical Formulation

General Scheduling Formula

\eta_t = f(t, \eta_0, \eta_{min}, T_{max}, ...)

Where f is the scheduling function

Cosine Annealing (Complete)

\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}

Boundary Conditions:

  • At t = 0: \eta_0 = \eta_{max}
  • At t = T_{max}: \eta_{T_{max}} = \eta_{min}

Step Decay

\eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor}

Exponential Decay

\eta_t = \eta_0 \times \gamma^t

Warmup + Cosine Annealing

Warmup Phase (t \leq T_{warmup}):

\eta_t = \eta_{max} \times \frac{t}{T_{warmup}}

Annealing Phase (t > T_{warmup}):

\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi (t - T_{warmup})}{T_{max} - T_{warmup}}\right)}{2}

8.8 Exercise: Schedule Calculation

Problem

Given Cosine Annealing schedule:

  • \eta_{max} = 0.002
  • \eta_{min} = 0.0001
  • T_{max} = 5000 steps

Calculate the learning rate at:

  1. Step t = 0
  2. Step t = 1250
  3. Step t = 2500
  4. Step t = 3750
  5. Step t = 5000

Step-by-Step Solution

General Formula

\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \times \frac{1 + \cos\left(\frac{\pi t}{T_{max}}\right)}{2}

Substitute values:

\eta_t = 0.0001 + (0.002 - 0.0001) \times \frac{1 + \cos\left(\frac{\pi t}{5000}\right)}{2}
\eta_t = 0.0001 + 0.0019 \times \frac{1 + \cos\left(\frac{\pi t}{5000}\right)}{2}

Step 1: t = 0

\eta_0 = 0.0001 + 0.0019 \times \frac{1 + \cos(0)}{2}
\eta_0 = 0.0001 + 0.0019 \times \frac{1 + 1}{2}
\eta_0 = 0.0001 + 0.0019 \times 1 = 0.0001 + 0.0019 = 0.002

Answer: \eta_0 = 0.002

Step 2: t = 1250

\eta_{1250} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi/4)}{2}
\eta_{1250} = 0.0001 + 0.0019 \times \frac{1 + 0.707}{2}
\eta_{1250} = 0.0001 + 0.0019 \times 0.8535 = 0.0001 + 0.001621 = 0.001721

Answer: \eta_{1250} \approx 0.001721

Step 3: t = 2500

\eta_{2500} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi/2)}{2}
\eta_{2500} = 0.0001 + 0.0019 \times \frac{1 + 0}{2}
\eta_{2500} = 0.0001 + 0.0019 \times 0.5 = 0.0001 + 0.00095 = 0.00105

Answer: \eta_{2500} = 0.00105

Step 4: t = 3750

\eta_{3750} = 0.0001 + 0.0019 \times \frac{1 + \cos(3\pi/4)}{2}
\eta_{3750} = 0.0001 + 0.0019 \times \frac{1 + (-0.707)}{2}
\eta_{3750} = 0.0001 + 0.0019 \times 0.1465 = 0.0001 + 0.000278 = 0.000378

Answer: \eta_{3750} \approx 0.000378

Step 5: t = 5000

\eta_{5000} = 0.0001 + 0.0019 \times \frac{1 + \cos(\pi)}{2}
\eta_{5000} = 0.0001 + 0.0019 \times \frac{1 + (-1)}{2}
\eta_{5000} = 0.0001 + 0.0019 \times 0 = 0.0001 + 0 = 0.0001

Answer: \eta_{5000} = 0.0001

Summary Table

Step Cosine Value Scale Factor Learning Rate
0 1.0 1.0 0.002
1250 0.707 0.854 0.001721
2500 0.0 0.5 0.00105
3750 -0.707 0.146 0.000378
5000 -1.0 0.0 0.0001

Smooth decay from 0.002 to 0.0001!


8.9 Key Takeaways

Scheduling

Scheduling adjusts learning rate during training
Starts high (fast learning), ends low (fine-tuning)
Improves convergence and final performance

Cosine Annealing

Smooth cosine-based decay
Gradual reduction from max to min
Works well for transformers

Why Important

Faster convergence
More stable training
Better final performance
Essential for optimal training


This document provides a comprehensive explanation of learning rate scheduling, including cosine annealing and other strategies with mathematical formulations and solved exercises.