- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
20 KiB
What is Optimization? Step-by-Step Explanation
Complete step-by-step explanation of optimization in neural networks: how optimizers update weights to minimize loss.
Table of Contents
- What is Optimization?
- The Optimization Problem
- Gradient Descent
- AdamW Optimizer
- Why Optimization Matters
- Complete Mathematical Formulation
- Exercise: Optimizer Step-by-Step
- Key Takeaways
7.1 What is Optimization?
Simple Definition
Optimization is the process of finding the best set of weights (parameters) that minimize the loss function and make the model's predictions as accurate as possible.
Visual Analogy
Think of optimization like finding the lowest point in a valley:
Loss Landscape:
High Loss
│
│ ● (current position)
│ ╱│╲
│ ╱ │ ╲
│ ╱ │ ╲
│╱ │ ╲
│ ▼ │
│ (goal)│
│ │
Low Loss ─────┘
Goal: Find the bottom of the valley (minimum loss)
Optimizer: Your guide down the mountain
What Optimization Does
Optimization:
- Measures how wrong the model is (loss)
- Calculates direction to improve (gradients)
- Updates weights to reduce loss
- Repeats until convergence
Result: Model learns to make accurate predictions!
Optimization Process Flow
graph TB
Start[Training Start] --> Init[Initialize Weights<br/>Random Values]
Init --> Loop[Training Loop]
Loop --> Forward[Forward Pass<br/>Model Prediction]
Forward --> Loss["Compute Loss<br/>L = loss(pred, target)"]
Loss --> Check{Converged?}
Check -->|Yes| End[Training Complete]
Check -->|No| Gradient["Compute Gradients<br/>∇L = ∂L/∂θ"]
Gradient --> Optimize[Optimizer<br/>Update Weights]
Optimize --> Update["New Weights<br/>θ = θ - update"]
Update --> Loop
style Start fill:#e1f5ff
style End fill:#e1ffe1
style Optimize fill:#fff4e1
style Check fill:#ffe1f5
7.2 The Optimization Problem
The Objective
We want to minimize:
L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(y_i, f(x_i; \theta))
Where:
\theta= all model parameters (weights)L= total loss\ell= loss function (e.g., cross-entropy)y_i= correct answerf(x_i; \theta)= model predictionN= number of examples
The Challenge
Problem: Loss function is complex and high-dimensional
Solution: Use iterative optimization algorithms
Process:
Initialize weights randomly
Repeat:
1. Compute loss
2. Compute gradients
3. Update weights
Until convergence
Optimization Problem Flowchart
graph LR
subgraph "Optimization Problem"
A["Loss Function<br/>L(θ)"] --> B["Find Minimum<br/>min L(θ)"]
B --> C["Optimal Weights<br/>θ*"]
end
subgraph "Solution Approach"
D["Initialize θ"] --> E[Iterative Updates]
E --> F[Compute Loss]
F --> G[Compute Gradient]
G --> H[Update Weights]
H --> I{Converged?}
I -->|No| E
I -->|Yes| C
end
A -.-> F
C -.-> C
style A fill:#ffcccc
style B fill:#ffffcc
style C fill:#ccffcc
style E fill:#cce5ff
7.3 Gradient Descent
What is Gradient Descent?
Gradient Descent is a basic optimization algorithm that updates weights by moving in the direction of steepest descent.
How It Works
Step 1: Compute Gradient
\nabla_\theta L = \frac{\partial L}{\partial \theta}
Gradient tells us:
- Direction: Which way to go
- Magnitude: How steep the slope
Step 2: Update Weights
\theta_{t+1} = \theta_t - \eta \nabla_\theta L
Where:
\theta_t= current weights\eta= learning rate (step size)\nabla_\theta L= gradient
Meaning: Move weights in direction opposite to gradient
Visual Example
Loss Landscape (2D):
Gradient
Direction
↓
● ──────┼───── → Lower Loss
│
│
Move in direction of negative gradient!
Gradient Descent Flowchart
graph TB
subgraph "Gradient Descent Algorithm"
Start["Start: Initialize θ₀"] --> Loop["For each iteration t"]
Loop --> Forward[Forward Pass<br/>Compute Predictions]
Forward --> Loss["Compute Loss<br/>L(θₜ)"]
Loss --> Grad["Compute Gradient<br/>g = ∇L(θₜ)"]
Grad --> Direction["Determine Direction<br/>-g points to minimum"]
Direction --> Step["Take Step<br/>η × g"]
Step --> Update["Update Weights<br/>θₜ₊₁ = θₜ - ηg"]
Update --> Check{"Converged?<br/>|g| < ε"}
Check -->|No| Loop
Check -->|Yes| End["Found Minimum<br/>θ*"]
end
subgraph "Gradient Information"
GradInfo["Gradient g contains:<br/>- Direction: Which way to go<br/>- Magnitude: How steep"]
end
Grad -.-> GradInfo
style Start fill:#e1f5ff
style End fill:#e1ffe1
style Grad fill:#fff4e1
style Check fill:#ffe1f5
style Update fill:#ccffcc
Types of Gradient Descent
1. Batch Gradient Descent:
- Uses all training examples
- Most accurate gradients
- Slow for large datasets
2. Stochastic Gradient Descent (SGD):
- Uses one example at a time
- Fast but noisy
- Can bounce around
3. Mini-Batch Gradient Descent:
- Uses small batch of examples
- Balance of speed and accuracy
- Most commonly used
Gradient Descent Types Comparison
graph TB
subgraph "Batch Gradient Descent"
B1[All Training Data] --> B2[Compute Gradient<br/>on Full Dataset]
B2 --> B3[Single Update<br/>Most Accurate]
B3 --> B4["Slow: O(N)"]
end
subgraph "Stochastic Gradient Descent"
S1[Single Example] --> S2[Compute Gradient<br/>on One Sample]
S2 --> S3[Many Updates<br/>Fast but Noisy]
S3 --> S4["Fast: O(1)"]
end
subgraph "Mini-Batch Gradient Descent"
M1[Small Batch<br/>32-256 samples] --> M2[Compute Gradient<br/>on Batch]
M2 --> M3[Balanced Updates<br/>Good Accuracy]
M3 --> M4["Fast: O(batch_size)"]
end
style B3 fill:#ccffcc
style S3 fill:#ffcccc
style M3 fill:#fff4e1
7.4 AdamW Optimizer
What is AdamW?
AdamW (Adam with Weight Decay) is an advanced optimizer that combines:
- Adaptive learning rates (like Adam)
- Weight decay (regularization)
Why AdamW?
- Per-parameter learning rates
- Handles sparse gradients well
- Works great for transformers
How AdamW Works
Step 1: Compute Gradient
g_t = \nabla_\theta L(\theta_t)
Step 2: Update Momentum
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
Where:
\beta_1 = 0.9(momentum decay)m_t= first moment estimate
Meaning: Moving average of gradients
Step 3: Update Variance
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
Where:
\beta_2 = 0.999(variance decay)v_t= second moment estimate
Meaning: Moving average of squared gradients
Step 4: Bias Correction
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
Why? Corrects bias in early iterations
Step 5: Update Weights
\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)
Where:
\eta= learning rate\epsilon = 10^{-8}(small constant)\lambda= weight decay coefficient
Key Points:
\frac{\hat{m}_t}{\sqrt{\hat{v}_t}}= adaptive learning rate per parameter\lambda \theta_t= weight decay (regularization)
AdamW Optimizer Flowchart
graph TB
subgraph "AdamW Optimization Process"
Start["Start: Initialize<br/>θ₀, m₀=0, v₀=0"] --> Loop["For each iteration t"]
Loop --> Forward["Forward Pass<br/>Compute Loss L(θₜ)"]
Forward --> Grad["Step 1: Compute Gradient<br/>gₜ = ∇L(θₜ)"]
Grad --> Mom["Step 2: Update Momentum<br/>mₜ = β₁mₜ₋₁ + (1-β₁)gₜ"]
Mom --> Var["Step 3: Update Variance<br/>vₜ = β₂vₜ₋₁ + (1-β₂)gₜ²"]
Var --> Bias["Step 4: Bias Correction<br/>m̂ₜ = mₜ/(1-β₁ᵗ)<br/>v̂ₜ = vₜ/(1-β₂ᵗ)"]
Bias --> Adapt["Step 5: Adaptive LR<br/>LR = η/(√v̂ₜ + ε)"]
Adapt --> Decay["Step 6: Weight Decay<br/>λθₜ"]
Decay --> Update["Step 7: Update Weights<br/>θₜ₊₁ = θₜ - LR×m̂ₜ - λθₜ"]
Update --> Check{Converged?}
Check -->|No| Loop
Check -->|Yes| End["Optimal Weights θ*"]
end
subgraph "Key Components"
C1["Momentum mₜ<br/>Moving avg of gradients"]
C2["Variance vₜ<br/>Moving avg of g²"]
C3["Adaptive LR<br/>Per-parameter learning rate"]
C4["Weight Decay<br/>Regularization"]
end
Mom -.-> C1
Var -.-> C2
Adapt -.-> C3
Decay -.-> C4
style Start fill:#e1f5ff
style End fill:#e1ffe1
style Grad fill:#fff4e1
style Adapt fill:#ccffcc
style Update fill:#ccffcc
style Check fill:#ffe1f5
AdamW Detailed Subgraph
graph LR
subgraph "Input"
I1["Gradient gₜ"]
I2["Previous Momentum mₜ₋₁"]
I3["Previous Variance vₜ₋₁"]
I4["Current Weights θₜ"]
end
subgraph "Momentum Update"
M1["Multiply: β₁mₜ₋₁"] --> M2["Combine: β₁mₜ₋₁ + (1-β₁)gₜ"]
I2 --> M1
I1 --> M2
end
subgraph "Variance Update"
V1["Square: gₜ²"] --> V2["Combine: β₂vₜ₋₁ + (1-β₂)gₜ²"]
I3 --> V2
I1 --> V1
end
subgraph "Bias Correction"
M2 --> BC1["m̂ₜ = mₜ/(1-β₁ᵗ)"]
V2 --> BC2["v̂ₜ = vₜ/(1-β₂ᵗ)"]
end
subgraph "Adaptive Learning Rate"
BC2 --> ALR["LR = η/(√v̂ₜ + ε)"]
end
subgraph "Weight Update"
BC1 --> WU1["Adaptive Step: LR × m̂ₜ"]
ALR --> WU1
I4 --> WU2["Decay Step: λθₜ"]
WU1 --> WU3["Update: θₜ₊₁ = θₜ - LR×m̂ₜ - λθₜ"]
WU2 --> WU3
end
style M2 fill:#e1f5ff
style V2 fill:#e1f5ff
style BC1 fill:#fff4e1
style BC2 fill:#fff4e1
style ALR fill:#ccffcc
style WU3 fill:#ccffcc
Why AdamW is Better
Compared to SGD:
SGD:
Same learning rate for all parameters
→ Slow convergence
→ Manual tuning needed
AdamW:
Adaptive learning rate per parameter
→ Faster convergence
→ Less manual tuning
Benefits:
- Adaptive: Each parameter gets its own learning rate
- Robust: Works well with noisy gradients
- Efficient: Converges faster than SGD
- Regularized: Weight decay prevents overfitting
SGD vs AdamW Comparison
graph TB
subgraph "Stochastic Gradient Descent"
SGD1["Gradient gₜ"] --> SGD2["Fixed Learning Rate η"]
SGD2 --> SGD3["Update: θₜ₊₁ = θₜ - ηgₜ"]
SGD3 --> SGD4["All params same LR"]
SGD4 --> SGD5["Slow Convergence<br/>Manual Tuning"]
end
subgraph "AdamW Optimizer"
AD1["Gradient gₜ"] --> AD2["Momentum mₜ"]
AD1 --> AD3["Variance vₜ"]
AD2 --> AD4[Bias Correction]
AD3 --> AD4
AD4 --> AD5["Adaptive LR per param"]
AD5 --> AD6["Update: θₜ₊₁ = θₜ - LR×m̂ₜ - λθₜ"]
AD6 --> AD7["Fast Convergence<br/>Less Tuning"]
end
subgraph "Comparison"
Comp1["Same Model<br/>Same Data"]
Comp1 --> Comp2["SGD: Loss = 2.5<br/>After 100 epochs"]
Comp1 --> Comp3["AdamW: Loss = 1.8<br/>After 100 epochs"]
Comp3 --> Comp4[AdamW is Better!]
end
SGD5 -.-> Comp2
AD7 -.-> Comp3
style SGD5 fill:#ffcccc
style AD7 fill:#ccffcc
style Comp4 fill:#e1ffe1
7.5 Why Optimization Matters
Reason 1: Without Optimization
Random weights:
Weights: Random values
Loss: Very high
Predictions: Random
Model: Useless
Reason 2: With Optimization
Learned weights:
Weights: Optimized values
Loss: Low
Predictions: Accurate
Model: Useful
Reason 3: Determines Learning Speed
Good optimizer:
- Fast convergence
- Stable training
- Good final performance
Poor optimizer:
- Slow convergence
- Unstable training
- Poor final performance
Reason 4: Affects Final Performance
Same model, different optimizers:
SGD: Loss = 2.5 (after 100 epochs)
AdamW: Loss = 1.8 (after 100 epochs)
Better optimizer = Better model!
Optimization Impact Visualization
graph LR
subgraph "Without Optimization"
WO1[Random Weights] --> WO2["High Loss<br/>L ≈ 8-10"]
WO2 --> WO3[Random Predictions]
WO3 --> WO4[Model Useless]
end
subgraph "With Optimization"
W1[Random Weights] --> W2[Optimization Loop]
W2 --> W3[Update Weights]
W3 --> W4["Low Loss<br/>L ≈ 1-2"]
W4 --> W5[Accurate Predictions]
W5 --> W6[Model Useful]
end
subgraph "Optimizer Quality"
O1["Poor Optimizer<br/>SGD, Bad LR"] --> O2["Slow Convergence<br/>Loss = 2.5"]
O3["Good Optimizer<br/>AdamW, Proper LR"] --> O4["Fast Convergence<br/>Loss = 1.8"]
end
W2 -.-> O1
W2 -.-> O3
style WO4 fill:#ffcccc
style W6 fill:#ccffcc
style O2 fill:#ffcccc
style O4 fill:#ccffcc
7.6 Complete Mathematical Formulation
Optimization Problem
\theta^* = \arg\min_{\theta} L(\theta)
Where \theta^* is the optimal set of weights
Gradient Descent Update
\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)
AdamW Update (Complete)
For each parameter \theta_i:
Gradient:
g_{t,i} = \frac{\partial L}{\partial \theta_{t,i}}
Momentum:
m_{t,i} = \beta_1 m_{t-1,i} + (1 - \beta_1) g_{t,i}
Variance:
v_{t,i} = \beta_2 v_{t-1,i} + (1 - \beta_2) g_{t,i}^2
Bias Correction:
\hat{m}_{t,i} = \frac{m_{t,i}}{1 - \beta_1^t}
\hat{v}_{t,i} = \frac{v_{t,i}}{1 - \beta_2^t}
Update:
\theta_{t+1,i} = \theta_{t,i} - \eta \left( \frac{\hat{m}_{t,i}}{\sqrt{\hat{v}_{t,i}} + \epsilon} + \lambda \theta_{t,i} \right)
Where:
\beta_1 = 0.9\beta_2 = 0.999\epsilon = 10^{-8}\lambda= weight decay (e.g., 0.01)
Complete Mathematical Flow
graph TB
subgraph "Optimization Problem"
OP1["Loss Function L(θ)"] --> OP2["Find: θ* = argmin L(θ)"]
end
subgraph "Gradient Computation"
GC1[Forward Pass] --> GC2[Compute Loss L]
GC2 --> GC3[Backpropagation]
GC3 --> GC4["Gradient gᵢ = ∂L/∂θᵢ"]
end
subgraph "AdamW Update Steps"
GC4 --> AU1["Momentum: mᵢ = β₁m + (1-β₁)gᵢ"]
AU1 --> AU2["Variance: vᵢ = β₂v + (1-β₂)gᵢ²"]
AU2 --> AU3["Bias Correction:<br/>m̂ = m/(1-β₁ᵗ), v̂ = v/(1-β₂ᵗ)"]
AU3 --> AU4["Adaptive LR: η/(√v̂ + ε)"]
AU4 --> AU5["Update: θᵢ = θᵢ - LR×m̂ - λθᵢ"]
end
subgraph "Convergence Check"
AU5 --> CC1{Converged?}
CC1 -->|No| GC1
CC1 -->|Yes| CC2["Optimal Weights θ*"]
end
OP2 -.-> GC1
CC2 -.-> OP2
style OP2 fill:#ffffcc
style GC4 fill:#fff4e1
style AU5 fill:#ccffcc
style CC2 fill:#e1ffe1
7.7 Exercise: Optimizer Step-by-Step
Problem
Given:
- Current weight:
\theta_0 = 2.0 - Loss function:
L(\theta) = (\theta - 1)^2 - Learning rate:
\eta = 0.1 - Use AdamW with
\beta_1 = 0.9,\beta_2 = 0.999,\lambda = 0.01 - Initial moments:
m_0 = 0,v_0 = 0
Calculate the weight update for step 1.
Step-by-Step Solution
Step 1: Compute Gradient
Loss function:
L(\theta) = (\theta - 1)^2
Gradient:
g_1 = \frac{\partial L}{\partial \theta} = 2(\theta - 1)
At \theta_0 = 2.0:
g_1 = 2(2.0 - 1) = 2(1.0) = 2.0
Step 2: Update Momentum
m_1 = \beta_1 m_0 + (1 - \beta_1) g_1
m_1 = 0.9 \times 0 + (1 - 0.9) \times 2.0 = 0 + 0.1 \times 2.0 = 0.2
Step 3: Update Variance
v_1 = \beta_2 v_0 + (1 - \beta_2) g_1^2
v_1 = 0.999 \times 0 + (1 - 0.999) \times (2.0)^2 = 0 + 0.001 \times 4.0 = 0.004
Step 4: Bias Correction
\hat{m}_1 = \frac{m_1}{1 - \beta_1^1} = \frac{0.2}{1 - 0.9} = \frac{0.2}{0.1} = 2.0
\hat{v}_1 = \frac{v_1}{1 - \beta_2^1} = \frac{0.004}{1 - 0.999} = \frac{0.004}{0.001} = 4.0
Step 5: Compute Update
\Delta \theta_1 = \eta \left( \frac{\hat{m}_1}{\sqrt{\hat{v}_1} + \epsilon} + \lambda \theta_0 \right)
\Delta \theta_1 = 0.1 \left( \frac{2.0}{\sqrt{4.0} + 10^{-8}} + 0.01 \times 2.0 \right)
\Delta \theta_1 = 0.1 \left( \frac{2.0}{2.0 + 0.00000001} + 0.02 \right)
\Delta \theta_1 = 0.1 \left( \frac{2.0}{2.0} + 0.02 \right) = 0.1 (1.0 + 0.02) = 0.1 \times 1.02 = 0.102
Step 6: Update Weight
\theta_1 = \theta_0 - \Delta \theta_1 = 2.0 - 0.102 = 1.898
Answer
After one step:
- Old weight:
\theta_0 = 2.0 - New weight:
\theta_1 = 1.898 - Update:
\Delta \theta_1 = -0.102
The weight moved closer to the optimal value (1.0)!
Verification
Check loss:
- Old loss:
L(2.0) = (2.0 - 1)^2 = 1.0 - New loss:
L(1.898) = (1.898 - 1)^2 = 0.806
Loss decreased! ✓
Exercise Solution Flowchart
graph TB
subgraph "Given Values"
G1["θ₀ = 2.0"] --> Start
G2["m₀ = 0"] --> Start
G3["v₀ = 0"] --> Start
G4["η = 0.1, β₁ = 0.9<br/>β₂ = 0.999, λ = 0.01"] --> Start
end
Start[Start] --> Step1["Step 1: Compute Gradient<br/>L(θ) = (θ-1)²<br/>g₁ = 2(θ₀-1) = 2.0"]
Step1 --> Step2["Step 2: Update Momentum<br/>m₁ = 0.9×0 + 0.1×2.0<br/>m₁ = 0.2"]
Step2 --> Step3["Step 3: Update Variance<br/>v₁ = 0.999×0 + 0.001×4.0<br/>v₁ = 0.004"]
Step3 --> Step4["Step 4: Bias Correction<br/>m̂₁ = 0.2/(1-0.9) = 2.0<br/>v̂₁ = 0.004/(1-0.999) = 4.0"]
Step4 --> Step5["Step 5: Compute Update<br/>Δθ₁ = 0.1×(2.0/√4.0 + 0.01×2.0)<br/>Δθ₁ = 0.102"]
Step5 --> Step6["Step 6: Update Weight<br/>θ₁ = 2.0 - 0.102<br/>θ₁ = 1.898"]
Step6 --> Verify["Verification:<br/>L(2.0) = 1.0 → L(1.898) = 0.806<br/>Loss Decreased!"]
Verify --> End["Result: θ₁ = 1.898<br/>Closer to optimum θ* = 1.0"]
style Start fill:#e1f5ff
style Step1 fill:#fff4e1
style Step2 fill:#fff4e1
style Step3 fill:#fff4e1
style Step4 fill:#fff4e1
style Step5 fill:#fff4e1
style Step6 fill:#ccffcc
style Verify fill:#e1ffe1
style End fill:#e1ffe1
7.8 Key Takeaways
Optimization
✅ Optimization finds best weights to minimize loss
✅ Uses gradients to determine update direction
✅ Iterative process: compute → update → repeat
Gradient Descent
✅ Basic algorithm: move opposite to gradient
✅ Learning rate controls step size
✅ Can be slow for complex problems
AdamW
✅ Advanced optimizer with adaptive learning rates
✅ Each parameter gets its own learning rate
✅ Combines momentum and variance estimates
✅ Includes weight decay for regularization
✅ Works great for transformers
Why Important
✅ Determines how fast model learns
✅ Affects final model performance
✅ Essential for training neural networks
This document provides a comprehensive explanation of optimization in neural networks, including gradient descent and AdamW optimizer with mathematical formulations and solved exercises.