Files
sheepOp/docs/OPTIMIZATION_EXPLAINED.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

20 KiB
Raw Permalink Blame History

What is Optimization? Step-by-Step Explanation

Complete step-by-step explanation of optimization in neural networks: how optimizers update weights to minimize loss.

Table of Contents

  1. What is Optimization?
  2. The Optimization Problem
  3. Gradient Descent
  4. AdamW Optimizer
  5. Why Optimization Matters
  6. Complete Mathematical Formulation
  7. Exercise: Optimizer Step-by-Step
  8. Key Takeaways

7.1 What is Optimization?

Simple Definition

Optimization is the process of finding the best set of weights (parameters) that minimize the loss function and make the model's predictions as accurate as possible.

Visual Analogy

Think of optimization like finding the lowest point in a valley:

Loss Landscape:

    High Loss
        │
        │    ●  (current position)
        │   ╱│╲
        │   │ ╲
        │   │  ╲
        │╱   │   ╲
        │    ▼    │
        │   (goal)│
        │         │
    Low Loss ─────┘

Goal: Find the bottom of the valley (minimum loss)

Optimizer: Your guide down the mountain

What Optimization Does

Optimization:

  1. Measures how wrong the model is (loss)
  2. Calculates direction to improve (gradients)
  3. Updates weights to reduce loss
  4. Repeats until convergence

Result: Model learns to make accurate predictions!

Optimization Process Flow

graph TB
    Start[Training Start] --> Init[Initialize Weights<br/>Random Values]
    Init --> Loop[Training Loop]
    
    Loop --> Forward[Forward Pass<br/>Model Prediction]
    Forward --> Loss["Compute Loss<br/>L = loss(pred, target)"]
    Loss --> Check{Converged?}
    
    Check -->|Yes| End[Training Complete]
    Check -->|No| Gradient["Compute Gradients<br/>∇L = ∂L/∂θ"]
    
    Gradient --> Optimize[Optimizer<br/>Update Weights]
    Optimize --> Update["New Weights<br/>θ = θ - update"]
    Update --> Loop
    
    style Start fill:#e1f5ff
    style End fill:#e1ffe1
    style Optimize fill:#fff4e1
    style Check fill:#ffe1f5

7.2 The Optimization Problem

The Objective

We want to minimize:

L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(y_i, f(x_i; \theta))

Where:

  • \theta = all model parameters (weights)
  • L = total loss
  • \ell = loss function (e.g., cross-entropy)
  • y_i = correct answer
  • f(x_i; \theta) = model prediction
  • N = number of examples

The Challenge

Problem: Loss function is complex and high-dimensional

Solution: Use iterative optimization algorithms

Process:

Initialize weights randomly
Repeat:
  1. Compute loss
  2. Compute gradients
  3. Update weights
Until convergence

Optimization Problem Flowchart

graph LR
    subgraph "Optimization Problem"
        A["Loss Function<br/>L(θ)"] --> B["Find Minimum<br/>min L(θ)"]
        B --> C["Optimal Weights<br/>θ*"]
    end
    
    subgraph "Solution Approach"
        D["Initialize θ"] --> E[Iterative Updates]
        E --> F[Compute Loss]
        F --> G[Compute Gradient]
        G --> H[Update Weights]
        H --> I{Converged?}
        I -->|No| E
        I -->|Yes| C
    end
    
    A -.-> F
    C -.-> C
    
    style A fill:#ffcccc
    style B fill:#ffffcc
    style C fill:#ccffcc
    style E fill:#cce5ff

7.3 Gradient Descent

What is Gradient Descent?

Gradient Descent is a basic optimization algorithm that updates weights by moving in the direction of steepest descent.

How It Works

Step 1: Compute Gradient

\nabla_\theta L = \frac{\partial L}{\partial \theta}

Gradient tells us:

  • Direction: Which way to go
  • Magnitude: How steep the slope

Step 2: Update Weights

\theta_{t+1} = \theta_t - \eta \nabla_\theta L

Where:

  • \theta_t = current weights
  • \eta = learning rate (step size)
  • \nabla_\theta L = gradient

Meaning: Move weights in direction opposite to gradient

Visual Example

Loss Landscape (2D):

         Gradient
         Direction
            ↓
    ● ──────┼───── → Lower Loss
            │
            │

Move in direction of negative gradient!

Gradient Descent Flowchart

graph TB
    subgraph "Gradient Descent Algorithm"
        Start["Start: Initialize θ₀"] --> Loop["For each iteration t"]
        
        Loop --> Forward[Forward Pass<br/>Compute Predictions]
        Forward --> Loss["Compute Loss<br/>L(θₜ)"]
        Loss --> Grad["Compute Gradient<br/>g = ∇L(θₜ)"]
        
        Grad --> Direction["Determine Direction<br/>-g points to minimum"]
        Direction --> Step["Take Step<br/>η × g"]
        Step --> Update["Update Weights<br/>θₜ₊₁ = θₜ - ηg"]
        
        Update --> Check{"Converged?<br/>|g| < ε"}
        Check -->|No| Loop
        Check -->|Yes| End["Found Minimum<br/>θ*"]
    end
    
    subgraph "Gradient Information"
        GradInfo["Gradient g contains:<br/>- Direction: Which way to go<br/>- Magnitude: How steep"]
    end
    
    Grad -.-> GradInfo
    
    style Start fill:#e1f5ff
    style End fill:#e1ffe1
    style Grad fill:#fff4e1
    style Check fill:#ffe1f5
    style Update fill:#ccffcc

Types of Gradient Descent

1. Batch Gradient Descent:

  • Uses all training examples
  • Most accurate gradients
  • Slow for large datasets

2. Stochastic Gradient Descent (SGD):

  • Uses one example at a time
  • Fast but noisy
  • Can bounce around

3. Mini-Batch Gradient Descent:

  • Uses small batch of examples
  • Balance of speed and accuracy
  • Most commonly used

Gradient Descent Types Comparison

graph TB
    subgraph "Batch Gradient Descent"
        B1[All Training Data] --> B2[Compute Gradient<br/>on Full Dataset]
        B2 --> B3[Single Update<br/>Most Accurate]
        B3 --> B4["Slow: O(N)"]
    end
    
    subgraph "Stochastic Gradient Descent"
        S1[Single Example] --> S2[Compute Gradient<br/>on One Sample]
        S2 --> S3[Many Updates<br/>Fast but Noisy]
        S3 --> S4["Fast: O(1)"]
    end
    
    subgraph "Mini-Batch Gradient Descent"
        M1[Small Batch<br/>32-256 samples] --> M2[Compute Gradient<br/>on Batch]
        M2 --> M3[Balanced Updates<br/>Good Accuracy]
        M3 --> M4["Fast: O(batch_size)"]
    end
    
    style B3 fill:#ccffcc
    style S3 fill:#ffcccc
    style M3 fill:#fff4e1

7.4 AdamW Optimizer

What is AdamW?

AdamW (Adam with Weight Decay) is an advanced optimizer that combines:

  • Adaptive learning rates (like Adam)
  • Weight decay (regularization)

Why AdamW?

  • Per-parameter learning rates
  • Handles sparse gradients well
  • Works great for transformers

How AdamW Works

Step 1: Compute Gradient

g_t = \nabla_\theta L(\theta_t)

Step 2: Update Momentum

m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

Where:

  • \beta_1 = 0.9 (momentum decay)
  • m_t = first moment estimate

Meaning: Moving average of gradients

Step 3: Update Variance

v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

Where:

  • \beta_2 = 0.999 (variance decay)
  • v_t = second moment estimate

Meaning: Moving average of squared gradients

Step 4: Bias Correction

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Why? Corrects bias in early iterations

Step 5: Update Weights

\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)

Where:

  • \eta = learning rate
  • \epsilon = 10^{-8} (small constant)
  • \lambda = weight decay coefficient

Key Points:

  • \frac{\hat{m}_t}{\sqrt{\hat{v}_t}} = adaptive learning rate per parameter
  • \lambda \theta_t = weight decay (regularization)

AdamW Optimizer Flowchart

graph TB
    subgraph "AdamW Optimization Process"
        Start["Start: Initialize<br/>θ₀, m₀=0, v₀=0"] --> Loop["For each iteration t"]
        
        Loop --> Forward["Forward Pass<br/>Compute Loss L(θₜ)"]
        Forward --> Grad["Step 1: Compute Gradient<br/>gₜ = ∇L(θₜ)"]
        
        Grad --> Mom["Step 2: Update Momentum<br/>mₜ = β₁mₜ₋₁ + (1-β₁)gₜ"]
        Mom --> Var["Step 3: Update Variance<br/>vₜ = β₂vₜ₋₁ + (1-β₂)gₜ²"]
        
        Var --> Bias["Step 4: Bias Correction<br/>m̂ₜ = mₜ/(1-β₁ᵗ)<br/>v̂ₜ = vₜ/(1-β₂ᵗ)"]
        
        Bias --> Adapt["Step 5: Adaptive LR<br/>LR = η/(√v̂ₜ + ε)"]
        
        Adapt --> Decay["Step 6: Weight Decay<br/>λθₜ"]
        
        Decay --> Update["Step 7: Update Weights<br/>θₜ₊₁ = θₜ - LR×m̂ₜ - λθₜ"]
        
        Update --> Check{Converged?}
        Check -->|No| Loop
        Check -->|Yes| End["Optimal Weights θ*"]
    end
    
    subgraph "Key Components"
        C1["Momentum mₜ<br/>Moving avg of gradients"]
        C2["Variance vₜ<br/>Moving avg of g²"]
        C3["Adaptive LR<br/>Per-parameter learning rate"]
        C4["Weight Decay<br/>Regularization"]
    end
    
    Mom -.-> C1
    Var -.-> C2
    Adapt -.-> C3
    Decay -.-> C4
    
    style Start fill:#e1f5ff
    style End fill:#e1ffe1
    style Grad fill:#fff4e1
    style Adapt fill:#ccffcc
    style Update fill:#ccffcc
    style Check fill:#ffe1f5

AdamW Detailed Subgraph

graph LR
    subgraph "Input"
        I1["Gradient gₜ"]
        I2["Previous Momentum mₜ₋₁"]
        I3["Previous Variance vₜ₋₁"]
        I4["Current Weights θₜ"]
    end
    
    subgraph "Momentum Update"
        M1["Multiply: β₁mₜ₋₁"] --> M2["Combine: β₁mₜ₋₁ + (1-β₁)gₜ"]
        I2 --> M1
        I1 --> M2
    end
    
    subgraph "Variance Update"
        V1["Square: gₜ²"] --> V2["Combine: β₂vₜ₋₁ + (1-β₂)gₜ²"]
        I3 --> V2
        I1 --> V1
    end
    
    subgraph "Bias Correction"
        M2 --> BC1["m̂ₜ = mₜ/(1-β₁ᵗ)"]
        V2 --> BC2["v̂ₜ = vₜ/(1-β₂ᵗ)"]
    end
    
    subgraph "Adaptive Learning Rate"
        BC2 --> ALR["LR = η/(√v̂ₜ + ε)"]
    end
    
    subgraph "Weight Update"
        BC1 --> WU1["Adaptive Step: LR × m̂ₜ"]
        ALR --> WU1
        I4 --> WU2["Decay Step: λθₜ"]
        WU1 --> WU3["Update: θₜ₊₁ = θₜ - LR×m̂ₜ - λθₜ"]
        WU2 --> WU3
    end
    
    style M2 fill:#e1f5ff
    style V2 fill:#e1f5ff
    style BC1 fill:#fff4e1
    style BC2 fill:#fff4e1
    style ALR fill:#ccffcc
    style WU3 fill:#ccffcc

Why AdamW is Better

Compared to SGD:

SGD:

Same learning rate for all parameters
→ Slow convergence
→ Manual tuning needed

AdamW:

Adaptive learning rate per parameter
→ Faster convergence
→ Less manual tuning

Benefits:

  1. Adaptive: Each parameter gets its own learning rate
  2. Robust: Works well with noisy gradients
  3. Efficient: Converges faster than SGD
  4. Regularized: Weight decay prevents overfitting

SGD vs AdamW Comparison

graph TB
    subgraph "Stochastic Gradient Descent"
        SGD1["Gradient gₜ"] --> SGD2["Fixed Learning Rate η"]
        SGD2 --> SGD3["Update: θₜ₊₁ = θₜ - ηgₜ"]
        SGD3 --> SGD4["All params same LR"]
        SGD4 --> SGD5["Slow Convergence<br/>Manual Tuning"]
    end
    
    subgraph "AdamW Optimizer"
        AD1["Gradient gₜ"] --> AD2["Momentum mₜ"]
        AD1 --> AD3["Variance vₜ"]
        AD2 --> AD4[Bias Correction]
        AD3 --> AD4
        AD4 --> AD5["Adaptive LR per param"]
        AD5 --> AD6["Update: θₜ₊₁ = θₜ - LR×m̂ₜ - λθₜ"]
        AD6 --> AD7["Fast Convergence<br/>Less Tuning"]
    end
    
    subgraph "Comparison"
        Comp1["Same Model<br/>Same Data"]
        Comp1 --> Comp2["SGD: Loss = 2.5<br/>After 100 epochs"]
        Comp1 --> Comp3["AdamW: Loss = 1.8<br/>After 100 epochs"]
        Comp3 --> Comp4[AdamW is Better!]
    end
    
    SGD5 -.-> Comp2
    AD7 -.-> Comp3
    
    style SGD5 fill:#ffcccc
    style AD7 fill:#ccffcc
    style Comp4 fill:#e1ffe1

7.5 Why Optimization Matters

Reason 1: Without Optimization

Random weights:

Weights: Random values
Loss: Very high
Predictions: Random
Model: Useless

Reason 2: With Optimization

Learned weights:

Weights: Optimized values
Loss: Low
Predictions: Accurate
Model: Useful

Reason 3: Determines Learning Speed

Good optimizer:

  • Fast convergence
  • Stable training
  • Good final performance

Poor optimizer:

  • Slow convergence
  • Unstable training
  • Poor final performance

Reason 4: Affects Final Performance

Same model, different optimizers:

SGD:      Loss = 2.5 (after 100 epochs)
AdamW:    Loss = 1.8 (after 100 epochs)

Better optimizer = Better model!

Optimization Impact Visualization

graph LR
    subgraph "Without Optimization"
        WO1[Random Weights] --> WO2["High Loss<br/>L ≈ 8-10"]
        WO2 --> WO3[Random Predictions]
        WO3 --> WO4[Model Useless]
    end
    
    subgraph "With Optimization"
        W1[Random Weights] --> W2[Optimization Loop]
        W2 --> W3[Update Weights]
        W3 --> W4["Low Loss<br/>L ≈ 1-2"]
        W4 --> W5[Accurate Predictions]
        W5 --> W6[Model Useful]
    end
    
    subgraph "Optimizer Quality"
        O1["Poor Optimizer<br/>SGD, Bad LR"] --> O2["Slow Convergence<br/>Loss = 2.5"]
        O3["Good Optimizer<br/>AdamW, Proper LR"] --> O4["Fast Convergence<br/>Loss = 1.8"]
    end
    
    W2 -.-> O1
    W2 -.-> O3
    
    style WO4 fill:#ffcccc
    style W6 fill:#ccffcc
    style O2 fill:#ffcccc
    style O4 fill:#ccffcc

7.6 Complete Mathematical Formulation

Optimization Problem

\theta^* = \arg\min_{\theta} L(\theta)

Where \theta^* is the optimal set of weights

Gradient Descent Update

\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)

AdamW Update (Complete)

For each parameter \theta_i:

Gradient:

g_{t,i} = \frac{\partial L}{\partial \theta_{t,i}}

Momentum:

m_{t,i} = \beta_1 m_{t-1,i} + (1 - \beta_1) g_{t,i}

Variance:

v_{t,i} = \beta_2 v_{t-1,i} + (1 - \beta_2) g_{t,i}^2

Bias Correction:

\hat{m}_{t,i} = \frac{m_{t,i}}{1 - \beta_1^t}
\hat{v}_{t,i} = \frac{v_{t,i}}{1 - \beta_2^t}

Update:

\theta_{t+1,i} = \theta_{t,i} - \eta \left( \frac{\hat{m}_{t,i}}{\sqrt{\hat{v}_{t,i}} + \epsilon} + \lambda \theta_{t,i} \right)

Where:

  • \beta_1 = 0.9
  • \beta_2 = 0.999
  • \epsilon = 10^{-8}
  • \lambda = weight decay (e.g., 0.01)

Complete Mathematical Flow

graph TB
    subgraph "Optimization Problem"
        OP1["Loss Function L(θ)"] --> OP2["Find: θ* = argmin L(θ)"]
    end
    
    subgraph "Gradient Computation"
        GC1[Forward Pass] --> GC2[Compute Loss L]
        GC2 --> GC3[Backpropagation]
        GC3 --> GC4["Gradient gᵢ = ∂L/∂θᵢ"]
    end
    
    subgraph "AdamW Update Steps"
        GC4 --> AU1["Momentum: mᵢ = β₁m + (1-β₁)gᵢ"]
        AU1 --> AU2["Variance: vᵢ = β₂v + (1-β₂)gᵢ²"]
        AU2 --> AU3["Bias Correction:<br/>m̂ = m/(1-β₁ᵗ), v̂ = v/(1-β₂ᵗ)"]
        AU3 --> AU4["Adaptive LR: η/(√v̂ + ε)"]
        AU4 --> AU5["Update: θᵢ = θᵢ - LR×m̂ - λθᵢ"]
    end
    
    subgraph "Convergence Check"
        AU5 --> CC1{Converged?}
        CC1 -->|No| GC1
        CC1 -->|Yes| CC2["Optimal Weights θ*"]
    end
    
    OP2 -.-> GC1
    CC2 -.-> OP2
    
    style OP2 fill:#ffffcc
    style GC4 fill:#fff4e1
    style AU5 fill:#ccffcc
    style CC2 fill:#e1ffe1

7.7 Exercise: Optimizer Step-by-Step

Problem

Given:

  • Current weight: \theta_0 = 2.0
  • Loss function: L(\theta) = (\theta - 1)^2
  • Learning rate: \eta = 0.1
  • Use AdamW with \beta_1 = 0.9, \beta_2 = 0.999, \lambda = 0.01
  • Initial moments: m_0 = 0, v_0 = 0

Calculate the weight update for step 1.

Step-by-Step Solution

Step 1: Compute Gradient

Loss function:

L(\theta) = (\theta - 1)^2

Gradient:

g_1 = \frac{\partial L}{\partial \theta} = 2(\theta - 1)

At \theta_0 = 2.0:

g_1 = 2(2.0 - 1) = 2(1.0) = 2.0

Step 2: Update Momentum

m_1 = \beta_1 m_0 + (1 - \beta_1) g_1
m_1 = 0.9 \times 0 + (1 - 0.9) \times 2.0 = 0 + 0.1 \times 2.0 = 0.2

Step 3: Update Variance

v_1 = \beta_2 v_0 + (1 - \beta_2) g_1^2
v_1 = 0.999 \times 0 + (1 - 0.999) \times (2.0)^2 = 0 + 0.001 \times 4.0 = 0.004

Step 4: Bias Correction

\hat{m}_1 = \frac{m_1}{1 - \beta_1^1} = \frac{0.2}{1 - 0.9} = \frac{0.2}{0.1} = 2.0
\hat{v}_1 = \frac{v_1}{1 - \beta_2^1} = \frac{0.004}{1 - 0.999} = \frac{0.004}{0.001} = 4.0

Step 5: Compute Update

\Delta \theta_1 = \eta \left( \frac{\hat{m}_1}{\sqrt{\hat{v}_1} + \epsilon} + \lambda \theta_0 \right)
\Delta \theta_1 = 0.1 \left( \frac{2.0}{\sqrt{4.0} + 10^{-8}} + 0.01 \times 2.0 \right)
\Delta \theta_1 = 0.1 \left( \frac{2.0}{2.0 + 0.00000001} + 0.02 \right)
\Delta \theta_1 = 0.1 \left( \frac{2.0}{2.0} + 0.02 \right) = 0.1 (1.0 + 0.02) = 0.1 \times 1.02 = 0.102

Step 6: Update Weight

\theta_1 = \theta_0 - \Delta \theta_1 = 2.0 - 0.102 = 1.898

Answer

After one step:

  • Old weight: \theta_0 = 2.0
  • New weight: \theta_1 = 1.898
  • Update: \Delta \theta_1 = -0.102

The weight moved closer to the optimal value (1.0)!

Verification

Check loss:

  • Old loss: L(2.0) = (2.0 - 1)^2 = 1.0
  • New loss: L(1.898) = (1.898 - 1)^2 = 0.806

Loss decreased! ✓

Exercise Solution Flowchart

graph TB
    subgraph "Given Values"
        G1["θ₀ = 2.0"] --> Start
        G2["m₀ = 0"] --> Start
        G3["v₀ = 0"] --> Start
        G4["η = 0.1, β₁ = 0.9<br/>β₂ = 0.999, λ = 0.01"] --> Start
    end
    
    Start[Start] --> Step1["Step 1: Compute Gradient<br/>L(θ) = (θ-1)²<br/>g₁ = 2(θ₀-1) = 2.0"]
    
    Step1 --> Step2["Step 2: Update Momentum<br/>m₁ = 0.9×0 + 0.1×2.0<br/>m₁ = 0.2"]
    
    Step2 --> Step3["Step 3: Update Variance<br/>v₁ = 0.999×0 + 0.001×4.0<br/>v₁ = 0.004"]
    
    Step3 --> Step4["Step 4: Bias Correction<br/>m̂₁ = 0.2/(1-0.9) = 2.0<br/>v̂₁ = 0.004/(1-0.999) = 4.0"]
    
    Step4 --> Step5["Step 5: Compute Update<br/>Δθ₁ = 0.1×(2.0/√4.0 + 0.01×2.0)<br/>Δθ₁ = 0.102"]
    
    Step5 --> Step6["Step 6: Update Weight<br/>θ₁ = 2.0 - 0.102<br/>θ₁ = 1.898"]
    
    Step6 --> Verify["Verification:<br/>L(2.0) = 1.0 → L(1.898) = 0.806<br/>Loss Decreased!"]
    
    Verify --> End["Result: θ₁ = 1.898<br/>Closer to optimum θ* = 1.0"]
    
    style Start fill:#e1f5ff
    style Step1 fill:#fff4e1
    style Step2 fill:#fff4e1
    style Step3 fill:#fff4e1
    style Step4 fill:#fff4e1
    style Step5 fill:#fff4e1
    style Step6 fill:#ccffcc
    style Verify fill:#e1ffe1
    style End fill:#e1ffe1

7.8 Key Takeaways

Optimization

Optimization finds best weights to minimize loss
Uses gradients to determine update direction
Iterative process: compute → update → repeat

Gradient Descent

Basic algorithm: move opposite to gradient
Learning rate controls step size
Can be slow for complex problems

AdamW

Advanced optimizer with adaptive learning rates
Each parameter gets its own learning rate
Combines momentum and variance estimates
Includes weight decay for regularization
Works great for transformers

Why Important

Determines how fast model learns
Affects final model performance
Essential for training neural networks


This document provides a comprehensive explanation of optimization in neural networks, including gradient descent and AdamW optimizer with mathematical formulations and solved exercises.