# What is Optimization? Step-by-Step Explanation Complete step-by-step explanation of optimization in neural networks: how optimizers update weights to minimize loss. ## Table of Contents 1. [What is Optimization?](#71-what-is-optimization) 2. [The Optimization Problem](#72-the-optimization-problem) 3. [Gradient Descent](#73-gradient-descent) 4. [AdamW Optimizer](#74-adamw-optimizer) 5. [Why Optimization Matters](#75-why-optimization-matters) 6. [Complete Mathematical Formulation](#76-complete-mathematical-formulation) 7. [Exercise: Optimizer Step-by-Step](#77-exercise-optimizer-step-by-step) 8. [Key Takeaways](#78-key-takeaways) --- ## 7.1 What is Optimization? ### Simple Definition **Optimization** is the process of finding the best set of weights (parameters) that minimize the loss function and make the model's predictions as accurate as possible. ### Visual Analogy **Think of optimization like finding the lowest point in a valley:** ``` Loss Landscape: High Loss │ │ ● (current position) │ ╱│╲ │ ╱ │ ╲ │ ╱ │ ╲ │╱ │ ╲ │ ▼ │ │ (goal)│ │ │ Low Loss ─────┘ ``` **Goal:** Find the bottom of the valley (minimum loss) **Optimizer:** Your guide down the mountain ### What Optimization Does **Optimization:** 1. **Measures** how wrong the model is (loss) 2. **Calculates** direction to improve (gradients) 3. **Updates** weights to reduce loss 4. **Repeats** until convergence **Result:** Model learns to make accurate predictions! ### Optimization Process Flow ```mermaid graph TB Start[Training Start] --> Init[Initialize Weights
Random Values] Init --> Loop[Training Loop] Loop --> Forward[Forward Pass
Model Prediction] Forward --> Loss["Compute Loss
L = loss(pred, target)"] Loss --> Check{Converged?} Check -->|Yes| End[Training Complete] Check -->|No| Gradient["Compute Gradients
∇L = ∂L/∂θ"] Gradient --> Optimize[Optimizer
Update Weights] Optimize --> Update["New Weights
θ = θ - update"] Update --> Loop style Start fill:#e1f5ff style End fill:#e1ffe1 style Optimize fill:#fff4e1 style Check fill:#ffe1f5 ``` --- ## 7.2 The Optimization Problem ### The Objective **We want to minimize:** ```math L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(y_i, f(x_i; \theta)) ``` **Where:** - $\theta$ = all model parameters (weights) - $L$ = total loss - $\ell$ = loss function (e.g., cross-entropy) - $y_i$ = correct answer - $f(x_i; \theta)$ = model prediction - $N$ = number of examples ### The Challenge **Problem:** Loss function is complex and high-dimensional **Solution:** Use iterative optimization algorithms **Process:** ``` Initialize weights randomly Repeat: 1. Compute loss 2. Compute gradients 3. Update weights Until convergence ``` ### Optimization Problem Flowchart ```mermaid graph LR subgraph "Optimization Problem" A["Loss Function
L(θ)"] --> B["Find Minimum
min L(θ)"] B --> C["Optimal Weights
θ*"] end subgraph "Solution Approach" D["Initialize θ"] --> E[Iterative Updates] E --> F[Compute Loss] F --> G[Compute Gradient] G --> H[Update Weights] H --> I{Converged?} I -->|No| E I -->|Yes| C end A -.-> F C -.-> C style A fill:#ffcccc style B fill:#ffffcc style C fill:#ccffcc style E fill:#cce5ff ``` --- ## 7.3 Gradient Descent ### What is Gradient Descent? **Gradient Descent** is a basic optimization algorithm that updates weights by moving in the direction of steepest descent. ### How It Works **Step 1: Compute Gradient** ```math \nabla_\theta L = \frac{\partial L}{\partial \theta} ``` **Gradient tells us:** - Direction: Which way to go - Magnitude: How steep the slope **Step 2: Update Weights** ```math \theta_{t+1} = \theta_t - \eta \nabla_\theta L ``` **Where:** - $\theta_t$ = current weights - $\eta$ = learning rate (step size) - $\nabla_\theta L$ = gradient **Meaning:** Move weights in direction opposite to gradient ### Visual Example ``` Loss Landscape (2D): Gradient Direction ↓ ● ──────┼───── → Lower Loss │ │ ``` **Move in direction of negative gradient!** ### Gradient Descent Flowchart ```mermaid graph TB subgraph "Gradient Descent Algorithm" Start["Start: Initialize θ₀"] --> Loop["For each iteration t"] Loop --> Forward[Forward Pass
Compute Predictions] Forward --> Loss["Compute Loss
L(θₜ)"] Loss --> Grad["Compute Gradient
g = ∇L(θₜ)"] Grad --> Direction["Determine Direction
-g points to minimum"] Direction --> Step["Take Step
η × g"] Step --> Update["Update Weights
θₜ₊₁ = θₜ - ηg"] Update --> Check{"Converged?
|g| < ε"} Check -->|No| Loop Check -->|Yes| End["Found Minimum
θ*"] end subgraph "Gradient Information" GradInfo["Gradient g contains:
- Direction: Which way to go
- Magnitude: How steep"] end Grad -.-> GradInfo style Start fill:#e1f5ff style End fill:#e1ffe1 style Grad fill:#fff4e1 style Check fill:#ffe1f5 style Update fill:#ccffcc ``` ### Types of Gradient Descent **1. Batch Gradient Descent:** - Uses all training examples - Most accurate gradients - Slow for large datasets **2. Stochastic Gradient Descent (SGD):** - Uses one example at a time - Fast but noisy - Can bounce around **3. Mini-Batch Gradient Descent:** - Uses small batch of examples - Balance of speed and accuracy - Most commonly used ### Gradient Descent Types Comparison ```mermaid graph TB subgraph "Batch Gradient Descent" B1[All Training Data] --> B2[Compute Gradient
on Full Dataset] B2 --> B3[Single Update
Most Accurate] B3 --> B4["Slow: O(N)"] end subgraph "Stochastic Gradient Descent" S1[Single Example] --> S2[Compute Gradient
on One Sample] S2 --> S3[Many Updates
Fast but Noisy] S3 --> S4["Fast: O(1)"] end subgraph "Mini-Batch Gradient Descent" M1[Small Batch
32-256 samples] --> M2[Compute Gradient
on Batch] M2 --> M3[Balanced Updates
Good Accuracy] M3 --> M4["Fast: O(batch_size)"] end style B3 fill:#ccffcc style S3 fill:#ffcccc style M3 fill:#fff4e1 ``` --- ## 7.4 AdamW Optimizer ### What is AdamW? **AdamW** (Adam with Weight Decay) is an advanced optimizer that combines: - **Adaptive learning rates** (like Adam) - **Weight decay** (regularization) **Why AdamW?** - Per-parameter learning rates - Handles sparse gradients well - Works great for transformers ### How AdamW Works **Step 1: Compute Gradient** ```math g_t = \nabla_\theta L(\theta_t) ``` **Step 2: Update Momentum** ```math m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t ``` **Where:** - $\beta_1 = 0.9$ (momentum decay) - $m_t$ = first moment estimate **Meaning:** Moving average of gradients **Step 3: Update Variance** ```math v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 ``` **Where:** - $\beta_2 = 0.999$ (variance decay) - $v_t$ = second moment estimate **Meaning:** Moving average of squared gradients **Step 4: Bias Correction** ```math \hat{m}_t = \frac{m_t}{1 - \beta_1^t} ``` ```math \hat{v}_t = \frac{v_t}{1 - \beta_2^t} ``` **Why?** Corrects bias in early iterations **Step 5: Update Weights** ```math \theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right) ``` **Where:** - $\eta$ = learning rate - $\epsilon = 10^{-8}$ (small constant) - $\lambda$ = weight decay coefficient **Key Points:** - $\frac{\hat{m}_t}{\sqrt{\hat{v}_t}}$ = adaptive learning rate per parameter - $\lambda \theta_t$ = weight decay (regularization) ### AdamW Optimizer Flowchart ```mermaid graph TB subgraph "AdamW Optimization Process" Start["Start: Initialize
θ₀, m₀=0, v₀=0"] --> Loop["For each iteration t"] Loop --> Forward["Forward Pass
Compute Loss L(θₜ)"] Forward --> Grad["Step 1: Compute Gradient
gₜ = ∇L(θₜ)"] Grad --> Mom["Step 2: Update Momentum
mₜ = β₁mₜ₋₁ + (1-β₁)gₜ"] Mom --> Var["Step 3: Update Variance
vₜ = β₂vₜ₋₁ + (1-β₂)gₜ²"] Var --> Bias["Step 4: Bias Correction
m̂ₜ = mₜ/(1-β₁ᵗ)
v̂ₜ = vₜ/(1-β₂ᵗ)"] Bias --> Adapt["Step 5: Adaptive LR
LR = η/(√v̂ₜ + ε)"] Adapt --> Decay["Step 6: Weight Decay
λθₜ"] Decay --> Update["Step 7: Update Weights
θₜ₊₁ = θₜ - LR×m̂ₜ - λθₜ"] Update --> Check{Converged?} Check -->|No| Loop Check -->|Yes| End["Optimal Weights θ*"] end subgraph "Key Components" C1["Momentum mₜ
Moving avg of gradients"] C2["Variance vₜ
Moving avg of g²"] C3["Adaptive LR
Per-parameter learning rate"] C4["Weight Decay
Regularization"] end Mom -.-> C1 Var -.-> C2 Adapt -.-> C3 Decay -.-> C4 style Start fill:#e1f5ff style End fill:#e1ffe1 style Grad fill:#fff4e1 style Adapt fill:#ccffcc style Update fill:#ccffcc style Check fill:#ffe1f5 ``` ### AdamW Detailed Subgraph ```mermaid graph LR subgraph "Input" I1["Gradient gₜ"] I2["Previous Momentum mₜ₋₁"] I3["Previous Variance vₜ₋₁"] I4["Current Weights θₜ"] end subgraph "Momentum Update" M1["Multiply: β₁mₜ₋₁"] --> M2["Combine: β₁mₜ₋₁ + (1-β₁)gₜ"] I2 --> M1 I1 --> M2 end subgraph "Variance Update" V1["Square: gₜ²"] --> V2["Combine: β₂vₜ₋₁ + (1-β₂)gₜ²"] I3 --> V2 I1 --> V1 end subgraph "Bias Correction" M2 --> BC1["m̂ₜ = mₜ/(1-β₁ᵗ)"] V2 --> BC2["v̂ₜ = vₜ/(1-β₂ᵗ)"] end subgraph "Adaptive Learning Rate" BC2 --> ALR["LR = η/(√v̂ₜ + ε)"] end subgraph "Weight Update" BC1 --> WU1["Adaptive Step: LR × m̂ₜ"] ALR --> WU1 I4 --> WU2["Decay Step: λθₜ"] WU1 --> WU3["Update: θₜ₊₁ = θₜ - LR×m̂ₜ - λθₜ"] WU2 --> WU3 end style M2 fill:#e1f5ff style V2 fill:#e1f5ff style BC1 fill:#fff4e1 style BC2 fill:#fff4e1 style ALR fill:#ccffcc style WU3 fill:#ccffcc ``` ### Why AdamW is Better **Compared to SGD:** **SGD:** ``` Same learning rate for all parameters → Slow convergence → Manual tuning needed ``` **AdamW:** ``` Adaptive learning rate per parameter → Faster convergence → Less manual tuning ``` **Benefits:** 1. **Adaptive:** Each parameter gets its own learning rate 2. **Robust:** Works well with noisy gradients 3. **Efficient:** Converges faster than SGD 4. **Regularized:** Weight decay prevents overfitting ### SGD vs AdamW Comparison ```mermaid graph TB subgraph "Stochastic Gradient Descent" SGD1["Gradient gₜ"] --> SGD2["Fixed Learning Rate η"] SGD2 --> SGD3["Update: θₜ₊₁ = θₜ - ηgₜ"] SGD3 --> SGD4["All params same LR"] SGD4 --> SGD5["Slow Convergence
Manual Tuning"] end subgraph "AdamW Optimizer" AD1["Gradient gₜ"] --> AD2["Momentum mₜ"] AD1 --> AD3["Variance vₜ"] AD2 --> AD4[Bias Correction] AD3 --> AD4 AD4 --> AD5["Adaptive LR per param"] AD5 --> AD6["Update: θₜ₊₁ = θₜ - LR×m̂ₜ - λθₜ"] AD6 --> AD7["Fast Convergence
Less Tuning"] end subgraph "Comparison" Comp1["Same Model
Same Data"] Comp1 --> Comp2["SGD: Loss = 2.5
After 100 epochs"] Comp1 --> Comp3["AdamW: Loss = 1.8
After 100 epochs"] Comp3 --> Comp4[AdamW is Better!] end SGD5 -.-> Comp2 AD7 -.-> Comp3 style SGD5 fill:#ffcccc style AD7 fill:#ccffcc style Comp4 fill:#e1ffe1 ``` --- ## 7.5 Why Optimization Matters ### Reason 1: Without Optimization **Random weights:** ``` Weights: Random values Loss: Very high Predictions: Random Model: Useless ``` ### Reason 2: With Optimization **Learned weights:** ``` Weights: Optimized values Loss: Low Predictions: Accurate Model: Useful ``` ### Reason 3: Determines Learning Speed **Good optimizer:** - Fast convergence - Stable training - Good final performance **Poor optimizer:** - Slow convergence - Unstable training - Poor final performance ### Reason 4: Affects Final Performance **Same model, different optimizers:** ``` SGD: Loss = 2.5 (after 100 epochs) AdamW: Loss = 1.8 (after 100 epochs) ``` **Better optimizer = Better model!** ### Optimization Impact Visualization ```mermaid graph LR subgraph "Without Optimization" WO1[Random Weights] --> WO2["High Loss
L ≈ 8-10"] WO2 --> WO3[Random Predictions] WO3 --> WO4[Model Useless] end subgraph "With Optimization" W1[Random Weights] --> W2[Optimization Loop] W2 --> W3[Update Weights] W3 --> W4["Low Loss
L ≈ 1-2"] W4 --> W5[Accurate Predictions] W5 --> W6[Model Useful] end subgraph "Optimizer Quality" O1["Poor Optimizer
SGD, Bad LR"] --> O2["Slow Convergence
Loss = 2.5"] O3["Good Optimizer
AdamW, Proper LR"] --> O4["Fast Convergence
Loss = 1.8"] end W2 -.-> O1 W2 -.-> O3 style WO4 fill:#ffcccc style W6 fill:#ccffcc style O2 fill:#ffcccc style O4 fill:#ccffcc ``` --- ## 7.6 Complete Mathematical Formulation ### Optimization Problem ```math \theta^* = \arg\min_{\theta} L(\theta) ``` **Where $\theta^*$ is the optimal set of weights** ### Gradient Descent Update ```math \theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t) ``` ### AdamW Update (Complete) **For each parameter $\theta_i$:** **Gradient:** ```math g_{t,i} = \frac{\partial L}{\partial \theta_{t,i}} ``` **Momentum:** ```math m_{t,i} = \beta_1 m_{t-1,i} + (1 - \beta_1) g_{t,i} ``` **Variance:** ```math v_{t,i} = \beta_2 v_{t-1,i} + (1 - \beta_2) g_{t,i}^2 ``` **Bias Correction:** ```math \hat{m}_{t,i} = \frac{m_{t,i}}{1 - \beta_1^t} ``` ```math \hat{v}_{t,i} = \frac{v_{t,i}}{1 - \beta_2^t} ``` **Update:** ```math \theta_{t+1,i} = \theta_{t,i} - \eta \left( \frac{\hat{m}_{t,i}}{\sqrt{\hat{v}_{t,i}} + \epsilon} + \lambda \theta_{t,i} \right) ``` **Where:** - $\beta_1 = 0.9$ - $\beta_2 = 0.999$ - $\epsilon = 10^{-8}$ - $\lambda$ = weight decay (e.g., 0.01) ### Complete Mathematical Flow ```mermaid graph TB subgraph "Optimization Problem" OP1["Loss Function L(θ)"] --> OP2["Find: θ* = argmin L(θ)"] end subgraph "Gradient Computation" GC1[Forward Pass] --> GC2[Compute Loss L] GC2 --> GC3[Backpropagation] GC3 --> GC4["Gradient gᵢ = ∂L/∂θᵢ"] end subgraph "AdamW Update Steps" GC4 --> AU1["Momentum: mᵢ = β₁m + (1-β₁)gᵢ"] AU1 --> AU2["Variance: vᵢ = β₂v + (1-β₂)gᵢ²"] AU2 --> AU3["Bias Correction:
m̂ = m/(1-β₁ᵗ), v̂ = v/(1-β₂ᵗ)"] AU3 --> AU4["Adaptive LR: η/(√v̂ + ε)"] AU4 --> AU5["Update: θᵢ = θᵢ - LR×m̂ - λθᵢ"] end subgraph "Convergence Check" AU5 --> CC1{Converged?} CC1 -->|No| GC1 CC1 -->|Yes| CC2["Optimal Weights θ*"] end OP2 -.-> GC1 CC2 -.-> OP2 style OP2 fill:#ffffcc style GC4 fill:#fff4e1 style AU5 fill:#ccffcc style CC2 fill:#e1ffe1 ``` --- ## 7.7 Exercise: Optimizer Step-by-Step ### Problem **Given:** - Current weight: $\theta_0 = 2.0$ - Loss function: $L(\theta) = (\theta - 1)^2$ - Learning rate: $\eta = 0.1$ - Use AdamW with $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\lambda = 0.01$ - Initial moments: $m_0 = 0$, $v_0 = 0$ **Calculate the weight update for step 1.** ### Step-by-Step Solution #### Step 1: Compute Gradient **Loss function:** ```math L(\theta) = (\theta - 1)^2 ``` **Gradient:** ```math g_1 = \frac{\partial L}{\partial \theta} = 2(\theta - 1) ``` **At $\theta_0 = 2.0$:** ```math g_1 = 2(2.0 - 1) = 2(1.0) = 2.0 ``` #### Step 2: Update Momentum ```math m_1 = \beta_1 m_0 + (1 - \beta_1) g_1 ``` ```math m_1 = 0.9 \times 0 + (1 - 0.9) \times 2.0 = 0 + 0.1 \times 2.0 = 0.2 ``` #### Step 3: Update Variance ```math v_1 = \beta_2 v_0 + (1 - \beta_2) g_1^2 ``` ```math v_1 = 0.999 \times 0 + (1 - 0.999) \times (2.0)^2 = 0 + 0.001 \times 4.0 = 0.004 ``` #### Step 4: Bias Correction ```math \hat{m}_1 = \frac{m_1}{1 - \beta_1^1} = \frac{0.2}{1 - 0.9} = \frac{0.2}{0.1} = 2.0 ``` ```math \hat{v}_1 = \frac{v_1}{1 - \beta_2^1} = \frac{0.004}{1 - 0.999} = \frac{0.004}{0.001} = 4.0 ``` #### Step 5: Compute Update ```math \Delta \theta_1 = \eta \left( \frac{\hat{m}_1}{\sqrt{\hat{v}_1} + \epsilon} + \lambda \theta_0 \right) ``` ```math \Delta \theta_1 = 0.1 \left( \frac{2.0}{\sqrt{4.0} + 10^{-8}} + 0.01 \times 2.0 \right) ``` ```math \Delta \theta_1 = 0.1 \left( \frac{2.0}{2.0 + 0.00000001} + 0.02 \right) ``` ```math \Delta \theta_1 = 0.1 \left( \frac{2.0}{2.0} + 0.02 \right) = 0.1 (1.0 + 0.02) = 0.1 \times 1.02 = 0.102 ``` #### Step 6: Update Weight ```math \theta_1 = \theta_0 - \Delta \theta_1 = 2.0 - 0.102 = 1.898 ``` ### Answer **After one step:** - Old weight: $\theta_0 = 2.0$ - New weight: $\theta_1 = 1.898$ - Update: $\Delta \theta_1 = -0.102$ **The weight moved closer to the optimal value (1.0)!** ### Verification **Check loss:** - Old loss: $L(2.0) = (2.0 - 1)^2 = 1.0$ - New loss: $L(1.898) = (1.898 - 1)^2 = 0.806$ **Loss decreased! ✓** ### Exercise Solution Flowchart ```mermaid graph TB subgraph "Given Values" G1["θ₀ = 2.0"] --> Start G2["m₀ = 0"] --> Start G3["v₀ = 0"] --> Start G4["η = 0.1, β₁ = 0.9
β₂ = 0.999, λ = 0.01"] --> Start end Start[Start] --> Step1["Step 1: Compute Gradient
L(θ) = (θ-1)²
g₁ = 2(θ₀-1) = 2.0"] Step1 --> Step2["Step 2: Update Momentum
m₁ = 0.9×0 + 0.1×2.0
m₁ = 0.2"] Step2 --> Step3["Step 3: Update Variance
v₁ = 0.999×0 + 0.001×4.0
v₁ = 0.004"] Step3 --> Step4["Step 4: Bias Correction
m̂₁ = 0.2/(1-0.9) = 2.0
v̂₁ = 0.004/(1-0.999) = 4.0"] Step4 --> Step5["Step 5: Compute Update
Δθ₁ = 0.1×(2.0/√4.0 + 0.01×2.0)
Δθ₁ = 0.102"] Step5 --> Step6["Step 6: Update Weight
θ₁ = 2.0 - 0.102
θ₁ = 1.898"] Step6 --> Verify["Verification:
L(2.0) = 1.0 → L(1.898) = 0.806
Loss Decreased!"] Verify --> End["Result: θ₁ = 1.898
Closer to optimum θ* = 1.0"] style Start fill:#e1f5ff style Step1 fill:#fff4e1 style Step2 fill:#fff4e1 style Step3 fill:#fff4e1 style Step4 fill:#fff4e1 style Step5 fill:#fff4e1 style Step6 fill:#ccffcc style Verify fill:#e1ffe1 style End fill:#e1ffe1 ``` --- ## 7.8 Key Takeaways ### Optimization ✅ **Optimization finds best weights to minimize loss** ✅ **Uses gradients to determine update direction** ✅ **Iterative process: compute → update → repeat** ### Gradient Descent ✅ **Basic algorithm: move opposite to gradient** ✅ **Learning rate controls step size** ✅ **Can be slow for complex problems** ### AdamW ✅ **Advanced optimizer with adaptive learning rates** ✅ **Each parameter gets its own learning rate** ✅ **Combines momentum and variance estimates** ✅ **Includes weight decay for regularization** ✅ **Works great for transformers** ### Why Important ✅ **Determines how fast model learns** ✅ **Affects final model performance** ✅ **Essential for training neural networks** --- *This document provides a comprehensive explanation of optimization in neural networks, including gradient descent and AdamW optimizer with mathematical formulations and solved exercises.*