# What is Feed-Forward? Step-by-Step Explanation Complete step-by-step explanation of feed-forward networks in transformer models: how models transform and refine features. ## Table of Contents 1. [The Problem Feed-Forward Solves](#31-the-problem-feed-forward-solves) 2. [What is Feed-Forward?](#32-what-is-feed-forward) 3. [How Feed-Forward Works: Step-by-Step](#33-how-feed-forward-works-step-by-step) 4. [Complete Example: Feed-Forward on "Hello"](#34-complete-example-feed-forward-on-hello) 5. [Why Feed-Forward Matters](#35-why-feed-forward-matters) 6. [Complete Feed-Forward Formula](#36-complete-feed-forward-formula) 7. [Visual Representation](#37-visual-representation) 8. [Why Expand and Compress?](#38-why-expand-and-compress) 9. [Key Takeaways](#39-key-takeaways) --- ## 3.1 The Problem Feed-Forward Solves ### The Challenge **Attention provides context, but we need to process and transform that information.** Think of it like cooking: - **Attention:** Gathers ingredients (context) - **Feed-Forward:** Cooks and transforms ingredients (processing) ### The Solution: Feed-Forward Network **Feed-Forward applies complex transformations to each position independently.** --- ## 3.2 What is Feed-Forward? ### Simple Definition A **Feed-Forward Network (FFN)** is a two-layer neural network that: 1. **Expands** the input to a larger dimension 2. **Applies** a nonlinear transformation 3. **Compresses** back to original dimension ### Visual Analogy **Think of it like a funnel:** ``` Input (512 dimensions) ↓ ┌─────────────┐ │ EXPAND │ │ 512 → 2048 │ └──────┬──────┘ ↓ ┌─────────────┐ │ TRANSFORM │ │ (GELU) │ └──────┬──────┘ ↓ ┌─────────────┐ │ COMPRESS │ │ 2048 → 512 │ └──────┬──────┘ ↓ Output (512 dimensions) ``` --- ## 3.3 How Feed-Forward Works: Step-by-Step ### High-Level Overview ``` Step 1: Expand dimension (512 → 2048) Step 2: Apply nonlinear activation (GELU) Step 3: Compress dimension (2048 → 512) ``` ### Detailed Step-by-Step #### Step 1: Expansion (First Linear Layer) **Input:** Vector of size 512 **Output:** Vector of size 2048 **Mathematical Operation:** ``` H = X × W₁ + b₁ ``` **Example:** **Input X:** ``` [0.10, -0.20, 0.30, ..., 0.05] (512 numbers) ``` **Weight Matrix W₁:** ``` Shape: [512, 2048] Each column transforms input to one output dimension ``` **Process:** ``` H[0] = X[0]×W₁[0,0] + X[1]×W₁[1,0] + ... + X[511]×W₁[511,0] H[1] = X[0]×W₁[0,1] + X[1]×W₁[1,1] + ... + X[511]×W₁[511,1] ... H[2047] = X[0]×W₁[0,2047] + ... + X[511]×W₁[511,2047] ``` **Result:** ``` H = [0.12, -0.08, 0.25, ..., 0.18] (2048 numbers) ``` **Why Expand?** - More dimensions = more capacity for complex transformations - Allows the model to learn intricate patterns - Think of it as "more room to work" #### Step 2: Nonlinear Activation (GELU) **Apply GELU to each element:** **GELU Function:** ``` GELU(x) = x × Φ(x) Where Φ(x) is the cumulative distribution function of standard normal distribution ``` **Simplified Understanding:** - Values near zero → suppressed (close to 0) - Positive values → pass through (modified) - Negative values → suppressed more **Example:** **Input H:** ``` H = [0.12, -0.08, 0.25, ..., 0.18] ``` **Apply GELU element-wise:** ``` GELU(0.12) ≈ 0.12 × 0.548 ≈ 0.066 GELU(-0.08) ≈ -0.08 × 0.468 ≈ -0.037 GELU(0.25) ≈ 0.25 × 0.599 ≈ 0.150 ... GELU(0.18) ≈ 0.18 × 0.572 ≈ 0.103 ``` **Result:** ``` H' = [0.066, -0.037, 0.150, ..., 0.103] (2048 numbers) ``` **Why Nonlinear?** - Linear transformations can only do so much - Nonlinearity enables complex function approximation - Essential for learning patterns #### Step 3: Compression (Second Linear Layer) **Input:** Vector of size 2048 **Output:** Vector of size 512 **Mathematical Operation:** ``` O = H' × W₂ + b₂ ``` **Process:** ``` O[0] = H'[0]×W₂[0,0] + H'[1]×W₂[1,0] + ... + H'[2047]×W₂[2047,0] O[1] = H'[0]×W₂[0,1] + H'[1]×W₂[1,1] + ... + H'[2047]×W₂[2047,1] ... O[511] = H'[0]×W₂[0,511] + ... + H'[2047]×W₂[2047,511] ``` **Result:** ``` O = [0.15, -0.10, 0.22, ..., 0.12] (512 numbers) ``` **Why Compress?** - Project back to original dimension - Maintains consistent size throughout model - Combines expanded features into compact representation --- ## 3.4 Complete Example: Feed-Forward on "Hello" ### Input ``` Word: "Hello" After Attention: [0.146, 0.108, 0.192, ..., 0.11] Dimension: 512 ``` ### Step-by-Step Processing #### Step 1: Expansion **Input X:** ``` [0.146, 0.108, 0.192, ..., 0.11] (512 numbers) ``` **Weight Matrix W₁:** ``` Shape: [512, 2048] Values: Learned during training ``` **Compute:** ``` H = X × W₁ ``` **Result:** ``` H = [0.21, -0.15, 0.28, ..., 0.19] (2048 numbers) ``` **Visualization:** ``` 512 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions │ W₁ │ └──────────┘ ``` #### Step 2: Activation **Input H:** ``` [0.21, -0.15, 0.28, ..., 0.19] (2048 numbers) ``` **Apply GELU element-wise:** ``` GELU(0.21) ≈ 0.115 GELU(-0.15) ≈ -0.058 GELU(0.28) ≈ 0.168 ... GELU(0.19) ≈ 0.109 ``` **Result:** ``` H' = [0.115, -0.058, 0.168, ..., 0.109] (2048 numbers) ``` **Visualization:** ``` 2048 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions │ GELU │ └──────────┘ ``` #### Step 3: Compression **Input H':** ``` [0.115, -0.058, 0.168, ..., 0.109] (2048 numbers) ``` **Weight Matrix W₂:** ``` Shape: [2048, 512] Values: Learned during training ``` **Compute:** ``` O = H' × W₂ ``` **Result:** ``` O = [0.18, -0.12, 0.24, ..., 0.14] (512 numbers) ``` **Visualization:** ``` 2048 dimensions ──→ ┌──────────┐ ──→ 512 dimensions │ W₂ │ └──────────┘ ``` #### Final Output ``` Output: [0.18, -0.12, 0.24, ..., 0.14] (512 numbers) ``` **Meaning:** Transformed representation that captures processed features --- ## 3.5 Why Feed-Forward Matters ### Benefit 1: Feature Transformation **Before FFN:** ``` Input: Raw attention output Information: Contextual relationships ``` **After FFN:** ``` Output: Transformed features Information: Processed and refined understanding ``` ### Benefit 2: Non-Linear Processing **Linear operations** (like attention) can only do limited transformations. **Non-linear operations** (like GELU in FFN) enable complex function learning. **Analogy:** - Linear: Can only draw straight lines - Non-linear: Can draw curves, circles, complex shapes ### Benefit 3: Position-Wise Processing **FFN processes each position independently:** ``` Position 0 ("Hello"): FFN → Transformed representation Position 1 ("World"): FFN → Transformed representation ``` **Each word gets its own transformation!** --- ## 3.6 Complete Feed-Forward Formula ### Mathematical Expression ``` FFN(X) = GELU(X × W₁ + b₁) × W₂ + b₂ ``` **Breaking it down:** **Part 1: First Linear Transformation** ``` H = X × W₁ + b₁ ``` - Expands from 512 to 2048 dimensions **Part 2: Non-Linear Activation** ``` H' = GELU(H) ``` - Applies non-linear transformation **Part 3: Second Linear Transformation** ``` O = H' × W₂ + b₂ ``` - Compresses from 2048 back to 512 dimensions **Complete:** ``` FFN(X) = O ``` --- ## 3.7 Visual Representation ### Feed-Forward Pipeline ``` Input Vector (512D) │ │ [0.146, 0.108, 0.192, ..., 0.11] ↓ ┌─────────────────────────────┐ │ Linear Layer 1 │ │ (512 → 2048 expansion) │ │ │ │ H = X × W₁ │ └──────────┬──────────────────┘ │ │ [0.21, -0.15, 0.28, ..., 0.19] (2048D) ↓ ┌─────────────────────────────┐ │ GELU Activation │ │ (Non-linear transformation) │ │ │ │ H' = GELU(H) │ └──────────┬──────────────────┘ │ │ [0.115, -0.058, 0.168, ..., 0.109] (2048D) ↓ ┌─────────────────────────────┐ │ Linear Layer 2 │ │ (2048 → 512 compression) │ │ │ │ O = H' × W₂ │ └──────────┬──────────────────┘ │ │ [0.18, -0.12, 0.24, ..., 0.14] (512D) ↓ Output Vector (512D) ``` ### Dimension Flow ``` 512 ──→ [Expand] ──→ 2048 ──→ [Transform] ──→ 2048 ──→ [Compress] ──→ 512 ``` **Like a funnel:** Expand → Transform → Compress --- ## 3.8 Why Expand and Compress? ### The Expansion-Compression Strategy **Why not stay at 512 dimensions?** **Answer:** Expansion provides "working space" **Analogy:** - Think of doing math on paper - Small paper (512D) = limited space - Large paper (2048D) = room to work - Then copy results back to small paper (512D) **Benefits:** 1. **More capacity:** 2048 dimensions = more parameters to learn 2. **Better transformations:** More space = more complex functions 3. **Feature refinement:** Transformation happens in expanded space **Why compress back?** **Answer:** Maintain consistent size throughout the model - All layers use 512 dimensions - Consistent size enables stacking layers - Easier to manage and optimize --- ## 3.9 Key Takeaways: Feed-Forward ✅ **FFN transforms features through expansion and compression** ✅ **Expands to larger dimension for processing** ✅ **Applies non-linear transformation (GELU)** ✅ **Compresses back to original dimension** ✅ **Processes each position independently** --- *This document provides a step-by-step explanation of feed-forward networks, the component that transforms and refines features in transformer models.*