- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
10 KiB
What is Feed-Forward? Step-by-Step Explanation
Complete step-by-step explanation of feed-forward networks in transformer models: how models transform and refine features.
Table of Contents
- The Problem Feed-Forward Solves
- What is Feed-Forward?
- How Feed-Forward Works: Step-by-Step
- Complete Example: Feed-Forward on "Hello"
- Why Feed-Forward Matters
- Complete Feed-Forward Formula
- Visual Representation
- Why Expand and Compress?
- Key Takeaways
3.1 The Problem Feed-Forward Solves
The Challenge
Attention provides context, but we need to process and transform that information.
Think of it like cooking:
- Attention: Gathers ingredients (context)
- Feed-Forward: Cooks and transforms ingredients (processing)
The Solution: Feed-Forward Network
Feed-Forward applies complex transformations to each position independently.
3.2 What is Feed-Forward?
Simple Definition
A Feed-Forward Network (FFN) is a two-layer neural network that:
- Expands the input to a larger dimension
- Applies a nonlinear transformation
- Compresses back to original dimension
Visual Analogy
Think of it like a funnel:
Input (512 dimensions)
↓
┌─────────────┐
│ EXPAND │
│ 512 → 2048 │
└──────┬──────┘
↓
┌─────────────┐
│ TRANSFORM │
│ (GELU) │
└──────┬──────┘
↓
┌─────────────┐
│ COMPRESS │
│ 2048 → 512 │
└──────┬──────┘
↓
Output (512 dimensions)
3.3 How Feed-Forward Works: Step-by-Step
High-Level Overview
Step 1: Expand dimension (512 → 2048)
Step 2: Apply nonlinear activation (GELU)
Step 3: Compress dimension (2048 → 512)
Detailed Step-by-Step
Step 1: Expansion (First Linear Layer)
Input: Vector of size 512
Output: Vector of size 2048
Mathematical Operation:
H = X × W₁ + b₁
Example:
Input X:
[0.10, -0.20, 0.30, ..., 0.05] (512 numbers)
Weight Matrix W₁:
Shape: [512, 2048]
Each column transforms input to one output dimension
Process:
H[0] = X[0]×W₁[0,0] + X[1]×W₁[1,0] + ... + X[511]×W₁[511,0]
H[1] = X[0]×W₁[0,1] + X[1]×W₁[1,1] + ... + X[511]×W₁[511,1]
...
H[2047] = X[0]×W₁[0,2047] + ... + X[511]×W₁[511,2047]
Result:
H = [0.12, -0.08, 0.25, ..., 0.18] (2048 numbers)
Why Expand?
- More dimensions = more capacity for complex transformations
- Allows the model to learn intricate patterns
- Think of it as "more room to work"
Step 2: Nonlinear Activation (GELU)
Apply GELU to each element:
GELU Function:
GELU(x) = x × Φ(x)
Where Φ(x) is the cumulative distribution function of standard normal distribution
Simplified Understanding:
- Values near zero → suppressed (close to 0)
- Positive values → pass through (modified)
- Negative values → suppressed more
Example:
Input H:
H = [0.12, -0.08, 0.25, ..., 0.18]
Apply GELU element-wise:
GELU(0.12) ≈ 0.12 × 0.548 ≈ 0.066
GELU(-0.08) ≈ -0.08 × 0.468 ≈ -0.037
GELU(0.25) ≈ 0.25 × 0.599 ≈ 0.150
...
GELU(0.18) ≈ 0.18 × 0.572 ≈ 0.103
Result:
H' = [0.066, -0.037, 0.150, ..., 0.103] (2048 numbers)
Why Nonlinear?
- Linear transformations can only do so much
- Nonlinearity enables complex function approximation
- Essential for learning patterns
Step 3: Compression (Second Linear Layer)
Input: Vector of size 2048
Output: Vector of size 512
Mathematical Operation:
O = H' × W₂ + b₂
Process:
O[0] = H'[0]×W₂[0,0] + H'[1]×W₂[1,0] + ... + H'[2047]×W₂[2047,0]
O[1] = H'[0]×W₂[0,1] + H'[1]×W₂[1,1] + ... + H'[2047]×W₂[2047,1]
...
O[511] = H'[0]×W₂[0,511] + ... + H'[2047]×W₂[2047,511]
Result:
O = [0.15, -0.10, 0.22, ..., 0.12] (512 numbers)
Why Compress?
- Project back to original dimension
- Maintains consistent size throughout model
- Combines expanded features into compact representation
3.4 Complete Example: Feed-Forward on "Hello"
Input
Word: "Hello"
After Attention: [0.146, 0.108, 0.192, ..., 0.11]
Dimension: 512
Step-by-Step Processing
Step 1: Expansion
Input X:
[0.146, 0.108, 0.192, ..., 0.11] (512 numbers)
Weight Matrix W₁:
Shape: [512, 2048]
Values: Learned during training
Compute:
H = X × W₁
Result:
H = [0.21, -0.15, 0.28, ..., 0.19] (2048 numbers)
Visualization:
512 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions
│ W₁ │
└──────────┘
Step 2: Activation
Input H:
[0.21, -0.15, 0.28, ..., 0.19] (2048 numbers)
Apply GELU element-wise:
GELU(0.21) ≈ 0.115
GELU(-0.15) ≈ -0.058
GELU(0.28) ≈ 0.168
...
GELU(0.19) ≈ 0.109
Result:
H' = [0.115, -0.058, 0.168, ..., 0.109] (2048 numbers)
Visualization:
2048 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions
│ GELU │
└──────────┘
Step 3: Compression
Input H':
[0.115, -0.058, 0.168, ..., 0.109] (2048 numbers)
Weight Matrix W₂:
Shape: [2048, 512]
Values: Learned during training
Compute:
O = H' × W₂
Result:
O = [0.18, -0.12, 0.24, ..., 0.14] (512 numbers)
Visualization:
2048 dimensions ──→ ┌──────────┐ ──→ 512 dimensions
│ W₂ │
└──────────┘
Final Output
Output: [0.18, -0.12, 0.24, ..., 0.14] (512 numbers)
Meaning: Transformed representation that captures processed features
3.5 Why Feed-Forward Matters
Benefit 1: Feature Transformation
Before FFN:
Input: Raw attention output
Information: Contextual relationships
After FFN:
Output: Transformed features
Information: Processed and refined understanding
Benefit 2: Non-Linear Processing
Linear operations (like attention) can only do limited transformations.
Non-linear operations (like GELU in FFN) enable complex function learning.
Analogy:
- Linear: Can only draw straight lines
- Non-linear: Can draw curves, circles, complex shapes
Benefit 3: Position-Wise Processing
FFN processes each position independently:
Position 0 ("Hello"): FFN → Transformed representation
Position 1 ("World"): FFN → Transformed representation
Each word gets its own transformation!
3.6 Complete Feed-Forward Formula
Mathematical Expression
FFN(X) = GELU(X × W₁ + b₁) × W₂ + b₂
Breaking it down:
Part 1: First Linear Transformation
H = X × W₁ + b₁
- Expands from 512 to 2048 dimensions
Part 2: Non-Linear Activation
H' = GELU(H)
- Applies non-linear transformation
Part 3: Second Linear Transformation
O = H' × W₂ + b₂
- Compresses from 2048 back to 512 dimensions
Complete:
FFN(X) = O
3.7 Visual Representation
Feed-Forward Pipeline
Input Vector (512D)
│
│ [0.146, 0.108, 0.192, ..., 0.11]
↓
┌─────────────────────────────┐
│ Linear Layer 1 │
│ (512 → 2048 expansion) │
│ │
│ H = X × W₁ │
└──────────┬──────────────────┘
│
│ [0.21, -0.15, 0.28, ..., 0.19] (2048D)
↓
┌─────────────────────────────┐
│ GELU Activation │
│ (Non-linear transformation) │
│ │
│ H' = GELU(H) │
└──────────┬──────────────────┘
│
│ [0.115, -0.058, 0.168, ..., 0.109] (2048D)
↓
┌─────────────────────────────┐
│ Linear Layer 2 │
│ (2048 → 512 compression) │
│ │
│ O = H' × W₂ │
└──────────┬──────────────────┘
│
│ [0.18, -0.12, 0.24, ..., 0.14] (512D)
↓
Output Vector (512D)
Dimension Flow
512 ──→ [Expand] ──→ 2048 ──→ [Transform] ──→ 2048 ──→ [Compress] ──→ 512
Like a funnel: Expand → Transform → Compress
3.8 Why Expand and Compress?
The Expansion-Compression Strategy
Why not stay at 512 dimensions?
Answer: Expansion provides "working space"
Analogy:
- Think of doing math on paper
- Small paper (512D) = limited space
- Large paper (2048D) = room to work
- Then copy results back to small paper (512D)
Benefits:
- More capacity: 2048 dimensions = more parameters to learn
- Better transformations: More space = more complex functions
- Feature refinement: Transformation happens in expanded space
Why compress back?
Answer: Maintain consistent size throughout the model
- All layers use 512 dimensions
- Consistent size enables stacking layers
- Easier to manage and optimize
3.9 Key Takeaways: Feed-Forward
✅ FFN transforms features through expansion and compression
✅ Expands to larger dimension for processing
✅ Applies non-linear transformation (GELU)
✅ Compresses back to original dimension
✅ Processes each position independently
This document provides a step-by-step explanation of feed-forward networks, the component that transforms and refines features in transformer models.