Files
sheepOp/docs/FEED_FORWARD_EXPLAINED.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

471 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# What is Feed-Forward? Step-by-Step Explanation
Complete step-by-step explanation of feed-forward networks in transformer models: how models transform and refine features.
## Table of Contents
1. [The Problem Feed-Forward Solves](#31-the-problem-feed-forward-solves)
2. [What is Feed-Forward?](#32-what-is-feed-forward)
3. [How Feed-Forward Works: Step-by-Step](#33-how-feed-forward-works-step-by-step)
4. [Complete Example: Feed-Forward on "Hello"](#34-complete-example-feed-forward-on-hello)
5. [Why Feed-Forward Matters](#35-why-feed-forward-matters)
6. [Complete Feed-Forward Formula](#36-complete-feed-forward-formula)
7. [Visual Representation](#37-visual-representation)
8. [Why Expand and Compress?](#38-why-expand-and-compress)
9. [Key Takeaways](#39-key-takeaways)
---
## 3.1 The Problem Feed-Forward Solves
### The Challenge
**Attention provides context, but we need to process and transform that information.**
Think of it like cooking:
- **Attention:** Gathers ingredients (context)
- **Feed-Forward:** Cooks and transforms ingredients (processing)
### The Solution: Feed-Forward Network
**Feed-Forward applies complex transformations to each position independently.**
---
## 3.2 What is Feed-Forward?
### Simple Definition
A **Feed-Forward Network (FFN)** is a two-layer neural network that:
1. **Expands** the input to a larger dimension
2. **Applies** a nonlinear transformation
3. **Compresses** back to original dimension
### Visual Analogy
**Think of it like a funnel:**
```
Input (512 dimensions)
┌─────────────┐
│ EXPAND │
│ 512 → 2048 │
└──────┬──────┘
┌─────────────┐
│ TRANSFORM │
│ (GELU) │
└──────┬──────┘
┌─────────────┐
│ COMPRESS │
│ 2048 → 512 │
└──────┬──────┘
Output (512 dimensions)
```
---
## 3.3 How Feed-Forward Works: Step-by-Step
### High-Level Overview
```
Step 1: Expand dimension (512 → 2048)
Step 2: Apply nonlinear activation (GELU)
Step 3: Compress dimension (2048 → 512)
```
### Detailed Step-by-Step
#### Step 1: Expansion (First Linear Layer)
**Input:** Vector of size 512
**Output:** Vector of size 2048
**Mathematical Operation:**
```
H = X × W₁ + b₁
```
**Example:**
**Input X:**
```
[0.10, -0.20, 0.30, ..., 0.05] (512 numbers)
```
**Weight Matrix W₁:**
```
Shape: [512, 2048]
Each column transforms input to one output dimension
```
**Process:**
```
H[0] = X[0]×W₁[0,0] + X[1]×W₁[1,0] + ... + X[511]×W₁[511,0]
H[1] = X[0]×W₁[0,1] + X[1]×W₁[1,1] + ... + X[511]×W₁[511,1]
...
H[2047] = X[0]×W₁[0,2047] + ... + X[511]×W₁[511,2047]
```
**Result:**
```
H = [0.12, -0.08, 0.25, ..., 0.18] (2048 numbers)
```
**Why Expand?**
- More dimensions = more capacity for complex transformations
- Allows the model to learn intricate patterns
- Think of it as "more room to work"
#### Step 2: Nonlinear Activation (GELU)
**Apply GELU to each element:**
**GELU Function:**
```
GELU(x) = x × Φ(x)
Where Φ(x) is the cumulative distribution function of standard normal distribution
```
**Simplified Understanding:**
- Values near zero → suppressed (close to 0)
- Positive values → pass through (modified)
- Negative values → suppressed more
**Example:**
**Input H:**
```
H = [0.12, -0.08, 0.25, ..., 0.18]
```
**Apply GELU element-wise:**
```
GELU(0.12) ≈ 0.12 × 0.548 ≈ 0.066
GELU(-0.08) ≈ -0.08 × 0.468 ≈ -0.037
GELU(0.25) ≈ 0.25 × 0.599 ≈ 0.150
...
GELU(0.18) ≈ 0.18 × 0.572 ≈ 0.103
```
**Result:**
```
H' = [0.066, -0.037, 0.150, ..., 0.103] (2048 numbers)
```
**Why Nonlinear?**
- Linear transformations can only do so much
- Nonlinearity enables complex function approximation
- Essential for learning patterns
#### Step 3: Compression (Second Linear Layer)
**Input:** Vector of size 2048
**Output:** Vector of size 512
**Mathematical Operation:**
```
O = H' × W₂ + b₂
```
**Process:**
```
O[0] = H'[0]×W₂[0,0] + H'[1]×W₂[1,0] + ... + H'[2047]×W₂[2047,0]
O[1] = H'[0]×W₂[0,1] + H'[1]×W₂[1,1] + ... + H'[2047]×W₂[2047,1]
...
O[511] = H'[0]×W₂[0,511] + ... + H'[2047]×W₂[2047,511]
```
**Result:**
```
O = [0.15, -0.10, 0.22, ..., 0.12] (512 numbers)
```
**Why Compress?**
- Project back to original dimension
- Maintains consistent size throughout model
- Combines expanded features into compact representation
---
## 3.4 Complete Example: Feed-Forward on "Hello"
### Input
```
Word: "Hello"
After Attention: [0.146, 0.108, 0.192, ..., 0.11]
Dimension: 512
```
### Step-by-Step Processing
#### Step 1: Expansion
**Input X:**
```
[0.146, 0.108, 0.192, ..., 0.11] (512 numbers)
```
**Weight Matrix W₁:**
```
Shape: [512, 2048]
Values: Learned during training
```
**Compute:**
```
H = X × W₁
```
**Result:**
```
H = [0.21, -0.15, 0.28, ..., 0.19] (2048 numbers)
```
**Visualization:**
```
512 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions
│ W₁ │
└──────────┘
```
#### Step 2: Activation
**Input H:**
```
[0.21, -0.15, 0.28, ..., 0.19] (2048 numbers)
```
**Apply GELU element-wise:**
```
GELU(0.21) ≈ 0.115
GELU(-0.15) ≈ -0.058
GELU(0.28) ≈ 0.168
...
GELU(0.19) ≈ 0.109
```
**Result:**
```
H' = [0.115, -0.058, 0.168, ..., 0.109] (2048 numbers)
```
**Visualization:**
```
2048 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions
│ GELU │
└──────────┘
```
#### Step 3: Compression
**Input H':**
```
[0.115, -0.058, 0.168, ..., 0.109] (2048 numbers)
```
**Weight Matrix W₂:**
```
Shape: [2048, 512]
Values: Learned during training
```
**Compute:**
```
O = H' × W₂
```
**Result:**
```
O = [0.18, -0.12, 0.24, ..., 0.14] (512 numbers)
```
**Visualization:**
```
2048 dimensions ──→ ┌──────────┐ ──→ 512 dimensions
│ W₂ │
└──────────┘
```
#### Final Output
```
Output: [0.18, -0.12, 0.24, ..., 0.14] (512 numbers)
```
**Meaning:** Transformed representation that captures processed features
---
## 3.5 Why Feed-Forward Matters
### Benefit 1: Feature Transformation
**Before FFN:**
```
Input: Raw attention output
Information: Contextual relationships
```
**After FFN:**
```
Output: Transformed features
Information: Processed and refined understanding
```
### Benefit 2: Non-Linear Processing
**Linear operations** (like attention) can only do limited transformations.
**Non-linear operations** (like GELU in FFN) enable complex function learning.
**Analogy:**
- Linear: Can only draw straight lines
- Non-linear: Can draw curves, circles, complex shapes
### Benefit 3: Position-Wise Processing
**FFN processes each position independently:**
```
Position 0 ("Hello"): FFN → Transformed representation
Position 1 ("World"): FFN → Transformed representation
```
**Each word gets its own transformation!**
---
## 3.6 Complete Feed-Forward Formula
### Mathematical Expression
```
FFN(X) = GELU(X × W₁ + b₁) × W₂ + b₂
```
**Breaking it down:**
**Part 1: First Linear Transformation**
```
H = X × W₁ + b₁
```
- Expands from 512 to 2048 dimensions
**Part 2: Non-Linear Activation**
```
H' = GELU(H)
```
- Applies non-linear transformation
**Part 3: Second Linear Transformation**
```
O = H' × W₂ + b₂
```
- Compresses from 2048 back to 512 dimensions
**Complete:**
```
FFN(X) = O
```
---
## 3.7 Visual Representation
### Feed-Forward Pipeline
```
Input Vector (512D)
│ [0.146, 0.108, 0.192, ..., 0.11]
┌─────────────────────────────┐
│ Linear Layer 1 │
│ (512 → 2048 expansion) │
│ │
│ H = X × W₁ │
└──────────┬──────────────────┘
│ [0.21, -0.15, 0.28, ..., 0.19] (2048D)
┌─────────────────────────────┐
│ GELU Activation │
│ (Non-linear transformation) │
│ │
│ H' = GELU(H) │
└──────────┬──────────────────┘
│ [0.115, -0.058, 0.168, ..., 0.109] (2048D)
┌─────────────────────────────┐
│ Linear Layer 2 │
│ (2048 → 512 compression) │
│ │
│ O = H' × W₂ │
└──────────┬──────────────────┘
│ [0.18, -0.12, 0.24, ..., 0.14] (512D)
Output Vector (512D)
```
### Dimension Flow
```
512 ──→ [Expand] ──→ 2048 ──→ [Transform] ──→ 2048 ──→ [Compress] ──→ 512
```
**Like a funnel:** Expand → Transform → Compress
---
## 3.8 Why Expand and Compress?
### The Expansion-Compression Strategy
**Why not stay at 512 dimensions?**
**Answer:** Expansion provides "working space"
**Analogy:**
- Think of doing math on paper
- Small paper (512D) = limited space
- Large paper (2048D) = room to work
- Then copy results back to small paper (512D)
**Benefits:**
1. **More capacity:** 2048 dimensions = more parameters to learn
2. **Better transformations:** More space = more complex functions
3. **Feature refinement:** Transformation happens in expanded space
**Why compress back?**
**Answer:** Maintain consistent size throughout the model
- All layers use 512 dimensions
- Consistent size enables stacking layers
- Easier to manage and optimize
---
## 3.9 Key Takeaways: Feed-Forward
**FFN transforms features through expansion and compression**
**Expands to larger dimension for processing**
**Applies non-linear transformation (GELU)**
**Compresses back to original dimension**
**Processes each position independently**
---
*This document provides a step-by-step explanation of feed-forward networks, the component that transforms and refines features in transformer models.*