Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
This commit is contained in:
470
docs/FEED_FORWARD_EXPLAINED.md
Normal file
470
docs/FEED_FORWARD_EXPLAINED.md
Normal file
@@ -0,0 +1,470 @@
|
||||
# What is Feed-Forward? Step-by-Step Explanation
|
||||
|
||||
Complete step-by-step explanation of feed-forward networks in transformer models: how models transform and refine features.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [The Problem Feed-Forward Solves](#31-the-problem-feed-forward-solves)
|
||||
2. [What is Feed-Forward?](#32-what-is-feed-forward)
|
||||
3. [How Feed-Forward Works: Step-by-Step](#33-how-feed-forward-works-step-by-step)
|
||||
4. [Complete Example: Feed-Forward on "Hello"](#34-complete-example-feed-forward-on-hello)
|
||||
5. [Why Feed-Forward Matters](#35-why-feed-forward-matters)
|
||||
6. [Complete Feed-Forward Formula](#36-complete-feed-forward-formula)
|
||||
7. [Visual Representation](#37-visual-representation)
|
||||
8. [Why Expand and Compress?](#38-why-expand-and-compress)
|
||||
9. [Key Takeaways](#39-key-takeaways)
|
||||
|
||||
---
|
||||
|
||||
## 3.1 The Problem Feed-Forward Solves
|
||||
|
||||
### The Challenge
|
||||
|
||||
**Attention provides context, but we need to process and transform that information.**
|
||||
|
||||
Think of it like cooking:
|
||||
- **Attention:** Gathers ingredients (context)
|
||||
- **Feed-Forward:** Cooks and transforms ingredients (processing)
|
||||
|
||||
### The Solution: Feed-Forward Network
|
||||
|
||||
**Feed-Forward applies complex transformations to each position independently.**
|
||||
|
||||
---
|
||||
|
||||
## 3.2 What is Feed-Forward?
|
||||
|
||||
### Simple Definition
|
||||
|
||||
A **Feed-Forward Network (FFN)** is a two-layer neural network that:
|
||||
1. **Expands** the input to a larger dimension
|
||||
2. **Applies** a nonlinear transformation
|
||||
3. **Compresses** back to original dimension
|
||||
|
||||
### Visual Analogy
|
||||
|
||||
**Think of it like a funnel:**
|
||||
|
||||
```
|
||||
Input (512 dimensions)
|
||||
↓
|
||||
┌─────────────┐
|
||||
│ EXPAND │
|
||||
│ 512 → 2048 │
|
||||
└──────┬──────┘
|
||||
↓
|
||||
┌─────────────┐
|
||||
│ TRANSFORM │
|
||||
│ (GELU) │
|
||||
└──────┬──────┘
|
||||
↓
|
||||
┌─────────────┐
|
||||
│ COMPRESS │
|
||||
│ 2048 → 512 │
|
||||
└──────┬──────┘
|
||||
↓
|
||||
Output (512 dimensions)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3.3 How Feed-Forward Works: Step-by-Step
|
||||
|
||||
### High-Level Overview
|
||||
|
||||
```
|
||||
Step 1: Expand dimension (512 → 2048)
|
||||
Step 2: Apply nonlinear activation (GELU)
|
||||
Step 3: Compress dimension (2048 → 512)
|
||||
```
|
||||
|
||||
### Detailed Step-by-Step
|
||||
|
||||
#### Step 1: Expansion (First Linear Layer)
|
||||
|
||||
**Input:** Vector of size 512
|
||||
**Output:** Vector of size 2048
|
||||
|
||||
**Mathematical Operation:**
|
||||
```
|
||||
H = X × W₁ + b₁
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
**Input X:**
|
||||
```
|
||||
[0.10, -0.20, 0.30, ..., 0.05] (512 numbers)
|
||||
```
|
||||
|
||||
**Weight Matrix W₁:**
|
||||
```
|
||||
Shape: [512, 2048]
|
||||
Each column transforms input to one output dimension
|
||||
```
|
||||
|
||||
**Process:**
|
||||
```
|
||||
H[0] = X[0]×W₁[0,0] + X[1]×W₁[1,0] + ... + X[511]×W₁[511,0]
|
||||
H[1] = X[0]×W₁[0,1] + X[1]×W₁[1,1] + ... + X[511]×W₁[511,1]
|
||||
...
|
||||
H[2047] = X[0]×W₁[0,2047] + ... + X[511]×W₁[511,2047]
|
||||
```
|
||||
|
||||
**Result:**
|
||||
```
|
||||
H = [0.12, -0.08, 0.25, ..., 0.18] (2048 numbers)
|
||||
```
|
||||
|
||||
**Why Expand?**
|
||||
- More dimensions = more capacity for complex transformations
|
||||
- Allows the model to learn intricate patterns
|
||||
- Think of it as "more room to work"
|
||||
|
||||
#### Step 2: Nonlinear Activation (GELU)
|
||||
|
||||
**Apply GELU to each element:**
|
||||
|
||||
**GELU Function:**
|
||||
```
|
||||
GELU(x) = x × Φ(x)
|
||||
|
||||
Where Φ(x) is the cumulative distribution function of standard normal distribution
|
||||
```
|
||||
|
||||
**Simplified Understanding:**
|
||||
- Values near zero → suppressed (close to 0)
|
||||
- Positive values → pass through (modified)
|
||||
- Negative values → suppressed more
|
||||
|
||||
**Example:**
|
||||
|
||||
**Input H:**
|
||||
```
|
||||
H = [0.12, -0.08, 0.25, ..., 0.18]
|
||||
```
|
||||
|
||||
**Apply GELU element-wise:**
|
||||
|
||||
```
|
||||
GELU(0.12) ≈ 0.12 × 0.548 ≈ 0.066
|
||||
GELU(-0.08) ≈ -0.08 × 0.468 ≈ -0.037
|
||||
GELU(0.25) ≈ 0.25 × 0.599 ≈ 0.150
|
||||
...
|
||||
GELU(0.18) ≈ 0.18 × 0.572 ≈ 0.103
|
||||
```
|
||||
|
||||
**Result:**
|
||||
```
|
||||
H' = [0.066, -0.037, 0.150, ..., 0.103] (2048 numbers)
|
||||
```
|
||||
|
||||
**Why Nonlinear?**
|
||||
- Linear transformations can only do so much
|
||||
- Nonlinearity enables complex function approximation
|
||||
- Essential for learning patterns
|
||||
|
||||
#### Step 3: Compression (Second Linear Layer)
|
||||
|
||||
**Input:** Vector of size 2048
|
||||
**Output:** Vector of size 512
|
||||
|
||||
**Mathematical Operation:**
|
||||
```
|
||||
O = H' × W₂ + b₂
|
||||
```
|
||||
|
||||
**Process:**
|
||||
```
|
||||
O[0] = H'[0]×W₂[0,0] + H'[1]×W₂[1,0] + ... + H'[2047]×W₂[2047,0]
|
||||
O[1] = H'[0]×W₂[0,1] + H'[1]×W₂[1,1] + ... + H'[2047]×W₂[2047,1]
|
||||
...
|
||||
O[511] = H'[0]×W₂[0,511] + ... + H'[2047]×W₂[2047,511]
|
||||
```
|
||||
|
||||
**Result:**
|
||||
```
|
||||
O = [0.15, -0.10, 0.22, ..., 0.12] (512 numbers)
|
||||
```
|
||||
|
||||
**Why Compress?**
|
||||
- Project back to original dimension
|
||||
- Maintains consistent size throughout model
|
||||
- Combines expanded features into compact representation
|
||||
|
||||
---
|
||||
|
||||
## 3.4 Complete Example: Feed-Forward on "Hello"
|
||||
|
||||
### Input
|
||||
|
||||
```
|
||||
Word: "Hello"
|
||||
After Attention: [0.146, 0.108, 0.192, ..., 0.11]
|
||||
Dimension: 512
|
||||
```
|
||||
|
||||
### Step-by-Step Processing
|
||||
|
||||
#### Step 1: Expansion
|
||||
|
||||
**Input X:**
|
||||
```
|
||||
[0.146, 0.108, 0.192, ..., 0.11] (512 numbers)
|
||||
```
|
||||
|
||||
**Weight Matrix W₁:**
|
||||
```
|
||||
Shape: [512, 2048]
|
||||
Values: Learned during training
|
||||
```
|
||||
|
||||
**Compute:**
|
||||
```
|
||||
H = X × W₁
|
||||
```
|
||||
|
||||
**Result:**
|
||||
```
|
||||
H = [0.21, -0.15, 0.28, ..., 0.19] (2048 numbers)
|
||||
```
|
||||
|
||||
**Visualization:**
|
||||
```
|
||||
512 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions
|
||||
│ W₁ │
|
||||
└──────────┘
|
||||
```
|
||||
|
||||
#### Step 2: Activation
|
||||
|
||||
**Input H:**
|
||||
```
|
||||
[0.21, -0.15, 0.28, ..., 0.19] (2048 numbers)
|
||||
```
|
||||
|
||||
**Apply GELU element-wise:**
|
||||
|
||||
```
|
||||
GELU(0.21) ≈ 0.115
|
||||
GELU(-0.15) ≈ -0.058
|
||||
GELU(0.28) ≈ 0.168
|
||||
...
|
||||
GELU(0.19) ≈ 0.109
|
||||
```
|
||||
|
||||
**Result:**
|
||||
```
|
||||
H' = [0.115, -0.058, 0.168, ..., 0.109] (2048 numbers)
|
||||
```
|
||||
|
||||
**Visualization:**
|
||||
```
|
||||
2048 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions
|
||||
│ GELU │
|
||||
└──────────┘
|
||||
```
|
||||
|
||||
#### Step 3: Compression
|
||||
|
||||
**Input H':**
|
||||
```
|
||||
[0.115, -0.058, 0.168, ..., 0.109] (2048 numbers)
|
||||
```
|
||||
|
||||
**Weight Matrix W₂:**
|
||||
```
|
||||
Shape: [2048, 512]
|
||||
Values: Learned during training
|
||||
```
|
||||
|
||||
**Compute:**
|
||||
```
|
||||
O = H' × W₂
|
||||
```
|
||||
|
||||
**Result:**
|
||||
```
|
||||
O = [0.18, -0.12, 0.24, ..., 0.14] (512 numbers)
|
||||
```
|
||||
|
||||
**Visualization:**
|
||||
```
|
||||
2048 dimensions ──→ ┌──────────┐ ──→ 512 dimensions
|
||||
│ W₂ │
|
||||
└──────────┘
|
||||
```
|
||||
|
||||
#### Final Output
|
||||
|
||||
```
|
||||
Output: [0.18, -0.12, 0.24, ..., 0.14] (512 numbers)
|
||||
```
|
||||
|
||||
**Meaning:** Transformed representation that captures processed features
|
||||
|
||||
---
|
||||
|
||||
## 3.5 Why Feed-Forward Matters
|
||||
|
||||
### Benefit 1: Feature Transformation
|
||||
|
||||
**Before FFN:**
|
||||
```
|
||||
Input: Raw attention output
|
||||
Information: Contextual relationships
|
||||
```
|
||||
|
||||
**After FFN:**
|
||||
```
|
||||
Output: Transformed features
|
||||
Information: Processed and refined understanding
|
||||
```
|
||||
|
||||
### Benefit 2: Non-Linear Processing
|
||||
|
||||
**Linear operations** (like attention) can only do limited transformations.
|
||||
**Non-linear operations** (like GELU in FFN) enable complex function learning.
|
||||
|
||||
**Analogy:**
|
||||
- Linear: Can only draw straight lines
|
||||
- Non-linear: Can draw curves, circles, complex shapes
|
||||
|
||||
### Benefit 3: Position-Wise Processing
|
||||
|
||||
**FFN processes each position independently:**
|
||||
|
||||
```
|
||||
Position 0 ("Hello"): FFN → Transformed representation
|
||||
Position 1 ("World"): FFN → Transformed representation
|
||||
```
|
||||
|
||||
**Each word gets its own transformation!**
|
||||
|
||||
---
|
||||
|
||||
## 3.6 Complete Feed-Forward Formula
|
||||
|
||||
### Mathematical Expression
|
||||
|
||||
```
|
||||
FFN(X) = GELU(X × W₁ + b₁) × W₂ + b₂
|
||||
```
|
||||
|
||||
**Breaking it down:**
|
||||
|
||||
**Part 1: First Linear Transformation**
|
||||
```
|
||||
H = X × W₁ + b₁
|
||||
```
|
||||
- Expands from 512 to 2048 dimensions
|
||||
|
||||
**Part 2: Non-Linear Activation**
|
||||
```
|
||||
H' = GELU(H)
|
||||
```
|
||||
- Applies non-linear transformation
|
||||
|
||||
**Part 3: Second Linear Transformation**
|
||||
```
|
||||
O = H' × W₂ + b₂
|
||||
```
|
||||
- Compresses from 2048 back to 512 dimensions
|
||||
|
||||
**Complete:**
|
||||
```
|
||||
FFN(X) = O
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3.7 Visual Representation
|
||||
|
||||
### Feed-Forward Pipeline
|
||||
|
||||
```
|
||||
Input Vector (512D)
|
||||
│
|
||||
│ [0.146, 0.108, 0.192, ..., 0.11]
|
||||
↓
|
||||
┌─────────────────────────────┐
|
||||
│ Linear Layer 1 │
|
||||
│ (512 → 2048 expansion) │
|
||||
│ │
|
||||
│ H = X × W₁ │
|
||||
└──────────┬──────────────────┘
|
||||
│
|
||||
│ [0.21, -0.15, 0.28, ..., 0.19] (2048D)
|
||||
↓
|
||||
┌─────────────────────────────┐
|
||||
│ GELU Activation │
|
||||
│ (Non-linear transformation) │
|
||||
│ │
|
||||
│ H' = GELU(H) │
|
||||
└──────────┬──────────────────┘
|
||||
│
|
||||
│ [0.115, -0.058, 0.168, ..., 0.109] (2048D)
|
||||
↓
|
||||
┌─────────────────────────────┐
|
||||
│ Linear Layer 2 │
|
||||
│ (2048 → 512 compression) │
|
||||
│ │
|
||||
│ O = H' × W₂ │
|
||||
└──────────┬──────────────────┘
|
||||
│
|
||||
│ [0.18, -0.12, 0.24, ..., 0.14] (512D)
|
||||
↓
|
||||
Output Vector (512D)
|
||||
```
|
||||
|
||||
### Dimension Flow
|
||||
|
||||
```
|
||||
512 ──→ [Expand] ──→ 2048 ──→ [Transform] ──→ 2048 ──→ [Compress] ──→ 512
|
||||
```
|
||||
|
||||
**Like a funnel:** Expand → Transform → Compress
|
||||
|
||||
---
|
||||
|
||||
## 3.8 Why Expand and Compress?
|
||||
|
||||
### The Expansion-Compression Strategy
|
||||
|
||||
**Why not stay at 512 dimensions?**
|
||||
|
||||
**Answer:** Expansion provides "working space"
|
||||
|
||||
**Analogy:**
|
||||
- Think of doing math on paper
|
||||
- Small paper (512D) = limited space
|
||||
- Large paper (2048D) = room to work
|
||||
- Then copy results back to small paper (512D)
|
||||
|
||||
**Benefits:**
|
||||
1. **More capacity:** 2048 dimensions = more parameters to learn
|
||||
2. **Better transformations:** More space = more complex functions
|
||||
3. **Feature refinement:** Transformation happens in expanded space
|
||||
|
||||
**Why compress back?**
|
||||
|
||||
**Answer:** Maintain consistent size throughout the model
|
||||
|
||||
- All layers use 512 dimensions
|
||||
- Consistent size enables stacking layers
|
||||
- Easier to manage and optimize
|
||||
|
||||
---
|
||||
|
||||
## 3.9 Key Takeaways: Feed-Forward
|
||||
|
||||
✅ **FFN transforms features through expansion and compression**
|
||||
✅ **Expands to larger dimension for processing**
|
||||
✅ **Applies non-linear transformation (GELU)**
|
||||
✅ **Compresses back to original dimension**
|
||||
✅ **Processes each position independently**
|
||||
|
||||
---
|
||||
|
||||
*This document provides a step-by-step explanation of feed-forward networks, the component that transforms and refines features in transformer models.*
|
||||
|
||||
Reference in New Issue
Block a user