Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
This commit is contained in:
Carlos Gutierrez
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions

View File

@@ -0,0 +1,470 @@
# What is Feed-Forward? Step-by-Step Explanation
Complete step-by-step explanation of feed-forward networks in transformer models: how models transform and refine features.
## Table of Contents
1. [The Problem Feed-Forward Solves](#31-the-problem-feed-forward-solves)
2. [What is Feed-Forward?](#32-what-is-feed-forward)
3. [How Feed-Forward Works: Step-by-Step](#33-how-feed-forward-works-step-by-step)
4. [Complete Example: Feed-Forward on "Hello"](#34-complete-example-feed-forward-on-hello)
5. [Why Feed-Forward Matters](#35-why-feed-forward-matters)
6. [Complete Feed-Forward Formula](#36-complete-feed-forward-formula)
7. [Visual Representation](#37-visual-representation)
8. [Why Expand and Compress?](#38-why-expand-and-compress)
9. [Key Takeaways](#39-key-takeaways)
---
## 3.1 The Problem Feed-Forward Solves
### The Challenge
**Attention provides context, but we need to process and transform that information.**
Think of it like cooking:
- **Attention:** Gathers ingredients (context)
- **Feed-Forward:** Cooks and transforms ingredients (processing)
### The Solution: Feed-Forward Network
**Feed-Forward applies complex transformations to each position independently.**
---
## 3.2 What is Feed-Forward?
### Simple Definition
A **Feed-Forward Network (FFN)** is a two-layer neural network that:
1. **Expands** the input to a larger dimension
2. **Applies** a nonlinear transformation
3. **Compresses** back to original dimension
### Visual Analogy
**Think of it like a funnel:**
```
Input (512 dimensions)
┌─────────────┐
│ EXPAND │
│ 512 → 2048 │
└──────┬──────┘
┌─────────────┐
│ TRANSFORM │
│ (GELU) │
└──────┬──────┘
┌─────────────┐
│ COMPRESS │
│ 2048 → 512 │
└──────┬──────┘
Output (512 dimensions)
```
---
## 3.3 How Feed-Forward Works: Step-by-Step
### High-Level Overview
```
Step 1: Expand dimension (512 → 2048)
Step 2: Apply nonlinear activation (GELU)
Step 3: Compress dimension (2048 → 512)
```
### Detailed Step-by-Step
#### Step 1: Expansion (First Linear Layer)
**Input:** Vector of size 512
**Output:** Vector of size 2048
**Mathematical Operation:**
```
H = X × W₁ + b₁
```
**Example:**
**Input X:**
```
[0.10, -0.20, 0.30, ..., 0.05] (512 numbers)
```
**Weight Matrix W₁:**
```
Shape: [512, 2048]
Each column transforms input to one output dimension
```
**Process:**
```
H[0] = X[0]×W₁[0,0] + X[1]×W₁[1,0] + ... + X[511]×W₁[511,0]
H[1] = X[0]×W₁[0,1] + X[1]×W₁[1,1] + ... + X[511]×W₁[511,1]
...
H[2047] = X[0]×W₁[0,2047] + ... + X[511]×W₁[511,2047]
```
**Result:**
```
H = [0.12, -0.08, 0.25, ..., 0.18] (2048 numbers)
```
**Why Expand?**
- More dimensions = more capacity for complex transformations
- Allows the model to learn intricate patterns
- Think of it as "more room to work"
#### Step 2: Nonlinear Activation (GELU)
**Apply GELU to each element:**
**GELU Function:**
```
GELU(x) = x × Φ(x)
Where Φ(x) is the cumulative distribution function of standard normal distribution
```
**Simplified Understanding:**
- Values near zero → suppressed (close to 0)
- Positive values → pass through (modified)
- Negative values → suppressed more
**Example:**
**Input H:**
```
H = [0.12, -0.08, 0.25, ..., 0.18]
```
**Apply GELU element-wise:**
```
GELU(0.12) ≈ 0.12 × 0.548 ≈ 0.066
GELU(-0.08) ≈ -0.08 × 0.468 ≈ -0.037
GELU(0.25) ≈ 0.25 × 0.599 ≈ 0.150
...
GELU(0.18) ≈ 0.18 × 0.572 ≈ 0.103
```
**Result:**
```
H' = [0.066, -0.037, 0.150, ..., 0.103] (2048 numbers)
```
**Why Nonlinear?**
- Linear transformations can only do so much
- Nonlinearity enables complex function approximation
- Essential for learning patterns
#### Step 3: Compression (Second Linear Layer)
**Input:** Vector of size 2048
**Output:** Vector of size 512
**Mathematical Operation:**
```
O = H' × W₂ + b₂
```
**Process:**
```
O[0] = H'[0]×W₂[0,0] + H'[1]×W₂[1,0] + ... + H'[2047]×W₂[2047,0]
O[1] = H'[0]×W₂[0,1] + H'[1]×W₂[1,1] + ... + H'[2047]×W₂[2047,1]
...
O[511] = H'[0]×W₂[0,511] + ... + H'[2047]×W₂[2047,511]
```
**Result:**
```
O = [0.15, -0.10, 0.22, ..., 0.12] (512 numbers)
```
**Why Compress?**
- Project back to original dimension
- Maintains consistent size throughout model
- Combines expanded features into compact representation
---
## 3.4 Complete Example: Feed-Forward on "Hello"
### Input
```
Word: "Hello"
After Attention: [0.146, 0.108, 0.192, ..., 0.11]
Dimension: 512
```
### Step-by-Step Processing
#### Step 1: Expansion
**Input X:**
```
[0.146, 0.108, 0.192, ..., 0.11] (512 numbers)
```
**Weight Matrix W₁:**
```
Shape: [512, 2048]
Values: Learned during training
```
**Compute:**
```
H = X × W₁
```
**Result:**
```
H = [0.21, -0.15, 0.28, ..., 0.19] (2048 numbers)
```
**Visualization:**
```
512 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions
│ W₁ │
└──────────┘
```
#### Step 2: Activation
**Input H:**
```
[0.21, -0.15, 0.28, ..., 0.19] (2048 numbers)
```
**Apply GELU element-wise:**
```
GELU(0.21) ≈ 0.115
GELU(-0.15) ≈ -0.058
GELU(0.28) ≈ 0.168
...
GELU(0.19) ≈ 0.109
```
**Result:**
```
H' = [0.115, -0.058, 0.168, ..., 0.109] (2048 numbers)
```
**Visualization:**
```
2048 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions
│ GELU │
└──────────┘
```
#### Step 3: Compression
**Input H':**
```
[0.115, -0.058, 0.168, ..., 0.109] (2048 numbers)
```
**Weight Matrix W₂:**
```
Shape: [2048, 512]
Values: Learned during training
```
**Compute:**
```
O = H' × W₂
```
**Result:**
```
O = [0.18, -0.12, 0.24, ..., 0.14] (512 numbers)
```
**Visualization:**
```
2048 dimensions ──→ ┌──────────┐ ──→ 512 dimensions
│ W₂ │
└──────────┘
```
#### Final Output
```
Output: [0.18, -0.12, 0.24, ..., 0.14] (512 numbers)
```
**Meaning:** Transformed representation that captures processed features
---
## 3.5 Why Feed-Forward Matters
### Benefit 1: Feature Transformation
**Before FFN:**
```
Input: Raw attention output
Information: Contextual relationships
```
**After FFN:**
```
Output: Transformed features
Information: Processed and refined understanding
```
### Benefit 2: Non-Linear Processing
**Linear operations** (like attention) can only do limited transformations.
**Non-linear operations** (like GELU in FFN) enable complex function learning.
**Analogy:**
- Linear: Can only draw straight lines
- Non-linear: Can draw curves, circles, complex shapes
### Benefit 3: Position-Wise Processing
**FFN processes each position independently:**
```
Position 0 ("Hello"): FFN → Transformed representation
Position 1 ("World"): FFN → Transformed representation
```
**Each word gets its own transformation!**
---
## 3.6 Complete Feed-Forward Formula
### Mathematical Expression
```
FFN(X) = GELU(X × W₁ + b₁) × W₂ + b₂
```
**Breaking it down:**
**Part 1: First Linear Transformation**
```
H = X × W₁ + b₁
```
- Expands from 512 to 2048 dimensions
**Part 2: Non-Linear Activation**
```
H' = GELU(H)
```
- Applies non-linear transformation
**Part 3: Second Linear Transformation**
```
O = H' × W₂ + b₂
```
- Compresses from 2048 back to 512 dimensions
**Complete:**
```
FFN(X) = O
```
---
## 3.7 Visual Representation
### Feed-Forward Pipeline
```
Input Vector (512D)
│ [0.146, 0.108, 0.192, ..., 0.11]
┌─────────────────────────────┐
│ Linear Layer 1 │
│ (512 → 2048 expansion) │
│ │
│ H = X × W₁ │
└──────────┬──────────────────┘
│ [0.21, -0.15, 0.28, ..., 0.19] (2048D)
┌─────────────────────────────┐
│ GELU Activation │
│ (Non-linear transformation) │
│ │
│ H' = GELU(H) │
└──────────┬──────────────────┘
│ [0.115, -0.058, 0.168, ..., 0.109] (2048D)
┌─────────────────────────────┐
│ Linear Layer 2 │
│ (2048 → 512 compression) │
│ │
│ O = H' × W₂ │
└──────────┬──────────────────┘
│ [0.18, -0.12, 0.24, ..., 0.14] (512D)
Output Vector (512D)
```
### Dimension Flow
```
512 ──→ [Expand] ──→ 2048 ──→ [Transform] ──→ 2048 ──→ [Compress] ──→ 512
```
**Like a funnel:** Expand → Transform → Compress
---
## 3.8 Why Expand and Compress?
### The Expansion-Compression Strategy
**Why not stay at 512 dimensions?**
**Answer:** Expansion provides "working space"
**Analogy:**
- Think of doing math on paper
- Small paper (512D) = limited space
- Large paper (2048D) = room to work
- Then copy results back to small paper (512D)
**Benefits:**
1. **More capacity:** 2048 dimensions = more parameters to learn
2. **Better transformations:** More space = more complex functions
3. **Feature refinement:** Transformation happens in expanded space
**Why compress back?**
**Answer:** Maintain consistent size throughout the model
- All layers use 512 dimensions
- Consistent size enables stacking layers
- Easier to manage and optimize
---
## 3.9 Key Takeaways: Feed-Forward
**FFN transforms features through expansion and compression**
**Expands to larger dimension for processing**
**Applies non-linear transformation (GELU)**
**Compresses back to original dimension**
**Processes each position independently**
---
*This document provides a step-by-step explanation of feed-forward networks, the component that transforms and refines features in transformer models.*