Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions
--- a/docs/FEED_FORWARD_EXPLAINED.md
+++ b/docs/FEED_FORWARD_EXPLAINED.md
@@ -0,0 +1,470 @@
+# What is Feed-Forward? Step-by-Step Explanation
+
+Complete step-by-step explanation of feed-forward networks in transformer models: how models transform and refine features.
+
+## Table of Contents
+
+1. [The Problem Feed-Forward Solves](#31-the-problem-feed-forward-solves)
+2. [What is Feed-Forward?](#32-what-is-feed-forward)
+3. [How Feed-Forward Works: Step-by-Step](#33-how-feed-forward-works-step-by-step)
+4. [Complete Example: Feed-Forward on "Hello"](#34-complete-example-feed-forward-on-hello)
+5. [Why Feed-Forward Matters](#35-why-feed-forward-matters)
+6. [Complete Feed-Forward Formula](#36-complete-feed-forward-formula)
+7. [Visual Representation](#37-visual-representation)
+8. [Why Expand and Compress?](#38-why-expand-and-compress)
+9. [Key Takeaways](#39-key-takeaways)
+
+---
+
+## 3.1 The Problem Feed-Forward Solves
+
+### The Challenge
+
+**Attention provides context, but we need to process and transform that information.**
+
+Think of it like cooking:
+- **Attention:** Gathers ingredients (context)
+- **Feed-Forward:** Cooks and transforms ingredients (processing)
+
+### The Solution: Feed-Forward Network
+
+**Feed-Forward applies complex transformations to each position independently.**
+
+---
+
+## 3.2 What is Feed-Forward?
+
+### Simple Definition
+
+A **Feed-Forward Network (FFN)** is a two-layer neural network that:
+1. **Expands** the input to a larger dimension
+2. **Applies** a nonlinear transformation
+3. **Compresses** back to original dimension
+
+### Visual Analogy
+
+**Think of it like a funnel:**
+
+```
+Input (512 dimensions)
+    ↓
+    ┌─────────────┐
+    │   EXPAND    │
+    │  512 → 2048 │
+    └──────┬──────┘
+           ↓
+    ┌─────────────┐
+    │ TRANSFORM   │
+    │  (GELU)     │
+    └──────┬──────┘
+           ↓
+    ┌─────────────┐
+    │  COMPRESS   │
+    │ 2048 → 512  │
+    └──────┬──────┘
+           ↓
+Output (512 dimensions)
+```
+
+---
+
+## 3.3 How Feed-Forward Works: Step-by-Step
+
+### High-Level Overview
+
+```
+Step 1: Expand dimension (512 → 2048)
+Step 2: Apply nonlinear activation (GELU)
+Step 3: Compress dimension (2048 → 512)
+```
+
+### Detailed Step-by-Step
+
+#### Step 1: Expansion (First Linear Layer)
+
+**Input:** Vector of size 512  
+**Output:** Vector of size 2048
+
+**Mathematical Operation:**
+```
+H = X × W₁ + b₁
+```
+
+**Example:**
+
+**Input X:**
+```
+[0.10, -0.20, 0.30, ..., 0.05]  (512 numbers)
+```
+
+**Weight Matrix W₁:**
+```
+Shape: [512, 2048]
+Each column transforms input to one output dimension
+```
+
+**Process:**
+```
+H[0] = X[0]×W₁[0,0] + X[1]×W₁[1,0] + ... + X[511]×W₁[511,0]
+H[1] = X[0]×W₁[0,1] + X[1]×W₁[1,1] + ... + X[511]×W₁[511,1]
+...
+H[2047] = X[0]×W₁[0,2047] + ... + X[511]×W₁[511,2047]
+```
+
+**Result:**
+```
+H = [0.12, -0.08, 0.25, ..., 0.18]  (2048 numbers)
+```
+
+**Why Expand?**
+- More dimensions = more capacity for complex transformations
+- Allows the model to learn intricate patterns
+- Think of it as "more room to work"
+
+#### Step 2: Nonlinear Activation (GELU)
+
+**Apply GELU to each element:**
+
+**GELU Function:**
+```
+GELU(x) = x × Φ(x)
+
+Where Φ(x) is the cumulative distribution function of standard normal distribution
+```
+
+**Simplified Understanding:**
+- Values near zero → suppressed (close to 0)
+- Positive values → pass through (modified)
+- Negative values → suppressed more
+
+**Example:**
+
+**Input H:**
+```
+H = [0.12, -0.08, 0.25, ..., 0.18]
+```
+
+**Apply GELU element-wise:**
+
+```
+GELU(0.12) ≈ 0.12 × 0.548 ≈ 0.066
+GELU(-0.08) ≈ -0.08 × 0.468 ≈ -0.037
+GELU(0.25) ≈ 0.25 × 0.599 ≈ 0.150
+...
+GELU(0.18) ≈ 0.18 × 0.572 ≈ 0.103
+```
+
+**Result:**
+```
+H' = [0.066, -0.037, 0.150, ..., 0.103]  (2048 numbers)
+```
+
+**Why Nonlinear?**
+- Linear transformations can only do so much
+- Nonlinearity enables complex function approximation
+- Essential for learning patterns
+
+#### Step 3: Compression (Second Linear Layer)
+
+**Input:** Vector of size 2048  
+**Output:** Vector of size 512
+
+**Mathematical Operation:**
+```
+O = H' × W₂ + b₂
+```
+
+**Process:**
+```
+O[0] = H'[0]×W₂[0,0] + H'[1]×W₂[1,0] + ... + H'[2047]×W₂[2047,0]
+O[1] = H'[0]×W₂[0,1] + H'[1]×W₂[1,1] + ... + H'[2047]×W₂[2047,1]
+...
+O[511] = H'[0]×W₂[0,511] + ... + H'[2047]×W₂[2047,511]
+```
+
+**Result:**
+```
+O = [0.15, -0.10, 0.22, ..., 0.12]  (512 numbers)
+```
+
+**Why Compress?**
+- Project back to original dimension
+- Maintains consistent size throughout model
+- Combines expanded features into compact representation
+
+---
+
+## 3.4 Complete Example: Feed-Forward on "Hello"
+
+### Input
+
+```
+Word: "Hello"
+After Attention: [0.146, 0.108, 0.192, ..., 0.11]
+Dimension: 512
+```
+
+### Step-by-Step Processing
+
+#### Step 1: Expansion
+
+**Input X:**
+```
+[0.146, 0.108, 0.192, ..., 0.11]  (512 numbers)
+```
+
+**Weight Matrix W₁:**
+```
+Shape: [512, 2048]
+Values: Learned during training
+```
+
+**Compute:**
+```
+H = X × W₁
+```
+
+**Result:**
+```
+H = [0.21, -0.15, 0.28, ..., 0.19]  (2048 numbers)
+```
+
+**Visualization:**
+```
+512 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions
+                    │   W₁     │
+                    └──────────┘
+```
+
+#### Step 2: Activation
+
+**Input H:**
+```
+[0.21, -0.15, 0.28, ..., 0.19]  (2048 numbers)
+```
+
+**Apply GELU element-wise:**
+
+```
+GELU(0.21) ≈ 0.115
+GELU(-0.15) ≈ -0.058
+GELU(0.28) ≈ 0.168
+...
+GELU(0.19) ≈ 0.109
+```
+
+**Result:**
+```
+H' = [0.115, -0.058, 0.168, ..., 0.109]  (2048 numbers)
+```
+
+**Visualization:**
+```
+2048 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions
+                    │   GELU   │
+                    └──────────┘
+```
+
+#### Step 3: Compression
+
+**Input H':**
+```
+[0.115, -0.058, 0.168, ..., 0.109]  (2048 numbers)
+```
+
+**Weight Matrix W₂:**
+```
+Shape: [2048, 512]
+Values: Learned during training
+```
+
+**Compute:**
+```
+O = H' × W₂
+```
+
+**Result:**
+```
+O = [0.18, -0.12, 0.24, ..., 0.14]  (512 numbers)
+```
+
+**Visualization:**
+```
+2048 dimensions ──→ ┌──────────┐ ──→ 512 dimensions
+                    │   W₂     │
+                    └──────────┘
+```
+
+#### Final Output
+
+```
+Output: [0.18, -0.12, 0.24, ..., 0.14]  (512 numbers)
+```
+
+**Meaning:** Transformed representation that captures processed features
+
+---
+
+## 3.5 Why Feed-Forward Matters
+
+### Benefit 1: Feature Transformation
+
+**Before FFN:**
+```
+Input: Raw attention output
+Information: Contextual relationships
+```
+
+**After FFN:**
+```
+Output: Transformed features
+Information: Processed and refined understanding
+```
+
+### Benefit 2: Non-Linear Processing
+
+**Linear operations** (like attention) can only do limited transformations.  
+**Non-linear operations** (like GELU in FFN) enable complex function learning.
+
+**Analogy:**
+- Linear: Can only draw straight lines
+- Non-linear: Can draw curves, circles, complex shapes
+
+### Benefit 3: Position-Wise Processing
+
+**FFN processes each position independently:**
+
+```
+Position 0 ("Hello"): FFN → Transformed representation
+Position 1 ("World"): FFN → Transformed representation
+```
+
+**Each word gets its own transformation!**
+
+---
+
+## 3.6 Complete Feed-Forward Formula
+
+### Mathematical Expression
+
+```
+FFN(X) = GELU(X × W₁ + b₁) × W₂ + b₂
+```
+
+**Breaking it down:**
+
+**Part 1: First Linear Transformation**
+```
+H = X × W₁ + b₁
+```
+- Expands from 512 to 2048 dimensions
+
+**Part 2: Non-Linear Activation**
+```
+H' = GELU(H)
+```
+- Applies non-linear transformation
+
+**Part 3: Second Linear Transformation**
+```
+O = H' × W₂ + b₂
+```
+- Compresses from 2048 back to 512 dimensions
+
+**Complete:**
+```
+FFN(X) = O
+```
+
+---
+
+## 3.7 Visual Representation
+
+### Feed-Forward Pipeline
+
+```
+Input Vector (512D)
+    │
+    │ [0.146, 0.108, 0.192, ..., 0.11]
+    ↓
+┌─────────────────────────────┐
+│ Linear Layer 1              │
+│ (512 → 2048 expansion)      │
+│                             │
+│ H = X × W₁                  │
+└──────────┬──────────────────┘
+           │
+           │ [0.21, -0.15, 0.28, ..., 0.19] (2048D)
+           ↓
+┌─────────────────────────────┐
+│ GELU Activation             │
+│ (Non-linear transformation) │
+│                             │
+│ H' = GELU(H)                │
+└──────────┬──────────────────┘
+           │
+           │ [0.115, -0.058, 0.168, ..., 0.109] (2048D)
+           ↓
+┌─────────────────────────────┐
+│ Linear Layer 2              │
+│ (2048 → 512 compression)   │
+│                             │
+│ O = H' × W₂                 │
+└──────────┬──────────────────┘
+           │
+           │ [0.18, -0.12, 0.24, ..., 0.14] (512D)
+           ↓
+    Output Vector (512D)
+```
+
+### Dimension Flow
+
+```
+512 ──→ [Expand] ──→ 2048 ──→ [Transform] ──→ 2048 ──→ [Compress] ──→ 512
+```
+
+**Like a funnel:** Expand → Transform → Compress
+
+---
+
+## 3.8 Why Expand and Compress?
+
+### The Expansion-Compression Strategy
+
+**Why not stay at 512 dimensions?**
+
+**Answer:** Expansion provides "working space"
+
+**Analogy:**
+- Think of doing math on paper
+- Small paper (512D) = limited space
+- Large paper (2048D) = room to work
+- Then copy results back to small paper (512D)
+
+**Benefits:**
+1. **More capacity:** 2048 dimensions = more parameters to learn
+2. **Better transformations:** More space = more complex functions
+3. **Feature refinement:** Transformation happens in expanded space
+
+**Why compress back?**
+
+**Answer:** Maintain consistent size throughout the model
+
+- All layers use 512 dimensions
+- Consistent size enables stacking layers
+- Easier to manage and optimize
+
+---
+
+## 3.9 Key Takeaways: Feed-Forward
+
+✅ **FFN transforms features through expansion and compression**  
+✅ **Expands to larger dimension for processing**  
+✅ **Applies non-linear transformation (GELU)**  
+✅ **Compresses back to original dimension**  
+✅ **Processes each position independently**
+
+---
+
+*This document provides a step-by-step explanation of feed-forward networks, the component that transforms and refines features in transformer models.*
+