Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions
--- a/docs/ATTENTION_EXPLAINED.md
+++ b/docs/ATTENTION_EXPLAINED.md
@@ -0,0 +1,410 @@
+# What is Attention? Step-by-Step Explanation
+
+Complete step-by-step explanation of attention mechanisms in transformer models: how models understand relationships between words.
+
+## Table of Contents
+
+1. [The Problem Attention Solves](#21-the-problem-attention-solves)
+2. [What is Attention?](#22-what-is-attention)
+3. [How Attention Works: Step-by-Step](#23-how-attention-works-step-by-step)
+4. [Complete Example: Attention in "Hello World"](#24-complete-example-attention-in-hello-world)
+5. [Why Attention Matters](#25-why-attention-matters)
+6. [Multi-Head Attention](#26-multi-head-attention)
+7. [Visual Representation of Attention](#27-visual-representation-of-attention)
+8. [Key Takeaways](#28-key-takeaways)
+
+---
+
+## 2.1 The Problem Attention Solves
+
+### The Challenge
+
+**In a sentence, words depend on each other:**
+
+```
+"He saw the cat with binoculars"
+```
+
+Two possible meanings:
+1. He used binoculars to see the cat
+2. The cat has binoculars
+
+**Context matters!** The model needs to understand which words relate to each other.
+
+### The Solution: Attention
+
+**Attention allows the model to "look" at other words when processing each word.**
+
+---
+
+## 2.2 What is Attention?
+
+### Simple Definition
+
+**Attention** is a mechanism that determines **how much each word should consider other words** when processing information.
+
+### Intuitive Analogy
+
+**Think of reading a sentence:**
+
+When you read "cat" in:
+```
+"The cat sat on the mat"
+```
+
+You might:
+- Pay attention to "sat" (what the cat did)
+- Pay attention to "mat" (where the cat is)
+- Pay less attention to "the" (just a word)
+
+**Attention does the same thing mathematically!**
+
+---
+
+## 2.3 How Attention Works: Step-by-Step
+
+### High-Level Overview
+
+```
+Step 1: Create Query, Key, Value for each word
+Step 2: Compare queries and keys (find similarities)
+Step 3: Calculate attention weights (how much to attend)
+Step 4: Combine values weighted by attention
+```
+
+### Detailed Step-by-Step
+
+#### Step 1: Create Query, Key, Value (Q, K, V)
+
+**For each word, create three representations:**
+
+**Query (Q):** "What am I looking for?"  
+**Key (K):** "What am I offering?"  
+**Value (V):** "What information do I contain?"
+
+**Example with "Hello World":**
+
+```
+Word: "Hello"
+    Query: [0.2, -0.1, 0.3, ...]  ← What should I look for?
+    Key:   [0.1, 0.2, -0.1, ...]  ← What do I represent?
+    Value: [0.15, 0.1, 0.2, ...]  ← What information do I have?
+
+Word: "World"
+    Query: [0.18, 0.15, 0.25, ...]
+    Key:   [0.12, 0.19, -0.08, ...]
+    Value: [0.14, 0.12, 0.18, ...]
+```
+
+**How Q, K, V are created:**
+```
+Q = Word × W_Q  (learned matrix)
+K = Word × W_K  (learned matrix)
+V = Word × W_V  (learned matrix)
+```
+
+#### Step 2: Compute Similarity Scores
+
+**Compare each query with all keys:**
+
+```
+Score[i, j] = How much should word i attend to word j?
+```
+
+**Mathematical Formula:**
+```
+Score[i, j] = (Query[i] · Key[j]) / √d_k
+```
+
+**Example:**
+
+**Query for "Hello":** `[0.2, -0.1, 0.3]`  
+**Key for "Hello":** `[0.1, 0.2, -0.1]`  
+**Key for "World":** `[0.12, 0.19, -0.08]`
+
+**Calculate similarity:**
+
+```
+Score["Hello", "Hello"] = (0.2×0.1 + (-0.1)×0.2 + 0.3×(-0.1)) / √3
+                        = (0.02 - 0.02 - 0.03) / 1.732
+                        = -0.03 / 1.732
+                        ≈ -0.017
+
+Score["Hello", "World"] = (0.2×0.12 + (-0.1)×0.19 + 0.3×(-0.08)) / √3
+                        = (0.024 - 0.019 - 0.024) / 1.732
+                        = -0.019 / 1.732
+                        ≈ -0.011
+```
+
+**Result:** Similarity scores tell us how related words are
+
+#### Step 3: Convert Scores to Attention Weights
+
+**Use softmax to convert scores to probabilities:**
+
+```
+Attention[i, j] = exp(Score[i, j]) / Σ exp(Score[i, k])
+```
+
+**Example:**
+
+**Raw Scores:**
+```
+Score["Hello", "Hello"] = -0.017
+Score["Hello", "World"] = -0.011
+```
+
+**Compute exponentials:**
+```
+exp(-0.017) ≈ 0.983
+exp(-0.011) ≈ 0.989
+Sum = 0.983 + 0.989 = 1.972
+```
+
+**Compute attention weights:**
+```
+Attention["Hello", "Hello"] = 0.983 / 1.972 ≈ 0.499 (49.9%)
+Attention["Hello", "World"] = 0.989 / 1.972 ≈ 0.501 (50.1%)
+```
+
+**Meaning:** "Hello" attends 49.9% to itself and 50.1% to "World"
+
+#### Step 4: Weighted Combination
+
+**Combine values using attention weights:**
+
+```
+Output["Hello"] = Attention["Hello", "Hello"] × Value["Hello"] 
+                + Attention["Hello", "World"] × Value["World"]
+```
+
+**Example:**
+
+```
+Value["Hello"] = [0.15, 0.1, 0.2]
+Value["World"] = [0.14, 0.12, 0.18]
+
+Output["Hello"] = 0.499 × [0.15, 0.1, 0.2] + 0.501 × [0.14, 0.12, 0.18]
+                = [0.075, 0.050, 0.100] + [0.070, 0.060, 0.090]
+                = [0.145, 0.110, 0.190]
+```
+
+**Result:** New representation that combines information from both words!
+
+---
+
+## 2.4 Complete Example: Attention in "Hello World"
+
+### Input
+
+```
+Words: ["Hello", "World"]
+Position 0: "Hello"
+Position 1: "World"
+```
+
+### Step-by-Step Processing
+
+#### Step 1: Embeddings
+
+```
+E["Hello"] = [0.10, -0.20, 0.30, ..., 0.05]
+E["World"] = [0.15, -0.18, 0.28, ..., 0.10]
+```
+
+#### Step 2: Create Q, K, V
+
+```
+Q["Hello"] = E["Hello"] × W_Q = [0.2, -0.1, 0.3, ...]
+K["Hello"] = E["Hello"] × W_K = [0.1, 0.2, -0.1, ...]
+V["Hello"] = E["Hello"] × W_V = [0.15, 0.1, 0.2, ...]
+
+Q["World"] = E["World"] × W_Q = [0.18, 0.15, 0.25, ...]
+K["World"] = E["World"] × W_K = [0.12, 0.19, -0.08, ...]
+V["World"] = E["World"] × W_V = [0.14, 0.12, 0.18, ...]
+```
+
+#### Step 3: Compute Attention Scores
+
+```
+Score Matrix (2×2):
+
+         "Hello"  "World"
+"Hello"    0.5      0.3
+"World"    0.4      0.6
+```
+
+**Interpretation:**
+- "Hello" attends to itself (0.5) more than "World" (0.3)
+- "World" attends to itself (0.6) more than "Hello" (0.4)
+
+#### Step 4: Apply Softmax
+
+```
+Attention Matrix:
+
+         "Hello"  "World"
+"Hello"   0.62    0.38
+"World"   0.40    0.60
+```
+
+**Interpretation:**
+- "Hello" gives 62% attention to itself, 38% to "World"
+- "World" gives 40% attention to "Hello", 60% to itself
+
+#### Step 5: Weighted Combination
+
+```
+Output["Hello"] = 0.62 × V["Hello"] + 0.38 × V["World"]
+                = 0.62 × [0.15, 0.1, 0.2] + 0.38 × [0.14, 0.12, 0.18]
+                = [0.093, 0.062, 0.124] + [0.053, 0.046, 0.068]
+                = [0.146, 0.108, 0.192]
+
+Output["World"] = 0.40 × V["Hello"] + 0.60 × V["World"]
+                = 0.40 × [0.15, 0.1, 0.2] + 0.60 × [0.14, 0.12, 0.18]
+                = [0.060, 0.040, 0.080] + [0.084, 0.072, 0.108]
+                = [0.144, 0.112, 0.188]
+```
+
+**Result:** Each word now contains information from both words!
+
+---
+
+## 2.5 Why Attention Matters
+
+### Benefit 1: Context Understanding
+
+**Without Attention:**
+```
+"Hello" is processed in isolation
+"World" is processed in isolation
+Result: No understanding of relationship
+```
+
+**With Attention:**
+```
+"Hello" considers "World" (38% attention)
+"World" considers "Hello" (40% attention)
+Result: Understands they're related
+```
+
+### Benefit 2: Long-Range Dependencies
+
+**Attention can connect distant words:**
+
+```
+"The cat that I saw yesterday sat on the mat"
+```
+
+- "cat" can attend to "yesterday" (even though far apart)
+- Model understands the cat from yesterday
+
+### Benefit 3: Selective Focus
+
+**Attention focuses on relevant information:**
+
+```
+"He saw the cat with binoculars"
+```
+
+- "saw" attends strongly to "binoculars" (how he saw)
+- "cat" attends strongly to "sat" (what it did)
+- Each word focuses on what's relevant to it
+
+---
+
+## 2.6 Multi-Head Attention
+
+### What is Multi-Head Attention?
+
+**Multiple attention "heads" look at different aspects:**
+
+```
+Head 1: Focuses on syntax (grammar relationships)
+Head 2: Focuses on semantics (meaning relationships)
+Head 3: Focuses on position (spatial relationships)
+...
+Head 8: Focuses on another aspect
+```
+
+### Visual Representation
+
+```
+Input: "Hello World"
+
+Head 1 (Syntax):
+    "Hello" → attends to "World" (subject-object relationship)
+
+Head 2 (Semantics):
+    "Hello" → attends to "World" (greeting relationship)
+
+Head 3 (Position):
+    "Hello" → attends more to itself (being first)
+
+... (other heads)
+
+Final: Combine all heads → Richer representation
+```
+
+### Why Multiple Heads?
+
+**Different heads capture different relationships:**
+
+- **Head 1:** Grammatical relationships
+- **Head 2:** Semantic relationships  
+- **Head 3:** Positional relationships
+- **Head 4:** Other patterns...
+
+**Together:** Comprehensive understanding!
+
+---
+
+## 2.7 Visual Representation of Attention
+
+### Attention Heatmap
+
+```
+Attention Weights for "Hello World"
+
+         Position 0    Position 1
+         ("Hello")     ("World")
+           ┌─────────┐  ┌─────────┐
+Position 0 │  0.62   │  │  0.38   │
+("Hello")  └─────────┘  └─────────┘
+           ┌─────────┐  ┌─────────┐
+Position 1 │  0.40   │  │  0.60   │
+("World")  └─────────┘  └─────────┘
+```
+
+**Reading:**
+- Row 0: "Hello" attends 62% to itself, 38% to "World"
+- Row 1: "World" attends 40% to "Hello", 60% to itself
+
+### Attention Flow Diagram
+
+```
+"Hello" ──── 0.38 ────→ "World"
+   ↑                      ↑
+   │                      │
+  0.62                  0.60
+   │                      │
+   └──────────────────────┘
+   (self-attention)
+```
+
+**Meaning:** Information flows between words based on attention weights.
+
+---
+
+## 2.8 Key Takeaways: Attention
+
+✅ **Attention determines which words to focus on**  
+✅ **Calculates similarity between words**  
+✅ **Creates weighted combinations of information**  
+✅ **Enables understanding of relationships**  
+✅ **Multiple heads capture different aspects**
+
+---
+
+*This document provides a step-by-step explanation of attention mechanisms, the core component that enables transformers to understand relationships between words.*
+