Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions
--- a/docs/GENERATION_EXPLAINED.md
+++ b/docs/GENERATION_EXPLAINED.md
@@ -0,0 +1,748 @@
+# What is Generation? Step-by-Step Explanation
+
+Complete step-by-step explanation of text generation: how models generate text using autoregressive generation, sampling, and decoding strategies.
+
+## Table of Contents
+
+1. [What is Generation?](#91-what-is-generation)
+2. [Autoregressive Generation](#92-autoregressive-generation)
+3. [Sampling Strategies](#93-sampling-strategies)
+4. [Temperature](#94-temperature)
+5. [Top-k Sampling](#95-top-k-sampling)
+6. [Top-p (Nucleus) Sampling](#96-top-p-nucleus-sampling)
+7. [Step-by-Step Generation Process](#97-step-by-step-generation-process)
+8. [Exercise: Complete Generation Example](#98-exercise-complete-generation-example)
+9. [Key Takeaways](#99-key-takeaways)
+
+---
+
+## 9.1 What is Generation?
+
+### Simple Definition
+
+**Generation** (text generation) is the process of using a trained model to produce new text, one token at a time, based on a given prompt.
+
+### Visual Analogy
+
+**Think of generation like writing a story:**
+
+```
+Prompt: "Once upon a time"
+
+Model generates:
+  "Once upon a time" → "there"
+  "Once upon a time there" → "was"
+  "Once upon a time there was" → "a"
+  "Once upon a time there was a" → "princess"
+  ...
+
+Final: "Once upon a time there was a princess..."
+```
+
+**Model predicts next word, one at a time!**
+
+### What Generation Does
+
+**Generation:**
+1. **Takes** a prompt (starting text)
+2. **Predicts** next token probabilities
+3. **Samples** a token from distribution
+4. **Appends** token to sequence
+5. **Repeats** until complete
+
+**Result:** Generated text continuation!
+
+---
+
+## 9.2 Autoregressive Generation
+
+### What is Autoregressive?
+
+**Autoregressive** means the model uses its own previous outputs as inputs for the next prediction.
+
+### How It Works
+
+**Step 1: Initial Prompt**
+```
+Prompt: "Hello"
+Sequence: ["Hello"]
+```
+
+**Step 2: First Prediction**
+```
+Input: ["Hello"]
+Model output: Probabilities for next token
+  "World": 0.4
+  "there": 0.3
+  "friend": 0.2
+  ...
+```
+
+**Step 3: Sample Token**
+```
+Sample: "World" (selected)
+Sequence: ["Hello", "World"]
+```
+
+**Step 4: Second Prediction**
+```
+Input: ["Hello", "World"]
+Model output: Probabilities for next token
+  "!": 0.5
+  ".": 0.3
+  ",": 0.1
+  ...
+```
+
+**Step 5: Continue**
+```
+Sample: "!"
+Sequence: ["Hello", "World", "!"]
+Continue until max length or stop token...
+```
+
+### Mathematical Formulation
+
+**For prompt $\mathbf{P} = [p_1, ..., p_k]$:**
+
+**Initialization:**
+```math
+\mathbf{T}_0 = \mathbf{P}
+```
+
+**For each step $t \geq k+1$:**
+
+1. **Forward pass:**
+   ```math
+   \mathbf{L}_t = \text{Model}(\mathbf{T}_{t-1})
+   ```
+
+2. **Get next token probabilities:**
+   ```math
+   \mathbf{p}_t = \text{softmax}(\mathbf{L}_t[:, -1, :])
+   ```
+
+3. **Sample token:**
+   ```math
+   t_t \sim \text{Categorical}(\mathbf{p}_t)
+   ```
+
+4. **Append token:**
+   ```math
+   \mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]
+   ```
+
+**Repeat until stop condition!**
+
+---
+
+## 9.3 Sampling Strategies
+
+### Deterministic vs Stochastic
+
+**Deterministic (Greedy):**
+```
+Always pick highest probability:
+  "World": 0.4 ← Highest
+  "there": 0.3
+  "friend": 0.2
+  
+→ Always picks "World"
+→ Same output every time
+```
+
+**Stochastic (Sampling):**
+```
+Sample from distribution:
+  "World": 0.4 (40% chance)
+  "there": 0.3 (30% chance)
+  "friend": 0.2 (20% chance)
+  
+→ Different output each time
+→ More diverse generations
+```
+
+### Why Sampling?
+
+**Greedy (Deterministic):**
+- Same output every time
+- Can be repetitive
+- Less creative
+
+**Sampling:**
+- Different outputs each time
+- More diverse
+- More creative
+- Better for creative tasks
+
+---
+
+## 9.4 Temperature
+
+### What is Temperature?
+
+**Temperature** controls the randomness of sampling by scaling the logits before applying softmax.
+
+### Formula
+
+```math
+\mathbf{p}_t = \text{softmax}\left(\frac{\mathbf{l}_t}{T}\right)
+```
+
+**Where:**
+- $\mathbf{l}_t$ = logits (raw scores)
+- $T$ = temperature
+- $\mathbf{p}_t$ = probabilities
+
+### How Temperature Works
+
+**T = 0.5 (Low Temperature - More Deterministic):**
+```
+Logits: [2.0, 1.0, 0.5]
+After scaling: [4.0, 2.0, 1.0]
+After softmax: [0.88, 0.11, 0.01]
+→ Sharp distribution (one token dominates)
+→ More deterministic
+```
+
+**T = 1.0 (Standard Temperature):**
+```
+Logits: [2.0, 1.0, 0.5]
+After scaling: [2.0, 1.0, 0.5]
+After softmax: [0.66, 0.24, 0.10]
+→ Moderate distribution
+→ Balanced
+```
+
+**T = 2.0 (High Temperature - More Random):**
+```
+Logits: [2.0, 1.0, 0.5]
+After scaling: [1.0, 0.5, 0.25]
+After softmax: [0.52, 0.31, 0.17]
+→ Flat distribution (more uniform)
+→ More random
+```
+
+### Visual Comparison
+
+```
+Probability
+    │
+ 1.0│  T=0.5: ●
+    │
+ 0.8│
+    │
+ 0.6│  T=1.0:       ●
+    │
+ 0.4│
+    │
+ 0.2│  T=2.0:                ●
+    │
+ 0.0├───────────────────────── Token
+    "World"  "there"  "friend"
+```
+
+**Lower T = Sharper distribution = More deterministic**  
+**Higher T = Flatter distribution = More random**
+
+### When to Use Different Temperatures
+
+**Low Temperature (T < 1.0):**
+- Factual tasks
+- Reproducible outputs
+- When you want consistent results
+
+**Standard Temperature (T = 1.0):**
+- Default setting
+- Balanced behavior
+- Good for most tasks
+
+**High Temperature (T > 1.0):**
+- Creative writing
+- Diverse outputs
+- When you want variety
+
+---
+
+## 9.5 Top-k Sampling
+
+### What is Top-k?
+
+**Top-k sampling** limits the sampling to only the top k most likely tokens.
+
+### How It Works
+
+**Step 1: Get Probabilities**
+```
+All tokens:
+  "World": 0.4
+  "there": 0.3
+  "friend": 0.2
+  "hello": 0.05
+  "cat": 0.03
+  "dog": 0.02
+  ...
+```
+
+**Step 2: Select Top-k (e.g., k=3)**
+```
+Top 3:
+  "World": 0.4
+  "there": 0.3
+  "friend": 0.2
+```
+
+**Step 3: Remove Others**
+```
+Set others to 0:
+  "World": 0.4
+  "there": 0.3
+  "friend": 0.2
+  "hello": 0.0
+  "cat": 0.0
+  "dog": 0.0
+  ...
+```
+
+**Step 4: Renormalize**
+```
+Sum = 0.4 + 0.3 + 0.2 = 0.9
+Renormalize:
+  "World": 0.4/0.9 = 0.44
+  "there": 0.3/0.9 = 0.33
+  "friend": 0.2/0.9 = 0.22
+```
+
+**Step 5: Sample from Top-k**
+```
+Sample from these 3 tokens only
+```
+
+### Mathematical Formulation
+
+**Given probabilities $\mathbf{p}_t$ and top-k:**
+
+```math
+\mathbf{p}_t^{topk}[v] = \begin{cases}
+\frac{\mathbf{p}_t[v]}{\sum_{u \in \text{top-k}} \mathbf{p}_t[u]} & \text{if } v \in \text{top-k} \\
+0 & \text{otherwise}
+\end{cases}
+```
+
+### Why Top-k?
+
+**Benefits:**
+- Removes low-probability tokens
+- Focuses on likely candidates
+- Reduces randomness from unlikely tokens
+- Better quality generations
+
+**Example:**
+```
+Without top-k: Might sample "xyz" (very unlikely)
+With top-k=50: Only samples from top 50 tokens
+→ Better quality!
+```
+
+---
+
+## 9.6 Top-p (Nucleus) Sampling
+
+### What is Top-p?
+
+**Top-p (nucleus) sampling** keeps the smallest set of tokens whose cumulative probability is at least p.
+
+### How It Works
+
+**Step 1: Sort Probabilities**
+```
+Sorted (descending):
+  "World": 0.4
+  "there": 0.3
+  "friend": 0.2
+  "hello": 0.05
+  "cat": 0.03
+  "dog": 0.02
+  ...
+```
+
+**Step 2: Compute Cumulative Probabilities**
+```
+Cumulative:
+  "World": 0.4
+  "there": 0.7  (0.4 + 0.3)
+  "friend": 0.9  (0.7 + 0.2)
+  "hello": 0.95  (0.9 + 0.05)
+  "cat": 0.98   (0.95 + 0.03)
+  ...
+```
+
+**Step 3: Find Nucleus (e.g., p=0.9)**
+```
+Find smallest set where sum ≥ 0.9:
+  "World": 0.4
+  "there": 0.3
+  "friend": 0.2
+  Cumulative: 0.9 ✓
+  
+→ Keep these 3 tokens
+```
+
+**Step 4: Remove Others**
+```
+Keep:
+  "World": 0.4
+  "there": 0.3
+  "friend": 0.2
+  Others: 0.0
+```
+
+**Step 5: Renormalize and Sample**
+```
+Renormalize and sample
+```
+
+### Mathematical Formulation
+
+**Given probabilities $\mathbf{p}_t$ and top-p:**
+
+**Find smallest set S:**
+```math
+S = \arg\min \{ |S'| : \sum_{v \in S'} \mathbf{p}_t[v] \geq p \}
+```
+
+**Then:**
+```math
+\mathbf{p}_t^{topp}[v] = \begin{cases}
+\frac{\mathbf{p}_t[v]}{\sum_{u \in S} \mathbf{p}_t[u]} & \text{if } v \in S \\
+0 & \text{otherwise}
+\end{cases}
+```
+
+### Why Top-p?
+
+**Benefits:**
+- Adapts to distribution shape
+- Keeps relevant tokens dynamically
+- Better than fixed k in some cases
+- More flexible than top-k
+
+**Example:**
+```
+Sharp distribution: Top-p=0.9 might keep 3 tokens
+Flat distribution: Top-p=0.9 might keep 50 tokens
+→ Adapts automatically!
+```
+
+---
+
+## 9.7 Step-by-Step Generation Process
+
+### Complete Process
+
+**Given prompt: "Hello"**
+
+#### Step 1: Encode Prompt
+
+```
+Prompt: "Hello"
+Token IDs: [72]
+```
+
+#### Step 2: Forward Pass
+
+```
+Input: [72]
+Model processes through layers
+Output: Logits for all tokens
+  Token 72: 5.2
+  Token 87: 4.8 ← "World"
+  Token 101: 3.2 ← "there"
+  Token 108: 2.1 ← "friend"
+  ...
+```
+
+#### Step 3: Apply Temperature
+
+```
+Temperature: T = 1.0
+Scaled logits: Same as above
+```
+
+#### Step 4: Apply Top-k (Optional)
+
+```
+Top-k: k = 50
+Keep top 50 tokens, remove others
+```
+
+#### Step 5: Apply Top-p (Optional)
+
+```
+Top-p: p = 0.95
+Keep tokens with cumulative prob ≥ 0.95
+```
+
+#### Step 6: Compute Probabilities
+
+```
+Apply softmax:
+  "World": 0.4
+  "there": 0.3
+  "friend": 0.2
+  ...
+```
+
+#### Step 7: Sample Token
+
+```
+Sample from distribution:
+Selected: "World" (token 87)
+```
+
+#### Step 8: Append Token
+
+```
+Sequence: [72, 87]
+Text: "Hello World"
+```
+
+#### Step 9: Repeat
+
+```
+Input: [72, 87]
+→ Predict next token
+→ Sample
+→ Append
+→ Repeat...
+```
+
+---
+
+## 9.8 Exercise: Complete Generation Example
+
+### Problem
+
+**Given:**
+- Prompt: "The"
+- Model logits for next token: `[10.0, 8.0, 5.0, 2.0, 1.0, 0.5, ...]` (for tokens: "cat", "dog", "car", "house", "tree", "book", ...)
+- Temperature: T = 1.0
+- Top-k: k = 3
+- Top-p: p = 0.9
+
+**Generate the next token step-by-step.**
+
+### Step-by-Step Solution
+
+#### Step 1: Initial Setup
+
+**Prompt:**
+```
+"The"
+Token IDs: [32] (assuming "The" = token 32)
+```
+
+**Logits:**
+```
+Token "cat":   10.0
+Token "dog":   8.0
+Token "car":   5.0
+Token "house": 2.0
+Token "tree":  1.0
+Token "book":  0.5
+...
+```
+
+#### Step 2: Apply Temperature
+
+**Temperature: T = 1.0**
+
+**Scaled logits (divide by T):**
+```
+Token "cat":   10.0 / 1.0 = 10.0
+Token "dog":   8.0 / 1.0 = 8.0
+Token "car":   5.0 / 1.0 = 5.0
+Token "house": 2.0 / 1.0 = 2.0
+Token "tree":  1.0 / 1.0 = 1.0
+Token "book":  0.5 / 1.0 = 0.5
+```
+
+**No change (T=1.0 is identity)**
+
+#### Step 3: Apply Top-k Filtering
+
+**Top-k: k = 3**
+
+**Select top 3 tokens:**
+```
+Top 3:
+  "cat":   10.0
+  "dog":   8.0
+  "car":   5.0
+```
+
+**Set others to -∞:**
+```
+Token "cat":   10.0
+Token "dog":   8.0
+Token "car":   5.0
+Token "house": -∞
+Token "tree":  -∞
+Token "book":  -∞
+```
+
+#### Step 4: Apply Top-p Filtering
+
+**First, compute probabilities from top-k tokens:**
+
+**Apply softmax:**
+```
+exp(10.0) = 22026.47
+exp(8.0) = 2980.96
+exp(5.0) = 148.41
+Sum = 25155.84
+
+P("cat") = 22026.47 / 25155.84 ≈ 0.875
+P("dog") = 2980.96 / 25155.84 ≈ 0.119
+P("car") = 148.41 / 25155.84 ≈ 0.006
+```
+
+**Cumulative probabilities:**
+```
+"cat":   0.875
+"dog":   0.994  (0.875 + 0.119)
+"car":   1.000  (0.994 + 0.006)
+```
+
+**Find smallest set where sum ≥ 0.9:**
+```
+"cat": 0.875 < 0.9
+"cat" + "dog": 0.994 ≥ 0.9 ✓
+
+→ Keep "cat" and "dog"
+→ Remove "car"
+```
+
+**Result:**
+```
+Token "cat":   10.0
+Token "dog":   8.0
+Token "car":   -∞  (removed)
+```
+
+#### Step 5: Compute Final Probabilities
+
+**Apply softmax to remaining tokens:**
+```
+exp(10.0) = 22026.47
+exp(8.0) = 2980.96
+Sum = 25007.43
+
+P("cat") = 22026.47 / 25007.43 ≈ 0.881
+P("dog") = 2980.96 / 25007.43 ≈ 0.119
+```
+
+#### Step 6: Sample Token
+
+**Sample from distribution:**
+```
+Random number: 0.75
+
+Cumulative:
+  "cat": 0.881 ← 0.75 falls here
+  "dog": 1.000
+
+→ Selected: "cat"
+```
+
+### Answer
+
+**Generated token: "cat"**
+
+**Final sequence:**
+```
+Prompt: "The"
+Generated: "cat"
+Full text: "The cat"
+```
+
+### Summary
+
+| Step | Operation | Result |
+|------|-----------|--------|
+| 1 | Initial logits | [10.0, 8.0, 5.0, 2.0, ...] |
+| 2 | Apply temperature (T=1.0) | [10.0, 8.0, 5.0, 2.0, ...] |
+| 3 | Top-k filtering (k=3) | Keep top 3: [10.0, 8.0, 5.0] |
+| 4 | Top-p filtering (p=0.9) | Keep cumulative ≥0.9: [10.0, 8.0] |
+| 5 | Compute probabilities | [0.881, 0.119] |
+| 6 | Sample | "cat" selected |
+
+**The model generated "cat" following "The"!**
+
+---
+
+## 9.9 Key Takeaways
+
+### Generation
+
+✅ **Generation produces text one token at a time**  
+✅ **Autoregressive: uses previous outputs as inputs**  
+✅ **Iterative process: predict → sample → append → repeat**
+
+### Sampling Strategies
+
+✅ **Temperature: Controls randomness (lower = deterministic, higher = random)**  
+✅ **Top-k: Limits to top k tokens**  
+✅ **Top-p: Keeps smallest set with cumulative probability ≥ p**  
+✅ **Combined: Often use temperature + top-k or top-p**
+
+### Why Important
+
+✅ **Enables text generation from trained models**  
+✅ **Different strategies produce different outputs**  
+✅ **Essential for language model deployment**
+
+---
+
+## Mathematical Summary
+
+### Generation Process
+
+**Initialization:**
+```math
+\mathbf{T}_0 = \mathbf{P}
+```
+
+**For each step $t$:**
+```math
+\mathbf{l}_t = \text{Model}(\mathbf{T}_{t-1})[:, -1, :]
+```
+
+```math
+\mathbf{l}_t' = \frac{\mathbf{l}_t}{T} \quad \text{(temperature)}
+```
+
+```math
+\mathbf{l}_t'' = \text{Top-k}(\mathbf{l}_t') \quad \text{(optional)}
+```
+
+```math
+\mathbf{l}_t''' = \text{Top-p}(\mathbf{l}_t'') \quad \text{(optional)}
+```
+
+```math
+\mathbf{p}_t = \text{softmax}(\mathbf{l}_t''')
+```
+
+```math
+t_t \sim \text{Categorical}(\mathbf{p}_t)
+```
+
+```math
+\mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]
+```
+
+---
+
+*This document provides a comprehensive explanation of text generation, including autoregressive generation, sampling strategies, temperature, top-k, and top-p with mathematical formulations and solved exercises.*
+