Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
This commit is contained in:
Carlos Gutierrez
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions

View File

@@ -0,0 +1,748 @@
# What is Generation? Step-by-Step Explanation
Complete step-by-step explanation of text generation: how models generate text using autoregressive generation, sampling, and decoding strategies.
## Table of Contents
1. [What is Generation?](#91-what-is-generation)
2. [Autoregressive Generation](#92-autoregressive-generation)
3. [Sampling Strategies](#93-sampling-strategies)
4. [Temperature](#94-temperature)
5. [Top-k Sampling](#95-top-k-sampling)
6. [Top-p (Nucleus) Sampling](#96-top-p-nucleus-sampling)
7. [Step-by-Step Generation Process](#97-step-by-step-generation-process)
8. [Exercise: Complete Generation Example](#98-exercise-complete-generation-example)
9. [Key Takeaways](#99-key-takeaways)
---
## 9.1 What is Generation?
### Simple Definition
**Generation** (text generation) is the process of using a trained model to produce new text, one token at a time, based on a given prompt.
### Visual Analogy
**Think of generation like writing a story:**
```
Prompt: "Once upon a time"
Model generates:
"Once upon a time" → "there"
"Once upon a time there" → "was"
"Once upon a time there was" → "a"
"Once upon a time there was a" → "princess"
...
Final: "Once upon a time there was a princess..."
```
**Model predicts next word, one at a time!**
### What Generation Does
**Generation:**
1. **Takes** a prompt (starting text)
2. **Predicts** next token probabilities
3. **Samples** a token from distribution
4. **Appends** token to sequence
5. **Repeats** until complete
**Result:** Generated text continuation!
---
## 9.2 Autoregressive Generation
### What is Autoregressive?
**Autoregressive** means the model uses its own previous outputs as inputs for the next prediction.
### How It Works
**Step 1: Initial Prompt**
```
Prompt: "Hello"
Sequence: ["Hello"]
```
**Step 2: First Prediction**
```
Input: ["Hello"]
Model output: Probabilities for next token
"World": 0.4
"there": 0.3
"friend": 0.2
...
```
**Step 3: Sample Token**
```
Sample: "World" (selected)
Sequence: ["Hello", "World"]
```
**Step 4: Second Prediction**
```
Input: ["Hello", "World"]
Model output: Probabilities for next token
"!": 0.5
".": 0.3
",": 0.1
...
```
**Step 5: Continue**
```
Sample: "!"
Sequence: ["Hello", "World", "!"]
Continue until max length or stop token...
```
### Mathematical Formulation
**For prompt $\mathbf{P} = [p_1, ..., p_k]$:**
**Initialization:**
```math
\mathbf{T}_0 = \mathbf{P}
```
**For each step $t \geq k+1$:**
1. **Forward pass:**
```math
\mathbf{L}_t = \text{Model}(\mathbf{T}_{t-1})
```
2. **Get next token probabilities:**
```math
\mathbf{p}_t = \text{softmax}(\mathbf{L}_t[:, -1, :])
```
3. **Sample token:**
```math
t_t \sim \text{Categorical}(\mathbf{p}_t)
```
4. **Append token:**
```math
\mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]
```
**Repeat until stop condition!**
---
## 9.3 Sampling Strategies
### Deterministic vs Stochastic
**Deterministic (Greedy):**
```
Always pick highest probability:
"World": 0.4 ← Highest
"there": 0.3
"friend": 0.2
→ Always picks "World"
→ Same output every time
```
**Stochastic (Sampling):**
```
Sample from distribution:
"World": 0.4 (40% chance)
"there": 0.3 (30% chance)
"friend": 0.2 (20% chance)
→ Different output each time
→ More diverse generations
```
### Why Sampling?
**Greedy (Deterministic):**
- Same output every time
- Can be repetitive
- Less creative
**Sampling:**
- Different outputs each time
- More diverse
- More creative
- Better for creative tasks
---
## 9.4 Temperature
### What is Temperature?
**Temperature** controls the randomness of sampling by scaling the logits before applying softmax.
### Formula
```math
\mathbf{p}_t = \text{softmax}\left(\frac{\mathbf{l}_t}{T}\right)
```
**Where:**
- $\mathbf{l}_t$ = logits (raw scores)
- $T$ = temperature
- $\mathbf{p}_t$ = probabilities
### How Temperature Works
**T = 0.5 (Low Temperature - More Deterministic):**
```
Logits: [2.0, 1.0, 0.5]
After scaling: [4.0, 2.0, 1.0]
After softmax: [0.88, 0.11, 0.01]
→ Sharp distribution (one token dominates)
→ More deterministic
```
**T = 1.0 (Standard Temperature):**
```
Logits: [2.0, 1.0, 0.5]
After scaling: [2.0, 1.0, 0.5]
After softmax: [0.66, 0.24, 0.10]
→ Moderate distribution
→ Balanced
```
**T = 2.0 (High Temperature - More Random):**
```
Logits: [2.0, 1.0, 0.5]
After scaling: [1.0, 0.5, 0.25]
After softmax: [0.52, 0.31, 0.17]
→ Flat distribution (more uniform)
→ More random
```
### Visual Comparison
```
Probability
1.0│ T=0.5: ●
0.8│
0.6│ T=1.0: ●
0.4│
0.2│ T=2.0: ●
0.0├───────────────────────── Token
"World" "there" "friend"
```
**Lower T = Sharper distribution = More deterministic**
**Higher T = Flatter distribution = More random**
### When to Use Different Temperatures
**Low Temperature (T < 1.0):**
- Factual tasks
- Reproducible outputs
- When you want consistent results
**Standard Temperature (T = 1.0):**
- Default setting
- Balanced behavior
- Good for most tasks
**High Temperature (T > 1.0):**
- Creative writing
- Diverse outputs
- When you want variety
---
## 9.5 Top-k Sampling
### What is Top-k?
**Top-k sampling** limits the sampling to only the top k most likely tokens.
### How It Works
**Step 1: Get Probabilities**
```
All tokens:
"World": 0.4
"there": 0.3
"friend": 0.2
"hello": 0.05
"cat": 0.03
"dog": 0.02
...
```
**Step 2: Select Top-k (e.g., k=3)**
```
Top 3:
"World": 0.4
"there": 0.3
"friend": 0.2
```
**Step 3: Remove Others**
```
Set others to 0:
"World": 0.4
"there": 0.3
"friend": 0.2
"hello": 0.0
"cat": 0.0
"dog": 0.0
...
```
**Step 4: Renormalize**
```
Sum = 0.4 + 0.3 + 0.2 = 0.9
Renormalize:
"World": 0.4/0.9 = 0.44
"there": 0.3/0.9 = 0.33
"friend": 0.2/0.9 = 0.22
```
**Step 5: Sample from Top-k**
```
Sample from these 3 tokens only
```
### Mathematical Formulation
**Given probabilities $\mathbf{p}_t$ and top-k:**
```math
\mathbf{p}_t^{topk}[v] = \begin{cases}
\frac{\mathbf{p}_t[v]}{\sum_{u \in \text{top-k}} \mathbf{p}_t[u]} & \text{if } v \in \text{top-k} \\
0 & \text{otherwise}
\end{cases}
```
### Why Top-k?
**Benefits:**
- Removes low-probability tokens
- Focuses on likely candidates
- Reduces randomness from unlikely tokens
- Better quality generations
**Example:**
```
Without top-k: Might sample "xyz" (very unlikely)
With top-k=50: Only samples from top 50 tokens
→ Better quality!
```
---
## 9.6 Top-p (Nucleus) Sampling
### What is Top-p?
**Top-p (nucleus) sampling** keeps the smallest set of tokens whose cumulative probability is at least p.
### How It Works
**Step 1: Sort Probabilities**
```
Sorted (descending):
"World": 0.4
"there": 0.3
"friend": 0.2
"hello": 0.05
"cat": 0.03
"dog": 0.02
...
```
**Step 2: Compute Cumulative Probabilities**
```
Cumulative:
"World": 0.4
"there": 0.7 (0.4 + 0.3)
"friend": 0.9 (0.7 + 0.2)
"hello": 0.95 (0.9 + 0.05)
"cat": 0.98 (0.95 + 0.03)
...
```
**Step 3: Find Nucleus (e.g., p=0.9)**
```
Find smallest set where sum ≥ 0.9:
"World": 0.4
"there": 0.3
"friend": 0.2
Cumulative: 0.9 ✓
→ Keep these 3 tokens
```
**Step 4: Remove Others**
```
Keep:
"World": 0.4
"there": 0.3
"friend": 0.2
Others: 0.0
```
**Step 5: Renormalize and Sample**
```
Renormalize and sample
```
### Mathematical Formulation
**Given probabilities $\mathbf{p}_t$ and top-p:**
**Find smallest set S:**
```math
S = \arg\min \{ |S'| : \sum_{v \in S'} \mathbf{p}_t[v] \geq p \}
```
**Then:**
```math
\mathbf{p}_t^{topp}[v] = \begin{cases}
\frac{\mathbf{p}_t[v]}{\sum_{u \in S} \mathbf{p}_t[u]} & \text{if } v \in S \\
0 & \text{otherwise}
\end{cases}
```
### Why Top-p?
**Benefits:**
- Adapts to distribution shape
- Keeps relevant tokens dynamically
- Better than fixed k in some cases
- More flexible than top-k
**Example:**
```
Sharp distribution: Top-p=0.9 might keep 3 tokens
Flat distribution: Top-p=0.9 might keep 50 tokens
→ Adapts automatically!
```
---
## 9.7 Step-by-Step Generation Process
### Complete Process
**Given prompt: "Hello"**
#### Step 1: Encode Prompt
```
Prompt: "Hello"
Token IDs: [72]
```
#### Step 2: Forward Pass
```
Input: [72]
Model processes through layers
Output: Logits for all tokens
Token 72: 5.2
Token 87: 4.8 ← "World"
Token 101: 3.2 ← "there"
Token 108: 2.1 ← "friend"
...
```
#### Step 3: Apply Temperature
```
Temperature: T = 1.0
Scaled logits: Same as above
```
#### Step 4: Apply Top-k (Optional)
```
Top-k: k = 50
Keep top 50 tokens, remove others
```
#### Step 5: Apply Top-p (Optional)
```
Top-p: p = 0.95
Keep tokens with cumulative prob ≥ 0.95
```
#### Step 6: Compute Probabilities
```
Apply softmax:
"World": 0.4
"there": 0.3
"friend": 0.2
...
```
#### Step 7: Sample Token
```
Sample from distribution:
Selected: "World" (token 87)
```
#### Step 8: Append Token
```
Sequence: [72, 87]
Text: "Hello World"
```
#### Step 9: Repeat
```
Input: [72, 87]
→ Predict next token
→ Sample
→ Append
→ Repeat...
```
---
## 9.8 Exercise: Complete Generation Example
### Problem
**Given:**
- Prompt: "The"
- Model logits for next token: `[10.0, 8.0, 5.0, 2.0, 1.0, 0.5, ...]` (for tokens: "cat", "dog", "car", "house", "tree", "book", ...)
- Temperature: T = 1.0
- Top-k: k = 3
- Top-p: p = 0.9
**Generate the next token step-by-step.**
### Step-by-Step Solution
#### Step 1: Initial Setup
**Prompt:**
```
"The"
Token IDs: [32] (assuming "The" = token 32)
```
**Logits:**
```
Token "cat": 10.0
Token "dog": 8.0
Token "car": 5.0
Token "house": 2.0
Token "tree": 1.0
Token "book": 0.5
...
```
#### Step 2: Apply Temperature
**Temperature: T = 1.0**
**Scaled logits (divide by T):**
```
Token "cat": 10.0 / 1.0 = 10.0
Token "dog": 8.0 / 1.0 = 8.0
Token "car": 5.0 / 1.0 = 5.0
Token "house": 2.0 / 1.0 = 2.0
Token "tree": 1.0 / 1.0 = 1.0
Token "book": 0.5 / 1.0 = 0.5
```
**No change (T=1.0 is identity)**
#### Step 3: Apply Top-k Filtering
**Top-k: k = 3**
**Select top 3 tokens:**
```
Top 3:
"cat": 10.0
"dog": 8.0
"car": 5.0
```
**Set others to -∞:**
```
Token "cat": 10.0
Token "dog": 8.0
Token "car": 5.0
Token "house": -∞
Token "tree": -∞
Token "book": -∞
```
#### Step 4: Apply Top-p Filtering
**First, compute probabilities from top-k tokens:**
**Apply softmax:**
```
exp(10.0) = 22026.47
exp(8.0) = 2980.96
exp(5.0) = 148.41
Sum = 25155.84
P("cat") = 22026.47 / 25155.84 ≈ 0.875
P("dog") = 2980.96 / 25155.84 ≈ 0.119
P("car") = 148.41 / 25155.84 ≈ 0.006
```
**Cumulative probabilities:**
```
"cat": 0.875
"dog": 0.994 (0.875 + 0.119)
"car": 1.000 (0.994 + 0.006)
```
**Find smallest set where sum ≥ 0.9:**
```
"cat": 0.875 < 0.9
"cat" + "dog": 0.994 ≥ 0.9 ✓
→ Keep "cat" and "dog"
→ Remove "car"
```
**Result:**
```
Token "cat": 10.0
Token "dog": 8.0
Token "car": -∞ (removed)
```
#### Step 5: Compute Final Probabilities
**Apply softmax to remaining tokens:**
```
exp(10.0) = 22026.47
exp(8.0) = 2980.96
Sum = 25007.43
P("cat") = 22026.47 / 25007.43 ≈ 0.881
P("dog") = 2980.96 / 25007.43 ≈ 0.119
```
#### Step 6: Sample Token
**Sample from distribution:**
```
Random number: 0.75
Cumulative:
"cat": 0.881 ← 0.75 falls here
"dog": 1.000
→ Selected: "cat"
```
### Answer
**Generated token: "cat"**
**Final sequence:**
```
Prompt: "The"
Generated: "cat"
Full text: "The cat"
```
### Summary
| Step | Operation | Result |
|------|-----------|--------|
| 1 | Initial logits | [10.0, 8.0, 5.0, 2.0, ...] |
| 2 | Apply temperature (T=1.0) | [10.0, 8.0, 5.0, 2.0, ...] |
| 3 | Top-k filtering (k=3) | Keep top 3: [10.0, 8.0, 5.0] |
| 4 | Top-p filtering (p=0.9) | Keep cumulative ≥0.9: [10.0, 8.0] |
| 5 | Compute probabilities | [0.881, 0.119] |
| 6 | Sample | "cat" selected |
**The model generated "cat" following "The"!**
---
## 9.9 Key Takeaways
### Generation
✅ **Generation produces text one token at a time**
✅ **Autoregressive: uses previous outputs as inputs**
✅ **Iterative process: predict → sample → append → repeat**
### Sampling Strategies
✅ **Temperature: Controls randomness (lower = deterministic, higher = random)**
✅ **Top-k: Limits to top k tokens**
✅ **Top-p: Keeps smallest set with cumulative probability ≥ p**
✅ **Combined: Often use temperature + top-k or top-p**
### Why Important
✅ **Enables text generation from trained models**
✅ **Different strategies produce different outputs**
✅ **Essential for language model deployment**
---
## Mathematical Summary
### Generation Process
**Initialization:**
```math
\mathbf{T}_0 = \mathbf{P}
```
**For each step $t$:**
```math
\mathbf{l}_t = \text{Model}(\mathbf{T}_{t-1})[:, -1, :]
```
```math
\mathbf{l}_t' = \frac{\mathbf{l}_t}{T} \quad \text{(temperature)}
```
```math
\mathbf{l}_t'' = \text{Top-k}(\mathbf{l}_t') \quad \text{(optional)}
```
```math
\mathbf{l}_t''' = \text{Top-p}(\mathbf{l}_t'') \quad \text{(optional)}
```
```math
\mathbf{p}_t = \text{softmax}(\mathbf{l}_t''')
```
```math
t_t \sim \text{Categorical}(\mathbf{p}_t)
```
```math
\mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]
```
---
*This document provides a comprehensive explanation of text generation, including autoregressive generation, sampling strategies, temperature, top-k, and top-p with mathematical formulations and solved exercises.*