- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
749 lines
13 KiB
Markdown
749 lines
13 KiB
Markdown
# What is Generation? Step-by-Step Explanation
|
|
|
|
Complete step-by-step explanation of text generation: how models generate text using autoregressive generation, sampling, and decoding strategies.
|
|
|
|
## Table of Contents
|
|
|
|
1. [What is Generation?](#91-what-is-generation)
|
|
2. [Autoregressive Generation](#92-autoregressive-generation)
|
|
3. [Sampling Strategies](#93-sampling-strategies)
|
|
4. [Temperature](#94-temperature)
|
|
5. [Top-k Sampling](#95-top-k-sampling)
|
|
6. [Top-p (Nucleus) Sampling](#96-top-p-nucleus-sampling)
|
|
7. [Step-by-Step Generation Process](#97-step-by-step-generation-process)
|
|
8. [Exercise: Complete Generation Example](#98-exercise-complete-generation-example)
|
|
9. [Key Takeaways](#99-key-takeaways)
|
|
|
|
---
|
|
|
|
## 9.1 What is Generation?
|
|
|
|
### Simple Definition
|
|
|
|
**Generation** (text generation) is the process of using a trained model to produce new text, one token at a time, based on a given prompt.
|
|
|
|
### Visual Analogy
|
|
|
|
**Think of generation like writing a story:**
|
|
|
|
```
|
|
Prompt: "Once upon a time"
|
|
|
|
Model generates:
|
|
"Once upon a time" → "there"
|
|
"Once upon a time there" → "was"
|
|
"Once upon a time there was" → "a"
|
|
"Once upon a time there was a" → "princess"
|
|
...
|
|
|
|
Final: "Once upon a time there was a princess..."
|
|
```
|
|
|
|
**Model predicts next word, one at a time!**
|
|
|
|
### What Generation Does
|
|
|
|
**Generation:**
|
|
1. **Takes** a prompt (starting text)
|
|
2. **Predicts** next token probabilities
|
|
3. **Samples** a token from distribution
|
|
4. **Appends** token to sequence
|
|
5. **Repeats** until complete
|
|
|
|
**Result:** Generated text continuation!
|
|
|
|
---
|
|
|
|
## 9.2 Autoregressive Generation
|
|
|
|
### What is Autoregressive?
|
|
|
|
**Autoregressive** means the model uses its own previous outputs as inputs for the next prediction.
|
|
|
|
### How It Works
|
|
|
|
**Step 1: Initial Prompt**
|
|
```
|
|
Prompt: "Hello"
|
|
Sequence: ["Hello"]
|
|
```
|
|
|
|
**Step 2: First Prediction**
|
|
```
|
|
Input: ["Hello"]
|
|
Model output: Probabilities for next token
|
|
"World": 0.4
|
|
"there": 0.3
|
|
"friend": 0.2
|
|
...
|
|
```
|
|
|
|
**Step 3: Sample Token**
|
|
```
|
|
Sample: "World" (selected)
|
|
Sequence: ["Hello", "World"]
|
|
```
|
|
|
|
**Step 4: Second Prediction**
|
|
```
|
|
Input: ["Hello", "World"]
|
|
Model output: Probabilities for next token
|
|
"!": 0.5
|
|
".": 0.3
|
|
",": 0.1
|
|
...
|
|
```
|
|
|
|
**Step 5: Continue**
|
|
```
|
|
Sample: "!"
|
|
Sequence: ["Hello", "World", "!"]
|
|
Continue until max length or stop token...
|
|
```
|
|
|
|
### Mathematical Formulation
|
|
|
|
**For prompt $\mathbf{P} = [p_1, ..., p_k]$:**
|
|
|
|
**Initialization:**
|
|
```math
|
|
\mathbf{T}_0 = \mathbf{P}
|
|
```
|
|
|
|
**For each step $t \geq k+1$:**
|
|
|
|
1. **Forward pass:**
|
|
```math
|
|
\mathbf{L}_t = \text{Model}(\mathbf{T}_{t-1})
|
|
```
|
|
|
|
2. **Get next token probabilities:**
|
|
```math
|
|
\mathbf{p}_t = \text{softmax}(\mathbf{L}_t[:, -1, :])
|
|
```
|
|
|
|
3. **Sample token:**
|
|
```math
|
|
t_t \sim \text{Categorical}(\mathbf{p}_t)
|
|
```
|
|
|
|
4. **Append token:**
|
|
```math
|
|
\mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]
|
|
```
|
|
|
|
**Repeat until stop condition!**
|
|
|
|
---
|
|
|
|
## 9.3 Sampling Strategies
|
|
|
|
### Deterministic vs Stochastic
|
|
|
|
**Deterministic (Greedy):**
|
|
```
|
|
Always pick highest probability:
|
|
"World": 0.4 ← Highest
|
|
"there": 0.3
|
|
"friend": 0.2
|
|
|
|
→ Always picks "World"
|
|
→ Same output every time
|
|
```
|
|
|
|
**Stochastic (Sampling):**
|
|
```
|
|
Sample from distribution:
|
|
"World": 0.4 (40% chance)
|
|
"there": 0.3 (30% chance)
|
|
"friend": 0.2 (20% chance)
|
|
|
|
→ Different output each time
|
|
→ More diverse generations
|
|
```
|
|
|
|
### Why Sampling?
|
|
|
|
**Greedy (Deterministic):**
|
|
- Same output every time
|
|
- Can be repetitive
|
|
- Less creative
|
|
|
|
**Sampling:**
|
|
- Different outputs each time
|
|
- More diverse
|
|
- More creative
|
|
- Better for creative tasks
|
|
|
|
---
|
|
|
|
## 9.4 Temperature
|
|
|
|
### What is Temperature?
|
|
|
|
**Temperature** controls the randomness of sampling by scaling the logits before applying softmax.
|
|
|
|
### Formula
|
|
|
|
```math
|
|
\mathbf{p}_t = \text{softmax}\left(\frac{\mathbf{l}_t}{T}\right)
|
|
```
|
|
|
|
**Where:**
|
|
- $\mathbf{l}_t$ = logits (raw scores)
|
|
- $T$ = temperature
|
|
- $\mathbf{p}_t$ = probabilities
|
|
|
|
### How Temperature Works
|
|
|
|
**T = 0.5 (Low Temperature - More Deterministic):**
|
|
```
|
|
Logits: [2.0, 1.0, 0.5]
|
|
After scaling: [4.0, 2.0, 1.0]
|
|
After softmax: [0.88, 0.11, 0.01]
|
|
→ Sharp distribution (one token dominates)
|
|
→ More deterministic
|
|
```
|
|
|
|
**T = 1.0 (Standard Temperature):**
|
|
```
|
|
Logits: [2.0, 1.0, 0.5]
|
|
After scaling: [2.0, 1.0, 0.5]
|
|
After softmax: [0.66, 0.24, 0.10]
|
|
→ Moderate distribution
|
|
→ Balanced
|
|
```
|
|
|
|
**T = 2.0 (High Temperature - More Random):**
|
|
```
|
|
Logits: [2.0, 1.0, 0.5]
|
|
After scaling: [1.0, 0.5, 0.25]
|
|
After softmax: [0.52, 0.31, 0.17]
|
|
→ Flat distribution (more uniform)
|
|
→ More random
|
|
```
|
|
|
|
### Visual Comparison
|
|
|
|
```
|
|
Probability
|
|
│
|
|
1.0│ T=0.5: ●
|
|
│
|
|
0.8│
|
|
│
|
|
0.6│ T=1.0: ●
|
|
│
|
|
0.4│
|
|
│
|
|
0.2│ T=2.0: ●
|
|
│
|
|
0.0├───────────────────────── Token
|
|
"World" "there" "friend"
|
|
```
|
|
|
|
**Lower T = Sharper distribution = More deterministic**
|
|
**Higher T = Flatter distribution = More random**
|
|
|
|
### When to Use Different Temperatures
|
|
|
|
**Low Temperature (T < 1.0):**
|
|
- Factual tasks
|
|
- Reproducible outputs
|
|
- When you want consistent results
|
|
|
|
**Standard Temperature (T = 1.0):**
|
|
- Default setting
|
|
- Balanced behavior
|
|
- Good for most tasks
|
|
|
|
**High Temperature (T > 1.0):**
|
|
- Creative writing
|
|
- Diverse outputs
|
|
- When you want variety
|
|
|
|
---
|
|
|
|
## 9.5 Top-k Sampling
|
|
|
|
### What is Top-k?
|
|
|
|
**Top-k sampling** limits the sampling to only the top k most likely tokens.
|
|
|
|
### How It Works
|
|
|
|
**Step 1: Get Probabilities**
|
|
```
|
|
All tokens:
|
|
"World": 0.4
|
|
"there": 0.3
|
|
"friend": 0.2
|
|
"hello": 0.05
|
|
"cat": 0.03
|
|
"dog": 0.02
|
|
...
|
|
```
|
|
|
|
**Step 2: Select Top-k (e.g., k=3)**
|
|
```
|
|
Top 3:
|
|
"World": 0.4
|
|
"there": 0.3
|
|
"friend": 0.2
|
|
```
|
|
|
|
**Step 3: Remove Others**
|
|
```
|
|
Set others to 0:
|
|
"World": 0.4
|
|
"there": 0.3
|
|
"friend": 0.2
|
|
"hello": 0.0
|
|
"cat": 0.0
|
|
"dog": 0.0
|
|
...
|
|
```
|
|
|
|
**Step 4: Renormalize**
|
|
```
|
|
Sum = 0.4 + 0.3 + 0.2 = 0.9
|
|
Renormalize:
|
|
"World": 0.4/0.9 = 0.44
|
|
"there": 0.3/0.9 = 0.33
|
|
"friend": 0.2/0.9 = 0.22
|
|
```
|
|
|
|
**Step 5: Sample from Top-k**
|
|
```
|
|
Sample from these 3 tokens only
|
|
```
|
|
|
|
### Mathematical Formulation
|
|
|
|
**Given probabilities $\mathbf{p}_t$ and top-k:**
|
|
|
|
```math
|
|
\mathbf{p}_t^{topk}[v] = \begin{cases}
|
|
\frac{\mathbf{p}_t[v]}{\sum_{u \in \text{top-k}} \mathbf{p}_t[u]} & \text{if } v \in \text{top-k} \\
|
|
0 & \text{otherwise}
|
|
\end{cases}
|
|
```
|
|
|
|
### Why Top-k?
|
|
|
|
**Benefits:**
|
|
- Removes low-probability tokens
|
|
- Focuses on likely candidates
|
|
- Reduces randomness from unlikely tokens
|
|
- Better quality generations
|
|
|
|
**Example:**
|
|
```
|
|
Without top-k: Might sample "xyz" (very unlikely)
|
|
With top-k=50: Only samples from top 50 tokens
|
|
→ Better quality!
|
|
```
|
|
|
|
---
|
|
|
|
## 9.6 Top-p (Nucleus) Sampling
|
|
|
|
### What is Top-p?
|
|
|
|
**Top-p (nucleus) sampling** keeps the smallest set of tokens whose cumulative probability is at least p.
|
|
|
|
### How It Works
|
|
|
|
**Step 1: Sort Probabilities**
|
|
```
|
|
Sorted (descending):
|
|
"World": 0.4
|
|
"there": 0.3
|
|
"friend": 0.2
|
|
"hello": 0.05
|
|
"cat": 0.03
|
|
"dog": 0.02
|
|
...
|
|
```
|
|
|
|
**Step 2: Compute Cumulative Probabilities**
|
|
```
|
|
Cumulative:
|
|
"World": 0.4
|
|
"there": 0.7 (0.4 + 0.3)
|
|
"friend": 0.9 (0.7 + 0.2)
|
|
"hello": 0.95 (0.9 + 0.05)
|
|
"cat": 0.98 (0.95 + 0.03)
|
|
...
|
|
```
|
|
|
|
**Step 3: Find Nucleus (e.g., p=0.9)**
|
|
```
|
|
Find smallest set where sum ≥ 0.9:
|
|
"World": 0.4
|
|
"there": 0.3
|
|
"friend": 0.2
|
|
Cumulative: 0.9 ✓
|
|
|
|
→ Keep these 3 tokens
|
|
```
|
|
|
|
**Step 4: Remove Others**
|
|
```
|
|
Keep:
|
|
"World": 0.4
|
|
"there": 0.3
|
|
"friend": 0.2
|
|
Others: 0.0
|
|
```
|
|
|
|
**Step 5: Renormalize and Sample**
|
|
```
|
|
Renormalize and sample
|
|
```
|
|
|
|
### Mathematical Formulation
|
|
|
|
**Given probabilities $\mathbf{p}_t$ and top-p:**
|
|
|
|
**Find smallest set S:**
|
|
```math
|
|
S = \arg\min \{ |S'| : \sum_{v \in S'} \mathbf{p}_t[v] \geq p \}
|
|
```
|
|
|
|
**Then:**
|
|
```math
|
|
\mathbf{p}_t^{topp}[v] = \begin{cases}
|
|
\frac{\mathbf{p}_t[v]}{\sum_{u \in S} \mathbf{p}_t[u]} & \text{if } v \in S \\
|
|
0 & \text{otherwise}
|
|
\end{cases}
|
|
```
|
|
|
|
### Why Top-p?
|
|
|
|
**Benefits:**
|
|
- Adapts to distribution shape
|
|
- Keeps relevant tokens dynamically
|
|
- Better than fixed k in some cases
|
|
- More flexible than top-k
|
|
|
|
**Example:**
|
|
```
|
|
Sharp distribution: Top-p=0.9 might keep 3 tokens
|
|
Flat distribution: Top-p=0.9 might keep 50 tokens
|
|
→ Adapts automatically!
|
|
```
|
|
|
|
---
|
|
|
|
## 9.7 Step-by-Step Generation Process
|
|
|
|
### Complete Process
|
|
|
|
**Given prompt: "Hello"**
|
|
|
|
#### Step 1: Encode Prompt
|
|
|
|
```
|
|
Prompt: "Hello"
|
|
Token IDs: [72]
|
|
```
|
|
|
|
#### Step 2: Forward Pass
|
|
|
|
```
|
|
Input: [72]
|
|
Model processes through layers
|
|
Output: Logits for all tokens
|
|
Token 72: 5.2
|
|
Token 87: 4.8 ← "World"
|
|
Token 101: 3.2 ← "there"
|
|
Token 108: 2.1 ← "friend"
|
|
...
|
|
```
|
|
|
|
#### Step 3: Apply Temperature
|
|
|
|
```
|
|
Temperature: T = 1.0
|
|
Scaled logits: Same as above
|
|
```
|
|
|
|
#### Step 4: Apply Top-k (Optional)
|
|
|
|
```
|
|
Top-k: k = 50
|
|
Keep top 50 tokens, remove others
|
|
```
|
|
|
|
#### Step 5: Apply Top-p (Optional)
|
|
|
|
```
|
|
Top-p: p = 0.95
|
|
Keep tokens with cumulative prob ≥ 0.95
|
|
```
|
|
|
|
#### Step 6: Compute Probabilities
|
|
|
|
```
|
|
Apply softmax:
|
|
"World": 0.4
|
|
"there": 0.3
|
|
"friend": 0.2
|
|
...
|
|
```
|
|
|
|
#### Step 7: Sample Token
|
|
|
|
```
|
|
Sample from distribution:
|
|
Selected: "World" (token 87)
|
|
```
|
|
|
|
#### Step 8: Append Token
|
|
|
|
```
|
|
Sequence: [72, 87]
|
|
Text: "Hello World"
|
|
```
|
|
|
|
#### Step 9: Repeat
|
|
|
|
```
|
|
Input: [72, 87]
|
|
→ Predict next token
|
|
→ Sample
|
|
→ Append
|
|
→ Repeat...
|
|
```
|
|
|
|
---
|
|
|
|
## 9.8 Exercise: Complete Generation Example
|
|
|
|
### Problem
|
|
|
|
**Given:**
|
|
- Prompt: "The"
|
|
- Model logits for next token: `[10.0, 8.0, 5.0, 2.0, 1.0, 0.5, ...]` (for tokens: "cat", "dog", "car", "house", "tree", "book", ...)
|
|
- Temperature: T = 1.0
|
|
- Top-k: k = 3
|
|
- Top-p: p = 0.9
|
|
|
|
**Generate the next token step-by-step.**
|
|
|
|
### Step-by-Step Solution
|
|
|
|
#### Step 1: Initial Setup
|
|
|
|
**Prompt:**
|
|
```
|
|
"The"
|
|
Token IDs: [32] (assuming "The" = token 32)
|
|
```
|
|
|
|
**Logits:**
|
|
```
|
|
Token "cat": 10.0
|
|
Token "dog": 8.0
|
|
Token "car": 5.0
|
|
Token "house": 2.0
|
|
Token "tree": 1.0
|
|
Token "book": 0.5
|
|
...
|
|
```
|
|
|
|
#### Step 2: Apply Temperature
|
|
|
|
**Temperature: T = 1.0**
|
|
|
|
**Scaled logits (divide by T):**
|
|
```
|
|
Token "cat": 10.0 / 1.0 = 10.0
|
|
Token "dog": 8.0 / 1.0 = 8.0
|
|
Token "car": 5.0 / 1.0 = 5.0
|
|
Token "house": 2.0 / 1.0 = 2.0
|
|
Token "tree": 1.0 / 1.0 = 1.0
|
|
Token "book": 0.5 / 1.0 = 0.5
|
|
```
|
|
|
|
**No change (T=1.0 is identity)**
|
|
|
|
#### Step 3: Apply Top-k Filtering
|
|
|
|
**Top-k: k = 3**
|
|
|
|
**Select top 3 tokens:**
|
|
```
|
|
Top 3:
|
|
"cat": 10.0
|
|
"dog": 8.0
|
|
"car": 5.0
|
|
```
|
|
|
|
**Set others to -∞:**
|
|
```
|
|
Token "cat": 10.0
|
|
Token "dog": 8.0
|
|
Token "car": 5.0
|
|
Token "house": -∞
|
|
Token "tree": -∞
|
|
Token "book": -∞
|
|
```
|
|
|
|
#### Step 4: Apply Top-p Filtering
|
|
|
|
**First, compute probabilities from top-k tokens:**
|
|
|
|
**Apply softmax:**
|
|
```
|
|
exp(10.0) = 22026.47
|
|
exp(8.0) = 2980.96
|
|
exp(5.0) = 148.41
|
|
Sum = 25155.84
|
|
|
|
P("cat") = 22026.47 / 25155.84 ≈ 0.875
|
|
P("dog") = 2980.96 / 25155.84 ≈ 0.119
|
|
P("car") = 148.41 / 25155.84 ≈ 0.006
|
|
```
|
|
|
|
**Cumulative probabilities:**
|
|
```
|
|
"cat": 0.875
|
|
"dog": 0.994 (0.875 + 0.119)
|
|
"car": 1.000 (0.994 + 0.006)
|
|
```
|
|
|
|
**Find smallest set where sum ≥ 0.9:**
|
|
```
|
|
"cat": 0.875 < 0.9
|
|
"cat" + "dog": 0.994 ≥ 0.9 ✓
|
|
|
|
→ Keep "cat" and "dog"
|
|
→ Remove "car"
|
|
```
|
|
|
|
**Result:**
|
|
```
|
|
Token "cat": 10.0
|
|
Token "dog": 8.0
|
|
Token "car": -∞ (removed)
|
|
```
|
|
|
|
#### Step 5: Compute Final Probabilities
|
|
|
|
**Apply softmax to remaining tokens:**
|
|
```
|
|
exp(10.0) = 22026.47
|
|
exp(8.0) = 2980.96
|
|
Sum = 25007.43
|
|
|
|
P("cat") = 22026.47 / 25007.43 ≈ 0.881
|
|
P("dog") = 2980.96 / 25007.43 ≈ 0.119
|
|
```
|
|
|
|
#### Step 6: Sample Token
|
|
|
|
**Sample from distribution:**
|
|
```
|
|
Random number: 0.75
|
|
|
|
Cumulative:
|
|
"cat": 0.881 ← 0.75 falls here
|
|
"dog": 1.000
|
|
|
|
→ Selected: "cat"
|
|
```
|
|
|
|
### Answer
|
|
|
|
**Generated token: "cat"**
|
|
|
|
**Final sequence:**
|
|
```
|
|
Prompt: "The"
|
|
Generated: "cat"
|
|
Full text: "The cat"
|
|
```
|
|
|
|
### Summary
|
|
|
|
| Step | Operation | Result |
|
|
|------|-----------|--------|
|
|
| 1 | Initial logits | [10.0, 8.0, 5.0, 2.0, ...] |
|
|
| 2 | Apply temperature (T=1.0) | [10.0, 8.0, 5.0, 2.0, ...] |
|
|
| 3 | Top-k filtering (k=3) | Keep top 3: [10.0, 8.0, 5.0] |
|
|
| 4 | Top-p filtering (p=0.9) | Keep cumulative ≥0.9: [10.0, 8.0] |
|
|
| 5 | Compute probabilities | [0.881, 0.119] |
|
|
| 6 | Sample | "cat" selected |
|
|
|
|
**The model generated "cat" following "The"!**
|
|
|
|
---
|
|
|
|
## 9.9 Key Takeaways
|
|
|
|
### Generation
|
|
|
|
✅ **Generation produces text one token at a time**
|
|
✅ **Autoregressive: uses previous outputs as inputs**
|
|
✅ **Iterative process: predict → sample → append → repeat**
|
|
|
|
### Sampling Strategies
|
|
|
|
✅ **Temperature: Controls randomness (lower = deterministic, higher = random)**
|
|
✅ **Top-k: Limits to top k tokens**
|
|
✅ **Top-p: Keeps smallest set with cumulative probability ≥ p**
|
|
✅ **Combined: Often use temperature + top-k or top-p**
|
|
|
|
### Why Important
|
|
|
|
✅ **Enables text generation from trained models**
|
|
✅ **Different strategies produce different outputs**
|
|
✅ **Essential for language model deployment**
|
|
|
|
---
|
|
|
|
## Mathematical Summary
|
|
|
|
### Generation Process
|
|
|
|
**Initialization:**
|
|
```math
|
|
\mathbf{T}_0 = \mathbf{P}
|
|
```
|
|
|
|
**For each step $t$:**
|
|
```math
|
|
\mathbf{l}_t = \text{Model}(\mathbf{T}_{t-1})[:, -1, :]
|
|
```
|
|
|
|
```math
|
|
\mathbf{l}_t' = \frac{\mathbf{l}_t}{T} \quad \text{(temperature)}
|
|
```
|
|
|
|
```math
|
|
\mathbf{l}_t'' = \text{Top-k}(\mathbf{l}_t') \quad \text{(optional)}
|
|
```
|
|
|
|
```math
|
|
\mathbf{l}_t''' = \text{Top-p}(\mathbf{l}_t'') \quad \text{(optional)}
|
|
```
|
|
|
|
```math
|
|
\mathbf{p}_t = \text{softmax}(\mathbf{l}_t''')
|
|
```
|
|
|
|
```math
|
|
t_t \sim \text{Categorical}(\mathbf{p}_t)
|
|
```
|
|
|
|
```math
|
|
\mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]
|
|
```
|
|
|
|
---
|
|
|
|
*This document provides a comprehensive explanation of text generation, including autoregressive generation, sampling strategies, temperature, top-k, and top-p with mathematical formulations and solved exercises.*
|
|
|