- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
13 KiB
What is Generation? Step-by-Step Explanation
Complete step-by-step explanation of text generation: how models generate text using autoregressive generation, sampling, and decoding strategies.
Table of Contents
- What is Generation?
- Autoregressive Generation
- Sampling Strategies
- Temperature
- Top-k Sampling
- Top-p (Nucleus) Sampling
- Step-by-Step Generation Process
- Exercise: Complete Generation Example
- Key Takeaways
9.1 What is Generation?
Simple Definition
Generation (text generation) is the process of using a trained model to produce new text, one token at a time, based on a given prompt.
Visual Analogy
Think of generation like writing a story:
Prompt: "Once upon a time"
Model generates:
"Once upon a time" → "there"
"Once upon a time there" → "was"
"Once upon a time there was" → "a"
"Once upon a time there was a" → "princess"
...
Final: "Once upon a time there was a princess..."
Model predicts next word, one at a time!
What Generation Does
Generation:
- Takes a prompt (starting text)
- Predicts next token probabilities
- Samples a token from distribution
- Appends token to sequence
- Repeats until complete
Result: Generated text continuation!
9.2 Autoregressive Generation
What is Autoregressive?
Autoregressive means the model uses its own previous outputs as inputs for the next prediction.
How It Works
Step 1: Initial Prompt
Prompt: "Hello"
Sequence: ["Hello"]
Step 2: First Prediction
Input: ["Hello"]
Model output: Probabilities for next token
"World": 0.4
"there": 0.3
"friend": 0.2
...
Step 3: Sample Token
Sample: "World" (selected)
Sequence: ["Hello", "World"]
Step 4: Second Prediction
Input: ["Hello", "World"]
Model output: Probabilities for next token
"!": 0.5
".": 0.3
",": 0.1
...
Step 5: Continue
Sample: "!"
Sequence: ["Hello", "World", "!"]
Continue until max length or stop token...
Mathematical Formulation
For prompt \mathbf{P} = [p_1, ..., p_k]:
Initialization:
\mathbf{T}_0 = \mathbf{P}
For each step t \geq k+1:
-
Forward pass:
\mathbf{L}_t = \text{Model}(\mathbf{T}_{t-1}) -
Get next token probabilities:
\mathbf{p}_t = \text{softmax}(\mathbf{L}_t[:, -1, :]) -
Sample token:
t_t \sim \text{Categorical}(\mathbf{p}_t) -
Append token:
\mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]
Repeat until stop condition!
9.3 Sampling Strategies
Deterministic vs Stochastic
Deterministic (Greedy):
Always pick highest probability:
"World": 0.4 ← Highest
"there": 0.3
"friend": 0.2
→ Always picks "World"
→ Same output every time
Stochastic (Sampling):
Sample from distribution:
"World": 0.4 (40% chance)
"there": 0.3 (30% chance)
"friend": 0.2 (20% chance)
→ Different output each time
→ More diverse generations
Why Sampling?
Greedy (Deterministic):
- Same output every time
- Can be repetitive
- Less creative
Sampling:
- Different outputs each time
- More diverse
- More creative
- Better for creative tasks
9.4 Temperature
What is Temperature?
Temperature controls the randomness of sampling by scaling the logits before applying softmax.
Formula
\mathbf{p}_t = \text{softmax}\left(\frac{\mathbf{l}_t}{T}\right)
Where:
\mathbf{l}_t= logits (raw scores)T= temperature\mathbf{p}_t= probabilities
How Temperature Works
T = 0.5 (Low Temperature - More Deterministic):
Logits: [2.0, 1.0, 0.5]
After scaling: [4.0, 2.0, 1.0]
After softmax: [0.88, 0.11, 0.01]
→ Sharp distribution (one token dominates)
→ More deterministic
T = 1.0 (Standard Temperature):
Logits: [2.0, 1.0, 0.5]
After scaling: [2.0, 1.0, 0.5]
After softmax: [0.66, 0.24, 0.10]
→ Moderate distribution
→ Balanced
T = 2.0 (High Temperature - More Random):
Logits: [2.0, 1.0, 0.5]
After scaling: [1.0, 0.5, 0.25]
After softmax: [0.52, 0.31, 0.17]
→ Flat distribution (more uniform)
→ More random
Visual Comparison
Probability
│
1.0│ T=0.5: ●
│
0.8│
│
0.6│ T=1.0: ●
│
0.4│
│
0.2│ T=2.0: ●
│
0.0├───────────────────────── Token
"World" "there" "friend"
Lower T = Sharper distribution = More deterministic
Higher T = Flatter distribution = More random
When to Use Different Temperatures
Low Temperature (T < 1.0):
- Factual tasks
- Reproducible outputs
- When you want consistent results
Standard Temperature (T = 1.0):
- Default setting
- Balanced behavior
- Good for most tasks
High Temperature (T > 1.0):
- Creative writing
- Diverse outputs
- When you want variety
9.5 Top-k Sampling
What is Top-k?
Top-k sampling limits the sampling to only the top k most likely tokens.
How It Works
Step 1: Get Probabilities
All tokens:
"World": 0.4
"there": 0.3
"friend": 0.2
"hello": 0.05
"cat": 0.03
"dog": 0.02
...
Step 2: Select Top-k (e.g., k=3)
Top 3:
"World": 0.4
"there": 0.3
"friend": 0.2
Step 3: Remove Others
Set others to 0:
"World": 0.4
"there": 0.3
"friend": 0.2
"hello": 0.0
"cat": 0.0
"dog": 0.0
...
Step 4: Renormalize
Sum = 0.4 + 0.3 + 0.2 = 0.9
Renormalize:
"World": 0.4/0.9 = 0.44
"there": 0.3/0.9 = 0.33
"friend": 0.2/0.9 = 0.22
Step 5: Sample from Top-k
Sample from these 3 tokens only
Mathematical Formulation
Given probabilities \mathbf{p}_t and top-k:
\mathbf{p}_t^{topk}[v] = \begin{cases}
\frac{\mathbf{p}_t[v]}{\sum_{u \in \text{top-k}} \mathbf{p}_t[u]} & \text{if } v \in \text{top-k} \\
0 & \text{otherwise}
\end{cases}
Why Top-k?
Benefits:
- Removes low-probability tokens
- Focuses on likely candidates
- Reduces randomness from unlikely tokens
- Better quality generations
Example:
Without top-k: Might sample "xyz" (very unlikely)
With top-k=50: Only samples from top 50 tokens
→ Better quality!
9.6 Top-p (Nucleus) Sampling
What is Top-p?
Top-p (nucleus) sampling keeps the smallest set of tokens whose cumulative probability is at least p.
How It Works
Step 1: Sort Probabilities
Sorted (descending):
"World": 0.4
"there": 0.3
"friend": 0.2
"hello": 0.05
"cat": 0.03
"dog": 0.02
...
Step 2: Compute Cumulative Probabilities
Cumulative:
"World": 0.4
"there": 0.7 (0.4 + 0.3)
"friend": 0.9 (0.7 + 0.2)
"hello": 0.95 (0.9 + 0.05)
"cat": 0.98 (0.95 + 0.03)
...
Step 3: Find Nucleus (e.g., p=0.9)
Find smallest set where sum ≥ 0.9:
"World": 0.4
"there": 0.3
"friend": 0.2
Cumulative: 0.9 ✓
→ Keep these 3 tokens
Step 4: Remove Others
Keep:
"World": 0.4
"there": 0.3
"friend": 0.2
Others: 0.0
Step 5: Renormalize and Sample
Renormalize and sample
Mathematical Formulation
Given probabilities \mathbf{p}_t and top-p:
Find smallest set S:
S = \arg\min \{ |S'| : \sum_{v \in S'} \mathbf{p}_t[v] \geq p \}
Then:
\mathbf{p}_t^{topp}[v] = \begin{cases}
\frac{\mathbf{p}_t[v]}{\sum_{u \in S} \mathbf{p}_t[u]} & \text{if } v \in S \\
0 & \text{otherwise}
\end{cases}
Why Top-p?
Benefits:
- Adapts to distribution shape
- Keeps relevant tokens dynamically
- Better than fixed k in some cases
- More flexible than top-k
Example:
Sharp distribution: Top-p=0.9 might keep 3 tokens
Flat distribution: Top-p=0.9 might keep 50 tokens
→ Adapts automatically!
9.7 Step-by-Step Generation Process
Complete Process
Given prompt: "Hello"
Step 1: Encode Prompt
Prompt: "Hello"
Token IDs: [72]
Step 2: Forward Pass
Input: [72]
Model processes through layers
Output: Logits for all tokens
Token 72: 5.2
Token 87: 4.8 ← "World"
Token 101: 3.2 ← "there"
Token 108: 2.1 ← "friend"
...
Step 3: Apply Temperature
Temperature: T = 1.0
Scaled logits: Same as above
Step 4: Apply Top-k (Optional)
Top-k: k = 50
Keep top 50 tokens, remove others
Step 5: Apply Top-p (Optional)
Top-p: p = 0.95
Keep tokens with cumulative prob ≥ 0.95
Step 6: Compute Probabilities
Apply softmax:
"World": 0.4
"there": 0.3
"friend": 0.2
...
Step 7: Sample Token
Sample from distribution:
Selected: "World" (token 87)
Step 8: Append Token
Sequence: [72, 87]
Text: "Hello World"
Step 9: Repeat
Input: [72, 87]
→ Predict next token
→ Sample
→ Append
→ Repeat...
9.8 Exercise: Complete Generation Example
Problem
Given:
- Prompt: "The"
- Model logits for next token:
[10.0, 8.0, 5.0, 2.0, 1.0, 0.5, ...](for tokens: "cat", "dog", "car", "house", "tree", "book", ...) - Temperature: T = 1.0
- Top-k: k = 3
- Top-p: p = 0.9
Generate the next token step-by-step.
Step-by-Step Solution
Step 1: Initial Setup
Prompt:
"The"
Token IDs: [32] (assuming "The" = token 32)
Logits:
Token "cat": 10.0
Token "dog": 8.0
Token "car": 5.0
Token "house": 2.0
Token "tree": 1.0
Token "book": 0.5
...
Step 2: Apply Temperature
Temperature: T = 1.0
Scaled logits (divide by T):
Token "cat": 10.0 / 1.0 = 10.0
Token "dog": 8.0 / 1.0 = 8.0
Token "car": 5.0 / 1.0 = 5.0
Token "house": 2.0 / 1.0 = 2.0
Token "tree": 1.0 / 1.0 = 1.0
Token "book": 0.5 / 1.0 = 0.5
No change (T=1.0 is identity)
Step 3: Apply Top-k Filtering
Top-k: k = 3
Select top 3 tokens:
Top 3:
"cat": 10.0
"dog": 8.0
"car": 5.0
Set others to -∞:
Token "cat": 10.0
Token "dog": 8.0
Token "car": 5.0
Token "house": -∞
Token "tree": -∞
Token "book": -∞
Step 4: Apply Top-p Filtering
First, compute probabilities from top-k tokens:
Apply softmax:
exp(10.0) = 22026.47
exp(8.0) = 2980.96
exp(5.0) = 148.41
Sum = 25155.84
P("cat") = 22026.47 / 25155.84 ≈ 0.875
P("dog") = 2980.96 / 25155.84 ≈ 0.119
P("car") = 148.41 / 25155.84 ≈ 0.006
Cumulative probabilities:
"cat": 0.875
"dog": 0.994 (0.875 + 0.119)
"car": 1.000 (0.994 + 0.006)
Find smallest set where sum ≥ 0.9:
"cat": 0.875 < 0.9
"cat" + "dog": 0.994 ≥ 0.9 ✓
→ Keep "cat" and "dog"
→ Remove "car"
Result:
Token "cat": 10.0
Token "dog": 8.0
Token "car": -∞ (removed)
Step 5: Compute Final Probabilities
Apply softmax to remaining tokens:
exp(10.0) = 22026.47
exp(8.0) = 2980.96
Sum = 25007.43
P("cat") = 22026.47 / 25007.43 ≈ 0.881
P("dog") = 2980.96 / 25007.43 ≈ 0.119
Step 6: Sample Token
Sample from distribution:
Random number: 0.75
Cumulative:
"cat": 0.881 ← 0.75 falls here
"dog": 1.000
→ Selected: "cat"
Answer
Generated token: "cat"
Final sequence:
Prompt: "The"
Generated: "cat"
Full text: "The cat"
Summary
| Step | Operation | Result |
|---|---|---|
| 1 | Initial logits | [10.0, 8.0, 5.0, 2.0, ...] |
| 2 | Apply temperature (T=1.0) | [10.0, 8.0, 5.0, 2.0, ...] |
| 3 | Top-k filtering (k=3) | Keep top 3: [10.0, 8.0, 5.0] |
| 4 | Top-p filtering (p=0.9) | Keep cumulative ≥0.9: [10.0, 8.0] |
| 5 | Compute probabilities | [0.881, 0.119] |
| 6 | Sample | "cat" selected |
The model generated "cat" following "The"!
9.9 Key Takeaways
Generation
✅ Generation produces text one token at a time
✅ Autoregressive: uses previous outputs as inputs
✅ Iterative process: predict → sample → append → repeat
Sampling Strategies
✅ Temperature: Controls randomness (lower = deterministic, higher = random)
✅ Top-k: Limits to top k tokens
✅ Top-p: Keeps smallest set with cumulative probability ≥ p
✅ Combined: Often use temperature + top-k or top-p
Why Important
✅ Enables text generation from trained models
✅ Different strategies produce different outputs
✅ Essential for language model deployment
Mathematical Summary
Generation Process
Initialization:
\mathbf{T}_0 = \mathbf{P}
For each step t:
\mathbf{l}_t = \text{Model}(\mathbf{T}_{t-1})[:, -1, :]
\mathbf{l}_t' = \frac{\mathbf{l}_t}{T} \quad \text{(temperature)}
\mathbf{l}_t'' = \text{Top-k}(\mathbf{l}_t') \quad \text{(optional)}
\mathbf{l}_t''' = \text{Top-p}(\mathbf{l}_t'') \quad \text{(optional)}
\mathbf{p}_t = \text{softmax}(\mathbf{l}_t''')
t_t \sim \text{Categorical}(\mathbf{p}_t)
\mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]
This document provides a comprehensive explanation of text generation, including autoregressive generation, sampling strategies, temperature, top-k, and top-p with mathematical formulations and solved exercises.