sheepOp/docs/GENERATION_EXPLAINED.md

# What is Generation? Step-by-Step Explanation

Complete step-by-step explanation of text generation: how models generate text using autoregressive generation, sampling, and decoding strategies.

## Table of Contents

1. [What is Generation?](#91-what-is-generation)
2. [Autoregressive Generation](#92-autoregressive-generation)
3. [Sampling Strategies](#93-sampling-strategies)
4. [Temperature](#94-temperature)
5. [Top-k Sampling](#95-top-k-sampling)
6. [Top-p (Nucleus) Sampling](#96-top-p-nucleus-sampling)
7. [Step-by-Step Generation Process](#97-step-by-step-generation-process)
8. [Exercise: Complete Generation Example](#98-exercise-complete-generation-example)
9. [Key Takeaways](#99-key-takeaways)

---

## 9.1 What is Generation?

### Simple Definition

**Generation** (text generation) is the process of using a trained model to produce new text, one token at a time, based on a given prompt.

### Visual Analogy

**Think of generation like writing a story:**

```
Prompt: "Once upon a time"

Model generates:
  "Once upon a time" → "there"
  "Once upon a time there" → "was"
  "Once upon a time there was" → "a"
  "Once upon a time there was a" → "princess"
  ...

Final: "Once upon a time there was a princess..."
```

**Model predicts next word, one at a time!**

### What Generation Does

**Generation:**
1. **Takes** a prompt (starting text)
2. **Predicts** next token probabilities
3. **Samples** a token from distribution
4. **Appends** token to sequence
5. **Repeats** until complete

**Result:** Generated text continuation!

---

## 9.2 Autoregressive Generation

### What is Autoregressive?

**Autoregressive** means the model uses its own previous outputs as inputs for the next prediction.

### How It Works

**Step 1: Initial Prompt**
```
Prompt: "Hello"
Sequence: ["Hello"]
```

**Step 2: First Prediction**
```
Input: ["Hello"]
Model output: Probabilities for next token
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  ...
```

**Step 3: Sample Token**
```
Sample: "World" (selected)
Sequence: ["Hello", "World"]
```

**Step 4: Second Prediction**
```
Input: ["Hello", "World"]
Model output: Probabilities for next token
  "!": 0.5
  ".": 0.3
  ",": 0.1
  ...
```

**Step 5: Continue**
```
Sample: "!"
Sequence: ["Hello", "World", "!"]
Continue until max length or stop token...
```

### Mathematical Formulation

**For prompt $\mathbf{P} = [p_1, ..., p_k]$:**

**Initialization:**
```math
\mathbf{T}_0 = \mathbf{P}
```

**For each step $t \geq k+1$:**

1. **Forward pass:**
   ```math
   \mathbf{L}_t = \text{Model}(\mathbf{T}_{t-1})
   ```

2. **Get next token probabilities:**
   ```math
   \mathbf{p}_t = \text{softmax}(\mathbf{L}_t[:, -1, :])
   ```

3. **Sample token:**
   ```math
   t_t \sim \text{Categorical}(\mathbf{p}_t)
   ```

4. **Append token:**
   ```math
   \mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]
   ```

**Repeat until stop condition!**

---

## 9.3 Sampling Strategies

### Deterministic vs Stochastic

**Deterministic (Greedy):**
```
Always pick highest probability:
  "World": 0.4 ← Highest
  "there": 0.3
  "friend": 0.2

→ Always picks "World"
→ Same output every time
```

**Stochastic (Sampling):**
```
Sample from distribution:
  "World": 0.4 (40% chance)
  "there": 0.3 (30% chance)
  "friend": 0.2 (20% chance)

→ Different output each time
→ More diverse generations
```

### Why Sampling?

**Greedy (Deterministic):**
- Same output every time
- Can be repetitive
- Less creative

**Sampling:**
- Different outputs each time
- More diverse
- More creative
- Better for creative tasks

---

## 9.4 Temperature

### What is Temperature?

**Temperature** controls the randomness of sampling by scaling the logits before applying softmax.

### Formula

```math
\mathbf{p}_t = \text{softmax}\left(\frac{\mathbf{l}_t}{T}\right)
```

**Where:**
- $\mathbf{l}_t$ = logits (raw scores)
- $T$ = temperature
- $\mathbf{p}_t$ = probabilities

### How Temperature Works

**T = 0.5 (Low Temperature - More Deterministic):**
```
Logits: [2.0, 1.0, 0.5]
After scaling: [4.0, 2.0, 1.0]
After softmax: [0.88, 0.11, 0.01]
→ Sharp distribution (one token dominates)
→ More deterministic
```

**T = 1.0 (Standard Temperature):**
```
Logits: [2.0, 1.0, 0.5]
After scaling: [2.0, 1.0, 0.5]
After softmax: [0.66, 0.24, 0.10]
→ Moderate distribution
→ Balanced
```

**T = 2.0 (High Temperature - More Random):**
```
Logits: [2.0, 1.0, 0.5]
After scaling: [1.0, 0.5, 0.25]
After softmax: [0.52, 0.31, 0.17]
→ Flat distribution (more uniform)
→ More random
```

### Visual Comparison

```
Probability
    │
 1.0│  T=0.5: ●
    │
 0.8│
    │
 0.6│  T=1.0:       ●
    │
 0.4│
    │
 0.2│  T=2.0:                ●
    │
 0.0├───────────────────────── Token
    "World"  "there"  "friend"
```

**Lower T = Sharper distribution = More deterministic**
**Higher T = Flatter distribution = More random**

### When to Use Different Temperatures

**Low Temperature (T < 1.0):**
- Factual tasks
- Reproducible outputs
- When you want consistent results

**Standard Temperature (T = 1.0):**
- Default setting
- Balanced behavior
- Good for most tasks

**High Temperature (T > 1.0):**
- Creative writing
- Diverse outputs
- When you want variety

---

## 9.5 Top-k Sampling

### What is Top-k?

**Top-k sampling** limits the sampling to only the top k most likely tokens.

### How It Works

**Step 1: Get Probabilities**
```
All tokens:
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  "hello": 0.05
  "cat": 0.03
  "dog": 0.02
  ...
```

**Step 2: Select Top-k (e.g., k=3)**
```
Top 3:
  "World": 0.4
  "there": 0.3
  "friend": 0.2
```

**Step 3: Remove Others**
```
Set others to 0:
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  "hello": 0.0
  "cat": 0.0
  "dog": 0.0
  ...
```

**Step 4: Renormalize**
```
Sum = 0.4 + 0.3 + 0.2 = 0.9
Renormalize:
  "World": 0.4/0.9 = 0.44
  "there": 0.3/0.9 = 0.33
  "friend": 0.2/0.9 = 0.22
```

**Step 5: Sample from Top-k**
```
Sample from these 3 tokens only
```

### Mathematical Formulation

**Given probabilities $\mathbf{p}_t$ and top-k:**

```math
\mathbf{p}_t^{topk}[v] = \begin{cases}
\frac{\mathbf{p}_t[v]}{\sum_{u \in \text{top-k}} \mathbf{p}_t[u]} & \text{if } v \in \text{top-k} \\
0 & \text{otherwise}
\end{cases}
```

### Why Top-k?

**Benefits:**
- Removes low-probability tokens
- Focuses on likely candidates
- Reduces randomness from unlikely tokens
- Better quality generations

**Example:**
```
Without top-k: Might sample "xyz" (very unlikely)
With top-k=50: Only samples from top 50 tokens
→ Better quality!
```

---

## 9.6 Top-p (Nucleus) Sampling

### What is Top-p?

**Top-p (nucleus) sampling** keeps the smallest set of tokens whose cumulative probability is at least p.

### How It Works

**Step 1: Sort Probabilities**
```
Sorted (descending):
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  "hello": 0.05
  "cat": 0.03
  "dog": 0.02
  ...
```

**Step 2: Compute Cumulative Probabilities**
```
Cumulative:
  "World": 0.4
  "there": 0.7  (0.4 + 0.3)
  "friend": 0.9  (0.7 + 0.2)
  "hello": 0.95  (0.9 + 0.05)
  "cat": 0.98   (0.95 + 0.03)
  ...
```

**Step 3: Find Nucleus (e.g., p=0.9)**
```
Find smallest set where sum ≥ 0.9:
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  Cumulative: 0.9 ✓

→ Keep these 3 tokens
```

**Step 4: Remove Others**
```
Keep:
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  Others: 0.0
```

**Step 5: Renormalize and Sample**
```
Renormalize and sample
```

### Mathematical Formulation

**Given probabilities $\mathbf{p}_t$ and top-p:**

**Find smallest set S:**
```math
S = \arg\min \{ |S'| : \sum_{v \in S'} \mathbf{p}_t[v] \geq p \}
```

**Then:**
```math
\mathbf{p}_t^{topp}[v] = \begin{cases}
\frac{\mathbf{p}_t[v]}{\sum_{u \in S} \mathbf{p}_t[u]} & \text{if } v \in S \\
0 & \text{otherwise}
\end{cases}
```

### Why Top-p?

**Benefits:**
- Adapts to distribution shape
- Keeps relevant tokens dynamically
- Better than fixed k in some cases
- More flexible than top-k

**Example:**
```
Sharp distribution: Top-p=0.9 might keep 3 tokens
Flat distribution: Top-p=0.9 might keep 50 tokens
→ Adapts automatically!
```

---

## 9.7 Step-by-Step Generation Process

### Complete Process

**Given prompt: "Hello"**

#### Step 1: Encode Prompt

```
Prompt: "Hello"
Token IDs: [72]
```

#### Step 2: Forward Pass

```
Input: [72]
Model processes through layers
Output: Logits for all tokens
  Token 72: 5.2
  Token 87: 4.8 ← "World"
  Token 101: 3.2 ← "there"
  Token 108: 2.1 ← "friend"
  ...
```

#### Step 3: Apply Temperature

```
Temperature: T = 1.0
Scaled logits: Same as above
```

#### Step 4: Apply Top-k (Optional)

```
Top-k: k = 50
Keep top 50 tokens, remove others
```

#### Step 5: Apply Top-p (Optional)

```
Top-p: p = 0.95
Keep tokens with cumulative prob ≥ 0.95
```

#### Step 6: Compute Probabilities

```
Apply softmax:
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  ...
```

#### Step 7: Sample Token

```
Sample from distribution:
Selected: "World" (token 87)
```

#### Step 8: Append Token

```
Sequence: [72, 87]
Text: "Hello World"
```

#### Step 9: Repeat

```
Input: [72, 87]
→ Predict next token
→ Sample
→ Append
→ Repeat...
```

---

## 9.8 Exercise: Complete Generation Example

### Problem

**Given:**
- Prompt: "The"
- Model logits for next token: `[10.0, 8.0, 5.0, 2.0, 1.0, 0.5, ...]` (for tokens: "cat", "dog", "car", "house", "tree", "book", ...)
- Temperature: T = 1.0
- Top-k: k = 3
- Top-p: p = 0.9

**Generate the next token step-by-step.**

### Step-by-Step Solution

#### Step 1: Initial Setup

**Prompt:**
```
"The"
Token IDs: [32] (assuming "The" = token 32)
```

**Logits:**
```
Token "cat":   10.0
Token "dog":   8.0
Token "car":   5.0
Token "house": 2.0
Token "tree":  1.0
Token "book":  0.5
...
```

#### Step 2: Apply Temperature

**Temperature: T = 1.0**

**Scaled logits (divide by T):**
```
Token "cat":   10.0 / 1.0 = 10.0
Token "dog":   8.0 / 1.0 = 8.0
Token "car":   5.0 / 1.0 = 5.0
Token "house": 2.0 / 1.0 = 2.0
Token "tree":  1.0 / 1.0 = 1.0
Token "book":  0.5 / 1.0 = 0.5
```

**No change (T=1.0 is identity)**

#### Step 3: Apply Top-k Filtering

**Top-k: k = 3**

**Select top 3 tokens:**
```
Top 3:
  "cat":   10.0
  "dog":   8.0
  "car":   5.0
```

**Set others to -∞:**
```
Token "cat":   10.0
Token "dog":   8.0
Token "car":   5.0
Token "house": -∞
Token "tree":  -∞
Token "book":  -∞
```

#### Step 4: Apply Top-p Filtering

**First, compute probabilities from top-k tokens:**

**Apply softmax:**
```
exp(10.0) = 22026.47
exp(8.0) = 2980.96
exp(5.0) = 148.41
Sum = 25155.84

P("cat") = 22026.47 / 25155.84 ≈ 0.875
P("dog") = 2980.96 / 25155.84 ≈ 0.119
P("car") = 148.41 / 25155.84 ≈ 0.006
```

**Cumulative probabilities:**
```
"cat":   0.875
"dog":   0.994  (0.875 + 0.119)
"car":   1.000  (0.994 + 0.006)
```

**Find smallest set where sum ≥ 0.9:**
```
"cat": 0.875 < 0.9
"cat" + "dog": 0.994 ≥ 0.9 ✓

→ Keep "cat" and "dog"
→ Remove "car"
```

**Result:**
```
Token "cat":   10.0
Token "dog":   8.0
Token "car":   -∞  (removed)
```

#### Step 5: Compute Final Probabilities

**Apply softmax to remaining tokens:**
```
exp(10.0) = 22026.47
exp(8.0) = 2980.96
Sum = 25007.43

P("cat") = 22026.47 / 25007.43 ≈ 0.881
P("dog") = 2980.96 / 25007.43 ≈ 0.119
```

#### Step 6: Sample Token

**Sample from distribution:**
```
Random number: 0.75

Cumulative:
  "cat": 0.881 ← 0.75 falls here
  "dog": 1.000

→ Selected: "cat"
```

### Answer

**Generated token: "cat"**

**Final sequence:**
```
Prompt: "The"
Generated: "cat"
Full text: "The cat"
```

### Summary

| Step | Operation | Result |
|------|-----------|--------|
| 1 | Initial logits | [10.0, 8.0, 5.0, 2.0, ...] |
| 2 | Apply temperature (T=1.0) | [10.0, 8.0, 5.0, 2.0, ...] |
| 3 | Top-k filtering (k=3) | Keep top 3: [10.0, 8.0, 5.0] |
| 4 | Top-p filtering (p=0.9) | Keep cumulative ≥0.9: [10.0, 8.0] |
| 5 | Compute probabilities | [0.881, 0.119] |
| 6 | Sample | "cat" selected |

**The model generated "cat" following "The"!**

---

## 9.9 Key Takeaways

### Generation

✅ **Generation produces text one token at a time**
✅ **Autoregressive: uses previous outputs as inputs**
✅ **Iterative process: predict → sample → append → repeat**

### Sampling Strategies

✅ **Temperature: Controls randomness (lower = deterministic, higher = random)**
✅ **Top-k: Limits to top k tokens**
✅ **Top-p: Keeps smallest set with cumulative probability ≥ p**
✅ **Combined: Often use temperature + top-k or top-p**

### Why Important

✅ **Enables text generation from trained models**
✅ **Different strategies produce different outputs**
✅ **Essential for language model deployment**

---

## Mathematical Summary

### Generation Process

**Initialization:**
```math
\mathbf{T}_0 = \mathbf{P}
```

**For each step $t$:**
```math
\mathbf{l}_t = \text{Model}(\mathbf{T}_{t-1})[:, -1, :]
```

```math
\mathbf{l}_t' = \frac{\mathbf{l}_t}{T} \quad \text{(temperature)}
```

```math
\mathbf{l}_t'' = \text{Top-k}(\mathbf{l}_t') \quad \text{(optional)}
```

```math
\mathbf{l}_t''' = \text{Top-p}(\mathbf{l}_t'') \quad \text{(optional)}
```

```math
\mathbf{p}_t = \text{softmax}(\mathbf{l}_t''')
```

```math
t_t \sim \text{Categorical}(\mathbf{p}_t)
```

```math
\mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]
```

---

*This document provides a comprehensive explanation of text generation, including autoregressive generation, sampling strategies, temperature, top-k, and top-p with mathematical formulations and solved exercises.*