Files

Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/

2025-11-06 22:07:41 -05:00

13 KiB

Raw Blame History

What is Generation? Step-by-Step Explanation

Complete step-by-step explanation of text generation: how models generate text using autoregressive generation, sampling, and decoding strategies.

What is Generation?
Autoregressive Generation
Sampling Strategies
Temperature
Top-k Sampling
Top-p (Nucleus) Sampling
Step-by-Step Generation Process
Exercise: Complete Generation Example
Key Takeaways

9.1 What is Generation?

Simple Definition

Generation (text generation) is the process of using a trained model to produce new text, one token at a time, based on a given prompt.

Visual Analogy

Think of generation like writing a story:

Prompt: "Once upon a time"

Model generates:
  "Once upon a time" → "there"
  "Once upon a time there" → "was"
  "Once upon a time there was" → "a"
  "Once upon a time there was a" → "princess"
  ...

Final: "Once upon a time there was a princess..."

Model predicts next word, one at a time!

What Generation Does

Generation:

Takes a prompt (starting text)
Predicts next token probabilities
Samples a token from distribution
Appends token to sequence
Repeats until complete

Result: Generated text continuation!

9.2 Autoregressive Generation

What is Autoregressive?

Autoregressive means the model uses its own previous outputs as inputs for the next prediction.

How It Works

Step 1: Initial Prompt

Prompt: "Hello"
Sequence: ["Hello"]

Step 2: First Prediction

Input: ["Hello"]
Model output: Probabilities for next token
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  ...

Step 3: Sample Token

Sample: "World" (selected)
Sequence: ["Hello", "World"]

Step 4: Second Prediction

Input: ["Hello", "World"]
Model output: Probabilities for next token
  "!": 0.5
  ".": 0.3
  ",": 0.1
  ...

Step 5: Continue

Sample: "!"
Sequence: ["Hello", "World", "!"]
Continue until max length or stop token...

Mathematical Formulation

For prompt \mathbf{P} = [p_1, ..., p_k]:

Initialization:

\mathbf{T}_0 = \mathbf{P}

For each step t \geq k+1:

Forward pass:

\mathbf{L}_t = \text{Model}(\mathbf{T}_{t-1})

Get next token probabilities:

\mathbf{p}_t = \text{softmax}(\mathbf{L}_t[:, -1, :])

Sample token:

t_t \sim \text{Categorical}(\mathbf{p}_t)

Append token:
```
\mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]
```

Repeat until stop condition!

9.3 Sampling Strategies

Deterministic vs Stochastic

Deterministic (Greedy):

Always pick highest probability:
  "World": 0.4 ← Highest
  "there": 0.3
  "friend": 0.2
  
→ Always picks "World"
→ Same output every time

Stochastic (Sampling):

Sample from distribution:
  "World": 0.4 (40% chance)
  "there": 0.3 (30% chance)
  "friend": 0.2 (20% chance)
  
→ Different output each time
→ More diverse generations

Why Sampling?

Greedy (Deterministic):

Same output every time
Can be repetitive
Less creative

Sampling:

Different outputs each time
More diverse
More creative
Better for creative tasks

9.4 Temperature

What is Temperature?

Temperature controls the randomness of sampling by scaling the logits before applying softmax.

Formula

\mathbf{p}_t = \text{softmax}\left(\frac{\mathbf{l}_t}{T}\right)

Where:

\mathbf{l}_t = logits (raw scores)
T = temperature
\mathbf{p}_t = probabilities

How Temperature Works

T = 0.5 (Low Temperature - More Deterministic):

Logits: [2.0, 1.0, 0.5]
After scaling: [4.0, 2.0, 1.0]
After softmax: [0.88, 0.11, 0.01]
→ Sharp distribution (one token dominates)
→ More deterministic

T = 1.0 (Standard Temperature):

Logits: [2.0, 1.0, 0.5]
After scaling: [2.0, 1.0, 0.5]
After softmax: [0.66, 0.24, 0.10]
→ Moderate distribution
→ Balanced

T = 2.0 (High Temperature - More Random):

Logits: [2.0, 1.0, 0.5]
After scaling: [1.0, 0.5, 0.25]
After softmax: [0.52, 0.31, 0.17]
→ Flat distribution (more uniform)
→ More random

Visual Comparison

Probability
    │
 1.0│  T=0.5: ●
    │
 0.8│
    │
 0.6│  T=1.0:       ●
    │
 0.4│
    │
 0.2│  T=2.0:                ●
    │
 0.0├───────────────────────── Token
    "World"  "there"  "friend"

Lower T = Sharper distribution = More deterministic
Higher T = Flatter distribution = More random

When to Use Different Temperatures

Low Temperature (T < 1.0):

Factual tasks
Reproducible outputs
When you want consistent results

Standard Temperature (T = 1.0):

Default setting
Balanced behavior
Good for most tasks

High Temperature (T > 1.0):

Creative writing
Diverse outputs
When you want variety

9.5 Top-k Sampling

What is Top-k?

Top-k sampling limits the sampling to only the top k most likely tokens.

How It Works

Step 1: Get Probabilities

All tokens:
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  "hello": 0.05
  "cat": 0.03
  "dog": 0.02
  ...

Step 2: Select Top-k (e.g., k=3)

Top 3:
  "World": 0.4
  "there": 0.3
  "friend": 0.2

Step 3: Remove Others

Set others to 0:
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  "hello": 0.0
  "cat": 0.0
  "dog": 0.0
  ...

Step 4: Renormalize

Sum = 0.4 + 0.3 + 0.2 = 0.9
Renormalize:
  "World": 0.4/0.9 = 0.44
  "there": 0.3/0.9 = 0.33
  "friend": 0.2/0.9 = 0.22

Step 5: Sample from Top-k

Sample from these 3 tokens only

Mathematical Formulation

Given probabilities \mathbf{p}_t and top-k:

\mathbf{p}_t^{topk}[v] = \begin{cases}
\frac{\mathbf{p}_t[v]}{\sum_{u \in \text{top-k}} \mathbf{p}_t[u]} & \text{if } v \in \text{top-k} \\
0 & \text{otherwise}
\end{cases}

Why Top-k?

Benefits:

Removes low-probability tokens
Focuses on likely candidates
Reduces randomness from unlikely tokens
Better quality generations

Example:

Without top-k: Might sample "xyz" (very unlikely)
With top-k=50: Only samples from top 50 tokens
→ Better quality!

9.6 Top-p (Nucleus) Sampling

What is Top-p?

Top-p (nucleus) sampling keeps the smallest set of tokens whose cumulative probability is at least p.

How It Works

Step 1: Sort Probabilities

Sorted (descending):
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  "hello": 0.05
  "cat": 0.03
  "dog": 0.02
  ...

Step 2: Compute Cumulative Probabilities

Cumulative:
  "World": 0.4
  "there": 0.7  (0.4 + 0.3)
  "friend": 0.9  (0.7 + 0.2)
  "hello": 0.95  (0.9 + 0.05)
  "cat": 0.98   (0.95 + 0.03)
  ...

Step 3: Find Nucleus (e.g., p=0.9)

Find smallest set where sum ≥ 0.9:
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  Cumulative: 0.9 ✓
  
→ Keep these 3 tokens

Step 4: Remove Others

Keep:
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  Others: 0.0

Step 5: Renormalize and Sample

Renormalize and sample

Mathematical Formulation

Given probabilities \mathbf{p}_t and top-p:

Find smallest set S:

S = \arg\min \{ |S'| : \sum_{v \in S'} \mathbf{p}_t[v] \geq p \}

Then:

\mathbf{p}_t^{topp}[v] = \begin{cases}
\frac{\mathbf{p}_t[v]}{\sum_{u \in S} \mathbf{p}_t[u]} & \text{if } v \in S \\
0 & \text{otherwise}
\end{cases}

Why Top-p?

Benefits:

Adapts to distribution shape
Keeps relevant tokens dynamically
Better than fixed k in some cases
More flexible than top-k

Example:

Sharp distribution: Top-p=0.9 might keep 3 tokens
Flat distribution: Top-p=0.9 might keep 50 tokens
→ Adapts automatically!

9.7 Step-by-Step Generation Process

Complete Process

Given prompt: "Hello"

Step 1: Encode Prompt

Prompt: "Hello"
Token IDs: [72]

Step 2: Forward Pass

Input: [72]
Model processes through layers
Output: Logits for all tokens
  Token 72: 5.2
  Token 87: 4.8 ← "World"
  Token 101: 3.2 ← "there"
  Token 108: 2.1 ← "friend"
  ...

Step 3: Apply Temperature

Temperature: T = 1.0
Scaled logits: Same as above

Step 4: Apply Top-k (Optional)

Top-k: k = 50
Keep top 50 tokens, remove others

Step 5: Apply Top-p (Optional)

Top-p: p = 0.95
Keep tokens with cumulative prob ≥ 0.95

Step 6: Compute Probabilities

Apply softmax:
  "World": 0.4
  "there": 0.3
  "friend": 0.2
  ...

Step 7: Sample Token

Sample from distribution:
Selected: "World" (token 87)

Step 8: Append Token

Sequence: [72, 87]
Text: "Hello World"

Step 9: Repeat

Input: [72, 87]
→ Predict next token
→ Sample
→ Append
→ Repeat...

9.8 Exercise: Complete Generation Example

Problem

Given:

Prompt: "The"
Model logits for next token: [10.0, 8.0, 5.0, 2.0, 1.0, 0.5, ...] (for tokens: "cat", "dog", "car", "house", "tree", "book", ...)
Temperature: T = 1.0
Top-k: k = 3
Top-p: p = 0.9

Generate the next token step-by-step.

Step-by-Step Solution

Step 1: Initial Setup

Prompt:

"The"
Token IDs: [32] (assuming "The" = token 32)

Logits:

Token "cat":   10.0
Token "dog":   8.0
Token "car":   5.0
Token "house": 2.0
Token "tree":  1.0
Token "book":  0.5
...

Step 2: Apply Temperature

Temperature: T = 1.0

Scaled logits (divide by T):

Token "cat":   10.0 / 1.0 = 10.0
Token "dog":   8.0 / 1.0 = 8.0
Token "car":   5.0 / 1.0 = 5.0
Token "house": 2.0 / 1.0 = 2.0
Token "tree":  1.0 / 1.0 = 1.0
Token "book":  0.5 / 1.0 = 0.5

No change (T=1.0 is identity)

Step 3: Apply Top-k Filtering

Top-k: k = 3

Select top 3 tokens:

Top 3:
  "cat":   10.0
  "dog":   8.0
  "car":   5.0

Set others to -∞:

Token "cat":   10.0
Token "dog":   8.0
Token "car":   5.0
Token "house": -∞
Token "tree":  -∞
Token "book":  -∞

Step 4: Apply Top-p Filtering

First, compute probabilities from top-k tokens:

Apply softmax:

exp(10.0) = 22026.47
exp(8.0) = 2980.96
exp(5.0) = 148.41
Sum = 25155.84

P("cat") = 22026.47 / 25155.84 ≈ 0.875
P("dog") = 2980.96 / 25155.84 ≈ 0.119
P("car") = 148.41 / 25155.84 ≈ 0.006

Cumulative probabilities:

"cat":   0.875
"dog":   0.994  (0.875 + 0.119)
"car":   1.000  (0.994 + 0.006)

Find smallest set where sum ≥ 0.9:

"cat": 0.875 < 0.9
"cat" + "dog": 0.994 ≥ 0.9 ✓

→ Keep "cat" and "dog"
→ Remove "car"

Result:

Token "cat":   10.0
Token "dog":   8.0
Token "car":   -∞  (removed)

Step 5: Compute Final Probabilities

Apply softmax to remaining tokens:

exp(10.0) = 22026.47
exp(8.0) = 2980.96
Sum = 25007.43

P("cat") = 22026.47 / 25007.43 ≈ 0.881
P("dog") = 2980.96 / 25007.43 ≈ 0.119

Step 6: Sample Token

Sample from distribution:

Random number: 0.75

Cumulative:
  "cat": 0.881 ← 0.75 falls here
  "dog": 1.000

→ Selected: "cat"

Answer

Generated token: "cat"

Final sequence:

Prompt: "The"
Generated: "cat"
Full text: "The cat"

Summary

Step	Operation	Result
1	Initial logits	[10.0, 8.0, 5.0, 2.0, ...]
2	Apply temperature (T=1.0)	[10.0, 8.0, 5.0, 2.0, ...]
3	Top-k filtering (k=3)	Keep top 3: [10.0, 8.0, 5.0]
4	Top-p filtering (p=0.9)	Keep cumulative ≥0.9: [10.0, 8.0]
5	Compute probabilities	[0.881, 0.119]
6	Sample	"cat" selected

The model generated "cat" following "The"!

9.9 Key Takeaways

Generation

✅ Generation produces text one token at a time
✅ Autoregressive: uses previous outputs as inputs
✅ Iterative process: predict → sample → append → repeat

Sampling Strategies

✅ Temperature: Controls randomness (lower = deterministic, higher = random)
✅ Top-k: Limits to top k tokens
✅ Top-p: Keeps smallest set with cumulative probability ≥ p
✅ Combined: Often use temperature + top-k or top-p

Why Important

✅ Enables text generation from trained models
✅ Different strategies produce different outputs
✅ Essential for language model deployment

Mathematical Summary

Generation Process

Initialization:

\mathbf{T}_0 = \mathbf{P}

For each step t:

\mathbf{l}_t = \text{Model}(\mathbf{T}_{t-1})[:, -1, :]

\mathbf{l}_t' = \frac{\mathbf{l}_t}{T} \quad \text{(temperature)}

\mathbf{l}_t'' = \text{Top-k}(\mathbf{l}_t') \quad \text{(optional)}

\mathbf{l}_t''' = \text{Top-p}(\mathbf{l}_t'') \quad \text{(optional)}

\mathbf{p}_t = \text{softmax}(\mathbf{l}_t''')

t_t \sim \text{Categorical}(\mathbf{p}_t)

\mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]

This document provides a comprehensive explanation of text generation, including autoregressive generation, sampling strategies, temperature, top-k, and top-p with mathematical formulations and solved exercises.

13 KiB Raw Blame History

What is Generation? Step-by-Step Explanation

Table of Contents

9.1 What is Generation?

Simple Definition

Visual Analogy

What Generation Does

9.2 Autoregressive Generation

What is Autoregressive?

How It Works

Mathematical Formulation

9.3 Sampling Strategies

Deterministic vs Stochastic

Why Sampling?

9.4 Temperature

What is Temperature?

Formula

How Temperature Works

Visual Comparison

When to Use Different Temperatures

9.5 Top-k Sampling

What is Top-k?

How It Works

Mathematical Formulation

Why Top-k?

9.6 Top-p (Nucleus) Sampling

What is Top-p?

How It Works

Mathematical Formulation

Why Top-p?

9.7 Step-by-Step Generation Process

Complete Process

Step 1: Encode Prompt

Step 2: Forward Pass

Step 3: Apply Temperature

Step 4: Apply Top-k (Optional)

Step 5: Apply Top-p (Optional)

Step 6: Compute Probabilities

Step 7: Sample Token

Step 8: Append Token

Step 9: Repeat

9.8 Exercise: Complete Generation Example

Problem

Step-by-Step Solution

Step 1: Initial Setup

Step 2: Apply Temperature

Step 3: Apply Top-k Filtering

Step 4: Apply Top-p Filtering

Step 5: Compute Final Probabilities

Step 6: Sample Token

Answer

Summary

9.9 Key Takeaways

Generation

Sampling Strategies

Why Important

Mathematical Summary

Generation Process

13 KiB

Raw Blame History