Files
sheepOp/docs/ATTENTION_EXPLAINED.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

411 lines
9.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# What is Attention? Step-by-Step Explanation
Complete step-by-step explanation of attention mechanisms in transformer models: how models understand relationships between words.
## Table of Contents
1. [The Problem Attention Solves](#21-the-problem-attention-solves)
2. [What is Attention?](#22-what-is-attention)
3. [How Attention Works: Step-by-Step](#23-how-attention-works-step-by-step)
4. [Complete Example: Attention in "Hello World"](#24-complete-example-attention-in-hello-world)
5. [Why Attention Matters](#25-why-attention-matters)
6. [Multi-Head Attention](#26-multi-head-attention)
7. [Visual Representation of Attention](#27-visual-representation-of-attention)
8. [Key Takeaways](#28-key-takeaways)
---
## 2.1 The Problem Attention Solves
### The Challenge
**In a sentence, words depend on each other:**
```
"He saw the cat with binoculars"
```
Two possible meanings:
1. He used binoculars to see the cat
2. The cat has binoculars
**Context matters!** The model needs to understand which words relate to each other.
### The Solution: Attention
**Attention allows the model to "look" at other words when processing each word.**
---
## 2.2 What is Attention?
### Simple Definition
**Attention** is a mechanism that determines **how much each word should consider other words** when processing information.
### Intuitive Analogy
**Think of reading a sentence:**
When you read "cat" in:
```
"The cat sat on the mat"
```
You might:
- Pay attention to "sat" (what the cat did)
- Pay attention to "mat" (where the cat is)
- Pay less attention to "the" (just a word)
**Attention does the same thing mathematically!**
---
## 2.3 How Attention Works: Step-by-Step
### High-Level Overview
```
Step 1: Create Query, Key, Value for each word
Step 2: Compare queries and keys (find similarities)
Step 3: Calculate attention weights (how much to attend)
Step 4: Combine values weighted by attention
```
### Detailed Step-by-Step
#### Step 1: Create Query, Key, Value (Q, K, V)
**For each word, create three representations:**
**Query (Q):** "What am I looking for?"
**Key (K):** "What am I offering?"
**Value (V):** "What information do I contain?"
**Example with "Hello World":**
```
Word: "Hello"
Query: [0.2, -0.1, 0.3, ...] ← What should I look for?
Key: [0.1, 0.2, -0.1, ...] ← What do I represent?
Value: [0.15, 0.1, 0.2, ...] ← What information do I have?
Word: "World"
Query: [0.18, 0.15, 0.25, ...]
Key: [0.12, 0.19, -0.08, ...]
Value: [0.14, 0.12, 0.18, ...]
```
**How Q, K, V are created:**
```
Q = Word × W_Q (learned matrix)
K = Word × W_K (learned matrix)
V = Word × W_V (learned matrix)
```
#### Step 2: Compute Similarity Scores
**Compare each query with all keys:**
```
Score[i, j] = How much should word i attend to word j?
```
**Mathematical Formula:**
```
Score[i, j] = (Query[i] · Key[j]) / √d_k
```
**Example:**
**Query for "Hello":** `[0.2, -0.1, 0.3]`
**Key for "Hello":** `[0.1, 0.2, -0.1]`
**Key for "World":** `[0.12, 0.19, -0.08]`
**Calculate similarity:**
```
Score["Hello", "Hello"] = (0.2×0.1 + (-0.1)×0.2 + 0.3×(-0.1)) / √3
= (0.02 - 0.02 - 0.03) / 1.732
= -0.03 / 1.732
≈ -0.017
Score["Hello", "World"] = (0.2×0.12 + (-0.1)×0.19 + 0.3×(-0.08)) / √3
= (0.024 - 0.019 - 0.024) / 1.732
= -0.019 / 1.732
≈ -0.011
```
**Result:** Similarity scores tell us how related words are
#### Step 3: Convert Scores to Attention Weights
**Use softmax to convert scores to probabilities:**
```
Attention[i, j] = exp(Score[i, j]) / Σ exp(Score[i, k])
```
**Example:**
**Raw Scores:**
```
Score["Hello", "Hello"] = -0.017
Score["Hello", "World"] = -0.011
```
**Compute exponentials:**
```
exp(-0.017) ≈ 0.983
exp(-0.011) ≈ 0.989
Sum = 0.983 + 0.989 = 1.972
```
**Compute attention weights:**
```
Attention["Hello", "Hello"] = 0.983 / 1.972 ≈ 0.499 (49.9%)
Attention["Hello", "World"] = 0.989 / 1.972 ≈ 0.501 (50.1%)
```
**Meaning:** "Hello" attends 49.9% to itself and 50.1% to "World"
#### Step 4: Weighted Combination
**Combine values using attention weights:**
```
Output["Hello"] = Attention["Hello", "Hello"] × Value["Hello"]
+ Attention["Hello", "World"] × Value["World"]
```
**Example:**
```
Value["Hello"] = [0.15, 0.1, 0.2]
Value["World"] = [0.14, 0.12, 0.18]
Output["Hello"] = 0.499 × [0.15, 0.1, 0.2] + 0.501 × [0.14, 0.12, 0.18]
= [0.075, 0.050, 0.100] + [0.070, 0.060, 0.090]
= [0.145, 0.110, 0.190]
```
**Result:** New representation that combines information from both words!
---
## 2.4 Complete Example: Attention in "Hello World"
### Input
```
Words: ["Hello", "World"]
Position 0: "Hello"
Position 1: "World"
```
### Step-by-Step Processing
#### Step 1: Embeddings
```
E["Hello"] = [0.10, -0.20, 0.30, ..., 0.05]
E["World"] = [0.15, -0.18, 0.28, ..., 0.10]
```
#### Step 2: Create Q, K, V
```
Q["Hello"] = E["Hello"] × W_Q = [0.2, -0.1, 0.3, ...]
K["Hello"] = E["Hello"] × W_K = [0.1, 0.2, -0.1, ...]
V["Hello"] = E["Hello"] × W_V = [0.15, 0.1, 0.2, ...]
Q["World"] = E["World"] × W_Q = [0.18, 0.15, 0.25, ...]
K["World"] = E["World"] × W_K = [0.12, 0.19, -0.08, ...]
V["World"] = E["World"] × W_V = [0.14, 0.12, 0.18, ...]
```
#### Step 3: Compute Attention Scores
```
Score Matrix (2×2):
"Hello" "World"
"Hello" 0.5 0.3
"World" 0.4 0.6
```
**Interpretation:**
- "Hello" attends to itself (0.5) more than "World" (0.3)
- "World" attends to itself (0.6) more than "Hello" (0.4)
#### Step 4: Apply Softmax
```
Attention Matrix:
"Hello" "World"
"Hello" 0.62 0.38
"World" 0.40 0.60
```
**Interpretation:**
- "Hello" gives 62% attention to itself, 38% to "World"
- "World" gives 40% attention to "Hello", 60% to itself
#### Step 5: Weighted Combination
```
Output["Hello"] = 0.62 × V["Hello"] + 0.38 × V["World"]
= 0.62 × [0.15, 0.1, 0.2] + 0.38 × [0.14, 0.12, 0.18]
= [0.093, 0.062, 0.124] + [0.053, 0.046, 0.068]
= [0.146, 0.108, 0.192]
Output["World"] = 0.40 × V["Hello"] + 0.60 × V["World"]
= 0.40 × [0.15, 0.1, 0.2] + 0.60 × [0.14, 0.12, 0.18]
= [0.060, 0.040, 0.080] + [0.084, 0.072, 0.108]
= [0.144, 0.112, 0.188]
```
**Result:** Each word now contains information from both words!
---
## 2.5 Why Attention Matters
### Benefit 1: Context Understanding
**Without Attention:**
```
"Hello" is processed in isolation
"World" is processed in isolation
Result: No understanding of relationship
```
**With Attention:**
```
"Hello" considers "World" (38% attention)
"World" considers "Hello" (40% attention)
Result: Understands they're related
```
### Benefit 2: Long-Range Dependencies
**Attention can connect distant words:**
```
"The cat that I saw yesterday sat on the mat"
```
- "cat" can attend to "yesterday" (even though far apart)
- Model understands the cat from yesterday
### Benefit 3: Selective Focus
**Attention focuses on relevant information:**
```
"He saw the cat with binoculars"
```
- "saw" attends strongly to "binoculars" (how he saw)
- "cat" attends strongly to "sat" (what it did)
- Each word focuses on what's relevant to it
---
## 2.6 Multi-Head Attention
### What is Multi-Head Attention?
**Multiple attention "heads" look at different aspects:**
```
Head 1: Focuses on syntax (grammar relationships)
Head 2: Focuses on semantics (meaning relationships)
Head 3: Focuses on position (spatial relationships)
...
Head 8: Focuses on another aspect
```
### Visual Representation
```
Input: "Hello World"
Head 1 (Syntax):
"Hello" → attends to "World" (subject-object relationship)
Head 2 (Semantics):
"Hello" → attends to "World" (greeting relationship)
Head 3 (Position):
"Hello" → attends more to itself (being first)
... (other heads)
Final: Combine all heads → Richer representation
```
### Why Multiple Heads?
**Different heads capture different relationships:**
- **Head 1:** Grammatical relationships
- **Head 2:** Semantic relationships
- **Head 3:** Positional relationships
- **Head 4:** Other patterns...
**Together:** Comprehensive understanding!
---
## 2.7 Visual Representation of Attention
### Attention Heatmap
```
Attention Weights for "Hello World"
Position 0 Position 1
("Hello") ("World")
┌─────────┐ ┌─────────┐
Position 0 │ 0.62 │ │ 0.38 │
("Hello") └─────────┘ └─────────┘
┌─────────┐ ┌─────────┐
Position 1 │ 0.40 │ │ 0.60 │
("World") └─────────┘ └─────────┘
```
**Reading:**
- Row 0: "Hello" attends 62% to itself, 38% to "World"
- Row 1: "World" attends 40% to "Hello", 60% to itself
### Attention Flow Diagram
```
"Hello" ──── 0.38 ────→ "World"
↑ ↑
│ │
0.62 0.60
│ │
└──────────────────────┘
(self-attention)
```
**Meaning:** Information flows between words based on attention weights.
---
## 2.8 Key Takeaways: Attention
**Attention determines which words to focus on**
**Calculates similarity between words**
**Creates weighted combinations of information**
**Enables understanding of relationships**
**Multiple heads capture different aspects**
---
*This document provides a step-by-step explanation of attention mechanisms, the core component that enables transformers to understand relationships between words.*