Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
This commit is contained in:
410
docs/ATTENTION_EXPLAINED.md
Normal file
410
docs/ATTENTION_EXPLAINED.md
Normal file
@@ -0,0 +1,410 @@
|
||||
# What is Attention? Step-by-Step Explanation
|
||||
|
||||
Complete step-by-step explanation of attention mechanisms in transformer models: how models understand relationships between words.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [The Problem Attention Solves](#21-the-problem-attention-solves)
|
||||
2. [What is Attention?](#22-what-is-attention)
|
||||
3. [How Attention Works: Step-by-Step](#23-how-attention-works-step-by-step)
|
||||
4. [Complete Example: Attention in "Hello World"](#24-complete-example-attention-in-hello-world)
|
||||
5. [Why Attention Matters](#25-why-attention-matters)
|
||||
6. [Multi-Head Attention](#26-multi-head-attention)
|
||||
7. [Visual Representation of Attention](#27-visual-representation-of-attention)
|
||||
8. [Key Takeaways](#28-key-takeaways)
|
||||
|
||||
---
|
||||
|
||||
## 2.1 The Problem Attention Solves
|
||||
|
||||
### The Challenge
|
||||
|
||||
**In a sentence, words depend on each other:**
|
||||
|
||||
```
|
||||
"He saw the cat with binoculars"
|
||||
```
|
||||
|
||||
Two possible meanings:
|
||||
1. He used binoculars to see the cat
|
||||
2. The cat has binoculars
|
||||
|
||||
**Context matters!** The model needs to understand which words relate to each other.
|
||||
|
||||
### The Solution: Attention
|
||||
|
||||
**Attention allows the model to "look" at other words when processing each word.**
|
||||
|
||||
---
|
||||
|
||||
## 2.2 What is Attention?
|
||||
|
||||
### Simple Definition
|
||||
|
||||
**Attention** is a mechanism that determines **how much each word should consider other words** when processing information.
|
||||
|
||||
### Intuitive Analogy
|
||||
|
||||
**Think of reading a sentence:**
|
||||
|
||||
When you read "cat" in:
|
||||
```
|
||||
"The cat sat on the mat"
|
||||
```
|
||||
|
||||
You might:
|
||||
- Pay attention to "sat" (what the cat did)
|
||||
- Pay attention to "mat" (where the cat is)
|
||||
- Pay less attention to "the" (just a word)
|
||||
|
||||
**Attention does the same thing mathematically!**
|
||||
|
||||
---
|
||||
|
||||
## 2.3 How Attention Works: Step-by-Step
|
||||
|
||||
### High-Level Overview
|
||||
|
||||
```
|
||||
Step 1: Create Query, Key, Value for each word
|
||||
Step 2: Compare queries and keys (find similarities)
|
||||
Step 3: Calculate attention weights (how much to attend)
|
||||
Step 4: Combine values weighted by attention
|
||||
```
|
||||
|
||||
### Detailed Step-by-Step
|
||||
|
||||
#### Step 1: Create Query, Key, Value (Q, K, V)
|
||||
|
||||
**For each word, create three representations:**
|
||||
|
||||
**Query (Q):** "What am I looking for?"
|
||||
**Key (K):** "What am I offering?"
|
||||
**Value (V):** "What information do I contain?"
|
||||
|
||||
**Example with "Hello World":**
|
||||
|
||||
```
|
||||
Word: "Hello"
|
||||
Query: [0.2, -0.1, 0.3, ...] ← What should I look for?
|
||||
Key: [0.1, 0.2, -0.1, ...] ← What do I represent?
|
||||
Value: [0.15, 0.1, 0.2, ...] ← What information do I have?
|
||||
|
||||
Word: "World"
|
||||
Query: [0.18, 0.15, 0.25, ...]
|
||||
Key: [0.12, 0.19, -0.08, ...]
|
||||
Value: [0.14, 0.12, 0.18, ...]
|
||||
```
|
||||
|
||||
**How Q, K, V are created:**
|
||||
```
|
||||
Q = Word × W_Q (learned matrix)
|
||||
K = Word × W_K (learned matrix)
|
||||
V = Word × W_V (learned matrix)
|
||||
```
|
||||
|
||||
#### Step 2: Compute Similarity Scores
|
||||
|
||||
**Compare each query with all keys:**
|
||||
|
||||
```
|
||||
Score[i, j] = How much should word i attend to word j?
|
||||
```
|
||||
|
||||
**Mathematical Formula:**
|
||||
```
|
||||
Score[i, j] = (Query[i] · Key[j]) / √d_k
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
**Query for "Hello":** `[0.2, -0.1, 0.3]`
|
||||
**Key for "Hello":** `[0.1, 0.2, -0.1]`
|
||||
**Key for "World":** `[0.12, 0.19, -0.08]`
|
||||
|
||||
**Calculate similarity:**
|
||||
|
||||
```
|
||||
Score["Hello", "Hello"] = (0.2×0.1 + (-0.1)×0.2 + 0.3×(-0.1)) / √3
|
||||
= (0.02 - 0.02 - 0.03) / 1.732
|
||||
= -0.03 / 1.732
|
||||
≈ -0.017
|
||||
|
||||
Score["Hello", "World"] = (0.2×0.12 + (-0.1)×0.19 + 0.3×(-0.08)) / √3
|
||||
= (0.024 - 0.019 - 0.024) / 1.732
|
||||
= -0.019 / 1.732
|
||||
≈ -0.011
|
||||
```
|
||||
|
||||
**Result:** Similarity scores tell us how related words are
|
||||
|
||||
#### Step 3: Convert Scores to Attention Weights
|
||||
|
||||
**Use softmax to convert scores to probabilities:**
|
||||
|
||||
```
|
||||
Attention[i, j] = exp(Score[i, j]) / Σ exp(Score[i, k])
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
**Raw Scores:**
|
||||
```
|
||||
Score["Hello", "Hello"] = -0.017
|
||||
Score["Hello", "World"] = -0.011
|
||||
```
|
||||
|
||||
**Compute exponentials:**
|
||||
```
|
||||
exp(-0.017) ≈ 0.983
|
||||
exp(-0.011) ≈ 0.989
|
||||
Sum = 0.983 + 0.989 = 1.972
|
||||
```
|
||||
|
||||
**Compute attention weights:**
|
||||
```
|
||||
Attention["Hello", "Hello"] = 0.983 / 1.972 ≈ 0.499 (49.9%)
|
||||
Attention["Hello", "World"] = 0.989 / 1.972 ≈ 0.501 (50.1%)
|
||||
```
|
||||
|
||||
**Meaning:** "Hello" attends 49.9% to itself and 50.1% to "World"
|
||||
|
||||
#### Step 4: Weighted Combination
|
||||
|
||||
**Combine values using attention weights:**
|
||||
|
||||
```
|
||||
Output["Hello"] = Attention["Hello", "Hello"] × Value["Hello"]
|
||||
+ Attention["Hello", "World"] × Value["World"]
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
```
|
||||
Value["Hello"] = [0.15, 0.1, 0.2]
|
||||
Value["World"] = [0.14, 0.12, 0.18]
|
||||
|
||||
Output["Hello"] = 0.499 × [0.15, 0.1, 0.2] + 0.501 × [0.14, 0.12, 0.18]
|
||||
= [0.075, 0.050, 0.100] + [0.070, 0.060, 0.090]
|
||||
= [0.145, 0.110, 0.190]
|
||||
```
|
||||
|
||||
**Result:** New representation that combines information from both words!
|
||||
|
||||
---
|
||||
|
||||
## 2.4 Complete Example: Attention in "Hello World"
|
||||
|
||||
### Input
|
||||
|
||||
```
|
||||
Words: ["Hello", "World"]
|
||||
Position 0: "Hello"
|
||||
Position 1: "World"
|
||||
```
|
||||
|
||||
### Step-by-Step Processing
|
||||
|
||||
#### Step 1: Embeddings
|
||||
|
||||
```
|
||||
E["Hello"] = [0.10, -0.20, 0.30, ..., 0.05]
|
||||
E["World"] = [0.15, -0.18, 0.28, ..., 0.10]
|
||||
```
|
||||
|
||||
#### Step 2: Create Q, K, V
|
||||
|
||||
```
|
||||
Q["Hello"] = E["Hello"] × W_Q = [0.2, -0.1, 0.3, ...]
|
||||
K["Hello"] = E["Hello"] × W_K = [0.1, 0.2, -0.1, ...]
|
||||
V["Hello"] = E["Hello"] × W_V = [0.15, 0.1, 0.2, ...]
|
||||
|
||||
Q["World"] = E["World"] × W_Q = [0.18, 0.15, 0.25, ...]
|
||||
K["World"] = E["World"] × W_K = [0.12, 0.19, -0.08, ...]
|
||||
V["World"] = E["World"] × W_V = [0.14, 0.12, 0.18, ...]
|
||||
```
|
||||
|
||||
#### Step 3: Compute Attention Scores
|
||||
|
||||
```
|
||||
Score Matrix (2×2):
|
||||
|
||||
"Hello" "World"
|
||||
"Hello" 0.5 0.3
|
||||
"World" 0.4 0.6
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
- "Hello" attends to itself (0.5) more than "World" (0.3)
|
||||
- "World" attends to itself (0.6) more than "Hello" (0.4)
|
||||
|
||||
#### Step 4: Apply Softmax
|
||||
|
||||
```
|
||||
Attention Matrix:
|
||||
|
||||
"Hello" "World"
|
||||
"Hello" 0.62 0.38
|
||||
"World" 0.40 0.60
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
- "Hello" gives 62% attention to itself, 38% to "World"
|
||||
- "World" gives 40% attention to "Hello", 60% to itself
|
||||
|
||||
#### Step 5: Weighted Combination
|
||||
|
||||
```
|
||||
Output["Hello"] = 0.62 × V["Hello"] + 0.38 × V["World"]
|
||||
= 0.62 × [0.15, 0.1, 0.2] + 0.38 × [0.14, 0.12, 0.18]
|
||||
= [0.093, 0.062, 0.124] + [0.053, 0.046, 0.068]
|
||||
= [0.146, 0.108, 0.192]
|
||||
|
||||
Output["World"] = 0.40 × V["Hello"] + 0.60 × V["World"]
|
||||
= 0.40 × [0.15, 0.1, 0.2] + 0.60 × [0.14, 0.12, 0.18]
|
||||
= [0.060, 0.040, 0.080] + [0.084, 0.072, 0.108]
|
||||
= [0.144, 0.112, 0.188]
|
||||
```
|
||||
|
||||
**Result:** Each word now contains information from both words!
|
||||
|
||||
---
|
||||
|
||||
## 2.5 Why Attention Matters
|
||||
|
||||
### Benefit 1: Context Understanding
|
||||
|
||||
**Without Attention:**
|
||||
```
|
||||
"Hello" is processed in isolation
|
||||
"World" is processed in isolation
|
||||
Result: No understanding of relationship
|
||||
```
|
||||
|
||||
**With Attention:**
|
||||
```
|
||||
"Hello" considers "World" (38% attention)
|
||||
"World" considers "Hello" (40% attention)
|
||||
Result: Understands they're related
|
||||
```
|
||||
|
||||
### Benefit 2: Long-Range Dependencies
|
||||
|
||||
**Attention can connect distant words:**
|
||||
|
||||
```
|
||||
"The cat that I saw yesterday sat on the mat"
|
||||
```
|
||||
|
||||
- "cat" can attend to "yesterday" (even though far apart)
|
||||
- Model understands the cat from yesterday
|
||||
|
||||
### Benefit 3: Selective Focus
|
||||
|
||||
**Attention focuses on relevant information:**
|
||||
|
||||
```
|
||||
"He saw the cat with binoculars"
|
||||
```
|
||||
|
||||
- "saw" attends strongly to "binoculars" (how he saw)
|
||||
- "cat" attends strongly to "sat" (what it did)
|
||||
- Each word focuses on what's relevant to it
|
||||
|
||||
---
|
||||
|
||||
## 2.6 Multi-Head Attention
|
||||
|
||||
### What is Multi-Head Attention?
|
||||
|
||||
**Multiple attention "heads" look at different aspects:**
|
||||
|
||||
```
|
||||
Head 1: Focuses on syntax (grammar relationships)
|
||||
Head 2: Focuses on semantics (meaning relationships)
|
||||
Head 3: Focuses on position (spatial relationships)
|
||||
...
|
||||
Head 8: Focuses on another aspect
|
||||
```
|
||||
|
||||
### Visual Representation
|
||||
|
||||
```
|
||||
Input: "Hello World"
|
||||
|
||||
Head 1 (Syntax):
|
||||
"Hello" → attends to "World" (subject-object relationship)
|
||||
|
||||
Head 2 (Semantics):
|
||||
"Hello" → attends to "World" (greeting relationship)
|
||||
|
||||
Head 3 (Position):
|
||||
"Hello" → attends more to itself (being first)
|
||||
|
||||
... (other heads)
|
||||
|
||||
Final: Combine all heads → Richer representation
|
||||
```
|
||||
|
||||
### Why Multiple Heads?
|
||||
|
||||
**Different heads capture different relationships:**
|
||||
|
||||
- **Head 1:** Grammatical relationships
|
||||
- **Head 2:** Semantic relationships
|
||||
- **Head 3:** Positional relationships
|
||||
- **Head 4:** Other patterns...
|
||||
|
||||
**Together:** Comprehensive understanding!
|
||||
|
||||
---
|
||||
|
||||
## 2.7 Visual Representation of Attention
|
||||
|
||||
### Attention Heatmap
|
||||
|
||||
```
|
||||
Attention Weights for "Hello World"
|
||||
|
||||
Position 0 Position 1
|
||||
("Hello") ("World")
|
||||
┌─────────┐ ┌─────────┐
|
||||
Position 0 │ 0.62 │ │ 0.38 │
|
||||
("Hello") └─────────┘ └─────────┘
|
||||
┌─────────┐ ┌─────────┐
|
||||
Position 1 │ 0.40 │ │ 0.60 │
|
||||
("World") └─────────┘ └─────────┘
|
||||
```
|
||||
|
||||
**Reading:**
|
||||
- Row 0: "Hello" attends 62% to itself, 38% to "World"
|
||||
- Row 1: "World" attends 40% to "Hello", 60% to itself
|
||||
|
||||
### Attention Flow Diagram
|
||||
|
||||
```
|
||||
"Hello" ──── 0.38 ────→ "World"
|
||||
↑ ↑
|
||||
│ │
|
||||
0.62 0.60
|
||||
│ │
|
||||
└──────────────────────┘
|
||||
(self-attention)
|
||||
```
|
||||
|
||||
**Meaning:** Information flows between words based on attention weights.
|
||||
|
||||
---
|
||||
|
||||
## 2.8 Key Takeaways: Attention
|
||||
|
||||
✅ **Attention determines which words to focus on**
|
||||
✅ **Calculates similarity between words**
|
||||
✅ **Creates weighted combinations of information**
|
||||
✅ **Enables understanding of relationships**
|
||||
✅ **Multiple heads capture different aspects**
|
||||
|
||||
---
|
||||
|
||||
*This document provides a step-by-step explanation of attention mechanisms, the core component that enables transformers to understand relationships between words.*
|
||||
|
||||
Reference in New Issue
Block a user