- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
9.7 KiB
What is Attention? Step-by-Step Explanation
Complete step-by-step explanation of attention mechanisms in transformer models: how models understand relationships between words.
Table of Contents
- The Problem Attention Solves
- What is Attention?
- How Attention Works: Step-by-Step
- Complete Example: Attention in "Hello World"
- Why Attention Matters
- Multi-Head Attention
- Visual Representation of Attention
- Key Takeaways
2.1 The Problem Attention Solves
The Challenge
In a sentence, words depend on each other:
"He saw the cat with binoculars"
Two possible meanings:
- He used binoculars to see the cat
- The cat has binoculars
Context matters! The model needs to understand which words relate to each other.
The Solution: Attention
Attention allows the model to "look" at other words when processing each word.
2.2 What is Attention?
Simple Definition
Attention is a mechanism that determines how much each word should consider other words when processing information.
Intuitive Analogy
Think of reading a sentence:
When you read "cat" in:
"The cat sat on the mat"
You might:
- Pay attention to "sat" (what the cat did)
- Pay attention to "mat" (where the cat is)
- Pay less attention to "the" (just a word)
Attention does the same thing mathematically!
2.3 How Attention Works: Step-by-Step
High-Level Overview
Step 1: Create Query, Key, Value for each word
Step 2: Compare queries and keys (find similarities)
Step 3: Calculate attention weights (how much to attend)
Step 4: Combine values weighted by attention
Detailed Step-by-Step
Step 1: Create Query, Key, Value (Q, K, V)
For each word, create three representations:
Query (Q): "What am I looking for?"
Key (K): "What am I offering?"
Value (V): "What information do I contain?"
Example with "Hello World":
Word: "Hello"
Query: [0.2, -0.1, 0.3, ...] ← What should I look for?
Key: [0.1, 0.2, -0.1, ...] ← What do I represent?
Value: [0.15, 0.1, 0.2, ...] ← What information do I have?
Word: "World"
Query: [0.18, 0.15, 0.25, ...]
Key: [0.12, 0.19, -0.08, ...]
Value: [0.14, 0.12, 0.18, ...]
How Q, K, V are created:
Q = Word × W_Q (learned matrix)
K = Word × W_K (learned matrix)
V = Word × W_V (learned matrix)
Step 2: Compute Similarity Scores
Compare each query with all keys:
Score[i, j] = How much should word i attend to word j?
Mathematical Formula:
Score[i, j] = (Query[i] · Key[j]) / √d_k
Example:
Query for "Hello": [0.2, -0.1, 0.3]
Key for "Hello": [0.1, 0.2, -0.1]
Key for "World": [0.12, 0.19, -0.08]
Calculate similarity:
Score["Hello", "Hello"] = (0.2×0.1 + (-0.1)×0.2 + 0.3×(-0.1)) / √3
= (0.02 - 0.02 - 0.03) / 1.732
= -0.03 / 1.732
≈ -0.017
Score["Hello", "World"] = (0.2×0.12 + (-0.1)×0.19 + 0.3×(-0.08)) / √3
= (0.024 - 0.019 - 0.024) / 1.732
= -0.019 / 1.732
≈ -0.011
Result: Similarity scores tell us how related words are
Step 3: Convert Scores to Attention Weights
Use softmax to convert scores to probabilities:
Attention[i, j] = exp(Score[i, j]) / Σ exp(Score[i, k])
Example:
Raw Scores:
Score["Hello", "Hello"] = -0.017
Score["Hello", "World"] = -0.011
Compute exponentials:
exp(-0.017) ≈ 0.983
exp(-0.011) ≈ 0.989
Sum = 0.983 + 0.989 = 1.972
Compute attention weights:
Attention["Hello", "Hello"] = 0.983 / 1.972 ≈ 0.499 (49.9%)
Attention["Hello", "World"] = 0.989 / 1.972 ≈ 0.501 (50.1%)
Meaning: "Hello" attends 49.9% to itself and 50.1% to "World"
Step 4: Weighted Combination
Combine values using attention weights:
Output["Hello"] = Attention["Hello", "Hello"] × Value["Hello"]
+ Attention["Hello", "World"] × Value["World"]
Example:
Value["Hello"] = [0.15, 0.1, 0.2]
Value["World"] = [0.14, 0.12, 0.18]
Output["Hello"] = 0.499 × [0.15, 0.1, 0.2] + 0.501 × [0.14, 0.12, 0.18]
= [0.075, 0.050, 0.100] + [0.070, 0.060, 0.090]
= [0.145, 0.110, 0.190]
Result: New representation that combines information from both words!
2.4 Complete Example: Attention in "Hello World"
Input
Words: ["Hello", "World"]
Position 0: "Hello"
Position 1: "World"
Step-by-Step Processing
Step 1: Embeddings
E["Hello"] = [0.10, -0.20, 0.30, ..., 0.05]
E["World"] = [0.15, -0.18, 0.28, ..., 0.10]
Step 2: Create Q, K, V
Q["Hello"] = E["Hello"] × W_Q = [0.2, -0.1, 0.3, ...]
K["Hello"] = E["Hello"] × W_K = [0.1, 0.2, -0.1, ...]
V["Hello"] = E["Hello"] × W_V = [0.15, 0.1, 0.2, ...]
Q["World"] = E["World"] × W_Q = [0.18, 0.15, 0.25, ...]
K["World"] = E["World"] × W_K = [0.12, 0.19, -0.08, ...]
V["World"] = E["World"] × W_V = [0.14, 0.12, 0.18, ...]
Step 3: Compute Attention Scores
Score Matrix (2×2):
"Hello" "World"
"Hello" 0.5 0.3
"World" 0.4 0.6
Interpretation:
- "Hello" attends to itself (0.5) more than "World" (0.3)
- "World" attends to itself (0.6) more than "Hello" (0.4)
Step 4: Apply Softmax
Attention Matrix:
"Hello" "World"
"Hello" 0.62 0.38
"World" 0.40 0.60
Interpretation:
- "Hello" gives 62% attention to itself, 38% to "World"
- "World" gives 40% attention to "Hello", 60% to itself
Step 5: Weighted Combination
Output["Hello"] = 0.62 × V["Hello"] + 0.38 × V["World"]
= 0.62 × [0.15, 0.1, 0.2] + 0.38 × [0.14, 0.12, 0.18]
= [0.093, 0.062, 0.124] + [0.053, 0.046, 0.068]
= [0.146, 0.108, 0.192]
Output["World"] = 0.40 × V["Hello"] + 0.60 × V["World"]
= 0.40 × [0.15, 0.1, 0.2] + 0.60 × [0.14, 0.12, 0.18]
= [0.060, 0.040, 0.080] + [0.084, 0.072, 0.108]
= [0.144, 0.112, 0.188]
Result: Each word now contains information from both words!
2.5 Why Attention Matters
Benefit 1: Context Understanding
Without Attention:
"Hello" is processed in isolation
"World" is processed in isolation
Result: No understanding of relationship
With Attention:
"Hello" considers "World" (38% attention)
"World" considers "Hello" (40% attention)
Result: Understands they're related
Benefit 2: Long-Range Dependencies
Attention can connect distant words:
"The cat that I saw yesterday sat on the mat"
- "cat" can attend to "yesterday" (even though far apart)
- Model understands the cat from yesterday
Benefit 3: Selective Focus
Attention focuses on relevant information:
"He saw the cat with binoculars"
- "saw" attends strongly to "binoculars" (how he saw)
- "cat" attends strongly to "sat" (what it did)
- Each word focuses on what's relevant to it
2.6 Multi-Head Attention
What is Multi-Head Attention?
Multiple attention "heads" look at different aspects:
Head 1: Focuses on syntax (grammar relationships)
Head 2: Focuses on semantics (meaning relationships)
Head 3: Focuses on position (spatial relationships)
...
Head 8: Focuses on another aspect
Visual Representation
Input: "Hello World"
Head 1 (Syntax):
"Hello" → attends to "World" (subject-object relationship)
Head 2 (Semantics):
"Hello" → attends to "World" (greeting relationship)
Head 3 (Position):
"Hello" → attends more to itself (being first)
... (other heads)
Final: Combine all heads → Richer representation
Why Multiple Heads?
Different heads capture different relationships:
- Head 1: Grammatical relationships
- Head 2: Semantic relationships
- Head 3: Positional relationships
- Head 4: Other patterns...
Together: Comprehensive understanding!
2.7 Visual Representation of Attention
Attention Heatmap
Attention Weights for "Hello World"
Position 0 Position 1
("Hello") ("World")
┌─────────┐ ┌─────────┐
Position 0 │ 0.62 │ │ 0.38 │
("Hello") └─────────┘ └─────────┘
┌─────────┐ ┌─────────┐
Position 1 │ 0.40 │ │ 0.60 │
("World") └─────────┘ └─────────┘
Reading:
- Row 0: "Hello" attends 62% to itself, 38% to "World"
- Row 1: "World" attends 40% to "Hello", 60% to itself
Attention Flow Diagram
"Hello" ──── 0.38 ────→ "World"
↑ ↑
│ │
0.62 0.60
│ │
└──────────────────────┘
(self-attention)
Meaning: Information flows between words based on attention weights.
2.8 Key Takeaways: Attention
✅ Attention determines which words to focus on
✅ Calculates similarity between words
✅ Creates weighted combinations of information
✅ Enables understanding of relationships
✅ Multiple heads capture different aspects
This document provides a step-by-step explanation of attention mechanisms, the core component that enables transformers to understand relationships between words.