# What is Attention? Step-by-Step Explanation Complete step-by-step explanation of attention mechanisms in transformer models: how models understand relationships between words. ## Table of Contents 1. [The Problem Attention Solves](#21-the-problem-attention-solves) 2. [What is Attention?](#22-what-is-attention) 3. [How Attention Works: Step-by-Step](#23-how-attention-works-step-by-step) 4. [Complete Example: Attention in "Hello World"](#24-complete-example-attention-in-hello-world) 5. [Why Attention Matters](#25-why-attention-matters) 6. [Multi-Head Attention](#26-multi-head-attention) 7. [Visual Representation of Attention](#27-visual-representation-of-attention) 8. [Key Takeaways](#28-key-takeaways) --- ## 2.1 The Problem Attention Solves ### The Challenge **In a sentence, words depend on each other:** ``` "He saw the cat with binoculars" ``` Two possible meanings: 1. He used binoculars to see the cat 2. The cat has binoculars **Context matters!** The model needs to understand which words relate to each other. ### The Solution: Attention **Attention allows the model to "look" at other words when processing each word.** --- ## 2.2 What is Attention? ### Simple Definition **Attention** is a mechanism that determines **how much each word should consider other words** when processing information. ### Intuitive Analogy **Think of reading a sentence:** When you read "cat" in: ``` "The cat sat on the mat" ``` You might: - Pay attention to "sat" (what the cat did) - Pay attention to "mat" (where the cat is) - Pay less attention to "the" (just a word) **Attention does the same thing mathematically!** --- ## 2.3 How Attention Works: Step-by-Step ### High-Level Overview ``` Step 1: Create Query, Key, Value for each word Step 2: Compare queries and keys (find similarities) Step 3: Calculate attention weights (how much to attend) Step 4: Combine values weighted by attention ``` ### Detailed Step-by-Step #### Step 1: Create Query, Key, Value (Q, K, V) **For each word, create three representations:** **Query (Q):** "What am I looking for?" **Key (K):** "What am I offering?" **Value (V):** "What information do I contain?" **Example with "Hello World":** ``` Word: "Hello" Query: [0.2, -0.1, 0.3, ...] ← What should I look for? Key: [0.1, 0.2, -0.1, ...] ← What do I represent? Value: [0.15, 0.1, 0.2, ...] ← What information do I have? Word: "World" Query: [0.18, 0.15, 0.25, ...] Key: [0.12, 0.19, -0.08, ...] Value: [0.14, 0.12, 0.18, ...] ``` **How Q, K, V are created:** ``` Q = Word × W_Q (learned matrix) K = Word × W_K (learned matrix) V = Word × W_V (learned matrix) ``` #### Step 2: Compute Similarity Scores **Compare each query with all keys:** ``` Score[i, j] = How much should word i attend to word j? ``` **Mathematical Formula:** ``` Score[i, j] = (Query[i] · Key[j]) / √d_k ``` **Example:** **Query for "Hello":** `[0.2, -0.1, 0.3]` **Key for "Hello":** `[0.1, 0.2, -0.1]` **Key for "World":** `[0.12, 0.19, -0.08]` **Calculate similarity:** ``` Score["Hello", "Hello"] = (0.2×0.1 + (-0.1)×0.2 + 0.3×(-0.1)) / √3 = (0.02 - 0.02 - 0.03) / 1.732 = -0.03 / 1.732 ≈ -0.017 Score["Hello", "World"] = (0.2×0.12 + (-0.1)×0.19 + 0.3×(-0.08)) / √3 = (0.024 - 0.019 - 0.024) / 1.732 = -0.019 / 1.732 ≈ -0.011 ``` **Result:** Similarity scores tell us how related words are #### Step 3: Convert Scores to Attention Weights **Use softmax to convert scores to probabilities:** ``` Attention[i, j] = exp(Score[i, j]) / Σ exp(Score[i, k]) ``` **Example:** **Raw Scores:** ``` Score["Hello", "Hello"] = -0.017 Score["Hello", "World"] = -0.011 ``` **Compute exponentials:** ``` exp(-0.017) ≈ 0.983 exp(-0.011) ≈ 0.989 Sum = 0.983 + 0.989 = 1.972 ``` **Compute attention weights:** ``` Attention["Hello", "Hello"] = 0.983 / 1.972 ≈ 0.499 (49.9%) Attention["Hello", "World"] = 0.989 / 1.972 ≈ 0.501 (50.1%) ``` **Meaning:** "Hello" attends 49.9% to itself and 50.1% to "World" #### Step 4: Weighted Combination **Combine values using attention weights:** ``` Output["Hello"] = Attention["Hello", "Hello"] × Value["Hello"] + Attention["Hello", "World"] × Value["World"] ``` **Example:** ``` Value["Hello"] = [0.15, 0.1, 0.2] Value["World"] = [0.14, 0.12, 0.18] Output["Hello"] = 0.499 × [0.15, 0.1, 0.2] + 0.501 × [0.14, 0.12, 0.18] = [0.075, 0.050, 0.100] + [0.070, 0.060, 0.090] = [0.145, 0.110, 0.190] ``` **Result:** New representation that combines information from both words! --- ## 2.4 Complete Example: Attention in "Hello World" ### Input ``` Words: ["Hello", "World"] Position 0: "Hello" Position 1: "World" ``` ### Step-by-Step Processing #### Step 1: Embeddings ``` E["Hello"] = [0.10, -0.20, 0.30, ..., 0.05] E["World"] = [0.15, -0.18, 0.28, ..., 0.10] ``` #### Step 2: Create Q, K, V ``` Q["Hello"] = E["Hello"] × W_Q = [0.2, -0.1, 0.3, ...] K["Hello"] = E["Hello"] × W_K = [0.1, 0.2, -0.1, ...] V["Hello"] = E["Hello"] × W_V = [0.15, 0.1, 0.2, ...] Q["World"] = E["World"] × W_Q = [0.18, 0.15, 0.25, ...] K["World"] = E["World"] × W_K = [0.12, 0.19, -0.08, ...] V["World"] = E["World"] × W_V = [0.14, 0.12, 0.18, ...] ``` #### Step 3: Compute Attention Scores ``` Score Matrix (2×2): "Hello" "World" "Hello" 0.5 0.3 "World" 0.4 0.6 ``` **Interpretation:** - "Hello" attends to itself (0.5) more than "World" (0.3) - "World" attends to itself (0.6) more than "Hello" (0.4) #### Step 4: Apply Softmax ``` Attention Matrix: "Hello" "World" "Hello" 0.62 0.38 "World" 0.40 0.60 ``` **Interpretation:** - "Hello" gives 62% attention to itself, 38% to "World" - "World" gives 40% attention to "Hello", 60% to itself #### Step 5: Weighted Combination ``` Output["Hello"] = 0.62 × V["Hello"] + 0.38 × V["World"] = 0.62 × [0.15, 0.1, 0.2] + 0.38 × [0.14, 0.12, 0.18] = [0.093, 0.062, 0.124] + [0.053, 0.046, 0.068] = [0.146, 0.108, 0.192] Output["World"] = 0.40 × V["Hello"] + 0.60 × V["World"] = 0.40 × [0.15, 0.1, 0.2] + 0.60 × [0.14, 0.12, 0.18] = [0.060, 0.040, 0.080] + [0.084, 0.072, 0.108] = [0.144, 0.112, 0.188] ``` **Result:** Each word now contains information from both words! --- ## 2.5 Why Attention Matters ### Benefit 1: Context Understanding **Without Attention:** ``` "Hello" is processed in isolation "World" is processed in isolation Result: No understanding of relationship ``` **With Attention:** ``` "Hello" considers "World" (38% attention) "World" considers "Hello" (40% attention) Result: Understands they're related ``` ### Benefit 2: Long-Range Dependencies **Attention can connect distant words:** ``` "The cat that I saw yesterday sat on the mat" ``` - "cat" can attend to "yesterday" (even though far apart) - Model understands the cat from yesterday ### Benefit 3: Selective Focus **Attention focuses on relevant information:** ``` "He saw the cat with binoculars" ``` - "saw" attends strongly to "binoculars" (how he saw) - "cat" attends strongly to "sat" (what it did) - Each word focuses on what's relevant to it --- ## 2.6 Multi-Head Attention ### What is Multi-Head Attention? **Multiple attention "heads" look at different aspects:** ``` Head 1: Focuses on syntax (grammar relationships) Head 2: Focuses on semantics (meaning relationships) Head 3: Focuses on position (spatial relationships) ... Head 8: Focuses on another aspect ``` ### Visual Representation ``` Input: "Hello World" Head 1 (Syntax): "Hello" → attends to "World" (subject-object relationship) Head 2 (Semantics): "Hello" → attends to "World" (greeting relationship) Head 3 (Position): "Hello" → attends more to itself (being first) ... (other heads) Final: Combine all heads → Richer representation ``` ### Why Multiple Heads? **Different heads capture different relationships:** - **Head 1:** Grammatical relationships - **Head 2:** Semantic relationships - **Head 3:** Positional relationships - **Head 4:** Other patterns... **Together:** Comprehensive understanding! --- ## 2.7 Visual Representation of Attention ### Attention Heatmap ``` Attention Weights for "Hello World" Position 0 Position 1 ("Hello") ("World") ┌─────────┐ ┌─────────┐ Position 0 │ 0.62 │ │ 0.38 │ ("Hello") └─────────┘ └─────────┘ ┌─────────┐ ┌─────────┐ Position 1 │ 0.40 │ │ 0.60 │ ("World") └─────────┘ └─────────┘ ``` **Reading:** - Row 0: "Hello" attends 62% to itself, 38% to "World" - Row 1: "World" attends 40% to "Hello", 60% to itself ### Attention Flow Diagram ``` "Hello" ──── 0.38 ────→ "World" ↑ ↑ │ │ 0.62 0.60 │ │ └──────────────────────┘ (self-attention) ``` **Meaning:** Information flows between words based on attention weights. --- ## 2.8 Key Takeaways: Attention ✅ **Attention determines which words to focus on** ✅ **Calculates similarity between words** ✅ **Creates weighted combinations of information** ✅ **Enables understanding of relationships** ✅ **Multiple heads capture different aspects** --- *This document provides a step-by-step explanation of attention mechanisms, the core component that enables transformers to understand relationships between words.*