Files
sheepOp/docs/ATTENTION_EXPLAINED.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

9.7 KiB
Raw Permalink Blame History

What is Attention? Step-by-Step Explanation

Complete step-by-step explanation of attention mechanisms in transformer models: how models understand relationships between words.

Table of Contents

  1. The Problem Attention Solves
  2. What is Attention?
  3. How Attention Works: Step-by-Step
  4. Complete Example: Attention in "Hello World"
  5. Why Attention Matters
  6. Multi-Head Attention
  7. Visual Representation of Attention
  8. Key Takeaways

2.1 The Problem Attention Solves

The Challenge

In a sentence, words depend on each other:

"He saw the cat with binoculars"

Two possible meanings:

  1. He used binoculars to see the cat
  2. The cat has binoculars

Context matters! The model needs to understand which words relate to each other.

The Solution: Attention

Attention allows the model to "look" at other words when processing each word.


2.2 What is Attention?

Simple Definition

Attention is a mechanism that determines how much each word should consider other words when processing information.

Intuitive Analogy

Think of reading a sentence:

When you read "cat" in:

"The cat sat on the mat"

You might:

  • Pay attention to "sat" (what the cat did)
  • Pay attention to "mat" (where the cat is)
  • Pay less attention to "the" (just a word)

Attention does the same thing mathematically!


2.3 How Attention Works: Step-by-Step

High-Level Overview

Step 1: Create Query, Key, Value for each word
Step 2: Compare queries and keys (find similarities)
Step 3: Calculate attention weights (how much to attend)
Step 4: Combine values weighted by attention

Detailed Step-by-Step

Step 1: Create Query, Key, Value (Q, K, V)

For each word, create three representations:

Query (Q): "What am I looking for?"
Key (K): "What am I offering?"
Value (V): "What information do I contain?"

Example with "Hello World":

Word: "Hello"
    Query: [0.2, -0.1, 0.3, ...]  ← What should I look for?
    Key:   [0.1, 0.2, -0.1, ...]  ← What do I represent?
    Value: [0.15, 0.1, 0.2, ...]  ← What information do I have?

Word: "World"
    Query: [0.18, 0.15, 0.25, ...]
    Key:   [0.12, 0.19, -0.08, ...]
    Value: [0.14, 0.12, 0.18, ...]

How Q, K, V are created:

Q = Word × W_Q  (learned matrix)
K = Word × W_K  (learned matrix)
V = Word × W_V  (learned matrix)

Step 2: Compute Similarity Scores

Compare each query with all keys:

Score[i, j] = How much should word i attend to word j?

Mathematical Formula:

Score[i, j] = (Query[i] · Key[j]) / √d_k

Example:

Query for "Hello": [0.2, -0.1, 0.3]
Key for "Hello": [0.1, 0.2, -0.1]
Key for "World": [0.12, 0.19, -0.08]

Calculate similarity:

Score["Hello", "Hello"] = (0.2×0.1 + (-0.1)×0.2 + 0.3×(-0.1)) / √3
                        = (0.02 - 0.02 - 0.03) / 1.732
                        = -0.03 / 1.732
                        ≈ -0.017

Score["Hello", "World"] = (0.2×0.12 + (-0.1)×0.19 + 0.3×(-0.08)) / √3
                        = (0.024 - 0.019 - 0.024) / 1.732
                        = -0.019 / 1.732
                        ≈ -0.011

Result: Similarity scores tell us how related words are

Step 3: Convert Scores to Attention Weights

Use softmax to convert scores to probabilities:

Attention[i, j] = exp(Score[i, j]) / Σ exp(Score[i, k])

Example:

Raw Scores:

Score["Hello", "Hello"] = -0.017
Score["Hello", "World"] = -0.011

Compute exponentials:

exp(-0.017) ≈ 0.983
exp(-0.011) ≈ 0.989
Sum = 0.983 + 0.989 = 1.972

Compute attention weights:

Attention["Hello", "Hello"] = 0.983 / 1.972 ≈ 0.499 (49.9%)
Attention["Hello", "World"] = 0.989 / 1.972 ≈ 0.501 (50.1%)

Meaning: "Hello" attends 49.9% to itself and 50.1% to "World"

Step 4: Weighted Combination

Combine values using attention weights:

Output["Hello"] = Attention["Hello", "Hello"] × Value["Hello"] 
                + Attention["Hello", "World"] × Value["World"]

Example:

Value["Hello"] = [0.15, 0.1, 0.2]
Value["World"] = [0.14, 0.12, 0.18]

Output["Hello"] = 0.499 × [0.15, 0.1, 0.2] + 0.501 × [0.14, 0.12, 0.18]
                = [0.075, 0.050, 0.100] + [0.070, 0.060, 0.090]
                = [0.145, 0.110, 0.190]

Result: New representation that combines information from both words!


2.4 Complete Example: Attention in "Hello World"

Input

Words: ["Hello", "World"]
Position 0: "Hello"
Position 1: "World"

Step-by-Step Processing

Step 1: Embeddings

E["Hello"] = [0.10, -0.20, 0.30, ..., 0.05]
E["World"] = [0.15, -0.18, 0.28, ..., 0.10]

Step 2: Create Q, K, V

Q["Hello"] = E["Hello"] × W_Q = [0.2, -0.1, 0.3, ...]
K["Hello"] = E["Hello"] × W_K = [0.1, 0.2, -0.1, ...]
V["Hello"] = E["Hello"] × W_V = [0.15, 0.1, 0.2, ...]

Q["World"] = E["World"] × W_Q = [0.18, 0.15, 0.25, ...]
K["World"] = E["World"] × W_K = [0.12, 0.19, -0.08, ...]
V["World"] = E["World"] × W_V = [0.14, 0.12, 0.18, ...]

Step 3: Compute Attention Scores

Score Matrix (2×2):

         "Hello"  "World"
"Hello"    0.5      0.3
"World"    0.4      0.6

Interpretation:

  • "Hello" attends to itself (0.5) more than "World" (0.3)
  • "World" attends to itself (0.6) more than "Hello" (0.4)

Step 4: Apply Softmax

Attention Matrix:

         "Hello"  "World"
"Hello"   0.62    0.38
"World"   0.40    0.60

Interpretation:

  • "Hello" gives 62% attention to itself, 38% to "World"
  • "World" gives 40% attention to "Hello", 60% to itself

Step 5: Weighted Combination

Output["Hello"] = 0.62 × V["Hello"] + 0.38 × V["World"]
                = 0.62 × [0.15, 0.1, 0.2] + 0.38 × [0.14, 0.12, 0.18]
                = [0.093, 0.062, 0.124] + [0.053, 0.046, 0.068]
                = [0.146, 0.108, 0.192]

Output["World"] = 0.40 × V["Hello"] + 0.60 × V["World"]
                = 0.40 × [0.15, 0.1, 0.2] + 0.60 × [0.14, 0.12, 0.18]
                = [0.060, 0.040, 0.080] + [0.084, 0.072, 0.108]
                = [0.144, 0.112, 0.188]

Result: Each word now contains information from both words!


2.5 Why Attention Matters

Benefit 1: Context Understanding

Without Attention:

"Hello" is processed in isolation
"World" is processed in isolation
Result: No understanding of relationship

With Attention:

"Hello" considers "World" (38% attention)
"World" considers "Hello" (40% attention)
Result: Understands they're related

Benefit 2: Long-Range Dependencies

Attention can connect distant words:

"The cat that I saw yesterday sat on the mat"
  • "cat" can attend to "yesterday" (even though far apart)
  • Model understands the cat from yesterday

Benefit 3: Selective Focus

Attention focuses on relevant information:

"He saw the cat with binoculars"
  • "saw" attends strongly to "binoculars" (how he saw)
  • "cat" attends strongly to "sat" (what it did)
  • Each word focuses on what's relevant to it

2.6 Multi-Head Attention

What is Multi-Head Attention?

Multiple attention "heads" look at different aspects:

Head 1: Focuses on syntax (grammar relationships)
Head 2: Focuses on semantics (meaning relationships)
Head 3: Focuses on position (spatial relationships)
...
Head 8: Focuses on another aspect

Visual Representation

Input: "Hello World"

Head 1 (Syntax):
    "Hello" → attends to "World" (subject-object relationship)

Head 2 (Semantics):
    "Hello" → attends to "World" (greeting relationship)

Head 3 (Position):
    "Hello" → attends more to itself (being first)

... (other heads)

Final: Combine all heads → Richer representation

Why Multiple Heads?

Different heads capture different relationships:

  • Head 1: Grammatical relationships
  • Head 2: Semantic relationships
  • Head 3: Positional relationships
  • Head 4: Other patterns...

Together: Comprehensive understanding!


2.7 Visual Representation of Attention

Attention Heatmap

Attention Weights for "Hello World"

         Position 0    Position 1
         ("Hello")     ("World")
           ┌─────────┐  ┌─────────┐
Position 0 │  0.62   │  │  0.38   │
("Hello")  └─────────┘  └─────────┘
           ┌─────────┐  ┌─────────┐
Position 1 │  0.40   │  │  0.60   │
("World")  └─────────┘  └─────────┘

Reading:

  • Row 0: "Hello" attends 62% to itself, 38% to "World"
  • Row 1: "World" attends 40% to "Hello", 60% to itself

Attention Flow Diagram

"Hello" ──── 0.38 ────→ "World"
   ↑                      ↑
   │                      │
  0.62                  0.60
   │                      │
   └──────────────────────┘
   (self-attention)

Meaning: Information flows between words based on attention weights.


2.8 Key Takeaways: Attention

Attention determines which words to focus on
Calculates similarity between words
Creates weighted combinations of information
Enables understanding of relationships
Multiple heads capture different aspects


This document provides a step-by-step explanation of attention mechanisms, the core component that enables transformers to understand relationships between words.