Files

Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/

2025-11-06 22:07:41 -05:00

9.7 KiB

Raw Blame History

What is Attention? Step-by-Step Explanation

Complete step-by-step explanation of attention mechanisms in transformer models: how models understand relationships between words.

The Problem Attention Solves
What is Attention?
How Attention Works: Step-by-Step
Complete Example: Attention in "Hello World"
Why Attention Matters
Multi-Head Attention
Visual Representation of Attention
Key Takeaways

2.1 The Problem Attention Solves

The Challenge

In a sentence, words depend on each other:

"He saw the cat with binoculars"

Two possible meanings:

He used binoculars to see the cat
The cat has binoculars

Context matters! The model needs to understand which words relate to each other.

The Solution: Attention

Attention allows the model to "look" at other words when processing each word.

2.2 What is Attention?

Simple Definition

Attention is a mechanism that determines how much each word should consider other words when processing information.

Intuitive Analogy

Think of reading a sentence:

When you read "cat" in:

"The cat sat on the mat"

You might:

Pay attention to "sat" (what the cat did)
Pay attention to "mat" (where the cat is)
Pay less attention to "the" (just a word)

Attention does the same thing mathematically!

2.3 How Attention Works: Step-by-Step

High-Level Overview

Step 1: Create Query, Key, Value for each word
Step 2: Compare queries and keys (find similarities)
Step 3: Calculate attention weights (how much to attend)
Step 4: Combine values weighted by attention

Detailed Step-by-Step

Step 1: Create Query, Key, Value (Q, K, V)

For each word, create three representations:

Query (Q): "What am I looking for?"
Key (K): "What am I offering?"
Value (V): "What information do I contain?"

Example with "Hello World":

Word: "Hello"
    Query: [0.2, -0.1, 0.3, ...]  ← What should I look for?
    Key:   [0.1, 0.2, -0.1, ...]  ← What do I represent?
    Value: [0.15, 0.1, 0.2, ...]  ← What information do I have?

Word: "World"
    Query: [0.18, 0.15, 0.25, ...]
    Key:   [0.12, 0.19, -0.08, ...]
    Value: [0.14, 0.12, 0.18, ...]

How Q, K, V are created:

Q = Word × W_Q  (learned matrix)
K = Word × W_K  (learned matrix)
V = Word × W_V  (learned matrix)

Step 2: Compute Similarity Scores

Compare each query with all keys:

Score[i, j] = How much should word i attend to word j?

Mathematical Formula:

Score[i, j] = (Query[i] · Key[j]) / √d_k

Example:

Query for "Hello": [0.2, -0.1, 0.3]
Key for "Hello": [0.1, 0.2, -0.1]
Key for "World": [0.12, 0.19, -0.08]

Calculate similarity:

Score["Hello", "Hello"] = (0.2×0.1 + (-0.1)×0.2 + 0.3×(-0.1)) / √3
                        = (0.02 - 0.02 - 0.03) / 1.732
                        = -0.03 / 1.732
                        ≈ -0.017

Score["Hello", "World"] = (0.2×0.12 + (-0.1)×0.19 + 0.3×(-0.08)) / √3
                        = (0.024 - 0.019 - 0.024) / 1.732
                        = -0.019 / 1.732
                        ≈ -0.011

Result: Similarity scores tell us how related words are

Step 3: Convert Scores to Attention Weights

Use softmax to convert scores to probabilities:

Attention[i, j] = exp(Score[i, j]) / Σ exp(Score[i, k])

Example:

Raw Scores:

Score["Hello", "Hello"] = -0.017
Score["Hello", "World"] = -0.011

Compute exponentials:

exp(-0.017) ≈ 0.983
exp(-0.011) ≈ 0.989
Sum = 0.983 + 0.989 = 1.972

Compute attention weights:

Attention["Hello", "Hello"] = 0.983 / 1.972 ≈ 0.499 (49.9%)
Attention["Hello", "World"] = 0.989 / 1.972 ≈ 0.501 (50.1%)

Meaning: "Hello" attends 49.9% to itself and 50.1% to "World"

Step 4: Weighted Combination

Combine values using attention weights:

Output["Hello"] = Attention["Hello", "Hello"] × Value["Hello"] 
                + Attention["Hello", "World"] × Value["World"]

Example:

Value["Hello"] = [0.15, 0.1, 0.2]
Value["World"] = [0.14, 0.12, 0.18]

Output["Hello"] = 0.499 × [0.15, 0.1, 0.2] + 0.501 × [0.14, 0.12, 0.18]
                = [0.075, 0.050, 0.100] + [0.070, 0.060, 0.090]
                = [0.145, 0.110, 0.190]

Result: New representation that combines information from both words!

2.4 Complete Example: Attention in "Hello World"

Input

Words: ["Hello", "World"]
Position 0: "Hello"
Position 1: "World"

Step-by-Step Processing

Step 1: Embeddings

E["Hello"] = [0.10, -0.20, 0.30, ..., 0.05]
E["World"] = [0.15, -0.18, 0.28, ..., 0.10]

Step 2: Create Q, K, V

Q["Hello"] = E["Hello"] × W_Q = [0.2, -0.1, 0.3, ...]
K["Hello"] = E["Hello"] × W_K = [0.1, 0.2, -0.1, ...]
V["Hello"] = E["Hello"] × W_V = [0.15, 0.1, 0.2, ...]

Q["World"] = E["World"] × W_Q = [0.18, 0.15, 0.25, ...]
K["World"] = E["World"] × W_K = [0.12, 0.19, -0.08, ...]
V["World"] = E["World"] × W_V = [0.14, 0.12, 0.18, ...]

Step 3: Compute Attention Scores

Score Matrix (2×2):

         "Hello"  "World"
"Hello"    0.5      0.3
"World"    0.4      0.6

Interpretation:

"Hello" attends to itself (0.5) more than "World" (0.3)
"World" attends to itself (0.6) more than "Hello" (0.4)

Step 4: Apply Softmax

Attention Matrix:

         "Hello"  "World"
"Hello"   0.62    0.38
"World"   0.40    0.60

Interpretation:

"Hello" gives 62% attention to itself, 38% to "World"
"World" gives 40% attention to "Hello", 60% to itself

Step 5: Weighted Combination

Output["Hello"] = 0.62 × V["Hello"] + 0.38 × V["World"]
                = 0.62 × [0.15, 0.1, 0.2] + 0.38 × [0.14, 0.12, 0.18]
                = [0.093, 0.062, 0.124] + [0.053, 0.046, 0.068]
                = [0.146, 0.108, 0.192]

Output["World"] = 0.40 × V["Hello"] + 0.60 × V["World"]
                = 0.40 × [0.15, 0.1, 0.2] + 0.60 × [0.14, 0.12, 0.18]
                = [0.060, 0.040, 0.080] + [0.084, 0.072, 0.108]
                = [0.144, 0.112, 0.188]

Result: Each word now contains information from both words!

2.5 Why Attention Matters

Benefit 1: Context Understanding

Without Attention:

"Hello" is processed in isolation
"World" is processed in isolation
Result: No understanding of relationship

With Attention:

"Hello" considers "World" (38% attention)
"World" considers "Hello" (40% attention)
Result: Understands they're related

Benefit 2: Long-Range Dependencies

Attention can connect distant words:

"The cat that I saw yesterday sat on the mat"

"cat" can attend to "yesterday" (even though far apart)
Model understands the cat from yesterday

Benefit 3: Selective Focus

Attention focuses on relevant information:

"He saw the cat with binoculars"

"saw" attends strongly to "binoculars" (how he saw)
"cat" attends strongly to "sat" (what it did)
Each word focuses on what's relevant to it

2.6 Multi-Head Attention

What is Multi-Head Attention?

Multiple attention "heads" look at different aspects:

Head 1: Focuses on syntax (grammar relationships)
Head 2: Focuses on semantics (meaning relationships)
Head 3: Focuses on position (spatial relationships)
...
Head 8: Focuses on another aspect

Visual Representation

Input: "Hello World"

Head 1 (Syntax):
    "Hello" → attends to "World" (subject-object relationship)

Head 2 (Semantics):
    "Hello" → attends to "World" (greeting relationship)

Head 3 (Position):
    "Hello" → attends more to itself (being first)

... (other heads)

Final: Combine all heads → Richer representation

Why Multiple Heads?

Different heads capture different relationships:

Head 1: Grammatical relationships
Head 2: Semantic relationships
Head 3: Positional relationships
Head 4: Other patterns...

Together: Comprehensive understanding!

2.7 Visual Representation of Attention

Attention Heatmap

Attention Weights for "Hello World"

         Position 0    Position 1
         ("Hello")     ("World")
           ┌─────────┐  ┌─────────┐
Position 0 │  0.62   │  │  0.38   │
("Hello")  └─────────┘  └─────────┘
           ┌─────────┐  ┌─────────┐
Position 1 │  0.40   │  │  0.60   │
("World")  └─────────┘  └─────────┘

Reading:

Row 0: "Hello" attends 62% to itself, 38% to "World"
Row 1: "World" attends 40% to "Hello", 60% to itself

Attention Flow Diagram

"Hello" ──── 0.38 ────→ "World"
   ↑                      ↑
   │                      │
  0.62                  0.60
   │                      │
   └──────────────────────┘
   (self-attention)

Meaning: Information flows between words based on attention weights.

2.8 Key Takeaways: Attention

✅ Attention determines which words to focus on
✅ Calculates similarity between words
✅ Creates weighted combinations of information
✅ Enables understanding of relationships
✅ Multiple heads capture different aspects

This document provides a step-by-step explanation of attention mechanisms, the core component that enables transformers to understand relationships between words.

9.7 KiB Raw Blame History Unescape Escape

What is Attention? Step-by-Step Explanation

Table of Contents

2.1 The Problem Attention Solves

The Challenge

The Solution: Attention

2.2 What is Attention?

Simple Definition

Intuitive Analogy

2.3 How Attention Works: Step-by-Step

High-Level Overview

Detailed Step-by-Step

Step 1: Create Query, Key, Value (Q, K, V)

Step 2: Compute Similarity Scores

Step 3: Convert Scores to Attention Weights

Step 4: Weighted Combination

2.4 Complete Example: Attention in "Hello World"

Input

Step-by-Step Processing

Step 1: Embeddings

Step 2: Create Q, K, V

Step 3: Compute Attention Scores

Step 4: Apply Softmax

Step 5: Weighted Combination

2.5 Why Attention Matters

Benefit 1: Context Understanding

Benefit 2: Long-Range Dependencies

Benefit 3: Selective Focus

2.6 Multi-Head Attention

What is Multi-Head Attention?

Visual Representation

Why Multiple Heads?

2.7 Visual Representation of Attention

Attention Heatmap

Attention Flow Diagram

2.8 Key Takeaways: Attention

9.7 KiB

Raw Blame History