Files

Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/

2025-11-06 22:07:41 -05:00

6.3 KiB

Raw Blame History

What are Embeddings? Step-by-Step Explanation

Complete step-by-step explanation of embeddings in transformer models: how words become numbers that capture meaning.

The Problem Embeddings Solve
What is an Embedding?
How Embeddings Work
Step-by-Step Example: Embedding "Hello"
Why Embeddings Matter
Complete Example: Embedding Multiple Words
Visual Representation
Key Takeaways

1.1 The Problem Embeddings Solve

The Challenge

Computers understand numbers, not words.

Your model receives:

Input: "Hello" (a word, not a number)

But neural networks need:

Input: Numbers (like [0.1, -0.2, 0.3, ...])

The Solution: Embeddings

Embeddings convert words (or tokens) into numbers (vectors) that capture meaning.

1.2 What is an Embedding?

Simple Definition

An embedding is a numerical representation of a word or token that captures its semantic meaning.

Think of it like this:

Each word gets a unique "address" in a high-dimensional space
Similar words end up close together
Different words are far apart

Visual Analogy

Imagine a map where:

Words are cities
Similar words are nearby cities
Different words are distant cities

          Semantic Space (2D visualization)

    "cat"    "dog"
      ●        ●
      
                "car"    "vehicle"
                  ●         ●
      
      "king"                  "queen"
        ●                         ●

In reality, embeddings use 512 dimensions (not 2D), but the concept is the same.

1.3 How Embeddings Work

Step 1: Vocabulary Mapping

Create a mapping from words to numbers:

Vocabulary:
"Hello" → Token ID: 72
"World" → Token ID: 87
"the"   → Token ID: 32
...

Result: Each word has a unique ID number

Step 2: Embedding Matrix

Create a matrix where each row represents a word:

Embedding Matrix E:

        Dimension 0  Dimension 1  Dimension 2  ...  Dimension 511
Token 0  [  0.05   ,   -0.10   ,    0.20   , ...,     0.15   ]
Token 1  [ -0.08   ,    0.12   ,   -0.05   , ...,     0.08   ]
Token 2  [  0.10   ,   -0.15   ,    0.25   , ...,     0.12   ]
...
Token 72 [  0.10   ,   -0.20   ,    0.30   , ...,     0.05   ]  ← "Hello"
...
Token 87 [  0.15   ,   -0.18   ,    0.28   , ...,     0.10   ]  ← "World"

Key Points:

Each row is a 512-dimensional vector
Each row represents one token/word
The values are learned during training

Step 3: Lookup Operation

When you need an embedding, look it up:

Input: Token ID = 72 ("Hello")
    ↓
Lookup: E[72]
    ↓
Output: [0.10, -0.20, 0.30, ..., 0.05]  (512 numbers)

1.4 Step-by-Step Example: Embedding "Hello"

Input

Word: "Hello"
Token ID: 72

Process

Step 1: Get Token ID

"Hello" → Lookup in vocabulary → 72

Step 2: Lookup Embedding

E[72] = [0.10, -0.20, 0.30, 0.15, -0.05, ..., 0.05]

Step 3: Result

Embedding vector: [0.10, -0.20, 0.30, ..., 0.05]
Dimension: 512 numbers
Meaning: Numerical representation of "Hello"

What These Numbers Mean

Individual numbers don't mean much by themselves, but together they represent:

Semantic meaning (what the word means)
Contextual relationships (how it relates to other words)
Syntactic information (grammatical role)

Key Insight: The model learns these values during training to capture meaning.

1.5 Why Embeddings Matter

Benefit 1: Continuous Space

Before Embeddings:

"Hello" = 72
"World" = 87
Distance: |72 - 87| = 15 (meaningless!)

After Embeddings:

"Hello" = [0.10, -0.20, 0.30, ...]
"World" = [0.15, -0.18, 0.28, ...]
Distance: Can measure similarity mathematically!

Benefit 2: Semantic Relationships

Similar words have similar embeddings:

"cat"    ≈ [0.8, 0.2, 0.1, ...]
"dog"    ≈ [0.7, 0.3, 0.1, ...]  ← Similar to "cat"
"car"    ≈ [0.1, 0.9, 0.8, ...]  ← Different from "cat"

Distance in embedding space = semantic similarity

Benefit 3: Mathematical Operations

You can do math with embeddings:

"king" - "man" + "woman" ≈ "queen"

This works because embeddings capture semantic relationships!

1.6 Complete Example: Embedding Multiple Words

Input Sentence

"Hello World"

Step-by-Step Processing

Step 1: Tokenize

"Hello" → Token ID: 72
"World" → Token ID: 87

Step 2: Lookup Embeddings

E[72] = [0.10, -0.20, 0.30, ..., 0.05]  (512 numbers)
E[87] = [0.15, -0.18, 0.28, ..., 0.10]  (512 numbers)

Step 3: Stack Together

Embedding Matrix:
[
  [0.10, -0.20, 0.30, ..., 0.05],  ← "Hello"
  [0.15, -0.18, 0.28, ..., 0.10]   ← "World"
]
Shape: [2, 512]

Result: Each word becomes a 512-dimensional vector

1.7 Visual Representation

Embedding Space Visualization

2D Projection of 512-Dimensional Embedding Space:

     0.3 │                    "World"
         │                      ●
     0.2 │        "Hello"
         │           ●
     0.1 │
         │
     0.0 ├───────────────────────────
         │
    -0.1 │
         │
    -0.2 │
         │
    -0.3 │

Reality: Embeddings exist in 512-dimensional space, but we can visualize them in 2D or 3D projections.

Similarity Visualization

Word Similarities (distance in embedding space):

"cat" ──── 0.15 distance ──── "dog"   (similar)
 "cat" ──── 2.5 distance ──── "car"   (different)
"king" ──── 0.8 distance ──── "queen" (related)

1.8 Key Takeaways: Embeddings

✅ Embeddings convert words to numbers
✅ Each word becomes a vector (list of numbers)
✅ Similar words have similar vectors
✅ Enables mathematical operations on words
✅ Learned during training to capture meaning

This document provides a step-by-step explanation of embeddings, the fundamental component that converts words into numerical representations in transformer models.

6.3 KiB Raw Blame History