Files
sheepOp/docs/TOKENIZATION_EXPLAINED.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

979 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Tokenization Explained - Mathematical Formulation
Complete mathematical derivation and step-by-step explanation of tokenization, Byte Pair Encoding (BPE), and UTF-8 encoding used in the SheepOp Language Model.
## Table of Contents
1. [Introduction to Tokenization](#1-introduction-to-tokenization)
2. [UTF-8 Encoding](#2-utf-8-encoding)
3. [Byte Pair Encoding Algorithm](#3-byte-pair-encoding-algorithm)
4. [Vocabulary Construction](#4-vocabulary-construction)
5. [Encoding Process](#5-encoding-process)
6. [Decoding Process](#6-decoding-process)
7. [Regex Pattern Splitting](#7-regex-pattern-splitting)
8. [Special Tokens](#8-special-tokens)
9. [Complete Tokenization Pipeline](#9-complete-tokenization-pipeline)
10. [Tokenization Challenges and Solutions](#10-tokenization-challenges-and-solutions)
---
## 1. Introduction to Tokenization
### 1.1 What is Tokenization?
**Tokenization** is the process of converting raw text into a sequence of discrete tokens (integers) that can be processed by neural networks.
**Mathematical Definition:**
Given a text string $s \in \Sigma^*$ where $\Sigma$ is the character alphabet, tokenization maps:
```math
\mathcal{T}: \Sigma^* \rightarrow \mathbb{N}^*
```
```math
s = c_1 c_2 \ldots c_n \mapsto \mathbf{t} = [t_1, t_2, \ldots, t_m]
```
where:
- $s$ = input text string
- $c_i$ = individual characters
- $\mathbf{t}$ = sequence of token IDs
- $n$ = number of characters
- $m$ = number of tokens (typically $m \leq n$)
### 1.2 Why Tokenization?
**Problem:** Neural networks require numerical inputs, not raw text.
**Solution:** Convert text to token sequences:
```mermaid
graph LR
A["Raw Text<br/>'Hello world'"] --> B[Tokenizer]
B --> C["Token IDs<br/>[15496, 1917]"]
C --> D[Embedding Layer]
D --> E["Embeddings<br/>^n×d"]
style A fill:#e1f5ff
style B fill:#fff4e1
style C fill:#e1ffe1
style E fill:#ffe1f5
```
### 1.3 Tokenization Approaches
**Three Main Approaches:**
1. **Character-Level**: Each character becomes a token
- Vocabulary size: ~100-200
- Sequence length: Very long
2. **Word-Level**: Each word becomes a token
- Vocabulary size: ~50,000-100,000
- Sequence length: Moderate
3. **Subword-Level (BPE)**: Sequences of bytes/characters become tokens
- Vocabulary size: ~30,000-100,000
- Sequence length: Efficient
```mermaid
graph TB
subgraph "Character-Level"
C1["'Hello'"] --> C2["['H','e','l','l','o']"]
C2 --> C3["5 tokens"]
end
subgraph "Word-Level"
W1["'Hello world'"] --> W2["['Hello', 'world']"]
W2 --> W3["2 tokens"]
end
subgraph "Subword-Level (BPE)"
B1["'Hello world'"] --> B2["['Hello', ' world']"]
B2 --> B3["2 tokens"]
end
style C3 fill:#ffcccc
style W3 fill:#ccffcc
style B3 fill:#ccffcc
```
---
## 2. UTF-8 Encoding
### 2.1 Unicode Code Points
**Unicode** defines a mapping from characters to integers (code points):
```math
f: \mathcal{C} \rightarrow \{0, 1, \ldots, 0x10FFFF\}
```
where $\mathcal{C}$ is the set of all Unicode characters.
**Example:**
```math
f('H') = 72, \quad f('ε') = 949, \quad f('中') = 20013
```
### 2.2 UTF-8 Encoding Function
**UTF-8** encodes Unicode code points into variable-length byte sequences:
```math
\text{UTF-8}: \{0, \ldots, 0x10FFFF\} \rightarrow \{0, \ldots, 255\}^*
```
**Encoding Rules:**
For a code point $c$:
```math
\text{UTF-8}(c) = \begin{cases}
[b_0] & \text{if } c < 128 \\
[b_0, b_1] & \text{if } c < 2048 \\
[b_0, b_1, b_2] & \text{if } c < 65536 \\
[b_0, b_1, b_2, b_3] & \text{if } c < 0x10FFFF
\end{cases}
```
where $b_i \in \{0, \ldots, 255\}$ are bytes.
### 2.3 UTF-8 Encoding Process
```mermaid
graph TB
A["Unicode String<br/>'Hello 世界'"] --> B[Extract Code Points]
B --> C["[72, 101, 108, 108, 111, 32, 19990, 30028]"]
C --> D[UTF-8 Encode]
D --> E["Bytes<br/>[72, 101, 108, 108, 111, 32, 228, 184, 150, 231, 149, 140]"]
style A fill:#e1f5ff
style E fill:#e1ffe1
```
**Mathematical Formulation:**
For a string $s = c_1 c_2 \ldots c_n$:
```math
\text{bytes}(s) = \bigoplus_{i=1}^n \text{UTF-8}(f(c_i))
```
where $\bigoplus$ denotes byte concatenation.
**Example:**
```math
\text{bytes}("Hi") = \text{UTF-8}(72) \oplus \text{UTF-8}(105) = [72] \oplus [105] = [72, 105]
```
---
## 3. Byte Pair Encoding Algorithm
### 3.1 BPE Overview
**Byte Pair Encoding (BPE)** is a data compression algorithm that iteratively merges the most frequent consecutive pairs.
**Goal:** Create an efficient vocabulary by merging frequent byte/character pairs.
### 3.2 BPE Training Algorithm
**Initial State:**
```math
V^{(0)} = \{0, 1, \ldots, 255\} \quad \text{(all bytes)}
```
```math
\text{tokens}^{(0)} = \text{bytes}(s) = [b_1, b_2, \ldots, b_n]
```
**Iterative Merging:**
For iteration $k = 1, 2, \ldots, K$:
**Step 1: Calculate Pair Frequencies**
```math
\text{stats}^{(k)} = \text{CountPairs}(\text{tokens}^{(k-1)})
```
```math
\text{stats}^{(k)}(i, j) = |\{p : \text{tokens}^{(k-1)}_p = i \land \text{tokens}^{(k-1)}_{p+1} = j\}|
```
**Step 2: Find Most Frequent Pair**
```math
(i^*, j^*) = \arg\max_{(i,j)} \text{stats}^{(k)}(i, j)
```
**Step 3: Create New Token**
```math
V^{(k)} = V^{(k-1)} \cup \{256 + k - 1\}
```
```math
\text{merges}^{(k)} = \text{merges}^{(k-1)} \cup \{(i^*, j^*) \mapsto 256 + k - 1\}
```
**Step 4: Apply Merge**
```math
\text{tokens}^{(k)} = \text{Merge}(\text{tokens}^{(k-1)}, (i^*, j^*), 256 + k - 1)
```
where:
```math
\text{Merge}(T, (a, b), \text{new\_id})_i = \begin{cases}
\text{new\_id} & \text{if } T_i = a \land T_{i+1} = b \\
T_i & \text{otherwise}
\end{cases}
```
### 3.3 BPE Training Flowchart
```mermaid
graph TB
Start[Start Training] --> Init["Initialize<br/>V⁽⁰⁾ = {0...255}<br/>tokens⁽⁰⁾ = bytes(text)"]
Init --> Iter{More Merges?}
Iter -->|Yes| Stats["Calculate Pair Frequencies<br/>stats⁽ᵏ⁾(i,j)"]
Stats --> Find["Find Most Frequent Pair<br/>(i*, j*) = argmax stats"]
Find --> Create["Create New Token<br/>id = 256 + k"]
Create --> Merge["Merge All Occurrences<br/>tokens⁽ᵏ⁾ = Merge(tokens⁽ᵏ⁻¹⁾, (i*,j*), id)"]
Merge --> Update["Update Vocabulary<br/>V⁽ᵏ⁾ = V⁽ᵏ⁻¹⁾ {id}"]
Update --> Iter
Iter -->|No| End[Training Complete]
style Start fill:#e1f5ff
style End fill:#e1ffe1
style Stats fill:#fff4e1
style Merge fill:#ffe1f5
```
### 3.4 BPE Example
**Example:** Training on text "aaab"
**Iteration 0:**
```math
V^{(0)} = \{0, 1, \ldots, 255\}
```
```math
\text{tokens}^{(0)} = [97, 97, 97, 98] \quad \text{(bytes for 'aaab')}
```
**Calculate frequencies:**
```math
\text{stats}^{(0)} = \{(97, 97): 2, (97, 98): 1\}
```
**Iteration 1:**
```math
(i^*, j^*) = (97, 97), \quad \text{new\_id} = 256
```
```math
\text{tokens}^{(1)} = [256, 97, 98]
```
```math
V^{(1)} = V^{(0)} \cup \{256\}
```
**Iteration 2:**
```math
\text{stats}^{(1)} = \{(256, 97): 1, (97, 98): 1\}
```
**Choose one:** \((256, 97)\), \(\text{new_id} = 257\)
```math
\text{tokens}^{(2)} = [257, 98]
```
---
## 4. Vocabulary Construction
### 4.1 Vocabulary Structure
**Final Vocabulary:**
```math
V = V_{\text{bytes}} \cup V_{\text{merges}} \cup V_{\text{special}}
```
where:
- $V_{\text{bytes}} = \{0, 1, \ldots, 255\}$ (256 tokens)
- $V_{\text{merges}} = \{256, 257, \ldots, 256 + K - 1\}$ (K merged tokens)
- $V_{\text{special}} = \{\text{<pad>}, \text{<unk>}, \text{<bos>}, \text{<eos>}\}$
**Vocabulary Size:**
```math
|V| = 256 + K + |V_{\text{special}}|
```
### 4.2 Merge Dictionary
**Merge Dictionary:**
```math
M: \mathbb{N} \times \mathbb{N} \rightarrow \mathbb{N}
```
```math
M(i, j) = \begin{cases}
\text{merged\_id} & \text{if } (i, j) \text{ was merged} \\
\text{undefined} & \text{otherwise}
\end{cases}
```
**Vocabulary Mapping:**
```math
\text{vocab}: \mathbb{N} \rightarrow \{0, \ldots, 255\}^*
```
```math
\text{vocab}(\text{id}) = \begin{cases}
[\text{id}] & \text{if } \text{id} < 256 \\
\text{vocab}(i) \oplus \text{vocab}(j) & \text{if } M(i, j) = \text{id}
\end{cases}
```
### 4.3 Vocabulary Construction Diagram
```mermaid
graph TB
subgraph "Initial Vocabulary"
A1["256 Byte Tokens<br/>{0, 1, ..., 255}"]
end
subgraph "BPE Merges"
B1["Merge 1: (101, 32) → 256"]
B2["Merge 2: (256, 108) → 257"]
B3["Merge K: ... → 256+K-1"]
end
subgraph "Special Tokens"
C1["<pad> → 50256"]
C2["<unk> → 50257"]
C3["<bos> → 50258"]
C4["<eos> → 50259"]
end
A1 --> D[Final Vocabulary]
B1 --> D
B2 --> D
B3 --> D
C1 --> D
C2 --> D
C3 --> D
C4 --> D
D --> E["|V| = 256 + K + 4"]
style A1 fill:#e1f5ff
style D fill:#e1ffe1
style E fill:#fff4e1
```
---
## 5. Encoding Process
### 5.1 Encoding Algorithm
**Input:** Text string $s$
**Step 1: Regex Splitting**
```math
\text{chunks} = \text{RegexSplit}(s, \text{pattern})
```
**Step 2: Convert to Bytes**
For each chunk $c \in \text{chunks}$:
```math
\text{bytes}(c) = \text{UTF-8}(c) = [b_1, b_2, \ldots, b_n]
```
**Step 3: Apply BPE Merges**
Initialize: $\text{tokens} = \text{bytes}(c)$
While merges possible:
```math
\text{Find earliest merge: } (i^*, j^*) = \arg\min_{(i,j) \in M} \text{merge\_index}(M(i, j))
```
```math
\text{Apply merge: } \text{tokens} = \text{Merge}(\text{tokens}, (i^*, j^*), M(i^*, j^*))
```
**Step 4: Combine Results**
```math
\text{token\_ids} = \bigoplus_{c \in \text{chunks}} \text{BPE}(\text{bytes}(c))
```
### 5.2 Encoding Function
**Mathematical Definition:**
```math
\text{encode}(s) = \bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))
```
where $\bigoplus$ denotes token sequence concatenation.
### 5.3 Encoding Flowchart
```mermaid
graph TB
A["Input Text<br/>'Hello world'"] --> B[Regex Split]
B --> C["Chunks<br/>['Hello', ' world']"]
C --> D[For Each Chunk]
D --> E["UTF-8 Encode<br/>bytes = [72, 101, 108, 108, 111]"]
E --> F["Initialize Tokens<br/>tokens = bytes"]
F --> G{Merge Possible?}
G -->|Yes| H["Find Earliest Merge<br/>(i*, j*)"]
H --> I["Apply Merge<br/>tokens = Merge(tokens, (i*,j*), id)"]
I --> G
G -->|No| J[Add to Result]
J --> K{More Chunks?}
K -->|Yes| D
K -->|No| L["Final Token IDs<br/>[15496, 1917]"]
style A fill:#e1f5ff
style L fill:#e1ffe1
style G fill:#ffe1f5
```
### 5.4 Encoding Example
**Input:** "Hello world"
**Step 1: Regex Split**
```math
\text{chunks} = ["Hello", " world"]
```
**Step 2: UTF-8 Encoding**
```math
\text{UTF-8}("Hello") = [72, 101, 108, 108, 111]
```
```math
\text{UTF-8}(" world") = [32, 119, 111, 114, 108, 100]
```
**Step 3: BPE Merging**
Assume merges: $M(72, 101) = 256, M(256, 108) = 257, M(257, 108) = 258, M(258, 111) = 259$
```math
[72, 101, 108, 108, 111] \xrightarrow{M(72,101)} [256, 108, 108, 111]
```
```math
\xrightarrow{M(256,108)} [257, 108, 111] \xrightarrow{M(257,108)} [258, 111]
```
```math
\xrightarrow{M(258,111)} [259]
```
**Final:** $[259, 1917]$
---
## 6. Decoding Process
### 6.1 Decoding Algorithm
**Input:** Token IDs $\mathbf{t} = [t_1, t_2, \ldots, t_n]$
**Step 1: Handle Special Tokens**
```math
\text{Stop at EOS: } \mathbf{t}' = \mathbf{t}[:i] \text{ where } t_i = \text{<eos>}
```
**Step 2: Lookup Bytes**
For each token $t_i$:
```math
\text{bytes}_i = \text{vocab}(t_i)
```
**Step 3: Concatenate Bytes**
```math
\text{all\_bytes} = \bigoplus_{i=1}^n \text{bytes}_i
```
**Step 4: UTF-8 Decode**
```math
\text{text} = \text{UTF-8}^{-1}(\text{all\_bytes})
```
### 6.2 Decoding Function
**Mathematical Definition:**
```math
\text{decode}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(t_i)\right)
```
### 6.3 Decoding Flowchart
```mermaid
graph TB
A["Token IDs<br/>[15496, 1917]"] --> B{Special Tokens?}
B -->|EOS Found| C[Stop at EOS]
B -->|No EOS| D[Process All Tokens]
C --> D
D --> E[For Each Token ID]
E --> F["Lookup Bytes<br/>vocab[t_i]"]
F --> G["Concatenate<br/>all_bytes = bytes₁ ⊕ bytes₂ ⊕ ..."]
G --> H["UTF-8 Decode<br/>text = decode(all_bytes)"]
H --> I["Output Text<br/>'Hello world'"]
style A fill:#e1f5ff
style I fill:#e1ffe1
style F fill:#fff4e1
```
### 6.4 Decoding Example
**Input:** $[259, 1917]$
**Step 1: Lookup**
```math
\text{vocab}(259) = \text{vocab}(258) \oplus \text{vocab}(111)
```
```math
= (\text{vocab}(257) \oplus \text{vocab}(108)) \oplus [111]
```
```math
= ((\text{vocab}(256) \oplus \text{vocab}(108)) \oplus [108]) \oplus [111]
```
```math
= (((\text{vocab}(72) \oplus \text{vocab}(101)) \oplus [108]) \oplus [108]) \oplus [111]
```
```math
= [[72] \oplus [101]] \oplus [108] \oplus [108] \oplus [111]
```
```math
= [72, 101, 108, 108, 111]
```
```math
\text{vocab}(1917) = [32, 119, 111, 114, 108, 100]
```
**Step 2: Concatenate**
```math
\text{all\_bytes} = [72, 101, 108, 108, 111] \oplus [32, 119, 111, 114, 108, 100]
```
```math
= [72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
```
**Step 3: Decode**
```math
\text{decode}([72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]) = "Hello world"
```
---
## 7. Regex Pattern Splitting
### 7.1 GPT-4 Style Pattern
**Pattern Components:**
```math
P = P_{\text{contractions}} \cup P_{\text{letters}} \cup P_{\text{numbers}} \cup P_{\text{punctuation}} \cup P_{\text{whitespace}}
```
**Pattern Definition:**
\[
P = \text{'(?i:[sdmt]|ll|ve|re)} \cup \text{[^\r\n\p{L}\p{N}]?+\p{L}+} \cup \text{\p{N}{1,3}} \cup \text{?[^\s\p{L}\p{N}]++} \cup \text{\r?\n} \cup \text{\s+}
\]
**Components:**
1. **Contractions:** \( P\_{\text{contractions}} = \text{'(?i:[sdmt]|ll|ve|re)} \)
- Matches: `'s`, `'t`, `'ll`, `'ve`, `'re` (case-insensitive)
2. **Letters:** \( P\_{\text{letters}} = \text{[^\r\n\p{L}\p{N}]?+\p{L}+} \)
- Optional space + one or more letters
3. **Numbers:** \( P\_{\text{numbers}} = \text{\p{N}{1,3}} \)
- **Limit:** 1-3 digits only (prevents long number tokens)
4. **Punctuation:** \( P\_{\text{punctuation}} = \text{?[^\s\p{L}\p{N}]++} \)
- Optional space + punctuation
5. **Whitespace:** \( P\_{\text{whitespace}} = \text{\r?\n} \cup \text{\s+} \)
- Newlines and multiple spaces
### 7.2 Regex Splitting Function
```math
\text{RegexSplit}(s, P) = \{m_1, m_2, \ldots, m_k : m_i \in \text{Match}(s, P)\}
```
where matches are found left-to-right, non-overlapping.
### 7.3 Regex Splitting Diagram
```mermaid
graph TB
A["Input Text<br/>'Hello world 123'"] --> B[Apply Regex Pattern]
B --> C1["Chunk 1: 'Hello'<br/>P_letters"]
B --> C2["Chunk 2: ' world'<br/>P_whitespace + P_letters"]
B --> C3["Chunk 3: ' 123'<br/>P_whitespace + P_numbers"]
C1 --> D["Chunks List<br/>['Hello', ' world', ' 123']"]
C2 --> D
C3 --> D
style A fill:#e1f5ff
style D fill:#e1ffe1
style B fill:#fff4e1
```
**Example:**
```math
\text{RegexSplit}("Hello world 123", P) = ["Hello", " world", " 123"]
```
---
## 8. Special Tokens
### 8.1 Special Token Set
**Special Tokens:**
```math
V_{\text{special}} = \{\text{<pad>}, \text{<unk>}, \text{<bos>}, \text{<eos>}\}
```
**Token IDs:**
```math
\text{id}(\text{<pad>}) = 0, \quad \text{id}(\text{<unk>}) = 1, \quad \text{id}(\text{<bos>}) = 2, \quad \text{id}(\text{<eos>}) = 3
```
### 8.2 Special Token Functions
**Padding:**
```math
\text{pad}(\mathbf{t}, \text{max\_length}) = \mathbf{t} \oplus [\text{<pad>}]^{\max(\text{max\_length} - |\mathbf{t}|, 0)}
```
**Unknown Token:**
```math
\text{encode}(s) = \begin{cases}
[\text{id}(c)] & \text{if } c \in V \\
[\text{<unk>}] & \text{if } c \notin V
\end{cases}
```
**EOS Handling:**
```math
\text{decode}(\mathbf{t}) = \text{decode}(\mathbf{t}[:i]) \text{ where } t_i = \text{<eos>}
```
### 8.3 Special Token Flowchart
```mermaid
graph TB
subgraph "Special Tokens"
A1["<pad> → 0<br/>Padding"]
A2["<unk> → 1<br/>Unknown"]
A3["<bos> → 2<br/>Beginning"]
A4["<eos> → 3<br/>End"]
end
subgraph "Usage"
B1["Padding:<br/>[1,2,3] → [1,2,3,0,0]"]
B2["Unknown:<br/>unknown_char → 1"]
B3["EOS Stop:<br/>[1,2,3] → stop at 3"]
end
A1 --> B1
A2 --> B2
A4 --> B3
style A1 fill:#e1f5ff
style A2 fill:#ffe1f5
style A3 fill:#e1ffe1
style A4 fill:#fff4e1
```
---
## 9. Complete Tokenization Pipeline
### 9.1 Full Pipeline
**Complete Tokenization Process:**
```mermaid
graph TB
Start[Input Text] --> Split[Regex Split]
Split --> UTF8[UTF-8 Encode]
UTF8 --> BPE[BPE Merge]
BPE --> Special{Special Tokens?}
Special -->|Yes| Handle[Handle Special]
Special -->|No| Combine[Combine Tokens]
Handle --> Combine
Combine --> Output[Token IDs]
Output --> Embed[Embedding Layer]
style Start fill:#e1f5ff
style Output fill:#e1ffe1
style Embed fill:#ffe1f5
```
### 9.2 Mathematical Formulation
**Complete Encoding:**
```math
\text{Tokenize}(s) = \text{HandleSpecial}\left(\bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))\right)
```
**Complete Decoding:**
```math
\text{Detokenize}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(\text{RemoveSpecial}(t_i))\right)
```
### 9.3 Pipeline Example
**Input:** "Hello world!"
**Step 1: Regex Split**
```math
\text{chunks} = ["Hello", " world", "!"]
```
**Step 2: UTF-8 Encode**
```math
\text{bytes} = [[72,101,108,108,111], [32,119,111,114,108,100], [33]]
```
**Step 3: BPE Merge**
```math
\text{tokens} = [[15496], [1917], [0]]
```
**Step 4: Combine**
```math
\text{final} = [15496, 1917, 0]
```
---
## 10. Tokenization Challenges and Solutions
### 10.1 Challenge: Long Tokens
**Problem:** Some tokens become very long (e.g., "defaultstyle" as single token).
**Mathematical Impact:**
```math
\text{LongToken}(s) = \begin{cases}
1 \text{ token} & \text{if } |s| > 10 \text{ chars} \\
\text{multiple tokens} & \text{otherwise}
\end{cases}
```
**Solution:** Better regex splitting prevents over-merging.
### 10.2 Challenge: Number Tokenization
**Problem:** Numbers tokenized arbitrarily (sometimes 1 token, sometimes 2-3).
**Solution:** Limit number merging to 1-3 digits:
```math
P_{\text{numbers}} = \text{\p{N}{1,3}}
```
**Impact:**
```math
\text{TokenCount}(n) = \begin{cases}
1 & \text{if } n < 1000 \\
2-3 & \text{if } n \geq 1000
\end{cases}
```
### 10.3 Challenge: Python Code Efficiency
**Problem:** Each space is separate token (GPT-2 issue).
**Solution:** Merge multiple spaces:
```math
P_{\text{whitespace}} = \text{\s+} \quad \text{(matches multiple spaces)}
```
**Efficiency Gain:**
```math
\text{TokensBefore} = |\text{spaces}| \quad \text{(one token per space)}
```
```math
\text{TokensAfter} = \lceil |\text{spaces}| / 4 \rceil \quad \text{(grouped spaces)}
```
### 10.4 Challenge: Trailing Whitespace
**Problem:** Trailing spaces cause poor tokenization.
**Detection:**
```math
\text{HasTrailingSpace}(s) = \begin{cases}
\text{True} & \text{if } s[-1] = ' ' \\
\text{False} & \text{otherwise}
\end{cases}
```
**Warning:**
```math
\text{Tokenize}(s) = \begin{cases}
\text{encode}(s) + \text{warning} & \text{if } \text{HasTrailingSpace}(s) \\
\text{encode}(s) & \text{otherwise}
\end{cases}
```
### 10.5 Challenge: Multilingual Support
**Problem:** Non-English languages tokenize inefficiently.
**Solution:** UTF-8 byte-level encoding handles all languages:
```math
\text{Tokenize}(s) = \text{BPE}(\text{UTF-8}(s)) \quad \forall s \in \Sigma^*
```
**Efficiency:**
```math
\text{TokenRatio} = \frac{|\text{Tokenize}(s_{\text{non-english}})|}{|\text{Tokenize}(s_{\text{english}})|} \approx 1.5-2.0
```
### 10.6 Challenge: Untrained Tokens
**Problem:** Some tokens never appear in training data.
**Solution:** Fallback handling:
```math
\text{Decode}(t) = \begin{cases}
\text{vocab}(t) & \text{if } t \in V \\
\text{<unk>} & \text{if } t \notin V \\
\text{fallback\_bytes} & \text{if } t < 256
\end{cases}
```
---
## Summary
### Key Formulas
**Encoding:**
```math
\text{encode}(s) = \text{HandleSpecial}\left(\bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))\right)
```
**Decoding:**
```math
\text{decode}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(t_i)\right)
```
**BPE Merge:**
```math
\text{tokens}^{(k)} = \text{Merge}(\text{tokens}^{(k-1)}, \arg\max_{(i,j)} \text{stats}^{(k)}(i,j), 256+k-1)
```
**Vocabulary Size:**
```math
|V| = 256 + K + |V_{\text{special}}|
```
### Key Takeaways
1. **UTF-8 Encoding**: Handles all Unicode characters consistently
2. **BPE Algorithm**: Creates efficient vocabulary through iterative merging
3. **Regex Splitting**: Prevents over-merging with pattern-based chunking
4. **Special Tokens**: Control flow and handle edge cases
5. **Byte-Level**: Works at byte level for universal language support