# Tokenization Explained - Mathematical Formulation
Complete mathematical derivation and step-by-step explanation of tokenization, Byte Pair Encoding (BPE), and UTF-8 encoding used in the SheepOp Language Model.
## Table of Contents
1. [Introduction to Tokenization](#1-introduction-to-tokenization)
2. [UTF-8 Encoding](#2-utf-8-encoding)
3. [Byte Pair Encoding Algorithm](#3-byte-pair-encoding-algorithm)
4. [Vocabulary Construction](#4-vocabulary-construction)
5. [Encoding Process](#5-encoding-process)
6. [Decoding Process](#6-decoding-process)
7. [Regex Pattern Splitting](#7-regex-pattern-splitting)
8. [Special Tokens](#8-special-tokens)
9. [Complete Tokenization Pipeline](#9-complete-tokenization-pipeline)
10. [Tokenization Challenges and Solutions](#10-tokenization-challenges-and-solutions)
---
## 1. Introduction to Tokenization
### 1.1 What is Tokenization?
**Tokenization** is the process of converting raw text into a sequence of discrete tokens (integers) that can be processed by neural networks.
**Mathematical Definition:**
Given a text string $s \in \Sigma^*$ where $\Sigma$ is the character alphabet, tokenization maps:
```math
\mathcal{T}: \Sigma^* \rightarrow \mathbb{N}^*
```
```math
s = c_1 c_2 \ldots c_n \mapsto \mathbf{t} = [t_1, t_2, \ldots, t_m]
```
where:
- $s$ = input text string
- $c_i$ = individual characters
- $\mathbf{t}$ = sequence of token IDs
- $n$ = number of characters
- $m$ = number of tokens (typically $m \leq n$)
### 1.2 Why Tokenization?
**Problem:** Neural networks require numerical inputs, not raw text.
**Solution:** Convert text to token sequences:
```mermaid
graph LR
A["Raw Text
'Hello world'"] --> B[Tokenizer]
B --> C["Token IDs
[15496, 1917]"]
C --> D[Embedding Layer]
D --> E["Embeddings
ℝ^n×d"]
style A fill:#e1f5ff
style B fill:#fff4e1
style C fill:#e1ffe1
style E fill:#ffe1f5
```
### 1.3 Tokenization Approaches
**Three Main Approaches:**
1. **Character-Level**: Each character becomes a token
- Vocabulary size: ~100-200
- Sequence length: Very long
2. **Word-Level**: Each word becomes a token
- Vocabulary size: ~50,000-100,000
- Sequence length: Moderate
3. **Subword-Level (BPE)**: Sequences of bytes/characters become tokens
- Vocabulary size: ~30,000-100,000
- Sequence length: Efficient
```mermaid
graph TB
subgraph "Character-Level"
C1["'Hello'"] --> C2["['H','e','l','l','o']"]
C2 --> C3["5 tokens"]
end
subgraph "Word-Level"
W1["'Hello world'"] --> W2["['Hello', 'world']"]
W2 --> W3["2 tokens"]
end
subgraph "Subword-Level (BPE)"
B1["'Hello world'"] --> B2["['Hello', ' world']"]
B2 --> B3["2 tokens"]
end
style C3 fill:#ffcccc
style W3 fill:#ccffcc
style B3 fill:#ccffcc
```
---
## 2. UTF-8 Encoding
### 2.1 Unicode Code Points
**Unicode** defines a mapping from characters to integers (code points):
```math
f: \mathcal{C} \rightarrow \{0, 1, \ldots, 0x10FFFF\}
```
where $\mathcal{C}$ is the set of all Unicode characters.
**Example:**
```math
f('H') = 72, \quad f('ε') = 949, \quad f('中') = 20013
```
### 2.2 UTF-8 Encoding Function
**UTF-8** encodes Unicode code points into variable-length byte sequences:
```math
\text{UTF-8}: \{0, \ldots, 0x10FFFF\} \rightarrow \{0, \ldots, 255\}^*
```
**Encoding Rules:**
For a code point $c$:
```math
\text{UTF-8}(c) = \begin{cases}
[b_0] & \text{if } c < 128 \\
[b_0, b_1] & \text{if } c < 2048 \\
[b_0, b_1, b_2] & \text{if } c < 65536 \\
[b_0, b_1, b_2, b_3] & \text{if } c < 0x10FFFF
\end{cases}
```
where $b_i \in \{0, \ldots, 255\}$ are bytes.
### 2.3 UTF-8 Encoding Process
```mermaid
graph TB
A["Unicode String
'Hello 世界'"] --> B[Extract Code Points]
B --> C["[72, 101, 108, 108, 111, 32, 19990, 30028]"]
C --> D[UTF-8 Encode]
D --> E["Bytes
[72, 101, 108, 108, 111, 32, 228, 184, 150, 231, 149, 140]"]
style A fill:#e1f5ff
style E fill:#e1ffe1
```
**Mathematical Formulation:**
For a string $s = c_1 c_2 \ldots c_n$:
```math
\text{bytes}(s) = \bigoplus_{i=1}^n \text{UTF-8}(f(c_i))
```
where $\bigoplus$ denotes byte concatenation.
**Example:**
```math
\text{bytes}("Hi") = \text{UTF-8}(72) \oplus \text{UTF-8}(105) = [72] \oplus [105] = [72, 105]
```
---
## 3. Byte Pair Encoding Algorithm
### 3.1 BPE Overview
**Byte Pair Encoding (BPE)** is a data compression algorithm that iteratively merges the most frequent consecutive pairs.
**Goal:** Create an efficient vocabulary by merging frequent byte/character pairs.
### 3.2 BPE Training Algorithm
**Initial State:**
```math
V^{(0)} = \{0, 1, \ldots, 255\} \quad \text{(all bytes)}
```
```math
\text{tokens}^{(0)} = \text{bytes}(s) = [b_1, b_2, \ldots, b_n]
```
**Iterative Merging:**
For iteration $k = 1, 2, \ldots, K$:
**Step 1: Calculate Pair Frequencies**
```math
\text{stats}^{(k)} = \text{CountPairs}(\text{tokens}^{(k-1)})
```
```math
\text{stats}^{(k)}(i, j) = |\{p : \text{tokens}^{(k-1)}_p = i \land \text{tokens}^{(k-1)}_{p+1} = j\}|
```
**Step 2: Find Most Frequent Pair**
```math
(i^*, j^*) = \arg\max_{(i,j)} \text{stats}^{(k)}(i, j)
```
**Step 3: Create New Token**
```math
V^{(k)} = V^{(k-1)} \cup \{256 + k - 1\}
```
```math
\text{merges}^{(k)} = \text{merges}^{(k-1)} \cup \{(i^*, j^*) \mapsto 256 + k - 1\}
```
**Step 4: Apply Merge**
```math
\text{tokens}^{(k)} = \text{Merge}(\text{tokens}^{(k-1)}, (i^*, j^*), 256 + k - 1)
```
where:
```math
\text{Merge}(T, (a, b), \text{new\_id})_i = \begin{cases}
\text{new\_id} & \text{if } T_i = a \land T_{i+1} = b \\
T_i & \text{otherwise}
\end{cases}
```
### 3.3 BPE Training Flowchart
```mermaid
graph TB
Start[Start Training] --> Init["Initialize
V⁽⁰⁾ = {0...255}
tokens⁽⁰⁾ = bytes(text)"]
Init --> Iter{More Merges?}
Iter -->|Yes| Stats["Calculate Pair Frequencies
stats⁽ᵏ⁾(i,j)"]
Stats --> Find["Find Most Frequent Pair
(i*, j*) = argmax stats"]
Find --> Create["Create New Token
id = 256 + k"]
Create --> Merge["Merge All Occurrences
tokens⁽ᵏ⁾ = Merge(tokens⁽ᵏ⁻¹⁾, (i*,j*), id)"]
Merge --> Update["Update Vocabulary
V⁽ᵏ⁾ = V⁽ᵏ⁻¹⁾ ∪ {id}"]
Update --> Iter
Iter -->|No| End[Training Complete]
style Start fill:#e1f5ff
style End fill:#e1ffe1
style Stats fill:#fff4e1
style Merge fill:#ffe1f5
```
### 3.4 BPE Example
**Example:** Training on text "aaab"
**Iteration 0:**
```math
V^{(0)} = \{0, 1, \ldots, 255\}
```
```math
\text{tokens}^{(0)} = [97, 97, 97, 98] \quad \text{(bytes for 'aaab')}
```
**Calculate frequencies:**
```math
\text{stats}^{(0)} = \{(97, 97): 2, (97, 98): 1\}
```
**Iteration 1:**
```math
(i^*, j^*) = (97, 97), \quad \text{new\_id} = 256
```
```math
\text{tokens}^{(1)} = [256, 97, 98]
```
```math
V^{(1)} = V^{(0)} \cup \{256\}
```
**Iteration 2:**
```math
\text{stats}^{(1)} = \{(256, 97): 1, (97, 98): 1\}
```
**Choose one:** \((256, 97)\), \(\text{new_id} = 257\)
```math
\text{tokens}^{(2)} = [257, 98]
```
---
## 4. Vocabulary Construction
### 4.1 Vocabulary Structure
**Final Vocabulary:**
```math
V = V_{\text{bytes}} \cup V_{\text{merges}} \cup V_{\text{special}}
```
where:
- $V_{\text{bytes}} = \{0, 1, \ldots, 255\}$ (256 tokens)
- $V_{\text{merges}} = \{256, 257, \ldots, 256 + K - 1\}$ (K merged tokens)
- $V_{\text{special}} = \{\text{}, \text{}, \text{}, \text{}\}$
**Vocabulary Size:**
```math
|V| = 256 + K + |V_{\text{special}}|
```
### 4.2 Merge Dictionary
**Merge Dictionary:**
```math
M: \mathbb{N} \times \mathbb{N} \rightarrow \mathbb{N}
```
```math
M(i, j) = \begin{cases}
\text{merged\_id} & \text{if } (i, j) \text{ was merged} \\
\text{undefined} & \text{otherwise}
\end{cases}
```
**Vocabulary Mapping:**
```math
\text{vocab}: \mathbb{N} \rightarrow \{0, \ldots, 255\}^*
```
```math
\text{vocab}(\text{id}) = \begin{cases}
[\text{id}] & \text{if } \text{id} < 256 \\
\text{vocab}(i) \oplus \text{vocab}(j) & \text{if } M(i, j) = \text{id}
\end{cases}
```
### 4.3 Vocabulary Construction Diagram
```mermaid
graph TB
subgraph "Initial Vocabulary"
A1["256 Byte Tokens
{0, 1, ..., 255}"]
end
subgraph "BPE Merges"
B1["Merge 1: (101, 32) → 256"]
B2["Merge 2: (256, 108) → 257"]
B3["Merge K: ... → 256+K-1"]
end
subgraph "Special Tokens"
C1[" → 50256"]
C2[" → 50257"]
C3[" → 50258"]
C4[" → 50259"]
end
A1 --> D[Final Vocabulary]
B1 --> D
B2 --> D
B3 --> D
C1 --> D
C2 --> D
C3 --> D
C4 --> D
D --> E["|V| = 256 + K + 4"]
style A1 fill:#e1f5ff
style D fill:#e1ffe1
style E fill:#fff4e1
```
---
## 5. Encoding Process
### 5.1 Encoding Algorithm
**Input:** Text string $s$
**Step 1: Regex Splitting**
```math
\text{chunks} = \text{RegexSplit}(s, \text{pattern})
```
**Step 2: Convert to Bytes**
For each chunk $c \in \text{chunks}$:
```math
\text{bytes}(c) = \text{UTF-8}(c) = [b_1, b_2, \ldots, b_n]
```
**Step 3: Apply BPE Merges**
Initialize: $\text{tokens} = \text{bytes}(c)$
While merges possible:
```math
\text{Find earliest merge: } (i^*, j^*) = \arg\min_{(i,j) \in M} \text{merge\_index}(M(i, j))
```
```math
\text{Apply merge: } \text{tokens} = \text{Merge}(\text{tokens}, (i^*, j^*), M(i^*, j^*))
```
**Step 4: Combine Results**
```math
\text{token\_ids} = \bigoplus_{c \in \text{chunks}} \text{BPE}(\text{bytes}(c))
```
### 5.2 Encoding Function
**Mathematical Definition:**
```math
\text{encode}(s) = \bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))
```
where $\bigoplus$ denotes token sequence concatenation.
### 5.3 Encoding Flowchart
```mermaid
graph TB
A["Input Text
'Hello world'"] --> B[Regex Split]
B --> C["Chunks
['Hello', ' world']"]
C --> D[For Each Chunk]
D --> E["UTF-8 Encode
bytes = [72, 101, 108, 108, 111]"]
E --> F["Initialize Tokens
tokens = bytes"]
F --> G{Merge Possible?}
G -->|Yes| H["Find Earliest Merge
(i*, j*)"]
H --> I["Apply Merge
tokens = Merge(tokens, (i*,j*), id)"]
I --> G
G -->|No| J[Add to Result]
J --> K{More Chunks?}
K -->|Yes| D
K -->|No| L["Final Token IDs
[15496, 1917]"]
style A fill:#e1f5ff
style L fill:#e1ffe1
style G fill:#ffe1f5
```
### 5.4 Encoding Example
**Input:** "Hello world"
**Step 1: Regex Split**
```math
\text{chunks} = ["Hello", " world"]
```
**Step 2: UTF-8 Encoding**
```math
\text{UTF-8}("Hello") = [72, 101, 108, 108, 111]
```
```math
\text{UTF-8}(" world") = [32, 119, 111, 114, 108, 100]
```
**Step 3: BPE Merging**
Assume merges: $M(72, 101) = 256, M(256, 108) = 257, M(257, 108) = 258, M(258, 111) = 259$
```math
[72, 101, 108, 108, 111] \xrightarrow{M(72,101)} [256, 108, 108, 111]
```
```math
\xrightarrow{M(256,108)} [257, 108, 111] \xrightarrow{M(257,108)} [258, 111]
```
```math
\xrightarrow{M(258,111)} [259]
```
**Final:** $[259, 1917]$
---
## 6. Decoding Process
### 6.1 Decoding Algorithm
**Input:** Token IDs $\mathbf{t} = [t_1, t_2, \ldots, t_n]$
**Step 1: Handle Special Tokens**
```math
\text{Stop at EOS: } \mathbf{t}' = \mathbf{t}[:i] \text{ where } t_i = \text{}
```
**Step 2: Lookup Bytes**
For each token $t_i$:
```math
\text{bytes}_i = \text{vocab}(t_i)
```
**Step 3: Concatenate Bytes**
```math
\text{all\_bytes} = \bigoplus_{i=1}^n \text{bytes}_i
```
**Step 4: UTF-8 Decode**
```math
\text{text} = \text{UTF-8}^{-1}(\text{all\_bytes})
```
### 6.2 Decoding Function
**Mathematical Definition:**
```math
\text{decode}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(t_i)\right)
```
### 6.3 Decoding Flowchart
```mermaid
graph TB
A["Token IDs
[15496, 1917]"] --> B{Special Tokens?}
B -->|EOS Found| C[Stop at EOS]
B -->|No EOS| D[Process All Tokens]
C --> D
D --> E[For Each Token ID]
E --> F["Lookup Bytes
vocab[t_i]"]
F --> G["Concatenate
all_bytes = bytes₁ ⊕ bytes₂ ⊕ ..."]
G --> H["UTF-8 Decode
text = decode(all_bytes)"]
H --> I["Output Text
'Hello world'"]
style A fill:#e1f5ff
style I fill:#e1ffe1
style F fill:#fff4e1
```
### 6.4 Decoding Example
**Input:** $[259, 1917]$
**Step 1: Lookup**
```math
\text{vocab}(259) = \text{vocab}(258) \oplus \text{vocab}(111)
```
```math
= (\text{vocab}(257) \oplus \text{vocab}(108)) \oplus [111]
```
```math
= ((\text{vocab}(256) \oplus \text{vocab}(108)) \oplus [108]) \oplus [111]
```
```math
= (((\text{vocab}(72) \oplus \text{vocab}(101)) \oplus [108]) \oplus [108]) \oplus [111]
```
```math
= [[72] \oplus [101]] \oplus [108] \oplus [108] \oplus [111]
```
```math
= [72, 101, 108, 108, 111]
```
```math
\text{vocab}(1917) = [32, 119, 111, 114, 108, 100]
```
**Step 2: Concatenate**
```math
\text{all\_bytes} = [72, 101, 108, 108, 111] \oplus [32, 119, 111, 114, 108, 100]
```
```math
= [72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
```
**Step 3: Decode**
```math
\text{decode}([72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]) = "Hello world"
```
---
## 7. Regex Pattern Splitting
### 7.1 GPT-4 Style Pattern
**Pattern Components:**
```math
P = P_{\text{contractions}} \cup P_{\text{letters}} \cup P_{\text{numbers}} \cup P_{\text{punctuation}} \cup P_{\text{whitespace}}
```
**Pattern Definition:**
\[
P = \text{'(?i:[sdmt]|ll|ve|re)} \cup \text{[^\r\n\p{L}\p{N}]?+\p{L}+} \cup \text{\p{N}{1,3}} \cup \text{?[^\s\p{L}\p{N}]++} \cup \text{\r?\n} \cup \text{\s+}
\]
**Components:**
1. **Contractions:** \( P\_{\text{contractions}} = \text{'(?i:[sdmt]|ll|ve|re)} \)
- Matches: `'s`, `'t`, `'ll`, `'ve`, `'re` (case-insensitive)
2. **Letters:** \( P\_{\text{letters}} = \text{[^\r\n\p{L}\p{N}]?+\p{L}+} \)
- Optional space + one or more letters
3. **Numbers:** \( P\_{\text{numbers}} = \text{\p{N}{1,3}} \)
- **Limit:** 1-3 digits only (prevents long number tokens)
4. **Punctuation:** \( P\_{\text{punctuation}} = \text{?[^\s\p{L}\p{N}]++} \)
- Optional space + punctuation
5. **Whitespace:** \( P\_{\text{whitespace}} = \text{\r?\n} \cup \text{\s+} \)
- Newlines and multiple spaces
### 7.2 Regex Splitting Function
```math
\text{RegexSplit}(s, P) = \{m_1, m_2, \ldots, m_k : m_i \in \text{Match}(s, P)\}
```
where matches are found left-to-right, non-overlapping.
### 7.3 Regex Splitting Diagram
```mermaid
graph TB
A["Input Text
'Hello world 123'"] --> B[Apply Regex Pattern]
B --> C1["Chunk 1: 'Hello'
P_letters"]
B --> C2["Chunk 2: ' world'
P_whitespace + P_letters"]
B --> C3["Chunk 3: ' 123'
P_whitespace + P_numbers"]
C1 --> D["Chunks List
['Hello', ' world', ' 123']"]
C2 --> D
C3 --> D
style A fill:#e1f5ff
style D fill:#e1ffe1
style B fill:#fff4e1
```
**Example:**
```math
\text{RegexSplit}("Hello world 123", P) = ["Hello", " world", " 123"]
```
---
## 8. Special Tokens
### 8.1 Special Token Set
**Special Tokens:**
```math
V_{\text{special}} = \{\text{}, \text{}, \text{}, \text{}\}
```
**Token IDs:**
```math
\text{id}(\text{}) = 0, \quad \text{id}(\text{}) = 1, \quad \text{id}(\text{}) = 2, \quad \text{id}(\text{}) = 3
```
### 8.2 Special Token Functions
**Padding:**
```math
\text{pad}(\mathbf{t}, \text{max\_length}) = \mathbf{t} \oplus [\text{}]^{\max(\text{max\_length} - |\mathbf{t}|, 0)}
```
**Unknown Token:**
```math
\text{encode}(s) = \begin{cases}
[\text{id}(c)] & \text{if } c \in V \\
[\text{}] & \text{if } c \notin V
\end{cases}
```
**EOS Handling:**
```math
\text{decode}(\mathbf{t}) = \text{decode}(\mathbf{t}[:i]) \text{ where } t_i = \text{}
```
### 8.3 Special Token Flowchart
```mermaid
graph TB
subgraph "Special Tokens"
A1[" → 0
Padding"]
A2[" → 1
Unknown"]
A3[" → 2
Beginning"]
A4[" → 3
End"]
end
subgraph "Usage"
B1["Padding:
[1,2,3] → [1,2,3,0,0]"]
B2["Unknown:
unknown_char → 1"]
B3["EOS Stop:
[1,2,3] → stop at 3"]
end
A1 --> B1
A2 --> B2
A4 --> B3
style A1 fill:#e1f5ff
style A2 fill:#ffe1f5
style A3 fill:#e1ffe1
style A4 fill:#fff4e1
```
---
## 9. Complete Tokenization Pipeline
### 9.1 Full Pipeline
**Complete Tokenization Process:**
```mermaid
graph TB
Start[Input Text] --> Split[Regex Split]
Split --> UTF8[UTF-8 Encode]
UTF8 --> BPE[BPE Merge]
BPE --> Special{Special Tokens?}
Special -->|Yes| Handle[Handle Special]
Special -->|No| Combine[Combine Tokens]
Handle --> Combine
Combine --> Output[Token IDs]
Output --> Embed[Embedding Layer]
style Start fill:#e1f5ff
style Output fill:#e1ffe1
style Embed fill:#ffe1f5
```
### 9.2 Mathematical Formulation
**Complete Encoding:**
```math
\text{Tokenize}(s) = \text{HandleSpecial}\left(\bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))\right)
```
**Complete Decoding:**
```math
\text{Detokenize}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(\text{RemoveSpecial}(t_i))\right)
```
### 9.3 Pipeline Example
**Input:** "Hello world!"
**Step 1: Regex Split**
```math
\text{chunks} = ["Hello", " world", "!"]
```
**Step 2: UTF-8 Encode**
```math
\text{bytes} = [[72,101,108,108,111], [32,119,111,114,108,100], [33]]
```
**Step 3: BPE Merge**
```math
\text{tokens} = [[15496], [1917], [0]]
```
**Step 4: Combine**
```math
\text{final} = [15496, 1917, 0]
```
---
## 10. Tokenization Challenges and Solutions
### 10.1 Challenge: Long Tokens
**Problem:** Some tokens become very long (e.g., "defaultstyle" as single token).
**Mathematical Impact:**
```math
\text{LongToken}(s) = \begin{cases}
1 \text{ token} & \text{if } |s| > 10 \text{ chars} \\
\text{multiple tokens} & \text{otherwise}
\end{cases}
```
**Solution:** Better regex splitting prevents over-merging.
### 10.2 Challenge: Number Tokenization
**Problem:** Numbers tokenized arbitrarily (sometimes 1 token, sometimes 2-3).
**Solution:** Limit number merging to 1-3 digits:
```math
P_{\text{numbers}} = \text{\p{N}{1,3}}
```
**Impact:**
```math
\text{TokenCount}(n) = \begin{cases}
1 & \text{if } n < 1000 \\
2-3 & \text{if } n \geq 1000
\end{cases}
```
### 10.3 Challenge: Python Code Efficiency
**Problem:** Each space is separate token (GPT-2 issue).
**Solution:** Merge multiple spaces:
```math
P_{\text{whitespace}} = \text{\s+} \quad \text{(matches multiple spaces)}
```
**Efficiency Gain:**
```math
\text{TokensBefore} = |\text{spaces}| \quad \text{(one token per space)}
```
```math
\text{TokensAfter} = \lceil |\text{spaces}| / 4 \rceil \quad \text{(grouped spaces)}
```
### 10.4 Challenge: Trailing Whitespace
**Problem:** Trailing spaces cause poor tokenization.
**Detection:**
```math
\text{HasTrailingSpace}(s) = \begin{cases}
\text{True} & \text{if } s[-1] = ' ' \\
\text{False} & \text{otherwise}
\end{cases}
```
**Warning:**
```math
\text{Tokenize}(s) = \begin{cases}
\text{encode}(s) + \text{warning} & \text{if } \text{HasTrailingSpace}(s) \\
\text{encode}(s) & \text{otherwise}
\end{cases}
```
### 10.5 Challenge: Multilingual Support
**Problem:** Non-English languages tokenize inefficiently.
**Solution:** UTF-8 byte-level encoding handles all languages:
```math
\text{Tokenize}(s) = \text{BPE}(\text{UTF-8}(s)) \quad \forall s \in \Sigma^*
```
**Efficiency:**
```math
\text{TokenRatio} = \frac{|\text{Tokenize}(s_{\text{non-english}})|}{|\text{Tokenize}(s_{\text{english}})|} \approx 1.5-2.0
```
### 10.6 Challenge: Untrained Tokens
**Problem:** Some tokens never appear in training data.
**Solution:** Fallback handling:
```math
\text{Decode}(t) = \begin{cases}
\text{vocab}(t) & \text{if } t \in V \\
\text{} & \text{if } t \notin V \\
\text{fallback\_bytes} & \text{if } t < 256
\end{cases}
```
---
## Summary
### Key Formulas
**Encoding:**
```math
\text{encode}(s) = \text{HandleSpecial}\left(\bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))\right)
```
**Decoding:**
```math
\text{decode}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(t_i)\right)
```
**BPE Merge:**
```math
\text{tokens}^{(k)} = \text{Merge}(\text{tokens}^{(k-1)}, \arg\max_{(i,j)} \text{stats}^{(k)}(i,j), 256+k-1)
```
**Vocabulary Size:**
```math
|V| = 256 + K + |V_{\text{special}}|
```
### Key Takeaways
1. **UTF-8 Encoding**: Handles all Unicode characters consistently
2. **BPE Algorithm**: Creates efficient vocabulary through iterative merging
3. **Regex Splitting**: Prevents over-merging with pattern-based chunking
4. **Special Tokens**: Control flow and handle edge cases
5. **Byte-Level**: Works at byte level for universal language support