- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
20 KiB
Tokenization Explained - Mathematical Formulation
Complete mathematical derivation and step-by-step explanation of tokenization, Byte Pair Encoding (BPE), and UTF-8 encoding used in the SheepOp Language Model.
Table of Contents
- Introduction to Tokenization
- UTF-8 Encoding
- Byte Pair Encoding Algorithm
- Vocabulary Construction
- Encoding Process
- Decoding Process
- Regex Pattern Splitting
- Special Tokens
- Complete Tokenization Pipeline
- Tokenization Challenges and Solutions
1. Introduction to Tokenization
1.1 What is Tokenization?
Tokenization is the process of converting raw text into a sequence of discrete tokens (integers) that can be processed by neural networks.
Mathematical Definition:
Given a text string s \in \Sigma^* where \Sigma is the character alphabet, tokenization maps:
\mathcal{T}: \Sigma^* \rightarrow \mathbb{N}^*
s = c_1 c_2 \ldots c_n \mapsto \mathbf{t} = [t_1, t_2, \ldots, t_m]
where:
s= input text stringc_i= individual characters\mathbf{t}= sequence of token IDsn= number of charactersm= number of tokens (typicallym \leq n)
1.2 Why Tokenization?
Problem: Neural networks require numerical inputs, not raw text.
Solution: Convert text to token sequences:
graph LR
A["Raw Text<br/>'Hello world'"] --> B[Tokenizer]
B --> C["Token IDs<br/>[15496, 1917]"]
C --> D[Embedding Layer]
D --> E["Embeddings<br/>ℝ^n×d"]
style A fill:#e1f5ff
style B fill:#fff4e1
style C fill:#e1ffe1
style E fill:#ffe1f5
1.3 Tokenization Approaches
Three Main Approaches:
-
Character-Level: Each character becomes a token
- Vocabulary size: ~100-200
- Sequence length: Very long
-
Word-Level: Each word becomes a token
- Vocabulary size: ~50,000-100,000
- Sequence length: Moderate
-
Subword-Level (BPE): Sequences of bytes/characters become tokens
- Vocabulary size: ~30,000-100,000
- Sequence length: Efficient
graph TB
subgraph "Character-Level"
C1["'Hello'"] --> C2["['H','e','l','l','o']"]
C2 --> C3["5 tokens"]
end
subgraph "Word-Level"
W1["'Hello world'"] --> W2["['Hello', 'world']"]
W2 --> W3["2 tokens"]
end
subgraph "Subword-Level (BPE)"
B1["'Hello world'"] --> B2["['Hello', ' world']"]
B2 --> B3["2 tokens"]
end
style C3 fill:#ffcccc
style W3 fill:#ccffcc
style B3 fill:#ccffcc
2. UTF-8 Encoding
2.1 Unicode Code Points
Unicode defines a mapping from characters to integers (code points):
f: \mathcal{C} \rightarrow \{0, 1, \ldots, 0x10FFFF\}
where \mathcal{C} is the set of all Unicode characters.
Example:
f('H') = 72, \quad f('ε') = 949, \quad f('中') = 20013
2.2 UTF-8 Encoding Function
UTF-8 encodes Unicode code points into variable-length byte sequences:
\text{UTF-8}: \{0, \ldots, 0x10FFFF\} \rightarrow \{0, \ldots, 255\}^*
Encoding Rules:
For a code point c:
\text{UTF-8}(c) = \begin{cases}
[b_0] & \text{if } c < 128 \\
[b_0, b_1] & \text{if } c < 2048 \\
[b_0, b_1, b_2] & \text{if } c < 65536 \\
[b_0, b_1, b_2, b_3] & \text{if } c < 0x10FFFF
\end{cases}
where b_i \in \{0, \ldots, 255\} are bytes.
2.3 UTF-8 Encoding Process
graph TB
A["Unicode String<br/>'Hello 世界'"] --> B[Extract Code Points]
B --> C["[72, 101, 108, 108, 111, 32, 19990, 30028]"]
C --> D[UTF-8 Encode]
D --> E["Bytes<br/>[72, 101, 108, 108, 111, 32, 228, 184, 150, 231, 149, 140]"]
style A fill:#e1f5ff
style E fill:#e1ffe1
Mathematical Formulation:
For a string s = c_1 c_2 \ldots c_n:
\text{bytes}(s) = \bigoplus_{i=1}^n \text{UTF-8}(f(c_i))
where \bigoplus denotes byte concatenation.
Example:
\text{bytes}("Hi") = \text{UTF-8}(72) \oplus \text{UTF-8}(105) = [72] \oplus [105] = [72, 105]
3. Byte Pair Encoding Algorithm
3.1 BPE Overview
Byte Pair Encoding (BPE) is a data compression algorithm that iteratively merges the most frequent consecutive pairs.
Goal: Create an efficient vocabulary by merging frequent byte/character pairs.
3.2 BPE Training Algorithm
Initial State:
V^{(0)} = \{0, 1, \ldots, 255\} \quad \text{(all bytes)}
\text{tokens}^{(0)} = \text{bytes}(s) = [b_1, b_2, \ldots, b_n]
Iterative Merging:
For iteration k = 1, 2, \ldots, K:
Step 1: Calculate Pair Frequencies
\text{stats}^{(k)} = \text{CountPairs}(\text{tokens}^{(k-1)})
\text{stats}^{(k)}(i, j) = |\{p : \text{tokens}^{(k-1)}_p = i \land \text{tokens}^{(k-1)}_{p+1} = j\}|
Step 2: Find Most Frequent Pair
(i^*, j^*) = \arg\max_{(i,j)} \text{stats}^{(k)}(i, j)
Step 3: Create New Token
V^{(k)} = V^{(k-1)} \cup \{256 + k - 1\}
\text{merges}^{(k)} = \text{merges}^{(k-1)} \cup \{(i^*, j^*) \mapsto 256 + k - 1\}
Step 4: Apply Merge
\text{tokens}^{(k)} = \text{Merge}(\text{tokens}^{(k-1)}, (i^*, j^*), 256 + k - 1)
where:
\text{Merge}(T, (a, b), \text{new\_id})_i = \begin{cases}
\text{new\_id} & \text{if } T_i = a \land T_{i+1} = b \\
T_i & \text{otherwise}
\end{cases}
3.3 BPE Training Flowchart
graph TB
Start[Start Training] --> Init["Initialize<br/>V⁽⁰⁾ = {0...255}<br/>tokens⁽⁰⁾ = bytes(text)"]
Init --> Iter{More Merges?}
Iter -->|Yes| Stats["Calculate Pair Frequencies<br/>stats⁽ᵏ⁾(i,j)"]
Stats --> Find["Find Most Frequent Pair<br/>(i*, j*) = argmax stats"]
Find --> Create["Create New Token<br/>id = 256 + k"]
Create --> Merge["Merge All Occurrences<br/>tokens⁽ᵏ⁾ = Merge(tokens⁽ᵏ⁻¹⁾, (i*,j*), id)"]
Merge --> Update["Update Vocabulary<br/>V⁽ᵏ⁾ = V⁽ᵏ⁻¹⁾ ∪ {id}"]
Update --> Iter
Iter -->|No| End[Training Complete]
style Start fill:#e1f5ff
style End fill:#e1ffe1
style Stats fill:#fff4e1
style Merge fill:#ffe1f5
3.4 BPE Example
Example: Training on text "aaab"
Iteration 0:
V^{(0)} = \{0, 1, \ldots, 255\}
\text{tokens}^{(0)} = [97, 97, 97, 98] \quad \text{(bytes for 'aaab')}
Calculate frequencies:
\text{stats}^{(0)} = \{(97, 97): 2, (97, 98): 1\}
Iteration 1:
(i^*, j^*) = (97, 97), \quad \text{new\_id} = 256
\text{tokens}^{(1)} = [256, 97, 98]
V^{(1)} = V^{(0)} \cup \{256\}
Iteration 2:
\text{stats}^{(1)} = \{(256, 97): 1, (97, 98): 1\}
Choose one: ((256, 97)), (\text{new_id} = 257)
\text{tokens}^{(2)} = [257, 98]
4. Vocabulary Construction
4.1 Vocabulary Structure
Final Vocabulary:
V = V_{\text{bytes}} \cup V_{\text{merges}} \cup V_{\text{special}}
where:
V_{\text{bytes}} = \{0, 1, \ldots, 255\}(256 tokens)V_{\text{merges}} = \{256, 257, \ldots, 256 + K - 1\}(K merged tokens)V_{\text{special}} = \{\text{<pad>}, \text{<unk>}, \text{<bos>}, \text{<eos>}\}
Vocabulary Size:
|V| = 256 + K + |V_{\text{special}}|
4.2 Merge Dictionary
Merge Dictionary:
M: \mathbb{N} \times \mathbb{N} \rightarrow \mathbb{N}
M(i, j) = \begin{cases}
\text{merged\_id} & \text{if } (i, j) \text{ was merged} \\
\text{undefined} & \text{otherwise}
\end{cases}
Vocabulary Mapping:
\text{vocab}: \mathbb{N} \rightarrow \{0, \ldots, 255\}^*
\text{vocab}(\text{id}) = \begin{cases}
[\text{id}] & \text{if } \text{id} < 256 \\
\text{vocab}(i) \oplus \text{vocab}(j) & \text{if } M(i, j) = \text{id}
\end{cases}
4.3 Vocabulary Construction Diagram
graph TB
subgraph "Initial Vocabulary"
A1["256 Byte Tokens<br/>{0, 1, ..., 255}"]
end
subgraph "BPE Merges"
B1["Merge 1: (101, 32) → 256"]
B2["Merge 2: (256, 108) → 257"]
B3["Merge K: ... → 256+K-1"]
end
subgraph "Special Tokens"
C1["<pad> → 50256"]
C2["<unk> → 50257"]
C3["<bos> → 50258"]
C4["<eos> → 50259"]
end
A1 --> D[Final Vocabulary]
B1 --> D
B2 --> D
B3 --> D
C1 --> D
C2 --> D
C3 --> D
C4 --> D
D --> E["|V| = 256 + K + 4"]
style A1 fill:#e1f5ff
style D fill:#e1ffe1
style E fill:#fff4e1
5. Encoding Process
5.1 Encoding Algorithm
Input: Text string s
Step 1: Regex Splitting
\text{chunks} = \text{RegexSplit}(s, \text{pattern})
Step 2: Convert to Bytes
For each chunk c \in \text{chunks}:
\text{bytes}(c) = \text{UTF-8}(c) = [b_1, b_2, \ldots, b_n]
Step 3: Apply BPE Merges
Initialize: \text{tokens} = \text{bytes}(c)
While merges possible:
\text{Find earliest merge: } (i^*, j^*) = \arg\min_{(i,j) \in M} \text{merge\_index}(M(i, j))
\text{Apply merge: } \text{tokens} = \text{Merge}(\text{tokens}, (i^*, j^*), M(i^*, j^*))
Step 4: Combine Results
\text{token\_ids} = \bigoplus_{c \in \text{chunks}} \text{BPE}(\text{bytes}(c))
5.2 Encoding Function
Mathematical Definition:
\text{encode}(s) = \bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))
where \bigoplus denotes token sequence concatenation.
5.3 Encoding Flowchart
graph TB
A["Input Text<br/>'Hello world'"] --> B[Regex Split]
B --> C["Chunks<br/>['Hello', ' world']"]
C --> D[For Each Chunk]
D --> E["UTF-8 Encode<br/>bytes = [72, 101, 108, 108, 111]"]
E --> F["Initialize Tokens<br/>tokens = bytes"]
F --> G{Merge Possible?}
G -->|Yes| H["Find Earliest Merge<br/>(i*, j*)"]
H --> I["Apply Merge<br/>tokens = Merge(tokens, (i*,j*), id)"]
I --> G
G -->|No| J[Add to Result]
J --> K{More Chunks?}
K -->|Yes| D
K -->|No| L["Final Token IDs<br/>[15496, 1917]"]
style A fill:#e1f5ff
style L fill:#e1ffe1
style G fill:#ffe1f5
5.4 Encoding Example
Input: "Hello world"
Step 1: Regex Split
\text{chunks} = ["Hello", " world"]
Step 2: UTF-8 Encoding
\text{UTF-8}("Hello") = [72, 101, 108, 108, 111]
\text{UTF-8}(" world") = [32, 119, 111, 114, 108, 100]
Step 3: BPE Merging
Assume merges: M(72, 101) = 256, M(256, 108) = 257, M(257, 108) = 258, M(258, 111) = 259
[72, 101, 108, 108, 111] \xrightarrow{M(72,101)} [256, 108, 108, 111]
\xrightarrow{M(256,108)} [257, 108, 111] \xrightarrow{M(257,108)} [258, 111]
\xrightarrow{M(258,111)} [259]
Final: [259, 1917]
6. Decoding Process
6.1 Decoding Algorithm
Input: Token IDs \mathbf{t} = [t_1, t_2, \ldots, t_n]
Step 1: Handle Special Tokens
\text{Stop at EOS: } \mathbf{t}' = \mathbf{t}[:i] \text{ where } t_i = \text{<eos>}
Step 2: Lookup Bytes
For each token t_i:
\text{bytes}_i = \text{vocab}(t_i)
Step 3: Concatenate Bytes
\text{all\_bytes} = \bigoplus_{i=1}^n \text{bytes}_i
Step 4: UTF-8 Decode
\text{text} = \text{UTF-8}^{-1}(\text{all\_bytes})
6.2 Decoding Function
Mathematical Definition:
\text{decode}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(t_i)\right)
6.3 Decoding Flowchart
graph TB
A["Token IDs<br/>[15496, 1917]"] --> B{Special Tokens?}
B -->|EOS Found| C[Stop at EOS]
B -->|No EOS| D[Process All Tokens]
C --> D
D --> E[For Each Token ID]
E --> F["Lookup Bytes<br/>vocab[t_i]"]
F --> G["Concatenate<br/>all_bytes = bytes₁ ⊕ bytes₂ ⊕ ..."]
G --> H["UTF-8 Decode<br/>text = decode(all_bytes)"]
H --> I["Output Text<br/>'Hello world'"]
style A fill:#e1f5ff
style I fill:#e1ffe1
style F fill:#fff4e1
6.4 Decoding Example
Input: [259, 1917]
Step 1: Lookup
\text{vocab}(259) = \text{vocab}(258) \oplus \text{vocab}(111)
= (\text{vocab}(257) \oplus \text{vocab}(108)) \oplus [111]
= ((\text{vocab}(256) \oplus \text{vocab}(108)) \oplus [108]) \oplus [111]
= (((\text{vocab}(72) \oplus \text{vocab}(101)) \oplus [108]) \oplus [108]) \oplus [111]
= [[72] \oplus [101]] \oplus [108] \oplus [108] \oplus [111]
= [72, 101, 108, 108, 111]
\text{vocab}(1917) = [32, 119, 111, 114, 108, 100]
Step 2: Concatenate
\text{all\_bytes} = [72, 101, 108, 108, 111] \oplus [32, 119, 111, 114, 108, 100]
= [72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
Step 3: Decode
\text{decode}([72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]) = "Hello world"
7. Regex Pattern Splitting
7.1 GPT-4 Style Pattern
Pattern Components:
P = P_{\text{contractions}} \cup P_{\text{letters}} \cup P_{\text{numbers}} \cup P_{\text{punctuation}} \cup P_{\text{whitespace}}
Pattern Definition:
[ P = \text{'(?i:[sdmt]|ll|ve|re)} \cup \text{[^\r\n\p{L}\p{N}]?+\p{L}+} \cup \text{\p{N}{1,3}} \cup \text{?[^\s\p{L}\p{N}]++} \cup \text{\r?\n} \cup \text{\s+} ]
Components:
-
Contractions: ( P_{\text{contractions}} = \text{'(?i:[sdmt]|ll|ve|re)} )
- Matches:
's,'t,'ll,'ve,'re(case-insensitive)
- Matches:
-
Letters: ( P_{\text{letters}} = \text{[^\r\n\p{L}\p{N}]?+\p{L}+} )
- Optional space + one or more letters
-
Numbers: ( P_{\text{numbers}} = \text{\p{N}{1,3}} )
- Limit: 1-3 digits only (prevents long number tokens)
-
Punctuation: ( P_{\text{punctuation}} = \text{?[^\s\p{L}\p{N}]++} )
- Optional space + punctuation
-
Whitespace: ( P_{\text{whitespace}} = \text{\r?\n} \cup \text{\s+} )
- Newlines and multiple spaces
7.2 Regex Splitting Function
\text{RegexSplit}(s, P) = \{m_1, m_2, \ldots, m_k : m_i \in \text{Match}(s, P)\}
where matches are found left-to-right, non-overlapping.
7.3 Regex Splitting Diagram
graph TB
A["Input Text<br/>'Hello world 123'"] --> B[Apply Regex Pattern]
B --> C1["Chunk 1: 'Hello'<br/>P_letters"]
B --> C2["Chunk 2: ' world'<br/>P_whitespace + P_letters"]
B --> C3["Chunk 3: ' 123'<br/>P_whitespace + P_numbers"]
C1 --> D["Chunks List<br/>['Hello', ' world', ' 123']"]
C2 --> D
C3 --> D
style A fill:#e1f5ff
style D fill:#e1ffe1
style B fill:#fff4e1
Example:
\text{RegexSplit}("Hello world 123", P) = ["Hello", " world", " 123"]
8. Special Tokens
8.1 Special Token Set
Special Tokens:
V_{\text{special}} = \{\text{<pad>}, \text{<unk>}, \text{<bos>}, \text{<eos>}\}
Token IDs:
\text{id}(\text{<pad>}) = 0, \quad \text{id}(\text{<unk>}) = 1, \quad \text{id}(\text{<bos>}) = 2, \quad \text{id}(\text{<eos>}) = 3
8.2 Special Token Functions
Padding:
\text{pad}(\mathbf{t}, \text{max\_length}) = \mathbf{t} \oplus [\text{<pad>}]^{\max(\text{max\_length} - |\mathbf{t}|, 0)}
Unknown Token:
\text{encode}(s) = \begin{cases}
[\text{id}(c)] & \text{if } c \in V \\
[\text{<unk>}] & \text{if } c \notin V
\end{cases}
EOS Handling:
\text{decode}(\mathbf{t}) = \text{decode}(\mathbf{t}[:i]) \text{ where } t_i = \text{<eos>}
8.3 Special Token Flowchart
graph TB
subgraph "Special Tokens"
A1["<pad> → 0<br/>Padding"]
A2["<unk> → 1<br/>Unknown"]
A3["<bos> → 2<br/>Beginning"]
A4["<eos> → 3<br/>End"]
end
subgraph "Usage"
B1["Padding:<br/>[1,2,3] → [1,2,3,0,0]"]
B2["Unknown:<br/>unknown_char → 1"]
B3["EOS Stop:<br/>[1,2,3] → stop at 3"]
end
A1 --> B1
A2 --> B2
A4 --> B3
style A1 fill:#e1f5ff
style A2 fill:#ffe1f5
style A3 fill:#e1ffe1
style A4 fill:#fff4e1
9. Complete Tokenization Pipeline
9.1 Full Pipeline
Complete Tokenization Process:
graph TB
Start[Input Text] --> Split[Regex Split]
Split --> UTF8[UTF-8 Encode]
UTF8 --> BPE[BPE Merge]
BPE --> Special{Special Tokens?}
Special -->|Yes| Handle[Handle Special]
Special -->|No| Combine[Combine Tokens]
Handle --> Combine
Combine --> Output[Token IDs]
Output --> Embed[Embedding Layer]
style Start fill:#e1f5ff
style Output fill:#e1ffe1
style Embed fill:#ffe1f5
9.2 Mathematical Formulation
Complete Encoding:
\text{Tokenize}(s) = \text{HandleSpecial}\left(\bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))\right)
Complete Decoding:
\text{Detokenize}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(\text{RemoveSpecial}(t_i))\right)
9.3 Pipeline Example
Input: "Hello world!"
Step 1: Regex Split
\text{chunks} = ["Hello", " world", "!"]
Step 2: UTF-8 Encode
\text{bytes} = [[72,101,108,108,111], [32,119,111,114,108,100], [33]]
Step 3: BPE Merge
\text{tokens} = [[15496], [1917], [0]]
Step 4: Combine
\text{final} = [15496, 1917, 0]
10. Tokenization Challenges and Solutions
10.1 Challenge: Long Tokens
Problem: Some tokens become very long (e.g., "defaultstyle" as single token).
Mathematical Impact:
\text{LongToken}(s) = \begin{cases}
1 \text{ token} & \text{if } |s| > 10 \text{ chars} \\
\text{multiple tokens} & \text{otherwise}
\end{cases}
Solution: Better regex splitting prevents over-merging.
10.2 Challenge: Number Tokenization
Problem: Numbers tokenized arbitrarily (sometimes 1 token, sometimes 2-3).
Solution: Limit number merging to 1-3 digits:
P_{\text{numbers}} = \text{\p{N}{1,3}}
Impact:
\text{TokenCount}(n) = \begin{cases}
1 & \text{if } n < 1000 \\
2-3 & \text{if } n \geq 1000
\end{cases}
10.3 Challenge: Python Code Efficiency
Problem: Each space is separate token (GPT-2 issue).
Solution: Merge multiple spaces:
P_{\text{whitespace}} = \text{\s+} \quad \text{(matches multiple spaces)}
Efficiency Gain:
\text{TokensBefore} = |\text{spaces}| \quad \text{(one token per space)}
\text{TokensAfter} = \lceil |\text{spaces}| / 4 \rceil \quad \text{(grouped spaces)}
10.4 Challenge: Trailing Whitespace
Problem: Trailing spaces cause poor tokenization.
Detection:
\text{HasTrailingSpace}(s) = \begin{cases}
\text{True} & \text{if } s[-1] = ' ' \\
\text{False} & \text{otherwise}
\end{cases}
Warning:
\text{Tokenize}(s) = \begin{cases}
\text{encode}(s) + \text{warning} & \text{if } \text{HasTrailingSpace}(s) \\
\text{encode}(s) & \text{otherwise}
\end{cases}
10.5 Challenge: Multilingual Support
Problem: Non-English languages tokenize inefficiently.
Solution: UTF-8 byte-level encoding handles all languages:
\text{Tokenize}(s) = \text{BPE}(\text{UTF-8}(s)) \quad \forall s \in \Sigma^*
Efficiency:
\text{TokenRatio} = \frac{|\text{Tokenize}(s_{\text{non-english}})|}{|\text{Tokenize}(s_{\text{english}})|} \approx 1.5-2.0
10.6 Challenge: Untrained Tokens
Problem: Some tokens never appear in training data.
Solution: Fallback handling:
\text{Decode}(t) = \begin{cases}
\text{vocab}(t) & \text{if } t \in V \\
\text{<unk>} & \text{if } t \notin V \\
\text{fallback\_bytes} & \text{if } t < 256
\end{cases}
Summary
Key Formulas
Encoding:
\text{encode}(s) = \text{HandleSpecial}\left(\bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))\right)
Decoding:
\text{decode}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(t_i)\right)
BPE Merge:
\text{tokens}^{(k)} = \text{Merge}(\text{tokens}^{(k-1)}, \arg\max_{(i,j)} \text{stats}^{(k)}(i,j), 256+k-1)
Vocabulary Size:
|V| = 256 + K + |V_{\text{special}}|
Key Takeaways
- UTF-8 Encoding: Handles all Unicode characters consistently
- BPE Algorithm: Creates efficient vocabulary through iterative merging
- Regex Splitting: Prevents over-merging with pattern-based chunking
- Special Tokens: Control flow and handle edge cases
- Byte-Level: Works at byte level for universal language support