- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
979 lines
20 KiB
Markdown
979 lines
20 KiB
Markdown
# Tokenization Explained - Mathematical Formulation
|
||
|
||
Complete mathematical derivation and step-by-step explanation of tokenization, Byte Pair Encoding (BPE), and UTF-8 encoding used in the SheepOp Language Model.
|
||
|
||
## Table of Contents
|
||
|
||
1. [Introduction to Tokenization](#1-introduction-to-tokenization)
|
||
2. [UTF-8 Encoding](#2-utf-8-encoding)
|
||
3. [Byte Pair Encoding Algorithm](#3-byte-pair-encoding-algorithm)
|
||
4. [Vocabulary Construction](#4-vocabulary-construction)
|
||
5. [Encoding Process](#5-encoding-process)
|
||
6. [Decoding Process](#6-decoding-process)
|
||
7. [Regex Pattern Splitting](#7-regex-pattern-splitting)
|
||
8. [Special Tokens](#8-special-tokens)
|
||
9. [Complete Tokenization Pipeline](#9-complete-tokenization-pipeline)
|
||
10. [Tokenization Challenges and Solutions](#10-tokenization-challenges-and-solutions)
|
||
|
||
---
|
||
|
||
## 1. Introduction to Tokenization
|
||
|
||
### 1.1 What is Tokenization?
|
||
|
||
**Tokenization** is the process of converting raw text into a sequence of discrete tokens (integers) that can be processed by neural networks.
|
||
|
||
**Mathematical Definition:**
|
||
|
||
Given a text string $s \in \Sigma^*$ where $\Sigma$ is the character alphabet, tokenization maps:
|
||
|
||
```math
|
||
\mathcal{T}: \Sigma^* \rightarrow \mathbb{N}^*
|
||
```
|
||
|
||
```math
|
||
s = c_1 c_2 \ldots c_n \mapsto \mathbf{t} = [t_1, t_2, \ldots, t_m]
|
||
```
|
||
|
||
where:
|
||
|
||
- $s$ = input text string
|
||
- $c_i$ = individual characters
|
||
- $\mathbf{t}$ = sequence of token IDs
|
||
- $n$ = number of characters
|
||
- $m$ = number of tokens (typically $m \leq n$)
|
||
|
||
### 1.2 Why Tokenization?
|
||
|
||
**Problem:** Neural networks require numerical inputs, not raw text.
|
||
|
||
**Solution:** Convert text to token sequences:
|
||
|
||
```mermaid
|
||
graph LR
|
||
A["Raw Text<br/>'Hello world'"] --> B[Tokenizer]
|
||
B --> C["Token IDs<br/>[15496, 1917]"]
|
||
C --> D[Embedding Layer]
|
||
D --> E["Embeddings<br/>ℝ^n×d"]
|
||
|
||
style A fill:#e1f5ff
|
||
style B fill:#fff4e1
|
||
style C fill:#e1ffe1
|
||
style E fill:#ffe1f5
|
||
```
|
||
|
||
### 1.3 Tokenization Approaches
|
||
|
||
**Three Main Approaches:**
|
||
|
||
1. **Character-Level**: Each character becomes a token
|
||
- Vocabulary size: ~100-200
|
||
- Sequence length: Very long
|
||
|
||
2. **Word-Level**: Each word becomes a token
|
||
- Vocabulary size: ~50,000-100,000
|
||
- Sequence length: Moderate
|
||
|
||
3. **Subword-Level (BPE)**: Sequences of bytes/characters become tokens
|
||
- Vocabulary size: ~30,000-100,000
|
||
- Sequence length: Efficient
|
||
|
||
```mermaid
|
||
graph TB
|
||
subgraph "Character-Level"
|
||
C1["'Hello'"] --> C2["['H','e','l','l','o']"]
|
||
C2 --> C3["5 tokens"]
|
||
end
|
||
|
||
subgraph "Word-Level"
|
||
W1["'Hello world'"] --> W2["['Hello', 'world']"]
|
||
W2 --> W3["2 tokens"]
|
||
end
|
||
|
||
subgraph "Subword-Level (BPE)"
|
||
B1["'Hello world'"] --> B2["['Hello', ' world']"]
|
||
B2 --> B3["2 tokens"]
|
||
end
|
||
|
||
style C3 fill:#ffcccc
|
||
style W3 fill:#ccffcc
|
||
style B3 fill:#ccffcc
|
||
```
|
||
|
||
---
|
||
|
||
## 2. UTF-8 Encoding
|
||
|
||
### 2.1 Unicode Code Points
|
||
|
||
**Unicode** defines a mapping from characters to integers (code points):
|
||
|
||
```math
|
||
f: \mathcal{C} \rightarrow \{0, 1, \ldots, 0x10FFFF\}
|
||
```
|
||
|
||
where $\mathcal{C}$ is the set of all Unicode characters.
|
||
|
||
**Example:**
|
||
|
||
```math
|
||
f('H') = 72, \quad f('ε') = 949, \quad f('中') = 20013
|
||
```
|
||
|
||
### 2.2 UTF-8 Encoding Function
|
||
|
||
**UTF-8** encodes Unicode code points into variable-length byte sequences:
|
||
|
||
```math
|
||
\text{UTF-8}: \{0, \ldots, 0x10FFFF\} \rightarrow \{0, \ldots, 255\}^*
|
||
```
|
||
|
||
**Encoding Rules:**
|
||
|
||
For a code point $c$:
|
||
|
||
```math
|
||
\text{UTF-8}(c) = \begin{cases}
|
||
[b_0] & \text{if } c < 128 \\
|
||
[b_0, b_1] & \text{if } c < 2048 \\
|
||
[b_0, b_1, b_2] & \text{if } c < 65536 \\
|
||
[b_0, b_1, b_2, b_3] & \text{if } c < 0x10FFFF
|
||
\end{cases}
|
||
```
|
||
|
||
where $b_i \in \{0, \ldots, 255\}$ are bytes.
|
||
|
||
### 2.3 UTF-8 Encoding Process
|
||
|
||
```mermaid
|
||
graph TB
|
||
A["Unicode String<br/>'Hello 世界'"] --> B[Extract Code Points]
|
||
B --> C["[72, 101, 108, 108, 111, 32, 19990, 30028]"]
|
||
C --> D[UTF-8 Encode]
|
||
D --> E["Bytes<br/>[72, 101, 108, 108, 111, 32, 228, 184, 150, 231, 149, 140]"]
|
||
|
||
style A fill:#e1f5ff
|
||
style E fill:#e1ffe1
|
||
```
|
||
|
||
**Mathematical Formulation:**
|
||
|
||
For a string $s = c_1 c_2 \ldots c_n$:
|
||
|
||
```math
|
||
\text{bytes}(s) = \bigoplus_{i=1}^n \text{UTF-8}(f(c_i))
|
||
```
|
||
|
||
where $\bigoplus$ denotes byte concatenation.
|
||
|
||
**Example:**
|
||
|
||
```math
|
||
\text{bytes}("Hi") = \text{UTF-8}(72) \oplus \text{UTF-8}(105) = [72] \oplus [105] = [72, 105]
|
||
```
|
||
|
||
---
|
||
|
||
## 3. Byte Pair Encoding Algorithm
|
||
|
||
### 3.1 BPE Overview
|
||
|
||
**Byte Pair Encoding (BPE)** is a data compression algorithm that iteratively merges the most frequent consecutive pairs.
|
||
|
||
**Goal:** Create an efficient vocabulary by merging frequent byte/character pairs.
|
||
|
||
### 3.2 BPE Training Algorithm
|
||
|
||
**Initial State:**
|
||
|
||
```math
|
||
V^{(0)} = \{0, 1, \ldots, 255\} \quad \text{(all bytes)}
|
||
```
|
||
|
||
```math
|
||
\text{tokens}^{(0)} = \text{bytes}(s) = [b_1, b_2, \ldots, b_n]
|
||
```
|
||
|
||
**Iterative Merging:**
|
||
|
||
For iteration $k = 1, 2, \ldots, K$:
|
||
|
||
**Step 1: Calculate Pair Frequencies**
|
||
|
||
```math
|
||
\text{stats}^{(k)} = \text{CountPairs}(\text{tokens}^{(k-1)})
|
||
```
|
||
|
||
```math
|
||
\text{stats}^{(k)}(i, j) = |\{p : \text{tokens}^{(k-1)}_p = i \land \text{tokens}^{(k-1)}_{p+1} = j\}|
|
||
```
|
||
|
||
**Step 2: Find Most Frequent Pair**
|
||
|
||
```math
|
||
(i^*, j^*) = \arg\max_{(i,j)} \text{stats}^{(k)}(i, j)
|
||
```
|
||
|
||
**Step 3: Create New Token**
|
||
|
||
```math
|
||
V^{(k)} = V^{(k-1)} \cup \{256 + k - 1\}
|
||
```
|
||
|
||
```math
|
||
\text{merges}^{(k)} = \text{merges}^{(k-1)} \cup \{(i^*, j^*) \mapsto 256 + k - 1\}
|
||
```
|
||
|
||
**Step 4: Apply Merge**
|
||
|
||
```math
|
||
\text{tokens}^{(k)} = \text{Merge}(\text{tokens}^{(k-1)}, (i^*, j^*), 256 + k - 1)
|
||
```
|
||
|
||
where:
|
||
|
||
```math
|
||
\text{Merge}(T, (a, b), \text{new\_id})_i = \begin{cases}
|
||
\text{new\_id} & \text{if } T_i = a \land T_{i+1} = b \\
|
||
T_i & \text{otherwise}
|
||
\end{cases}
|
||
```
|
||
|
||
### 3.3 BPE Training Flowchart
|
||
|
||
```mermaid
|
||
graph TB
|
||
Start[Start Training] --> Init["Initialize<br/>V⁽⁰⁾ = {0...255}<br/>tokens⁽⁰⁾ = bytes(text)"]
|
||
Init --> Iter{More Merges?}
|
||
|
||
Iter -->|Yes| Stats["Calculate Pair Frequencies<br/>stats⁽ᵏ⁾(i,j)"]
|
||
Stats --> Find["Find Most Frequent Pair<br/>(i*, j*) = argmax stats"]
|
||
Find --> Create["Create New Token<br/>id = 256 + k"]
|
||
Create --> Merge["Merge All Occurrences<br/>tokens⁽ᵏ⁾ = Merge(tokens⁽ᵏ⁻¹⁾, (i*,j*), id)"]
|
||
Merge --> Update["Update Vocabulary<br/>V⁽ᵏ⁾ = V⁽ᵏ⁻¹⁾ ∪ {id}"]
|
||
Update --> Iter
|
||
|
||
Iter -->|No| End[Training Complete]
|
||
|
||
style Start fill:#e1f5ff
|
||
style End fill:#e1ffe1
|
||
style Stats fill:#fff4e1
|
||
style Merge fill:#ffe1f5
|
||
```
|
||
|
||
### 3.4 BPE Example
|
||
|
||
**Example:** Training on text "aaab"
|
||
|
||
**Iteration 0:**
|
||
|
||
```math
|
||
V^{(0)} = \{0, 1, \ldots, 255\}
|
||
```
|
||
|
||
```math
|
||
\text{tokens}^{(0)} = [97, 97, 97, 98] \quad \text{(bytes for 'aaab')}
|
||
```
|
||
|
||
**Calculate frequencies:**
|
||
|
||
```math
|
||
\text{stats}^{(0)} = \{(97, 97): 2, (97, 98): 1\}
|
||
```
|
||
|
||
**Iteration 1:**
|
||
|
||
```math
|
||
(i^*, j^*) = (97, 97), \quad \text{new\_id} = 256
|
||
```
|
||
|
||
```math
|
||
\text{tokens}^{(1)} = [256, 97, 98]
|
||
```
|
||
|
||
```math
|
||
V^{(1)} = V^{(0)} \cup \{256\}
|
||
```
|
||
|
||
**Iteration 2:**
|
||
|
||
```math
|
||
\text{stats}^{(1)} = \{(256, 97): 1, (97, 98): 1\}
|
||
```
|
||
|
||
**Choose one:** \((256, 97)\), \(\text{new_id} = 257\)
|
||
|
||
```math
|
||
\text{tokens}^{(2)} = [257, 98]
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Vocabulary Construction
|
||
|
||
### 4.1 Vocabulary Structure
|
||
|
||
**Final Vocabulary:**
|
||
|
||
```math
|
||
V = V_{\text{bytes}} \cup V_{\text{merges}} \cup V_{\text{special}}
|
||
```
|
||
|
||
where:
|
||
|
||
- $V_{\text{bytes}} = \{0, 1, \ldots, 255\}$ (256 tokens)
|
||
- $V_{\text{merges}} = \{256, 257, \ldots, 256 + K - 1\}$ (K merged tokens)
|
||
- $V_{\text{special}} = \{\text{<pad>}, \text{<unk>}, \text{<bos>}, \text{<eos>}\}$
|
||
|
||
**Vocabulary Size:**
|
||
|
||
```math
|
||
|V| = 256 + K + |V_{\text{special}}|
|
||
```
|
||
|
||
### 4.2 Merge Dictionary
|
||
|
||
**Merge Dictionary:**
|
||
|
||
```math
|
||
M: \mathbb{N} \times \mathbb{N} \rightarrow \mathbb{N}
|
||
```
|
||
|
||
```math
|
||
M(i, j) = \begin{cases}
|
||
\text{merged\_id} & \text{if } (i, j) \text{ was merged} \\
|
||
\text{undefined} & \text{otherwise}
|
||
\end{cases}
|
||
```
|
||
|
||
**Vocabulary Mapping:**
|
||
|
||
```math
|
||
\text{vocab}: \mathbb{N} \rightarrow \{0, \ldots, 255\}^*
|
||
```
|
||
|
||
```math
|
||
\text{vocab}(\text{id}) = \begin{cases}
|
||
[\text{id}] & \text{if } \text{id} < 256 \\
|
||
\text{vocab}(i) \oplus \text{vocab}(j) & \text{if } M(i, j) = \text{id}
|
||
\end{cases}
|
||
```
|
||
|
||
### 4.3 Vocabulary Construction Diagram
|
||
|
||
```mermaid
|
||
graph TB
|
||
subgraph "Initial Vocabulary"
|
||
A1["256 Byte Tokens<br/>{0, 1, ..., 255}"]
|
||
end
|
||
|
||
subgraph "BPE Merges"
|
||
B1["Merge 1: (101, 32) → 256"]
|
||
B2["Merge 2: (256, 108) → 257"]
|
||
B3["Merge K: ... → 256+K-1"]
|
||
end
|
||
|
||
subgraph "Special Tokens"
|
||
C1["<pad> → 50256"]
|
||
C2["<unk> → 50257"]
|
||
C3["<bos> → 50258"]
|
||
C4["<eos> → 50259"]
|
||
end
|
||
|
||
A1 --> D[Final Vocabulary]
|
||
B1 --> D
|
||
B2 --> D
|
||
B3 --> D
|
||
C1 --> D
|
||
C2 --> D
|
||
C3 --> D
|
||
C4 --> D
|
||
|
||
D --> E["|V| = 256 + K + 4"]
|
||
|
||
style A1 fill:#e1f5ff
|
||
style D fill:#e1ffe1
|
||
style E fill:#fff4e1
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Encoding Process
|
||
|
||
### 5.1 Encoding Algorithm
|
||
|
||
**Input:** Text string $s$
|
||
|
||
**Step 1: Regex Splitting**
|
||
|
||
```math
|
||
\text{chunks} = \text{RegexSplit}(s, \text{pattern})
|
||
```
|
||
|
||
**Step 2: Convert to Bytes**
|
||
|
||
For each chunk $c \in \text{chunks}$:
|
||
|
||
```math
|
||
\text{bytes}(c) = \text{UTF-8}(c) = [b_1, b_2, \ldots, b_n]
|
||
```
|
||
|
||
**Step 3: Apply BPE Merges**
|
||
|
||
Initialize: $\text{tokens} = \text{bytes}(c)$
|
||
|
||
While merges possible:
|
||
|
||
```math
|
||
\text{Find earliest merge: } (i^*, j^*) = \arg\min_{(i,j) \in M} \text{merge\_index}(M(i, j))
|
||
```
|
||
|
||
```math
|
||
\text{Apply merge: } \text{tokens} = \text{Merge}(\text{tokens}, (i^*, j^*), M(i^*, j^*))
|
||
```
|
||
|
||
**Step 4: Combine Results**
|
||
|
||
```math
|
||
\text{token\_ids} = \bigoplus_{c \in \text{chunks}} \text{BPE}(\text{bytes}(c))
|
||
```
|
||
|
||
### 5.2 Encoding Function
|
||
|
||
**Mathematical Definition:**
|
||
|
||
```math
|
||
\text{encode}(s) = \bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))
|
||
```
|
||
|
||
where $\bigoplus$ denotes token sequence concatenation.
|
||
|
||
### 5.3 Encoding Flowchart
|
||
|
||
```mermaid
|
||
graph TB
|
||
A["Input Text<br/>'Hello world'"] --> B[Regex Split]
|
||
B --> C["Chunks<br/>['Hello', ' world']"]
|
||
|
||
C --> D[For Each Chunk]
|
||
D --> E["UTF-8 Encode<br/>bytes = [72, 101, 108, 108, 111]"]
|
||
E --> F["Initialize Tokens<br/>tokens = bytes"]
|
||
|
||
F --> G{Merge Possible?}
|
||
G -->|Yes| H["Find Earliest Merge<br/>(i*, j*)"]
|
||
H --> I["Apply Merge<br/>tokens = Merge(tokens, (i*,j*), id)"]
|
||
I --> G
|
||
|
||
G -->|No| J[Add to Result]
|
||
J --> K{More Chunks?}
|
||
K -->|Yes| D
|
||
K -->|No| L["Final Token IDs<br/>[15496, 1917]"]
|
||
|
||
style A fill:#e1f5ff
|
||
style L fill:#e1ffe1
|
||
style G fill:#ffe1f5
|
||
```
|
||
|
||
### 5.4 Encoding Example
|
||
|
||
**Input:** "Hello world"
|
||
|
||
**Step 1: Regex Split**
|
||
|
||
```math
|
||
\text{chunks} = ["Hello", " world"]
|
||
```
|
||
|
||
**Step 2: UTF-8 Encoding**
|
||
|
||
```math
|
||
\text{UTF-8}("Hello") = [72, 101, 108, 108, 111]
|
||
```
|
||
|
||
```math
|
||
\text{UTF-8}(" world") = [32, 119, 111, 114, 108, 100]
|
||
```
|
||
|
||
**Step 3: BPE Merging**
|
||
|
||
Assume merges: $M(72, 101) = 256, M(256, 108) = 257, M(257, 108) = 258, M(258, 111) = 259$
|
||
|
||
```math
|
||
[72, 101, 108, 108, 111] \xrightarrow{M(72,101)} [256, 108, 108, 111]
|
||
```
|
||
|
||
```math
|
||
\xrightarrow{M(256,108)} [257, 108, 111] \xrightarrow{M(257,108)} [258, 111]
|
||
```
|
||
|
||
```math
|
||
\xrightarrow{M(258,111)} [259]
|
||
```
|
||
|
||
**Final:** $[259, 1917]$
|
||
|
||
---
|
||
|
||
## 6. Decoding Process
|
||
|
||
### 6.1 Decoding Algorithm
|
||
|
||
**Input:** Token IDs $\mathbf{t} = [t_1, t_2, \ldots, t_n]$
|
||
|
||
**Step 1: Handle Special Tokens**
|
||
|
||
```math
|
||
\text{Stop at EOS: } \mathbf{t}' = \mathbf{t}[:i] \text{ where } t_i = \text{<eos>}
|
||
```
|
||
|
||
**Step 2: Lookup Bytes**
|
||
|
||
For each token $t_i$:
|
||
|
||
```math
|
||
\text{bytes}_i = \text{vocab}(t_i)
|
||
```
|
||
|
||
**Step 3: Concatenate Bytes**
|
||
|
||
```math
|
||
\text{all\_bytes} = \bigoplus_{i=1}^n \text{bytes}_i
|
||
```
|
||
|
||
**Step 4: UTF-8 Decode**
|
||
|
||
```math
|
||
\text{text} = \text{UTF-8}^{-1}(\text{all\_bytes})
|
||
```
|
||
|
||
### 6.2 Decoding Function
|
||
|
||
**Mathematical Definition:**
|
||
|
||
```math
|
||
\text{decode}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(t_i)\right)
|
||
```
|
||
|
||
### 6.3 Decoding Flowchart
|
||
|
||
```mermaid
|
||
graph TB
|
||
A["Token IDs<br/>[15496, 1917]"] --> B{Special Tokens?}
|
||
B -->|EOS Found| C[Stop at EOS]
|
||
B -->|No EOS| D[Process All Tokens]
|
||
C --> D
|
||
|
||
D --> E[For Each Token ID]
|
||
E --> F["Lookup Bytes<br/>vocab[t_i]"]
|
||
F --> G["Concatenate<br/>all_bytes = bytes₁ ⊕ bytes₂ ⊕ ..."]
|
||
|
||
G --> H["UTF-8 Decode<br/>text = decode(all_bytes)"]
|
||
H --> I["Output Text<br/>'Hello world'"]
|
||
|
||
style A fill:#e1f5ff
|
||
style I fill:#e1ffe1
|
||
style F fill:#fff4e1
|
||
```
|
||
|
||
### 6.4 Decoding Example
|
||
|
||
**Input:** $[259, 1917]$
|
||
|
||
**Step 1: Lookup**
|
||
|
||
```math
|
||
\text{vocab}(259) = \text{vocab}(258) \oplus \text{vocab}(111)
|
||
```
|
||
|
||
```math
|
||
= (\text{vocab}(257) \oplus \text{vocab}(108)) \oplus [111]
|
||
```
|
||
|
||
```math
|
||
= ((\text{vocab}(256) \oplus \text{vocab}(108)) \oplus [108]) \oplus [111]
|
||
```
|
||
|
||
```math
|
||
= (((\text{vocab}(72) \oplus \text{vocab}(101)) \oplus [108]) \oplus [108]) \oplus [111]
|
||
```
|
||
|
||
```math
|
||
= [[72] \oplus [101]] \oplus [108] \oplus [108] \oplus [111]
|
||
```
|
||
|
||
```math
|
||
= [72, 101, 108, 108, 111]
|
||
```
|
||
|
||
```math
|
||
\text{vocab}(1917) = [32, 119, 111, 114, 108, 100]
|
||
```
|
||
|
||
**Step 2: Concatenate**
|
||
|
||
```math
|
||
\text{all\_bytes} = [72, 101, 108, 108, 111] \oplus [32, 119, 111, 114, 108, 100]
|
||
```
|
||
|
||
```math
|
||
= [72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
|
||
```
|
||
|
||
**Step 3: Decode**
|
||
|
||
```math
|
||
\text{decode}([72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]) = "Hello world"
|
||
```
|
||
|
||
---
|
||
|
||
## 7. Regex Pattern Splitting
|
||
|
||
### 7.1 GPT-4 Style Pattern
|
||
|
||
**Pattern Components:**
|
||
|
||
```math
|
||
P = P_{\text{contractions}} \cup P_{\text{letters}} \cup P_{\text{numbers}} \cup P_{\text{punctuation}} \cup P_{\text{whitespace}}
|
||
```
|
||
|
||
**Pattern Definition:**
|
||
|
||
\[
|
||
P = \text{'(?i:[sdmt]|ll|ve|re)} \cup \text{[^\r\n\p{L}\p{N}]?+\p{L}+} \cup \text{\p{N}{1,3}} \cup \text{?[^\s\p{L}\p{N}]++} \cup \text{\r?\n} \cup \text{\s+}
|
||
\]
|
||
|
||
**Components:**
|
||
|
||
1. **Contractions:** \( P\_{\text{contractions}} = \text{'(?i:[sdmt]|ll|ve|re)} \)
|
||
- Matches: `'s`, `'t`, `'ll`, `'ve`, `'re` (case-insensitive)
|
||
|
||
2. **Letters:** \( P\_{\text{letters}} = \text{[^\r\n\p{L}\p{N}]?+\p{L}+} \)
|
||
- Optional space + one or more letters
|
||
|
||
3. **Numbers:** \( P\_{\text{numbers}} = \text{\p{N}{1,3}} \)
|
||
- **Limit:** 1-3 digits only (prevents long number tokens)
|
||
|
||
4. **Punctuation:** \( P\_{\text{punctuation}} = \text{?[^\s\p{L}\p{N}]++} \)
|
||
- Optional space + punctuation
|
||
|
||
5. **Whitespace:** \( P\_{\text{whitespace}} = \text{\r?\n} \cup \text{\s+} \)
|
||
- Newlines and multiple spaces
|
||
|
||
### 7.2 Regex Splitting Function
|
||
|
||
```math
|
||
\text{RegexSplit}(s, P) = \{m_1, m_2, \ldots, m_k : m_i \in \text{Match}(s, P)\}
|
||
```
|
||
|
||
where matches are found left-to-right, non-overlapping.
|
||
|
||
### 7.3 Regex Splitting Diagram
|
||
|
||
```mermaid
|
||
graph TB
|
||
A["Input Text<br/>'Hello world 123'"] --> B[Apply Regex Pattern]
|
||
|
||
B --> C1["Chunk 1: 'Hello'<br/>P_letters"]
|
||
B --> C2["Chunk 2: ' world'<br/>P_whitespace + P_letters"]
|
||
B --> C3["Chunk 3: ' 123'<br/>P_whitespace + P_numbers"]
|
||
|
||
C1 --> D["Chunks List<br/>['Hello', ' world', ' 123']"]
|
||
C2 --> D
|
||
C3 --> D
|
||
|
||
style A fill:#e1f5ff
|
||
style D fill:#e1ffe1
|
||
style B fill:#fff4e1
|
||
```
|
||
|
||
**Example:**
|
||
|
||
```math
|
||
\text{RegexSplit}("Hello world 123", P) = ["Hello", " world", " 123"]
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Special Tokens
|
||
|
||
### 8.1 Special Token Set
|
||
|
||
**Special Tokens:**
|
||
|
||
```math
|
||
V_{\text{special}} = \{\text{<pad>}, \text{<unk>}, \text{<bos>}, \text{<eos>}\}
|
||
```
|
||
|
||
**Token IDs:**
|
||
|
||
```math
|
||
\text{id}(\text{<pad>}) = 0, \quad \text{id}(\text{<unk>}) = 1, \quad \text{id}(\text{<bos>}) = 2, \quad \text{id}(\text{<eos>}) = 3
|
||
```
|
||
|
||
### 8.2 Special Token Functions
|
||
|
||
**Padding:**
|
||
|
||
```math
|
||
\text{pad}(\mathbf{t}, \text{max\_length}) = \mathbf{t} \oplus [\text{<pad>}]^{\max(\text{max\_length} - |\mathbf{t}|, 0)}
|
||
```
|
||
|
||
**Unknown Token:**
|
||
|
||
```math
|
||
\text{encode}(s) = \begin{cases}
|
||
[\text{id}(c)] & \text{if } c \in V \\
|
||
[\text{<unk>}] & \text{if } c \notin V
|
||
\end{cases}
|
||
```
|
||
|
||
**EOS Handling:**
|
||
|
||
```math
|
||
\text{decode}(\mathbf{t}) = \text{decode}(\mathbf{t}[:i]) \text{ where } t_i = \text{<eos>}
|
||
```
|
||
|
||
### 8.3 Special Token Flowchart
|
||
|
||
```mermaid
|
||
graph TB
|
||
subgraph "Special Tokens"
|
||
A1["<pad> → 0<br/>Padding"]
|
||
A2["<unk> → 1<br/>Unknown"]
|
||
A3["<bos> → 2<br/>Beginning"]
|
||
A4["<eos> → 3<br/>End"]
|
||
end
|
||
|
||
subgraph "Usage"
|
||
B1["Padding:<br/>[1,2,3] → [1,2,3,0,0]"]
|
||
B2["Unknown:<br/>unknown_char → 1"]
|
||
B3["EOS Stop:<br/>[1,2,3] → stop at 3"]
|
||
end
|
||
|
||
A1 --> B1
|
||
A2 --> B2
|
||
A4 --> B3
|
||
|
||
style A1 fill:#e1f5ff
|
||
style A2 fill:#ffe1f5
|
||
style A3 fill:#e1ffe1
|
||
style A4 fill:#fff4e1
|
||
```
|
||
|
||
---
|
||
|
||
## 9. Complete Tokenization Pipeline
|
||
|
||
### 9.1 Full Pipeline
|
||
|
||
**Complete Tokenization Process:**
|
||
|
||
```mermaid
|
||
graph TB
|
||
Start[Input Text] --> Split[Regex Split]
|
||
Split --> UTF8[UTF-8 Encode]
|
||
UTF8 --> BPE[BPE Merge]
|
||
BPE --> Special{Special Tokens?}
|
||
Special -->|Yes| Handle[Handle Special]
|
||
Special -->|No| Combine[Combine Tokens]
|
||
Handle --> Combine
|
||
Combine --> Output[Token IDs]
|
||
|
||
Output --> Embed[Embedding Layer]
|
||
|
||
style Start fill:#e1f5ff
|
||
style Output fill:#e1ffe1
|
||
style Embed fill:#ffe1f5
|
||
```
|
||
|
||
### 9.2 Mathematical Formulation
|
||
|
||
**Complete Encoding:**
|
||
|
||
```math
|
||
\text{Tokenize}(s) = \text{HandleSpecial}\left(\bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))\right)
|
||
```
|
||
|
||
**Complete Decoding:**
|
||
|
||
```math
|
||
\text{Detokenize}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(\text{RemoveSpecial}(t_i))\right)
|
||
```
|
||
|
||
### 9.3 Pipeline Example
|
||
|
||
**Input:** "Hello world!"
|
||
|
||
**Step 1: Regex Split**
|
||
|
||
```math
|
||
\text{chunks} = ["Hello", " world", "!"]
|
||
```
|
||
|
||
**Step 2: UTF-8 Encode**
|
||
|
||
```math
|
||
\text{bytes} = [[72,101,108,108,111], [32,119,111,114,108,100], [33]]
|
||
```
|
||
|
||
**Step 3: BPE Merge**
|
||
|
||
```math
|
||
\text{tokens} = [[15496], [1917], [0]]
|
||
```
|
||
|
||
**Step 4: Combine**
|
||
|
||
```math
|
||
\text{final} = [15496, 1917, 0]
|
||
```
|
||
|
||
---
|
||
|
||
## 10. Tokenization Challenges and Solutions
|
||
|
||
### 10.1 Challenge: Long Tokens
|
||
|
||
**Problem:** Some tokens become very long (e.g., "defaultstyle" as single token).
|
||
|
||
**Mathematical Impact:**
|
||
|
||
```math
|
||
\text{LongToken}(s) = \begin{cases}
|
||
1 \text{ token} & \text{if } |s| > 10 \text{ chars} \\
|
||
\text{multiple tokens} & \text{otherwise}
|
||
\end{cases}
|
||
```
|
||
|
||
**Solution:** Better regex splitting prevents over-merging.
|
||
|
||
### 10.2 Challenge: Number Tokenization
|
||
|
||
**Problem:** Numbers tokenized arbitrarily (sometimes 1 token, sometimes 2-3).
|
||
|
||
**Solution:** Limit number merging to 1-3 digits:
|
||
|
||
```math
|
||
P_{\text{numbers}} = \text{\p{N}{1,3}}
|
||
```
|
||
|
||
**Impact:**
|
||
|
||
```math
|
||
\text{TokenCount}(n) = \begin{cases}
|
||
1 & \text{if } n < 1000 \\
|
||
2-3 & \text{if } n \geq 1000
|
||
\end{cases}
|
||
```
|
||
|
||
### 10.3 Challenge: Python Code Efficiency
|
||
|
||
**Problem:** Each space is separate token (GPT-2 issue).
|
||
|
||
**Solution:** Merge multiple spaces:
|
||
|
||
```math
|
||
P_{\text{whitespace}} = \text{\s+} \quad \text{(matches multiple spaces)}
|
||
```
|
||
|
||
**Efficiency Gain:**
|
||
|
||
```math
|
||
\text{TokensBefore} = |\text{spaces}| \quad \text{(one token per space)}
|
||
```
|
||
|
||
```math
|
||
\text{TokensAfter} = \lceil |\text{spaces}| / 4 \rceil \quad \text{(grouped spaces)}
|
||
```
|
||
|
||
### 10.4 Challenge: Trailing Whitespace
|
||
|
||
**Problem:** Trailing spaces cause poor tokenization.
|
||
|
||
**Detection:**
|
||
|
||
```math
|
||
\text{HasTrailingSpace}(s) = \begin{cases}
|
||
\text{True} & \text{if } s[-1] = ' ' \\
|
||
\text{False} & \text{otherwise}
|
||
\end{cases}
|
||
```
|
||
|
||
**Warning:**
|
||
|
||
```math
|
||
\text{Tokenize}(s) = \begin{cases}
|
||
\text{encode}(s) + \text{warning} & \text{if } \text{HasTrailingSpace}(s) \\
|
||
\text{encode}(s) & \text{otherwise}
|
||
\end{cases}
|
||
```
|
||
|
||
### 10.5 Challenge: Multilingual Support
|
||
|
||
**Problem:** Non-English languages tokenize inefficiently.
|
||
|
||
**Solution:** UTF-8 byte-level encoding handles all languages:
|
||
|
||
```math
|
||
\text{Tokenize}(s) = \text{BPE}(\text{UTF-8}(s)) \quad \forall s \in \Sigma^*
|
||
```
|
||
|
||
**Efficiency:**
|
||
|
||
```math
|
||
\text{TokenRatio} = \frac{|\text{Tokenize}(s_{\text{non-english}})|}{|\text{Tokenize}(s_{\text{english}})|} \approx 1.5-2.0
|
||
```
|
||
|
||
### 10.6 Challenge: Untrained Tokens
|
||
|
||
**Problem:** Some tokens never appear in training data.
|
||
|
||
**Solution:** Fallback handling:
|
||
|
||
```math
|
||
\text{Decode}(t) = \begin{cases}
|
||
\text{vocab}(t) & \text{if } t \in V \\
|
||
\text{<unk>} & \text{if } t \notin V \\
|
||
\text{fallback\_bytes} & \text{if } t < 256
|
||
\end{cases}
|
||
```
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
### Key Formulas
|
||
|
||
**Encoding:**
|
||
|
||
```math
|
||
\text{encode}(s) = \text{HandleSpecial}\left(\bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))\right)
|
||
```
|
||
|
||
**Decoding:**
|
||
|
||
```math
|
||
\text{decode}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(t_i)\right)
|
||
```
|
||
|
||
**BPE Merge:**
|
||
|
||
```math
|
||
\text{tokens}^{(k)} = \text{Merge}(\text{tokens}^{(k-1)}, \arg\max_{(i,j)} \text{stats}^{(k)}(i,j), 256+k-1)
|
||
```
|
||
|
||
**Vocabulary Size:**
|
||
|
||
```math
|
||
|V| = 256 + K + |V_{\text{special}}|
|
||
```
|
||
|
||
### Key Takeaways
|
||
|
||
1. **UTF-8 Encoding**: Handles all Unicode characters consistently
|
||
2. **BPE Algorithm**: Creates efficient vocabulary through iterative merging
|
||
3. **Regex Splitting**: Prevents over-merging with pattern-based chunking
|
||
4. **Special Tokens**: Control flow and handle edge cases
|
||
5. **Byte-Level**: Works at byte level for universal language support
|