# Tokenization Explained - Mathematical Formulation Complete mathematical derivation and step-by-step explanation of tokenization, Byte Pair Encoding (BPE), and UTF-8 encoding used in the SheepOp Language Model. ## Table of Contents 1. [Introduction to Tokenization](#1-introduction-to-tokenization) 2. [UTF-8 Encoding](#2-utf-8-encoding) 3. [Byte Pair Encoding Algorithm](#3-byte-pair-encoding-algorithm) 4. [Vocabulary Construction](#4-vocabulary-construction) 5. [Encoding Process](#5-encoding-process) 6. [Decoding Process](#6-decoding-process) 7. [Regex Pattern Splitting](#7-regex-pattern-splitting) 8. [Special Tokens](#8-special-tokens) 9. [Complete Tokenization Pipeline](#9-complete-tokenization-pipeline) 10. [Tokenization Challenges and Solutions](#10-tokenization-challenges-and-solutions) --- ## 1. Introduction to Tokenization ### 1.1 What is Tokenization? **Tokenization** is the process of converting raw text into a sequence of discrete tokens (integers) that can be processed by neural networks. **Mathematical Definition:** Given a text string $s \in \Sigma^*$ where $\Sigma$ is the character alphabet, tokenization maps: ```math \mathcal{T}: \Sigma^* \rightarrow \mathbb{N}^* ``` ```math s = c_1 c_2 \ldots c_n \mapsto \mathbf{t} = [t_1, t_2, \ldots, t_m] ``` where: - $s$ = input text string - $c_i$ = individual characters - $\mathbf{t}$ = sequence of token IDs - $n$ = number of characters - $m$ = number of tokens (typically $m \leq n$) ### 1.2 Why Tokenization? **Problem:** Neural networks require numerical inputs, not raw text. **Solution:** Convert text to token sequences: ```mermaid graph LR A["Raw Text
'Hello world'"] --> B[Tokenizer] B --> C["Token IDs
[15496, 1917]"] C --> D[Embedding Layer] D --> E["Embeddings
ℝ^n×d"] style A fill:#e1f5ff style B fill:#fff4e1 style C fill:#e1ffe1 style E fill:#ffe1f5 ``` ### 1.3 Tokenization Approaches **Three Main Approaches:** 1. **Character-Level**: Each character becomes a token - Vocabulary size: ~100-200 - Sequence length: Very long 2. **Word-Level**: Each word becomes a token - Vocabulary size: ~50,000-100,000 - Sequence length: Moderate 3. **Subword-Level (BPE)**: Sequences of bytes/characters become tokens - Vocabulary size: ~30,000-100,000 - Sequence length: Efficient ```mermaid graph TB subgraph "Character-Level" C1["'Hello'"] --> C2["['H','e','l','l','o']"] C2 --> C3["5 tokens"] end subgraph "Word-Level" W1["'Hello world'"] --> W2["['Hello', 'world']"] W2 --> W3["2 tokens"] end subgraph "Subword-Level (BPE)" B1["'Hello world'"] --> B2["['Hello', ' world']"] B2 --> B3["2 tokens"] end style C3 fill:#ffcccc style W3 fill:#ccffcc style B3 fill:#ccffcc ``` --- ## 2. UTF-8 Encoding ### 2.1 Unicode Code Points **Unicode** defines a mapping from characters to integers (code points): ```math f: \mathcal{C} \rightarrow \{0, 1, \ldots, 0x10FFFF\} ``` where $\mathcal{C}$ is the set of all Unicode characters. **Example:** ```math f('H') = 72, \quad f('ε') = 949, \quad f('中') = 20013 ``` ### 2.2 UTF-8 Encoding Function **UTF-8** encodes Unicode code points into variable-length byte sequences: ```math \text{UTF-8}: \{0, \ldots, 0x10FFFF\} \rightarrow \{0, \ldots, 255\}^* ``` **Encoding Rules:** For a code point $c$: ```math \text{UTF-8}(c) = \begin{cases} [b_0] & \text{if } c < 128 \\ [b_0, b_1] & \text{if } c < 2048 \\ [b_0, b_1, b_2] & \text{if } c < 65536 \\ [b_0, b_1, b_2, b_3] & \text{if } c < 0x10FFFF \end{cases} ``` where $b_i \in \{0, \ldots, 255\}$ are bytes. ### 2.3 UTF-8 Encoding Process ```mermaid graph TB A["Unicode String
'Hello 世界'"] --> B[Extract Code Points] B --> C["[72, 101, 108, 108, 111, 32, 19990, 30028]"] C --> D[UTF-8 Encode] D --> E["Bytes
[72, 101, 108, 108, 111, 32, 228, 184, 150, 231, 149, 140]"] style A fill:#e1f5ff style E fill:#e1ffe1 ``` **Mathematical Formulation:** For a string $s = c_1 c_2 \ldots c_n$: ```math \text{bytes}(s) = \bigoplus_{i=1}^n \text{UTF-8}(f(c_i)) ``` where $\bigoplus$ denotes byte concatenation. **Example:** ```math \text{bytes}("Hi") = \text{UTF-8}(72) \oplus \text{UTF-8}(105) = [72] \oplus [105] = [72, 105] ``` --- ## 3. Byte Pair Encoding Algorithm ### 3.1 BPE Overview **Byte Pair Encoding (BPE)** is a data compression algorithm that iteratively merges the most frequent consecutive pairs. **Goal:** Create an efficient vocabulary by merging frequent byte/character pairs. ### 3.2 BPE Training Algorithm **Initial State:** ```math V^{(0)} = \{0, 1, \ldots, 255\} \quad \text{(all bytes)} ``` ```math \text{tokens}^{(0)} = \text{bytes}(s) = [b_1, b_2, \ldots, b_n] ``` **Iterative Merging:** For iteration $k = 1, 2, \ldots, K$: **Step 1: Calculate Pair Frequencies** ```math \text{stats}^{(k)} = \text{CountPairs}(\text{tokens}^{(k-1)}) ``` ```math \text{stats}^{(k)}(i, j) = |\{p : \text{tokens}^{(k-1)}_p = i \land \text{tokens}^{(k-1)}_{p+1} = j\}| ``` **Step 2: Find Most Frequent Pair** ```math (i^*, j^*) = \arg\max_{(i,j)} \text{stats}^{(k)}(i, j) ``` **Step 3: Create New Token** ```math V^{(k)} = V^{(k-1)} \cup \{256 + k - 1\} ``` ```math \text{merges}^{(k)} = \text{merges}^{(k-1)} \cup \{(i^*, j^*) \mapsto 256 + k - 1\} ``` **Step 4: Apply Merge** ```math \text{tokens}^{(k)} = \text{Merge}(\text{tokens}^{(k-1)}, (i^*, j^*), 256 + k - 1) ``` where: ```math \text{Merge}(T, (a, b), \text{new\_id})_i = \begin{cases} \text{new\_id} & \text{if } T_i = a \land T_{i+1} = b \\ T_i & \text{otherwise} \end{cases} ``` ### 3.3 BPE Training Flowchart ```mermaid graph TB Start[Start Training] --> Init["Initialize
V⁽⁰⁾ = {0...255}
tokens⁽⁰⁾ = bytes(text)"] Init --> Iter{More Merges?} Iter -->|Yes| Stats["Calculate Pair Frequencies
stats⁽ᵏ⁾(i,j)"] Stats --> Find["Find Most Frequent Pair
(i*, j*) = argmax stats"] Find --> Create["Create New Token
id = 256 + k"] Create --> Merge["Merge All Occurrences
tokens⁽ᵏ⁾ = Merge(tokens⁽ᵏ⁻¹⁾, (i*,j*), id)"] Merge --> Update["Update Vocabulary
V⁽ᵏ⁾ = V⁽ᵏ⁻¹⁾ ∪ {id}"] Update --> Iter Iter -->|No| End[Training Complete] style Start fill:#e1f5ff style End fill:#e1ffe1 style Stats fill:#fff4e1 style Merge fill:#ffe1f5 ``` ### 3.4 BPE Example **Example:** Training on text "aaab" **Iteration 0:** ```math V^{(0)} = \{0, 1, \ldots, 255\} ``` ```math \text{tokens}^{(0)} = [97, 97, 97, 98] \quad \text{(bytes for 'aaab')} ``` **Calculate frequencies:** ```math \text{stats}^{(0)} = \{(97, 97): 2, (97, 98): 1\} ``` **Iteration 1:** ```math (i^*, j^*) = (97, 97), \quad \text{new\_id} = 256 ``` ```math \text{tokens}^{(1)} = [256, 97, 98] ``` ```math V^{(1)} = V^{(0)} \cup \{256\} ``` **Iteration 2:** ```math \text{stats}^{(1)} = \{(256, 97): 1, (97, 98): 1\} ``` **Choose one:** \((256, 97)\), \(\text{new_id} = 257\) ```math \text{tokens}^{(2)} = [257, 98] ``` --- ## 4. Vocabulary Construction ### 4.1 Vocabulary Structure **Final Vocabulary:** ```math V = V_{\text{bytes}} \cup V_{\text{merges}} \cup V_{\text{special}} ``` where: - $V_{\text{bytes}} = \{0, 1, \ldots, 255\}$ (256 tokens) - $V_{\text{merges}} = \{256, 257, \ldots, 256 + K - 1\}$ (K merged tokens) - $V_{\text{special}} = \{\text{}, \text{}, \text{}, \text{}\}$ **Vocabulary Size:** ```math |V| = 256 + K + |V_{\text{special}}| ``` ### 4.2 Merge Dictionary **Merge Dictionary:** ```math M: \mathbb{N} \times \mathbb{N} \rightarrow \mathbb{N} ``` ```math M(i, j) = \begin{cases} \text{merged\_id} & \text{if } (i, j) \text{ was merged} \\ \text{undefined} & \text{otherwise} \end{cases} ``` **Vocabulary Mapping:** ```math \text{vocab}: \mathbb{N} \rightarrow \{0, \ldots, 255\}^* ``` ```math \text{vocab}(\text{id}) = \begin{cases} [\text{id}] & \text{if } \text{id} < 256 \\ \text{vocab}(i) \oplus \text{vocab}(j) & \text{if } M(i, j) = \text{id} \end{cases} ``` ### 4.3 Vocabulary Construction Diagram ```mermaid graph TB subgraph "Initial Vocabulary" A1["256 Byte Tokens
{0, 1, ..., 255}"] end subgraph "BPE Merges" B1["Merge 1: (101, 32) → 256"] B2["Merge 2: (256, 108) → 257"] B3["Merge K: ... → 256+K-1"] end subgraph "Special Tokens" C1[" → 50256"] C2[" → 50257"] C3[" → 50258"] C4[" → 50259"] end A1 --> D[Final Vocabulary] B1 --> D B2 --> D B3 --> D C1 --> D C2 --> D C3 --> D C4 --> D D --> E["|V| = 256 + K + 4"] style A1 fill:#e1f5ff style D fill:#e1ffe1 style E fill:#fff4e1 ``` --- ## 5. Encoding Process ### 5.1 Encoding Algorithm **Input:** Text string $s$ **Step 1: Regex Splitting** ```math \text{chunks} = \text{RegexSplit}(s, \text{pattern}) ``` **Step 2: Convert to Bytes** For each chunk $c \in \text{chunks}$: ```math \text{bytes}(c) = \text{UTF-8}(c) = [b_1, b_2, \ldots, b_n] ``` **Step 3: Apply BPE Merges** Initialize: $\text{tokens} = \text{bytes}(c)$ While merges possible: ```math \text{Find earliest merge: } (i^*, j^*) = \arg\min_{(i,j) \in M} \text{merge\_index}(M(i, j)) ``` ```math \text{Apply merge: } \text{tokens} = \text{Merge}(\text{tokens}, (i^*, j^*), M(i^*, j^*)) ``` **Step 4: Combine Results** ```math \text{token\_ids} = \bigoplus_{c \in \text{chunks}} \text{BPE}(\text{bytes}(c)) ``` ### 5.2 Encoding Function **Mathematical Definition:** ```math \text{encode}(s) = \bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c)) ``` where $\bigoplus$ denotes token sequence concatenation. ### 5.3 Encoding Flowchart ```mermaid graph TB A["Input Text
'Hello world'"] --> B[Regex Split] B --> C["Chunks
['Hello', ' world']"] C --> D[For Each Chunk] D --> E["UTF-8 Encode
bytes = [72, 101, 108, 108, 111]"] E --> F["Initialize Tokens
tokens = bytes"] F --> G{Merge Possible?} G -->|Yes| H["Find Earliest Merge
(i*, j*)"] H --> I["Apply Merge
tokens = Merge(tokens, (i*,j*), id)"] I --> G G -->|No| J[Add to Result] J --> K{More Chunks?} K -->|Yes| D K -->|No| L["Final Token IDs
[15496, 1917]"] style A fill:#e1f5ff style L fill:#e1ffe1 style G fill:#ffe1f5 ``` ### 5.4 Encoding Example **Input:** "Hello world" **Step 1: Regex Split** ```math \text{chunks} = ["Hello", " world"] ``` **Step 2: UTF-8 Encoding** ```math \text{UTF-8}("Hello") = [72, 101, 108, 108, 111] ``` ```math \text{UTF-8}(" world") = [32, 119, 111, 114, 108, 100] ``` **Step 3: BPE Merging** Assume merges: $M(72, 101) = 256, M(256, 108) = 257, M(257, 108) = 258, M(258, 111) = 259$ ```math [72, 101, 108, 108, 111] \xrightarrow{M(72,101)} [256, 108, 108, 111] ``` ```math \xrightarrow{M(256,108)} [257, 108, 111] \xrightarrow{M(257,108)} [258, 111] ``` ```math \xrightarrow{M(258,111)} [259] ``` **Final:** $[259, 1917]$ --- ## 6. Decoding Process ### 6.1 Decoding Algorithm **Input:** Token IDs $\mathbf{t} = [t_1, t_2, \ldots, t_n]$ **Step 1: Handle Special Tokens** ```math \text{Stop at EOS: } \mathbf{t}' = \mathbf{t}[:i] \text{ where } t_i = \text{} ``` **Step 2: Lookup Bytes** For each token $t_i$: ```math \text{bytes}_i = \text{vocab}(t_i) ``` **Step 3: Concatenate Bytes** ```math \text{all\_bytes} = \bigoplus_{i=1}^n \text{bytes}_i ``` **Step 4: UTF-8 Decode** ```math \text{text} = \text{UTF-8}^{-1}(\text{all\_bytes}) ``` ### 6.2 Decoding Function **Mathematical Definition:** ```math \text{decode}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(t_i)\right) ``` ### 6.3 Decoding Flowchart ```mermaid graph TB A["Token IDs
[15496, 1917]"] --> B{Special Tokens?} B -->|EOS Found| C[Stop at EOS] B -->|No EOS| D[Process All Tokens] C --> D D --> E[For Each Token ID] E --> F["Lookup Bytes
vocab[t_i]"] F --> G["Concatenate
all_bytes = bytes₁ ⊕ bytes₂ ⊕ ..."] G --> H["UTF-8 Decode
text = decode(all_bytes)"] H --> I["Output Text
'Hello world'"] style A fill:#e1f5ff style I fill:#e1ffe1 style F fill:#fff4e1 ``` ### 6.4 Decoding Example **Input:** $[259, 1917]$ **Step 1: Lookup** ```math \text{vocab}(259) = \text{vocab}(258) \oplus \text{vocab}(111) ``` ```math = (\text{vocab}(257) \oplus \text{vocab}(108)) \oplus [111] ``` ```math = ((\text{vocab}(256) \oplus \text{vocab}(108)) \oplus [108]) \oplus [111] ``` ```math = (((\text{vocab}(72) \oplus \text{vocab}(101)) \oplus [108]) \oplus [108]) \oplus [111] ``` ```math = [[72] \oplus [101]] \oplus [108] \oplus [108] \oplus [111] ``` ```math = [72, 101, 108, 108, 111] ``` ```math \text{vocab}(1917) = [32, 119, 111, 114, 108, 100] ``` **Step 2: Concatenate** ```math \text{all\_bytes} = [72, 101, 108, 108, 111] \oplus [32, 119, 111, 114, 108, 100] ``` ```math = [72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100] ``` **Step 3: Decode** ```math \text{decode}([72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]) = "Hello world" ``` --- ## 7. Regex Pattern Splitting ### 7.1 GPT-4 Style Pattern **Pattern Components:** ```math P = P_{\text{contractions}} \cup P_{\text{letters}} \cup P_{\text{numbers}} \cup P_{\text{punctuation}} \cup P_{\text{whitespace}} ``` **Pattern Definition:** \[ P = \text{'(?i:[sdmt]|ll|ve|re)} \cup \text{[^\r\n\p{L}\p{N}]?+\p{L}+} \cup \text{\p{N}{1,3}} \cup \text{?[^\s\p{L}\p{N}]++} \cup \text{\r?\n} \cup \text{\s+} \] **Components:** 1. **Contractions:** \( P\_{\text{contractions}} = \text{'(?i:[sdmt]|ll|ve|re)} \) - Matches: `'s`, `'t`, `'ll`, `'ve`, `'re` (case-insensitive) 2. **Letters:** \( P\_{\text{letters}} = \text{[^\r\n\p{L}\p{N}]?+\p{L}+} \) - Optional space + one or more letters 3. **Numbers:** \( P\_{\text{numbers}} = \text{\p{N}{1,3}} \) - **Limit:** 1-3 digits only (prevents long number tokens) 4. **Punctuation:** \( P\_{\text{punctuation}} = \text{?[^\s\p{L}\p{N}]++} \) - Optional space + punctuation 5. **Whitespace:** \( P\_{\text{whitespace}} = \text{\r?\n} \cup \text{\s+} \) - Newlines and multiple spaces ### 7.2 Regex Splitting Function ```math \text{RegexSplit}(s, P) = \{m_1, m_2, \ldots, m_k : m_i \in \text{Match}(s, P)\} ``` where matches are found left-to-right, non-overlapping. ### 7.3 Regex Splitting Diagram ```mermaid graph TB A["Input Text
'Hello world 123'"] --> B[Apply Regex Pattern] B --> C1["Chunk 1: 'Hello'
P_letters"] B --> C2["Chunk 2: ' world'
P_whitespace + P_letters"] B --> C3["Chunk 3: ' 123'
P_whitespace + P_numbers"] C1 --> D["Chunks List
['Hello', ' world', ' 123']"] C2 --> D C3 --> D style A fill:#e1f5ff style D fill:#e1ffe1 style B fill:#fff4e1 ``` **Example:** ```math \text{RegexSplit}("Hello world 123", P) = ["Hello", " world", " 123"] ``` --- ## 8. Special Tokens ### 8.1 Special Token Set **Special Tokens:** ```math V_{\text{special}} = \{\text{}, \text{}, \text{}, \text{}\} ``` **Token IDs:** ```math \text{id}(\text{}) = 0, \quad \text{id}(\text{}) = 1, \quad \text{id}(\text{}) = 2, \quad \text{id}(\text{}) = 3 ``` ### 8.2 Special Token Functions **Padding:** ```math \text{pad}(\mathbf{t}, \text{max\_length}) = \mathbf{t} \oplus [\text{}]^{\max(\text{max\_length} - |\mathbf{t}|, 0)} ``` **Unknown Token:** ```math \text{encode}(s) = \begin{cases} [\text{id}(c)] & \text{if } c \in V \\ [\text{}] & \text{if } c \notin V \end{cases} ``` **EOS Handling:** ```math \text{decode}(\mathbf{t}) = \text{decode}(\mathbf{t}[:i]) \text{ where } t_i = \text{} ``` ### 8.3 Special Token Flowchart ```mermaid graph TB subgraph "Special Tokens" A1[" → 0
Padding"] A2[" → 1
Unknown"] A3[" → 2
Beginning"] A4[" → 3
End"] end subgraph "Usage" B1["Padding:
[1,2,3] → [1,2,3,0,0]"] B2["Unknown:
unknown_char → 1"] B3["EOS Stop:
[1,2,3] → stop at 3"] end A1 --> B1 A2 --> B2 A4 --> B3 style A1 fill:#e1f5ff style A2 fill:#ffe1f5 style A3 fill:#e1ffe1 style A4 fill:#fff4e1 ``` --- ## 9. Complete Tokenization Pipeline ### 9.1 Full Pipeline **Complete Tokenization Process:** ```mermaid graph TB Start[Input Text] --> Split[Regex Split] Split --> UTF8[UTF-8 Encode] UTF8 --> BPE[BPE Merge] BPE --> Special{Special Tokens?} Special -->|Yes| Handle[Handle Special] Special -->|No| Combine[Combine Tokens] Handle --> Combine Combine --> Output[Token IDs] Output --> Embed[Embedding Layer] style Start fill:#e1f5ff style Output fill:#e1ffe1 style Embed fill:#ffe1f5 ``` ### 9.2 Mathematical Formulation **Complete Encoding:** ```math \text{Tokenize}(s) = \text{HandleSpecial}\left(\bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))\right) ``` **Complete Decoding:** ```math \text{Detokenize}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(\text{RemoveSpecial}(t_i))\right) ``` ### 9.3 Pipeline Example **Input:** "Hello world!" **Step 1: Regex Split** ```math \text{chunks} = ["Hello", " world", "!"] ``` **Step 2: UTF-8 Encode** ```math \text{bytes} = [[72,101,108,108,111], [32,119,111,114,108,100], [33]] ``` **Step 3: BPE Merge** ```math \text{tokens} = [[15496], [1917], [0]] ``` **Step 4: Combine** ```math \text{final} = [15496, 1917, 0] ``` --- ## 10. Tokenization Challenges and Solutions ### 10.1 Challenge: Long Tokens **Problem:** Some tokens become very long (e.g., "defaultstyle" as single token). **Mathematical Impact:** ```math \text{LongToken}(s) = \begin{cases} 1 \text{ token} & \text{if } |s| > 10 \text{ chars} \\ \text{multiple tokens} & \text{otherwise} \end{cases} ``` **Solution:** Better regex splitting prevents over-merging. ### 10.2 Challenge: Number Tokenization **Problem:** Numbers tokenized arbitrarily (sometimes 1 token, sometimes 2-3). **Solution:** Limit number merging to 1-3 digits: ```math P_{\text{numbers}} = \text{\p{N}{1,3}} ``` **Impact:** ```math \text{TokenCount}(n) = \begin{cases} 1 & \text{if } n < 1000 \\ 2-3 & \text{if } n \geq 1000 \end{cases} ``` ### 10.3 Challenge: Python Code Efficiency **Problem:** Each space is separate token (GPT-2 issue). **Solution:** Merge multiple spaces: ```math P_{\text{whitespace}} = \text{\s+} \quad \text{(matches multiple spaces)} ``` **Efficiency Gain:** ```math \text{TokensBefore} = |\text{spaces}| \quad \text{(one token per space)} ``` ```math \text{TokensAfter} = \lceil |\text{spaces}| / 4 \rceil \quad \text{(grouped spaces)} ``` ### 10.4 Challenge: Trailing Whitespace **Problem:** Trailing spaces cause poor tokenization. **Detection:** ```math \text{HasTrailingSpace}(s) = \begin{cases} \text{True} & \text{if } s[-1] = ' ' \\ \text{False} & \text{otherwise} \end{cases} ``` **Warning:** ```math \text{Tokenize}(s) = \begin{cases} \text{encode}(s) + \text{warning} & \text{if } \text{HasTrailingSpace}(s) \\ \text{encode}(s) & \text{otherwise} \end{cases} ``` ### 10.5 Challenge: Multilingual Support **Problem:** Non-English languages tokenize inefficiently. **Solution:** UTF-8 byte-level encoding handles all languages: ```math \text{Tokenize}(s) = \text{BPE}(\text{UTF-8}(s)) \quad \forall s \in \Sigma^* ``` **Efficiency:** ```math \text{TokenRatio} = \frac{|\text{Tokenize}(s_{\text{non-english}})|}{|\text{Tokenize}(s_{\text{english}})|} \approx 1.5-2.0 ``` ### 10.6 Challenge: Untrained Tokens **Problem:** Some tokens never appear in training data. **Solution:** Fallback handling: ```math \text{Decode}(t) = \begin{cases} \text{vocab}(t) & \text{if } t \in V \\ \text{} & \text{if } t \notin V \\ \text{fallback\_bytes} & \text{if } t < 256 \end{cases} ``` --- ## Summary ### Key Formulas **Encoding:** ```math \text{encode}(s) = \text{HandleSpecial}\left(\bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))\right) ``` **Decoding:** ```math \text{decode}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(t_i)\right) ``` **BPE Merge:** ```math \text{tokens}^{(k)} = \text{Merge}(\text{tokens}^{(k-1)}, \arg\max_{(i,j)} \text{stats}^{(k)}(i,j), 256+k-1) ``` **Vocabulary Size:** ```math |V| = 256 + K + |V_{\text{special}}| ``` ### Key Takeaways 1. **UTF-8 Encoding**: Handles all Unicode characters consistently 2. **BPE Algorithm**: Creates efficient vocabulary through iterative merging 3. **Regex Splitting**: Prevents over-merging with pattern-based chunking 4. **Special Tokens**: Control flow and handle edge cases 5. **Byte-Level**: Works at byte level for universal language support