Files

Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/

2025-11-06 22:07:41 -05:00

20 KiB

Raw Blame History

Tokenization Explained - Mathematical Formulation

Complete mathematical derivation and step-by-step explanation of tokenization, Byte Pair Encoding (BPE), and UTF-8 encoding used in the SheepOp Language Model.

Introduction to Tokenization
UTF-8 Encoding
Byte Pair Encoding Algorithm
Vocabulary Construction
Encoding Process
Decoding Process
Regex Pattern Splitting
Special Tokens
Complete Tokenization Pipeline
Tokenization Challenges and Solutions

1. Introduction to Tokenization

1.1 What is Tokenization?

Tokenization is the process of converting raw text into a sequence of discrete tokens (integers) that can be processed by neural networks.

Mathematical Definition:

Given a text string s \in \Sigma^* where \Sigma is the character alphabet, tokenization maps:

\mathcal{T}: \Sigma^* \rightarrow \mathbb{N}^*

s = c_1 c_2 \ldots c_n \mapsto \mathbf{t} = [t_1, t_2, \ldots, t_m]

where:

s = input text string
c_i = individual characters
\mathbf{t} = sequence of token IDs
n = number of characters
m = number of tokens (typically m \leq n)

1.2 Why Tokenization?

Problem: Neural networks require numerical inputs, not raw text.

Solution: Convert text to token sequences:

graph LR
    A["Raw Text<br/>'Hello world'"] --> B[Tokenizer]
    B --> C["Token IDs<br/>[15496, 1917]"]
    C --> D[Embedding Layer]
    D --> E["Embeddings<br/>ℝ^n×d"]

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#e1ffe1
    style E fill:#ffe1f5

1.3 Tokenization Approaches

Three Main Approaches:

Character-Level: Each character becomes a token
- Vocabulary size: ~100-200
- Sequence length: Very long
Word-Level: Each word becomes a token
- Vocabulary size: ~50,000-100,000
- Sequence length: Moderate
Subword-Level (BPE): Sequences of bytes/characters become tokens
- Vocabulary size: ~30,000-100,000
- Sequence length: Efficient

graph TB
    subgraph "Character-Level"
        C1["'Hello'"] --> C2["['H','e','l','l','o']"]
        C2 --> C3["5 tokens"]
    end

    subgraph "Word-Level"
        W1["'Hello world'"] --> W2["['Hello', 'world']"]
        W2 --> W3["2 tokens"]
    end

    subgraph "Subword-Level (BPE)"
        B1["'Hello world'"] --> B2["['Hello', ' world']"]
        B2 --> B3["2 tokens"]
    end

    style C3 fill:#ffcccc
    style W3 fill:#ccffcc
    style B3 fill:#ccffcc

2. UTF-8 Encoding

2.1 Unicode Code Points

Unicode defines a mapping from characters to integers (code points):

f: \mathcal{C} \rightarrow \{0, 1, \ldots, 0x10FFFF\}

where \mathcal{C} is the set of all Unicode characters.

Example:

f('H') = 72, \quad f('ε') = 949, \quad f('中') = 20013

2.2 UTF-8 Encoding Function

UTF-8 encodes Unicode code points into variable-length byte sequences:

\text{UTF-8}: \{0, \ldots, 0x10FFFF\} \rightarrow \{0, \ldots, 255\}^*

Encoding Rules:

For a code point c:

\text{UTF-8}(c) = \begin{cases}
[b_0] & \text{if } c < 128 \\
[b_0, b_1] & \text{if } c < 2048 \\
[b_0, b_1, b_2] & \text{if } c < 65536 \\
[b_0, b_1, b_2, b_3] & \text{if } c < 0x10FFFF
\end{cases}

where b_i \in \{0, \ldots, 255\} are bytes.

2.3 UTF-8 Encoding Process

graph TB
    A["Unicode String<br/>'Hello 世界'"] --> B[Extract Code Points]
    B --> C["[72, 101, 108, 108, 111, 32, 19990, 30028]"]
    C --> D[UTF-8 Encode]
    D --> E["Bytes<br/>[72, 101, 108, 108, 111, 32, 228, 184, 150, 231, 149, 140]"]

    style A fill:#e1f5ff
    style E fill:#e1ffe1

Mathematical Formulation:

For a string s = c_1 c_2 \ldots c_n:

\text{bytes}(s) = \bigoplus_{i=1}^n \text{UTF-8}(f(c_i))

where \bigoplus denotes byte concatenation.

Example:

\text{bytes}("Hi") = \text{UTF-8}(72) \oplus \text{UTF-8}(105) = [72] \oplus [105] = [72, 105]

3. Byte Pair Encoding Algorithm

3.1 BPE Overview

Byte Pair Encoding (BPE) is a data compression algorithm that iteratively merges the most frequent consecutive pairs.

Goal: Create an efficient vocabulary by merging frequent byte/character pairs.

3.2 BPE Training Algorithm

Initial State:

V^{(0)} = \{0, 1, \ldots, 255\} \quad \text{(all bytes)}

\text{tokens}^{(0)} = \text{bytes}(s) = [b_1, b_2, \ldots, b_n]

Iterative Merging:

For iteration k = 1, 2, \ldots, K:

Step 1: Calculate Pair Frequencies

\text{stats}^{(k)} = \text{CountPairs}(\text{tokens}^{(k-1)})

\text{stats}^{(k)}(i, j) = |\{p : \text{tokens}^{(k-1)}_p = i \land \text{tokens}^{(k-1)}_{p+1} = j\}|

Step 2: Find Most Frequent Pair

(i^*, j^*) = \arg\max_{(i,j)} \text{stats}^{(k)}(i, j)

Step 3: Create New Token

V^{(k)} = V^{(k-1)} \cup \{256 + k - 1\}

\text{merges}^{(k)} = \text{merges}^{(k-1)} \cup \{(i^*, j^*) \mapsto 256 + k - 1\}

Step 4: Apply Merge

\text{tokens}^{(k)} = \text{Merge}(\text{tokens}^{(k-1)}, (i^*, j^*), 256 + k - 1)

where:

\text{Merge}(T, (a, b), \text{new\_id})_i = \begin{cases}
\text{new\_id} & \text{if } T_i = a \land T_{i+1} = b \\
T_i & \text{otherwise}
\end{cases}

3.3 BPE Training Flowchart

graph TB
    Start[Start Training] --> Init["Initialize<br/>V⁽⁰⁾ = {0...255}<br/>tokens⁽⁰⁾ = bytes(text)"]
    Init --> Iter{More Merges?}

    Iter -->|Yes| Stats["Calculate Pair Frequencies<br/>stats⁽ᵏ⁾(i,j)"]
    Stats --> Find["Find Most Frequent Pair<br/>(i*, j*) = argmax stats"]
    Find --> Create["Create New Token<br/>id = 256 + k"]
    Create --> Merge["Merge All Occurrences<br/>tokens⁽ᵏ⁾ = Merge(tokens⁽ᵏ⁻¹⁾, (i*,j*), id)"]
    Merge --> Update["Update Vocabulary<br/>V⁽ᵏ⁾ = V⁽ᵏ⁻¹⁾ ∪ {id}"]
    Update --> Iter

    Iter -->|No| End[Training Complete]

    style Start fill:#e1f5ff
    style End fill:#e1ffe1
    style Stats fill:#fff4e1
    style Merge fill:#ffe1f5

3.4 BPE Example

Example: Training on text "aaab"

Iteration 0:

V^{(0)} = \{0, 1, \ldots, 255\}

\text{tokens}^{(0)} = [97, 97, 97, 98] \quad \text{(bytes for 'aaab')}

Calculate frequencies:

\text{stats}^{(0)} = \{(97, 97): 2, (97, 98): 1\}

Iteration 1:

(i^*, j^*) = (97, 97), \quad \text{new\_id} = 256

\text{tokens}^{(1)} = [256, 97, 98]

V^{(1)} = V^{(0)} \cup \{256\}

Iteration 2:

\text{stats}^{(1)} = \{(256, 97): 1, (97, 98): 1\}

Choose one: ((256, 97)), (\text{new_id} = 257)

\text{tokens}^{(2)} = [257, 98]

4. Vocabulary Construction

4.1 Vocabulary Structure

Final Vocabulary:

V = V_{\text{bytes}} \cup V_{\text{merges}} \cup V_{\text{special}}

where:

V_{\text{bytes}} = \{0, 1, \ldots, 255\} (256 tokens)
V_{\text{merges}} = \{256, 257, \ldots, 256 + K - 1\} (K merged tokens)
V_{\text{special}} = \{\text{<pad>}, \text{<unk>}, \text{<bos>}, \text{<eos>}\}

Vocabulary Size:

|V| = 256 + K + |V_{\text{special}}|

4.2 Merge Dictionary

Merge Dictionary:

M: \mathbb{N} \times \mathbb{N} \rightarrow \mathbb{N}

M(i, j) = \begin{cases}
\text{merged\_id} & \text{if } (i, j) \text{ was merged} \\
\text{undefined} & \text{otherwise}
\end{cases}

Vocabulary Mapping:

\text{vocab}: \mathbb{N} \rightarrow \{0, \ldots, 255\}^*

\text{vocab}(\text{id}) = \begin{cases}
[\text{id}] & \text{if } \text{id} < 256 \\
\text{vocab}(i) \oplus \text{vocab}(j) & \text{if } M(i, j) = \text{id}
\end{cases}

4.3 Vocabulary Construction Diagram

graph TB
    subgraph "Initial Vocabulary"
        A1["256 Byte Tokens<br/>{0, 1, ..., 255}"]
    end

    subgraph "BPE Merges"
        B1["Merge 1: (101, 32) → 256"]
        B2["Merge 2: (256, 108) → 257"]
        B3["Merge K: ... → 256+K-1"]
    end

    subgraph "Special Tokens"
        C1["<pad> → 50256"]
        C2["<unk> → 50257"]
        C3["<bos> → 50258"]
        C4["<eos> → 50259"]
    end

    A1 --> D[Final Vocabulary]
    B1 --> D
    B2 --> D
    B3 --> D
    C1 --> D
    C2 --> D
    C3 --> D
    C4 --> D

    D --> E["|V| = 256 + K + 4"]

    style A1 fill:#e1f5ff
    style D fill:#e1ffe1
    style E fill:#fff4e1

5. Encoding Process

5.1 Encoding Algorithm

Input: Text string s

Step 1: Regex Splitting

\text{chunks} = \text{RegexSplit}(s, \text{pattern})

Step 2: Convert to Bytes

For each chunk c \in \text{chunks}:

\text{bytes}(c) = \text{UTF-8}(c) = [b_1, b_2, \ldots, b_n]

Step 3: Apply BPE Merges

Initialize: \text{tokens} = \text{bytes}(c)

While merges possible:

\text{Find earliest merge: } (i^*, j^*) = \arg\min_{(i,j) \in M} \text{merge\_index}(M(i, j))

\text{Apply merge: } \text{tokens} = \text{Merge}(\text{tokens}, (i^*, j^*), M(i^*, j^*))

Step 4: Combine Results

\text{token\_ids} = \bigoplus_{c \in \text{chunks}} \text{BPE}(\text{bytes}(c))

5.2 Encoding Function

Mathematical Definition:

\text{encode}(s) = \bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))

where \bigoplus denotes token sequence concatenation.

5.3 Encoding Flowchart

graph TB
    A["Input Text<br/>'Hello world'"] --> B[Regex Split]
    B --> C["Chunks<br/>['Hello', ' world']"]

    C --> D[For Each Chunk]
    D --> E["UTF-8 Encode<br/>bytes = [72, 101, 108, 108, 111]"]
    E --> F["Initialize Tokens<br/>tokens = bytes"]

    F --> G{Merge Possible?}
    G -->|Yes| H["Find Earliest Merge<br/>(i*, j*)"]
    H --> I["Apply Merge<br/>tokens = Merge(tokens, (i*,j*), id)"]
    I --> G

    G -->|No| J[Add to Result]
    J --> K{More Chunks?}
    K -->|Yes| D
    K -->|No| L["Final Token IDs<br/>[15496, 1917]"]

    style A fill:#e1f5ff
    style L fill:#e1ffe1
    style G fill:#ffe1f5

5.4 Encoding Example

Input: "Hello world"

Step 1: Regex Split

\text{chunks} = ["Hello", " world"]

Step 2: UTF-8 Encoding

\text{UTF-8}("Hello") = [72, 101, 108, 108, 111]

\text{UTF-8}(" world") = [32, 119, 111, 114, 108, 100]

Step 3: BPE Merging

Assume merges: M(72, 101) = 256, M(256, 108) = 257, M(257, 108) = 258, M(258, 111) = 259

[72, 101, 108, 108, 111] \xrightarrow{M(72,101)} [256, 108, 108, 111]

\xrightarrow{M(256,108)} [257, 108, 111] \xrightarrow{M(257,108)} [258, 111]

\xrightarrow{M(258,111)} [259]

Final: [259, 1917]

6. Decoding Process

6.1 Decoding Algorithm

Input: Token IDs \mathbf{t} = [t_1, t_2, \ldots, t_n]

Step 1: Handle Special Tokens

\text{Stop at EOS: } \mathbf{t}' = \mathbf{t}[:i] \text{ where } t_i = \text{<eos>}

Step 2: Lookup Bytes

For each token t_i:

\text{bytes}_i = \text{vocab}(t_i)

Step 3: Concatenate Bytes

\text{all\_bytes} = \bigoplus_{i=1}^n \text{bytes}_i

Step 4: UTF-8 Decode

\text{text} = \text{UTF-8}^{-1}(\text{all\_bytes})

6.2 Decoding Function

Mathematical Definition:

\text{decode}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(t_i)\right)

6.3 Decoding Flowchart

graph TB
    A["Token IDs<br/>[15496, 1917]"] --> B{Special Tokens?}
    B -->|EOS Found| C[Stop at EOS]
    B -->|No EOS| D[Process All Tokens]
    C --> D

    D --> E[For Each Token ID]
    E --> F["Lookup Bytes<br/>vocab[t_i]"]
    F --> G["Concatenate<br/>all_bytes = bytes₁ ⊕ bytes₂ ⊕ ..."]

    G --> H["UTF-8 Decode<br/>text = decode(all_bytes)"]
    H --> I["Output Text<br/>'Hello world'"]

    style A fill:#e1f5ff
    style I fill:#e1ffe1
    style F fill:#fff4e1

6.4 Decoding Example

Input: [259, 1917]

Step 1: Lookup

\text{vocab}(259) = \text{vocab}(258) \oplus \text{vocab}(111)

= (\text{vocab}(257) \oplus \text{vocab}(108)) \oplus [111]

= ((\text{vocab}(256) \oplus \text{vocab}(108)) \oplus [108]) \oplus [111]

= (((\text{vocab}(72) \oplus \text{vocab}(101)) \oplus [108]) \oplus [108]) \oplus [111]

= [[72] \oplus [101]] \oplus [108] \oplus [108] \oplus [111]

= [72, 101, 108, 108, 111]

\text{vocab}(1917) = [32, 119, 111, 114, 108, 100]

Step 2: Concatenate

\text{all\_bytes} = [72, 101, 108, 108, 111] \oplus [32, 119, 111, 114, 108, 100]

= [72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]

Step 3: Decode

\text{decode}([72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]) = "Hello world"

7. Regex Pattern Splitting

7.1 GPT-4 Style Pattern

Pattern Components:

P = P_{\text{contractions}} \cup P_{\text{letters}} \cup P_{\text{numbers}} \cup P_{\text{punctuation}} \cup P_{\text{whitespace}}

Pattern Definition:

[ P = \text{'(?i:[sdmt]|ll|ve|re)} \cup \text{[^\r\n\p{L}\p{N}]?+\p{L}+} \cup \text{\p{N}{1,3}} \cup \text{?[^\s\p{L}\p{N}]++} \cup \text{\r?\n} \cup \text{\s+} ]

Components:

Contractions: ( P_{\text{contractions}} = \text{'(?i:[sdmt]|ll|ve|re)} )
- Matches: 's, 't, 'll, 've, 're (case-insensitive)
Letters: ( P_{\text{letters}} = \text{[^\r\n\p{L}\p{N}]?+\p{L}+} )
- Optional space + one or more letters
Numbers: ( P_{\text{numbers}} = \text{\p{N}{1,3}} )
- Limit: 1-3 digits only (prevents long number tokens)
Punctuation: ( P_{\text{punctuation}} = \text{?[^\s\p{L}\p{N}]++} )
- Optional space + punctuation
Whitespace: ( P_{\text{whitespace}} = \text{\r?\n} \cup \text{\s+} )
- Newlines and multiple spaces

7.2 Regex Splitting Function

\text{RegexSplit}(s, P) = \{m_1, m_2, \ldots, m_k : m_i \in \text{Match}(s, P)\}

where matches are found left-to-right, non-overlapping.

7.3 Regex Splitting Diagram

graph TB
    A["Input Text<br/>'Hello world 123'"] --> B[Apply Regex Pattern]

    B --> C1["Chunk 1: 'Hello'<br/>P_letters"]
    B --> C2["Chunk 2: ' world'<br/>P_whitespace + P_letters"]
    B --> C3["Chunk 3: ' 123'<br/>P_whitespace + P_numbers"]

    C1 --> D["Chunks List<br/>['Hello', ' world', ' 123']"]
    C2 --> D
    C3 --> D

    style A fill:#e1f5ff
    style D fill:#e1ffe1
    style B fill:#fff4e1

Example:

\text{RegexSplit}("Hello world 123", P) = ["Hello", " world", " 123"]

8. Special Tokens

8.1 Special Token Set

Special Tokens:

V_{\text{special}} = \{\text{<pad>}, \text{<unk>}, \text{<bos>}, \text{<eos>}\}

Token IDs:

\text{id}(\text{<pad>}) = 0, \quad \text{id}(\text{<unk>}) = 1, \quad \text{id}(\text{<bos>}) = 2, \quad \text{id}(\text{<eos>}) = 3

8.2 Special Token Functions

Padding:

\text{pad}(\mathbf{t}, \text{max\_length}) = \mathbf{t} \oplus [\text{<pad>}]^{\max(\text{max\_length} - |\mathbf{t}|, 0)}

Unknown Token:

\text{encode}(s) = \begin{cases}
[\text{id}(c)] & \text{if } c \in V \\
[\text{<unk>}] & \text{if } c \notin V
\end{cases}

EOS Handling:

\text{decode}(\mathbf{t}) = \text{decode}(\mathbf{t}[:i]) \text{ where } t_i = \text{<eos>}

8.3 Special Token Flowchart

graph TB
    subgraph "Special Tokens"
        A1["<pad> → 0<br/>Padding"]
        A2["<unk> → 1<br/>Unknown"]
        A3["<bos> → 2<br/>Beginning"]
        A4["<eos> → 3<br/>End"]
    end

    subgraph "Usage"
        B1["Padding:<br/>[1,2,3] → [1,2,3,0,0]"]
        B2["Unknown:<br/>unknown_char → 1"]
        B3["EOS Stop:<br/>[1,2,3] → stop at 3"]
    end

    A1 --> B1
    A2 --> B2
    A4 --> B3

    style A1 fill:#e1f5ff
    style A2 fill:#ffe1f5
    style A3 fill:#e1ffe1
    style A4 fill:#fff4e1

9. Complete Tokenization Pipeline

9.1 Full Pipeline

Complete Tokenization Process:

graph TB
    Start[Input Text] --> Split[Regex Split]
    Split --> UTF8[UTF-8 Encode]
    UTF8 --> BPE[BPE Merge]
    BPE --> Special{Special Tokens?}
    Special -->|Yes| Handle[Handle Special]
    Special -->|No| Combine[Combine Tokens]
    Handle --> Combine
    Combine --> Output[Token IDs]

    Output --> Embed[Embedding Layer]

    style Start fill:#e1f5ff
    style Output fill:#e1ffe1
    style Embed fill:#ffe1f5

9.2 Mathematical Formulation

Complete Encoding:

\text{Tokenize}(s) = \text{HandleSpecial}\left(\bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))\right)

Complete Decoding:

\text{Detokenize}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(\text{RemoveSpecial}(t_i))\right)

9.3 Pipeline Example

Input: "Hello world!"

Step 1: Regex Split

\text{chunks} = ["Hello", " world", "!"]

Step 2: UTF-8 Encode

\text{bytes} = [[72,101,108,108,111], [32,119,111,114,108,100], [33]]

Step 3: BPE Merge

\text{tokens} = [[15496], [1917], [0]]

Step 4: Combine

\text{final} = [15496, 1917, 0]

10. Tokenization Challenges and Solutions

10.1 Challenge: Long Tokens

Problem: Some tokens become very long (e.g., "defaultstyle" as single token).

Mathematical Impact:

\text{LongToken}(s) = \begin{cases}
1 \text{ token} & \text{if } |s| > 10 \text{ chars} \\
\text{multiple tokens} & \text{otherwise}
\end{cases}

Solution: Better regex splitting prevents over-merging.

10.2 Challenge: Number Tokenization

Problem: Numbers tokenized arbitrarily (sometimes 1 token, sometimes 2-3).

Solution: Limit number merging to 1-3 digits:

P_{\text{numbers}} = \text{\p{N}{1,3}}

Impact:

\text{TokenCount}(n) = \begin{cases}
1 & \text{if } n < 1000 \\
2-3 & \text{if } n \geq 1000
\end{cases}

10.3 Challenge: Python Code Efficiency

Problem: Each space is separate token (GPT-2 issue).

Solution: Merge multiple spaces:

P_{\text{whitespace}} = \text{\s+} \quad \text{(matches multiple spaces)}

Efficiency Gain:

\text{TokensBefore} = |\text{spaces}| \quad \text{(one token per space)}

\text{TokensAfter} = \lceil |\text{spaces}| / 4 \rceil \quad \text{(grouped spaces)}

10.4 Challenge: Trailing Whitespace

Problem: Trailing spaces cause poor tokenization.

Detection:

\text{HasTrailingSpace}(s) = \begin{cases}
\text{True} & \text{if } s[-1] = ' ' \\
\text{False} & \text{otherwise}
\end{cases}

Warning:

\text{Tokenize}(s) = \begin{cases}
\text{encode}(s) + \text{warning} & \text{if } \text{HasTrailingSpace}(s) \\
\text{encode}(s) & \text{otherwise}
\end{cases}

10.5 Challenge: Multilingual Support

Problem: Non-English languages tokenize inefficiently.

Solution: UTF-8 byte-level encoding handles all languages:

\text{Tokenize}(s) = \text{BPE}(\text{UTF-8}(s)) \quad \forall s \in \Sigma^*

Efficiency:

\text{TokenRatio} = \frac{|\text{Tokenize}(s_{\text{non-english}})|}{|\text{Tokenize}(s_{\text{english}})|} \approx 1.5-2.0

10.6 Challenge: Untrained Tokens

Problem: Some tokens never appear in training data.

Solution: Fallback handling:

\text{Decode}(t) = \begin{cases}
\text{vocab}(t) & \text{if } t \in V \\
\text{<unk>} & \text{if } t \notin V \\
\text{fallback\_bytes} & \text{if } t < 256
\end{cases}

Summary

Key Formulas

Encoding:

\text{encode}(s) = \text{HandleSpecial}\left(\bigoplus_{c \in \text{RegexSplit}(s)} \text{BPE}(\text{UTF-8}(c))\right)

Decoding:

\text{decode}(\mathbf{t}) = \text{UTF-8}^{-1}\left(\bigoplus_{i=1}^n \text{vocab}(t_i)\right)

BPE Merge:

\text{tokens}^{(k)} = \text{Merge}(\text{tokens}^{(k-1)}, \arg\max_{(i,j)} \text{stats}^{(k)}(i,j), 256+k-1)

Vocabulary Size:

|V| = 256 + K + |V_{\text{special}}|

Key Takeaways

UTF-8 Encoding: Handles all Unicode characters consistently
BPE Algorithm: Creates efficient vocabulary through iterative merging
Regex Splitting: Prevents over-merging with pattern-based chunking
Special Tokens: Control flow and handle edge cases
Byte-Level: Works at byte level for universal language support

20 KiB Raw Blame History Unescape Escape

Tokenization Explained - Mathematical Formulation

Table of Contents

1. Introduction to Tokenization

1.1 What is Tokenization?

1.2 Why Tokenization?

1.3 Tokenization Approaches

2. UTF-8 Encoding

2.1 Unicode Code Points

2.2 UTF-8 Encoding Function

2.3 UTF-8 Encoding Process

3. Byte Pair Encoding Algorithm

3.1 BPE Overview

3.2 BPE Training Algorithm

3.3 BPE Training Flowchart

3.4 BPE Example

4. Vocabulary Construction

4.1 Vocabulary Structure

4.2 Merge Dictionary

4.3 Vocabulary Construction Diagram

5. Encoding Process

5.1 Encoding Algorithm

5.2 Encoding Function

5.3 Encoding Flowchart

5.4 Encoding Example

6. Decoding Process

6.1 Decoding Algorithm

6.2 Decoding Function

6.3 Decoding Flowchart

6.4 Decoding Example

7. Regex Pattern Splitting

7.1 GPT-4 Style Pattern

7.2 Regex Splitting Function

7.3 Regex Splitting Diagram

8. Special Tokens

8.1 Special Token Set

8.2 Special Token Functions

8.3 Special Token Flowchart

9. Complete Tokenization Pipeline

9.1 Full Pipeline

9.2 Mathematical Formulation

9.3 Pipeline Example

10. Tokenization Challenges and Solutions

10.1 Challenge: Long Tokens

10.2 Challenge: Number Tokenization

10.3 Challenge: Python Code Efficiency

10.4 Challenge: Trailing Whitespace

10.5 Challenge: Multilingual Support

10.6 Challenge: Untrained Tokens

Summary

Key Formulas

Key Takeaways

20 KiB

Raw Blame History