- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
5.2 KiB
5.2 KiB
Tokenizer Improvements
Overview
Based on the tokenization challenges discussed in the transcript, we've implemented an improved BPE (Byte Pair Encoding) tokenizer that addresses common issues with tokenization in large language models.
Key Improvements
1. UTF-8 Byte-Level Encoding
- What: Tokenizer works at the byte level (0-255) instead of character level
- Why: Handles all Unicode characters consistently, regardless of language
- Benefit: Better support for non-English languages and special characters
2. GPT-4 Style Regex Pattern
- What: Improved regex pattern for splitting text into chunks
- Improvements:
- Case-insensitive matching for contractions (
's,'t,'ll, etc.) - Better whitespace handling (groups multiple spaces for Python code)
- Limits number merging to 1-3 digits (prevents overly long number tokens)
- Case-insensitive matching for contractions (
- Benefit: More consistent tokenization, especially for code and numbers
3. BPE Training Algorithm
- What: Full Byte Pair Encoding implementation
- Features:
get_stats(): Counts consecutive token pairsmerge(): Replaces frequent pairs with single tokens- Iterative training process
- Benefit: Creates efficient vocabulary that compresses common sequences
4. Special Token Handling
- What: Proper handling of special tokens (
<pad>,<unk>,<bos>,<eos>) - Features:
- Special tokens are excluded from BPE merging
- EOS token stops decoding
- Configurable special token set
- Benefit: Clean separation between data tokens and control tokens
5. Trailing Whitespace Detection
- What: Warns when text ends with trailing whitespace
- Why: Trailing spaces can cause poor tokenization (as seen in GPT-3.5 playground warnings)
- Benefit: Helps developers avoid tokenization issues
6. Better Python Code Handling
- What: Improved whitespace merging for indentation
- Why: GPT-2 had issues with Python code because each space was a separate token
- Benefit: More efficient tokenization of Python code (fewer tokens per file)
7. Number Tokenization Limits
- What: Limits number merging to 1-3 digits
- Why: Prevents creating tokens for very long number sequences
- Benefit: Better arithmetic performance (numbers are more consistently tokenized)
Usage
Basic Usage
from data.example import SimpleTokenizer
# Create tokenizer (uses BPE by default)
tokenizer = SimpleTokenizer(use_bpe=True, vocab_size=50257)
# Encode text
tokens = tokenizer.encode("Hello world!")
print(tokens) # [15496, 1917, 0]
# Decode tokens
text = tokenizer.decode(tokens)
print(text) # "Hello world!"
Training a Custom Tokenizer
from data.example import BPETokenizer
# Create tokenizer
tokenizer = BPETokenizer(vocab_size=50257)
# Train on your corpus
texts = [
"Your training text here...",
"More training text...",
]
tokenizer.train(texts, num_merges=50000, verbose=True)
# Save trained tokenizer
tokenizer.save("merges.json", "vocab.json")
Loading a Pre-trained Tokenizer
from data.example import BPETokenizer
# Load saved tokenizer
tokenizer = BPETokenizer()
tokenizer.load("merges.json", "vocab.json")
# Use it
tokens = tokenizer.encode("Hello world!")
Addressing Common Issues
Issue: "Can't spell words well"
- Cause: Long tokens (like "defaultstyle" as single token)
- Fix: Better regex splitting prevents over-merging
Issue: "Bad at arithmetic"
- Cause: Arbitrary number tokenization (sometimes 1 token, sometimes 2-3)
- Fix: Limits number merging to 1-3 digits for consistency
Issue: "Python code inefficient"
- Cause: Each space is separate token (GPT-2 issue)
- Fix: Multiple spaces merge into single tokens
Issue: "Non-English languages worse"
- Cause: Tokenizer trained primarily on English
- Fix: UTF-8 byte-level encoding handles all languages consistently
Issue: "Trailing whitespace warning"
- Cause: Models see very few examples of trailing spaces
- Fix: Warning helps developers detect and fix the issue
Issue: "Solid gold Magikarp" (untrained tokens)
- Cause: Tokenizer creates tokens for strings not in training data
- Fix: Proper validation and fallback handling for unknown tokens
Backward Compatibility
The SimpleTokenizer class maintains backward compatibility:
- If
use_bpe=False, uses character-level tokenization (old behavior) - If
use_bpe=True(default), uses new BPE tokenizer - All existing code continues to work without changes
Technical Details
BPE Algorithm
- Start with byte-level vocabulary (256 tokens)
- Count consecutive token pairs in training data
- Find most frequent pair
- Merge pair into new token
- Repeat until target vocabulary size reached
Encoding Process
- Split text using regex pattern
- Convert each chunk to UTF-8 bytes
- Apply BPE merges (greedy, left-to-right)
- Return token IDs
Decoding Process
- Look up token IDs in vocabulary
- Convert bytes back to UTF-8
- Handle special tokens (EOS stops decoding)
- Return decoded text
References
Based on improvements discussed in:
- GPT-2 paper tokenization section
- GPT-4 tokenizer improvements
- Common tokenization challenges and solutions