- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
166 lines
5.2 KiB
Markdown
166 lines
5.2 KiB
Markdown
# Tokenizer Improvements
|
|
|
|
## Overview
|
|
|
|
Based on the tokenization challenges discussed in the transcript, we've implemented an improved BPE (Byte Pair Encoding) tokenizer that addresses common issues with tokenization in large language models.
|
|
|
|
## Key Improvements
|
|
|
|
### 1. UTF-8 Byte-Level Encoding
|
|
- **What**: Tokenizer works at the byte level (0-255) instead of character level
|
|
- **Why**: Handles all Unicode characters consistently, regardless of language
|
|
- **Benefit**: Better support for non-English languages and special characters
|
|
|
|
### 2. GPT-4 Style Regex Pattern
|
|
- **What**: Improved regex pattern for splitting text into chunks
|
|
- **Improvements**:
|
|
- Case-insensitive matching for contractions (`'s`, `'t`, `'ll`, etc.)
|
|
- Better whitespace handling (groups multiple spaces for Python code)
|
|
- Limits number merging to 1-3 digits (prevents overly long number tokens)
|
|
- **Benefit**: More consistent tokenization, especially for code and numbers
|
|
|
|
### 3. BPE Training Algorithm
|
|
- **What**: Full Byte Pair Encoding implementation
|
|
- **Features**:
|
|
- `get_stats()`: Counts consecutive token pairs
|
|
- `merge()`: Replaces frequent pairs with single tokens
|
|
- Iterative training process
|
|
- **Benefit**: Creates efficient vocabulary that compresses common sequences
|
|
|
|
### 4. Special Token Handling
|
|
- **What**: Proper handling of special tokens (`<pad>`, `<unk>`, `<bos>`, `<eos>`)
|
|
- **Features**:
|
|
- Special tokens are excluded from BPE merging
|
|
- EOS token stops decoding
|
|
- Configurable special token set
|
|
- **Benefit**: Clean separation between data tokens and control tokens
|
|
|
|
### 5. Trailing Whitespace Detection
|
|
- **What**: Warns when text ends with trailing whitespace
|
|
- **Why**: Trailing spaces can cause poor tokenization (as seen in GPT-3.5 playground warnings)
|
|
- **Benefit**: Helps developers avoid tokenization issues
|
|
|
|
### 6. Better Python Code Handling
|
|
- **What**: Improved whitespace merging for indentation
|
|
- **Why**: GPT-2 had issues with Python code because each space was a separate token
|
|
- **Benefit**: More efficient tokenization of Python code (fewer tokens per file)
|
|
|
|
### 7. Number Tokenization Limits
|
|
- **What**: Limits number merging to 1-3 digits
|
|
- **Why**: Prevents creating tokens for very long number sequences
|
|
- **Benefit**: Better arithmetic performance (numbers are more consistently tokenized)
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
from data.example import SimpleTokenizer
|
|
|
|
# Create tokenizer (uses BPE by default)
|
|
tokenizer = SimpleTokenizer(use_bpe=True, vocab_size=50257)
|
|
|
|
# Encode text
|
|
tokens = tokenizer.encode("Hello world!")
|
|
print(tokens) # [15496, 1917, 0]
|
|
|
|
# Decode tokens
|
|
text = tokenizer.decode(tokens)
|
|
print(text) # "Hello world!"
|
|
```
|
|
|
|
### Training a Custom Tokenizer
|
|
|
|
```python
|
|
from data.example import BPETokenizer
|
|
|
|
# Create tokenizer
|
|
tokenizer = BPETokenizer(vocab_size=50257)
|
|
|
|
# Train on your corpus
|
|
texts = [
|
|
"Your training text here...",
|
|
"More training text...",
|
|
]
|
|
|
|
tokenizer.train(texts, num_merges=50000, verbose=True)
|
|
|
|
# Save trained tokenizer
|
|
tokenizer.save("merges.json", "vocab.json")
|
|
```
|
|
|
|
### Loading a Pre-trained Tokenizer
|
|
|
|
```python
|
|
from data.example import BPETokenizer
|
|
|
|
# Load saved tokenizer
|
|
tokenizer = BPETokenizer()
|
|
tokenizer.load("merges.json", "vocab.json")
|
|
|
|
# Use it
|
|
tokens = tokenizer.encode("Hello world!")
|
|
```
|
|
|
|
## Addressing Common Issues
|
|
|
|
### Issue: "Can't spell words well"
|
|
- **Cause**: Long tokens (like "defaultstyle" as single token)
|
|
- **Fix**: Better regex splitting prevents over-merging
|
|
|
|
### Issue: "Bad at arithmetic"
|
|
- **Cause**: Arbitrary number tokenization (sometimes 1 token, sometimes 2-3)
|
|
- **Fix**: Limits number merging to 1-3 digits for consistency
|
|
|
|
### Issue: "Python code inefficient"
|
|
- **Cause**: Each space is separate token (GPT-2 issue)
|
|
- **Fix**: Multiple spaces merge into single tokens
|
|
|
|
### Issue: "Non-English languages worse"
|
|
- **Cause**: Tokenizer trained primarily on English
|
|
- **Fix**: UTF-8 byte-level encoding handles all languages consistently
|
|
|
|
### Issue: "Trailing whitespace warning"
|
|
- **Cause**: Models see very few examples of trailing spaces
|
|
- **Fix**: Warning helps developers detect and fix the issue
|
|
|
|
### Issue: "Solid gold Magikarp" (untrained tokens)
|
|
- **Cause**: Tokenizer creates tokens for strings not in training data
|
|
- **Fix**: Proper validation and fallback handling for unknown tokens
|
|
|
|
## Backward Compatibility
|
|
|
|
The `SimpleTokenizer` class maintains backward compatibility:
|
|
- If `use_bpe=False`, uses character-level tokenization (old behavior)
|
|
- If `use_bpe=True` (default), uses new BPE tokenizer
|
|
- All existing code continues to work without changes
|
|
|
|
## Technical Details
|
|
|
|
### BPE Algorithm
|
|
1. Start with byte-level vocabulary (256 tokens)
|
|
2. Count consecutive token pairs in training data
|
|
3. Find most frequent pair
|
|
4. Merge pair into new token
|
|
5. Repeat until target vocabulary size reached
|
|
|
|
### Encoding Process
|
|
1. Split text using regex pattern
|
|
2. Convert each chunk to UTF-8 bytes
|
|
3. Apply BPE merges (greedy, left-to-right)
|
|
4. Return token IDs
|
|
|
|
### Decoding Process
|
|
1. Look up token IDs in vocabulary
|
|
2. Convert bytes back to UTF-8
|
|
3. Handle special tokens (EOS stops decoding)
|
|
4. Return decoded text
|
|
|
|
## References
|
|
|
|
Based on improvements discussed in:
|
|
- GPT-2 paper tokenization section
|
|
- GPT-4 tokenizer improvements
|
|
- Common tokenization challenges and solutions
|
|
|