Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
This commit is contained in:
165
docs/TOKENIZER_IMPROVEMENTS.md
Normal file
165
docs/TOKENIZER_IMPROVEMENTS.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# Tokenizer Improvements
|
||||
|
||||
## Overview
|
||||
|
||||
Based on the tokenization challenges discussed in the transcript, we've implemented an improved BPE (Byte Pair Encoding) tokenizer that addresses common issues with tokenization in large language models.
|
||||
|
||||
## Key Improvements
|
||||
|
||||
### 1. UTF-8 Byte-Level Encoding
|
||||
- **What**: Tokenizer works at the byte level (0-255) instead of character level
|
||||
- **Why**: Handles all Unicode characters consistently, regardless of language
|
||||
- **Benefit**: Better support for non-English languages and special characters
|
||||
|
||||
### 2. GPT-4 Style Regex Pattern
|
||||
- **What**: Improved regex pattern for splitting text into chunks
|
||||
- **Improvements**:
|
||||
- Case-insensitive matching for contractions (`'s`, `'t`, `'ll`, etc.)
|
||||
- Better whitespace handling (groups multiple spaces for Python code)
|
||||
- Limits number merging to 1-3 digits (prevents overly long number tokens)
|
||||
- **Benefit**: More consistent tokenization, especially for code and numbers
|
||||
|
||||
### 3. BPE Training Algorithm
|
||||
- **What**: Full Byte Pair Encoding implementation
|
||||
- **Features**:
|
||||
- `get_stats()`: Counts consecutive token pairs
|
||||
- `merge()`: Replaces frequent pairs with single tokens
|
||||
- Iterative training process
|
||||
- **Benefit**: Creates efficient vocabulary that compresses common sequences
|
||||
|
||||
### 4. Special Token Handling
|
||||
- **What**: Proper handling of special tokens (`<pad>`, `<unk>`, `<bos>`, `<eos>`)
|
||||
- **Features**:
|
||||
- Special tokens are excluded from BPE merging
|
||||
- EOS token stops decoding
|
||||
- Configurable special token set
|
||||
- **Benefit**: Clean separation between data tokens and control tokens
|
||||
|
||||
### 5. Trailing Whitespace Detection
|
||||
- **What**: Warns when text ends with trailing whitespace
|
||||
- **Why**: Trailing spaces can cause poor tokenization (as seen in GPT-3.5 playground warnings)
|
||||
- **Benefit**: Helps developers avoid tokenization issues
|
||||
|
||||
### 6. Better Python Code Handling
|
||||
- **What**: Improved whitespace merging for indentation
|
||||
- **Why**: GPT-2 had issues with Python code because each space was a separate token
|
||||
- **Benefit**: More efficient tokenization of Python code (fewer tokens per file)
|
||||
|
||||
### 7. Number Tokenization Limits
|
||||
- **What**: Limits number merging to 1-3 digits
|
||||
- **Why**: Prevents creating tokens for very long number sequences
|
||||
- **Benefit**: Better arithmetic performance (numbers are more consistently tokenized)
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from data.example import SimpleTokenizer
|
||||
|
||||
# Create tokenizer (uses BPE by default)
|
||||
tokenizer = SimpleTokenizer(use_bpe=True, vocab_size=50257)
|
||||
|
||||
# Encode text
|
||||
tokens = tokenizer.encode("Hello world!")
|
||||
print(tokens) # [15496, 1917, 0]
|
||||
|
||||
# Decode tokens
|
||||
text = tokenizer.decode(tokens)
|
||||
print(text) # "Hello world!"
|
||||
```
|
||||
|
||||
### Training a Custom Tokenizer
|
||||
|
||||
```python
|
||||
from data.example import BPETokenizer
|
||||
|
||||
# Create tokenizer
|
||||
tokenizer = BPETokenizer(vocab_size=50257)
|
||||
|
||||
# Train on your corpus
|
||||
texts = [
|
||||
"Your training text here...",
|
||||
"More training text...",
|
||||
]
|
||||
|
||||
tokenizer.train(texts, num_merges=50000, verbose=True)
|
||||
|
||||
# Save trained tokenizer
|
||||
tokenizer.save("merges.json", "vocab.json")
|
||||
```
|
||||
|
||||
### Loading a Pre-trained Tokenizer
|
||||
|
||||
```python
|
||||
from data.example import BPETokenizer
|
||||
|
||||
# Load saved tokenizer
|
||||
tokenizer = BPETokenizer()
|
||||
tokenizer.load("merges.json", "vocab.json")
|
||||
|
||||
# Use it
|
||||
tokens = tokenizer.encode("Hello world!")
|
||||
```
|
||||
|
||||
## Addressing Common Issues
|
||||
|
||||
### Issue: "Can't spell words well"
|
||||
- **Cause**: Long tokens (like "defaultstyle" as single token)
|
||||
- **Fix**: Better regex splitting prevents over-merging
|
||||
|
||||
### Issue: "Bad at arithmetic"
|
||||
- **Cause**: Arbitrary number tokenization (sometimes 1 token, sometimes 2-3)
|
||||
- **Fix**: Limits number merging to 1-3 digits for consistency
|
||||
|
||||
### Issue: "Python code inefficient"
|
||||
- **Cause**: Each space is separate token (GPT-2 issue)
|
||||
- **Fix**: Multiple spaces merge into single tokens
|
||||
|
||||
### Issue: "Non-English languages worse"
|
||||
- **Cause**: Tokenizer trained primarily on English
|
||||
- **Fix**: UTF-8 byte-level encoding handles all languages consistently
|
||||
|
||||
### Issue: "Trailing whitespace warning"
|
||||
- **Cause**: Models see very few examples of trailing spaces
|
||||
- **Fix**: Warning helps developers detect and fix the issue
|
||||
|
||||
### Issue: "Solid gold Magikarp" (untrained tokens)
|
||||
- **Cause**: Tokenizer creates tokens for strings not in training data
|
||||
- **Fix**: Proper validation and fallback handling for unknown tokens
|
||||
|
||||
## Backward Compatibility
|
||||
|
||||
The `SimpleTokenizer` class maintains backward compatibility:
|
||||
- If `use_bpe=False`, uses character-level tokenization (old behavior)
|
||||
- If `use_bpe=True` (default), uses new BPE tokenizer
|
||||
- All existing code continues to work without changes
|
||||
|
||||
## Technical Details
|
||||
|
||||
### BPE Algorithm
|
||||
1. Start with byte-level vocabulary (256 tokens)
|
||||
2. Count consecutive token pairs in training data
|
||||
3. Find most frequent pair
|
||||
4. Merge pair into new token
|
||||
5. Repeat until target vocabulary size reached
|
||||
|
||||
### Encoding Process
|
||||
1. Split text using regex pattern
|
||||
2. Convert each chunk to UTF-8 bytes
|
||||
3. Apply BPE merges (greedy, left-to-right)
|
||||
4. Return token IDs
|
||||
|
||||
### Decoding Process
|
||||
1. Look up token IDs in vocabulary
|
||||
2. Convert bytes back to UTF-8
|
||||
3. Handle special tokens (EOS stops decoding)
|
||||
4. Return decoded text
|
||||
|
||||
## References
|
||||
|
||||
Based on improvements discussed in:
|
||||
- GPT-2 paper tokenization section
|
||||
- GPT-4 tokenizer improvements
|
||||
- Common tokenization challenges and solutions
|
||||
|
||||
Reference in New Issue
Block a user