Files

Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/

2025-11-06 22:07:41 -05:00

5.2 KiB

Raw Blame History

Tokenizer Improvements

Overview

Based on the tokenization challenges discussed in the transcript, we've implemented an improved BPE (Byte Pair Encoding) tokenizer that addresses common issues with tokenization in large language models.

Key Improvements

1. UTF-8 Byte-Level Encoding

What: Tokenizer works at the byte level (0-255) instead of character level
Why: Handles all Unicode characters consistently, regardless of language
Benefit: Better support for non-English languages and special characters

2. GPT-4 Style Regex Pattern

What: Improved regex pattern for splitting text into chunks
Improvements:
- Case-insensitive matching for contractions ('s, 't, 'll, etc.)
- Better whitespace handling (groups multiple spaces for Python code)
- Limits number merging to 1-3 digits (prevents overly long number tokens)
Benefit: More consistent tokenization, especially for code and numbers

3. BPE Training Algorithm

What: Full Byte Pair Encoding implementation
Features:
- get_stats(): Counts consecutive token pairs
- merge(): Replaces frequent pairs with single tokens
- Iterative training process
Benefit: Creates efficient vocabulary that compresses common sequences

4. Special Token Handling

What: Proper handling of special tokens (<pad>, <unk>, <bos>, <eos>)
Features:
- Special tokens are excluded from BPE merging
- EOS token stops decoding
- Configurable special token set
Benefit: Clean separation between data tokens and control tokens

5. Trailing Whitespace Detection

What: Warns when text ends with trailing whitespace
Why: Trailing spaces can cause poor tokenization (as seen in GPT-3.5 playground warnings)
Benefit: Helps developers avoid tokenization issues

6. Better Python Code Handling

What: Improved whitespace merging for indentation
Why: GPT-2 had issues with Python code because each space was a separate token
Benefit: More efficient tokenization of Python code (fewer tokens per file)

7. Number Tokenization Limits

What: Limits number merging to 1-3 digits
Why: Prevents creating tokens for very long number sequences
Benefit: Better arithmetic performance (numbers are more consistently tokenized)

Usage

Basic Usage

from data.example import SimpleTokenizer

# Create tokenizer (uses BPE by default)
tokenizer = SimpleTokenizer(use_bpe=True, vocab_size=50257)

# Encode text
tokens = tokenizer.encode("Hello world!")
print(tokens)  # [15496, 1917, 0]

# Decode tokens
text = tokenizer.decode(tokens)
print(text)  # "Hello world!"

Training a Custom Tokenizer

from data.example import BPETokenizer

# Create tokenizer
tokenizer = BPETokenizer(vocab_size=50257)

# Train on your corpus
texts = [
    "Your training text here...",
    "More training text...",
]

tokenizer.train(texts, num_merges=50000, verbose=True)

# Save trained tokenizer
tokenizer.save("merges.json", "vocab.json")

Loading a Pre-trained Tokenizer

from data.example import BPETokenizer

# Load saved tokenizer
tokenizer = BPETokenizer()
tokenizer.load("merges.json", "vocab.json")

# Use it
tokens = tokenizer.encode("Hello world!")

Addressing Common Issues

Issue: "Can't spell words well"

Cause: Long tokens (like "defaultstyle" as single token)
Fix: Better regex splitting prevents over-merging

Issue: "Bad at arithmetic"

Cause: Arbitrary number tokenization (sometimes 1 token, sometimes 2-3)
Fix: Limits number merging to 1-3 digits for consistency

Issue: "Python code inefficient"

Cause: Each space is separate token (GPT-2 issue)
Fix: Multiple spaces merge into single tokens

Issue: "Non-English languages worse"

Cause: Tokenizer trained primarily on English
Fix: UTF-8 byte-level encoding handles all languages consistently

Issue: "Trailing whitespace warning"

Cause: Models see very few examples of trailing spaces
Fix: Warning helps developers detect and fix the issue

Issue: "Solid gold Magikarp" (untrained tokens)

Cause: Tokenizer creates tokens for strings not in training data
Fix: Proper validation and fallback handling for unknown tokens

Backward Compatibility

The SimpleTokenizer class maintains backward compatibility:

If use_bpe=False, uses character-level tokenization (old behavior)
If use_bpe=True (default), uses new BPE tokenizer
All existing code continues to work without changes

Technical Details

BPE Algorithm

Start with byte-level vocabulary (256 tokens)
Count consecutive token pairs in training data
Find most frequent pair
Merge pair into new token
Repeat until target vocabulary size reached

Encoding Process

Split text using regex pattern
Convert each chunk to UTF-8 bytes
Apply BPE merges (greedy, left-to-right)
Return token IDs

Decoding Process

Look up token IDs in vocabulary
Convert bytes back to UTF-8
Handle special tokens (EOS stops decoding)
Return decoded text

References

Based on improvements discussed in:

GPT-2 paper tokenization section
GPT-4 tokenizer improvements
Common tokenization challenges and solutions

5.2 KiB Raw Blame History