- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
3.5 KiB
3.5 KiB
Retraining Guide for Better Model Performance
Current Issues
Your model only trained for 40 global steps across 10 epochs, which means:
- Very little training data (~4 batches per epoch)
- Model hasn't learned language patterns
- Model just repeats input and stops
Retraining Recommendations
1. Increase Training Data
The model needs much more data. Check your current data:
# Check how much data you have
wc -l data/*.txt
Recommendations:
- Minimum: 10,000+ text samples
- Good: 100,000+ text samples
- Better: 1,000,000+ text samples
2. Update Training Configuration
Edit config.json for better training:
{
"training": {
"batch_size": 32,
"max_epochs": 50, // Increase from 10 to 50+
"learning_rate": 1e-4,
"weight_decay": 0.01,
"warmup_steps": 1000,
"max_grad_norm": 1.0,
"gradient_accumulation_steps": 4, // Increase to simulate larger batches
"use_amp": true,
"save_dir": "./checkpoints",
"log_interval": 10, // More frequent logging
"eval_interval": 500 // More frequent evaluation
}
}
3. Add Validation Set
Split your data for validation:
# In train.py, add validation split
from sklearn.model_selection import train_test_split
train_texts, val_texts = train_test_split(texts, test_size=0.1, random_state=42)
4. Improve Training Data Quality
Ensure your training data:
- ✅ Contains complete sentences/paragraphs
- ✅ Has diverse topics and styles
- ✅ Doesn't have excessive padding
- ✅ Uses proper text formatting
5. Monitor Training
Watch for:
- Loss decreasing: Should trend downward
- Perplexity: Should decrease (lower is better)
- Generation quality: Test periodically during training
6. Training Command
# Train with more data
python3 train.py \
--data data/your_training_data.txt \
--config config.json \
--output ./checkpoints \
--device cpu # or cuda/mps
7. Check Training Progress
During training, you should see:
Epoch 1: Train Loss = 8.5 → Epoch 10: Train Loss = 6.0 → Epoch 50: Train Loss = 3.5
If loss stops decreasing, the model has converged.
8. Early Stopping
Consider adding early stopping if validation loss plateaus:
- Stop if validation loss doesn't improve for 5 epochs
- Save the best model based on validation loss
9. Test During Training
After each epoch, test generation:
python3 inference.py \
--checkpoint checkpoints/checkpoint_epoch_X.pt \
--prompt "The future of" \
--optimized
Good training should show:
- ✅ Model generates coherent text
- ✅ Model continues beyond input prompt
- ✅ Model doesn't immediately generate padding tokens
Quick Start Retraining
- Get more training data (most important!)
- Update config.json with more epochs
- Start training:
python3 train.py --data data/your_data.txt --config config.json - Monitor loss - should decrease over time
- Test periodically - check if generation improves
Expected Results
After proper training:
- Loss should decrease from ~8-10 to ~2-4
- Perplexity should decrease from ~3000 to ~10-50
- Model should generate 50+ tokens before stopping
- Generated text should be coherent and diverse
Next Steps
- ✅ Early stopping is now fixed (prevents padding tokens)
- ⏳ Retrain with more data and epochs
- ⏳ Monitor training metrics
- ⏳ Test generation quality during training
Good luck with retraining! 🚀