Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions
--- a/docs/RETRAINING_GUIDE.md
+++ b/docs/RETRAINING_GUIDE.md
@@ -0,0 +1,143 @@
+# Retraining Guide for Better Model Performance
+
+## Current Issues
+
+Your model only trained for **40 global steps** across 10 epochs, which means:
+- Very little training data (~4 batches per epoch)
+- Model hasn't learned language patterns
+- Model just repeats input and stops
+
+## Retraining Recommendations
+
+### 1. **Increase Training Data**
+
+The model needs much more data. Check your current data:
+
+```bash
+# Check how much data you have
+wc -l data/*.txt
+```
+
+**Recommendations:**
+- **Minimum**: 10,000+ text samples
+- **Good**: 100,000+ text samples  
+- **Better**: 1,000,000+ text samples
+
+### 2. **Update Training Configuration**
+
+Edit `config.json` for better training:
+
+```json
+{
+  "training": {
+    "batch_size": 32,
+    "max_epochs": 50,        // Increase from 10 to 50+
+    "learning_rate": 1e-4,
+    "weight_decay": 0.01,
+    "warmup_steps": 1000,
+    "max_grad_norm": 1.0,
+    "gradient_accumulation_steps": 4,  // Increase to simulate larger batches
+    "use_amp": true,
+    "save_dir": "./checkpoints",
+    "log_interval": 10,      // More frequent logging
+    "eval_interval": 500     // More frequent evaluation
+  }
+}
+```
+
+### 3. **Add Validation Set**
+
+Split your data for validation:
+
+```python
+# In train.py, add validation split
+from sklearn.model_selection import train_test_split
+
+train_texts, val_texts = train_test_split(texts, test_size=0.1, random_state=42)
+```
+
+### 4. **Improve Training Data Quality**
+
+Ensure your training data:
+- ✅ Contains complete sentences/paragraphs
+- ✅ Has diverse topics and styles
+- ✅ Doesn't have excessive padding
+- ✅ Uses proper text formatting
+
+### 5. **Monitor Training**
+
+Watch for:
+- **Loss decreasing**: Should trend downward
+- **Perplexity**: Should decrease (lower is better)
+- **Generation quality**: Test periodically during training
+
+### 6. **Training Command**
+
+```bash
+# Train with more data
+python3 train.py \
+    --data data/your_training_data.txt \
+    --config config.json \
+    --output ./checkpoints \
+    --device cpu  # or cuda/mps
+```
+
+### 7. **Check Training Progress**
+
+During training, you should see:
+```
+Epoch 1: Train Loss = 8.5 → Epoch 10: Train Loss = 6.0 → Epoch 50: Train Loss = 3.5
+```
+
+If loss stops decreasing, the model has converged.
+
+### 8. **Early Stopping**
+
+Consider adding early stopping if validation loss plateaus:
+- Stop if validation loss doesn't improve for 5 epochs
+- Save the best model based on validation loss
+
+### 9. **Test During Training**
+
+After each epoch, test generation:
+
+```bash
+python3 inference.py \
+    --checkpoint checkpoints/checkpoint_epoch_X.pt \
+    --prompt "The future of" \
+    --optimized
+```
+
+Good training should show:
+- ✅ Model generates coherent text
+- ✅ Model continues beyond input prompt
+- ✅ Model doesn't immediately generate padding tokens
+
+## Quick Start Retraining
+
+1. **Get more training data** (most important!)
+2. **Update config.json** with more epochs
+3. **Start training**:
+   ```bash
+   python3 train.py --data data/your_data.txt --config config.json
+   ```
+4. **Monitor loss** - should decrease over time
+5. **Test periodically** - check if generation improves
+
+## Expected Results
+
+After proper training:
+- Loss should decrease from ~8-10 to ~2-4
+- Perplexity should decrease from ~3000 to ~10-50
+- Model should generate 50+ tokens before stopping
+- Generated text should be coherent and diverse
+
+## Next Steps
+
+1. ✅ Early stopping is now fixed (prevents padding tokens)
+2. ⏳ **Retrain with more data and epochs**
+3. ⏳ Monitor training metrics
+4. ⏳ Test generation quality during training
+
+Good luck with retraining! 🚀
+