Files

Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/

2025-11-06 22:07:41 -05:00

3.5 KiB

Raw Blame History

Retraining Guide for Better Model Performance

Current Issues

Your model only trained for 40 global steps across 10 epochs, which means:

Very little training data (~4 batches per epoch)
Model hasn't learned language patterns
Model just repeats input and stops

Retraining Recommendations

1. Increase Training Data

The model needs much more data. Check your current data:

# Check how much data you have
wc -l data/*.txt

Recommendations:

Minimum: 10,000+ text samples
Good: 100,000+ text samples
Better: 1,000,000+ text samples

2. Update Training Configuration

Edit config.json for better training:

{
  "training": {
    "batch_size": 32,
    "max_epochs": 50,        // Increase from 10 to 50+
    "learning_rate": 1e-4,
    "weight_decay": 0.01,
    "warmup_steps": 1000,
    "max_grad_norm": 1.0,
    "gradient_accumulation_steps": 4,  // Increase to simulate larger batches
    "use_amp": true,
    "save_dir": "./checkpoints",
    "log_interval": 10,      // More frequent logging
    "eval_interval": 500     // More frequent evaluation
  }
}

3. Add Validation Set

Split your data for validation:

# In train.py, add validation split
from sklearn.model_selection import train_test_split

train_texts, val_texts = train_test_split(texts, test_size=0.1, random_state=42)

4. Improve Training Data Quality

Ensure your training data:

✅ Contains complete sentences/paragraphs
✅ Has diverse topics and styles
✅ Doesn't have excessive padding
✅ Uses proper text formatting

5. Monitor Training

Watch for:

Loss decreasing: Should trend downward
Perplexity: Should decrease (lower is better)
Generation quality: Test periodically during training

6. Training Command

# Train with more data
python3 train.py \
    --data data/your_training_data.txt \
    --config config.json \
    --output ./checkpoints \
    --device cpu  # or cuda/mps

7. Check Training Progress

During training, you should see:

Epoch 1: Train Loss = 8.5 → Epoch 10: Train Loss = 6.0 → Epoch 50: Train Loss = 3.5

If loss stops decreasing, the model has converged.

8. Early Stopping

Consider adding early stopping if validation loss plateaus:

Stop if validation loss doesn't improve for 5 epochs
Save the best model based on validation loss

9. Test During Training

After each epoch, test generation:

python3 inference.py \
    --checkpoint checkpoints/checkpoint_epoch_X.pt \
    --prompt "The future of" \
    --optimized

Good training should show:

✅ Model generates coherent text
✅ Model continues beyond input prompt
✅ Model doesn't immediately generate padding tokens

Quick Start Retraining

Get more training data (most important!)
Update config.json with more epochs

Start training:

python3 train.py --data data/your_data.txt --config config.json

Monitor loss - should decrease over time
Test periodically - check if generation improves

Expected Results

After proper training:

Loss should decrease from ~8-10 to ~2-4
Perplexity should decrease from ~3000 to ~10-50
Model should generate 50+ tokens before stopping
Generated text should be coherent and diverse

Next Steps

✅ Early stopping is now fixed (prevents padding tokens)
⏳ Retrain with more data and epochs
⏳ Monitor training metrics
⏳ Test generation quality during training

Good luck with retraining! 🚀

3.5 KiB Raw Blame History