Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,368 @@
+# SheepOp LLM 🐑➡️🤖
+
+**Author:** Carlos Gutierrez  
+**Email:** carlos.gutierrez@carg.dev  
+**License:** Apache 2.0
+
+A modern language model implementation from scratch, incorporating insights from recent research papers.
+
+---
+
+## Purpose of the Project
+
+SheepOp LLM is a comprehensive transformer-based language model implementation designed for:
+
+- **Research & Education**: Understanding how large language models work from the ground up
+- **Custom Training**: Training models on domain-specific data (PDFs, code, text files)
+- **Production Deployment**: Optimized inference with KV caching and efficient attention mechanisms
+- **Multi-Format Data Processing**: Support for various data types including PDFs, images (OCR), code files, and text
+
+The project provides a complete toolkit for building, training, and deploying transformer language models with modern best practices.
+
+---
+
+## Documentation Index
+
+All detailed documentation is available in the [`docs/`](docs/) folder:
+
+### Core Concepts
+
+- **[Complete Guide](docs/COMPLETE_GUIDE.md)** - Full project documentation with mathematical foundations, architecture, and usage
+- **[Architecture](docs/ARCHITECTURE.md)** - System architecture and design patterns
+- **[Mathematics](docs/MATHEMATICS.md)** - Complete mathematical derivations for all components
+
+### Component Explanations
+
+- **[Embeddings](docs/EMBEDDINGS_EXPLAINED.md)** - What are embeddings and how they work
+- **[Attention](docs/ATTENTION_EXPLAINED.md)** - Attention mechanisms explained step-by-step
+- **[Feed-Forward](docs/FEED_FORWARD_EXPLAINED.md)** - Feed-forward networks explained
+- **[Normalization](docs/NORMALIZATION_EXPLAINED.md)** - Layer normalization explained
+- **[Neural Networks](docs/NEURAL_NETWORK_EXPLAINED.md)** - Neural networks, neurons, and weights explained
+
+### Training & Optimization
+
+- **[Training](docs/TRAINING_EXPLAINED.md)** - What is training, why we need data, why more data is better, and how to interpret training metrics
+- **[Optimization](docs/OPTIMIZATION_EXPLAINED.md)** - Optimizers (AdamW, gradient descent) explained
+- **[Scheduling](docs/SCHEDULING_EXPLAINED.md)** - Learning rate scheduling explained
+- **[Generation](docs/GENERATION_EXPLAINED.md)** - Text generation and sampling strategies
+
+### Data & Processing
+
+- **[Data Processing](docs/DATA_PROCESSING_EXPLAINED.md)** - How data processing works step-by-step
+- **[Multi-Format Data Guide](docs/MULTI_FORMAT_DATA_GUIDE.md)** - Working with PDFs, images, code files
+- **[Data Guide](docs/DATA_GUIDE.md)** - General data handling guide
+- **[Database Extraction Guide](docs/DATABASE_EXTRACTION_GUIDE.md)** - Extracting data from databases
+- **[Repository Download Guide](docs/REPOSITORY_DOWNLOAD_GUIDE.md)** - Automatically downloading GitHub repositories for code training
+
+### Advanced Topics
+
+- **[Control System Model](docs/CONTROL_SYSTEM_MODEL.md)** - Mathematical control system formulation
+- **[Optimizations](docs/OPTIMIZATIONS.md)** - Performance optimizations
+- **[Retraining Guide](docs/RETRAINING_GUIDE.md)** - How to retrain models
+
+---
+
+## Common Questions
+
+### Getting Started
+
+**Q: How do I get started with this project?**  
+**A:** See [Complete Guide](docs/COMPLETE_GUIDE.md) - Quick Start section
+
+**Q: What do I need to install?**  
+**A:** See [Complete Guide](docs/COMPLETE_GUIDE.md) - Installation section
+
+**Q: How do I train my first model?**  
+**A:** See [Complete Guide](docs/COMPLETE_GUIDE.md) - Usage section
+
+### Understanding Concepts
+
+**Q: What are embeddings?**  
+**A:** See [Embeddings Explained](docs/EMBEDDINGS_EXPLAINED.md)
+
+**Q: How does attention work?**  
+**A:** See [Attention Explained](docs/ATTENTION_EXPLAINED.md)
+
+**Q: What is a feed-forward network?**  
+**A:** See [Feed-Forward Explained](docs/FEED_FORWARD_EXPLAINED.md)
+
+**Q: Why do we need normalization?**  
+**A:** See [Normalization Explained](docs/NORMALIZATION_EXPLAINED.md)
+
+**Q: How do neural networks work?**  
+**A:** See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)
+
+**Q: What is a neuron and what are weights?**  
+**A:** See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)
+
+### Training Questions
+
+**Q: What is training and why do we need it?**  
+**A:** See [Training Explained](docs/TRAINING_EXPLAINED.md)
+
+**Q: Why do we need data for training?**  
+**A:** See [Training Explained](docs/TRAINING_EXPLAINED.md) - Why Do We Need Data section
+
+**Q: Why is more data better?**  
+**A:** See [Training Explained](docs/TRAINING_EXPLAINED.md) - Why More Data is Better section
+
+**Q: How does the optimizer work?**  
+**A:** See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)
+
+**Q: What is learning rate scheduling?**  
+**A:** See [Scheduling Explained](docs/SCHEDULING_EXPLAINED.md)
+
+### Data Questions
+
+**Q: How does data processing work?**  
+**A:** See [Data Processing Explained](docs/DATA_PROCESSING_EXPLAINED.md)
+
+**Q: Can I train on PDFs?**  
+**A:** See [Multi-Format Data Guide](docs/MULTI_FORMAT_DATA_GUIDE.md)
+
+**Q: Can I train on images?**  
+**A:** See [Multi-Format Data Guide](docs/MULTI_FORMAT_DATA_GUIDE.md)
+
+**Q: How do I process different file types?**  
+**A:** See [Data Processing Explained](docs/DATA_PROCESSING_EXPLAINED.md)
+
+**Q: How do I download code repositories automatically?**  
+**A:** See [Repository Download Guide](docs/REPOSITORY_DOWNLOAD_GUIDE.md)
+
+### Generation Questions
+
+**Q: How does text generation work?**  
+**A:** See [Generation Explained](docs/GENERATION_EXPLAINED.md)
+
+**Q: What is temperature in generation?**  
+**A:** See [Generation Explained](docs/GENERATION_EXPLAINED.md) - Temperature section
+
+**Q: What is top-k and top-p sampling?**  
+**A:** See [Generation Explained](docs/GENERATION_EXPLAINED.md) - Top-k and Top-p sections
+
+### Mathematical Questions
+
+**Q: What are the mathematical foundations?**  
+**A:** See [Mathematics](docs/MATHEMATICS.md) or [Complete Guide](docs/COMPLETE_GUIDE.md) - Mathematical Foundations section
+
+**Q: How do I understand the complete mathematical model?**  
+**A:** See [Mathematics](docs/MATHEMATICS.md) for step-by-step derivations
+
+**Q: Is there a control system perspective?**  
+**A:** See [Control System Model](docs/CONTROL_SYSTEM_MODEL.md)
+
+### Architecture Questions
+
+**Q: How is the architecture designed?**  
+**A:** See [Architecture](docs/ARCHITECTURE.md)
+
+**Q: What is the complete system flow?**  
+**A:** See [Complete Guide](docs/COMPLETE_GUIDE.md) - Architecture Explained section
+
+### Advanced Questions
+
+**Q: How do I optimize inference?**  
+**A:** See [Optimizations](docs/OPTIMIZATIONS.md)
+
+**Q: How do I retrain a model?**  
+**A:** See [Retraining Guide](docs/RETRAINING_GUIDE.md)
+
+**Q: How do I extract data from databases?**  
+**A:** See [Database Extraction Guide](docs/DATABASE_EXTRACTION_GUIDE.md)
+
+**Q: How do I download GitHub repositories for code training?**  
+**A:** See [Repository Download Guide](docs/REPOSITORY_DOWNLOAD_GUIDE.md)
+
+---
+
+## Glossary
+
+### A
+
+**AdamW** - Advanced optimizer combining adaptive learning rates with weight decay. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)
+
+**Attention** - Mechanism that determines how much each word should consider other words. See [Attention Explained](docs/ATTENTION_EXPLAINED.md)
+
+**Autoregressive** - Generation method where the model uses its own previous outputs as inputs. See [Generation Explained](docs/GENERATION_EXPLAINED.md)
+
+### B
+
+**Batch** - Small group of examples processed together during training. See [Training Explained](docs/TRAINING_EXPLAINED.md)
+
+**Bias** - Constant added to weighted sum in neural networks. See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)
+
+**Backpropagation** - Algorithm for computing gradients through the network. See [Training Explained](docs/TRAINING_EXPLAINED.md)
+
+### C
+
+**Causal Masking** - Prevents tokens from attending to future tokens. See [Complete Guide](docs/COMPLETE_GUIDE.md)
+
+**Cosine Annealing** - Learning rate schedule that follows a cosine curve. See [Scheduling Explained](docs/SCHEDULING_EXPLAINED.md)
+
+**Cross-Entropy Loss** - Loss function for classification tasks. See [Mathematics](docs/MATHEMATICS.md)
+
+### D
+
+**Data Processing** - Transformation of raw files into training-ready text. See [Data Processing Explained](docs/DATA_PROCESSING_EXPLAINED.md)
+
+**Dropout** - Regularization technique that randomly sets activations to zero. See [Complete Guide](docs/COMPLETE_GUIDE.md)
+
+**Decoder** - Part of transformer that generates output. See [Architecture](docs/ARCHITECTURE.md)
+
+### E
+
+**Embedding** - Numerical representation of words/tokens. See [Embeddings Explained](docs/EMBEDDINGS_EXPLAINED.md)
+
+**Epoch** - One complete pass through the training data. See [Training Explained](docs/TRAINING_EXPLAINED.md)
+
+**Evaluation** - Process of measuring model performance. See [Training Explained](docs/TRAINING_EXPLAINED.md)
+
+### F
+
+**Feed-Forward Network (FFN)** - Two-layer neural network that transforms features. See [Feed-Forward Explained](docs/FEED_FORWARD_EXPLAINED.md)
+
+**Forward Pass** - Computing predictions from inputs through the model. See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)
+
+### G
+
+**GELU** - Gaussian Error Linear Unit activation function. See [Feed-Forward Explained](docs/FEED_FORWARD_EXPLAINED.md)
+
+**Generation** - Process of creating new text from a trained model. See [Generation Explained](docs/GENERATION_EXPLAINED.md)
+
+**Gradient** - Derivative of loss with respect to parameters. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)
+
+**Gradient Clipping** - Technique to prevent exploding gradients. See [Complete Guide](docs/COMPLETE_GUIDE.md)
+
+**Gradient Descent** - Basic optimization algorithm. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)
+
+### H
+
+**Hidden State** - Intermediate representation in the model. See [Architecture](docs/ARCHITECTURE.md)
+
+### L
+
+**Layer Normalization** - Normalization technique applied per layer. See [Normalization Explained](docs/NORMALIZATION_EXPLAINED.md)
+
+**Learning Rate** - Step size for weight updates. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)
+
+**Logits** - Raw scores before applying softmax. See [Generation Explained](docs/GENERATION_EXPLAINED.md)
+
+**Loss** - Measure of prediction error. See [Training Explained](docs/TRAINING_EXPLAINED.md)
+
+### M
+
+**Multi-Head Attention** - Attention mechanism with multiple parallel heads. See [Attention Explained](docs/ATTENTION_EXPLAINED.md)
+
+**Momentum** - Technique to accelerate gradient descent. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)
+
+### N
+
+**Neural Network** - Computational model inspired by biological neurons. See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)
+
+**Neuron** - Basic processing unit in neural networks. See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)
+
+**Normalization** - Technique to standardize activations. See [Normalization Explained](docs/NORMALIZATION_EXPLAINED.md)
+
+**Nucleus Sampling (Top-p)** - Sampling strategy keeping tokens with cumulative probability ≥ p. See [Generation Explained](docs/GENERATION_EXPLAINED.md)
+
+### O
+
+**Optimization** - Process of finding optimal weights. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)
+
+**Optimizer** - Algorithm that updates model weights. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)
+
+**Overfitting** - Model memorizes training data but doesn't generalize. See [Training Explained](docs/TRAINING_EXPLAINED.md)
+
+### P
+
+**Perplexity** - Measure of model uncertainty (exp(loss)). See [Mathematics](docs/MATHEMATICS.md)
+
+**Positional Encoding** - Adds position information to embeddings. See [Complete Guide](docs/COMPLETE_GUIDE.md)
+
+**Pre-norm** - Architecture where normalization comes before sublayers. See [Architecture](docs/ARCHITECTURE.md)
+
+**Probability Distribution** - Distribution over possible next tokens. See [Generation Explained](docs/GENERATION_EXPLAINED.md)
+
+### Q
+
+**Query (Q)** - One of three representations in attention (what am I looking for?). See [Attention Explained](docs/ATTENTION_EXPLAINED.md)
+
+### R
+
+**Residual Connection** - Skip connection that adds input to output. See [Architecture](docs/ARCHITECTURE.md)
+
+### S
+
+**Sampling** - Process of selecting a token from probability distribution. See [Generation Explained](docs/GENERATION_EXPLAINED.md)
+
+**Scheduling** - Adjusting learning rate during training. See [Scheduling Explained](docs/SCHEDULING_EXPLAINED.md)
+
+**Self-Attention** - Attention mechanism where queries, keys, and values come from same input. See [Attention Explained](docs/ATTENTION_EXPLAINED.md)
+
+**Softmax** - Function that converts logits to probabilities. See [Generation Explained](docs/GENERATION_EXPLAINED.md)
+
+### T
+
+**Temperature** - Parameter controlling randomness in sampling. See [Generation Explained](docs/GENERATION_EXPLAINED.md)
+
+**Token** - Basic unit of text (word or character). See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)
+
+**Tokenization** - Process of converting text to tokens. See [Data Processing Explained](docs/DATA_PROCESSING_EXPLAINED.md)
+
+**Top-k Sampling** - Sampling strategy keeping only top k tokens. See [Generation Explained](docs/GENERATION_EXPLAINED.md)
+
+**Top-p Sampling** - Another name for nucleus sampling. See [Generation Explained](docs/GENERATION_EXPLAINED.md)
+
+**Transformer** - Neural network architecture based on attention. See [Architecture](docs/ARCHITECTURE.md)
+
+**Training** - Process of teaching model to make predictions. See [Training Explained](docs/TRAINING_EXPLAINED.md)
+
+### V
+
+**Value (V)** - One of three representations in attention (what information do I contain?). See [Attention Explained](docs/ATTENTION_EXPLAINED.md)
+
+**Vocabulary** - Set of all possible tokens. See [Embeddings Explained](docs/EMBEDDINGS_EXPLAINED.md)
+
+### W
+
+**Weight** - Parameter in neural network that controls connection strength. See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)
+
+**Weight Decay** - Regularization technique that penalizes large weights. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)
+
+**Weight Matrix** - Matrix containing all weights for a layer. See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)
+
+---
+
+## Quick Links
+
+- **Complete Documentation**: [docs/COMPLETE_GUIDE.md](docs/COMPLETE_GUIDE.md)
+- **Mathematical Foundations**: [docs/MATHEMATICS.md](docs/MATHEMATICS.md)
+- **System Architecture**: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
+- **Control System Model**: [docs/CONTROL_SYSTEM_MODEL.md](docs/CONTROL_SYSTEM_MODEL.md)
+
+---
+
+## License
+
+This project is licensed under the **Apache License 2.0**.
+
+See [LICENSE](LICENSE) or [LICENSE.txt](LICENSE.txt) for the full license text.
+
+**Summary:**
+- ✅ Free to use, modify, and distribute
+- ✅ Commercial use allowed
+- ✅ Patent grant included
+- ✅ Private use allowed
+- ⚠️ Must include license and copyright notice
+- ⚠️ Must state changes if modifying
+
+---
+
+## Contact
+
+**Carlos Gutierrez**  
+Email: carlos.gutierrez@carg.dev
+
+---
+
+*This README serves as an index to the comprehensive documentation available in the `docs/` folder.*