# SheepOp LLM 🐑➡️🤖

**Author:** Carlos Gutierrez  
**Email:** carlos.gutierrez@carg.dev  
**License:** Dual License - Apache 2.0 (Research) + Commercial License (Commercial Use)

A modern language model implementation from scratch, incorporating insights from recent research papers.

---

## Purpose of the Project

SheepOp LLM is a comprehensive transformer-based language model implementation designed for:

- **Research & Education**: Understanding how large language models work from the ground up
- **Custom Training**: Training models on domain-specific data (PDFs, code, text files)
- **Production Deployment**: Optimized inference with KV caching and efficient attention mechanisms
- **Multi-Format Data Processing**: Support for various data types including PDFs, images (OCR), code files, and text

The project provides a complete toolkit for building, training, and deploying transformer language models with modern best practices.

---

## Documentation Index

All detailed documentation is available in the [`docs/`](docs/) folder:

### Core Concepts

- **[Complete Guide](docs/COMPLETE_GUIDE.md)** - Full project documentation with mathematical foundations, architecture, and usage
- **[Architecture](docs/ARCHITECTURE.md)** - System architecture and design patterns
- **[Mathematics](docs/MATHEMATICS.md)** - Complete mathematical derivations for all components

### Component Explanations

- **[Embeddings](docs/EMBEDDINGS_EXPLAINED.md)** - What are embeddings and how they work
- **[Attention](docs/ATTENTION_EXPLAINED.md)** - Attention mechanisms explained step-by-step
- **[Feed-Forward](docs/FEED_FORWARD_EXPLAINED.md)** - Feed-forward networks explained
- **[Normalization](docs/NORMALIZATION_EXPLAINED.md)** - Layer normalization explained
- **[Neural Networks](docs/NEURAL_NETWORK_EXPLAINED.md)** - Neural networks, neurons, and weights explained

### Training & Optimization

- **[Training](docs/TRAINING_EXPLAINED.md)** - What is training, why we need data, why more data is better, and how to interpret training metrics
- **[Optimization](docs/OPTIMIZATION_EXPLAINED.md)** - Optimizers (AdamW, gradient descent) explained
- **[Scheduling](docs/SCHEDULING_EXPLAINED.md)** - Learning rate scheduling explained
- **[Generation](docs/GENERATION_EXPLAINED.md)** - Text generation and sampling strategies

### Data & Processing

- **[Data Processing](docs/DATA_PROCESSING_EXPLAINED.md)** - How data processing works step-by-step
- **[Multi-Format Data Guide](docs/MULTI_FORMAT_DATA_GUIDE.md)** - Working with PDFs, images, code files
- **[Data Guide](docs/DATA_GUIDE.md)** - General data handling guide
- **[Database Extraction Guide](docs/DATABASE_EXTRACTION_GUIDE.md)** - Extracting data from databases
- **[Repository Download Guide](docs/REPOSITORY_DOWNLOAD_GUIDE.md)** - Automatically downloading GitHub repositories for code training

### Advanced Topics

- **[Control System Model](docs/CONTROL_SYSTEM_MODEL.md)** - Mathematical control system formulation
- **[Optimizations](docs/OPTIMIZATIONS.md)** - Performance optimizations
- **[Retraining Guide](docs/RETRAINING_GUIDE.md)** - How to retrain models

---

## Common Questions

### Getting Started

**Q: How do I get started with this project?**  
**A:** See [Complete Guide](docs/COMPLETE_GUIDE.md) - Quick Start section

**Q: What do I need to install?**  
**A:** See [Complete Guide](docs/COMPLETE_GUIDE.md) - Installation section

**Q: How do I train my first model?**  
**A:** See [Complete Guide](docs/COMPLETE_GUIDE.md) - Usage section

### Understanding Concepts

**Q: What are embeddings?**  
**A:** See [Embeddings Explained](docs/EMBEDDINGS_EXPLAINED.md)

**Q: How does attention work?**  
**A:** See [Attention Explained](docs/ATTENTION_EXPLAINED.md)

**Q: What is a feed-forward network?**  
**A:** See [Feed-Forward Explained](docs/FEED_FORWARD_EXPLAINED.md)

**Q: Why do we need normalization?**  
**A:** See [Normalization Explained](docs/NORMALIZATION_EXPLAINED.md)

**Q: How do neural networks work?**  
**A:** See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)

**Q: What is a neuron and what are weights?**  
**A:** See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)

### Training Questions

**Q: What is training and why do we need it?**  
**A:** See [Training Explained](docs/TRAINING_EXPLAINED.md)

**Q: Why do we need data for training?**  
**A:** See [Training Explained](docs/TRAINING_EXPLAINED.md) - Why Do We Need Data section

**Q: Why is more data better?**  
**A:** See [Training Explained](docs/TRAINING_EXPLAINED.md) - Why More Data is Better section

**Q: How does the optimizer work?**  
**A:** See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)

**Q: What is learning rate scheduling?**  
**A:** See [Scheduling Explained](docs/SCHEDULING_EXPLAINED.md)

### Data Questions

**Q: How does data processing work?**  
**A:** See [Data Processing Explained](docs/DATA_PROCESSING_EXPLAINED.md)

**Q: Can I train on PDFs?**  
**A:** See [Multi-Format Data Guide](docs/MULTI_FORMAT_DATA_GUIDE.md)

**Q: Can I train on images?**  
**A:** See [Multi-Format Data Guide](docs/MULTI_FORMAT_DATA_GUIDE.md)

**Q: How do I process different file types?**  
**A:** See [Data Processing Explained](docs/DATA_PROCESSING_EXPLAINED.md)

**Q: How do I download code repositories automatically?**  
**A:** See [Repository Download Guide](docs/REPOSITORY_DOWNLOAD_GUIDE.md)

### Generation Questions

**Q: How does text generation work?**  
**A:** See [Generation Explained](docs/GENERATION_EXPLAINED.md)

**Q: What is temperature in generation?**  
**A:** See [Generation Explained](docs/GENERATION_EXPLAINED.md) - Temperature section

**Q: What is top-k and top-p sampling?**  
**A:** See [Generation Explained](docs/GENERATION_EXPLAINED.md) - Top-k and Top-p sections

### Mathematical Questions

**Q: What are the mathematical foundations?**  
**A:** See [Mathematics](docs/MATHEMATICS.md) or [Complete Guide](docs/COMPLETE_GUIDE.md) - Mathematical Foundations section

**Q: How do I understand the complete mathematical model?**  
**A:** See [Mathematics](docs/MATHEMATICS.md) for step-by-step derivations

**Q: Is there a control system perspective?**  
**A:** See [Control System Model](docs/CONTROL_SYSTEM_MODEL.md)

### Architecture Questions

**Q: How is the architecture designed?**  
**A:** See [Architecture](docs/ARCHITECTURE.md)

**Q: What is the complete system flow?**  
**A:** See [Complete Guide](docs/COMPLETE_GUIDE.md) - Architecture Explained section

### Advanced Questions

**Q: How do I optimize inference?**  
**A:** See [Optimizations](docs/OPTIMIZATIONS.md)

**Q: How do I retrain a model?**  
**A:** See [Retraining Guide](docs/RETRAINING_GUIDE.md)

**Q: How do I extract data from databases?**  
**A:** See [Database Extraction Guide](docs/DATABASE_EXTRACTION_GUIDE.md)

**Q: How do I download GitHub repositories for code training?**  
**A:** See [Repository Download Guide](docs/REPOSITORY_DOWNLOAD_GUIDE.md)

---

## Glossary

### A

**AdamW** - Advanced optimizer combining adaptive learning rates with weight decay. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)

**Attention** - Mechanism that determines how much each word should consider other words. See [Attention Explained](docs/ATTENTION_EXPLAINED.md)

**Autoregressive** - Generation method where the model uses its own previous outputs as inputs. See [Generation Explained](docs/GENERATION_EXPLAINED.md)

### B

**Batch** - Small group of examples processed together during training. See [Training Explained](docs/TRAINING_EXPLAINED.md)

**Bias** - Constant added to weighted sum in neural networks. See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)

**Backpropagation** - Algorithm for computing gradients through the network. See [Training Explained](docs/TRAINING_EXPLAINED.md)

### C

**Causal Masking** - Prevents tokens from attending to future tokens. See [Complete Guide](docs/COMPLETE_GUIDE.md)

**Cosine Annealing** - Learning rate schedule that follows a cosine curve. See [Scheduling Explained](docs/SCHEDULING_EXPLAINED.md)

**Cross-Entropy Loss** - Loss function for classification tasks. See [Mathematics](docs/MATHEMATICS.md)

### D

**Data Processing** - Transformation of raw files into training-ready text. See [Data Processing Explained](docs/DATA_PROCESSING_EXPLAINED.md)

**Dropout** - Regularization technique that randomly sets activations to zero. See [Complete Guide](docs/COMPLETE_GUIDE.md)

**Decoder** - Part of transformer that generates output. See [Architecture](docs/ARCHITECTURE.md)

### E

**Embedding** - Numerical representation of words/tokens. See [Embeddings Explained](docs/EMBEDDINGS_EXPLAINED.md)

**Epoch** - One complete pass through the training data. See [Training Explained](docs/TRAINING_EXPLAINED.md)

**Evaluation** - Process of measuring model performance. See [Training Explained](docs/TRAINING_EXPLAINED.md)

### F

**Feed-Forward Network (FFN)** - Two-layer neural network that transforms features. See [Feed-Forward Explained](docs/FEED_FORWARD_EXPLAINED.md)

**Forward Pass** - Computing predictions from inputs through the model. See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)

### G

**GELU** - Gaussian Error Linear Unit activation function. See [Feed-Forward Explained](docs/FEED_FORWARD_EXPLAINED.md)

**Generation** - Process of creating new text from a trained model. See [Generation Explained](docs/GENERATION_EXPLAINED.md)

**Gradient** - Derivative of loss with respect to parameters. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)

**Gradient Clipping** - Technique to prevent exploding gradients. See [Complete Guide](docs/COMPLETE_GUIDE.md)

**Gradient Descent** - Basic optimization algorithm. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)

### H

**Hidden State** - Intermediate representation in the model. See [Architecture](docs/ARCHITECTURE.md)

### L

**Layer Normalization** - Normalization technique applied per layer. See [Normalization Explained](docs/NORMALIZATION_EXPLAINED.md)

**Learning Rate** - Step size for weight updates. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)

**Logits** - Raw scores before applying softmax. See [Generation Explained](docs/GENERATION_EXPLAINED.md)

**Loss** - Measure of prediction error. See [Training Explained](docs/TRAINING_EXPLAINED.md)

### M

**Multi-Head Attention** - Attention mechanism with multiple parallel heads. See [Attention Explained](docs/ATTENTION_EXPLAINED.md)

**Momentum** - Technique to accelerate gradient descent. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)

### N

**Neural Network** - Computational model inspired by biological neurons. See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)

**Neuron** - Basic processing unit in neural networks. See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)

**Normalization** - Technique to standardize activations. See [Normalization Explained](docs/NORMALIZATION_EXPLAINED.md)

**Nucleus Sampling (Top-p)** - Sampling strategy keeping tokens with cumulative probability ≥ p. See [Generation Explained](docs/GENERATION_EXPLAINED.md)

### O

**Optimization** - Process of finding optimal weights. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)

**Optimizer** - Algorithm that updates model weights. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)

**Overfitting** - Model memorizes training data but doesn't generalize. See [Training Explained](docs/TRAINING_EXPLAINED.md)

### P

**Perplexity** - Measure of model uncertainty (exp(loss)). See [Mathematics](docs/MATHEMATICS.md)

**Positional Encoding** - Adds position information to embeddings. See [Complete Guide](docs/COMPLETE_GUIDE.md)

**Pre-norm** - Architecture where normalization comes before sublayers. See [Architecture](docs/ARCHITECTURE.md)

**Probability Distribution** - Distribution over possible next tokens. See [Generation Explained](docs/GENERATION_EXPLAINED.md)

### Q

**Query (Q)** - One of three representations in attention (what am I looking for?). See [Attention Explained](docs/ATTENTION_EXPLAINED.md)

### R

**Residual Connection** - Skip connection that adds input to output. See [Architecture](docs/ARCHITECTURE.md)

### S

**Sampling** - Process of selecting a token from probability distribution. See [Generation Explained](docs/GENERATION_EXPLAINED.md)

**Scheduling** - Adjusting learning rate during training. See [Scheduling Explained](docs/SCHEDULING_EXPLAINED.md)

**Self-Attention** - Attention mechanism where queries, keys, and values come from same input. See [Attention Explained](docs/ATTENTION_EXPLAINED.md)

**Softmax** - Function that converts logits to probabilities. See [Generation Explained](docs/GENERATION_EXPLAINED.md)

### T

**Temperature** - Parameter controlling randomness in sampling. See [Generation Explained](docs/GENERATION_EXPLAINED.md)

**Token** - Basic unit of text (word or character). See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)

**Tokenization** - Process of converting text to tokens. See [Data Processing Explained](docs/DATA_PROCESSING_EXPLAINED.md)

**Top-k Sampling** - Sampling strategy keeping only top k tokens. See [Generation Explained](docs/GENERATION_EXPLAINED.md)

**Top-p Sampling** - Another name for nucleus sampling. See [Generation Explained](docs/GENERATION_EXPLAINED.md)

**Transformer** - Neural network architecture based on attention. See [Architecture](docs/ARCHITECTURE.md)

**Training** - Process of teaching model to make predictions. See [Training Explained](docs/TRAINING_EXPLAINED.md)

### V

**Value (V)** - One of three representations in attention (what information do I contain?). See [Attention Explained](docs/ATTENTION_EXPLAINED.md)

**Vocabulary** - Set of all possible tokens. See [Embeddings Explained](docs/EMBEDDINGS_EXPLAINED.md)

### W

**Weight** - Parameter in neural network that controls connection strength. See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)

**Weight Decay** - Regularization technique that penalizes large weights. See [Optimization Explained](docs/OPTIMIZATION_EXPLAINED.md)

**Weight Matrix** - Matrix containing all weights for a layer. See [Neural Network Explained](docs/NEURAL_NETWORK_EXPLAINED.md)

---

## Quick Links

- **Complete Documentation**: [docs/COMPLETE_GUIDE.md](docs/COMPLETE_GUIDE.md)
- **Mathematical Foundations**: [docs/MATHEMATICS.md](docs/MATHEMATICS.md)
- **System Architecture**: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
- **Control System Model**: [docs/CONTROL_SYSTEM_MODEL.md](docs/CONTROL_SYSTEM_MODEL.md)

---

## License

This project is available under a **dual license**:

### Apache 2.0 License (Research & Non-Commercial Use)

**Free for:**
- ✅ Academic research
- ✅ Educational purposes
- ✅ Personal projects
- ✅ Open source contributions
- ✅ Non-commercial use

**Terms:**
- Free to use, modify, and distribute
- Patent grant included
- Must include license and copyright notice
- Must state changes if modifying

### Commercial License (Commercial Use)

**Requires a commercial license for:**
- ⚠️ Commercial products or services
- ⚠️ SaaS applications
- ⚠️ Revenue-generating applications
- ⚠️ Internal business use (for profit-making entities)
- ⚠️ Any use that generates profit or revenue

**To obtain a commercial license:**
Contact: **carlos.gutierrez@carg.dev**  
Subject: Commercial License Inquiry - SheepOp

Please include:
- Intended use case
- Expected usage volume
- Company/Organization name
- Contact information

### Citation Requirement

**IMPORTANT:** If you use this software in academic research or publications, you **MUST cite** this work. This is a condition of use for academic purposes.

**Required Citation Format:**

BibTeX:
```bibtex
@software{sheepop2024,
  title = {SheepOp LLM: Transformer-based Language Model Implementation},
  author = {Gutierrez, Carlos},
  year = {2024},
  url = {https://github.com/[your-username]/sheepOp},
  version = {1.0}
}
```

Text format:
```
Carlos Gutierrez. (2024). SheepOp LLM: Transformer-based Language Model 
Implementation. https://github.com/[your-username]/sheepOp
```

**Note:** Citation is required for academic use. Failure to cite constitutes a violation of the terms of use.

See [LICENSE](LICENSE) or [LICENSE.txt](LICENSE.txt) for the full license text.

---

## Contact

**Carlos Gutierrez**  
Email: carlos.gutierrez@carg.dev

---

*This README serves as an index to the comprehensive documentation available in the `docs/` folder.*