Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions
--- a/docs/DATA_GUIDE.md
+++ b/docs/DATA_GUIDE.md
@@ -0,0 +1,225 @@
+# Data Collection Guide
+
+This guide shows you how to get training data from the internet or create your own data.txt file.
+
+## Option 1: Use the Download Script
+
+### Quick Start
+
+```bash
+# Download Shakespeare text (recommended for testing)
+python download_data.py --type shakespeare
+
+# Create a sample data file
+python download_data.py --type sample --output data/my_data.txt --samples 200
+
+# Download Wikipedia article (requires: pip install wikipedia)
+python download_data.py --type wikipedia --title "Artificial Intelligence" --output data/ai_article.txt
+```
+
+### Available Options
+
+**Shakespeare Dataset:**
+```bash
+python download_data.py --type shakespeare
+```
+Downloads classic Shakespeare text - great for testing!
+
+**Create Sample Data:**
+```bash
+python download_data.py --type sample --output data/my_data.txt --samples 100
+```
+Creates a file with sample sentences about ML/AI.
+
+**Wikipedia Article:**
+```bash
+python download_data.py --type wikipedia --title "Machine Learning" --output data/ml_article.txt
+```
+Downloads a Wikipedia article (requires `pip install wikipedia`).
+
+## Option 2: Manual Data Collection
+
+### Method A: Create Your Own data.txt
+
+1. **Create a text file:**
+```bash
+nano data/my_data.txt
+# or
+vim data/my_data.txt
+```
+
+2. **Add your text** (one sentence per line):
+```
+This is my first training sample.
+This is my second training sample.
+Add as many lines as you want.
+```
+
+3. **Save and use:**
+```bash
+python train.py --data data/my_data.txt
+```
+
+### Method B: Download from Public Datasets
+
+**Shakespeare Text:**
+```bash
+curl -o data/shakespeare.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
+```
+
+**Book Corpus Sample:**
+```bash
+# Download Project Gutenberg books
+curl -o data/book.txt https://www.gutenberg.org/files/1342/1342-0.txt  # Pride and Prejudice
+```
+
+**News Articles:**
+```bash
+# Download news text
+curl -o data/news.txt https://raw.githubusercontent.com/sunnysai12345/News_Summary/master/news_summary_more.csv
+```
+
+### Method C: Scrape Your Own Data
+
+**From Wikipedia (Python):**
+```python
+import wikipedia
+
+page = wikipedia.page("Machine Learning")
+with open("data/ml_article.txt", "w") as f:
+    f.write(page.content)
+```
+
+**From a Website:**
+```python
+import requests
+from bs4 import BeautifulSoup
+
+url = "https://example.com/article"
+response = requests.get(url)
+soup = BeautifulSoup(response.text, 'html.parser')
+text = soup.get_text()
+
+with open("data/scraped.txt", "w") as f:
+    f.write(text)
+```
+
+## Option 3: Use Existing Datasets
+
+### Popular NLP Datasets
+
+**WikiText-2:**
+```bash
+# Download WikiText-2
+wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
+unzip wikitext-2-v1.zip
+# Use: wikitext-2/wiki.train.tokens
+```
+
+**OpenWebText Sample:**
+```bash
+# Download sample
+curl -o data/openwebtext_sample.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
+```
+
+**BookCorpus:**
+```bash
+# Various book sources available
+# Check: https://github.com/soskek/bookcorpus
+```
+
+## Data Format Requirements
+
+Your `data.txt` file should:
+- Have **one text sample per line**
+- Use **UTF-8 encoding**
+- Be **plain text** (no special formatting)
+
+**Example format:**
+```
+This is the first training example.
+This is the second training example.
+Each line becomes one training sample.
+```
+
+**Good:**
+```
+Hello world!
+This is a sentence.
+Machine learning is cool.
+```
+
+**Bad:**
+```
+This is paragraph 1 with multiple sentences. This is sentence 2.
+This is paragraph 2.
+```
+
+## Preprocessing Tips
+
+1. **Clean your data:**
+```python
+import re
+
+with open("raw_data.txt", "r") as f:
+    text = f.read()
+
+# Remove extra whitespace
+text = re.sub(r'\s+', ' ', text)
+
+# Split into sentences
+sentences = text.split('.')
+
+# Write one per line
+with open("data/cleaned_data.txt", "w") as f:
+    for sentence in sentences:
+        if sentence.strip():
+            f.write(sentence.strip() + '\n')
+```
+
+2. **Split long texts:**
+```python
+# If you have long texts, split them into sentences
+text = "Long paragraph here. Another sentence. More text."
+sentences = text.split('.')
+for sentence in sentences:
+    if sentence.strip():
+        print(sentence.strip())
+```
+
+## Quick Test
+
+1. **Create a small test file:**
+```bash
+cat > data/test.txt << EOF
+Hello world!
+This is a test.
+Language models are cool.
+EOF
+```
+
+2. **Train with it:**
+```bash
+python train.py --data data/test.txt --output ./checkpoints
+```
+
+## Recommended Data Sources
+
+- **Small (for testing):** Shakespeare text, sample_data.txt
+- **Medium (for training):** Wikipedia articles, news articles
+- **Large (for serious training):** WikiText-2, BookCorpus, OpenWebText
+
+## Next Steps
+
+Once you have your data.txt file:
+
+```bash
+# Train your model
+python train.py --data data/your_data.txt --output ./checkpoints
+
+# Or use the sample data
+python train.py --data data/sample_data.txt --output ./checkpoints
+```
+
+Happy training! 🚀
+