Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions
--- a/docs/DATABASE_EXTRACTION_GUIDE.md
+++ b/docs/DATABASE_EXTRACTION_GUIDE.md
@@ -0,0 +1,217 @@
+# Database Extraction Guide
+
+This guide shows you how to extract text from your 1TB database for training.
+
+## Quick Start
+
+### SQLite Database
+
+```bash
+# Extract from SQLite database
+python3 extract_from_database.py \
+    --type sqlite \
+    --db-path /path/to/your/database.db \
+    --table your_table_name \
+    --column text_column_name \
+    --output data/database_training.txt \
+    --limit 1000000  # Limit to 1M samples (or omit for all)
+```
+
+### PostgreSQL Database
+
+```bash
+# Install PostgreSQL driver first
+pip install psycopg2-binary
+
+# Extract with SQL query
+python3 extract_from_database.py \
+    --type sql \
+    --connection "host=localhost dbname=mydb user=myuser password=mypass" \
+    --query "SELECT text_column FROM your_table WHERE length(text_column) > 50" \
+    --output data/database_training.txt \
+    --limit 1000000
+```
+
+### MySQL Database
+
+```bash
+# Install MySQL driver first
+pip install pymysql
+
+# Extract with SQL query
+python3 extract_from_database.py \
+    --type sql \
+    --connection "mysql+pymysql://user:pass@localhost/dbname" \
+    --query "SELECT text_column FROM your_table" \
+    --output data/database_training.txt
+```
+
+### JSON/JSONL Files
+
+```bash
+# Extract from JSON Lines file
+python3 extract_from_database.py \
+    --type json \
+    --json-path /path/to/data.jsonl \
+    --text-field content \
+    --output data/database_training.txt \
+    --limit 1000000
+```
+
+## Examples
+
+### Example 1: Extract All Text from SQLite Table
+
+```bash
+python3 extract_from_database.py \
+    --type sqlite \
+    --db-path /Volumes/YourDisk/database.db \
+    --table articles \
+    --column body_text \
+    --output data/training_data.txt
+```
+
+### Example 2: Extract Filtered Data (Longer Texts Only)
+
+```bash
+python3 extract_from_database.py \
+    --type sqlite \
+    --db-path /Volumes/YourDisk/database.db \
+    --table articles \
+    --column body_text \
+    --where "WHERE length(body_text) > 200" \
+    --output data/training_data.txt \
+    --min-length 50
+```
+
+### Example 3: Extract from Multiple Tables
+
+```bash
+# Extract from table 1
+python3 extract_from_database.py \
+    --type sqlite \
+    --db-path /Volumes/YourDisk/database.db \
+    --table articles \
+    --column content \
+    --output data/articles.txt
+
+# Extract from table 2
+python3 extract_from_database.py \
+    --type sqlite \
+    --db-path /Volumes/YourDisk/database.db \
+    --table comments \
+    --column text \
+    --output data/comments.txt
+
+# Combine files
+cat data/articles.txt data/comments.txt > data/combined_training.txt
+```
+
+### Example 4: PostgreSQL with Complex Query
+
+```bash
+python3 extract_from_database.py \
+    --type sql \
+    --connection "host=localhost dbname=mydb user=myuser password=mypass" \
+    --query "SELECT description FROM products WHERE description IS NOT NULL AND length(description) > 100 UNION SELECT review_text FROM reviews WHERE review_text IS NOT NULL" \
+    --output data/products_and_reviews.txt
+```
+
+## Options
+
+### Filtering Options
+
+```bash
+# Only extract texts longer than 100 characters
+--min-length 100
+
+# Limit total samples
+--limit 1000000
+
+# Add WHERE clause (SQLite)
+--where "WHERE created_at > '2024-01-01' AND length(text) > 200"
+```
+
+### Output Options
+
+```bash
+# Custom output path
+--output data/my_training_data.txt
+
+# Don't clean/split text (preserve original format)
+--no-clean
+```
+
+## Performance Tips
+
+1. **Use LIMIT for Testing**: Start with `--limit 10000` to test
+2. **Filter in Database**: Use `--where` clause to filter at database level (faster)
+3. **Batch Processing**: The script processes in batches automatically
+4. **Monitor Progress**: Progress updates every 1000 texts
+
+## Data Format
+
+The output file will have:
+- One text sample per line
+- Cleaned and split into sentences
+- Minimum length filtering applied
+- UTF-8 encoding
+
+## Next Steps
+
+After extraction:
+
+```bash
+# Check how much data you extracted
+wc -l data/database_training.txt
+
+# Train with the extracted data
+python3 train.py --data data/database_training.txt --config config.json --device mps
+```
+
+## Troubleshooting
+
+### SQLite Database Locked
+- Close any applications using the database
+- Copy database to a local location first
+
+### Large Database (1TB)
+- Use `--limit` to extract in batches
+- Use `--where` to filter at database level
+- Consider extracting to multiple files and combining
+
+### Memory Issues
+- The script processes in batches (streaming)
+- Use `--limit` to control size
+- Process in chunks if needed
+
+## Example Workflow
+
+```bash
+# 1. Extract 1M samples for testing
+python3 extract_from_database.py \
+    --type sqlite \
+    --db-path /Volumes/YourDisk/database.db \
+    --table your_table \
+    --column text_column \
+    --output data/test_extraction.txt \
+    --limit 1000000
+
+# 2. Check the data
+head -20 data/test_extraction.txt
+wc -l data/test_extraction.txt
+
+# 3. If good, extract more (or all)
+python3 extract_from_database.py \
+    --type sqlite \
+    --db-path /Volumes/YourDisk/database.db \
+    --table your_table \
+    --column text_column \
+    --output data/full_training.txt
+
+# 4. Train with the data
+python3 train.py --data data/full_training.txt --config config.json --device mps
+```
+
+Good luck extracting your 1TB database! 🚀
+