Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
This commit is contained in:
217
docs/DATABASE_EXTRACTION_GUIDE.md
Normal file
217
docs/DATABASE_EXTRACTION_GUIDE.md
Normal file
@@ -0,0 +1,217 @@
|
||||
# Database Extraction Guide
|
||||
|
||||
This guide shows you how to extract text from your 1TB database for training.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### SQLite Database
|
||||
|
||||
```bash
|
||||
# Extract from SQLite database
|
||||
python3 extract_from_database.py \
|
||||
--type sqlite \
|
||||
--db-path /path/to/your/database.db \
|
||||
--table your_table_name \
|
||||
--column text_column_name \
|
||||
--output data/database_training.txt \
|
||||
--limit 1000000 # Limit to 1M samples (or omit for all)
|
||||
```
|
||||
|
||||
### PostgreSQL Database
|
||||
|
||||
```bash
|
||||
# Install PostgreSQL driver first
|
||||
pip install psycopg2-binary
|
||||
|
||||
# Extract with SQL query
|
||||
python3 extract_from_database.py \
|
||||
--type sql \
|
||||
--connection "host=localhost dbname=mydb user=myuser password=mypass" \
|
||||
--query "SELECT text_column FROM your_table WHERE length(text_column) > 50" \
|
||||
--output data/database_training.txt \
|
||||
--limit 1000000
|
||||
```
|
||||
|
||||
### MySQL Database
|
||||
|
||||
```bash
|
||||
# Install MySQL driver first
|
||||
pip install pymysql
|
||||
|
||||
# Extract with SQL query
|
||||
python3 extract_from_database.py \
|
||||
--type sql \
|
||||
--connection "mysql+pymysql://user:pass@localhost/dbname" \
|
||||
--query "SELECT text_column FROM your_table" \
|
||||
--output data/database_training.txt
|
||||
```
|
||||
|
||||
### JSON/JSONL Files
|
||||
|
||||
```bash
|
||||
# Extract from JSON Lines file
|
||||
python3 extract_from_database.py \
|
||||
--type json \
|
||||
--json-path /path/to/data.jsonl \
|
||||
--text-field content \
|
||||
--output data/database_training.txt \
|
||||
--limit 1000000
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Extract All Text from SQLite Table
|
||||
|
||||
```bash
|
||||
python3 extract_from_database.py \
|
||||
--type sqlite \
|
||||
--db-path /Volumes/YourDisk/database.db \
|
||||
--table articles \
|
||||
--column body_text \
|
||||
--output data/training_data.txt
|
||||
```
|
||||
|
||||
### Example 2: Extract Filtered Data (Longer Texts Only)
|
||||
|
||||
```bash
|
||||
python3 extract_from_database.py \
|
||||
--type sqlite \
|
||||
--db-path /Volumes/YourDisk/database.db \
|
||||
--table articles \
|
||||
--column body_text \
|
||||
--where "WHERE length(body_text) > 200" \
|
||||
--output data/training_data.txt \
|
||||
--min-length 50
|
||||
```
|
||||
|
||||
### Example 3: Extract from Multiple Tables
|
||||
|
||||
```bash
|
||||
# Extract from table 1
|
||||
python3 extract_from_database.py \
|
||||
--type sqlite \
|
||||
--db-path /Volumes/YourDisk/database.db \
|
||||
--table articles \
|
||||
--column content \
|
||||
--output data/articles.txt
|
||||
|
||||
# Extract from table 2
|
||||
python3 extract_from_database.py \
|
||||
--type sqlite \
|
||||
--db-path /Volumes/YourDisk/database.db \
|
||||
--table comments \
|
||||
--column text \
|
||||
--output data/comments.txt
|
||||
|
||||
# Combine files
|
||||
cat data/articles.txt data/comments.txt > data/combined_training.txt
|
||||
```
|
||||
|
||||
### Example 4: PostgreSQL with Complex Query
|
||||
|
||||
```bash
|
||||
python3 extract_from_database.py \
|
||||
--type sql \
|
||||
--connection "host=localhost dbname=mydb user=myuser password=mypass" \
|
||||
--query "SELECT description FROM products WHERE description IS NOT NULL AND length(description) > 100 UNION SELECT review_text FROM reviews WHERE review_text IS NOT NULL" \
|
||||
--output data/products_and_reviews.txt
|
||||
```
|
||||
|
||||
## Options
|
||||
|
||||
### Filtering Options
|
||||
|
||||
```bash
|
||||
# Only extract texts longer than 100 characters
|
||||
--min-length 100
|
||||
|
||||
# Limit total samples
|
||||
--limit 1000000
|
||||
|
||||
# Add WHERE clause (SQLite)
|
||||
--where "WHERE created_at > '2024-01-01' AND length(text) > 200"
|
||||
```
|
||||
|
||||
### Output Options
|
||||
|
||||
```bash
|
||||
# Custom output path
|
||||
--output data/my_training_data.txt
|
||||
|
||||
# Don't clean/split text (preserve original format)
|
||||
--no-clean
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use LIMIT for Testing**: Start with `--limit 10000` to test
|
||||
2. **Filter in Database**: Use `--where` clause to filter at database level (faster)
|
||||
3. **Batch Processing**: The script processes in batches automatically
|
||||
4. **Monitor Progress**: Progress updates every 1000 texts
|
||||
|
||||
## Data Format
|
||||
|
||||
The output file will have:
|
||||
- One text sample per line
|
||||
- Cleaned and split into sentences
|
||||
- Minimum length filtering applied
|
||||
- UTF-8 encoding
|
||||
|
||||
## Next Steps
|
||||
|
||||
After extraction:
|
||||
|
||||
```bash
|
||||
# Check how much data you extracted
|
||||
wc -l data/database_training.txt
|
||||
|
||||
# Train with the extracted data
|
||||
python3 train.py --data data/database_training.txt --config config.json --device mps
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### SQLite Database Locked
|
||||
- Close any applications using the database
|
||||
- Copy database to a local location first
|
||||
|
||||
### Large Database (1TB)
|
||||
- Use `--limit` to extract in batches
|
||||
- Use `--where` to filter at database level
|
||||
- Consider extracting to multiple files and combining
|
||||
|
||||
### Memory Issues
|
||||
- The script processes in batches (streaming)
|
||||
- Use `--limit` to control size
|
||||
- Process in chunks if needed
|
||||
|
||||
## Example Workflow
|
||||
|
||||
```bash
|
||||
# 1. Extract 1M samples for testing
|
||||
python3 extract_from_database.py \
|
||||
--type sqlite \
|
||||
--db-path /Volumes/YourDisk/database.db \
|
||||
--table your_table \
|
||||
--column text_column \
|
||||
--output data/test_extraction.txt \
|
||||
--limit 1000000
|
||||
|
||||
# 2. Check the data
|
||||
head -20 data/test_extraction.txt
|
||||
wc -l data/test_extraction.txt
|
||||
|
||||
# 3. If good, extract more (or all)
|
||||
python3 extract_from_database.py \
|
||||
--type sqlite \
|
||||
--db-path /Volumes/YourDisk/database.db \
|
||||
--table your_table \
|
||||
--column text_column \
|
||||
--output data/full_training.txt
|
||||
|
||||
# 4. Train with the data
|
||||
python3 train.py --data data/full_training.txt --config config.json --device mps
|
||||
```
|
||||
|
||||
Good luck extracting your 1TB database! 🚀
|
||||
|
||||
Reference in New Issue
Block a user