Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
This commit is contained in:
225
docs/DATA_GUIDE.md
Normal file
225
docs/DATA_GUIDE.md
Normal file
@@ -0,0 +1,225 @@
|
||||
# Data Collection Guide
|
||||
|
||||
This guide shows you how to get training data from the internet or create your own data.txt file.
|
||||
|
||||
## Option 1: Use the Download Script
|
||||
|
||||
### Quick Start
|
||||
|
||||
```bash
|
||||
# Download Shakespeare text (recommended for testing)
|
||||
python download_data.py --type shakespeare
|
||||
|
||||
# Create a sample data file
|
||||
python download_data.py --type sample --output data/my_data.txt --samples 200
|
||||
|
||||
# Download Wikipedia article (requires: pip install wikipedia)
|
||||
python download_data.py --type wikipedia --title "Artificial Intelligence" --output data/ai_article.txt
|
||||
```
|
||||
|
||||
### Available Options
|
||||
|
||||
**Shakespeare Dataset:**
|
||||
```bash
|
||||
python download_data.py --type shakespeare
|
||||
```
|
||||
Downloads classic Shakespeare text - great for testing!
|
||||
|
||||
**Create Sample Data:**
|
||||
```bash
|
||||
python download_data.py --type sample --output data/my_data.txt --samples 100
|
||||
```
|
||||
Creates a file with sample sentences about ML/AI.
|
||||
|
||||
**Wikipedia Article:**
|
||||
```bash
|
||||
python download_data.py --type wikipedia --title "Machine Learning" --output data/ml_article.txt
|
||||
```
|
||||
Downloads a Wikipedia article (requires `pip install wikipedia`).
|
||||
|
||||
## Option 2: Manual Data Collection
|
||||
|
||||
### Method A: Create Your Own data.txt
|
||||
|
||||
1. **Create a text file:**
|
||||
```bash
|
||||
nano data/my_data.txt
|
||||
# or
|
||||
vim data/my_data.txt
|
||||
```
|
||||
|
||||
2. **Add your text** (one sentence per line):
|
||||
```
|
||||
This is my first training sample.
|
||||
This is my second training sample.
|
||||
Add as many lines as you want.
|
||||
```
|
||||
|
||||
3. **Save and use:**
|
||||
```bash
|
||||
python train.py --data data/my_data.txt
|
||||
```
|
||||
|
||||
### Method B: Download from Public Datasets
|
||||
|
||||
**Shakespeare Text:**
|
||||
```bash
|
||||
curl -o data/shakespeare.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
|
||||
```
|
||||
|
||||
**Book Corpus Sample:**
|
||||
```bash
|
||||
# Download Project Gutenberg books
|
||||
curl -o data/book.txt https://www.gutenberg.org/files/1342/1342-0.txt # Pride and Prejudice
|
||||
```
|
||||
|
||||
**News Articles:**
|
||||
```bash
|
||||
# Download news text
|
||||
curl -o data/news.txt https://raw.githubusercontent.com/sunnysai12345/News_Summary/master/news_summary_more.csv
|
||||
```
|
||||
|
||||
### Method C: Scrape Your Own Data
|
||||
|
||||
**From Wikipedia (Python):**
|
||||
```python
|
||||
import wikipedia
|
||||
|
||||
page = wikipedia.page("Machine Learning")
|
||||
with open("data/ml_article.txt", "w") as f:
|
||||
f.write(page.content)
|
||||
```
|
||||
|
||||
**From a Website:**
|
||||
```python
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
url = "https://example.com/article"
|
||||
response = requests.get(url)
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
text = soup.get_text()
|
||||
|
||||
with open("data/scraped.txt", "w") as f:
|
||||
f.write(text)
|
||||
```
|
||||
|
||||
## Option 3: Use Existing Datasets
|
||||
|
||||
### Popular NLP Datasets
|
||||
|
||||
**WikiText-2:**
|
||||
```bash
|
||||
# Download WikiText-2
|
||||
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
|
||||
unzip wikitext-2-v1.zip
|
||||
# Use: wikitext-2/wiki.train.tokens
|
||||
```
|
||||
|
||||
**OpenWebText Sample:**
|
||||
```bash
|
||||
# Download sample
|
||||
curl -o data/openwebtext_sample.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
|
||||
```
|
||||
|
||||
**BookCorpus:**
|
||||
```bash
|
||||
# Various book sources available
|
||||
# Check: https://github.com/soskek/bookcorpus
|
||||
```
|
||||
|
||||
## Data Format Requirements
|
||||
|
||||
Your `data.txt` file should:
|
||||
- Have **one text sample per line**
|
||||
- Use **UTF-8 encoding**
|
||||
- Be **plain text** (no special formatting)
|
||||
|
||||
**Example format:**
|
||||
```
|
||||
This is the first training example.
|
||||
This is the second training example.
|
||||
Each line becomes one training sample.
|
||||
```
|
||||
|
||||
**Good:**
|
||||
```
|
||||
Hello world!
|
||||
This is a sentence.
|
||||
Machine learning is cool.
|
||||
```
|
||||
|
||||
**Bad:**
|
||||
```
|
||||
This is paragraph 1 with multiple sentences. This is sentence 2.
|
||||
This is paragraph 2.
|
||||
```
|
||||
|
||||
## Preprocessing Tips
|
||||
|
||||
1. **Clean your data:**
|
||||
```python
|
||||
import re
|
||||
|
||||
with open("raw_data.txt", "r") as f:
|
||||
text = f.read()
|
||||
|
||||
# Remove extra whitespace
|
||||
text = re.sub(r'\s+', ' ', text)
|
||||
|
||||
# Split into sentences
|
||||
sentences = text.split('.')
|
||||
|
||||
# Write one per line
|
||||
with open("data/cleaned_data.txt", "w") as f:
|
||||
for sentence in sentences:
|
||||
if sentence.strip():
|
||||
f.write(sentence.strip() + '\n')
|
||||
```
|
||||
|
||||
2. **Split long texts:**
|
||||
```python
|
||||
# If you have long texts, split them into sentences
|
||||
text = "Long paragraph here. Another sentence. More text."
|
||||
sentences = text.split('.')
|
||||
for sentence in sentences:
|
||||
if sentence.strip():
|
||||
print(sentence.strip())
|
||||
```
|
||||
|
||||
## Quick Test
|
||||
|
||||
1. **Create a small test file:**
|
||||
```bash
|
||||
cat > data/test.txt << EOF
|
||||
Hello world!
|
||||
This is a test.
|
||||
Language models are cool.
|
||||
EOF
|
||||
```
|
||||
|
||||
2. **Train with it:**
|
||||
```bash
|
||||
python train.py --data data/test.txt --output ./checkpoints
|
||||
```
|
||||
|
||||
## Recommended Data Sources
|
||||
|
||||
- **Small (for testing):** Shakespeare text, sample_data.txt
|
||||
- **Medium (for training):** Wikipedia articles, news articles
|
||||
- **Large (for serious training):** WikiText-2, BookCorpus, OpenWebText
|
||||
|
||||
## Next Steps
|
||||
|
||||
Once you have your data.txt file:
|
||||
|
||||
```bash
|
||||
# Train your model
|
||||
python train.py --data data/your_data.txt --output ./checkpoints
|
||||
|
||||
# Or use the sample data
|
||||
python train.py --data data/sample_data.txt --output ./checkpoints
|
||||
```
|
||||
|
||||
Happy training! 🚀
|
||||
|
||||
Reference in New Issue
Block a user