- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
4.8 KiB
4.8 KiB
Data Collection Guide
This guide shows you how to get training data from the internet or create your own data.txt file.
Option 1: Use the Download Script
Quick Start
# Download Shakespeare text (recommended for testing)
python download_data.py --type shakespeare
# Create a sample data file
python download_data.py --type sample --output data/my_data.txt --samples 200
# Download Wikipedia article (requires: pip install wikipedia)
python download_data.py --type wikipedia --title "Artificial Intelligence" --output data/ai_article.txt
Available Options
Shakespeare Dataset:
python download_data.py --type shakespeare
Downloads classic Shakespeare text - great for testing!
Create Sample Data:
python download_data.py --type sample --output data/my_data.txt --samples 100
Creates a file with sample sentences about ML/AI.
Wikipedia Article:
python download_data.py --type wikipedia --title "Machine Learning" --output data/ml_article.txt
Downloads a Wikipedia article (requires pip install wikipedia).
Option 2: Manual Data Collection
Method A: Create Your Own data.txt
- Create a text file:
nano data/my_data.txt
# or
vim data/my_data.txt
- Add your text (one sentence per line):
This is my first training sample.
This is my second training sample.
Add as many lines as you want.
- Save and use:
python train.py --data data/my_data.txt
Method B: Download from Public Datasets
Shakespeare Text:
curl -o data/shakespeare.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Book Corpus Sample:
# Download Project Gutenberg books
curl -o data/book.txt https://www.gutenberg.org/files/1342/1342-0.txt # Pride and Prejudice
News Articles:
# Download news text
curl -o data/news.txt https://raw.githubusercontent.com/sunnysai12345/News_Summary/master/news_summary_more.csv
Method C: Scrape Your Own Data
From Wikipedia (Python):
import wikipedia
page = wikipedia.page("Machine Learning")
with open("data/ml_article.txt", "w") as f:
f.write(page.content)
From a Website:
import requests
from bs4 import BeautifulSoup
url = "https://example.com/article"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
text = soup.get_text()
with open("data/scraped.txt", "w") as f:
f.write(text)
Option 3: Use Existing Datasets
Popular NLP Datasets
WikiText-2:
# Download WikiText-2
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
unzip wikitext-2-v1.zip
# Use: wikitext-2/wiki.train.tokens
OpenWebText Sample:
# Download sample
curl -o data/openwebtext_sample.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
BookCorpus:
# Various book sources available
# Check: https://github.com/soskek/bookcorpus
Data Format Requirements
Your data.txt file should:
- Have one text sample per line
- Use UTF-8 encoding
- Be plain text (no special formatting)
Example format:
This is the first training example.
This is the second training example.
Each line becomes one training sample.
Good:
Hello world!
This is a sentence.
Machine learning is cool.
Bad:
This is paragraph 1 with multiple sentences. This is sentence 2.
This is paragraph 2.
Preprocessing Tips
- Clean your data:
import re
with open("raw_data.txt", "r") as f:
text = f.read()
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text)
# Split into sentences
sentences = text.split('.')
# Write one per line
with open("data/cleaned_data.txt", "w") as f:
for sentence in sentences:
if sentence.strip():
f.write(sentence.strip() + '\n')
- Split long texts:
# If you have long texts, split them into sentences
text = "Long paragraph here. Another sentence. More text."
sentences = text.split('.')
for sentence in sentences:
if sentence.strip():
print(sentence.strip())
Quick Test
- Create a small test file:
cat > data/test.txt << EOF
Hello world!
This is a test.
Language models are cool.
EOF
- Train with it:
python train.py --data data/test.txt --output ./checkpoints
Recommended Data Sources
- Small (for testing): Shakespeare text, sample_data.txt
- Medium (for training): Wikipedia articles, news articles
- Large (for serious training): WikiText-2, BookCorpus, OpenWebText
Next Steps
Once you have your data.txt file:
# Train your model
python train.py --data data/your_data.txt --output ./checkpoints
# Or use the sample data
python train.py --data data/sample_data.txt --output ./checkpoints
Happy training! 🚀