# Data Collection Guide

This guide shows you how to get training data from the internet or create your own data.txt file.

## Option 1: Use the Download Script

### Quick Start

```bash
# Download Shakespeare text (recommended for testing)
python download_data.py --type shakespeare

# Create a sample data file
python download_data.py --type sample --output data/my_data.txt --samples 200

# Download Wikipedia article (requires: pip install wikipedia)
python download_data.py --type wikipedia --title "Artificial Intelligence" --output data/ai_article.txt
```

### Available Options

**Shakespeare Dataset:**
```bash
python download_data.py --type shakespeare
```
Downloads classic Shakespeare text - great for testing!

**Create Sample Data:**
```bash
python download_data.py --type sample --output data/my_data.txt --samples 100
```
Creates a file with sample sentences about ML/AI.

**Wikipedia Article:**
```bash
python download_data.py --type wikipedia --title "Machine Learning" --output data/ml_article.txt
```
Downloads a Wikipedia article (requires `pip install wikipedia`).

## Option 2: Manual Data Collection

### Method A: Create Your Own data.txt

1. **Create a text file:**
```bash
nano data/my_data.txt
# or
vim data/my_data.txt
```

2. **Add your text** (one sentence per line):
```
This is my first training sample.
This is my second training sample.
Add as many lines as you want.
```

3. **Save and use:**
```bash
python train.py --data data/my_data.txt
```

### Method B: Download from Public Datasets

**Shakespeare Text:**
```bash
curl -o data/shakespeare.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
```

**Book Corpus Sample:**
```bash
# Download Project Gutenberg books
curl -o data/book.txt https://www.gutenberg.org/files/1342/1342-0.txt  # Pride and Prejudice
```

**News Articles:**
```bash
# Download news text
curl -o data/news.txt https://raw.githubusercontent.com/sunnysai12345/News_Summary/master/news_summary_more.csv
```

### Method C: Scrape Your Own Data

**From Wikipedia (Python):**
```python
import wikipedia

page = wikipedia.page("Machine Learning")
with open("data/ml_article.txt", "w") as f:
    f.write(page.content)
```

**From a Website:**
```python
import requests
from bs4 import BeautifulSoup

url = "https://example.com/article"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
text = soup.get_text()

with open("data/scraped.txt", "w") as f:
    f.write(text)
```

## Option 3: Use Existing Datasets

### Popular NLP Datasets

**WikiText-2:**
```bash
# Download WikiText-2
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
unzip wikitext-2-v1.zip
# Use: wikitext-2/wiki.train.tokens
```

**OpenWebText Sample:**
```bash
# Download sample
curl -o data/openwebtext_sample.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
```

**BookCorpus:**
```bash
# Various book sources available
# Check: https://github.com/soskek/bookcorpus
```

## Data Format Requirements

Your `data.txt` file should:
- Have **one text sample per line**
- Use **UTF-8 encoding**
- Be **plain text** (no special formatting)

**Example format:**
```
This is the first training example.
This is the second training example.
Each line becomes one training sample.
```

**Good:**
```
Hello world!
This is a sentence.
Machine learning is cool.
```

**Bad:**
```
This is paragraph 1 with multiple sentences. This is sentence 2.
This is paragraph 2.
```

## Preprocessing Tips

1. **Clean your data:**
```python
import re

with open("raw_data.txt", "r") as f:
    text = f.read()

# Remove extra whitespace
text = re.sub(r'\s+', ' ', text)

# Split into sentences
sentences = text.split('.')

# Write one per line
with open("data/cleaned_data.txt", "w") as f:
    for sentence in sentences:
        if sentence.strip():
            f.write(sentence.strip() + '\n')
```

2. **Split long texts:**
```python
# If you have long texts, split them into sentences
text = "Long paragraph here. Another sentence. More text."
sentences = text.split('.')
for sentence in sentences:
    if sentence.strip():
        print(sentence.strip())
```

## Quick Test

1. **Create a small test file:**
```bash
cat > data/test.txt << EOF
Hello world!
This is a test.
Language models are cool.
EOF
```

2. **Train with it:**
```bash
python train.py --data data/test.txt --output ./checkpoints
```

## Recommended Data Sources

- **Small (for testing):** Shakespeare text, sample_data.txt
- **Medium (for training):** Wikipedia articles, news articles
- **Large (for serious training):** WikiText-2, BookCorpus, OpenWebText

## Next Steps

Once you have your data.txt file:

```bash
# Train your model
python train.py --data data/your_data.txt --output ./checkpoints

# Or use the sample data
python train.py --data data/sample_data.txt --output ./checkpoints
```

Happy training! 🚀