- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
196 lines
5.4 KiB
Markdown
196 lines
5.4 KiB
Markdown
# Multi-Format Data Processing Guide
|
|
|
|
## Overview
|
|
|
|
The training script now supports processing multiple file types from your `data/` directory:
|
|
|
|
- **Text files**: `.txt`, `.md`, `.rst`, `.log`, `.csv`, `.json`, `.jsonl`, `.xml`, `.html`, `.htm`
|
|
- **Code files**: `.py`, `.js`, `.ts`, `.java`, `.cpp`, `.c`, `.go`, `.rs`, `.rb`, `.php`, `.swift`, and many more
|
|
- **PDF files**: `.pdf` (requires PyPDF2 or pdfplumber)
|
|
- **Images**: `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.webp` (requires pytesseract for OCR)
|
|
|
|
## Basic Usage
|
|
|
|
Simply point the training script to your data directory:
|
|
|
|
```bash
|
|
python train.py --data /path/to/your/data/directory
|
|
```
|
|
|
|
The script will automatically:
|
|
1. Scan the directory (recursively by default)
|
|
2. Extract text from all supported file types
|
|
3. Process and tokenize the text
|
|
4. Train the model on all extracted content
|
|
|
|
## Installation
|
|
|
|
### Core Dependencies
|
|
|
|
The core dependencies are already in `requirements.txt`. Install them with:
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### Optional Dependencies for PDF and Image Processing
|
|
|
|
If you want to process PDFs or images, install the optional dependencies:
|
|
|
|
```bash
|
|
# For PDF processing (choose one):
|
|
pip install PyPDF2
|
|
# OR
|
|
pip install pdfplumber # Alternative, often better for complex PDFs
|
|
|
|
# For image OCR:
|
|
pip install pytesseract Pillow
|
|
|
|
# Also install Tesseract OCR engine:
|
|
# macOS: brew install tesseract
|
|
# Ubuntu/Debian: sudo apt-get install tesseract-ocr
|
|
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
|
|
```
|
|
|
|
## How It Works
|
|
|
|
### 1. Text Files
|
|
|
|
Text files are read line by line. Each non-empty line becomes a training sample.
|
|
|
|
### 2. Code Files
|
|
|
|
Code files are processed as text. Each line of code becomes a training sample. This allows the model to learn code patterns and syntax.
|
|
|
|
### 3. PDF Files
|
|
|
|
PDFs are processed page by page:
|
|
- Text is extracted from each page
|
|
- Split into lines
|
|
- Filtered to remove very short lines
|
|
- Each line becomes a training sample
|
|
|
|
**Note**: PDF extraction works best with text-based PDFs. Scanned PDFs (images) should use OCR instead.
|
|
|
|
### 4. Image Files
|
|
|
|
Images are processed using OCR (Optical Character Recognition):
|
|
- Images are opened using PIL/Pillow
|
|
- pytesseract extracts text from the image
|
|
- Text is split into lines
|
|
- Each line becomes a training sample
|
|
|
|
**Note**: OCR quality depends on image quality. For best results:
|
|
- Use high-resolution images
|
|
- Ensure good contrast between text and background
|
|
- Avoid images with complex layouts
|
|
|
|
## Configuration Options
|
|
|
|
You can customize the data processing behavior:
|
|
|
|
```python
|
|
from pathlib import Path
|
|
from data import DataProcessor
|
|
|
|
processor = DataProcessor(
|
|
use_ocr=True, # Enable OCR for images
|
|
use_pdf_extraction=True # Enable PDF extraction
|
|
)
|
|
|
|
# Process directory
|
|
texts = processor.process_to_list(
|
|
directory=Path("data/"),
|
|
recursive=True, # Process subdirectories
|
|
min_length=10, # Minimum line length
|
|
max_samples=None, # Limit number of samples (None = all)
|
|
)
|
|
```
|
|
|
|
## Examples
|
|
|
|
### Example 1: Process all files in directory
|
|
|
|
```bash
|
|
python train.py --data /mnt/storage/sheepOp/data
|
|
```
|
|
|
|
### Example 2: Process single file
|
|
|
|
```bash
|
|
python train.py --data /mnt/storage/sheepOp/data/document.pdf
|
|
```
|
|
|
|
### Example 3: Using Python API
|
|
|
|
```python
|
|
from pathlib import Path
|
|
from data import extract_text_from_directory
|
|
|
|
# Extract text from all supported files
|
|
texts = extract_text_from_directory(
|
|
directory=Path("data/"),
|
|
recursive=True,
|
|
use_ocr=True,
|
|
use_pdf_extraction=True,
|
|
min_length=10,
|
|
)
|
|
|
|
print(f"Extracted {len(texts)} text samples")
|
|
```
|
|
|
|
## Supported File Types Summary
|
|
|
|
| Category | Extensions | Requirements |
|
|
|----------|-----------|--------------|
|
|
| Text | `.txt`, `.md`, `.rst`, `.log`, `.csv`, `.json`, `.jsonl`, `.xml`, `.html`, `.htm` | None |
|
|
| Code | `.py`, `.js`, `.ts`, `.java`, `.cpp`, `.c`, `.go`, `.rs`, `.rb`, `.php`, `.swift`, and 30+ more | None |
|
|
| PDF | `.pdf` | PyPDF2 or pdfplumber |
|
|
| Images | `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.webp` | pytesseract + Pillow + Tesseract OCR |
|
|
|
|
## Troubleshooting
|
|
|
|
### PDF extraction not working
|
|
|
|
- Install PyPDF2: `pip install PyPDF2`
|
|
- Or install pdfplumber (better for complex PDFs): `pip install pdfplumber`
|
|
- If PDFs are scanned images, use OCR instead
|
|
|
|
### OCR not working
|
|
|
|
1. Install pytesseract: `pip install pytesseract Pillow`
|
|
2. Install Tesseract OCR engine (see installation instructions above)
|
|
3. On some systems, you may need to set the tesseract path:
|
|
```python
|
|
import pytesseract
|
|
pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract' # macOS example
|
|
```
|
|
|
|
### No text extracted
|
|
|
|
- Check that files are in supported formats
|
|
- Verify file permissions
|
|
- Check logs for error messages
|
|
- Try processing a single file first to debug
|
|
|
|
## Performance Tips
|
|
|
|
1. **Large directories**: Processing can take time for large directories. Progress is logged every 100 files.
|
|
|
|
2. **Parallel processing**: Consider processing files in parallel if you have many large files.
|
|
|
|
3. **Filtering**: Use `min_length` to filter out very short lines that may not be useful for training.
|
|
|
|
4. **Caching**: For repeated processing, consider saving extracted text to a file first.
|
|
|
|
## Next Steps
|
|
|
|
Once your data is processed:
|
|
|
|
1. The training script will automatically tokenize the text
|
|
2. Create training batches
|
|
3. Train your model
|
|
|
|
For more information on training, see `RETRAINING_GUIDE.md`.
|
|
|