Files
sheepOp/docs/MULTI_FORMAT_DATA_GUIDE.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

196 lines
5.4 KiB
Markdown

# Multi-Format Data Processing Guide
## Overview
The training script now supports processing multiple file types from your `data/` directory:
- **Text files**: `.txt`, `.md`, `.rst`, `.log`, `.csv`, `.json`, `.jsonl`, `.xml`, `.html`, `.htm`
- **Code files**: `.py`, `.js`, `.ts`, `.java`, `.cpp`, `.c`, `.go`, `.rs`, `.rb`, `.php`, `.swift`, and many more
- **PDF files**: `.pdf` (requires PyPDF2 or pdfplumber)
- **Images**: `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.webp` (requires pytesseract for OCR)
## Basic Usage
Simply point the training script to your data directory:
```bash
python train.py --data /path/to/your/data/directory
```
The script will automatically:
1. Scan the directory (recursively by default)
2. Extract text from all supported file types
3. Process and tokenize the text
4. Train the model on all extracted content
## Installation
### Core Dependencies
The core dependencies are already in `requirements.txt`. Install them with:
```bash
pip install -r requirements.txt
```
### Optional Dependencies for PDF and Image Processing
If you want to process PDFs or images, install the optional dependencies:
```bash
# For PDF processing (choose one):
pip install PyPDF2
# OR
pip install pdfplumber # Alternative, often better for complex PDFs
# For image OCR:
pip install pytesseract Pillow
# Also install Tesseract OCR engine:
# macOS: brew install tesseract
# Ubuntu/Debian: sudo apt-get install tesseract-ocr
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
```
## How It Works
### 1. Text Files
Text files are read line by line. Each non-empty line becomes a training sample.
### 2. Code Files
Code files are processed as text. Each line of code becomes a training sample. This allows the model to learn code patterns and syntax.
### 3. PDF Files
PDFs are processed page by page:
- Text is extracted from each page
- Split into lines
- Filtered to remove very short lines
- Each line becomes a training sample
**Note**: PDF extraction works best with text-based PDFs. Scanned PDFs (images) should use OCR instead.
### 4. Image Files
Images are processed using OCR (Optical Character Recognition):
- Images are opened using PIL/Pillow
- pytesseract extracts text from the image
- Text is split into lines
- Each line becomes a training sample
**Note**: OCR quality depends on image quality. For best results:
- Use high-resolution images
- Ensure good contrast between text and background
- Avoid images with complex layouts
## Configuration Options
You can customize the data processing behavior:
```python
from pathlib import Path
from data import DataProcessor
processor = DataProcessor(
use_ocr=True, # Enable OCR for images
use_pdf_extraction=True # Enable PDF extraction
)
# Process directory
texts = processor.process_to_list(
directory=Path("data/"),
recursive=True, # Process subdirectories
min_length=10, # Minimum line length
max_samples=None, # Limit number of samples (None = all)
)
```
## Examples
### Example 1: Process all files in directory
```bash
python train.py --data /mnt/storage/sheepOp/data
```
### Example 2: Process single file
```bash
python train.py --data /mnt/storage/sheepOp/data/document.pdf
```
### Example 3: Using Python API
```python
from pathlib import Path
from data import extract_text_from_directory
# Extract text from all supported files
texts = extract_text_from_directory(
directory=Path("data/"),
recursive=True,
use_ocr=True,
use_pdf_extraction=True,
min_length=10,
)
print(f"Extracted {len(texts)} text samples")
```
## Supported File Types Summary
| Category | Extensions | Requirements |
|----------|-----------|--------------|
| Text | `.txt`, `.md`, `.rst`, `.log`, `.csv`, `.json`, `.jsonl`, `.xml`, `.html`, `.htm` | None |
| Code | `.py`, `.js`, `.ts`, `.java`, `.cpp`, `.c`, `.go`, `.rs`, `.rb`, `.php`, `.swift`, and 30+ more | None |
| PDF | `.pdf` | PyPDF2 or pdfplumber |
| Images | `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.webp` | pytesseract + Pillow + Tesseract OCR |
## Troubleshooting
### PDF extraction not working
- Install PyPDF2: `pip install PyPDF2`
- Or install pdfplumber (better for complex PDFs): `pip install pdfplumber`
- If PDFs are scanned images, use OCR instead
### OCR not working
1. Install pytesseract: `pip install pytesseract Pillow`
2. Install Tesseract OCR engine (see installation instructions above)
3. On some systems, you may need to set the tesseract path:
```python
import pytesseract
pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract' # macOS example
```
### No text extracted
- Check that files are in supported formats
- Verify file permissions
- Check logs for error messages
- Try processing a single file first to debug
## Performance Tips
1. **Large directories**: Processing can take time for large directories. Progress is logged every 100 files.
2. **Parallel processing**: Consider processing files in parallel if you have many large files.
3. **Filtering**: Use `min_length` to filter out very short lines that may not be useful for training.
4. **Caching**: For repeated processing, consider saving extracted text to a file first.
## Next Steps
Once your data is processed:
1. The training script will automatically tokenize the text
2. Create training batches
3. Train your model
For more information on training, see `RETRAINING_GUIDE.md`.