- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
5.4 KiB
Multi-Format Data Processing Guide
Overview
The training script now supports processing multiple file types from your data/ directory:
- Text files:
.txt,.md,.rst,.log,.csv,.json,.jsonl,.xml,.html,.htm - Code files:
.py,.js,.ts,.java,.cpp,.c,.go,.rs,.rb,.php,.swift, and many more - PDF files:
.pdf(requires PyPDF2 or pdfplumber) - Images:
.png,.jpg,.jpeg,.gif,.bmp,.tiff,.webp(requires pytesseract for OCR)
Basic Usage
Simply point the training script to your data directory:
python train.py --data /path/to/your/data/directory
The script will automatically:
- Scan the directory (recursively by default)
- Extract text from all supported file types
- Process and tokenize the text
- Train the model on all extracted content
Installation
Core Dependencies
The core dependencies are already in requirements.txt. Install them with:
pip install -r requirements.txt
Optional Dependencies for PDF and Image Processing
If you want to process PDFs or images, install the optional dependencies:
# For PDF processing (choose one):
pip install PyPDF2
# OR
pip install pdfplumber # Alternative, often better for complex PDFs
# For image OCR:
pip install pytesseract Pillow
# Also install Tesseract OCR engine:
# macOS: brew install tesseract
# Ubuntu/Debian: sudo apt-get install tesseract-ocr
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
How It Works
1. Text Files
Text files are read line by line. Each non-empty line becomes a training sample.
2. Code Files
Code files are processed as text. Each line of code becomes a training sample. This allows the model to learn code patterns and syntax.
3. PDF Files
PDFs are processed page by page:
- Text is extracted from each page
- Split into lines
- Filtered to remove very short lines
- Each line becomes a training sample
Note: PDF extraction works best with text-based PDFs. Scanned PDFs (images) should use OCR instead.
4. Image Files
Images are processed using OCR (Optical Character Recognition):
- Images are opened using PIL/Pillow
- pytesseract extracts text from the image
- Text is split into lines
- Each line becomes a training sample
Note: OCR quality depends on image quality. For best results:
- Use high-resolution images
- Ensure good contrast between text and background
- Avoid images with complex layouts
Configuration Options
You can customize the data processing behavior:
from pathlib import Path
from data import DataProcessor
processor = DataProcessor(
use_ocr=True, # Enable OCR for images
use_pdf_extraction=True # Enable PDF extraction
)
# Process directory
texts = processor.process_to_list(
directory=Path("data/"),
recursive=True, # Process subdirectories
min_length=10, # Minimum line length
max_samples=None, # Limit number of samples (None = all)
)
Examples
Example 1: Process all files in directory
python train.py --data /mnt/storage/sheepOp/data
Example 2: Process single file
python train.py --data /mnt/storage/sheepOp/data/document.pdf
Example 3: Using Python API
from pathlib import Path
from data import extract_text_from_directory
# Extract text from all supported files
texts = extract_text_from_directory(
directory=Path("data/"),
recursive=True,
use_ocr=True,
use_pdf_extraction=True,
min_length=10,
)
print(f"Extracted {len(texts)} text samples")
Supported File Types Summary
| Category | Extensions | Requirements |
|---|---|---|
| Text | .txt, .md, .rst, .log, .csv, .json, .jsonl, .xml, .html, .htm |
None |
| Code | .py, .js, .ts, .java, .cpp, .c, .go, .rs, .rb, .php, .swift, and 30+ more |
None |
.pdf |
PyPDF2 or pdfplumber | |
| Images | .png, .jpg, .jpeg, .gif, .bmp, .tiff, .webp |
pytesseract + Pillow + Tesseract OCR |
Troubleshooting
PDF extraction not working
- Install PyPDF2:
pip install PyPDF2 - Or install pdfplumber (better for complex PDFs):
pip install pdfplumber - If PDFs are scanned images, use OCR instead
OCR not working
- Install pytesseract:
pip install pytesseract Pillow - Install Tesseract OCR engine (see installation instructions above)
- On some systems, you may need to set the tesseract path:
import pytesseract pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract' # macOS example
No text extracted
- Check that files are in supported formats
- Verify file permissions
- Check logs for error messages
- Try processing a single file first to debug
Performance Tips
-
Large directories: Processing can take time for large directories. Progress is logged every 100 files.
-
Parallel processing: Consider processing files in parallel if you have many large files.
-
Filtering: Use
min_lengthto filter out very short lines that may not be useful for training. -
Caching: For repeated processing, consider saving extracted text to a file first.
Next Steps
Once your data is processed:
- The training script will automatically tokenize the text
- Create training batches
- Train your model
For more information on training, see RETRAINING_GUIDE.md.