Files
sheepOp/docs/MULTI_FORMAT_DATA_GUIDE.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

5.4 KiB

Multi-Format Data Processing Guide

Overview

The training script now supports processing multiple file types from your data/ directory:

  • Text files: .txt, .md, .rst, .log, .csv, .json, .jsonl, .xml, .html, .htm
  • Code files: .py, .js, .ts, .java, .cpp, .c, .go, .rs, .rb, .php, .swift, and many more
  • PDF files: .pdf (requires PyPDF2 or pdfplumber)
  • Images: .png, .jpg, .jpeg, .gif, .bmp, .tiff, .webp (requires pytesseract for OCR)

Basic Usage

Simply point the training script to your data directory:

python train.py --data /path/to/your/data/directory

The script will automatically:

  1. Scan the directory (recursively by default)
  2. Extract text from all supported file types
  3. Process and tokenize the text
  4. Train the model on all extracted content

Installation

Core Dependencies

The core dependencies are already in requirements.txt. Install them with:

pip install -r requirements.txt

Optional Dependencies for PDF and Image Processing

If you want to process PDFs or images, install the optional dependencies:

# For PDF processing (choose one):
pip install PyPDF2
# OR
pip install pdfplumber  # Alternative, often better for complex PDFs

# For image OCR:
pip install pytesseract Pillow

# Also install Tesseract OCR engine:
# macOS: brew install tesseract
# Ubuntu/Debian: sudo apt-get install tesseract-ocr
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki

How It Works

1. Text Files

Text files are read line by line. Each non-empty line becomes a training sample.

2. Code Files

Code files are processed as text. Each line of code becomes a training sample. This allows the model to learn code patterns and syntax.

3. PDF Files

PDFs are processed page by page:

  • Text is extracted from each page
  • Split into lines
  • Filtered to remove very short lines
  • Each line becomes a training sample

Note: PDF extraction works best with text-based PDFs. Scanned PDFs (images) should use OCR instead.

4. Image Files

Images are processed using OCR (Optical Character Recognition):

  • Images are opened using PIL/Pillow
  • pytesseract extracts text from the image
  • Text is split into lines
  • Each line becomes a training sample

Note: OCR quality depends on image quality. For best results:

  • Use high-resolution images
  • Ensure good contrast between text and background
  • Avoid images with complex layouts

Configuration Options

You can customize the data processing behavior:

from pathlib import Path
from data import DataProcessor

processor = DataProcessor(
    use_ocr=True,           # Enable OCR for images
    use_pdf_extraction=True # Enable PDF extraction
)

# Process directory
texts = processor.process_to_list(
    directory=Path("data/"),
    recursive=True,         # Process subdirectories
    min_length=10,          # Minimum line length
    max_samples=None,       # Limit number of samples (None = all)
)

Examples

Example 1: Process all files in directory

python train.py --data /mnt/storage/sheepOp/data

Example 2: Process single file

python train.py --data /mnt/storage/sheepOp/data/document.pdf

Example 3: Using Python API

from pathlib import Path
from data import extract_text_from_directory

# Extract text from all supported files
texts = extract_text_from_directory(
    directory=Path("data/"),
    recursive=True,
    use_ocr=True,
    use_pdf_extraction=True,
    min_length=10,
)

print(f"Extracted {len(texts)} text samples")

Supported File Types Summary

Category Extensions Requirements
Text .txt, .md, .rst, .log, .csv, .json, .jsonl, .xml, .html, .htm None
Code .py, .js, .ts, .java, .cpp, .c, .go, .rs, .rb, .php, .swift, and 30+ more None
PDF .pdf PyPDF2 or pdfplumber
Images .png, .jpg, .jpeg, .gif, .bmp, .tiff, .webp pytesseract + Pillow + Tesseract OCR

Troubleshooting

PDF extraction not working

  • Install PyPDF2: pip install PyPDF2
  • Or install pdfplumber (better for complex PDFs): pip install pdfplumber
  • If PDFs are scanned images, use OCR instead

OCR not working

  1. Install pytesseract: pip install pytesseract Pillow
  2. Install Tesseract OCR engine (see installation instructions above)
  3. On some systems, you may need to set the tesseract path:
    import pytesseract
    pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract'  # macOS example
    

No text extracted

  • Check that files are in supported formats
  • Verify file permissions
  • Check logs for error messages
  • Try processing a single file first to debug

Performance Tips

  1. Large directories: Processing can take time for large directories. Progress is logged every 100 files.

  2. Parallel processing: Consider processing files in parallel if you have many large files.

  3. Filtering: Use min_length to filter out very short lines that may not be useful for training.

  4. Caching: For repeated processing, consider saving extracted text to a file first.

Next Steps

Once your data is processed:

  1. The training script will automatically tokenize the text
  2. Create training batches
  3. Train your model

For more information on training, see RETRAINING_GUIDE.md.