Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions
--- a/docs/DATA_PROCESSING_EXPLAINED.md
+++ b/docs/DATA_PROCESSING_EXPLAINED.md
@@ -0,0 +1,914 @@
+# Data Processing Explained: Step-by-Step Guide
+
+Complete guide to understanding data processing in the SheepOp LLM project, explaining what happens to your data from raw files to training-ready text.
+
+## Table of Contents
+
+1. [What is Data Processing?](#1-what-is-data-processing)
+2. [Why Do We Need Data Processing?](#2-why-do-we-need-data-processing)
+3. [The Data Processing Pipeline](#3-the-data-processing-pipeline)
+4. [Step-by-Step: How Each File Type is Processed](#4-step-by-step-how-each-file-type-is-processed)
+5. [Data Transformation Stages](#5-data-transformation-stages)
+6. [Complete Example: Processing "Hello World.pdf"](#6-complete-example-processing-hello-worldpdf)
+7. [Data Quality and Filtering](#7-data-quality-and-filtering)
+8. [Common Questions](#8-common-questions)
+
+---
+
+## 1. What is Data Processing?
+
+**Data processing** is the transformation of raw, unstructured data into a format that machine learning models can understand and learn from.
+
+### Simple Analogy
+
+Think of data processing like preparing ingredients for cooking:
+
+**Raw Ingredients (Your Files):**
+- PDF documents
+- Text files
+- Images with text
+- Code files
+
+**Prepared Ingredients (Processed Data):**
+- Clean text lines
+- Consistent format
+- Ready for training
+
+**The Recipe (Training):**
+- The model learns from the prepared ingredients
+
+### In Our Context
+
+**Input:** Mixed file types (PDFs, images, code, text)  
+**Output:** List of text strings ready for tokenization  
+**Purpose:** Extract meaningful text that the model can learn from
+
+---
+
+## 2. Why Do We Need Data Processing?
+
+### 2.1 The Problem
+
+Machine learning models (like our transformer) understand **numbers**, not:
+- PDF files
+- Images
+- Raw text files
+- Code files
+
+### 2.2 The Solution
+
+We need to:
+1. **Extract** text from different file formats
+2. **Clean** the text (remove noise, handle encoding)
+3. **Standardize** the format (consistent structure)
+4. **Prepare** for tokenization (split into manageable pieces)
+
+### 2.3 Benefits
+
+✅ **Unified Format**: All data becomes text lines  
+✅ **Easy to Process**: Simple format for tokenization  
+✅ **Flexible**: Works with many file types  
+✅ **Scalable**: Can process thousands of files automatically
+
+---
+
+## 3. The Data Processing Pipeline
+
+### 3.1 High-Level Overview
+
+```
+Raw Files
+    ↓
+[File Type Detection]
+    ↓
+[Text Extraction]
+    ↓
+[Text Cleaning]
+    ↓
+[Line Splitting]
+    ↓
+[Filtering]
+    ↓
+Clean Text Lines
+    ↓
+[Tokenization] ← Not part of data processing
+    ↓
+[Training] ← Not part of data processing
+```
+
+### 3.2 Detailed Pipeline
+
+```
+Step 1: Directory Scan
+    └─→ Find all files in data/ directory
+        └─→ Categorize by file type (.pdf, .txt, .png, etc.)
+
+Step 2: File Type Detection
+    └─→ Check file extension
+        └─→ Route to appropriate processor
+
+Step 3: Text Extraction
+    ├─→ PDF files → PDF text extraction
+    ├─→ Text files → Read as text
+    ├─→ Image files → OCR (Optical Character Recognition)
+    └─→ Code files → Read as text
+
+Step 4: Text Cleaning
+    └─→ Remove extra whitespace
+        └─→ Handle encoding issues
+            └─→ Normalize line endings
+
+Step 5: Line Splitting
+    └─→ Split text into individual lines
+        └─→ Each line becomes one training sample
+
+Step 6: Filtering
+    └─→ Remove empty lines
+        └─→ Filter by minimum length
+            └─→ Remove lines that are too short
+
+Step 7: Output
+    └─→ List of text strings
+        └─→ Ready for tokenization
+```
+
+---
+
+## 4. Step-by-Step: How Each File Type is Processed
+
+### 4.1 Text Files (.txt, .md, .log, etc.)
+
+**What happens:**
+1. File is opened
+2. Content is read line by line
+3. Each line becomes a separate text sample
+
+**Example:**
+
+**Input:** `document.txt`
+```
+Hello world
+This is a sentence.
+Machine learning is fascinating.
+```
+
+**Processing:**
+```
+Line 1: "Hello world"
+Line 2: "This is a sentence."
+Line 3: "Machine learning is fascinating."
+```
+
+**Output:**
+```python
+[
+    "Hello world",
+    "This is a sentence.",
+    "Machine learning is fascinating."
+]
+```
+
+**Why this works:** Text files are already in plain text format, so extraction is straightforward.
+
+---
+
+### 4.2 Code Files (.py, .js, .java, etc.)
+
+**What happens:**
+1. File is opened
+2. Content is read line by line
+3. Each line becomes a separate text sample
+
+**Example:**
+
+**Input:** `example.py`
+```python
+def hello():
+    print("Hello")
+    return True
+```
+
+**Processing:**
+```
+Line 1: "def hello():"
+Line 2: "    print("Hello")"
+Line 3: "    return True"
+```
+
+**Output:**
+```python
+[
+    "def hello():",
+    "    print("Hello")",
+    "    return True"
+]
+```
+
+**Why this works:** Code files are text files, so they're processed the same way. The model learns code patterns and syntax.
+
+---
+
+### 4.3 PDF Files (.pdf)
+
+**What happens:**
+1. PDF file is opened
+2. Text is extracted from each page
+3. Text is split into lines
+4. Lines are filtered for quality
+
+**Example:**
+
+**Input:** `document.pdf` (3 pages)
+
+**Page 1:**
+```
+Introduction to Machine Learning
+Machine learning is a subset of artificial intelligence.
+```
+
+**Page 2:**
+```
+Neural Networks
+Neural networks are computing systems inspired by biological neural networks.
+```
+
+**Page 3:**
+```
+Conclusion
+In conclusion, machine learning has revolutionized technology.
+```
+
+**Processing:**
+
+**Step 1: Extract text from each page**
+```
+Page 1 text: "Introduction to Machine Learning\nMachine learning is a subset of artificial intelligence."
+Page 2 text: "Neural Networks\nNeural networks are computing systems inspired by biological neural networks."
+Page 3 text: "Conclusion\nIn conclusion, machine learning has revolutionized technology."
+```
+
+**Step 2: Split by newlines**
+```
+Line 1: "Introduction to Machine Learning"
+Line 2: "Machine learning is a subset of artificial intelligence."
+Line 3: "Neural Networks"
+Line 4: "Neural networks are computing systems inspired by biological neural networks."
+Line 5: "Conclusion"
+Line 6: "In conclusion, machine learning has revolutionized technology."
+```
+
+**Step 3: Filter short lines**
+```
+Remove: "Introduction to Machine Learning" (too short for context)
+Keep: "Machine learning is a subset of artificial intelligence."
+Remove: "Neural Networks" (too short)
+Keep: "Neural networks are computing systems inspired by biological neural networks."
+Remove: "Conclusion" (too short)
+Keep: "In conclusion, machine learning has revolutionized technology."
+```
+
+**Output:**
+```python
+[
+    "Machine learning is a subset of artificial intelligence.",
+    "Neural networks are computing systems inspired by biological neural networks.",
+    "In conclusion, machine learning has revolutionized technology."
+]
+```
+
+**Why this works:** PDFs contain text embedded in the file structure. Libraries like PyPDF2 or pdfplumber extract this text, preserving the content but losing formatting.
+
+---
+
+### 4.4 Image Files (.png, .jpg, etc.)
+
+**What happens:**
+1. Image file is opened
+2. OCR (Optical Character Recognition) reads text from the image
+3. Extracted text is split into lines
+4. Lines are filtered for quality
+
+**Example:**
+
+**Input:** `screenshot.png` containing:
+```
+Hello World
+This is text in an image.
+```
+
+**Processing:**
+
+**Step 1: OCR Processing**
+```
+Image → OCR Engine → Text
+"Hello World\nThis is text in an image."
+```
+
+**Step 2: Split by newlines**
+```
+Line 1: "Hello World"
+Line 2: "This is text in an image."
+```
+
+**Step 3: Filter short lines**
+```
+Remove: "Hello World" (might be too short)
+Keep: "This is text in an image."
+```
+
+**Output:**
+```python
+[
+    "This is text in an image."
+]
+```
+
+**Why this works:** OCR software analyzes the image pixel by pixel, identifies characters, and converts them to text. Accuracy depends on image quality.
+
+---
+
+## 5. Data Transformation Stages
+
+### 5.1 Stage 1: File Discovery
+
+**Purpose:** Find all files to process
+
+**Process:**
+```
+Directory: data/
+    ├── document.pdf
+    ├── code.py
+    ├── screenshot.png
+    └── notes.txt
+
+Scan recursively:
+    ├── Find: document.pdf
+    ├── Find: code.py
+    ├── Find: screenshot.png
+    └── Find: notes.txt
+
+Total: 4 files found
+```
+
+**Result:** List of file paths to process
+
+---
+
+### 5.2 Stage 2: File Type Classification
+
+**Purpose:** Determine how to process each file
+
+**Process:**
+```
+File: document.pdf
+    ├── Extension: .pdf
+    ├── Type: PDF
+    └── Processor: PDF Extractor
+
+File: code.py
+    ├── Extension: .py
+    ├── Type: Code
+    └── Processor: Text Reader
+
+File: screenshot.png
+    ├── Extension: .png
+    ├── Type: Image
+    └── Processor: OCR
+
+File: notes.txt
+    ├── Extension: .txt
+    ├── Type: Text
+    └── Processor: Text Reader
+```
+
+**Result:** Each file assigned to appropriate processor
+
+---
+
+### 5.3 Stage 3: Text Extraction
+
+**Purpose:** Get raw text from each file
+
+**Process:**
+
+**PDF File:**
+```
+document.pdf
+    → Open PDF
+    → Extract Page 1: "Introduction..."
+    → Extract Page 2: "Chapter 1..."
+    → Extract Page 3: "Conclusion..."
+    → Combine: "Introduction...\nChapter 1...\nConclusion..."
+```
+
+**Text File:**
+```
+notes.txt
+    → Open file
+    → Read content: "Hello\nWorld\nTest"
+```
+
+**Image File:**
+```
+screenshot.png
+    → Open image
+    → Run OCR
+    → Extract: "Hello World\nThis is text"
+```
+
+**Code File:**
+```
+code.py
+    → Open file
+    → Read content: "def hello():\n    print('Hi')"
+```
+
+**Result:** Raw text strings from each file
+
+---
+
+### 5.4 Stage 4: Text Cleaning
+
+**Purpose:** Standardize and clean the extracted text
+
+**Process:**
+
+**Input:**
+```
+"Hello   World\n\n\nThis is a test.  "
+```
+
+**Step 1: Remove Extra Whitespace**
+```
+"Hello World\n\n\nThis is a test.  "
+    ↓
+"Hello World\n\n\nThis is a test."
+```
+
+**Step 2: Normalize Line Endings**
+```
+"Hello World\n\n\nThis is a test."
+    ↓
+"Hello World\n\n\nThis is a test."
+```
+
+**Step 3: Handle Encoding**
+```
+"Hello World" (UTF-8)
+    ↓
+"Hello World" (checked and valid)
+```
+
+**Result:** Cleaned text strings
+
+---
+
+### 5.5 Stage 5: Line Splitting
+
+**Purpose:** Break text into individual training samples
+
+**Process:**
+
+**Input:**
+```
+"Hello World\nThis is a test.\nMachine learning is cool."
+```
+
+**Split by newlines:**
+```
+Line 1: "Hello World"
+Line 2: "This is a test."
+Line 3: "Machine learning is cool."
+```
+
+**Result:** List of individual text lines
+
+---
+
+### 5.6 Stage 6: Filtering
+
+**Purpose:** Keep only useful text samples
+
+**Process:**
+
+**Input:**
+```python
+[
+    "Hello World",           # Length: 11
+    "Hi",                    # Length: 2 (too short)
+    "This is a sentence.",    # Length: 19
+    "",                      # Empty (remove)
+    "A"                      # Length: 1 (too short)
+]
+```
+
+**Filter criteria:**
+- Minimum length: 10 characters
+- Non-empty strings
+
+**Filtering:**
+```
+Keep: "Hello World" (length 11 ≥ 10)
+Remove: "Hi" (length 2 < 10)
+Keep: "This is a sentence." (length 19 ≥ 10)
+Remove: "" (empty)
+Remove: "A" (length 1 < 10)
+```
+
+**Output:**
+```python
+[
+    "Hello World",
+    "This is a sentence."
+]
+```
+
+**Result:** Filtered list of quality text samples
+
+---
+
+## 6. Complete Example: Processing "Hello World.pdf"
+
+Let's trace through processing a complete PDF file step-by-step.
+
+### Input
+**File:** `Hello World.pdf`  
+**Location:** `data/documents/Hello World.pdf`  
+**Content:** 2 pages with text
+
+### Step-by-Step Processing
+
+#### Step 1: File Discovery
+
+```
+Scanning: data/
+    ├── documents/
+    │   └── Hello World.pdf  ← Found
+    ├── images/
+    └── code/
+    
+File found: data/documents/Hello World.pdf
+```
+
+#### Step 2: File Type Detection
+
+```
+File: Hello World.pdf
+Extension: .pdf
+Type: PDF
+Processor: PDF Extractor
+```
+
+#### Step 3: PDF Text Extraction
+
+**Page 1 Content:**
+```
+Hello World
+This is a simple example document.
+It contains multiple sentences.
+```
+
+**Page 2 Content:**
+```
+Second Page
+Here is more content.
+The end.
+```
+
+**Extraction Process:**
+```
+Open PDF file
+    ↓
+Extract Page 1:
+    Text: "Hello World\nThis is a simple example document.\nIt contains multiple sentences."
+    ↓
+Extract Page 2:
+    Text: "Second Page\nHere is more content.\nThe end."
+    ↓
+Combine pages:
+    "Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
+```
+
+#### Step 4: Text Cleaning
+
+**Input:**
+```
+"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
+```
+
+**Process:**
+```
+Remove extra whitespace: ✓ (already clean)
+Normalize encoding: ✓ (UTF-8)
+Handle special characters: ✓ (none found)
+```
+
+**Output:**
+```
+"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
+```
+
+#### Step 5: Line Splitting
+
+**Input:**
+```
+"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
+```
+
+**Split by newline character (`\n`):**
+```
+Line 1: "Hello World"
+Line 2: "This is a simple example document."
+Line 3: "It contains multiple sentences."
+Line 4: "Second Page"
+Line 5: "Here is more content."
+Line 6: "The end."
+```
+
+#### Step 6: Filtering
+
+**Input:**
+```python
+[
+    "Hello World",                           # Length: 11
+    "This is a simple example document.",     # Length: 36
+    "It contains multiple sentences.",        # Length: 31
+    "Second Page",                           # Length: 11
+    "Here is more content.",                 # Length: 21
+    "The end."                               # Length: 8 (too short!)
+]
+```
+
+**Filter: Minimum length = 10**
+```
+✓ Keep: "Hello World" (11 ≥ 10)
+✓ Keep: "This is a simple example document." (36 ≥ 10)
+✓ Keep: "It contains multiple sentences." (31 ≥ 10)
+✓ Keep: "Second Page" (11 ≥ 10)
+✓ Keep: "Here is more content." (21 ≥ 10)
+✗ Remove: "The end." (8 < 10)
+```
+
+#### Step 7: Final Output
+
+**Result:**
+```python
+[
+    "Hello World",
+    "This is a simple example document.",
+    "It contains multiple sentences.",
+    "Second Page",
+    "Here is more content."
+]
+```
+
+**Statistics:**
+- Files processed: 1
+- Pages extracted: 2
+- Lines extracted: 6
+- Lines kept: 5
+- Lines filtered: 1
+
+---
+
+## 7. Data Quality and Filtering
+
+### 7.1 Why Filter?
+
+**Problem:** Not all text is useful for training
+
+**Examples of Low-Quality Text:**
+
+```
+✗ ""                    (empty line)
+✗ " "                   (just whitespace)
+✗ "Hi"                  (too short, no context)
+✗ "A"                   (single character)
+✗ "..."                 (ellipsis, no meaning)
+✗ "---"                 (separator line)
+```
+
+**Examples of High-Quality Text:**
+
+```
+✓ "Machine learning is a subset of artificial intelligence."
+✓ "The transformer architecture uses self-attention mechanisms."
+✓ "Gradient descent optimizes neural network parameters."
+```
+
+### 7.2 Filtering Criteria
+
+**Minimum Length Filter:**
+
+**Purpose:** Remove very short lines that don't provide context
+
+**Example:**
+```
+Minimum length: 10 characters
+
+Keep:
+✓ "Hello world" (11 chars)
+✓ "This is a test." (15 chars)
+
+Remove:
+✗ "Hi" (2 chars)
+✗ "Test" (4 chars)
+✗ "OK" (2 chars)
+```
+
+**Why 10 characters?**
+- Provides enough context for meaningful learning
+- Filters out headers, separators, and noise
+- Ensures each sample has semantic value
+
+### 7.3 Encoding Handling
+
+**Problem:** Files may have different encodings
+
+**Solution:** Try multiple encodings
+
+**Process:**
+```
+Try UTF-8 first:
+    ✓ Success → Use UTF-8
+    ✗ Failure → Try Latin-1
+        ✓ Success → Use Latin-1
+        ✗ Failure → Log error and skip file
+```
+
+**Example:**
+
+**UTF-8 file:**
+```
+"Hello 世界" → Reads correctly
+```
+
+**Latin-1 file:**
+```
+"Hello café" → Reads correctly with Latin-1
+```
+
+### 7.4 Error Handling
+
+**What happens when processing fails?**
+
+**Examples:**
+
+**Corrupted PDF:**
+```
+File: corrupted.pdf
+    → Try to extract text
+    → Error: "Cannot read PDF"
+    → Log warning: "Failed to process corrupted.pdf"
+    → Skip file
+    → Continue with next file
+```
+
+**Unsupported File Type:**
+```
+File: presentation.pptx
+    → Extension: .pptx
+    → Type: Not supported
+    → Warning: "Unsupported file type: .pptx"
+    → Skip file
+    → Continue with next file
+```
+
+**Image OCR Failure:**
+```
+File: blurry_image.png
+    → Try OCR
+    → OCR returns empty or garbled text
+    → Filter removes empty lines
+    → No text extracted
+    → File processed (no output)
+```
+
+---
+
+## 8. Common Questions
+
+### Q1: Why process PDFs instead of using them directly?
+
+**Answer:**  
+Models work with numbers (token IDs), not file formats. PDFs have:
+- Complex structure (fonts, layouts, metadata)
+- Embedded formatting
+- Binary data mixed with text
+
+Processing extracts just the text content, which is what the model needs.
+
+### Q2: What if OCR doesn't work well on an image?
+
+**Answer:**  
+- Low-quality images produce poor OCR results
+- The system will extract what it can
+- Poor OCR output is filtered out (too short or garbled)
+- The file is processed but may contribute little or no text
+
+**Solution:** Use high-quality images with clear text for best results.
+
+### Q3: Why split text into lines?
+
+**Answer:**  
+- Each line becomes a training sample
+- Models predict next tokens in sequences
+- Shorter sequences are easier to process
+- Allows the model to learn from diverse sentence structures
+
+### Q4: What happens to code formatting?
+
+**Answer:**  
+- Code is processed as text
+- Indentation and structure are preserved
+- Each line becomes a sample
+- The model learns code patterns and syntax
+
+**Example:**
+```python
+def hello():
+    print("Hi")
+```
+
+Becomes:
+```
+"def hello():"
+"    print("Hi")"
+```
+
+### Q5: Can I process files in parallel?
+
+**Answer:**  
+Currently, files are processed sequentially. Future improvements could include:
+- Parallel processing of multiple files
+- Multi-threaded extraction
+- Batch processing for efficiency
+
+### Q6: What if a file is very large?
+
+**Answer:**  
+- Large files are processed line by line
+- Memory usage stays manageable
+- Progress is logged every 100 files
+- System can handle files of any size (within memory limits)
+
+### Q7: How is data from different file types combined?
+
+**Answer:**  
+All extracted text is combined into a single list:
+
+```
+PDF file → 50 lines extracted
+Text file → 30 lines extracted
+Code file → 100 lines extracted
+Image → 5 lines extracted
+
+Combined: 185 text lines total
+```
+
+All lines are treated equally, regardless of source file type.
+
+---
+
+## Summary
+
+### What is Data Processing?
+
+**Data processing** is the transformation of raw files (PDFs, images, code, text) into clean text lines that can be tokenized and used for training.
+
+### Key Steps
+
+1. **Find Files**: Scan directory for all files
+2. **Classify**: Determine file type (.pdf, .txt, .png, etc.)
+3. **Extract**: Get text content from each file
+4. **Clean**: Remove noise and standardize format
+5. **Split**: Break into individual lines
+6. **Filter**: Keep only quality text samples
+
+### Result
+
+A list of text strings ready for:
+- Tokenization (converting to numbers)
+- Training (teaching the model)
+- Learning (model understanding patterns)
+
+### Example Flow
+
+```
+PDF file "document.pdf"
+    ↓
+Extract text from pages
+    ↓
+Clean and split into lines
+    ↓
+Filter by length
+    ↓
+["Sentence 1.", "Sentence 2.", "Sentence 3."]
+    ↓
+Ready for tokenization and training!
+```
+
+---
+
+*This document explains what data processing means and how it transforms your raw files into training-ready text, step by step.*
+