Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
This commit is contained in:
Carlos Gutierrez
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions

View File

@@ -0,0 +1,914 @@
# Data Processing Explained: Step-by-Step Guide
Complete guide to understanding data processing in the SheepOp LLM project, explaining what happens to your data from raw files to training-ready text.
## Table of Contents
1. [What is Data Processing?](#1-what-is-data-processing)
2. [Why Do We Need Data Processing?](#2-why-do-we-need-data-processing)
3. [The Data Processing Pipeline](#3-the-data-processing-pipeline)
4. [Step-by-Step: How Each File Type is Processed](#4-step-by-step-how-each-file-type-is-processed)
5. [Data Transformation Stages](#5-data-transformation-stages)
6. [Complete Example: Processing "Hello World.pdf"](#6-complete-example-processing-hello-worldpdf)
7. [Data Quality and Filtering](#7-data-quality-and-filtering)
8. [Common Questions](#8-common-questions)
---
## 1. What is Data Processing?
**Data processing** is the transformation of raw, unstructured data into a format that machine learning models can understand and learn from.
### Simple Analogy
Think of data processing like preparing ingredients for cooking:
**Raw Ingredients (Your Files):**
- PDF documents
- Text files
- Images with text
- Code files
**Prepared Ingredients (Processed Data):**
- Clean text lines
- Consistent format
- Ready for training
**The Recipe (Training):**
- The model learns from the prepared ingredients
### In Our Context
**Input:** Mixed file types (PDFs, images, code, text)
**Output:** List of text strings ready for tokenization
**Purpose:** Extract meaningful text that the model can learn from
---
## 2. Why Do We Need Data Processing?
### 2.1 The Problem
Machine learning models (like our transformer) understand **numbers**, not:
- PDF files
- Images
- Raw text files
- Code files
### 2.2 The Solution
We need to:
1. **Extract** text from different file formats
2. **Clean** the text (remove noise, handle encoding)
3. **Standardize** the format (consistent structure)
4. **Prepare** for tokenization (split into manageable pieces)
### 2.3 Benefits
**Unified Format**: All data becomes text lines
**Easy to Process**: Simple format for tokenization
**Flexible**: Works with many file types
**Scalable**: Can process thousands of files automatically
---
## 3. The Data Processing Pipeline
### 3.1 High-Level Overview
```
Raw Files
[File Type Detection]
[Text Extraction]
[Text Cleaning]
[Line Splitting]
[Filtering]
Clean Text Lines
[Tokenization] ← Not part of data processing
[Training] ← Not part of data processing
```
### 3.2 Detailed Pipeline
```
Step 1: Directory Scan
└─→ Find all files in data/ directory
└─→ Categorize by file type (.pdf, .txt, .png, etc.)
Step 2: File Type Detection
└─→ Check file extension
└─→ Route to appropriate processor
Step 3: Text Extraction
├─→ PDF files → PDF text extraction
├─→ Text files → Read as text
├─→ Image files → OCR (Optical Character Recognition)
└─→ Code files → Read as text
Step 4: Text Cleaning
└─→ Remove extra whitespace
└─→ Handle encoding issues
└─→ Normalize line endings
Step 5: Line Splitting
└─→ Split text into individual lines
└─→ Each line becomes one training sample
Step 6: Filtering
└─→ Remove empty lines
└─→ Filter by minimum length
└─→ Remove lines that are too short
Step 7: Output
└─→ List of text strings
└─→ Ready for tokenization
```
---
## 4. Step-by-Step: How Each File Type is Processed
### 4.1 Text Files (.txt, .md, .log, etc.)
**What happens:**
1. File is opened
2. Content is read line by line
3. Each line becomes a separate text sample
**Example:**
**Input:** `document.txt`
```
Hello world
This is a sentence.
Machine learning is fascinating.
```
**Processing:**
```
Line 1: "Hello world"
Line 2: "This is a sentence."
Line 3: "Machine learning is fascinating."
```
**Output:**
```python
[
"Hello world",
"This is a sentence.",
"Machine learning is fascinating."
]
```
**Why this works:** Text files are already in plain text format, so extraction is straightforward.
---
### 4.2 Code Files (.py, .js, .java, etc.)
**What happens:**
1. File is opened
2. Content is read line by line
3. Each line becomes a separate text sample
**Example:**
**Input:** `example.py`
```python
def hello():
print("Hello")
return True
```
**Processing:**
```
Line 1: "def hello():"
Line 2: " print("Hello")"
Line 3: " return True"
```
**Output:**
```python
[
"def hello():",
" print("Hello")",
" return True"
]
```
**Why this works:** Code files are text files, so they're processed the same way. The model learns code patterns and syntax.
---
### 4.3 PDF Files (.pdf)
**What happens:**
1. PDF file is opened
2. Text is extracted from each page
3. Text is split into lines
4. Lines are filtered for quality
**Example:**
**Input:** `document.pdf` (3 pages)
**Page 1:**
```
Introduction to Machine Learning
Machine learning is a subset of artificial intelligence.
```
**Page 2:**
```
Neural Networks
Neural networks are computing systems inspired by biological neural networks.
```
**Page 3:**
```
Conclusion
In conclusion, machine learning has revolutionized technology.
```
**Processing:**
**Step 1: Extract text from each page**
```
Page 1 text: "Introduction to Machine Learning\nMachine learning is a subset of artificial intelligence."
Page 2 text: "Neural Networks\nNeural networks are computing systems inspired by biological neural networks."
Page 3 text: "Conclusion\nIn conclusion, machine learning has revolutionized technology."
```
**Step 2: Split by newlines**
```
Line 1: "Introduction to Machine Learning"
Line 2: "Machine learning is a subset of artificial intelligence."
Line 3: "Neural Networks"
Line 4: "Neural networks are computing systems inspired by biological neural networks."
Line 5: "Conclusion"
Line 6: "In conclusion, machine learning has revolutionized technology."
```
**Step 3: Filter short lines**
```
Remove: "Introduction to Machine Learning" (too short for context)
Keep: "Machine learning is a subset of artificial intelligence."
Remove: "Neural Networks" (too short)
Keep: "Neural networks are computing systems inspired by biological neural networks."
Remove: "Conclusion" (too short)
Keep: "In conclusion, machine learning has revolutionized technology."
```
**Output:**
```python
[
"Machine learning is a subset of artificial intelligence.",
"Neural networks are computing systems inspired by biological neural networks.",
"In conclusion, machine learning has revolutionized technology."
]
```
**Why this works:** PDFs contain text embedded in the file structure. Libraries like PyPDF2 or pdfplumber extract this text, preserving the content but losing formatting.
---
### 4.4 Image Files (.png, .jpg, etc.)
**What happens:**
1. Image file is opened
2. OCR (Optical Character Recognition) reads text from the image
3. Extracted text is split into lines
4. Lines are filtered for quality
**Example:**
**Input:** `screenshot.png` containing:
```
Hello World
This is text in an image.
```
**Processing:**
**Step 1: OCR Processing**
```
Image → OCR Engine → Text
"Hello World\nThis is text in an image."
```
**Step 2: Split by newlines**
```
Line 1: "Hello World"
Line 2: "This is text in an image."
```
**Step 3: Filter short lines**
```
Remove: "Hello World" (might be too short)
Keep: "This is text in an image."
```
**Output:**
```python
[
"This is text in an image."
]
```
**Why this works:** OCR software analyzes the image pixel by pixel, identifies characters, and converts them to text. Accuracy depends on image quality.
---
## 5. Data Transformation Stages
### 5.1 Stage 1: File Discovery
**Purpose:** Find all files to process
**Process:**
```
Directory: data/
├── document.pdf
├── code.py
├── screenshot.png
└── notes.txt
Scan recursively:
├── Find: document.pdf
├── Find: code.py
├── Find: screenshot.png
└── Find: notes.txt
Total: 4 files found
```
**Result:** List of file paths to process
---
### 5.2 Stage 2: File Type Classification
**Purpose:** Determine how to process each file
**Process:**
```
File: document.pdf
├── Extension: .pdf
├── Type: PDF
└── Processor: PDF Extractor
File: code.py
├── Extension: .py
├── Type: Code
└── Processor: Text Reader
File: screenshot.png
├── Extension: .png
├── Type: Image
└── Processor: OCR
File: notes.txt
├── Extension: .txt
├── Type: Text
└── Processor: Text Reader
```
**Result:** Each file assigned to appropriate processor
---
### 5.3 Stage 3: Text Extraction
**Purpose:** Get raw text from each file
**Process:**
**PDF File:**
```
document.pdf
→ Open PDF
→ Extract Page 1: "Introduction..."
→ Extract Page 2: "Chapter 1..."
→ Extract Page 3: "Conclusion..."
→ Combine: "Introduction...\nChapter 1...\nConclusion..."
```
**Text File:**
```
notes.txt
→ Open file
→ Read content: "Hello\nWorld\nTest"
```
**Image File:**
```
screenshot.png
→ Open image
→ Run OCR
→ Extract: "Hello World\nThis is text"
```
**Code File:**
```
code.py
→ Open file
→ Read content: "def hello():\n print('Hi')"
```
**Result:** Raw text strings from each file
---
### 5.4 Stage 4: Text Cleaning
**Purpose:** Standardize and clean the extracted text
**Process:**
**Input:**
```
"Hello World\n\n\nThis is a test. "
```
**Step 1: Remove Extra Whitespace**
```
"Hello World\n\n\nThis is a test. "
"Hello World\n\n\nThis is a test."
```
**Step 2: Normalize Line Endings**
```
"Hello World\n\n\nThis is a test."
"Hello World\n\n\nThis is a test."
```
**Step 3: Handle Encoding**
```
"Hello World" (UTF-8)
"Hello World" (checked and valid)
```
**Result:** Cleaned text strings
---
### 5.5 Stage 5: Line Splitting
**Purpose:** Break text into individual training samples
**Process:**
**Input:**
```
"Hello World\nThis is a test.\nMachine learning is cool."
```
**Split by newlines:**
```
Line 1: "Hello World"
Line 2: "This is a test."
Line 3: "Machine learning is cool."
```
**Result:** List of individual text lines
---
### 5.6 Stage 6: Filtering
**Purpose:** Keep only useful text samples
**Process:**
**Input:**
```python
[
"Hello World", # Length: 11
"Hi", # Length: 2 (too short)
"This is a sentence.", # Length: 19
"", # Empty (remove)
"A" # Length: 1 (too short)
]
```
**Filter criteria:**
- Minimum length: 10 characters
- Non-empty strings
**Filtering:**
```
Keep: "Hello World" (length 11 ≥ 10)
Remove: "Hi" (length 2 < 10)
Keep: "This is a sentence." (length 19 ≥ 10)
Remove: "" (empty)
Remove: "A" (length 1 < 10)
```
**Output:**
```python
[
"Hello World",
"This is a sentence."
]
```
**Result:** Filtered list of quality text samples
---
## 6. Complete Example: Processing "Hello World.pdf"
Let's trace through processing a complete PDF file step-by-step.
### Input
**File:** `Hello World.pdf`
**Location:** `data/documents/Hello World.pdf`
**Content:** 2 pages with text
### Step-by-Step Processing
#### Step 1: File Discovery
```
Scanning: data/
├── documents/
│ └── Hello World.pdf ← Found
├── images/
└── code/
File found: data/documents/Hello World.pdf
```
#### Step 2: File Type Detection
```
File: Hello World.pdf
Extension: .pdf
Type: PDF
Processor: PDF Extractor
```
#### Step 3: PDF Text Extraction
**Page 1 Content:**
```
Hello World
This is a simple example document.
It contains multiple sentences.
```
**Page 2 Content:**
```
Second Page
Here is more content.
The end.
```
**Extraction Process:**
```
Open PDF file
Extract Page 1:
Text: "Hello World\nThis is a simple example document.\nIt contains multiple sentences."
Extract Page 2:
Text: "Second Page\nHere is more content.\nThe end."
Combine pages:
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
```
#### Step 4: Text Cleaning
**Input:**
```
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
```
**Process:**
```
Remove extra whitespace: ✓ (already clean)
Normalize encoding: ✓ (UTF-8)
Handle special characters: ✓ (none found)
```
**Output:**
```
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
```
#### Step 5: Line Splitting
**Input:**
```
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
```
**Split by newline character (`\n`):**
```
Line 1: "Hello World"
Line 2: "This is a simple example document."
Line 3: "It contains multiple sentences."
Line 4: "Second Page"
Line 5: "Here is more content."
Line 6: "The end."
```
#### Step 6: Filtering
**Input:**
```python
[
"Hello World", # Length: 11
"This is a simple example document.", # Length: 36
"It contains multiple sentences.", # Length: 31
"Second Page", # Length: 11
"Here is more content.", # Length: 21
"The end." # Length: 8 (too short!)
]
```
**Filter: Minimum length = 10**
```
✓ Keep: "Hello World" (11 ≥ 10)
✓ Keep: "This is a simple example document." (36 ≥ 10)
✓ Keep: "It contains multiple sentences." (31 ≥ 10)
✓ Keep: "Second Page" (11 ≥ 10)
✓ Keep: "Here is more content." (21 ≥ 10)
✗ Remove: "The end." (8 < 10)
```
#### Step 7: Final Output
**Result:**
```python
[
"Hello World",
"This is a simple example document.",
"It contains multiple sentences.",
"Second Page",
"Here is more content."
]
```
**Statistics:**
- Files processed: 1
- Pages extracted: 2
- Lines extracted: 6
- Lines kept: 5
- Lines filtered: 1
---
## 7. Data Quality and Filtering
### 7.1 Why Filter?
**Problem:** Not all text is useful for training
**Examples of Low-Quality Text:**
```
✗ "" (empty line)
✗ " " (just whitespace)
✗ "Hi" (too short, no context)
✗ "A" (single character)
✗ "..." (ellipsis, no meaning)
✗ "---" (separator line)
```
**Examples of High-Quality Text:**
```
✓ "Machine learning is a subset of artificial intelligence."
✓ "The transformer architecture uses self-attention mechanisms."
✓ "Gradient descent optimizes neural network parameters."
```
### 7.2 Filtering Criteria
**Minimum Length Filter:**
**Purpose:** Remove very short lines that don't provide context
**Example:**
```
Minimum length: 10 characters
Keep:
✓ "Hello world" (11 chars)
✓ "This is a test." (15 chars)
Remove:
✗ "Hi" (2 chars)
✗ "Test" (4 chars)
✗ "OK" (2 chars)
```
**Why 10 characters?**
- Provides enough context for meaningful learning
- Filters out headers, separators, and noise
- Ensures each sample has semantic value
### 7.3 Encoding Handling
**Problem:** Files may have different encodings
**Solution:** Try multiple encodings
**Process:**
```
Try UTF-8 first:
✓ Success → Use UTF-8
✗ Failure → Try Latin-1
✓ Success → Use Latin-1
✗ Failure → Log error and skip file
```
**Example:**
**UTF-8 file:**
```
"Hello 世界" → Reads correctly
```
**Latin-1 file:**
```
"Hello café" → Reads correctly with Latin-1
```
### 7.4 Error Handling
**What happens when processing fails?**
**Examples:**
**Corrupted PDF:**
```
File: corrupted.pdf
→ Try to extract text
→ Error: "Cannot read PDF"
→ Log warning: "Failed to process corrupted.pdf"
→ Skip file
→ Continue with next file
```
**Unsupported File Type:**
```
File: presentation.pptx
→ Extension: .pptx
→ Type: Not supported
→ Warning: "Unsupported file type: .pptx"
→ Skip file
→ Continue with next file
```
**Image OCR Failure:**
```
File: blurry_image.png
→ Try OCR
→ OCR returns empty or garbled text
→ Filter removes empty lines
→ No text extracted
→ File processed (no output)
```
---
## 8. Common Questions
### Q1: Why process PDFs instead of using them directly?
**Answer:**
Models work with numbers (token IDs), not file formats. PDFs have:
- Complex structure (fonts, layouts, metadata)
- Embedded formatting
- Binary data mixed with text
Processing extracts just the text content, which is what the model needs.
### Q2: What if OCR doesn't work well on an image?
**Answer:**
- Low-quality images produce poor OCR results
- The system will extract what it can
- Poor OCR output is filtered out (too short or garbled)
- The file is processed but may contribute little or no text
**Solution:** Use high-quality images with clear text for best results.
### Q3: Why split text into lines?
**Answer:**
- Each line becomes a training sample
- Models predict next tokens in sequences
- Shorter sequences are easier to process
- Allows the model to learn from diverse sentence structures
### Q4: What happens to code formatting?
**Answer:**
- Code is processed as text
- Indentation and structure are preserved
- Each line becomes a sample
- The model learns code patterns and syntax
**Example:**
```python
def hello():
print("Hi")
```
Becomes:
```
"def hello():"
" print("Hi")"
```
### Q5: Can I process files in parallel?
**Answer:**
Currently, files are processed sequentially. Future improvements could include:
- Parallel processing of multiple files
- Multi-threaded extraction
- Batch processing for efficiency
### Q6: What if a file is very large?
**Answer:**
- Large files are processed line by line
- Memory usage stays manageable
- Progress is logged every 100 files
- System can handle files of any size (within memory limits)
### Q7: How is data from different file types combined?
**Answer:**
All extracted text is combined into a single list:
```
PDF file → 50 lines extracted
Text file → 30 lines extracted
Code file → 100 lines extracted
Image → 5 lines extracted
Combined: 185 text lines total
```
All lines are treated equally, regardless of source file type.
---
## Summary
### What is Data Processing?
**Data processing** is the transformation of raw files (PDFs, images, code, text) into clean text lines that can be tokenized and used for training.
### Key Steps
1. **Find Files**: Scan directory for all files
2. **Classify**: Determine file type (.pdf, .txt, .png, etc.)
3. **Extract**: Get text content from each file
4. **Clean**: Remove noise and standardize format
5. **Split**: Break into individual lines
6. **Filter**: Keep only quality text samples
### Result
A list of text strings ready for:
- Tokenization (converting to numbers)
- Training (teaching the model)
- Learning (model understanding patterns)
### Example Flow
```
PDF file "document.pdf"
Extract text from pages
Clean and split into lines
Filter by length
["Sentence 1.", "Sentence 2.", "Sentence 3."]
Ready for tokenization and training!
```
---
*This document explains what data processing means and how it transforms your raw files into training-ready text, step by step.*