- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
19 KiB
Data Processing Explained: Step-by-Step Guide
Complete guide to understanding data processing in the SheepOp LLM project, explaining what happens to your data from raw files to training-ready text.
Table of Contents
- What is Data Processing?
- Why Do We Need Data Processing?
- The Data Processing Pipeline
- Step-by-Step: How Each File Type is Processed
- Data Transformation Stages
- Complete Example: Processing "Hello World.pdf"
- Data Quality and Filtering
- Common Questions
1. What is Data Processing?
Data processing is the transformation of raw, unstructured data into a format that machine learning models can understand and learn from.
Simple Analogy
Think of data processing like preparing ingredients for cooking:
Raw Ingredients (Your Files):
- PDF documents
- Text files
- Images with text
- Code files
Prepared Ingredients (Processed Data):
- Clean text lines
- Consistent format
- Ready for training
The Recipe (Training):
- The model learns from the prepared ingredients
In Our Context
Input: Mixed file types (PDFs, images, code, text)
Output: List of text strings ready for tokenization
Purpose: Extract meaningful text that the model can learn from
2. Why Do We Need Data Processing?
2.1 The Problem
Machine learning models (like our transformer) understand numbers, not:
- PDF files
- Images
- Raw text files
- Code files
2.2 The Solution
We need to:
- Extract text from different file formats
- Clean the text (remove noise, handle encoding)
- Standardize the format (consistent structure)
- Prepare for tokenization (split into manageable pieces)
2.3 Benefits
✅ Unified Format: All data becomes text lines
✅ Easy to Process: Simple format for tokenization
✅ Flexible: Works with many file types
✅ Scalable: Can process thousands of files automatically
3. The Data Processing Pipeline
3.1 High-Level Overview
Raw Files
↓
[File Type Detection]
↓
[Text Extraction]
↓
[Text Cleaning]
↓
[Line Splitting]
↓
[Filtering]
↓
Clean Text Lines
↓
[Tokenization] ← Not part of data processing
↓
[Training] ← Not part of data processing
3.2 Detailed Pipeline
Step 1: Directory Scan
└─→ Find all files in data/ directory
└─→ Categorize by file type (.pdf, .txt, .png, etc.)
Step 2: File Type Detection
└─→ Check file extension
└─→ Route to appropriate processor
Step 3: Text Extraction
├─→ PDF files → PDF text extraction
├─→ Text files → Read as text
├─→ Image files → OCR (Optical Character Recognition)
└─→ Code files → Read as text
Step 4: Text Cleaning
└─→ Remove extra whitespace
└─→ Handle encoding issues
└─→ Normalize line endings
Step 5: Line Splitting
└─→ Split text into individual lines
└─→ Each line becomes one training sample
Step 6: Filtering
└─→ Remove empty lines
└─→ Filter by minimum length
└─→ Remove lines that are too short
Step 7: Output
└─→ List of text strings
└─→ Ready for tokenization
4. Step-by-Step: How Each File Type is Processed
4.1 Text Files (.txt, .md, .log, etc.)
What happens:
- File is opened
- Content is read line by line
- Each line becomes a separate text sample
Example:
Input: document.txt
Hello world
This is a sentence.
Machine learning is fascinating.
Processing:
Line 1: "Hello world"
Line 2: "This is a sentence."
Line 3: "Machine learning is fascinating."
Output:
[
"Hello world",
"This is a sentence.",
"Machine learning is fascinating."
]
Why this works: Text files are already in plain text format, so extraction is straightforward.
4.2 Code Files (.py, .js, .java, etc.)
What happens:
- File is opened
- Content is read line by line
- Each line becomes a separate text sample
Example:
Input: example.py
def hello():
print("Hello")
return True
Processing:
Line 1: "def hello():"
Line 2: " print("Hello")"
Line 3: " return True"
Output:
[
"def hello():",
" print("Hello")",
" return True"
]
Why this works: Code files are text files, so they're processed the same way. The model learns code patterns and syntax.
4.3 PDF Files (.pdf)
What happens:
- PDF file is opened
- Text is extracted from each page
- Text is split into lines
- Lines are filtered for quality
Example:
Input: document.pdf (3 pages)
Page 1:
Introduction to Machine Learning
Machine learning is a subset of artificial intelligence.
Page 2:
Neural Networks
Neural networks are computing systems inspired by biological neural networks.
Page 3:
Conclusion
In conclusion, machine learning has revolutionized technology.
Processing:
Step 1: Extract text from each page
Page 1 text: "Introduction to Machine Learning\nMachine learning is a subset of artificial intelligence."
Page 2 text: "Neural Networks\nNeural networks are computing systems inspired by biological neural networks."
Page 3 text: "Conclusion\nIn conclusion, machine learning has revolutionized technology."
Step 2: Split by newlines
Line 1: "Introduction to Machine Learning"
Line 2: "Machine learning is a subset of artificial intelligence."
Line 3: "Neural Networks"
Line 4: "Neural networks are computing systems inspired by biological neural networks."
Line 5: "Conclusion"
Line 6: "In conclusion, machine learning has revolutionized technology."
Step 3: Filter short lines
Remove: "Introduction to Machine Learning" (too short for context)
Keep: "Machine learning is a subset of artificial intelligence."
Remove: "Neural Networks" (too short)
Keep: "Neural networks are computing systems inspired by biological neural networks."
Remove: "Conclusion" (too short)
Keep: "In conclusion, machine learning has revolutionized technology."
Output:
[
"Machine learning is a subset of artificial intelligence.",
"Neural networks are computing systems inspired by biological neural networks.",
"In conclusion, machine learning has revolutionized technology."
]
Why this works: PDFs contain text embedded in the file structure. Libraries like PyPDF2 or pdfplumber extract this text, preserving the content but losing formatting.
4.4 Image Files (.png, .jpg, etc.)
What happens:
- Image file is opened
- OCR (Optical Character Recognition) reads text from the image
- Extracted text is split into lines
- Lines are filtered for quality
Example:
Input: screenshot.png containing:
Hello World
This is text in an image.
Processing:
Step 1: OCR Processing
Image → OCR Engine → Text
"Hello World\nThis is text in an image."
Step 2: Split by newlines
Line 1: "Hello World"
Line 2: "This is text in an image."
Step 3: Filter short lines
Remove: "Hello World" (might be too short)
Keep: "This is text in an image."
Output:
[
"This is text in an image."
]
Why this works: OCR software analyzes the image pixel by pixel, identifies characters, and converts them to text. Accuracy depends on image quality.
5. Data Transformation Stages
5.1 Stage 1: File Discovery
Purpose: Find all files to process
Process:
Directory: data/
├── document.pdf
├── code.py
├── screenshot.png
└── notes.txt
Scan recursively:
├── Find: document.pdf
├── Find: code.py
├── Find: screenshot.png
└── Find: notes.txt
Total: 4 files found
Result: List of file paths to process
5.2 Stage 2: File Type Classification
Purpose: Determine how to process each file
Process:
File: document.pdf
├── Extension: .pdf
├── Type: PDF
└── Processor: PDF Extractor
File: code.py
├── Extension: .py
├── Type: Code
└── Processor: Text Reader
File: screenshot.png
├── Extension: .png
├── Type: Image
└── Processor: OCR
File: notes.txt
├── Extension: .txt
├── Type: Text
└── Processor: Text Reader
Result: Each file assigned to appropriate processor
5.3 Stage 3: Text Extraction
Purpose: Get raw text from each file
Process:
PDF File:
document.pdf
→ Open PDF
→ Extract Page 1: "Introduction..."
→ Extract Page 2: "Chapter 1..."
→ Extract Page 3: "Conclusion..."
→ Combine: "Introduction...\nChapter 1...\nConclusion..."
Text File:
notes.txt
→ Open file
→ Read content: "Hello\nWorld\nTest"
Image File:
screenshot.png
→ Open image
→ Run OCR
→ Extract: "Hello World\nThis is text"
Code File:
code.py
→ Open file
→ Read content: "def hello():\n print('Hi')"
Result: Raw text strings from each file
5.4 Stage 4: Text Cleaning
Purpose: Standardize and clean the extracted text
Process:
Input:
"Hello World\n\n\nThis is a test. "
Step 1: Remove Extra Whitespace
"Hello World\n\n\nThis is a test. "
↓
"Hello World\n\n\nThis is a test."
Step 2: Normalize Line Endings
"Hello World\n\n\nThis is a test."
↓
"Hello World\n\n\nThis is a test."
Step 3: Handle Encoding
"Hello World" (UTF-8)
↓
"Hello World" (checked and valid)
Result: Cleaned text strings
5.5 Stage 5: Line Splitting
Purpose: Break text into individual training samples
Process:
Input:
"Hello World\nThis is a test.\nMachine learning is cool."
Split by newlines:
Line 1: "Hello World"
Line 2: "This is a test."
Line 3: "Machine learning is cool."
Result: List of individual text lines
5.6 Stage 6: Filtering
Purpose: Keep only useful text samples
Process:
Input:
[
"Hello World", # Length: 11
"Hi", # Length: 2 (too short)
"This is a sentence.", # Length: 19
"", # Empty (remove)
"A" # Length: 1 (too short)
]
Filter criteria:
- Minimum length: 10 characters
- Non-empty strings
Filtering:
Keep: "Hello World" (length 11 ≥ 10)
Remove: "Hi" (length 2 < 10)
Keep: "This is a sentence." (length 19 ≥ 10)
Remove: "" (empty)
Remove: "A" (length 1 < 10)
Output:
[
"Hello World",
"This is a sentence."
]
Result: Filtered list of quality text samples
6. Complete Example: Processing "Hello World.pdf"
Let's trace through processing a complete PDF file step-by-step.
Input
File: Hello World.pdf
Location: data/documents/Hello World.pdf
Content: 2 pages with text
Step-by-Step Processing
Step 1: File Discovery
Scanning: data/
├── documents/
│ └── Hello World.pdf ← Found
├── images/
└── code/
File found: data/documents/Hello World.pdf
Step 2: File Type Detection
File: Hello World.pdf
Extension: .pdf
Type: PDF
Processor: PDF Extractor
Step 3: PDF Text Extraction
Page 1 Content:
Hello World
This is a simple example document.
It contains multiple sentences.
Page 2 Content:
Second Page
Here is more content.
The end.
Extraction Process:
Open PDF file
↓
Extract Page 1:
Text: "Hello World\nThis is a simple example document.\nIt contains multiple sentences."
↓
Extract Page 2:
Text: "Second Page\nHere is more content.\nThe end."
↓
Combine pages:
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
Step 4: Text Cleaning
Input:
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
Process:
Remove extra whitespace: ✓ (already clean)
Normalize encoding: ✓ (UTF-8)
Handle special characters: ✓ (none found)
Output:
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
Step 5: Line Splitting
Input:
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
Split by newline character (\n):
Line 1: "Hello World"
Line 2: "This is a simple example document."
Line 3: "It contains multiple sentences."
Line 4: "Second Page"
Line 5: "Here is more content."
Line 6: "The end."
Step 6: Filtering
Input:
[
"Hello World", # Length: 11
"This is a simple example document.", # Length: 36
"It contains multiple sentences.", # Length: 31
"Second Page", # Length: 11
"Here is more content.", # Length: 21
"The end." # Length: 8 (too short!)
]
Filter: Minimum length = 10
✓ Keep: "Hello World" (11 ≥ 10)
✓ Keep: "This is a simple example document." (36 ≥ 10)
✓ Keep: "It contains multiple sentences." (31 ≥ 10)
✓ Keep: "Second Page" (11 ≥ 10)
✓ Keep: "Here is more content." (21 ≥ 10)
✗ Remove: "The end." (8 < 10)
Step 7: Final Output
Result:
[
"Hello World",
"This is a simple example document.",
"It contains multiple sentences.",
"Second Page",
"Here is more content."
]
Statistics:
- Files processed: 1
- Pages extracted: 2
- Lines extracted: 6
- Lines kept: 5
- Lines filtered: 1
7. Data Quality and Filtering
7.1 Why Filter?
Problem: Not all text is useful for training
Examples of Low-Quality Text:
✗ "" (empty line)
✗ " " (just whitespace)
✗ "Hi" (too short, no context)
✗ "A" (single character)
✗ "..." (ellipsis, no meaning)
✗ "---" (separator line)
Examples of High-Quality Text:
✓ "Machine learning is a subset of artificial intelligence."
✓ "The transformer architecture uses self-attention mechanisms."
✓ "Gradient descent optimizes neural network parameters."
7.2 Filtering Criteria
Minimum Length Filter:
Purpose: Remove very short lines that don't provide context
Example:
Minimum length: 10 characters
Keep:
✓ "Hello world" (11 chars)
✓ "This is a test." (15 chars)
Remove:
✗ "Hi" (2 chars)
✗ "Test" (4 chars)
✗ "OK" (2 chars)
Why 10 characters?
- Provides enough context for meaningful learning
- Filters out headers, separators, and noise
- Ensures each sample has semantic value
7.3 Encoding Handling
Problem: Files may have different encodings
Solution: Try multiple encodings
Process:
Try UTF-8 first:
✓ Success → Use UTF-8
✗ Failure → Try Latin-1
✓ Success → Use Latin-1
✗ Failure → Log error and skip file
Example:
UTF-8 file:
"Hello 世界" → Reads correctly
Latin-1 file:
"Hello café" → Reads correctly with Latin-1
7.4 Error Handling
What happens when processing fails?
Examples:
Corrupted PDF:
File: corrupted.pdf
→ Try to extract text
→ Error: "Cannot read PDF"
→ Log warning: "Failed to process corrupted.pdf"
→ Skip file
→ Continue with next file
Unsupported File Type:
File: presentation.pptx
→ Extension: .pptx
→ Type: Not supported
→ Warning: "Unsupported file type: .pptx"
→ Skip file
→ Continue with next file
Image OCR Failure:
File: blurry_image.png
→ Try OCR
→ OCR returns empty or garbled text
→ Filter removes empty lines
→ No text extracted
→ File processed (no output)
8. Common Questions
Q1: Why process PDFs instead of using them directly?
Answer:
Models work with numbers (token IDs), not file formats. PDFs have:
- Complex structure (fonts, layouts, metadata)
- Embedded formatting
- Binary data mixed with text
Processing extracts just the text content, which is what the model needs.
Q2: What if OCR doesn't work well on an image?
Answer:
- Low-quality images produce poor OCR results
- The system will extract what it can
- Poor OCR output is filtered out (too short or garbled)
- The file is processed but may contribute little or no text
Solution: Use high-quality images with clear text for best results.
Q3: Why split text into lines?
Answer:
- Each line becomes a training sample
- Models predict next tokens in sequences
- Shorter sequences are easier to process
- Allows the model to learn from diverse sentence structures
Q4: What happens to code formatting?
Answer:
- Code is processed as text
- Indentation and structure are preserved
- Each line becomes a sample
- The model learns code patterns and syntax
Example:
def hello():
print("Hi")
Becomes:
"def hello():"
" print("Hi")"
Q5: Can I process files in parallel?
Answer:
Currently, files are processed sequentially. Future improvements could include:
- Parallel processing of multiple files
- Multi-threaded extraction
- Batch processing for efficiency
Q6: What if a file is very large?
Answer:
- Large files are processed line by line
- Memory usage stays manageable
- Progress is logged every 100 files
- System can handle files of any size (within memory limits)
Q7: How is data from different file types combined?
Answer:
All extracted text is combined into a single list:
PDF file → 50 lines extracted
Text file → 30 lines extracted
Code file → 100 lines extracted
Image → 5 lines extracted
Combined: 185 text lines total
All lines are treated equally, regardless of source file type.
Summary
What is Data Processing?
Data processing is the transformation of raw files (PDFs, images, code, text) into clean text lines that can be tokenized and used for training.
Key Steps
- Find Files: Scan directory for all files
- Classify: Determine file type (.pdf, .txt, .png, etc.)
- Extract: Get text content from each file
- Clean: Remove noise and standardize format
- Split: Break into individual lines
- Filter: Keep only quality text samples
Result
A list of text strings ready for:
- Tokenization (converting to numbers)
- Training (teaching the model)
- Learning (model understanding patterns)
Example Flow
PDF file "document.pdf"
↓
Extract text from pages
↓
Clean and split into lines
↓
Filter by length
↓
["Sentence 1.", "Sentence 2.", "Sentence 3."]
↓
Ready for tokenization and training!
This document explains what data processing means and how it transforms your raw files into training-ready text, step by step.