Files
sheepOp/docs/DATA_PROCESSING_EXPLAINED.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

19 KiB

Data Processing Explained: Step-by-Step Guide

Complete guide to understanding data processing in the SheepOp LLM project, explaining what happens to your data from raw files to training-ready text.

Table of Contents

  1. What is Data Processing?
  2. Why Do We Need Data Processing?
  3. The Data Processing Pipeline
  4. Step-by-Step: How Each File Type is Processed
  5. Data Transformation Stages
  6. Complete Example: Processing "Hello World.pdf"
  7. Data Quality and Filtering
  8. Common Questions

1. What is Data Processing?

Data processing is the transformation of raw, unstructured data into a format that machine learning models can understand and learn from.

Simple Analogy

Think of data processing like preparing ingredients for cooking:

Raw Ingredients (Your Files):

  • PDF documents
  • Text files
  • Images with text
  • Code files

Prepared Ingredients (Processed Data):

  • Clean text lines
  • Consistent format
  • Ready for training

The Recipe (Training):

  • The model learns from the prepared ingredients

In Our Context

Input: Mixed file types (PDFs, images, code, text)
Output: List of text strings ready for tokenization
Purpose: Extract meaningful text that the model can learn from


2. Why Do We Need Data Processing?

2.1 The Problem

Machine learning models (like our transformer) understand numbers, not:

  • PDF files
  • Images
  • Raw text files
  • Code files

2.2 The Solution

We need to:

  1. Extract text from different file formats
  2. Clean the text (remove noise, handle encoding)
  3. Standardize the format (consistent structure)
  4. Prepare for tokenization (split into manageable pieces)

2.3 Benefits

Unified Format: All data becomes text lines
Easy to Process: Simple format for tokenization
Flexible: Works with many file types
Scalable: Can process thousands of files automatically


3. The Data Processing Pipeline

3.1 High-Level Overview

Raw Files
    ↓
[File Type Detection]
    ↓
[Text Extraction]
    ↓
[Text Cleaning]
    ↓
[Line Splitting]
    ↓
[Filtering]
    ↓
Clean Text Lines
    ↓
[Tokenization] ← Not part of data processing
    ↓
[Training] ← Not part of data processing

3.2 Detailed Pipeline

Step 1: Directory Scan
    └─→ Find all files in data/ directory
        └─→ Categorize by file type (.pdf, .txt, .png, etc.)

Step 2: File Type Detection
    └─→ Check file extension
        └─→ Route to appropriate processor

Step 3: Text Extraction
    ├─→ PDF files → PDF text extraction
    ├─→ Text files → Read as text
    ├─→ Image files → OCR (Optical Character Recognition)
    └─→ Code files → Read as text

Step 4: Text Cleaning
    └─→ Remove extra whitespace
        └─→ Handle encoding issues
            └─→ Normalize line endings

Step 5: Line Splitting
    └─→ Split text into individual lines
        └─→ Each line becomes one training sample

Step 6: Filtering
    └─→ Remove empty lines
        └─→ Filter by minimum length
            └─→ Remove lines that are too short

Step 7: Output
    └─→ List of text strings
        └─→ Ready for tokenization

4. Step-by-Step: How Each File Type is Processed

4.1 Text Files (.txt, .md, .log, etc.)

What happens:

  1. File is opened
  2. Content is read line by line
  3. Each line becomes a separate text sample

Example:

Input: document.txt

Hello world
This is a sentence.
Machine learning is fascinating.

Processing:

Line 1: "Hello world"
Line 2: "This is a sentence."
Line 3: "Machine learning is fascinating."

Output:

[
    "Hello world",
    "This is a sentence.",
    "Machine learning is fascinating."
]

Why this works: Text files are already in plain text format, so extraction is straightforward.


4.2 Code Files (.py, .js, .java, etc.)

What happens:

  1. File is opened
  2. Content is read line by line
  3. Each line becomes a separate text sample

Example:

Input: example.py

def hello():
    print("Hello")
    return True

Processing:

Line 1: "def hello():"
Line 2: "    print("Hello")"
Line 3: "    return True"

Output:

[
    "def hello():",
    "    print("Hello")",
    "    return True"
]

Why this works: Code files are text files, so they're processed the same way. The model learns code patterns and syntax.


4.3 PDF Files (.pdf)

What happens:

  1. PDF file is opened
  2. Text is extracted from each page
  3. Text is split into lines
  4. Lines are filtered for quality

Example:

Input: document.pdf (3 pages)

Page 1:

Introduction to Machine Learning
Machine learning is a subset of artificial intelligence.

Page 2:

Neural Networks
Neural networks are computing systems inspired by biological neural networks.

Page 3:

Conclusion
In conclusion, machine learning has revolutionized technology.

Processing:

Step 1: Extract text from each page

Page 1 text: "Introduction to Machine Learning\nMachine learning is a subset of artificial intelligence."
Page 2 text: "Neural Networks\nNeural networks are computing systems inspired by biological neural networks."
Page 3 text: "Conclusion\nIn conclusion, machine learning has revolutionized technology."

Step 2: Split by newlines

Line 1: "Introduction to Machine Learning"
Line 2: "Machine learning is a subset of artificial intelligence."
Line 3: "Neural Networks"
Line 4: "Neural networks are computing systems inspired by biological neural networks."
Line 5: "Conclusion"
Line 6: "In conclusion, machine learning has revolutionized technology."

Step 3: Filter short lines

Remove: "Introduction to Machine Learning" (too short for context)
Keep: "Machine learning is a subset of artificial intelligence."
Remove: "Neural Networks" (too short)
Keep: "Neural networks are computing systems inspired by biological neural networks."
Remove: "Conclusion" (too short)
Keep: "In conclusion, machine learning has revolutionized technology."

Output:

[
    "Machine learning is a subset of artificial intelligence.",
    "Neural networks are computing systems inspired by biological neural networks.",
    "In conclusion, machine learning has revolutionized technology."
]

Why this works: PDFs contain text embedded in the file structure. Libraries like PyPDF2 or pdfplumber extract this text, preserving the content but losing formatting.


4.4 Image Files (.png, .jpg, etc.)

What happens:

  1. Image file is opened
  2. OCR (Optical Character Recognition) reads text from the image
  3. Extracted text is split into lines
  4. Lines are filtered for quality

Example:

Input: screenshot.png containing:

Hello World
This is text in an image.

Processing:

Step 1: OCR Processing

Image → OCR Engine → Text
"Hello World\nThis is text in an image."

Step 2: Split by newlines

Line 1: "Hello World"
Line 2: "This is text in an image."

Step 3: Filter short lines

Remove: "Hello World" (might be too short)
Keep: "This is text in an image."

Output:

[
    "This is text in an image."
]

Why this works: OCR software analyzes the image pixel by pixel, identifies characters, and converts them to text. Accuracy depends on image quality.


5. Data Transformation Stages

5.1 Stage 1: File Discovery

Purpose: Find all files to process

Process:

Directory: data/
    ├── document.pdf
    ├── code.py
    ├── screenshot.png
    └── notes.txt

Scan recursively:
    ├── Find: document.pdf
    ├── Find: code.py
    ├── Find: screenshot.png
    └── Find: notes.txt

Total: 4 files found

Result: List of file paths to process


5.2 Stage 2: File Type Classification

Purpose: Determine how to process each file

Process:

File: document.pdf
    ├── Extension: .pdf
    ├── Type: PDF
    └── Processor: PDF Extractor

File: code.py
    ├── Extension: .py
    ├── Type: Code
    └── Processor: Text Reader

File: screenshot.png
    ├── Extension: .png
    ├── Type: Image
    └── Processor: OCR

File: notes.txt
    ├── Extension: .txt
    ├── Type: Text
    └── Processor: Text Reader

Result: Each file assigned to appropriate processor


5.3 Stage 3: Text Extraction

Purpose: Get raw text from each file

Process:

PDF File:

document.pdf
    → Open PDF
    → Extract Page 1: "Introduction..."
    → Extract Page 2: "Chapter 1..."
    → Extract Page 3: "Conclusion..."
    → Combine: "Introduction...\nChapter 1...\nConclusion..."

Text File:

notes.txt
    → Open file
    → Read content: "Hello\nWorld\nTest"

Image File:

screenshot.png
    → Open image
    → Run OCR
    → Extract: "Hello World\nThis is text"

Code File:

code.py
    → Open file
    → Read content: "def hello():\n    print('Hi')"

Result: Raw text strings from each file


5.4 Stage 4: Text Cleaning

Purpose: Standardize and clean the extracted text

Process:

Input:

"Hello   World\n\n\nThis is a test.  "

Step 1: Remove Extra Whitespace

"Hello World\n\n\nThis is a test.  "
    ↓
"Hello World\n\n\nThis is a test."

Step 2: Normalize Line Endings

"Hello World\n\n\nThis is a test."
    ↓
"Hello World\n\n\nThis is a test."

Step 3: Handle Encoding

"Hello World" (UTF-8)
    ↓
"Hello World" (checked and valid)

Result: Cleaned text strings


5.5 Stage 5: Line Splitting

Purpose: Break text into individual training samples

Process:

Input:

"Hello World\nThis is a test.\nMachine learning is cool."

Split by newlines:

Line 1: "Hello World"
Line 2: "This is a test."
Line 3: "Machine learning is cool."

Result: List of individual text lines


5.6 Stage 6: Filtering

Purpose: Keep only useful text samples

Process:

Input:

[
    "Hello World",           # Length: 11
    "Hi",                    # Length: 2 (too short)
    "This is a sentence.",    # Length: 19
    "",                      # Empty (remove)
    "A"                      # Length: 1 (too short)
]

Filter criteria:

  • Minimum length: 10 characters
  • Non-empty strings

Filtering:

Keep: "Hello World" (length 11 ≥ 10)
Remove: "Hi" (length 2 < 10)
Keep: "This is a sentence." (length 19 ≥ 10)
Remove: "" (empty)
Remove: "A" (length 1 < 10)

Output:

[
    "Hello World",
    "This is a sentence."
]

Result: Filtered list of quality text samples


6. Complete Example: Processing "Hello World.pdf"

Let's trace through processing a complete PDF file step-by-step.

Input

File: Hello World.pdf
Location: data/documents/Hello World.pdf
Content: 2 pages with text

Step-by-Step Processing

Step 1: File Discovery

Scanning: data/
    ├── documents/
    │   └── Hello World.pdf  ← Found
    ├── images/
    └── code/
    
File found: data/documents/Hello World.pdf

Step 2: File Type Detection

File: Hello World.pdf
Extension: .pdf
Type: PDF
Processor: PDF Extractor

Step 3: PDF Text Extraction

Page 1 Content:

Hello World
This is a simple example document.
It contains multiple sentences.

Page 2 Content:

Second Page
Here is more content.
The end.

Extraction Process:

Open PDF file
    ↓
Extract Page 1:
    Text: "Hello World\nThis is a simple example document.\nIt contains multiple sentences."
    ↓
Extract Page 2:
    Text: "Second Page\nHere is more content.\nThe end."
    ↓
Combine pages:
    "Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."

Step 4: Text Cleaning

Input:

"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."

Process:

Remove extra whitespace: ✓ (already clean)
Normalize encoding: ✓ (UTF-8)
Handle special characters: ✓ (none found)

Output:

"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."

Step 5: Line Splitting

Input:

"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."

Split by newline character (\n):

Line 1: "Hello World"
Line 2: "This is a simple example document."
Line 3: "It contains multiple sentences."
Line 4: "Second Page"
Line 5: "Here is more content."
Line 6: "The end."

Step 6: Filtering

Input:

[
    "Hello World",                           # Length: 11
    "This is a simple example document.",     # Length: 36
    "It contains multiple sentences.",        # Length: 31
    "Second Page",                           # Length: 11
    "Here is more content.",                 # Length: 21
    "The end."                               # Length: 8 (too short!)
]

Filter: Minimum length = 10

✓ Keep: "Hello World" (11 ≥ 10)
✓ Keep: "This is a simple example document." (36 ≥ 10)
✓ Keep: "It contains multiple sentences." (31 ≥ 10)
✓ Keep: "Second Page" (11 ≥ 10)
✓ Keep: "Here is more content." (21 ≥ 10)
✗ Remove: "The end." (8 < 10)

Step 7: Final Output

Result:

[
    "Hello World",
    "This is a simple example document.",
    "It contains multiple sentences.",
    "Second Page",
    "Here is more content."
]

Statistics:

  • Files processed: 1
  • Pages extracted: 2
  • Lines extracted: 6
  • Lines kept: 5
  • Lines filtered: 1

7. Data Quality and Filtering

7.1 Why Filter?

Problem: Not all text is useful for training

Examples of Low-Quality Text:

✗ ""                    (empty line)
✗ " "                   (just whitespace)
✗ "Hi"                  (too short, no context)
✗ "A"                   (single character)
✗ "..."                 (ellipsis, no meaning)
✗ "---"                 (separator line)

Examples of High-Quality Text:

✓ "Machine learning is a subset of artificial intelligence."
✓ "The transformer architecture uses self-attention mechanisms."
✓ "Gradient descent optimizes neural network parameters."

7.2 Filtering Criteria

Minimum Length Filter:

Purpose: Remove very short lines that don't provide context

Example:

Minimum length: 10 characters

Keep:
✓ "Hello world" (11 chars)
✓ "This is a test." (15 chars)

Remove:
✗ "Hi" (2 chars)
✗ "Test" (4 chars)
✗ "OK" (2 chars)

Why 10 characters?

  • Provides enough context for meaningful learning
  • Filters out headers, separators, and noise
  • Ensures each sample has semantic value

7.3 Encoding Handling

Problem: Files may have different encodings

Solution: Try multiple encodings

Process:

Try UTF-8 first:
    ✓ Success → Use UTF-8
    ✗ Failure → Try Latin-1
        ✓ Success → Use Latin-1
        ✗ Failure → Log error and skip file

Example:

UTF-8 file:

"Hello 世界" → Reads correctly

Latin-1 file:

"Hello café" → Reads correctly with Latin-1

7.4 Error Handling

What happens when processing fails?

Examples:

Corrupted PDF:

File: corrupted.pdf
    → Try to extract text
    → Error: "Cannot read PDF"
    → Log warning: "Failed to process corrupted.pdf"
    → Skip file
    → Continue with next file

Unsupported File Type:

File: presentation.pptx
    → Extension: .pptx
    → Type: Not supported
    → Warning: "Unsupported file type: .pptx"
    → Skip file
    → Continue with next file

Image OCR Failure:

File: blurry_image.png
    → Try OCR
    → OCR returns empty or garbled text
    → Filter removes empty lines
    → No text extracted
    → File processed (no output)

8. Common Questions

Q1: Why process PDFs instead of using them directly?

Answer:
Models work with numbers (token IDs), not file formats. PDFs have:

  • Complex structure (fonts, layouts, metadata)
  • Embedded formatting
  • Binary data mixed with text

Processing extracts just the text content, which is what the model needs.

Q2: What if OCR doesn't work well on an image?

Answer:

  • Low-quality images produce poor OCR results
  • The system will extract what it can
  • Poor OCR output is filtered out (too short or garbled)
  • The file is processed but may contribute little or no text

Solution: Use high-quality images with clear text for best results.

Q3: Why split text into lines?

Answer:

  • Each line becomes a training sample
  • Models predict next tokens in sequences
  • Shorter sequences are easier to process
  • Allows the model to learn from diverse sentence structures

Q4: What happens to code formatting?

Answer:

  • Code is processed as text
  • Indentation and structure are preserved
  • Each line becomes a sample
  • The model learns code patterns and syntax

Example:

def hello():
    print("Hi")

Becomes:

"def hello():"
"    print("Hi")"

Q5: Can I process files in parallel?

Answer:
Currently, files are processed sequentially. Future improvements could include:

  • Parallel processing of multiple files
  • Multi-threaded extraction
  • Batch processing for efficiency

Q6: What if a file is very large?

Answer:

  • Large files are processed line by line
  • Memory usage stays manageable
  • Progress is logged every 100 files
  • System can handle files of any size (within memory limits)

Q7: How is data from different file types combined?

Answer:
All extracted text is combined into a single list:

PDF file → 50 lines extracted
Text file → 30 lines extracted
Code file → 100 lines extracted
Image → 5 lines extracted

Combined: 185 text lines total

All lines are treated equally, regardless of source file type.


Summary

What is Data Processing?

Data processing is the transformation of raw files (PDFs, images, code, text) into clean text lines that can be tokenized and used for training.

Key Steps

  1. Find Files: Scan directory for all files
  2. Classify: Determine file type (.pdf, .txt, .png, etc.)
  3. Extract: Get text content from each file
  4. Clean: Remove noise and standardize format
  5. Split: Break into individual lines
  6. Filter: Keep only quality text samples

Result

A list of text strings ready for:

  • Tokenization (converting to numbers)
  • Training (teaching the model)
  • Learning (model understanding patterns)

Example Flow

PDF file "document.pdf"
    ↓
Extract text from pages
    ↓
Clean and split into lines
    ↓
Filter by length
    ↓
["Sentence 1.", "Sentence 2.", "Sentence 3."]
    ↓
Ready for tokenization and training!

This document explains what data processing means and how it transforms your raw files into training-ready text, step by step.