# Data Processing Explained: Step-by-Step Guide Complete guide to understanding data processing in the SheepOp LLM project, explaining what happens to your data from raw files to training-ready text. ## Table of Contents 1. [What is Data Processing?](#1-what-is-data-processing) 2. [Why Do We Need Data Processing?](#2-why-do-we-need-data-processing) 3. [The Data Processing Pipeline](#3-the-data-processing-pipeline) 4. [Step-by-Step: How Each File Type is Processed](#4-step-by-step-how-each-file-type-is-processed) 5. [Data Transformation Stages](#5-data-transformation-stages) 6. [Complete Example: Processing "Hello World.pdf"](#6-complete-example-processing-hello-worldpdf) 7. [Data Quality and Filtering](#7-data-quality-and-filtering) 8. [Common Questions](#8-common-questions) --- ## 1. What is Data Processing? **Data processing** is the transformation of raw, unstructured data into a format that machine learning models can understand and learn from. ### Simple Analogy Think of data processing like preparing ingredients for cooking: **Raw Ingredients (Your Files):** - PDF documents - Text files - Images with text - Code files **Prepared Ingredients (Processed Data):** - Clean text lines - Consistent format - Ready for training **The Recipe (Training):** - The model learns from the prepared ingredients ### In Our Context **Input:** Mixed file types (PDFs, images, code, text) **Output:** List of text strings ready for tokenization **Purpose:** Extract meaningful text that the model can learn from --- ## 2. Why Do We Need Data Processing? ### 2.1 The Problem Machine learning models (like our transformer) understand **numbers**, not: - PDF files - Images - Raw text files - Code files ### 2.2 The Solution We need to: 1. **Extract** text from different file formats 2. **Clean** the text (remove noise, handle encoding) 3. **Standardize** the format (consistent structure) 4. **Prepare** for tokenization (split into manageable pieces) ### 2.3 Benefits ✅ **Unified Format**: All data becomes text lines ✅ **Easy to Process**: Simple format for tokenization ✅ **Flexible**: Works with many file types ✅ **Scalable**: Can process thousands of files automatically --- ## 3. The Data Processing Pipeline ### 3.1 High-Level Overview ``` Raw Files ↓ [File Type Detection] ↓ [Text Extraction] ↓ [Text Cleaning] ↓ [Line Splitting] ↓ [Filtering] ↓ Clean Text Lines ↓ [Tokenization] ← Not part of data processing ↓ [Training] ← Not part of data processing ``` ### 3.2 Detailed Pipeline ``` Step 1: Directory Scan └─→ Find all files in data/ directory └─→ Categorize by file type (.pdf, .txt, .png, etc.) Step 2: File Type Detection └─→ Check file extension └─→ Route to appropriate processor Step 3: Text Extraction ├─→ PDF files → PDF text extraction ├─→ Text files → Read as text ├─→ Image files → OCR (Optical Character Recognition) └─→ Code files → Read as text Step 4: Text Cleaning └─→ Remove extra whitespace └─→ Handle encoding issues └─→ Normalize line endings Step 5: Line Splitting └─→ Split text into individual lines └─→ Each line becomes one training sample Step 6: Filtering └─→ Remove empty lines └─→ Filter by minimum length └─→ Remove lines that are too short Step 7: Output └─→ List of text strings └─→ Ready for tokenization ``` --- ## 4. Step-by-Step: How Each File Type is Processed ### 4.1 Text Files (.txt, .md, .log, etc.) **What happens:** 1. File is opened 2. Content is read line by line 3. Each line becomes a separate text sample **Example:** **Input:** `document.txt` ``` Hello world This is a sentence. Machine learning is fascinating. ``` **Processing:** ``` Line 1: "Hello world" Line 2: "This is a sentence." Line 3: "Machine learning is fascinating." ``` **Output:** ```python [ "Hello world", "This is a sentence.", "Machine learning is fascinating." ] ``` **Why this works:** Text files are already in plain text format, so extraction is straightforward. --- ### 4.2 Code Files (.py, .js, .java, etc.) **What happens:** 1. File is opened 2. Content is read line by line 3. Each line becomes a separate text sample **Example:** **Input:** `example.py` ```python def hello(): print("Hello") return True ``` **Processing:** ``` Line 1: "def hello():" Line 2: " print("Hello")" Line 3: " return True" ``` **Output:** ```python [ "def hello():", " print("Hello")", " return True" ] ``` **Why this works:** Code files are text files, so they're processed the same way. The model learns code patterns and syntax. --- ### 4.3 PDF Files (.pdf) **What happens:** 1. PDF file is opened 2. Text is extracted from each page 3. Text is split into lines 4. Lines are filtered for quality **Example:** **Input:** `document.pdf` (3 pages) **Page 1:** ``` Introduction to Machine Learning Machine learning is a subset of artificial intelligence. ``` **Page 2:** ``` Neural Networks Neural networks are computing systems inspired by biological neural networks. ``` **Page 3:** ``` Conclusion In conclusion, machine learning has revolutionized technology. ``` **Processing:** **Step 1: Extract text from each page** ``` Page 1 text: "Introduction to Machine Learning\nMachine learning is a subset of artificial intelligence." Page 2 text: "Neural Networks\nNeural networks are computing systems inspired by biological neural networks." Page 3 text: "Conclusion\nIn conclusion, machine learning has revolutionized technology." ``` **Step 2: Split by newlines** ``` Line 1: "Introduction to Machine Learning" Line 2: "Machine learning is a subset of artificial intelligence." Line 3: "Neural Networks" Line 4: "Neural networks are computing systems inspired by biological neural networks." Line 5: "Conclusion" Line 6: "In conclusion, machine learning has revolutionized technology." ``` **Step 3: Filter short lines** ``` Remove: "Introduction to Machine Learning" (too short for context) Keep: "Machine learning is a subset of artificial intelligence." Remove: "Neural Networks" (too short) Keep: "Neural networks are computing systems inspired by biological neural networks." Remove: "Conclusion" (too short) Keep: "In conclusion, machine learning has revolutionized technology." ``` **Output:** ```python [ "Machine learning is a subset of artificial intelligence.", "Neural networks are computing systems inspired by biological neural networks.", "In conclusion, machine learning has revolutionized technology." ] ``` **Why this works:** PDFs contain text embedded in the file structure. Libraries like PyPDF2 or pdfplumber extract this text, preserving the content but losing formatting. --- ### 4.4 Image Files (.png, .jpg, etc.) **What happens:** 1. Image file is opened 2. OCR (Optical Character Recognition) reads text from the image 3. Extracted text is split into lines 4. Lines are filtered for quality **Example:** **Input:** `screenshot.png` containing: ``` Hello World This is text in an image. ``` **Processing:** **Step 1: OCR Processing** ``` Image → OCR Engine → Text "Hello World\nThis is text in an image." ``` **Step 2: Split by newlines** ``` Line 1: "Hello World" Line 2: "This is text in an image." ``` **Step 3: Filter short lines** ``` Remove: "Hello World" (might be too short) Keep: "This is text in an image." ``` **Output:** ```python [ "This is text in an image." ] ``` **Why this works:** OCR software analyzes the image pixel by pixel, identifies characters, and converts them to text. Accuracy depends on image quality. --- ## 5. Data Transformation Stages ### 5.1 Stage 1: File Discovery **Purpose:** Find all files to process **Process:** ``` Directory: data/ ├── document.pdf ├── code.py ├── screenshot.png └── notes.txt Scan recursively: ├── Find: document.pdf ├── Find: code.py ├── Find: screenshot.png └── Find: notes.txt Total: 4 files found ``` **Result:** List of file paths to process --- ### 5.2 Stage 2: File Type Classification **Purpose:** Determine how to process each file **Process:** ``` File: document.pdf ├── Extension: .pdf ├── Type: PDF └── Processor: PDF Extractor File: code.py ├── Extension: .py ├── Type: Code └── Processor: Text Reader File: screenshot.png ├── Extension: .png ├── Type: Image └── Processor: OCR File: notes.txt ├── Extension: .txt ├── Type: Text └── Processor: Text Reader ``` **Result:** Each file assigned to appropriate processor --- ### 5.3 Stage 3: Text Extraction **Purpose:** Get raw text from each file **Process:** **PDF File:** ``` document.pdf → Open PDF → Extract Page 1: "Introduction..." → Extract Page 2: "Chapter 1..." → Extract Page 3: "Conclusion..." → Combine: "Introduction...\nChapter 1...\nConclusion..." ``` **Text File:** ``` notes.txt → Open file → Read content: "Hello\nWorld\nTest" ``` **Image File:** ``` screenshot.png → Open image → Run OCR → Extract: "Hello World\nThis is text" ``` **Code File:** ``` code.py → Open file → Read content: "def hello():\n print('Hi')" ``` **Result:** Raw text strings from each file --- ### 5.4 Stage 4: Text Cleaning **Purpose:** Standardize and clean the extracted text **Process:** **Input:** ``` "Hello World\n\n\nThis is a test. " ``` **Step 1: Remove Extra Whitespace** ``` "Hello World\n\n\nThis is a test. " ↓ "Hello World\n\n\nThis is a test." ``` **Step 2: Normalize Line Endings** ``` "Hello World\n\n\nThis is a test." ↓ "Hello World\n\n\nThis is a test." ``` **Step 3: Handle Encoding** ``` "Hello World" (UTF-8) ↓ "Hello World" (checked and valid) ``` **Result:** Cleaned text strings --- ### 5.5 Stage 5: Line Splitting **Purpose:** Break text into individual training samples **Process:** **Input:** ``` "Hello World\nThis is a test.\nMachine learning is cool." ``` **Split by newlines:** ``` Line 1: "Hello World" Line 2: "This is a test." Line 3: "Machine learning is cool." ``` **Result:** List of individual text lines --- ### 5.6 Stage 6: Filtering **Purpose:** Keep only useful text samples **Process:** **Input:** ```python [ "Hello World", # Length: 11 "Hi", # Length: 2 (too short) "This is a sentence.", # Length: 19 "", # Empty (remove) "A" # Length: 1 (too short) ] ``` **Filter criteria:** - Minimum length: 10 characters - Non-empty strings **Filtering:** ``` Keep: "Hello World" (length 11 ≥ 10) Remove: "Hi" (length 2 < 10) Keep: "This is a sentence." (length 19 ≥ 10) Remove: "" (empty) Remove: "A" (length 1 < 10) ``` **Output:** ```python [ "Hello World", "This is a sentence." ] ``` **Result:** Filtered list of quality text samples --- ## 6. Complete Example: Processing "Hello World.pdf" Let's trace through processing a complete PDF file step-by-step. ### Input **File:** `Hello World.pdf` **Location:** `data/documents/Hello World.pdf` **Content:** 2 pages with text ### Step-by-Step Processing #### Step 1: File Discovery ``` Scanning: data/ ├── documents/ │ └── Hello World.pdf ← Found ├── images/ └── code/ File found: data/documents/Hello World.pdf ``` #### Step 2: File Type Detection ``` File: Hello World.pdf Extension: .pdf Type: PDF Processor: PDF Extractor ``` #### Step 3: PDF Text Extraction **Page 1 Content:** ``` Hello World This is a simple example document. It contains multiple sentences. ``` **Page 2 Content:** ``` Second Page Here is more content. The end. ``` **Extraction Process:** ``` Open PDF file ↓ Extract Page 1: Text: "Hello World\nThis is a simple example document.\nIt contains multiple sentences." ↓ Extract Page 2: Text: "Second Page\nHere is more content.\nThe end." ↓ Combine pages: "Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end." ``` #### Step 4: Text Cleaning **Input:** ``` "Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end." ``` **Process:** ``` Remove extra whitespace: ✓ (already clean) Normalize encoding: ✓ (UTF-8) Handle special characters: ✓ (none found) ``` **Output:** ``` "Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end." ``` #### Step 5: Line Splitting **Input:** ``` "Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end." ``` **Split by newline character (`\n`):** ``` Line 1: "Hello World" Line 2: "This is a simple example document." Line 3: "It contains multiple sentences." Line 4: "Second Page" Line 5: "Here is more content." Line 6: "The end." ``` #### Step 6: Filtering **Input:** ```python [ "Hello World", # Length: 11 "This is a simple example document.", # Length: 36 "It contains multiple sentences.", # Length: 31 "Second Page", # Length: 11 "Here is more content.", # Length: 21 "The end." # Length: 8 (too short!) ] ``` **Filter: Minimum length = 10** ``` ✓ Keep: "Hello World" (11 ≥ 10) ✓ Keep: "This is a simple example document." (36 ≥ 10) ✓ Keep: "It contains multiple sentences." (31 ≥ 10) ✓ Keep: "Second Page" (11 ≥ 10) ✓ Keep: "Here is more content." (21 ≥ 10) ✗ Remove: "The end." (8 < 10) ``` #### Step 7: Final Output **Result:** ```python [ "Hello World", "This is a simple example document.", "It contains multiple sentences.", "Second Page", "Here is more content." ] ``` **Statistics:** - Files processed: 1 - Pages extracted: 2 - Lines extracted: 6 - Lines kept: 5 - Lines filtered: 1 --- ## 7. Data Quality and Filtering ### 7.1 Why Filter? **Problem:** Not all text is useful for training **Examples of Low-Quality Text:** ``` ✗ "" (empty line) ✗ " " (just whitespace) ✗ "Hi" (too short, no context) ✗ "A" (single character) ✗ "..." (ellipsis, no meaning) ✗ "---" (separator line) ``` **Examples of High-Quality Text:** ``` ✓ "Machine learning is a subset of artificial intelligence." ✓ "The transformer architecture uses self-attention mechanisms." ✓ "Gradient descent optimizes neural network parameters." ``` ### 7.2 Filtering Criteria **Minimum Length Filter:** **Purpose:** Remove very short lines that don't provide context **Example:** ``` Minimum length: 10 characters Keep: ✓ "Hello world" (11 chars) ✓ "This is a test." (15 chars) Remove: ✗ "Hi" (2 chars) ✗ "Test" (4 chars) ✗ "OK" (2 chars) ``` **Why 10 characters?** - Provides enough context for meaningful learning - Filters out headers, separators, and noise - Ensures each sample has semantic value ### 7.3 Encoding Handling **Problem:** Files may have different encodings **Solution:** Try multiple encodings **Process:** ``` Try UTF-8 first: ✓ Success → Use UTF-8 ✗ Failure → Try Latin-1 ✓ Success → Use Latin-1 ✗ Failure → Log error and skip file ``` **Example:** **UTF-8 file:** ``` "Hello 世界" → Reads correctly ``` **Latin-1 file:** ``` "Hello café" → Reads correctly with Latin-1 ``` ### 7.4 Error Handling **What happens when processing fails?** **Examples:** **Corrupted PDF:** ``` File: corrupted.pdf → Try to extract text → Error: "Cannot read PDF" → Log warning: "Failed to process corrupted.pdf" → Skip file → Continue with next file ``` **Unsupported File Type:** ``` File: presentation.pptx → Extension: .pptx → Type: Not supported → Warning: "Unsupported file type: .pptx" → Skip file → Continue with next file ``` **Image OCR Failure:** ``` File: blurry_image.png → Try OCR → OCR returns empty or garbled text → Filter removes empty lines → No text extracted → File processed (no output) ``` --- ## 8. Common Questions ### Q1: Why process PDFs instead of using them directly? **Answer:** Models work with numbers (token IDs), not file formats. PDFs have: - Complex structure (fonts, layouts, metadata) - Embedded formatting - Binary data mixed with text Processing extracts just the text content, which is what the model needs. ### Q2: What if OCR doesn't work well on an image? **Answer:** - Low-quality images produce poor OCR results - The system will extract what it can - Poor OCR output is filtered out (too short or garbled) - The file is processed but may contribute little or no text **Solution:** Use high-quality images with clear text for best results. ### Q3: Why split text into lines? **Answer:** - Each line becomes a training sample - Models predict next tokens in sequences - Shorter sequences are easier to process - Allows the model to learn from diverse sentence structures ### Q4: What happens to code formatting? **Answer:** - Code is processed as text - Indentation and structure are preserved - Each line becomes a sample - The model learns code patterns and syntax **Example:** ```python def hello(): print("Hi") ``` Becomes: ``` "def hello():" " print("Hi")" ``` ### Q5: Can I process files in parallel? **Answer:** Currently, files are processed sequentially. Future improvements could include: - Parallel processing of multiple files - Multi-threaded extraction - Batch processing for efficiency ### Q6: What if a file is very large? **Answer:** - Large files are processed line by line - Memory usage stays manageable - Progress is logged every 100 files - System can handle files of any size (within memory limits) ### Q7: How is data from different file types combined? **Answer:** All extracted text is combined into a single list: ``` PDF file → 50 lines extracted Text file → 30 lines extracted Code file → 100 lines extracted Image → 5 lines extracted Combined: 185 text lines total ``` All lines are treated equally, regardless of source file type. --- ## Summary ### What is Data Processing? **Data processing** is the transformation of raw files (PDFs, images, code, text) into clean text lines that can be tokenized and used for training. ### Key Steps 1. **Find Files**: Scan directory for all files 2. **Classify**: Determine file type (.pdf, .txt, .png, etc.) 3. **Extract**: Get text content from each file 4. **Clean**: Remove noise and standardize format 5. **Split**: Break into individual lines 6. **Filter**: Keep only quality text samples ### Result A list of text strings ready for: - Tokenization (converting to numbers) - Training (teaching the model) - Learning (model understanding patterns) ### Example Flow ``` PDF file "document.pdf" ↓ Extract text from pages ↓ Clean and split into lines ↓ Filter by length ↓ ["Sentence 1.", "Sentence 2.", "Sentence 3."] ↓ Ready for tokenization and training! ``` --- *This document explains what data processing means and how it transforms your raw files into training-ready text, step by step.*