Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
This commit is contained in:
914
docs/DATA_PROCESSING_EXPLAINED.md
Normal file
914
docs/DATA_PROCESSING_EXPLAINED.md
Normal file
@@ -0,0 +1,914 @@
|
||||
# Data Processing Explained: Step-by-Step Guide
|
||||
|
||||
Complete guide to understanding data processing in the SheepOp LLM project, explaining what happens to your data from raw files to training-ready text.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [What is Data Processing?](#1-what-is-data-processing)
|
||||
2. [Why Do We Need Data Processing?](#2-why-do-we-need-data-processing)
|
||||
3. [The Data Processing Pipeline](#3-the-data-processing-pipeline)
|
||||
4. [Step-by-Step: How Each File Type is Processed](#4-step-by-step-how-each-file-type-is-processed)
|
||||
5. [Data Transformation Stages](#5-data-transformation-stages)
|
||||
6. [Complete Example: Processing "Hello World.pdf"](#6-complete-example-processing-hello-worldpdf)
|
||||
7. [Data Quality and Filtering](#7-data-quality-and-filtering)
|
||||
8. [Common Questions](#8-common-questions)
|
||||
|
||||
---
|
||||
|
||||
## 1. What is Data Processing?
|
||||
|
||||
**Data processing** is the transformation of raw, unstructured data into a format that machine learning models can understand and learn from.
|
||||
|
||||
### Simple Analogy
|
||||
|
||||
Think of data processing like preparing ingredients for cooking:
|
||||
|
||||
**Raw Ingredients (Your Files):**
|
||||
- PDF documents
|
||||
- Text files
|
||||
- Images with text
|
||||
- Code files
|
||||
|
||||
**Prepared Ingredients (Processed Data):**
|
||||
- Clean text lines
|
||||
- Consistent format
|
||||
- Ready for training
|
||||
|
||||
**The Recipe (Training):**
|
||||
- The model learns from the prepared ingredients
|
||||
|
||||
### In Our Context
|
||||
|
||||
**Input:** Mixed file types (PDFs, images, code, text)
|
||||
**Output:** List of text strings ready for tokenization
|
||||
**Purpose:** Extract meaningful text that the model can learn from
|
||||
|
||||
---
|
||||
|
||||
## 2. Why Do We Need Data Processing?
|
||||
|
||||
### 2.1 The Problem
|
||||
|
||||
Machine learning models (like our transformer) understand **numbers**, not:
|
||||
- PDF files
|
||||
- Images
|
||||
- Raw text files
|
||||
- Code files
|
||||
|
||||
### 2.2 The Solution
|
||||
|
||||
We need to:
|
||||
1. **Extract** text from different file formats
|
||||
2. **Clean** the text (remove noise, handle encoding)
|
||||
3. **Standardize** the format (consistent structure)
|
||||
4. **Prepare** for tokenization (split into manageable pieces)
|
||||
|
||||
### 2.3 Benefits
|
||||
|
||||
✅ **Unified Format**: All data becomes text lines
|
||||
✅ **Easy to Process**: Simple format for tokenization
|
||||
✅ **Flexible**: Works with many file types
|
||||
✅ **Scalable**: Can process thousands of files automatically
|
||||
|
||||
---
|
||||
|
||||
## 3. The Data Processing Pipeline
|
||||
|
||||
### 3.1 High-Level Overview
|
||||
|
||||
```
|
||||
Raw Files
|
||||
↓
|
||||
[File Type Detection]
|
||||
↓
|
||||
[Text Extraction]
|
||||
↓
|
||||
[Text Cleaning]
|
||||
↓
|
||||
[Line Splitting]
|
||||
↓
|
||||
[Filtering]
|
||||
↓
|
||||
Clean Text Lines
|
||||
↓
|
||||
[Tokenization] ← Not part of data processing
|
||||
↓
|
||||
[Training] ← Not part of data processing
|
||||
```
|
||||
|
||||
### 3.2 Detailed Pipeline
|
||||
|
||||
```
|
||||
Step 1: Directory Scan
|
||||
└─→ Find all files in data/ directory
|
||||
└─→ Categorize by file type (.pdf, .txt, .png, etc.)
|
||||
|
||||
Step 2: File Type Detection
|
||||
└─→ Check file extension
|
||||
└─→ Route to appropriate processor
|
||||
|
||||
Step 3: Text Extraction
|
||||
├─→ PDF files → PDF text extraction
|
||||
├─→ Text files → Read as text
|
||||
├─→ Image files → OCR (Optical Character Recognition)
|
||||
└─→ Code files → Read as text
|
||||
|
||||
Step 4: Text Cleaning
|
||||
└─→ Remove extra whitespace
|
||||
└─→ Handle encoding issues
|
||||
└─→ Normalize line endings
|
||||
|
||||
Step 5: Line Splitting
|
||||
└─→ Split text into individual lines
|
||||
└─→ Each line becomes one training sample
|
||||
|
||||
Step 6: Filtering
|
||||
└─→ Remove empty lines
|
||||
└─→ Filter by minimum length
|
||||
└─→ Remove lines that are too short
|
||||
|
||||
Step 7: Output
|
||||
└─→ List of text strings
|
||||
└─→ Ready for tokenization
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Step-by-Step: How Each File Type is Processed
|
||||
|
||||
### 4.1 Text Files (.txt, .md, .log, etc.)
|
||||
|
||||
**What happens:**
|
||||
1. File is opened
|
||||
2. Content is read line by line
|
||||
3. Each line becomes a separate text sample
|
||||
|
||||
**Example:**
|
||||
|
||||
**Input:** `document.txt`
|
||||
```
|
||||
Hello world
|
||||
This is a sentence.
|
||||
Machine learning is fascinating.
|
||||
```
|
||||
|
||||
**Processing:**
|
||||
```
|
||||
Line 1: "Hello world"
|
||||
Line 2: "This is a sentence."
|
||||
Line 3: "Machine learning is fascinating."
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```python
|
||||
[
|
||||
"Hello world",
|
||||
"This is a sentence.",
|
||||
"Machine learning is fascinating."
|
||||
]
|
||||
```
|
||||
|
||||
**Why this works:** Text files are already in plain text format, so extraction is straightforward.
|
||||
|
||||
---
|
||||
|
||||
### 4.2 Code Files (.py, .js, .java, etc.)
|
||||
|
||||
**What happens:**
|
||||
1. File is opened
|
||||
2. Content is read line by line
|
||||
3. Each line becomes a separate text sample
|
||||
|
||||
**Example:**
|
||||
|
||||
**Input:** `example.py`
|
||||
```python
|
||||
def hello():
|
||||
print("Hello")
|
||||
return True
|
||||
```
|
||||
|
||||
**Processing:**
|
||||
```
|
||||
Line 1: "def hello():"
|
||||
Line 2: " print("Hello")"
|
||||
Line 3: " return True"
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```python
|
||||
[
|
||||
"def hello():",
|
||||
" print("Hello")",
|
||||
" return True"
|
||||
]
|
||||
```
|
||||
|
||||
**Why this works:** Code files are text files, so they're processed the same way. The model learns code patterns and syntax.
|
||||
|
||||
---
|
||||
|
||||
### 4.3 PDF Files (.pdf)
|
||||
|
||||
**What happens:**
|
||||
1. PDF file is opened
|
||||
2. Text is extracted from each page
|
||||
3. Text is split into lines
|
||||
4. Lines are filtered for quality
|
||||
|
||||
**Example:**
|
||||
|
||||
**Input:** `document.pdf` (3 pages)
|
||||
|
||||
**Page 1:**
|
||||
```
|
||||
Introduction to Machine Learning
|
||||
Machine learning is a subset of artificial intelligence.
|
||||
```
|
||||
|
||||
**Page 2:**
|
||||
```
|
||||
Neural Networks
|
||||
Neural networks are computing systems inspired by biological neural networks.
|
||||
```
|
||||
|
||||
**Page 3:**
|
||||
```
|
||||
Conclusion
|
||||
In conclusion, machine learning has revolutionized technology.
|
||||
```
|
||||
|
||||
**Processing:**
|
||||
|
||||
**Step 1: Extract text from each page**
|
||||
```
|
||||
Page 1 text: "Introduction to Machine Learning\nMachine learning is a subset of artificial intelligence."
|
||||
Page 2 text: "Neural Networks\nNeural networks are computing systems inspired by biological neural networks."
|
||||
Page 3 text: "Conclusion\nIn conclusion, machine learning has revolutionized technology."
|
||||
```
|
||||
|
||||
**Step 2: Split by newlines**
|
||||
```
|
||||
Line 1: "Introduction to Machine Learning"
|
||||
Line 2: "Machine learning is a subset of artificial intelligence."
|
||||
Line 3: "Neural Networks"
|
||||
Line 4: "Neural networks are computing systems inspired by biological neural networks."
|
||||
Line 5: "Conclusion"
|
||||
Line 6: "In conclusion, machine learning has revolutionized technology."
|
||||
```
|
||||
|
||||
**Step 3: Filter short lines**
|
||||
```
|
||||
Remove: "Introduction to Machine Learning" (too short for context)
|
||||
Keep: "Machine learning is a subset of artificial intelligence."
|
||||
Remove: "Neural Networks" (too short)
|
||||
Keep: "Neural networks are computing systems inspired by biological neural networks."
|
||||
Remove: "Conclusion" (too short)
|
||||
Keep: "In conclusion, machine learning has revolutionized technology."
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```python
|
||||
[
|
||||
"Machine learning is a subset of artificial intelligence.",
|
||||
"Neural networks are computing systems inspired by biological neural networks.",
|
||||
"In conclusion, machine learning has revolutionized technology."
|
||||
]
|
||||
```
|
||||
|
||||
**Why this works:** PDFs contain text embedded in the file structure. Libraries like PyPDF2 or pdfplumber extract this text, preserving the content but losing formatting.
|
||||
|
||||
---
|
||||
|
||||
### 4.4 Image Files (.png, .jpg, etc.)
|
||||
|
||||
**What happens:**
|
||||
1. Image file is opened
|
||||
2. OCR (Optical Character Recognition) reads text from the image
|
||||
3. Extracted text is split into lines
|
||||
4. Lines are filtered for quality
|
||||
|
||||
**Example:**
|
||||
|
||||
**Input:** `screenshot.png` containing:
|
||||
```
|
||||
Hello World
|
||||
This is text in an image.
|
||||
```
|
||||
|
||||
**Processing:**
|
||||
|
||||
**Step 1: OCR Processing**
|
||||
```
|
||||
Image → OCR Engine → Text
|
||||
"Hello World\nThis is text in an image."
|
||||
```
|
||||
|
||||
**Step 2: Split by newlines**
|
||||
```
|
||||
Line 1: "Hello World"
|
||||
Line 2: "This is text in an image."
|
||||
```
|
||||
|
||||
**Step 3: Filter short lines**
|
||||
```
|
||||
Remove: "Hello World" (might be too short)
|
||||
Keep: "This is text in an image."
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```python
|
||||
[
|
||||
"This is text in an image."
|
||||
]
|
||||
```
|
||||
|
||||
**Why this works:** OCR software analyzes the image pixel by pixel, identifies characters, and converts them to text. Accuracy depends on image quality.
|
||||
|
||||
---
|
||||
|
||||
## 5. Data Transformation Stages
|
||||
|
||||
### 5.1 Stage 1: File Discovery
|
||||
|
||||
**Purpose:** Find all files to process
|
||||
|
||||
**Process:**
|
||||
```
|
||||
Directory: data/
|
||||
├── document.pdf
|
||||
├── code.py
|
||||
├── screenshot.png
|
||||
└── notes.txt
|
||||
|
||||
Scan recursively:
|
||||
├── Find: document.pdf
|
||||
├── Find: code.py
|
||||
├── Find: screenshot.png
|
||||
└── Find: notes.txt
|
||||
|
||||
Total: 4 files found
|
||||
```
|
||||
|
||||
**Result:** List of file paths to process
|
||||
|
||||
---
|
||||
|
||||
### 5.2 Stage 2: File Type Classification
|
||||
|
||||
**Purpose:** Determine how to process each file
|
||||
|
||||
**Process:**
|
||||
```
|
||||
File: document.pdf
|
||||
├── Extension: .pdf
|
||||
├── Type: PDF
|
||||
└── Processor: PDF Extractor
|
||||
|
||||
File: code.py
|
||||
├── Extension: .py
|
||||
├── Type: Code
|
||||
└── Processor: Text Reader
|
||||
|
||||
File: screenshot.png
|
||||
├── Extension: .png
|
||||
├── Type: Image
|
||||
└── Processor: OCR
|
||||
|
||||
File: notes.txt
|
||||
├── Extension: .txt
|
||||
├── Type: Text
|
||||
└── Processor: Text Reader
|
||||
```
|
||||
|
||||
**Result:** Each file assigned to appropriate processor
|
||||
|
||||
---
|
||||
|
||||
### 5.3 Stage 3: Text Extraction
|
||||
|
||||
**Purpose:** Get raw text from each file
|
||||
|
||||
**Process:**
|
||||
|
||||
**PDF File:**
|
||||
```
|
||||
document.pdf
|
||||
→ Open PDF
|
||||
→ Extract Page 1: "Introduction..."
|
||||
→ Extract Page 2: "Chapter 1..."
|
||||
→ Extract Page 3: "Conclusion..."
|
||||
→ Combine: "Introduction...\nChapter 1...\nConclusion..."
|
||||
```
|
||||
|
||||
**Text File:**
|
||||
```
|
||||
notes.txt
|
||||
→ Open file
|
||||
→ Read content: "Hello\nWorld\nTest"
|
||||
```
|
||||
|
||||
**Image File:**
|
||||
```
|
||||
screenshot.png
|
||||
→ Open image
|
||||
→ Run OCR
|
||||
→ Extract: "Hello World\nThis is text"
|
||||
```
|
||||
|
||||
**Code File:**
|
||||
```
|
||||
code.py
|
||||
→ Open file
|
||||
→ Read content: "def hello():\n print('Hi')"
|
||||
```
|
||||
|
||||
**Result:** Raw text strings from each file
|
||||
|
||||
---
|
||||
|
||||
### 5.4 Stage 4: Text Cleaning
|
||||
|
||||
**Purpose:** Standardize and clean the extracted text
|
||||
|
||||
**Process:**
|
||||
|
||||
**Input:**
|
||||
```
|
||||
"Hello World\n\n\nThis is a test. "
|
||||
```
|
||||
|
||||
**Step 1: Remove Extra Whitespace**
|
||||
```
|
||||
"Hello World\n\n\nThis is a test. "
|
||||
↓
|
||||
"Hello World\n\n\nThis is a test."
|
||||
```
|
||||
|
||||
**Step 2: Normalize Line Endings**
|
||||
```
|
||||
"Hello World\n\n\nThis is a test."
|
||||
↓
|
||||
"Hello World\n\n\nThis is a test."
|
||||
```
|
||||
|
||||
**Step 3: Handle Encoding**
|
||||
```
|
||||
"Hello World" (UTF-8)
|
||||
↓
|
||||
"Hello World" (checked and valid)
|
||||
```
|
||||
|
||||
**Result:** Cleaned text strings
|
||||
|
||||
---
|
||||
|
||||
### 5.5 Stage 5: Line Splitting
|
||||
|
||||
**Purpose:** Break text into individual training samples
|
||||
|
||||
**Process:**
|
||||
|
||||
**Input:**
|
||||
```
|
||||
"Hello World\nThis is a test.\nMachine learning is cool."
|
||||
```
|
||||
|
||||
**Split by newlines:**
|
||||
```
|
||||
Line 1: "Hello World"
|
||||
Line 2: "This is a test."
|
||||
Line 3: "Machine learning is cool."
|
||||
```
|
||||
|
||||
**Result:** List of individual text lines
|
||||
|
||||
---
|
||||
|
||||
### 5.6 Stage 6: Filtering
|
||||
|
||||
**Purpose:** Keep only useful text samples
|
||||
|
||||
**Process:**
|
||||
|
||||
**Input:**
|
||||
```python
|
||||
[
|
||||
"Hello World", # Length: 11
|
||||
"Hi", # Length: 2 (too short)
|
||||
"This is a sentence.", # Length: 19
|
||||
"", # Empty (remove)
|
||||
"A" # Length: 1 (too short)
|
||||
]
|
||||
```
|
||||
|
||||
**Filter criteria:**
|
||||
- Minimum length: 10 characters
|
||||
- Non-empty strings
|
||||
|
||||
**Filtering:**
|
||||
```
|
||||
Keep: "Hello World" (length 11 ≥ 10)
|
||||
Remove: "Hi" (length 2 < 10)
|
||||
Keep: "This is a sentence." (length 19 ≥ 10)
|
||||
Remove: "" (empty)
|
||||
Remove: "A" (length 1 < 10)
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```python
|
||||
[
|
||||
"Hello World",
|
||||
"This is a sentence."
|
||||
]
|
||||
```
|
||||
|
||||
**Result:** Filtered list of quality text samples
|
||||
|
||||
---
|
||||
|
||||
## 6. Complete Example: Processing "Hello World.pdf"
|
||||
|
||||
Let's trace through processing a complete PDF file step-by-step.
|
||||
|
||||
### Input
|
||||
**File:** `Hello World.pdf`
|
||||
**Location:** `data/documents/Hello World.pdf`
|
||||
**Content:** 2 pages with text
|
||||
|
||||
### Step-by-Step Processing
|
||||
|
||||
#### Step 1: File Discovery
|
||||
|
||||
```
|
||||
Scanning: data/
|
||||
├── documents/
|
||||
│ └── Hello World.pdf ← Found
|
||||
├── images/
|
||||
└── code/
|
||||
|
||||
File found: data/documents/Hello World.pdf
|
||||
```
|
||||
|
||||
#### Step 2: File Type Detection
|
||||
|
||||
```
|
||||
File: Hello World.pdf
|
||||
Extension: .pdf
|
||||
Type: PDF
|
||||
Processor: PDF Extractor
|
||||
```
|
||||
|
||||
#### Step 3: PDF Text Extraction
|
||||
|
||||
**Page 1 Content:**
|
||||
```
|
||||
Hello World
|
||||
This is a simple example document.
|
||||
It contains multiple sentences.
|
||||
```
|
||||
|
||||
**Page 2 Content:**
|
||||
```
|
||||
Second Page
|
||||
Here is more content.
|
||||
The end.
|
||||
```
|
||||
|
||||
**Extraction Process:**
|
||||
```
|
||||
Open PDF file
|
||||
↓
|
||||
Extract Page 1:
|
||||
Text: "Hello World\nThis is a simple example document.\nIt contains multiple sentences."
|
||||
↓
|
||||
Extract Page 2:
|
||||
Text: "Second Page\nHere is more content.\nThe end."
|
||||
↓
|
||||
Combine pages:
|
||||
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
|
||||
```
|
||||
|
||||
#### Step 4: Text Cleaning
|
||||
|
||||
**Input:**
|
||||
```
|
||||
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
|
||||
```
|
||||
|
||||
**Process:**
|
||||
```
|
||||
Remove extra whitespace: ✓ (already clean)
|
||||
Normalize encoding: ✓ (UTF-8)
|
||||
Handle special characters: ✓ (none found)
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```
|
||||
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
|
||||
```
|
||||
|
||||
#### Step 5: Line Splitting
|
||||
|
||||
**Input:**
|
||||
```
|
||||
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
|
||||
```
|
||||
|
||||
**Split by newline character (`\n`):**
|
||||
```
|
||||
Line 1: "Hello World"
|
||||
Line 2: "This is a simple example document."
|
||||
Line 3: "It contains multiple sentences."
|
||||
Line 4: "Second Page"
|
||||
Line 5: "Here is more content."
|
||||
Line 6: "The end."
|
||||
```
|
||||
|
||||
#### Step 6: Filtering
|
||||
|
||||
**Input:**
|
||||
```python
|
||||
[
|
||||
"Hello World", # Length: 11
|
||||
"This is a simple example document.", # Length: 36
|
||||
"It contains multiple sentences.", # Length: 31
|
||||
"Second Page", # Length: 11
|
||||
"Here is more content.", # Length: 21
|
||||
"The end." # Length: 8 (too short!)
|
||||
]
|
||||
```
|
||||
|
||||
**Filter: Minimum length = 10**
|
||||
```
|
||||
✓ Keep: "Hello World" (11 ≥ 10)
|
||||
✓ Keep: "This is a simple example document." (36 ≥ 10)
|
||||
✓ Keep: "It contains multiple sentences." (31 ≥ 10)
|
||||
✓ Keep: "Second Page" (11 ≥ 10)
|
||||
✓ Keep: "Here is more content." (21 ≥ 10)
|
||||
✗ Remove: "The end." (8 < 10)
|
||||
```
|
||||
|
||||
#### Step 7: Final Output
|
||||
|
||||
**Result:**
|
||||
```python
|
||||
[
|
||||
"Hello World",
|
||||
"This is a simple example document.",
|
||||
"It contains multiple sentences.",
|
||||
"Second Page",
|
||||
"Here is more content."
|
||||
]
|
||||
```
|
||||
|
||||
**Statistics:**
|
||||
- Files processed: 1
|
||||
- Pages extracted: 2
|
||||
- Lines extracted: 6
|
||||
- Lines kept: 5
|
||||
- Lines filtered: 1
|
||||
|
||||
---
|
||||
|
||||
## 7. Data Quality and Filtering
|
||||
|
||||
### 7.1 Why Filter?
|
||||
|
||||
**Problem:** Not all text is useful for training
|
||||
|
||||
**Examples of Low-Quality Text:**
|
||||
|
||||
```
|
||||
✗ "" (empty line)
|
||||
✗ " " (just whitespace)
|
||||
✗ "Hi" (too short, no context)
|
||||
✗ "A" (single character)
|
||||
✗ "..." (ellipsis, no meaning)
|
||||
✗ "---" (separator line)
|
||||
```
|
||||
|
||||
**Examples of High-Quality Text:**
|
||||
|
||||
```
|
||||
✓ "Machine learning is a subset of artificial intelligence."
|
||||
✓ "The transformer architecture uses self-attention mechanisms."
|
||||
✓ "Gradient descent optimizes neural network parameters."
|
||||
```
|
||||
|
||||
### 7.2 Filtering Criteria
|
||||
|
||||
**Minimum Length Filter:**
|
||||
|
||||
**Purpose:** Remove very short lines that don't provide context
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Minimum length: 10 characters
|
||||
|
||||
Keep:
|
||||
✓ "Hello world" (11 chars)
|
||||
✓ "This is a test." (15 chars)
|
||||
|
||||
Remove:
|
||||
✗ "Hi" (2 chars)
|
||||
✗ "Test" (4 chars)
|
||||
✗ "OK" (2 chars)
|
||||
```
|
||||
|
||||
**Why 10 characters?**
|
||||
- Provides enough context for meaningful learning
|
||||
- Filters out headers, separators, and noise
|
||||
- Ensures each sample has semantic value
|
||||
|
||||
### 7.3 Encoding Handling
|
||||
|
||||
**Problem:** Files may have different encodings
|
||||
|
||||
**Solution:** Try multiple encodings
|
||||
|
||||
**Process:**
|
||||
```
|
||||
Try UTF-8 first:
|
||||
✓ Success → Use UTF-8
|
||||
✗ Failure → Try Latin-1
|
||||
✓ Success → Use Latin-1
|
||||
✗ Failure → Log error and skip file
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
**UTF-8 file:**
|
||||
```
|
||||
"Hello 世界" → Reads correctly
|
||||
```
|
||||
|
||||
**Latin-1 file:**
|
||||
```
|
||||
"Hello café" → Reads correctly with Latin-1
|
||||
```
|
||||
|
||||
### 7.4 Error Handling
|
||||
|
||||
**What happens when processing fails?**
|
||||
|
||||
**Examples:**
|
||||
|
||||
**Corrupted PDF:**
|
||||
```
|
||||
File: corrupted.pdf
|
||||
→ Try to extract text
|
||||
→ Error: "Cannot read PDF"
|
||||
→ Log warning: "Failed to process corrupted.pdf"
|
||||
→ Skip file
|
||||
→ Continue with next file
|
||||
```
|
||||
|
||||
**Unsupported File Type:**
|
||||
```
|
||||
File: presentation.pptx
|
||||
→ Extension: .pptx
|
||||
→ Type: Not supported
|
||||
→ Warning: "Unsupported file type: .pptx"
|
||||
→ Skip file
|
||||
→ Continue with next file
|
||||
```
|
||||
|
||||
**Image OCR Failure:**
|
||||
```
|
||||
File: blurry_image.png
|
||||
→ Try OCR
|
||||
→ OCR returns empty or garbled text
|
||||
→ Filter removes empty lines
|
||||
→ No text extracted
|
||||
→ File processed (no output)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Common Questions
|
||||
|
||||
### Q1: Why process PDFs instead of using them directly?
|
||||
|
||||
**Answer:**
|
||||
Models work with numbers (token IDs), not file formats. PDFs have:
|
||||
- Complex structure (fonts, layouts, metadata)
|
||||
- Embedded formatting
|
||||
- Binary data mixed with text
|
||||
|
||||
Processing extracts just the text content, which is what the model needs.
|
||||
|
||||
### Q2: What if OCR doesn't work well on an image?
|
||||
|
||||
**Answer:**
|
||||
- Low-quality images produce poor OCR results
|
||||
- The system will extract what it can
|
||||
- Poor OCR output is filtered out (too short or garbled)
|
||||
- The file is processed but may contribute little or no text
|
||||
|
||||
**Solution:** Use high-quality images with clear text for best results.
|
||||
|
||||
### Q3: Why split text into lines?
|
||||
|
||||
**Answer:**
|
||||
- Each line becomes a training sample
|
||||
- Models predict next tokens in sequences
|
||||
- Shorter sequences are easier to process
|
||||
- Allows the model to learn from diverse sentence structures
|
||||
|
||||
### Q4: What happens to code formatting?
|
||||
|
||||
**Answer:**
|
||||
- Code is processed as text
|
||||
- Indentation and structure are preserved
|
||||
- Each line becomes a sample
|
||||
- The model learns code patterns and syntax
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
def hello():
|
||||
print("Hi")
|
||||
```
|
||||
|
||||
Becomes:
|
||||
```
|
||||
"def hello():"
|
||||
" print("Hi")"
|
||||
```
|
||||
|
||||
### Q5: Can I process files in parallel?
|
||||
|
||||
**Answer:**
|
||||
Currently, files are processed sequentially. Future improvements could include:
|
||||
- Parallel processing of multiple files
|
||||
- Multi-threaded extraction
|
||||
- Batch processing for efficiency
|
||||
|
||||
### Q6: What if a file is very large?
|
||||
|
||||
**Answer:**
|
||||
- Large files are processed line by line
|
||||
- Memory usage stays manageable
|
||||
- Progress is logged every 100 files
|
||||
- System can handle files of any size (within memory limits)
|
||||
|
||||
### Q7: How is data from different file types combined?
|
||||
|
||||
**Answer:**
|
||||
All extracted text is combined into a single list:
|
||||
|
||||
```
|
||||
PDF file → 50 lines extracted
|
||||
Text file → 30 lines extracted
|
||||
Code file → 100 lines extracted
|
||||
Image → 5 lines extracted
|
||||
|
||||
Combined: 185 text lines total
|
||||
```
|
||||
|
||||
All lines are treated equally, regardless of source file type.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
### What is Data Processing?
|
||||
|
||||
**Data processing** is the transformation of raw files (PDFs, images, code, text) into clean text lines that can be tokenized and used for training.
|
||||
|
||||
### Key Steps
|
||||
|
||||
1. **Find Files**: Scan directory for all files
|
||||
2. **Classify**: Determine file type (.pdf, .txt, .png, etc.)
|
||||
3. **Extract**: Get text content from each file
|
||||
4. **Clean**: Remove noise and standardize format
|
||||
5. **Split**: Break into individual lines
|
||||
6. **Filter**: Keep only quality text samples
|
||||
|
||||
### Result
|
||||
|
||||
A list of text strings ready for:
|
||||
- Tokenization (converting to numbers)
|
||||
- Training (teaching the model)
|
||||
- Learning (model understanding patterns)
|
||||
|
||||
### Example Flow
|
||||
|
||||
```
|
||||
PDF file "document.pdf"
|
||||
↓
|
||||
Extract text from pages
|
||||
↓
|
||||
Clean and split into lines
|
||||
↓
|
||||
Filter by length
|
||||
↓
|
||||
["Sentence 1.", "Sentence 2.", "Sentence 3."]
|
||||
↓
|
||||
Ready for tokenization and training!
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
*This document explains what data processing means and how it transforms your raw files into training-ready text, step by step.*
|
||||
|
||||
Reference in New Issue
Block a user