- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
915 lines
19 KiB
Markdown
915 lines
19 KiB
Markdown
# Data Processing Explained: Step-by-Step Guide
|
|
|
|
Complete guide to understanding data processing in the SheepOp LLM project, explaining what happens to your data from raw files to training-ready text.
|
|
|
|
## Table of Contents
|
|
|
|
1. [What is Data Processing?](#1-what-is-data-processing)
|
|
2. [Why Do We Need Data Processing?](#2-why-do-we-need-data-processing)
|
|
3. [The Data Processing Pipeline](#3-the-data-processing-pipeline)
|
|
4. [Step-by-Step: How Each File Type is Processed](#4-step-by-step-how-each-file-type-is-processed)
|
|
5. [Data Transformation Stages](#5-data-transformation-stages)
|
|
6. [Complete Example: Processing "Hello World.pdf"](#6-complete-example-processing-hello-worldpdf)
|
|
7. [Data Quality and Filtering](#7-data-quality-and-filtering)
|
|
8. [Common Questions](#8-common-questions)
|
|
|
|
---
|
|
|
|
## 1. What is Data Processing?
|
|
|
|
**Data processing** is the transformation of raw, unstructured data into a format that machine learning models can understand and learn from.
|
|
|
|
### Simple Analogy
|
|
|
|
Think of data processing like preparing ingredients for cooking:
|
|
|
|
**Raw Ingredients (Your Files):**
|
|
- PDF documents
|
|
- Text files
|
|
- Images with text
|
|
- Code files
|
|
|
|
**Prepared Ingredients (Processed Data):**
|
|
- Clean text lines
|
|
- Consistent format
|
|
- Ready for training
|
|
|
|
**The Recipe (Training):**
|
|
- The model learns from the prepared ingredients
|
|
|
|
### In Our Context
|
|
|
|
**Input:** Mixed file types (PDFs, images, code, text)
|
|
**Output:** List of text strings ready for tokenization
|
|
**Purpose:** Extract meaningful text that the model can learn from
|
|
|
|
---
|
|
|
|
## 2. Why Do We Need Data Processing?
|
|
|
|
### 2.1 The Problem
|
|
|
|
Machine learning models (like our transformer) understand **numbers**, not:
|
|
- PDF files
|
|
- Images
|
|
- Raw text files
|
|
- Code files
|
|
|
|
### 2.2 The Solution
|
|
|
|
We need to:
|
|
1. **Extract** text from different file formats
|
|
2. **Clean** the text (remove noise, handle encoding)
|
|
3. **Standardize** the format (consistent structure)
|
|
4. **Prepare** for tokenization (split into manageable pieces)
|
|
|
|
### 2.3 Benefits
|
|
|
|
✅ **Unified Format**: All data becomes text lines
|
|
✅ **Easy to Process**: Simple format for tokenization
|
|
✅ **Flexible**: Works with many file types
|
|
✅ **Scalable**: Can process thousands of files automatically
|
|
|
|
---
|
|
|
|
## 3. The Data Processing Pipeline
|
|
|
|
### 3.1 High-Level Overview
|
|
|
|
```
|
|
Raw Files
|
|
↓
|
|
[File Type Detection]
|
|
↓
|
|
[Text Extraction]
|
|
↓
|
|
[Text Cleaning]
|
|
↓
|
|
[Line Splitting]
|
|
↓
|
|
[Filtering]
|
|
↓
|
|
Clean Text Lines
|
|
↓
|
|
[Tokenization] ← Not part of data processing
|
|
↓
|
|
[Training] ← Not part of data processing
|
|
```
|
|
|
|
### 3.2 Detailed Pipeline
|
|
|
|
```
|
|
Step 1: Directory Scan
|
|
└─→ Find all files in data/ directory
|
|
└─→ Categorize by file type (.pdf, .txt, .png, etc.)
|
|
|
|
Step 2: File Type Detection
|
|
└─→ Check file extension
|
|
└─→ Route to appropriate processor
|
|
|
|
Step 3: Text Extraction
|
|
├─→ PDF files → PDF text extraction
|
|
├─→ Text files → Read as text
|
|
├─→ Image files → OCR (Optical Character Recognition)
|
|
└─→ Code files → Read as text
|
|
|
|
Step 4: Text Cleaning
|
|
└─→ Remove extra whitespace
|
|
└─→ Handle encoding issues
|
|
└─→ Normalize line endings
|
|
|
|
Step 5: Line Splitting
|
|
└─→ Split text into individual lines
|
|
└─→ Each line becomes one training sample
|
|
|
|
Step 6: Filtering
|
|
└─→ Remove empty lines
|
|
└─→ Filter by minimum length
|
|
└─→ Remove lines that are too short
|
|
|
|
Step 7: Output
|
|
└─→ List of text strings
|
|
└─→ Ready for tokenization
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Step-by-Step: How Each File Type is Processed
|
|
|
|
### 4.1 Text Files (.txt, .md, .log, etc.)
|
|
|
|
**What happens:**
|
|
1. File is opened
|
|
2. Content is read line by line
|
|
3. Each line becomes a separate text sample
|
|
|
|
**Example:**
|
|
|
|
**Input:** `document.txt`
|
|
```
|
|
Hello world
|
|
This is a sentence.
|
|
Machine learning is fascinating.
|
|
```
|
|
|
|
**Processing:**
|
|
```
|
|
Line 1: "Hello world"
|
|
Line 2: "This is a sentence."
|
|
Line 3: "Machine learning is fascinating."
|
|
```
|
|
|
|
**Output:**
|
|
```python
|
|
[
|
|
"Hello world",
|
|
"This is a sentence.",
|
|
"Machine learning is fascinating."
|
|
]
|
|
```
|
|
|
|
**Why this works:** Text files are already in plain text format, so extraction is straightforward.
|
|
|
|
---
|
|
|
|
### 4.2 Code Files (.py, .js, .java, etc.)
|
|
|
|
**What happens:**
|
|
1. File is opened
|
|
2. Content is read line by line
|
|
3. Each line becomes a separate text sample
|
|
|
|
**Example:**
|
|
|
|
**Input:** `example.py`
|
|
```python
|
|
def hello():
|
|
print("Hello")
|
|
return True
|
|
```
|
|
|
|
**Processing:**
|
|
```
|
|
Line 1: "def hello():"
|
|
Line 2: " print("Hello")"
|
|
Line 3: " return True"
|
|
```
|
|
|
|
**Output:**
|
|
```python
|
|
[
|
|
"def hello():",
|
|
" print("Hello")",
|
|
" return True"
|
|
]
|
|
```
|
|
|
|
**Why this works:** Code files are text files, so they're processed the same way. The model learns code patterns and syntax.
|
|
|
|
---
|
|
|
|
### 4.3 PDF Files (.pdf)
|
|
|
|
**What happens:**
|
|
1. PDF file is opened
|
|
2. Text is extracted from each page
|
|
3. Text is split into lines
|
|
4. Lines are filtered for quality
|
|
|
|
**Example:**
|
|
|
|
**Input:** `document.pdf` (3 pages)
|
|
|
|
**Page 1:**
|
|
```
|
|
Introduction to Machine Learning
|
|
Machine learning is a subset of artificial intelligence.
|
|
```
|
|
|
|
**Page 2:**
|
|
```
|
|
Neural Networks
|
|
Neural networks are computing systems inspired by biological neural networks.
|
|
```
|
|
|
|
**Page 3:**
|
|
```
|
|
Conclusion
|
|
In conclusion, machine learning has revolutionized technology.
|
|
```
|
|
|
|
**Processing:**
|
|
|
|
**Step 1: Extract text from each page**
|
|
```
|
|
Page 1 text: "Introduction to Machine Learning\nMachine learning is a subset of artificial intelligence."
|
|
Page 2 text: "Neural Networks\nNeural networks are computing systems inspired by biological neural networks."
|
|
Page 3 text: "Conclusion\nIn conclusion, machine learning has revolutionized technology."
|
|
```
|
|
|
|
**Step 2: Split by newlines**
|
|
```
|
|
Line 1: "Introduction to Machine Learning"
|
|
Line 2: "Machine learning is a subset of artificial intelligence."
|
|
Line 3: "Neural Networks"
|
|
Line 4: "Neural networks are computing systems inspired by biological neural networks."
|
|
Line 5: "Conclusion"
|
|
Line 6: "In conclusion, machine learning has revolutionized technology."
|
|
```
|
|
|
|
**Step 3: Filter short lines**
|
|
```
|
|
Remove: "Introduction to Machine Learning" (too short for context)
|
|
Keep: "Machine learning is a subset of artificial intelligence."
|
|
Remove: "Neural Networks" (too short)
|
|
Keep: "Neural networks are computing systems inspired by biological neural networks."
|
|
Remove: "Conclusion" (too short)
|
|
Keep: "In conclusion, machine learning has revolutionized technology."
|
|
```
|
|
|
|
**Output:**
|
|
```python
|
|
[
|
|
"Machine learning is a subset of artificial intelligence.",
|
|
"Neural networks are computing systems inspired by biological neural networks.",
|
|
"In conclusion, machine learning has revolutionized technology."
|
|
]
|
|
```
|
|
|
|
**Why this works:** PDFs contain text embedded in the file structure. Libraries like PyPDF2 or pdfplumber extract this text, preserving the content but losing formatting.
|
|
|
|
---
|
|
|
|
### 4.4 Image Files (.png, .jpg, etc.)
|
|
|
|
**What happens:**
|
|
1. Image file is opened
|
|
2. OCR (Optical Character Recognition) reads text from the image
|
|
3. Extracted text is split into lines
|
|
4. Lines are filtered for quality
|
|
|
|
**Example:**
|
|
|
|
**Input:** `screenshot.png` containing:
|
|
```
|
|
Hello World
|
|
This is text in an image.
|
|
```
|
|
|
|
**Processing:**
|
|
|
|
**Step 1: OCR Processing**
|
|
```
|
|
Image → OCR Engine → Text
|
|
"Hello World\nThis is text in an image."
|
|
```
|
|
|
|
**Step 2: Split by newlines**
|
|
```
|
|
Line 1: "Hello World"
|
|
Line 2: "This is text in an image."
|
|
```
|
|
|
|
**Step 3: Filter short lines**
|
|
```
|
|
Remove: "Hello World" (might be too short)
|
|
Keep: "This is text in an image."
|
|
```
|
|
|
|
**Output:**
|
|
```python
|
|
[
|
|
"This is text in an image."
|
|
]
|
|
```
|
|
|
|
**Why this works:** OCR software analyzes the image pixel by pixel, identifies characters, and converts them to text. Accuracy depends on image quality.
|
|
|
|
---
|
|
|
|
## 5. Data Transformation Stages
|
|
|
|
### 5.1 Stage 1: File Discovery
|
|
|
|
**Purpose:** Find all files to process
|
|
|
|
**Process:**
|
|
```
|
|
Directory: data/
|
|
├── document.pdf
|
|
├── code.py
|
|
├── screenshot.png
|
|
└── notes.txt
|
|
|
|
Scan recursively:
|
|
├── Find: document.pdf
|
|
├── Find: code.py
|
|
├── Find: screenshot.png
|
|
└── Find: notes.txt
|
|
|
|
Total: 4 files found
|
|
```
|
|
|
|
**Result:** List of file paths to process
|
|
|
|
---
|
|
|
|
### 5.2 Stage 2: File Type Classification
|
|
|
|
**Purpose:** Determine how to process each file
|
|
|
|
**Process:**
|
|
```
|
|
File: document.pdf
|
|
├── Extension: .pdf
|
|
├── Type: PDF
|
|
└── Processor: PDF Extractor
|
|
|
|
File: code.py
|
|
├── Extension: .py
|
|
├── Type: Code
|
|
└── Processor: Text Reader
|
|
|
|
File: screenshot.png
|
|
├── Extension: .png
|
|
├── Type: Image
|
|
└── Processor: OCR
|
|
|
|
File: notes.txt
|
|
├── Extension: .txt
|
|
├── Type: Text
|
|
└── Processor: Text Reader
|
|
```
|
|
|
|
**Result:** Each file assigned to appropriate processor
|
|
|
|
---
|
|
|
|
### 5.3 Stage 3: Text Extraction
|
|
|
|
**Purpose:** Get raw text from each file
|
|
|
|
**Process:**
|
|
|
|
**PDF File:**
|
|
```
|
|
document.pdf
|
|
→ Open PDF
|
|
→ Extract Page 1: "Introduction..."
|
|
→ Extract Page 2: "Chapter 1..."
|
|
→ Extract Page 3: "Conclusion..."
|
|
→ Combine: "Introduction...\nChapter 1...\nConclusion..."
|
|
```
|
|
|
|
**Text File:**
|
|
```
|
|
notes.txt
|
|
→ Open file
|
|
→ Read content: "Hello\nWorld\nTest"
|
|
```
|
|
|
|
**Image File:**
|
|
```
|
|
screenshot.png
|
|
→ Open image
|
|
→ Run OCR
|
|
→ Extract: "Hello World\nThis is text"
|
|
```
|
|
|
|
**Code File:**
|
|
```
|
|
code.py
|
|
→ Open file
|
|
→ Read content: "def hello():\n print('Hi')"
|
|
```
|
|
|
|
**Result:** Raw text strings from each file
|
|
|
|
---
|
|
|
|
### 5.4 Stage 4: Text Cleaning
|
|
|
|
**Purpose:** Standardize and clean the extracted text
|
|
|
|
**Process:**
|
|
|
|
**Input:**
|
|
```
|
|
"Hello World\n\n\nThis is a test. "
|
|
```
|
|
|
|
**Step 1: Remove Extra Whitespace**
|
|
```
|
|
"Hello World\n\n\nThis is a test. "
|
|
↓
|
|
"Hello World\n\n\nThis is a test."
|
|
```
|
|
|
|
**Step 2: Normalize Line Endings**
|
|
```
|
|
"Hello World\n\n\nThis is a test."
|
|
↓
|
|
"Hello World\n\n\nThis is a test."
|
|
```
|
|
|
|
**Step 3: Handle Encoding**
|
|
```
|
|
"Hello World" (UTF-8)
|
|
↓
|
|
"Hello World" (checked and valid)
|
|
```
|
|
|
|
**Result:** Cleaned text strings
|
|
|
|
---
|
|
|
|
### 5.5 Stage 5: Line Splitting
|
|
|
|
**Purpose:** Break text into individual training samples
|
|
|
|
**Process:**
|
|
|
|
**Input:**
|
|
```
|
|
"Hello World\nThis is a test.\nMachine learning is cool."
|
|
```
|
|
|
|
**Split by newlines:**
|
|
```
|
|
Line 1: "Hello World"
|
|
Line 2: "This is a test."
|
|
Line 3: "Machine learning is cool."
|
|
```
|
|
|
|
**Result:** List of individual text lines
|
|
|
|
---
|
|
|
|
### 5.6 Stage 6: Filtering
|
|
|
|
**Purpose:** Keep only useful text samples
|
|
|
|
**Process:**
|
|
|
|
**Input:**
|
|
```python
|
|
[
|
|
"Hello World", # Length: 11
|
|
"Hi", # Length: 2 (too short)
|
|
"This is a sentence.", # Length: 19
|
|
"", # Empty (remove)
|
|
"A" # Length: 1 (too short)
|
|
]
|
|
```
|
|
|
|
**Filter criteria:**
|
|
- Minimum length: 10 characters
|
|
- Non-empty strings
|
|
|
|
**Filtering:**
|
|
```
|
|
Keep: "Hello World" (length 11 ≥ 10)
|
|
Remove: "Hi" (length 2 < 10)
|
|
Keep: "This is a sentence." (length 19 ≥ 10)
|
|
Remove: "" (empty)
|
|
Remove: "A" (length 1 < 10)
|
|
```
|
|
|
|
**Output:**
|
|
```python
|
|
[
|
|
"Hello World",
|
|
"This is a sentence."
|
|
]
|
|
```
|
|
|
|
**Result:** Filtered list of quality text samples
|
|
|
|
---
|
|
|
|
## 6. Complete Example: Processing "Hello World.pdf"
|
|
|
|
Let's trace through processing a complete PDF file step-by-step.
|
|
|
|
### Input
|
|
**File:** `Hello World.pdf`
|
|
**Location:** `data/documents/Hello World.pdf`
|
|
**Content:** 2 pages with text
|
|
|
|
### Step-by-Step Processing
|
|
|
|
#### Step 1: File Discovery
|
|
|
|
```
|
|
Scanning: data/
|
|
├── documents/
|
|
│ └── Hello World.pdf ← Found
|
|
├── images/
|
|
└── code/
|
|
|
|
File found: data/documents/Hello World.pdf
|
|
```
|
|
|
|
#### Step 2: File Type Detection
|
|
|
|
```
|
|
File: Hello World.pdf
|
|
Extension: .pdf
|
|
Type: PDF
|
|
Processor: PDF Extractor
|
|
```
|
|
|
|
#### Step 3: PDF Text Extraction
|
|
|
|
**Page 1 Content:**
|
|
```
|
|
Hello World
|
|
This is a simple example document.
|
|
It contains multiple sentences.
|
|
```
|
|
|
|
**Page 2 Content:**
|
|
```
|
|
Second Page
|
|
Here is more content.
|
|
The end.
|
|
```
|
|
|
|
**Extraction Process:**
|
|
```
|
|
Open PDF file
|
|
↓
|
|
Extract Page 1:
|
|
Text: "Hello World\nThis is a simple example document.\nIt contains multiple sentences."
|
|
↓
|
|
Extract Page 2:
|
|
Text: "Second Page\nHere is more content.\nThe end."
|
|
↓
|
|
Combine pages:
|
|
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
|
|
```
|
|
|
|
#### Step 4: Text Cleaning
|
|
|
|
**Input:**
|
|
```
|
|
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
|
|
```
|
|
|
|
**Process:**
|
|
```
|
|
Remove extra whitespace: ✓ (already clean)
|
|
Normalize encoding: ✓ (UTF-8)
|
|
Handle special characters: ✓ (none found)
|
|
```
|
|
|
|
**Output:**
|
|
```
|
|
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
|
|
```
|
|
|
|
#### Step 5: Line Splitting
|
|
|
|
**Input:**
|
|
```
|
|
"Hello World\nThis is a simple example document.\nIt contains multiple sentences.\nSecond Page\nHere is more content.\nThe end."
|
|
```
|
|
|
|
**Split by newline character (`\n`):**
|
|
```
|
|
Line 1: "Hello World"
|
|
Line 2: "This is a simple example document."
|
|
Line 3: "It contains multiple sentences."
|
|
Line 4: "Second Page"
|
|
Line 5: "Here is more content."
|
|
Line 6: "The end."
|
|
```
|
|
|
|
#### Step 6: Filtering
|
|
|
|
**Input:**
|
|
```python
|
|
[
|
|
"Hello World", # Length: 11
|
|
"This is a simple example document.", # Length: 36
|
|
"It contains multiple sentences.", # Length: 31
|
|
"Second Page", # Length: 11
|
|
"Here is more content.", # Length: 21
|
|
"The end." # Length: 8 (too short!)
|
|
]
|
|
```
|
|
|
|
**Filter: Minimum length = 10**
|
|
```
|
|
✓ Keep: "Hello World" (11 ≥ 10)
|
|
✓ Keep: "This is a simple example document." (36 ≥ 10)
|
|
✓ Keep: "It contains multiple sentences." (31 ≥ 10)
|
|
✓ Keep: "Second Page" (11 ≥ 10)
|
|
✓ Keep: "Here is more content." (21 ≥ 10)
|
|
✗ Remove: "The end." (8 < 10)
|
|
```
|
|
|
|
#### Step 7: Final Output
|
|
|
|
**Result:**
|
|
```python
|
|
[
|
|
"Hello World",
|
|
"This is a simple example document.",
|
|
"It contains multiple sentences.",
|
|
"Second Page",
|
|
"Here is more content."
|
|
]
|
|
```
|
|
|
|
**Statistics:**
|
|
- Files processed: 1
|
|
- Pages extracted: 2
|
|
- Lines extracted: 6
|
|
- Lines kept: 5
|
|
- Lines filtered: 1
|
|
|
|
---
|
|
|
|
## 7. Data Quality and Filtering
|
|
|
|
### 7.1 Why Filter?
|
|
|
|
**Problem:** Not all text is useful for training
|
|
|
|
**Examples of Low-Quality Text:**
|
|
|
|
```
|
|
✗ "" (empty line)
|
|
✗ " " (just whitespace)
|
|
✗ "Hi" (too short, no context)
|
|
✗ "A" (single character)
|
|
✗ "..." (ellipsis, no meaning)
|
|
✗ "---" (separator line)
|
|
```
|
|
|
|
**Examples of High-Quality Text:**
|
|
|
|
```
|
|
✓ "Machine learning is a subset of artificial intelligence."
|
|
✓ "The transformer architecture uses self-attention mechanisms."
|
|
✓ "Gradient descent optimizes neural network parameters."
|
|
```
|
|
|
|
### 7.2 Filtering Criteria
|
|
|
|
**Minimum Length Filter:**
|
|
|
|
**Purpose:** Remove very short lines that don't provide context
|
|
|
|
**Example:**
|
|
```
|
|
Minimum length: 10 characters
|
|
|
|
Keep:
|
|
✓ "Hello world" (11 chars)
|
|
✓ "This is a test." (15 chars)
|
|
|
|
Remove:
|
|
✗ "Hi" (2 chars)
|
|
✗ "Test" (4 chars)
|
|
✗ "OK" (2 chars)
|
|
```
|
|
|
|
**Why 10 characters?**
|
|
- Provides enough context for meaningful learning
|
|
- Filters out headers, separators, and noise
|
|
- Ensures each sample has semantic value
|
|
|
|
### 7.3 Encoding Handling
|
|
|
|
**Problem:** Files may have different encodings
|
|
|
|
**Solution:** Try multiple encodings
|
|
|
|
**Process:**
|
|
```
|
|
Try UTF-8 first:
|
|
✓ Success → Use UTF-8
|
|
✗ Failure → Try Latin-1
|
|
✓ Success → Use Latin-1
|
|
✗ Failure → Log error and skip file
|
|
```
|
|
|
|
**Example:**
|
|
|
|
**UTF-8 file:**
|
|
```
|
|
"Hello 世界" → Reads correctly
|
|
```
|
|
|
|
**Latin-1 file:**
|
|
```
|
|
"Hello café" → Reads correctly with Latin-1
|
|
```
|
|
|
|
### 7.4 Error Handling
|
|
|
|
**What happens when processing fails?**
|
|
|
|
**Examples:**
|
|
|
|
**Corrupted PDF:**
|
|
```
|
|
File: corrupted.pdf
|
|
→ Try to extract text
|
|
→ Error: "Cannot read PDF"
|
|
→ Log warning: "Failed to process corrupted.pdf"
|
|
→ Skip file
|
|
→ Continue with next file
|
|
```
|
|
|
|
**Unsupported File Type:**
|
|
```
|
|
File: presentation.pptx
|
|
→ Extension: .pptx
|
|
→ Type: Not supported
|
|
→ Warning: "Unsupported file type: .pptx"
|
|
→ Skip file
|
|
→ Continue with next file
|
|
```
|
|
|
|
**Image OCR Failure:**
|
|
```
|
|
File: blurry_image.png
|
|
→ Try OCR
|
|
→ OCR returns empty or garbled text
|
|
→ Filter removes empty lines
|
|
→ No text extracted
|
|
→ File processed (no output)
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Common Questions
|
|
|
|
### Q1: Why process PDFs instead of using them directly?
|
|
|
|
**Answer:**
|
|
Models work with numbers (token IDs), not file formats. PDFs have:
|
|
- Complex structure (fonts, layouts, metadata)
|
|
- Embedded formatting
|
|
- Binary data mixed with text
|
|
|
|
Processing extracts just the text content, which is what the model needs.
|
|
|
|
### Q2: What if OCR doesn't work well on an image?
|
|
|
|
**Answer:**
|
|
- Low-quality images produce poor OCR results
|
|
- The system will extract what it can
|
|
- Poor OCR output is filtered out (too short or garbled)
|
|
- The file is processed but may contribute little or no text
|
|
|
|
**Solution:** Use high-quality images with clear text for best results.
|
|
|
|
### Q3: Why split text into lines?
|
|
|
|
**Answer:**
|
|
- Each line becomes a training sample
|
|
- Models predict next tokens in sequences
|
|
- Shorter sequences are easier to process
|
|
- Allows the model to learn from diverse sentence structures
|
|
|
|
### Q4: What happens to code formatting?
|
|
|
|
**Answer:**
|
|
- Code is processed as text
|
|
- Indentation and structure are preserved
|
|
- Each line becomes a sample
|
|
- The model learns code patterns and syntax
|
|
|
|
**Example:**
|
|
```python
|
|
def hello():
|
|
print("Hi")
|
|
```
|
|
|
|
Becomes:
|
|
```
|
|
"def hello():"
|
|
" print("Hi")"
|
|
```
|
|
|
|
### Q5: Can I process files in parallel?
|
|
|
|
**Answer:**
|
|
Currently, files are processed sequentially. Future improvements could include:
|
|
- Parallel processing of multiple files
|
|
- Multi-threaded extraction
|
|
- Batch processing for efficiency
|
|
|
|
### Q6: What if a file is very large?
|
|
|
|
**Answer:**
|
|
- Large files are processed line by line
|
|
- Memory usage stays manageable
|
|
- Progress is logged every 100 files
|
|
- System can handle files of any size (within memory limits)
|
|
|
|
### Q7: How is data from different file types combined?
|
|
|
|
**Answer:**
|
|
All extracted text is combined into a single list:
|
|
|
|
```
|
|
PDF file → 50 lines extracted
|
|
Text file → 30 lines extracted
|
|
Code file → 100 lines extracted
|
|
Image → 5 lines extracted
|
|
|
|
Combined: 185 text lines total
|
|
```
|
|
|
|
All lines are treated equally, regardless of source file type.
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
### What is Data Processing?
|
|
|
|
**Data processing** is the transformation of raw files (PDFs, images, code, text) into clean text lines that can be tokenized and used for training.
|
|
|
|
### Key Steps
|
|
|
|
1. **Find Files**: Scan directory for all files
|
|
2. **Classify**: Determine file type (.pdf, .txt, .png, etc.)
|
|
3. **Extract**: Get text content from each file
|
|
4. **Clean**: Remove noise and standardize format
|
|
5. **Split**: Break into individual lines
|
|
6. **Filter**: Keep only quality text samples
|
|
|
|
### Result
|
|
|
|
A list of text strings ready for:
|
|
- Tokenization (converting to numbers)
|
|
- Training (teaching the model)
|
|
- Learning (model understanding patterns)
|
|
|
|
### Example Flow
|
|
|
|
```
|
|
PDF file "document.pdf"
|
|
↓
|
|
Extract text from pages
|
|
↓
|
|
Clean and split into lines
|
|
↓
|
|
Filter by length
|
|
↓
|
|
["Sentence 1.", "Sentence 2.", "Sentence 3."]
|
|
↓
|
|
Ready for tokenization and training!
|
|
```
|
|
|
|
---
|
|
|
|
*This document explains what data processing means and how it transforms your raw files into training-ready text, step by step.*
|
|
|