# SheepOp LLM - Complete Architecture Documentation
Complete documentation of the SheepOp Language Model project architecture, data flow, training pipeline, and inference system.
## Table of Contents
1. [System Overview](#system-overview)
2. [Data Ingestion Pipeline](#data-ingestion-pipeline)
3. [Training Pipeline](#training-pipeline)
4. [Model Architecture](#model-architecture)
5. [Inference Pipeline](#inference-pipeline)
6. [Complete Workflow](#complete-workflow)
---
## System Overview
```mermaid
graph TB
subgraph "Data Sources"
A[PDF Files] --> DataProcessor
B[Images - PNG/JPG/etc] --> DataProcessor
C[Code Files - .py/.js/etc] --> DataProcessor
D[Text Files - .txt/.md/etc] --> DataProcessor
end
DataProcessor[DataProcessor
Multi-Format Extractor] --> TextList[Text Lines]
TextList --> Tokenizer[SimpleTokenizer
Character-Level]
Tokenizer --> DataLoader[PyTorch DataLoader
Batched Sequences]
DataLoader --> Trainer[Trainer
Training Loop]
subgraph "Training Components"
Trainer --> Model[TransformerModel]
Trainer --> Optimizer[AdamW Optimizer]
Trainer --> Scheduler[CosineAnnealingLR]
Trainer --> Loss[CrossEntropyLoss]
end
Model --> Checkpoint[Model Checkpoints
checkpoints/*.pt]
Checkpoint --> Inference[Inference Script]
Inference --> GeneratedText[Generated Text]
style DataProcessor fill:#e1f5ff
style Model fill:#fff4e1
style Trainer fill:#ffe1f5
style Checkpoint fill:#e1ffe1
```
---
## Data Ingestion Pipeline
### Multi-Format Data Processing Flow
```mermaid
flowchart TD
Start([Start:
train.py --data path]) --> CheckPath{Path Type?}
CheckPath -->|File| SingleFile[Process Single File]
CheckPath -->|Directory| Directory[Process Directory]
SingleFile --> DataProcessor[DataProcessor.process_file]
Directory --> RecursiveScan[Recursive Directory Scan
Find all files]
RecursiveScan --> FileType{File Extension?}
FileType -->|.txt/.md/.json/etc| TextExtract[Read as Text File
Line by line]
FileType -->|.py/.js/.java/etc| CodeExtract[Read as Code File
Line by line]
FileType -->|.pdf| PDFExtract[PDF Extraction
PyPDF2/pdfplumber]
FileType -->|.png/.jpg/.tiff/etc| ImageExtract[OCR Extraction
pytesseract]
FileType -->|Unknown| Fallback[Try Text Fallback]
PDFExtract --> PDFPages[Extract Each Page]
PDFPages --> PDFLines[Split into Lines]
ImageExtract --> OCR[Perform OCR
pytesseract]
OCR --> OCRLines[Split OCR Text into Lines]
TextExtract --> FilterLines[Filter Lines
min_length=10]
CodeExtract --> FilterLines
PDFLines --> FilterLines
OCRLines --> FilterLines
Fallback --> FilterLines
FilterLines --> Combine[List of Text Lines]
Combine --> Validate{Texts Empty?}
Validate -->|Yes| Error[Raise Error:
No data extracted]
Validate -->|No| Success[✅ Success
N text samples loaded]
Success --> TokenizerStep[Next: Tokenization]
style DataProcessor fill:#e1f5ff
style PDFExtract fill:#ffe1f5
style ImageExtract fill:#fff4e1
style Success fill:#e1ffe1
style Error fill:#ffe1e1
```
### Data Processing Components
```mermaid
classDiagram
class DataProcessor {
+process_file(file_path) Iterator[str]
+process_directory(directory) Iterator[str]
+process_to_list(...) List[str]
-_process_text_file() Iterator[str]
-_process_code_file() Iterator[str]
-_process_pdf() Iterator[str]
-_process_image() Iterator[str]
-_check_dependencies()
}
class SimpleTokenizer {
+vocab: Dict[str, int]
+inv_vocab: Dict[int, str]
+vocab_size: int
+encode(text: str) List[int]
+decode(token_ids: List[int]) str
+save_vocab(path: str)
}
class TextDataset {
+texts: List[str]
+tokenizer: SimpleTokenizer
+max_length: int
+sequences: List[torch.Tensor]
+__getitem__(idx) Dict
+_prepare_sequences() List[Tensor]
}
class DataLoader {
+batch_size: int
+shuffle: bool
+num_workers: int
+collate_fn: Callable
}
DataProcessor --> TextDataset : extracts text
SimpleTokenizer --> TextDataset : tokenizes
TextDataset --> DataLoader : creates dataset
```
---
## Training Pipeline
### Complete Training Flow
```mermaid
flowchart TD
Start([python train.py
--data path]) --> Args[Parse Arguments
--data, --config, --resume, --device]
Args --> ConfigLoad{Config File
Provided?}
ConfigLoad -->|Yes| LoadConfig[Load config.json]
ConfigLoad -->|No| DefaultConfig[Use Default Config]
LoadConfig --> Config[Config Object
ModelConfig
TrainingConfig
DataConfig
seed=42]
DefaultConfig --> Config
Config --> SetSeed[Set Random Seed
torch.manual_seed
torch.cuda.manual_seed_all
CUDNN deterministic]
SetSeed --> Device[Detect Device
CUDA/MPS/CPU]
Device --> DataIngestion[Data Ingestion Pipeline
Extract text from all files]
DataIngestion --> TextList[List of Text Lines
N samples]
TextList --> CreateTokenizer[Create SimpleTokenizer
Character-level vocab]
CreateTokenizer --> Tokenizer[Tokenizer Ready
vocab_size calculated]
Tokenizer --> CreateDataLoader[Create DataLoader
Batch size
Max length
Shuffle]
CreateDataLoader --> TrainLoader[PyTorch DataLoader
Batched sequences]
TrainLoader --> CheckResume{Resume
Checkpoint?}
CheckResume -->|Yes| LoadCheckpoint[Load Checkpoint
Model state
Optimizer state
Scheduler state
Epoch/Step]
CheckResume -->|No| CreateModel[Create New Model
TransformerModel]
LoadCheckpoint --> CreateModel
CreateModel --> Model[Model Ready
N parameters]
Model --> CreateOptimizer[Create Optimizer
AdamW
lr, weight_decay]
CreateOptimizer --> CreateScheduler[Create Scheduler
CosineAnnealingLR
T_max=total_steps]
CreateScheduler --> CreateTrainer[Create Trainer
Model
DataLoader
Optimizer
Scheduler
Device]
CreateTrainer --> Trainer[Trainer Ready]
Trainer --> TrainingLoop[Training Loop
For each epoch]
TrainingLoop --> EpochLoop[For each batch]
EpochLoop --> Forward[Forward Pass
Model prediction]
Forward --> Loss[Compute Loss
CrossEntropyLoss]
Loss --> Backward[Backward Pass
Compute gradients]
Backward --> GradientAccum{Gradient
Accumulation?}
GradientAccum -->|Not yet| Accumulate[Accumulate gradients]
Accumulate --> EpochLoop
GradientAccum -->|Ready| ClipGrad[Gradient Clipping
max_grad_norm]
ClipGrad --> Update[Update Weights
Optimizer.step]
Update --> UpdateLR[Update Learning Rate
Scheduler.step]
UpdateLR --> ZeroGrad[Zero Gradients]
ZeroGrad --> Log{Log Interval?}
Log -->|Yes| LogMetrics[Log Metrics
Loss, LR
Save to metrics.json]
Log -->|No| EvalCheck{Evaluation
Interval?}
LogMetrics --> EvalCheck
EvalCheck -->|Yes| Evaluate[Evaluate on
Validation Set]
EvalCheck -->|No| SaveCheck{End of
Epoch?}
Evaluate --> SaveCheck
SaveCheck -->|No| EpochLoop
SaveCheck -->|Yes| SaveCheckpoint[Save Checkpoint
Model state
Optimizer state
Scheduler state
Epoch/Step]
SaveCheckpoint --> MoreEpochs{More
Epochs?}
MoreEpochs -->|Yes| TrainingLoop
MoreEpochs -->|No| GeneratePlots[Generate Training Plots
loss_by_epoch.png
training_curve.png]
GeneratePlots --> End([Training Complete!
Checkpoints saved])
style SetSeed fill:#ffe1f5
style DataIngestion fill:#e1f5ff
style Model fill:#fff4e1
style TrainingLoop fill:#ffe1f5
style End fill:#e1ffe1
```
### Seed Initialization Details
```mermaid
sequenceDiagram
participant TrainScript as train.py
participant Config as Config
participant PyTorch as PyTorch
participant CUDA as CUDA Backend
TrainScript->>Config: Load config (seed=42)
TrainScript->>PyTorch: torch.manual_seed(42)
TrainScript->>CUDA: torch.cuda.manual_seed_all(42)
TrainScript->>PyTorch: torch.backends.cudnn.deterministic = True
TrainScript->>PyTorch: torch.backends.cudnn.benchmark = False
Note over TrainScript,CUDA: Seed ensures reproducibility
across runs and devices
```
### Training Loop Details
```mermaid
graph LR
subgraph "Single Training Step"
A[Batch Input
input_ids, labels] --> B[Forward Pass
Model forward]
B --> C[Logits
batch_size × seq_len × vocab_size]
C --> D[Compute Loss
CrossEntropyLoss]
D --> E[Backward Pass
Compute gradients]
E --> F{Gradient
Accumulation
Steps reached?}
F -->|No| G[Accumulate Gradients]
F -->|Yes| H[Gradient Clipping]
H --> I[Optimizer Step
Update weights]
I --> J[Scheduler Step
Update LR]
J --> K[Zero Gradients]
K --> L[Log Metrics]
end
G --> A
style B fill:#e1f5ff
style D fill:#ffe1f5
style I fill:#fff4e1
```
---
## Model Architecture
### Transformer Model Structure
```mermaid
graph TB
Input[Input Tokens
Token IDs] --> Embed[Token Embedding
vocab_size → d_model]
Embed --> PosEnc[Positional Encoding
Sinusoidal/Cosine]
PosEnc --> Dropout1[Dropout]
Dropout1 --> Layer1[Transformer Block 1]
Layer1 --> Layer2[Transformer Block 2]
Layer2 --> Layer3[Transformer Block 3]
Layer3 --> LayerN[Transformer Block N
num_layers]
LayerN --> LayerNorm[Final Layer Norm]
LayerNorm --> OutputProj[Output Projection
d_model → vocab_size]
OutputProj --> Logits[Logits
batch × seq_len × vocab_size]
subgraph "Transformer Block Details"
TBInput[Input x] --> Attention[Multi-Head
Self-Attention]
Attention --> AddNorm1[Add & Norm
Residual + LayerNorm]
AddNorm1 --> FFN[Feed-Forward
Network]
FFN --> AddNorm2[Add & Norm
Residual + LayerNorm]
AddNorm2 --> TBOutput[Output]
end
style Embed fill:#e1f5ff
style Attention fill:#ffe1f5
style FFN fill:#fff4e1
style Logits fill:#e1ffe1
```
### Multi-Head Attention Mechanism
```mermaid
graph LR
Input[Input
batch × seq_len × d_model] --> Q[Query
Linear Layer]
Input --> K[Key
Linear Layer]
Input --> V[Value
Linear Layer]
Q --> SplitQ[Split into
num_heads heads]
K --> SplitK[Split into
num_heads heads]
V --> SplitV[Split into
num_heads heads]
SplitQ --> ScaledDot[Scaled Dot-Product
Attention]
SplitK --> ScaledDot
SplitV --> ScaledDot
ScaledDot --> Mask[Causal Mask
Lower triangular]
Mask --> Softmax[Softmax]
Softmax --> AttentionOutput[Attention Output
per head]
AttentionOutput --> Concat[Concat Heads]
Concat --> OutputProj[Output Projection
Linear Layer]
OutputProj --> Output[Output
batch × seq_len × d_model]
style ScaledDot fill:#ffe1f5
style Mask fill:#fff4e1
style Output fill:#e1ffe1
```
### Complete Model Component Diagram
```mermaid
classDiagram
class TransformerModel {
+vocab_size: int
+d_model: int
+num_layers: int
+num_heads: int
+token_embedding: Embedding
+pos_encoding: PositionalEncoding
+layers: ModuleList[TransformerBlock]
+final_norm: LayerNorm
+output_proj: Linear
+forward(input_ids) Tuple[Tensor, Tensor]
+generate(...) Tensor
+get_num_params() int
}
class TransformerBlock {
+attention: MultiHeadAttention
+ffn: FeedForward
+norm1: LayerNorm
+norm2: LayerNorm
+dropout: Dropout
+forward(x, mask) Tensor
}
class MultiHeadAttention {
+num_heads: int
+d_model: int
+d_k: int
+q_proj: Linear
+k_proj: Linear
+v_proj: Linear
+out_proj: Linear
+forward(q, k, v, mask) Tensor
}
class FeedForward {
+linear1: Linear
+linear2: Linear
+activation: GELU/ReLU
+dropout: Dropout
+forward(x) Tensor
}
class PositionalEncoding {
+d_model: int
+max_len: int
+pe: Tensor
+forward(x) Tensor
}
TransformerModel --> TransformerBlock : contains N layers
TransformerModel --> PositionalEncoding : adds positional info
TransformerBlock --> MultiHeadAttention : self-attention
TransformerBlock --> FeedForward : feed-forward network
```
---
## Inference Pipeline
### Text Generation Flow
```mermaid
flowchart TD
Start([python inference.py
--checkpoint path
--prompt text]) --> LoadModel[Load Model from Checkpoint
Load state dict
Set to eval mode]
LoadModel --> CreateTokenizer[Create Tokenizer
SimpleTokenizer]
CreateTokenizer --> EncodePrompt[Encode Prompt
Text → Token IDs]
EncodePrompt --> CheckOptimized{Use Optimized
Inference?}
CheckOptimized -->|Yes| OptimizedGen[OptimizedInference
with KV Caching]
CheckOptimized -->|No| StandardGen[Standard Generation]
StandardGen --> InitGen[Initialize Generation
generated = input_ids]
InitGen --> LoopStart[Generation Loop
For max_length steps]
LoopStart --> Forward[Forward Pass
Model prediction]
Forward --> NextToken[Get Next Token Logits
Last position]
NextToken --> Temperature[Apply Temperature
Scale logits]
Temperature --> TopK{Top-K
Filtering?}
TopK -->|Yes| FilterK[Filter Top-K Tokens]
TopK -->|No| TopP{Top-P
Nucleus Sampling?}
FilterK --> TopP
TopP -->|Yes| FilterP[Filter by Cumulative Prob]
TopP -->|No| Sample[Sample Token
Multinomial]
FilterP --> Sample
Sample --> Append[Append Token
to Generated]
Append --> CheckStop{Stop
Condition?}
CheckStop -->|No| LoopStart
CheckStop -->|Yes| Decode[Decode Tokens
Token IDs → Text]
OptimizedGen --> KVCache[Use KV Cache
Cache previous KV]
KVCache --> LoopStart
Decode --> Output[Generated Text
Output]
Output --> End([End])
style OptimizedGen fill:#e1f5ff
style Forward fill:#ffe1f5
style Sample fill:#fff4e1
style Output fill:#e1ffe1
```
### Optimized Inference with KV Caching
```mermaid
graph TB
subgraph "Standard Generation"
A1[Input Token] --> B1[Forward Pass
Compute Q, K, V]
B1 --> C1[Attention
Full Sequence]
C1 --> D1[Next Token]
D1 --> E1[Append Token]
E1 --> A1
end
subgraph "Optimized Generation with KV Cache"
A2[Input Token] --> B2{First
Token?}
B2 -->|Yes| C2[Forward Pass
Compute Q, K, V]
B2 -->|No| C2Cache[Use Cached K, V
Only compute Q]
C2 --> D2[Cache K, V]
D2 --> E2[Attention
Only with New Token]
C2Cache --> E2
E2 --> F2[Next Token]
F2 --> G2[Append Token]
G2 --> A2
end
style C2 fill:#e1f5ff
style C2Cache fill:#ffe1f5
style E2 fill:#e1ffe1
```
---
## Complete Workflow
### End-to-End System Flow
```mermaid
flowchart TB
subgraph "Phase 1: Data Preparation"
A1[Raw Data Files
PDFs, Images, Code, Text] --> A2[DataProcessor
Extract Text]
A2 --> A3[Text Lines
List of Strings]
A3 --> A4[SimpleTokenizer
Build Vocabulary]
A4 --> A5[Tokenize & Chunk
Create Sequences]
A5 --> A6[DataLoader
Batched Data]
end
subgraph "Phase 2: Model Initialization"
B1[Load Config
ModelConfig] --> B2[Set Random Seed
seed=42]
B2 --> B3[Create Model
TransformerModel]
B3 --> B4[Initialize Weights
Normal Distribution]
B4 --> B5[Create Optimizer
AdamW]
B5 --> B6[Create Scheduler
CosineAnnealingLR]
end
subgraph "Phase 3: Training"
C1[Trainer Setup] --> C2[Training Loop
Epochs]
C2 --> C3[Batch Loop]
C3 --> C4[Forward Pass]
C4 --> C5[Compute Loss]
C5 --> C6[Backward Pass]
C6 --> C7[Gradient Clipping]
C7 --> C8[Update Weights]
C8 --> C9[Save Checkpoint]
C9 --> C10{More Epochs?}
C10 -->|Yes| C2
C10 -->|No| C11[Generate Plots
Training Metrics]
end
subgraph "Phase 4: Inference"
D1[Load Checkpoint] --> D2[Load Model State]
D2 --> D3[Encode Prompt]
D3 --> D4[Generate Text
Autoregressive]
D4 --> D5[Decode Tokens]
D5 --> D6[Output Text]
end
A6 --> B1
B6 --> C1
C11 --> D1
style A2 fill:#e1f5ff
style B3 fill:#fff4e1
style C4 fill:#ffe1f5
style D4 fill:#e1ffe1
```
### Checkpoint Structure
```mermaid
graph TB
Checkpoint[Checkpoint File
checkpoint_epoch_N.pt] --> ModelState[model_state_dict
Model weights]
Checkpoint --> OptimizerState[optimizer_state_dict
AdamW state]
Checkpoint --> SchedulerState[scheduler_state_dict
LR scheduler state]
Checkpoint --> ModelConfig[model_config
Model hyperparameters]
Checkpoint --> Epoch[epoch
Current epoch number]
Checkpoint --> GlobalStep[global_step
Training step count]
Checkpoint --> BestValLoss[best_val_loss
Best validation loss]
ModelState --> Resume[Resume Training
Restore model state]
OptimizerState --> Resume
SchedulerState --> Resume
ModelConfig --> Resume
Epoch --> Resume
GlobalStep --> Resume
style Checkpoint fill:#e1f5ff
style Resume fill:#e1ffe1
```
### Configuration Hierarchy
```mermaid
graph TB
Config[Config
Root Configuration] --> ModelConfig[ModelConfig
vocab_size
d_model
num_layers
num_heads
d_ff
max_seq_len
dropout
activation]
Config --> TrainingConfig[TrainingConfig
batch_size
max_epochs
learning_rate
weight_decay
warmup_steps
max_grad_norm
gradient_accumulation_steps
use_amp]
Config --> DataConfig[DataConfig
data_dir
max_length
stride
num_workers]
Config --> Global[Global Settings
device
seed]
ModelConfig --> Model[TransformerModel
Model Architecture]
TrainingConfig --> Trainer[Trainer
Training Parameters]
DataConfig --> DataLoader[DataLoader
Data Parameters]
style Config fill:#e1f5ff
style Model fill:#fff4e1
style Trainer fill:#ffe1f5
style DataLoader fill:#e1ffe1
```
---
## Key Components Summary
### 1. **Data Processing**
- **DataProcessor**: Multi-format text extraction (PDFs, images, code, text)
- **SimpleTokenizer**: Character-level tokenization
- **TextDataset**: PyTorch dataset for training
- **DataLoader**: Batched data loading
### 2. **Model Architecture**
- **TransformerModel**: Complete transformer language model
- **TransformerBlock**: Multi-head attention + feed-forward
- **MultiHeadAttention**: Scaled dot-product attention with causal masking
- **FeedForward**: Position-wise feed-forward network
- **PositionalEncoding**: Sinusoidal position embeddings
### 3. **Training**
- **Trainer**: Complete training loop with:
- Gradient accumulation
- Mixed precision training (AMP)
- Gradient clipping
- Learning rate scheduling
- Checkpointing
- Metrics tracking
### 4. **Inference**
- **Standard Generation**: Autoregressive text generation
- **OptimizedInference**: KV caching for faster generation
- **RetrievalCache**: Caching for RAG systems
### 5. **Configuration**
- **Config System**: Hierarchical configuration (Model, Training, Data)
- **JSON Support**: Save/load configurations
- **Default Values**: Sensible defaults for all parameters
---
## Usage Examples
### Training
```bash
# Basic training
python train.py --data /path/to/data
# With custom config
python train.py --data /path/to/data --config config.json
# Resume from checkpoint
python train.py --data /path/to/data --resume checkpoints/checkpoint_epoch_5.pt
# Specify device
python train.py --data /path/to/data --device cuda
```
### Inference
```bash
# Basic inference
python inference.py --checkpoint checkpoints/best_checkpoint.pt --prompt "Hello world"
# With sampling parameters
python inference.py \
--checkpoint checkpoints/best_checkpoint.pt \
--prompt "The future of AI" \
--max-length 200 \
--temperature 0.8 \
--top-k 50 \
--top-p 0.95
# Optimized inference
python inference.py \
--checkpoint checkpoints/best_checkpoint.pt \
--prompt "Hello" \
--optimized
```
---
## File Structure
```
sheepOp/
├── train.py # Main training script
├── inference.py # Inference script
├── config.py # Configuration management
├── config.json # Configuration file
├── data/ # Data module (symlink)
│ └── __init__.py # Tokenizer, DataLoader, DataProcessor
├── models/ # Model definitions
│ ├── transformer.py # Main transformer model
│ ├── blocks.py # Transformer blocks
│ ├── attention.py # Attention mechanisms
│ └── optimized_attention.py # Optimized inference
├── training/ # Training utilities
│ ├── __init__.py # Trainer class
│ └── metrics.py # Training metrics
├── checkpoints/ # Saved model checkpoints
└── requirements.txt # Dependencies
```
---
## Flow Summary
1. **Data Ingestion**: Raw files → Text extraction → Text lines
2. **Tokenization**: Text lines → Token sequences → Batched data
3. **Model Setup**: Config → Model → Optimizer → Scheduler
4. **Training**: Batches → Forward → Loss → Backward → Update → Checkpoint
5. **Inference**: Checkpoint → Model → Prompt → Generate → Output
---
*This documentation provides a complete view of the SheepOp LLM project architecture and workflow.*