# SheepOp LLM - Complete Architecture Documentation Complete documentation of the SheepOp Language Model project architecture, data flow, training pipeline, and inference system. ## Table of Contents 1. [System Overview](#system-overview) 2. [Data Ingestion Pipeline](#data-ingestion-pipeline) 3. [Training Pipeline](#training-pipeline) 4. [Model Architecture](#model-architecture) 5. [Inference Pipeline](#inference-pipeline) 6. [Complete Workflow](#complete-workflow) --- ## System Overview ```mermaid graph TB subgraph "Data Sources" A[PDF Files] --> DataProcessor B[Images - PNG/JPG/etc] --> DataProcessor C[Code Files - .py/.js/etc] --> DataProcessor D[Text Files - .txt/.md/etc] --> DataProcessor end DataProcessor[DataProcessor
Multi-Format Extractor] --> TextList[Text Lines] TextList --> Tokenizer[SimpleTokenizer
Character-Level] Tokenizer --> DataLoader[PyTorch DataLoader
Batched Sequences] DataLoader --> Trainer[Trainer
Training Loop] subgraph "Training Components" Trainer --> Model[TransformerModel] Trainer --> Optimizer[AdamW Optimizer] Trainer --> Scheduler[CosineAnnealingLR] Trainer --> Loss[CrossEntropyLoss] end Model --> Checkpoint[Model Checkpoints
checkpoints/*.pt] Checkpoint --> Inference[Inference Script] Inference --> GeneratedText[Generated Text] style DataProcessor fill:#e1f5ff style Model fill:#fff4e1 style Trainer fill:#ffe1f5 style Checkpoint fill:#e1ffe1 ``` --- ## Data Ingestion Pipeline ### Multi-Format Data Processing Flow ```mermaid flowchart TD Start([Start:
train.py --data path]) --> CheckPath{Path Type?} CheckPath -->|File| SingleFile[Process Single File] CheckPath -->|Directory| Directory[Process Directory] SingleFile --> DataProcessor[DataProcessor.process_file] Directory --> RecursiveScan[Recursive Directory Scan
Find all files] RecursiveScan --> FileType{File Extension?} FileType -->|.txt/.md/.json/etc| TextExtract[Read as Text File
Line by line] FileType -->|.py/.js/.java/etc| CodeExtract[Read as Code File
Line by line] FileType -->|.pdf| PDFExtract[PDF Extraction
PyPDF2/pdfplumber] FileType -->|.png/.jpg/.tiff/etc| ImageExtract[OCR Extraction
pytesseract] FileType -->|Unknown| Fallback[Try Text Fallback] PDFExtract --> PDFPages[Extract Each Page] PDFPages --> PDFLines[Split into Lines] ImageExtract --> OCR[Perform OCR
pytesseract] OCR --> OCRLines[Split OCR Text into Lines] TextExtract --> FilterLines[Filter Lines
min_length=10] CodeExtract --> FilterLines PDFLines --> FilterLines OCRLines --> FilterLines Fallback --> FilterLines FilterLines --> Combine[List of Text Lines] Combine --> Validate{Texts Empty?} Validate -->|Yes| Error[Raise Error:
No data extracted] Validate -->|No| Success[✅ Success
N text samples loaded] Success --> TokenizerStep[Next: Tokenization] style DataProcessor fill:#e1f5ff style PDFExtract fill:#ffe1f5 style ImageExtract fill:#fff4e1 style Success fill:#e1ffe1 style Error fill:#ffe1e1 ``` ### Data Processing Components ```mermaid classDiagram class DataProcessor { +process_file(file_path) Iterator[str] +process_directory(directory) Iterator[str] +process_to_list(...) List[str] -_process_text_file() Iterator[str] -_process_code_file() Iterator[str] -_process_pdf() Iterator[str] -_process_image() Iterator[str] -_check_dependencies() } class SimpleTokenizer { +vocab: Dict[str, int] +inv_vocab: Dict[int, str] +vocab_size: int +encode(text: str) List[int] +decode(token_ids: List[int]) str +save_vocab(path: str) } class TextDataset { +texts: List[str] +tokenizer: SimpleTokenizer +max_length: int +sequences: List[torch.Tensor] +__getitem__(idx) Dict +_prepare_sequences() List[Tensor] } class DataLoader { +batch_size: int +shuffle: bool +num_workers: int +collate_fn: Callable } DataProcessor --> TextDataset : extracts text SimpleTokenizer --> TextDataset : tokenizes TextDataset --> DataLoader : creates dataset ``` --- ## Training Pipeline ### Complete Training Flow ```mermaid flowchart TD Start([python train.py
--data path]) --> Args[Parse Arguments
--data, --config, --resume, --device] Args --> ConfigLoad{Config File
Provided?} ConfigLoad -->|Yes| LoadConfig[Load config.json] ConfigLoad -->|No| DefaultConfig[Use Default Config] LoadConfig --> Config[Config Object
ModelConfig
TrainingConfig
DataConfig
seed=42] DefaultConfig --> Config Config --> SetSeed[Set Random Seed
torch.manual_seed
torch.cuda.manual_seed_all
CUDNN deterministic] SetSeed --> Device[Detect Device
CUDA/MPS/CPU] Device --> DataIngestion[Data Ingestion Pipeline
Extract text from all files] DataIngestion --> TextList[List of Text Lines
N samples] TextList --> CreateTokenizer[Create SimpleTokenizer
Character-level vocab] CreateTokenizer --> Tokenizer[Tokenizer Ready
vocab_size calculated] Tokenizer --> CreateDataLoader[Create DataLoader
Batch size
Max length
Shuffle] CreateDataLoader --> TrainLoader[PyTorch DataLoader
Batched sequences] TrainLoader --> CheckResume{Resume
Checkpoint?} CheckResume -->|Yes| LoadCheckpoint[Load Checkpoint
Model state
Optimizer state
Scheduler state
Epoch/Step] CheckResume -->|No| CreateModel[Create New Model
TransformerModel] LoadCheckpoint --> CreateModel CreateModel --> Model[Model Ready
N parameters] Model --> CreateOptimizer[Create Optimizer
AdamW
lr, weight_decay] CreateOptimizer --> CreateScheduler[Create Scheduler
CosineAnnealingLR
T_max=total_steps] CreateScheduler --> CreateTrainer[Create Trainer
Model
DataLoader
Optimizer
Scheduler
Device] CreateTrainer --> Trainer[Trainer Ready] Trainer --> TrainingLoop[Training Loop
For each epoch] TrainingLoop --> EpochLoop[For each batch] EpochLoop --> Forward[Forward Pass
Model prediction] Forward --> Loss[Compute Loss
CrossEntropyLoss] Loss --> Backward[Backward Pass
Compute gradients] Backward --> GradientAccum{Gradient
Accumulation?} GradientAccum -->|Not yet| Accumulate[Accumulate gradients] Accumulate --> EpochLoop GradientAccum -->|Ready| ClipGrad[Gradient Clipping
max_grad_norm] ClipGrad --> Update[Update Weights
Optimizer.step] Update --> UpdateLR[Update Learning Rate
Scheduler.step] UpdateLR --> ZeroGrad[Zero Gradients] ZeroGrad --> Log{Log Interval?} Log -->|Yes| LogMetrics[Log Metrics
Loss, LR
Save to metrics.json] Log -->|No| EvalCheck{Evaluation
Interval?} LogMetrics --> EvalCheck EvalCheck -->|Yes| Evaluate[Evaluate on
Validation Set] EvalCheck -->|No| SaveCheck{End of
Epoch?} Evaluate --> SaveCheck SaveCheck -->|No| EpochLoop SaveCheck -->|Yes| SaveCheckpoint[Save Checkpoint
Model state
Optimizer state
Scheduler state
Epoch/Step] SaveCheckpoint --> MoreEpochs{More
Epochs?} MoreEpochs -->|Yes| TrainingLoop MoreEpochs -->|No| GeneratePlots[Generate Training Plots
loss_by_epoch.png
training_curve.png] GeneratePlots --> End([Training Complete!
Checkpoints saved]) style SetSeed fill:#ffe1f5 style DataIngestion fill:#e1f5ff style Model fill:#fff4e1 style TrainingLoop fill:#ffe1f5 style End fill:#e1ffe1 ``` ### Seed Initialization Details ```mermaid sequenceDiagram participant TrainScript as train.py participant Config as Config participant PyTorch as PyTorch participant CUDA as CUDA Backend TrainScript->>Config: Load config (seed=42) TrainScript->>PyTorch: torch.manual_seed(42) TrainScript->>CUDA: torch.cuda.manual_seed_all(42) TrainScript->>PyTorch: torch.backends.cudnn.deterministic = True TrainScript->>PyTorch: torch.backends.cudnn.benchmark = False Note over TrainScript,CUDA: Seed ensures reproducibility
across runs and devices ``` ### Training Loop Details ```mermaid graph LR subgraph "Single Training Step" A[Batch Input
input_ids, labels] --> B[Forward Pass
Model forward] B --> C[Logits
batch_size × seq_len × vocab_size] C --> D[Compute Loss
CrossEntropyLoss] D --> E[Backward Pass
Compute gradients] E --> F{Gradient
Accumulation
Steps reached?} F -->|No| G[Accumulate Gradients] F -->|Yes| H[Gradient Clipping] H --> I[Optimizer Step
Update weights] I --> J[Scheduler Step
Update LR] J --> K[Zero Gradients] K --> L[Log Metrics] end G --> A style B fill:#e1f5ff style D fill:#ffe1f5 style I fill:#fff4e1 ``` --- ## Model Architecture ### Transformer Model Structure ```mermaid graph TB Input[Input Tokens
Token IDs] --> Embed[Token Embedding
vocab_size → d_model] Embed --> PosEnc[Positional Encoding
Sinusoidal/Cosine] PosEnc --> Dropout1[Dropout] Dropout1 --> Layer1[Transformer Block 1] Layer1 --> Layer2[Transformer Block 2] Layer2 --> Layer3[Transformer Block 3] Layer3 --> LayerN[Transformer Block N
num_layers] LayerN --> LayerNorm[Final Layer Norm] LayerNorm --> OutputProj[Output Projection
d_model → vocab_size] OutputProj --> Logits[Logits
batch × seq_len × vocab_size] subgraph "Transformer Block Details" TBInput[Input x] --> Attention[Multi-Head
Self-Attention] Attention --> AddNorm1[Add & Norm
Residual + LayerNorm] AddNorm1 --> FFN[Feed-Forward
Network] FFN --> AddNorm2[Add & Norm
Residual + LayerNorm] AddNorm2 --> TBOutput[Output] end style Embed fill:#e1f5ff style Attention fill:#ffe1f5 style FFN fill:#fff4e1 style Logits fill:#e1ffe1 ``` ### Multi-Head Attention Mechanism ```mermaid graph LR Input[Input
batch × seq_len × d_model] --> Q[Query
Linear Layer] Input --> K[Key
Linear Layer] Input --> V[Value
Linear Layer] Q --> SplitQ[Split into
num_heads heads] K --> SplitK[Split into
num_heads heads] V --> SplitV[Split into
num_heads heads] SplitQ --> ScaledDot[Scaled Dot-Product
Attention] SplitK --> ScaledDot SplitV --> ScaledDot ScaledDot --> Mask[Causal Mask
Lower triangular] Mask --> Softmax[Softmax] Softmax --> AttentionOutput[Attention Output
per head] AttentionOutput --> Concat[Concat Heads] Concat --> OutputProj[Output Projection
Linear Layer] OutputProj --> Output[Output
batch × seq_len × d_model] style ScaledDot fill:#ffe1f5 style Mask fill:#fff4e1 style Output fill:#e1ffe1 ``` ### Complete Model Component Diagram ```mermaid classDiagram class TransformerModel { +vocab_size: int +d_model: int +num_layers: int +num_heads: int +token_embedding: Embedding +pos_encoding: PositionalEncoding +layers: ModuleList[TransformerBlock] +final_norm: LayerNorm +output_proj: Linear +forward(input_ids) Tuple[Tensor, Tensor] +generate(...) Tensor +get_num_params() int } class TransformerBlock { +attention: MultiHeadAttention +ffn: FeedForward +norm1: LayerNorm +norm2: LayerNorm +dropout: Dropout +forward(x, mask) Tensor } class MultiHeadAttention { +num_heads: int +d_model: int +d_k: int +q_proj: Linear +k_proj: Linear +v_proj: Linear +out_proj: Linear +forward(q, k, v, mask) Tensor } class FeedForward { +linear1: Linear +linear2: Linear +activation: GELU/ReLU +dropout: Dropout +forward(x) Tensor } class PositionalEncoding { +d_model: int +max_len: int +pe: Tensor +forward(x) Tensor } TransformerModel --> TransformerBlock : contains N layers TransformerModel --> PositionalEncoding : adds positional info TransformerBlock --> MultiHeadAttention : self-attention TransformerBlock --> FeedForward : feed-forward network ``` --- ## Inference Pipeline ### Text Generation Flow ```mermaid flowchart TD Start([python inference.py
--checkpoint path
--prompt text]) --> LoadModel[Load Model from Checkpoint
Load state dict
Set to eval mode] LoadModel --> CreateTokenizer[Create Tokenizer
SimpleTokenizer] CreateTokenizer --> EncodePrompt[Encode Prompt
Text → Token IDs] EncodePrompt --> CheckOptimized{Use Optimized
Inference?} CheckOptimized -->|Yes| OptimizedGen[OptimizedInference
with KV Caching] CheckOptimized -->|No| StandardGen[Standard Generation] StandardGen --> InitGen[Initialize Generation
generated = input_ids] InitGen --> LoopStart[Generation Loop
For max_length steps] LoopStart --> Forward[Forward Pass
Model prediction] Forward --> NextToken[Get Next Token Logits
Last position] NextToken --> Temperature[Apply Temperature
Scale logits] Temperature --> TopK{Top-K
Filtering?} TopK -->|Yes| FilterK[Filter Top-K Tokens] TopK -->|No| TopP{Top-P
Nucleus Sampling?} FilterK --> TopP TopP -->|Yes| FilterP[Filter by Cumulative Prob] TopP -->|No| Sample[Sample Token
Multinomial] FilterP --> Sample Sample --> Append[Append Token
to Generated] Append --> CheckStop{Stop
Condition?} CheckStop -->|No| LoopStart CheckStop -->|Yes| Decode[Decode Tokens
Token IDs → Text] OptimizedGen --> KVCache[Use KV Cache
Cache previous KV] KVCache --> LoopStart Decode --> Output[Generated Text
Output] Output --> End([End]) style OptimizedGen fill:#e1f5ff style Forward fill:#ffe1f5 style Sample fill:#fff4e1 style Output fill:#e1ffe1 ``` ### Optimized Inference with KV Caching ```mermaid graph TB subgraph "Standard Generation" A1[Input Token] --> B1[Forward Pass
Compute Q, K, V] B1 --> C1[Attention
Full Sequence] C1 --> D1[Next Token] D1 --> E1[Append Token] E1 --> A1 end subgraph "Optimized Generation with KV Cache" A2[Input Token] --> B2{First
Token?} B2 -->|Yes| C2[Forward Pass
Compute Q, K, V] B2 -->|No| C2Cache[Use Cached K, V
Only compute Q] C2 --> D2[Cache K, V] D2 --> E2[Attention
Only with New Token] C2Cache --> E2 E2 --> F2[Next Token] F2 --> G2[Append Token] G2 --> A2 end style C2 fill:#e1f5ff style C2Cache fill:#ffe1f5 style E2 fill:#e1ffe1 ``` --- ## Complete Workflow ### End-to-End System Flow ```mermaid flowchart TB subgraph "Phase 1: Data Preparation" A1[Raw Data Files
PDFs, Images, Code, Text] --> A2[DataProcessor
Extract Text] A2 --> A3[Text Lines
List of Strings] A3 --> A4[SimpleTokenizer
Build Vocabulary] A4 --> A5[Tokenize & Chunk
Create Sequences] A5 --> A6[DataLoader
Batched Data] end subgraph "Phase 2: Model Initialization" B1[Load Config
ModelConfig] --> B2[Set Random Seed
seed=42] B2 --> B3[Create Model
TransformerModel] B3 --> B4[Initialize Weights
Normal Distribution] B4 --> B5[Create Optimizer
AdamW] B5 --> B6[Create Scheduler
CosineAnnealingLR] end subgraph "Phase 3: Training" C1[Trainer Setup] --> C2[Training Loop
Epochs] C2 --> C3[Batch Loop] C3 --> C4[Forward Pass] C4 --> C5[Compute Loss] C5 --> C6[Backward Pass] C6 --> C7[Gradient Clipping] C7 --> C8[Update Weights] C8 --> C9[Save Checkpoint] C9 --> C10{More Epochs?} C10 -->|Yes| C2 C10 -->|No| C11[Generate Plots
Training Metrics] end subgraph "Phase 4: Inference" D1[Load Checkpoint] --> D2[Load Model State] D2 --> D3[Encode Prompt] D3 --> D4[Generate Text
Autoregressive] D4 --> D5[Decode Tokens] D5 --> D6[Output Text] end A6 --> B1 B6 --> C1 C11 --> D1 style A2 fill:#e1f5ff style B3 fill:#fff4e1 style C4 fill:#ffe1f5 style D4 fill:#e1ffe1 ``` ### Checkpoint Structure ```mermaid graph TB Checkpoint[Checkpoint File
checkpoint_epoch_N.pt] --> ModelState[model_state_dict
Model weights] Checkpoint --> OptimizerState[optimizer_state_dict
AdamW state] Checkpoint --> SchedulerState[scheduler_state_dict
LR scheduler state] Checkpoint --> ModelConfig[model_config
Model hyperparameters] Checkpoint --> Epoch[epoch
Current epoch number] Checkpoint --> GlobalStep[global_step
Training step count] Checkpoint --> BestValLoss[best_val_loss
Best validation loss] ModelState --> Resume[Resume Training
Restore model state] OptimizerState --> Resume SchedulerState --> Resume ModelConfig --> Resume Epoch --> Resume GlobalStep --> Resume style Checkpoint fill:#e1f5ff style Resume fill:#e1ffe1 ``` ### Configuration Hierarchy ```mermaid graph TB Config[Config
Root Configuration] --> ModelConfig[ModelConfig
vocab_size
d_model
num_layers
num_heads
d_ff
max_seq_len
dropout
activation] Config --> TrainingConfig[TrainingConfig
batch_size
max_epochs
learning_rate
weight_decay
warmup_steps
max_grad_norm
gradient_accumulation_steps
use_amp] Config --> DataConfig[DataConfig
data_dir
max_length
stride
num_workers] Config --> Global[Global Settings
device
seed] ModelConfig --> Model[TransformerModel
Model Architecture] TrainingConfig --> Trainer[Trainer
Training Parameters] DataConfig --> DataLoader[DataLoader
Data Parameters] style Config fill:#e1f5ff style Model fill:#fff4e1 style Trainer fill:#ffe1f5 style DataLoader fill:#e1ffe1 ``` --- ## Key Components Summary ### 1. **Data Processing** - **DataProcessor**: Multi-format text extraction (PDFs, images, code, text) - **SimpleTokenizer**: Character-level tokenization - **TextDataset**: PyTorch dataset for training - **DataLoader**: Batched data loading ### 2. **Model Architecture** - **TransformerModel**: Complete transformer language model - **TransformerBlock**: Multi-head attention + feed-forward - **MultiHeadAttention**: Scaled dot-product attention with causal masking - **FeedForward**: Position-wise feed-forward network - **PositionalEncoding**: Sinusoidal position embeddings ### 3. **Training** - **Trainer**: Complete training loop with: - Gradient accumulation - Mixed precision training (AMP) - Gradient clipping - Learning rate scheduling - Checkpointing - Metrics tracking ### 4. **Inference** - **Standard Generation**: Autoregressive text generation - **OptimizedInference**: KV caching for faster generation - **RetrievalCache**: Caching for RAG systems ### 5. **Configuration** - **Config System**: Hierarchical configuration (Model, Training, Data) - **JSON Support**: Save/load configurations - **Default Values**: Sensible defaults for all parameters --- ## Usage Examples ### Training ```bash # Basic training python train.py --data /path/to/data # With custom config python train.py --data /path/to/data --config config.json # Resume from checkpoint python train.py --data /path/to/data --resume checkpoints/checkpoint_epoch_5.pt # Specify device python train.py --data /path/to/data --device cuda ``` ### Inference ```bash # Basic inference python inference.py --checkpoint checkpoints/best_checkpoint.pt --prompt "Hello world" # With sampling parameters python inference.py \ --checkpoint checkpoints/best_checkpoint.pt \ --prompt "The future of AI" \ --max-length 200 \ --temperature 0.8 \ --top-k 50 \ --top-p 0.95 # Optimized inference python inference.py \ --checkpoint checkpoints/best_checkpoint.pt \ --prompt "Hello" \ --optimized ``` --- ## File Structure ``` sheepOp/ ├── train.py # Main training script ├── inference.py # Inference script ├── config.py # Configuration management ├── config.json # Configuration file ├── data/ # Data module (symlink) │ └── __init__.py # Tokenizer, DataLoader, DataProcessor ├── models/ # Model definitions │ ├── transformer.py # Main transformer model │ ├── blocks.py # Transformer blocks │ ├── attention.py # Attention mechanisms │ └── optimized_attention.py # Optimized inference ├── training/ # Training utilities │ ├── __init__.py # Trainer class │ └── metrics.py # Training metrics ├── checkpoints/ # Saved model checkpoints └── requirements.txt # Dependencies ``` --- ## Flow Summary 1. **Data Ingestion**: Raw files → Text extraction → Text lines 2. **Tokenization**: Text lines → Token sequences → Batched data 3. **Model Setup**: Config → Model → Optimizer → Scheduler 4. **Training**: Batches → Forward → Loss → Backward → Update → Checkpoint 5. **Inference**: Checkpoint → Model → Prompt → Generate → Output --- *This documentation provides a complete view of the SheepOp LLM project architecture and workflow.*