Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions
--- a/docs/BENCHMARKING_GUIDE.md
+++ b/docs/BENCHMARKING_GUIDE.md
@@ -0,0 +1,757 @@
+# Inference Benchmarking Guide
+
+This guide explains how to use the benchmarking feature to compare optimized vs non-optimized inference performance for research purposes.
+
+## Overview
+
+The benchmarking feature runs inference both with and without optimizations (KV caching, optimized attention) and generates:
+
+- **Performance metrics** (tokens/sec, latency, memory usage)
+- **Comparison plots** (visual charts showing improvements)
+- **CSV export** (data for further analysis)
+
+## Data Storage Location
+
+**All benchmark data is saved to:** `./inference_benchmarks/` (default)
+
+**You can customize the location:**
+
+```bash
+python inference.py --benchmark --benchmark-dir ./research/results
+```
+
+**Data files created:**
+
+- `inference_metrics.json` - All raw metrics (JSON format)
+- `inference_metrics.csv` - Spreadsheet-friendly data (CSV format)
+- `optimization_comparison.png` - Visual comparison charts
+- `performance_over_time.png` - Trend analysis over multiple runs
+
+**Note:** All runs accumulate in the same files, so you can run multiple benchmarks and build trends over time.
+
+## Quick Start
+
+### Basic Benchmark
+
+```bash
+python inference.py \
+    --checkpoint checkpoints/best_checkpoint.pt \
+    --prompt "The future of artificial intelligence" \
+    --max-length 100 \
+    --benchmark
+```
+
+This will:
+
+1. Run inference **without** optimizations
+2. Run inference **with** optimizations (KV cache)
+3. Collect metrics for both runs
+4. Generate comparison plots
+5. Save all data to `./inference_benchmarks/`
+
+### Custom Benchmark Directory
+
+```bash
+python inference.py \
+    --checkpoint checkpoints/best_checkpoint.pt \
+    --prompt "Your prompt here" \
+    --max-length 100 \
+    --benchmark \
+    --benchmark-dir ./research/results
+```
+
+### Running Multiple Prompts for Trends
+
+**Use the batch benchmark script** to run multiple prompts and create trends:
+
+```bash
+# Create a prompts file
+cat > prompts.txt << EOF
+The future of artificial intelligence
+Machine learning is transforming
+Deep neural networks enable
+Natural language processing requires
+EOF
+
+# Run batch benchmarks
+python benchmark_batch.py \
+    --checkpoint checkpoints/best_checkpoint.pt \
+    --prompt-file prompts.txt \
+    --max-length 100 \
+    --benchmark-dir ./research/results
+```
+
+**Or use command-line prompts:**
+
+```bash
+python benchmark_batch.py \
+    --checkpoint checkpoints/best_checkpoint.pt \
+    --prompts "Prompt 1" "Prompt 2" "Prompt 3" \
+    --max-length 100
+```
+
+**Results accumulate** in the same files, allowing you to:
+
+- Build trends across multiple prompts
+- Analyze performance consistency
+- Create comprehensive research reports
+
+## Output Files
+
+After running a benchmark, you'll get:
+
+### 1. JSON Metrics File
+
+**Location:** `inference_benchmarks/inference_metrics.json`
+
+Contains all raw metrics data:
+
+```json
+{
+  "runs": [
+    {
+      "run_name": "run_1234567890_optimized",
+      "optimized": true,
+      "tokens_per_second": 150.5,
+      "time_per_token": 6.64,
+      "memory_used_mb": 245.3,
+      ...
+    },
+    ...
+  ]
+}
+```
+
+### 2. CSV Export
+
+**Location:** `inference_benchmarks/inference_metrics.csv`
+
+For spreadsheet analysis:
+
+```csv
+run_name,timestamp,optimized,prompt_length,generated_length,total_time,tokens_per_second,time_per_token,memory_used_mb,device
+run_1234567890_optimized,1234567890.5,true,20,100,0.663,150.8,6.63,245.3,cuda
+...
+```
+
+### 3. Comparison Plot
+
+**Location:** `inference_benchmarks/optimization_comparison.png`
+
+Shows 4 charts:
+
+- **Tokens per Second** (speed comparison)
+- **Time per Token** (latency comparison)
+- **Total Generation Time** (overall speed)
+- **Memory Usage** (memory efficiency)
+
+### 4. Performance Over Time Plot
+
+**Location:** `inference_benchmarks/performance_over_time.png`
+
+Shows how performance varies across multiple benchmark runs.
+
+## Metrics Collected
+
+### Performance Metrics
+
+- **Tokens per Second**: Generation speed
+- **Time per Token**: Latency per token (milliseconds)
+- **Total Time**: Complete generation time
+
+### Resource Metrics
+
+- **Memory Usage**: GPU memory consumption (MB)
+- **Device**: Device used (cuda/cpu/mps)
+
+### Derived Metrics
+
+- **Speedup**: Ratio of optimized vs non-optimized speed
+- **Memory Reduction**: Percentage reduction in memory usage
+
+## Example Output
+
+```
+🔬 BENCHMARK MODE: Comparing optimized vs non-optimized inference
+======================================================================
+
+BENCHMARK RUN: run_1234567890
+======================================================================
+
+🔴 Running NON-OPTIMIZED inference...
+  ⏱️  Total Time: 1.234 s
+  📊 Tokens/Second: 81.0
+  ⚡ Time/Token: 12.35 ms
+  💾 Memory Used: 512.3 MB
+  📝 Generated: The future of artificial intelligence is bright...
+
+🟢 Running OPTIMIZED inference...
+  ⏱️  Total Time: 0.663 s
+  📊 Tokens/Second: 150.8
+  ⚡ Time/Token: 6.63 ms
+  💾 Memory Used: 245.3 MB
+  📝 Generated: The future of artificial intelligence is bright...
+
+🚀 SPEEDUP: 1.86x faster with optimizations
+💾 MEMORY REDUCTION: 52.1%
+
+📊 Generating comparison plots and data...
+📊 Comparison plot saved to: ./inference_benchmarks/optimization_comparison.png
+📊 Performance over time plot saved to: ./inference_benchmarks/performance_over_time.png
+📊 Metrics exported to CSV: ./inference_benchmarks/inference_metrics.csv
+
+✅ Benchmark complete! Results saved to: ./inference_benchmarks
+```
+
+## Running Multiple Benchmarks for Trends
+
+### Method 1: Individual Runs (Manual)
+
+```bash
+# Run 1
+python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 1" --benchmark
+
+# Run 2
+python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 2" --benchmark
+
+# Run 3
+python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 3" --max-length 200 --benchmark
+```
+
+All runs accumulate in the same files:
+
+- `inference_metrics.json` - All runs appended
+- `inference_metrics.csv` - All runs in CSV format
+- Plots update automatically with new data
+
+### Method 2: Batch Script (Recommended)
+
+**Create a prompts file:**
+
+```bash
+cat > research_prompts.txt << EOF
+The future of artificial intelligence is bright.
+Machine learning models are becoming more efficient.
+Deep neural networks can process complex patterns.
+Natural language processing enables human-computer interaction.
+Transformer architectures revolutionized NLP.
+EOF
+```
+
+**Run batch benchmarks:**
+
+```bash
+python benchmark_batch.py \
+    --checkpoint checkpoints/best_checkpoint.pt \
+    --prompt-file research_prompts.txt \
+    --max-length 100 \
+    --benchmark-dir ./research/results \
+    --delay 2.0
+```
+
+**Benefits:**
+
+- ✅ Runs all prompts automatically
+- ✅ Accumulates data for trend analysis
+- ✅ Creates comprehensive performance reports
+- ✅ Handles errors gracefully
+
+**After running multiple benchmarks:**
+
+- Check `performance_over_time.png` for trends
+- Analyze `inference_metrics.csv` in Excel/Python
+- Review aggregated statistics in console output
+
+## Research Use Cases
+
+### 1. Performance Analysis
+
+Compare how optimizations affect inference speed:
+
+```bash
+python inference.py \
+    --checkpoint checkpoints/best.pt \
+    --prompt "Your research prompt" \
+    --benchmark
+```
+
+### 2. Memory Efficiency Study
+
+Analyze memory usage improvements:
+
+```bash
+# Check memory reduction
+python inference.py --checkpoint checkpoints/best.pt --prompt "Long prompt" --max-length 500 --benchmark
+```
+
+### 3. Scalability Testing
+
+Test with different generation lengths:
+
+```bash
+# Short sequences
+python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 50 --benchmark
+
+# Medium sequences
+python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 200 --benchmark
+
+# Long sequences
+python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 1000 --benchmark
+```
+
+## Plot Interpretation
+
+### Comparison Plot (`optimization_comparison.png`)
+
+**Top Left - Tokens per Second:**
+
+- Higher is better
+- Shows generation speed
+- Speedup annotation shows improvement factor
+
+**Top Right - Time per Token:**
+
+- Lower is better
+- Shows latency per token
+- Important for real-time applications
+
+**Bottom Left - Total Generation Time:**
+
+- Lower is better
+- Overall generation time
+- Most user-visible metric
+
+**Bottom Right - Memory Usage:**
+
+- Lower is better
+- GPU memory consumption
+- Memory reduction annotation shows savings
+
+### Performance Over Time Plot (`performance_over_time.png`)
+
+Shows performance trends across multiple benchmark runs:
+
+- **Green line**: Optimized performance
+- **Red line**: Non-optimized performance
+- Useful for finding performance regressions or improvements
+
+## Reporting Results
+
+### Speedup Calculation
+
+```
+Speedup = Optimized Tokens/Second / Non-Optimized Tokens/Second
+```
+
+**Example:**
+
+- Optimized: 150 tokens/sec
+- Non-Optimized: 81 tokens/sec
+- Speedup: 150/81 = 1.85x faster
+
+### Memory Reduction Calculation
+
+```
+Memory Reduction % = (1 - Optimized Memory / Non-Optimized Memory) × 100
+```
+
+**Example:**
+
+- Optimized: 245 MB
+- Non-Optimized: 512 MB
+- Reduction: (1 - 245/512) × 100 = 52.1%
+
+## Tips for Best Results
+
+1. **Warm Up GPU**: Run a few inference calls before benchmarking to warm up the GPU
+2. **Clear Cache**: The benchmark automatically clears CUDA cache between runs
+3. **Multiple Runs**: Run multiple benchmarks for statistical significance
+4. **Consistent Prompts**: Use the same prompt for fair comparison
+5. **Device Consistency**: Use the same device for all runs
+
+## Command Line Options
+
+```bash
+python inference.py \
+    --checkpoint PATH          # Path to model checkpoint (required)
+    --prompt TEXT              # Prompt text (required)
+    --max-length INT           # Maximum generation length (default: 100)
+    --temperature FLOAT        # Sampling temperature (default: 1.0)
+    --top-k INT                # Top-k sampling (default: 50)
+    --top-p FLOAT              # Top-p sampling (default: 0.95)
+    --device DEVICE            # Device: cuda/cpu/mps (default: cuda)
+    --benchmark                # Enable benchmarking mode
+    --benchmark-dir DIR        # Benchmark output directory (default: ./inference_benchmarks)
+```
+
+## Troubleshooting
+
+### No GPU Memory Stats
+
+If memory stats show as `None`:
+
+- CUDA: Memory tracking should work automatically
+- MPS (Apple Silicon): Memory tracking not available
+- CPU: Memory tracking not available
+
+### Plots Not Generated
+
+If plots fail to generate:
+
+- Ensure `matplotlib` is installed: `pip install matplotlib`
+- Check file permissions for output directory
+
+### Inconsistent Results
+
+For consistent results:
+
+- Use same device for all runs
+- Use same prompt length
+- Allow GPU to warm up
+- Close other GPU applications
+
+## Example Research Workflow
+
+```bash
+# 1. Run initial benchmark
+python inference.py --checkpoint checkpoints/best.pt --prompt "Test prompt" --benchmark
+
+# 2. Review results
+ls inference_benchmarks/
+cat inference_benchmarks/inference_metrics.json
+
+# 3. Generate plots (already done automatically)
+# View: inference_benchmarks/optimization_comparison.png
+
+# 4. Analyze CSV data
+# Open: inference_benchmarks/inference_metrics.csv in Excel/Python
+
+# 5. Run additional benchmarks
+python inference.py --checkpoint checkpoints/best.pt --prompt "Different prompt" --max-length 200 --benchmark
+
+# 6. Compare results
+python inference.py --checkpoint checkpoints/best.pt --prompt "Same prompt" --benchmark
+```
+
+## Optimization Architecture & Code Injection
+
+### Overview: Optimization Layers
+
+The optimizations are implemented as layers that wrap the standard inference pipeline:
+
+```mermaid
+flowchart TB
+ subgraph subGraph0["Standard Inference (Non-Optimized)"]
+        B["Tokenize"]
+        A["Input Prompt"]
+        C["Embedding Layer"]
+        D["Transformer Blocks"]
+        E["Attention: Recompute All"]
+        F["Forward Pass: O(n²)"]
+        G["Output Tokens"]
+        H["Detokenize"]
+        I["Generated Text"]
+  end
+ subgraph subGraph1["Optimized Inference (With KV Cache)"]
+        B2["Tokenize"]
+        A2["Input Prompt"]
+        C2["Embedding Layer"]
+        D2["Transformer Blocks"]
+        E2["Optimized Attention"]
+        F2["KV Cache Layer"]
+        G2["Forward Pass: O(n)"]
+        H2["Output Tokens"]
+        I2["Detokenize"]
+        J2["Generated Text"]
+  end
+    A --> B
+    B --> C
+    C --> D
+    D --> E
+    E --> F
+    F --> G
+    G --> H
+    H --> I
+    A2 --> B2
+    B2 --> C2
+    C2 --> D2
+    D2 --> E2
+    E2 --> F2
+    F2 --> G2
+    G2 --> H2
+    H2 --> I2
+    I2 --> J2
+    style E fill:#ffcccc
+    style F fill:#ffcccc
+    style E2 fill:#ccffcc
+    style F2 fill:#ccffcc
+```
+
+### Detailed Optimization Flow
+
+```mermaid
+
+flowchart LR
+ subgraph subGraph0["Request Flow"]
+        Mode{"Optimized?"}
+        Start["Benchmark Request"]
+        Standard["Standard Path"]
+        Optimized["Optimized Path"]
+  end
+ subgraph subGraph1["Standard Path"]
+        S1["Model.generate"]
+        S2["Transformer Forward"]
+        S3["MultiHeadAttention"]
+        S4["Compute Q, K, V"]
+        S5["Recompute All KVs"]
+        S6["Attention Scores: O(n²)"]
+        S7["Generate Token"]
+  end
+ subgraph subGraph2["Optimized Path"]
+        O1["OptimizedInference"]
+        O2["Init KV Cache"]
+        O3["Transformer Forward"]
+        O4["OptimizedMultiHeadAttention"]
+        O5["Compute Q, K, V"]
+        O6["KV Cache Layer"]
+        O7["Append to Cache"]
+        O8["Reuse Cached KVs"]
+        O9["Attention Scores: O(n)"]
+        O10["Generate Token"]
+  end
+    Start --> Mode
+    Mode -- No --> Standard
+    Mode -- Yes --> Optimized
+    Standard --> S1
+    S1 --> S2
+    S2 --> S3
+    S3 --> S4
+    S4 --> S5
+    S5 --> S6
+    S6 --> S7
+    Optimized --> O1
+    O1 --> O2
+    O2 --> O3
+    O3 --> O4
+    O4 --> O5
+    O5 --> O6
+    O6 --> O7
+    O7 --> O8
+    O8 --> O9
+    O9 --> O10
+    S7 --> Metrics["Collect Metrics"]
+    O10 --> Metrics
+    style Standard fill:#ffcccc
+    style Optimized fill:#ccffcc
+    style S5 fill:#ffcccc
+    style O8 fill:#ccffcc
+
+```
+
+### Code Injection Points
+
+```mermaid
+graph TB
+    subgraph "Standard Model Architecture"
+        A[TransformerModel] --> B[TransformerBlock]
+        B --> C[MultiHeadAttention]
+        C --> D[Q, K, V Projections]
+        D --> E[Attention Computation]
+        E --> F[Output Projection]
+        F --> G[Feed Forward]
+    end
+
+    subgraph "Optimization Injection Points"
+        H[OptimizedInference Wrapper] --> A
+        A --> B2[TransformerBlock]
+        B2 --> C2[OptimizedMultiHeadAttention]
+        C2 --> D2[Q, K, V Projections]
+        D2 --> I[KV Cache Injection]
+        I --> E2[Optimized Attention]
+        E2 --> F2[Output Projection]
+        F2 --> G2[Feed Forward]
+    end
+
+    subgraph "KV Cache Layer Details"
+        I --> J[Cache Check]
+        J --> K{Cache Exists?}
+        K -->|No| L[Compute K, V]
+        K -->|Yes| M[Retrieve from Cache]
+        L --> N[Store in Cache]
+        M --> O[Append New K, V]
+        N --> O
+        O --> P[Use Cached KVs]
+    end
+
+    style H fill:#90EE90
+    style I fill:#90EE90
+    style K fill:#FFD700
+    style P fill:#90EE90
+```
+
+### Benchmark Execution Flow
+
+```mermaid
+sequenceDiagram
+    participant User
+    participant InferenceScript
+    participant BenchmarkModule
+    participant OptimizedInference
+    participant StandardModel
+    participant MetricsCollector
+
+    User->>InferenceScript: python inference.py --benchmark
+    InferenceScript->>BenchmarkModule: Initialize Metrics
+    BenchmarkModule->>MetricsCollector: Create InferenceMetrics
+
+    Note over InferenceScript: Run 1: Non-Optimized
+    InferenceScript->>StandardModel: model.generate()
+    StandardModel->>StandardModel: Forward Pass (O(n²))
+    StandardModel-->>InferenceScript: Generated Tokens
+    InferenceScript->>MetricsCollector: Log Run (optimized=false)
+
+    Note over InferenceScript: Run 2: Optimized
+    InferenceScript->>OptimizedInference: get_optimized_inference()
+    OptimizedInference->>OptimizedInference: Init KV Cache
+    OptimizedInference->>OptimizedInference: generate_with_cache()
+
+    loop For each token
+        OptimizedInference->>OptimizedInference: Forward Pass (O(n))
+        OptimizedInference->>OptimizedInference: Update KV Cache
+    end
+
+    OptimizedInference-->>InferenceScript: Generated Tokens
+    InferenceScript->>MetricsCollector: Log Run (optimized=true)
+
+    MetricsCollector->>MetricsCollector: Calculate Speedup
+    MetricsCollector->>MetricsCollector: Generate Plots
+    MetricsCollector->>MetricsCollector: Export CSV
+    MetricsCollector-->>User: Results & Plots
+```
+
+### Optimization Components Stack
+
+```mermaid
+graph TD
+    subgraph "Application Layer"
+        A[inference.py] --> B[benchmark_inference]
+        B --> C[Generate Text]
+    end
+
+    subgraph "Optimization Layer"
+        C --> D{Optimized?}
+        D -->|Yes| E[OptimizedInference]
+        D -->|No| F[Standard Model]
+        E --> G[KV Cache Manager]
+        E --> H[Optimized Attention]
+    end
+
+    subgraph "Core Model Layer"
+        F --> I[TransformerModel]
+        E --> I
+        I --> J[TransformerBlock]
+        J --> K[MultiHeadAttention]
+        H --> K
+        K --> L[Attention Computation]
+    end
+
+    subgraph "Cache Layer"
+        G --> M[KVCache Data Structure]
+        M --> N[Keys Cache]
+        M --> O[Values Cache]
+        N --> P[Retrieve Previous K]
+        O --> Q[Retrieve Previous V]
+    end
+
+    subgraph "Compute Layer"
+        L --> R[Q × K^T]
+        P --> R
+        Q --> R
+        R --> S[Softmax]
+        S --> T[Attention Weights]
+        T --> U[Output]
+    end
+
+    style E fill:#90EE90
+    style G fill:#90EE90
+    style H fill:#90EE90
+    style M fill:#FFD700
+```
+
+### Performance Comparison Schema
+
+```mermaid
+
+flowchart LR
+ subgraph subGraph0["Metrics Collection"]
+        B["Non-Optimized Metrics"]
+        A["Benchmark Run"]
+        C["Optimized Metrics"]
+        D["Time: T1<br>Memory: M1<br>Speed: S1"]
+        E["Time: T2<br>Memory: M2<br>Speed: S2"]
+  end
+ subgraph Analysis["Analysis"]
+        F["Calculate Speedup"]
+        G["Speedup = S2/S1"]
+        H["Calculate Memory Reduction"]
+        I["Reduction = (M1-M2)/M1 × 100%"]
+  end
+ subgraph Visualization["Visualization"]
+        J["Comparison Plot"]
+        K["Trend Analysis"]
+        L["Performance Over Time"]
+  end
+ subgraph subGraph3["Data Export"]
+        M["JSON Metrics"]
+        N["CSV Export"]
+  end
+    A --> B & C
+    B --> D
+    C --> E
+    D --> F & H
+    E --> F & H
+    F --> G & K
+    H --> I
+    G --> J
+    I --> J
+    K --> L
+    J --> M & N
+    L --> M & N
+    style F fill:#FFD700
+    style G fill:#90EE90
+    style I fill:#90EE90
+
+```
+
+## Data File Locations Summary
+
+**All benchmark data is saved to:**
+
+```
+./inference_benchmarks/
+├── inference_metrics.json          # All raw metrics (JSON)
+├── inference_metrics.csv           # Spreadsheet data (CSV)
+├── optimization_comparison.png     # Comparison charts
+└── performance_over_time.png       # Trend analysis
+```
+
+**Custom location:**
+
+```bash
+--benchmark-dir ./research/results
+```
+
+**Data accumulates:** Each benchmark run appends to the same files, building trends over time.
+
+## Next Steps
+
+1. ✅ Run your first benchmark
+2. ✅ Review the comparison plots
+3. ✅ Analyze CSV data for deeper insights
+4. ✅ Run multiple benchmarks for statistical analysis
+5. ✅ Use batch script for trend analysis
+6. ✅ Include results in your research paper/presentation
+
+---
+
+**Happy Benchmarking!** 📊🔬