Files

Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/

2025-11-06 22:07:41 -05:00

19 KiB

Raw Blame History

Inference Benchmarking Guide

This guide explains how to use the benchmarking feature to compare optimized vs non-optimized inference performance for research purposes.

Overview

The benchmarking feature runs inference both with and without optimizations (KV caching, optimized attention) and generates:

Performance metrics (tokens/sec, latency, memory usage)
Comparison plots (visual charts showing improvements)
CSV export (data for further analysis)

Data Storage Location

All benchmark data is saved to: ./inference_benchmarks/ (default)

You can customize the location:

python inference.py --benchmark --benchmark-dir ./research/results

Data files created:

inference_metrics.json - All raw metrics (JSON format)
inference_metrics.csv - Spreadsheet-friendly data (CSV format)
optimization_comparison.png - Visual comparison charts
performance_over_time.png - Trend analysis over multiple runs

Note: All runs accumulate in the same files, so you can run multiple benchmarks and build trends over time.

Quick Start

Basic Benchmark

python inference.py \
    --checkpoint checkpoints/best_checkpoint.pt \
    --prompt "The future of artificial intelligence" \
    --max-length 100 \
    --benchmark

This will:

Run inference without optimizations
Run inference with optimizations (KV cache)
Collect metrics for both runs
Generate comparison plots
Save all data to ./inference_benchmarks/

Custom Benchmark Directory

python inference.py \
    --checkpoint checkpoints/best_checkpoint.pt \
    --prompt "Your prompt here" \
    --max-length 100 \
    --benchmark \
    --benchmark-dir ./research/results

Running Multiple Prompts for Trends

Use the batch benchmark script to run multiple prompts and create trends:

# Create a prompts file
cat > prompts.txt << EOF
The future of artificial intelligence
Machine learning is transforming
Deep neural networks enable
Natural language processing requires
EOF

# Run batch benchmarks
python benchmark_batch.py \
    --checkpoint checkpoints/best_checkpoint.pt \
    --prompt-file prompts.txt \
    --max-length 100 \
    --benchmark-dir ./research/results

Or use command-line prompts:

python benchmark_batch.py \
    --checkpoint checkpoints/best_checkpoint.pt \
    --prompts "Prompt 1" "Prompt 2" "Prompt 3" \
    --max-length 100

Results accumulate in the same files, allowing you to:

Build trends across multiple prompts
Analyze performance consistency
Create comprehensive research reports

Output Files

After running a benchmark, you'll get:

1. JSON Metrics File

Location: inference_benchmarks/inference_metrics.json

Contains all raw metrics data:

{
  "runs": [
    {
      "run_name": "run_1234567890_optimized",
      "optimized": true,
      "tokens_per_second": 150.5,
      "time_per_token": 6.64,
      "memory_used_mb": 245.3,
      ...
    },
    ...
  ]
}

2. CSV Export

Location: inference_benchmarks/inference_metrics.csv

For spreadsheet analysis:

run_name,timestamp,optimized,prompt_length,generated_length,total_time,tokens_per_second,time_per_token,memory_used_mb,device
run_1234567890_optimized,1234567890.5,true,20,100,0.663,150.8,6.63,245.3,cuda
...

3. Comparison Plot

Location: inference_benchmarks/optimization_comparison.png

Shows 4 charts:

Tokens per Second (speed comparison)
Time per Token (latency comparison)
Total Generation Time (overall speed)
Memory Usage (memory efficiency)

4. Performance Over Time Plot

Location: inference_benchmarks/performance_over_time.png

Shows how performance varies across multiple benchmark runs.

Metrics Collected

Performance Metrics

Tokens per Second: Generation speed
Time per Token: Latency per token (milliseconds)
Total Time: Complete generation time

Resource Metrics

Memory Usage: GPU memory consumption (MB)
Device: Device used (cuda/cpu/mps)

Derived Metrics

Speedup: Ratio of optimized vs non-optimized speed
Memory Reduction: Percentage reduction in memory usage

Example Output

🔬 BENCHMARK MODE: Comparing optimized vs non-optimized inference
======================================================================

BENCHMARK RUN: run_1234567890
======================================================================

🔴 Running NON-OPTIMIZED inference...
  ⏱️  Total Time: 1.234 s
  📊 Tokens/Second: 81.0
  ⚡ Time/Token: 12.35 ms
  💾 Memory Used: 512.3 MB
  📝 Generated: The future of artificial intelligence is bright...

🟢 Running OPTIMIZED inference...
  ⏱️  Total Time: 0.663 s
  📊 Tokens/Second: 150.8
  ⚡ Time/Token: 6.63 ms
  💾 Memory Used: 245.3 MB
  📝 Generated: The future of artificial intelligence is bright...

🚀 SPEEDUP: 1.86x faster with optimizations
💾 MEMORY REDUCTION: 52.1%

📊 Generating comparison plots and data...
📊 Comparison plot saved to: ./inference_benchmarks/optimization_comparison.png
📊 Performance over time plot saved to: ./inference_benchmarks/performance_over_time.png
📊 Metrics exported to CSV: ./inference_benchmarks/inference_metrics.csv

✅ Benchmark complete! Results saved to: ./inference_benchmarks

Running Multiple Benchmarks for Trends

Method 1: Individual Runs (Manual)

# Run 1
python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 1" --benchmark

# Run 2
python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 2" --benchmark

# Run 3
python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 3" --max-length 200 --benchmark

All runs accumulate in the same files:

inference_metrics.json - All runs appended
inference_metrics.csv - All runs in CSV format
Plots update automatically with new data

Method 2: Batch Script (Recommended)

Create a prompts file:

cat > research_prompts.txt << EOF
The future of artificial intelligence is bright.
Machine learning models are becoming more efficient.
Deep neural networks can process complex patterns.
Natural language processing enables human-computer interaction.
Transformer architectures revolutionized NLP.
EOF

Run batch benchmarks:

python benchmark_batch.py \
    --checkpoint checkpoints/best_checkpoint.pt \
    --prompt-file research_prompts.txt \
    --max-length 100 \
    --benchmark-dir ./research/results \
    --delay 2.0

Benefits:

✅ Runs all prompts automatically
✅ Accumulates data for trend analysis
✅ Creates comprehensive performance reports
✅ Handles errors gracefully

After running multiple benchmarks:

Check performance_over_time.png for trends
Analyze inference_metrics.csv in Excel/Python
Review aggregated statistics in console output

Research Use Cases

1. Performance Analysis

Compare how optimizations affect inference speed:

python inference.py \
    --checkpoint checkpoints/best.pt \
    --prompt "Your research prompt" \
    --benchmark

2. Memory Efficiency Study

Analyze memory usage improvements:

# Check memory reduction
python inference.py --checkpoint checkpoints/best.pt --prompt "Long prompt" --max-length 500 --benchmark

3. Scalability Testing

Test with different generation lengths:

# Short sequences
python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 50 --benchmark

# Medium sequences
python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 200 --benchmark

# Long sequences
python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 1000 --benchmark

Plot Interpretation

Comparison Plot (`optimization_comparison.png`)

Top Left - Tokens per Second:

Higher is better
Shows generation speed
Speedup annotation shows improvement factor

Top Right - Time per Token:

Lower is better
Shows latency per token
Important for real-time applications

Bottom Left - Total Generation Time:

Lower is better
Overall generation time
Most user-visible metric

Bottom Right - Memory Usage:

Lower is better
GPU memory consumption
Memory reduction annotation shows savings

Performance Over Time Plot (`performance_over_time.png`)

Shows performance trends across multiple benchmark runs:

Green line: Optimized performance
Red line: Non-optimized performance
Useful for finding performance regressions or improvements

Reporting Results

Speedup Calculation

Speedup = Optimized Tokens/Second / Non-Optimized Tokens/Second

Example:

Optimized: 150 tokens/sec
Non-Optimized: 81 tokens/sec
Speedup: 150/81 = 1.85x faster

Memory Reduction Calculation

Memory Reduction % = (1 - Optimized Memory / Non-Optimized Memory) × 100

Example:

Optimized: 245 MB
Non-Optimized: 512 MB
Reduction: (1 - 245/512) × 100 = 52.1%

Tips for Best Results

Warm Up GPU: Run a few inference calls before benchmarking to warm up the GPU
Clear Cache: The benchmark automatically clears CUDA cache between runs
Multiple Runs: Run multiple benchmarks for statistical significance
Consistent Prompts: Use the same prompt for fair comparison
Device Consistency: Use the same device for all runs

Command Line Options

python inference.py \
    --checkpoint PATH          # Path to model checkpoint (required)
    --prompt TEXT              # Prompt text (required)
    --max-length INT           # Maximum generation length (default: 100)
    --temperature FLOAT        # Sampling temperature (default: 1.0)
    --top-k INT                # Top-k sampling (default: 50)
    --top-p FLOAT              # Top-p sampling (default: 0.95)
    --device DEVICE            # Device: cuda/cpu/mps (default: cuda)
    --benchmark                # Enable benchmarking mode
    --benchmark-dir DIR        # Benchmark output directory (default: ./inference_benchmarks)

Troubleshooting

No GPU Memory Stats

If memory stats show as None:

CUDA: Memory tracking should work automatically
MPS (Apple Silicon): Memory tracking not available
CPU: Memory tracking not available

Plots Not Generated

If plots fail to generate:

Ensure matplotlib is installed: pip install matplotlib
Check file permissions for output directory

Inconsistent Results

For consistent results:

Use same device for all runs
Use same prompt length
Allow GPU to warm up
Close other GPU applications

Example Research Workflow

# 1. Run initial benchmark
python inference.py --checkpoint checkpoints/best.pt --prompt "Test prompt" --benchmark

# 2. Review results
ls inference_benchmarks/
cat inference_benchmarks/inference_metrics.json

# 3. Generate plots (already done automatically)
# View: inference_benchmarks/optimization_comparison.png

# 4. Analyze CSV data
# Open: inference_benchmarks/inference_metrics.csv in Excel/Python

# 5. Run additional benchmarks
python inference.py --checkpoint checkpoints/best.pt --prompt "Different prompt" --max-length 200 --benchmark

# 6. Compare results
python inference.py --checkpoint checkpoints/best.pt --prompt "Same prompt" --benchmark

Optimization Architecture & Code Injection

Overview: Optimization Layers

The optimizations are implemented as layers that wrap the standard inference pipeline:

flowchart TB
 subgraph subGraph0["Standard Inference (Non-Optimized)"]
        B["Tokenize"]
        A["Input Prompt"]
        C["Embedding Layer"]
        D["Transformer Blocks"]
        E["Attention: Recompute All"]
        F["Forward Pass: O(n²)"]
        G["Output Tokens"]
        H["Detokenize"]
        I["Generated Text"]
  end
 subgraph subGraph1["Optimized Inference (With KV Cache)"]
        B2["Tokenize"]
        A2["Input Prompt"]
        C2["Embedding Layer"]
        D2["Transformer Blocks"]
        E2["Optimized Attention"]
        F2["KV Cache Layer"]
        G2["Forward Pass: O(n)"]
        H2["Output Tokens"]
        I2["Detokenize"]
        J2["Generated Text"]
  end
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H
    H --> I
    A2 --> B2
    B2 --> C2
    C2 --> D2
    D2 --> E2
    E2 --> F2
    F2 --> G2
    G2 --> H2
    H2 --> I2
    I2 --> J2
    style E fill:#ffcccc
    style F fill:#ffcccc
    style E2 fill:#ccffcc
    style F2 fill:#ccffcc

Detailed Optimization Flow


flowchart LR
 subgraph subGraph0["Request Flow"]
        Mode{"Optimized?"}
        Start["Benchmark Request"]
        Standard["Standard Path"]
        Optimized["Optimized Path"]
  end
 subgraph subGraph1["Standard Path"]
        S1["Model.generate"]
        S2["Transformer Forward"]
        S3["MultiHeadAttention"]
        S4["Compute Q, K, V"]
        S5["Recompute All KVs"]
        S6["Attention Scores: O(n²)"]
        S7["Generate Token"]
  end
 subgraph subGraph2["Optimized Path"]
        O1["OptimizedInference"]
        O2["Init KV Cache"]
        O3["Transformer Forward"]
        O4["OptimizedMultiHeadAttention"]
        O5["Compute Q, K, V"]
        O6["KV Cache Layer"]
        O7["Append to Cache"]
        O8["Reuse Cached KVs"]
        O9["Attention Scores: O(n)"]
        O10["Generate Token"]
  end
    Start --> Mode
    Mode -- No --> Standard
    Mode -- Yes --> Optimized
    Standard --> S1
    S1 --> S2
    S2 --> S3
    S3 --> S4
    S4 --> S5
    S5 --> S6
    S6 --> S7
    Optimized --> O1
    O1 --> O2
    O2 --> O3
    O3 --> O4
    O4 --> O5
    O5 --> O6
    O6 --> O7
    O7 --> O8
    O8 --> O9
    O9 --> O10
    S7 --> Metrics["Collect Metrics"]
    O10 --> Metrics
    style Standard fill:#ffcccc
    style Optimized fill:#ccffcc
    style S5 fill:#ffcccc
    style O8 fill:#ccffcc

Code Injection Points

graph TB
    subgraph "Standard Model Architecture"
        A[TransformerModel] --> B[TransformerBlock]
        B --> C[MultiHeadAttention]
        C --> D[Q, K, V Projections]
        D --> E[Attention Computation]
        E --> F[Output Projection]
        F --> G[Feed Forward]
    end

    subgraph "Optimization Injection Points"
        H[OptimizedInference Wrapper] --> A
        A --> B2[TransformerBlock]
        B2 --> C2[OptimizedMultiHeadAttention]
        C2 --> D2[Q, K, V Projections]
        D2 --> I[KV Cache Injection]
        I --> E2[Optimized Attention]
        E2 --> F2[Output Projection]
        F2 --> G2[Feed Forward]
    end

    subgraph "KV Cache Layer Details"
        I --> J[Cache Check]
        J --> K{Cache Exists?}
        K -->|No| L[Compute K, V]
        K -->|Yes| M[Retrieve from Cache]
        L --> N[Store in Cache]
        M --> O[Append New K, V]
        N --> O
        O --> P[Use Cached KVs]
    end

    style H fill:#90EE90
    style I fill:#90EE90
    style K fill:#FFD700
    style P fill:#90EE90

Benchmark Execution Flow

sequenceDiagram
    participant User
    participant InferenceScript
    participant BenchmarkModule
    participant OptimizedInference
    participant StandardModel
    participant MetricsCollector

    User->>InferenceScript: python inference.py --benchmark
    InferenceScript->>BenchmarkModule: Initialize Metrics
    BenchmarkModule->>MetricsCollector: Create InferenceMetrics

    Note over InferenceScript: Run 1: Non-Optimized
    InferenceScript->>StandardModel: model.generate()
    StandardModel->>StandardModel: Forward Pass (O(n²))
    StandardModel-->>InferenceScript: Generated Tokens
    InferenceScript->>MetricsCollector: Log Run (optimized=false)

    Note over InferenceScript: Run 2: Optimized
    InferenceScript->>OptimizedInference: get_optimized_inference()
    OptimizedInference->>OptimizedInference: Init KV Cache
    OptimizedInference->>OptimizedInference: generate_with_cache()

    loop For each token
        OptimizedInference->>OptimizedInference: Forward Pass (O(n))
        OptimizedInference->>OptimizedInference: Update KV Cache
    end

    OptimizedInference-->>InferenceScript: Generated Tokens
    InferenceScript->>MetricsCollector: Log Run (optimized=true)

    MetricsCollector->>MetricsCollector: Calculate Speedup
    MetricsCollector->>MetricsCollector: Generate Plots
    MetricsCollector->>MetricsCollector: Export CSV
    MetricsCollector-->>User: Results & Plots

Optimization Components Stack

graph TD
    subgraph "Application Layer"
        A[inference.py] --> B[benchmark_inference]
        B --> C[Generate Text]
    end

    subgraph "Optimization Layer"
        C --> D{Optimized?}
        D -->|Yes| E[OptimizedInference]
        D -->|No| F[Standard Model]
        E --> G[KV Cache Manager]
        E --> H[Optimized Attention]
    end

    subgraph "Core Model Layer"
        F --> I[TransformerModel]
        E --> I
        I --> J[TransformerBlock]
        J --> K[MultiHeadAttention]
        H --> K
        K --> L[Attention Computation]
    end

    subgraph "Cache Layer"
        G --> M[KVCache Data Structure]
        M --> N[Keys Cache]
        M --> O[Values Cache]
        N --> P[Retrieve Previous K]
        O --> Q[Retrieve Previous V]
    end

    subgraph "Compute Layer"
        L --> R[Q × K^T]
        P --> R
        Q --> R
        R --> S[Softmax]
        S --> T[Attention Weights]
        T --> U[Output]
    end

    style E fill:#90EE90
    style G fill:#90EE90
    style H fill:#90EE90
    style M fill:#FFD700

Performance Comparison Schema


flowchart LR
 subgraph subGraph0["Metrics Collection"]
        B["Non-Optimized Metrics"]
        A["Benchmark Run"]
        C["Optimized Metrics"]
        D["Time: T1<br>Memory: M1<br>Speed: S1"]
        E["Time: T2<br>Memory: M2<br>Speed: S2"]
  end
 subgraph Analysis["Analysis"]
        F["Calculate Speedup"]
        G["Speedup = S2/S1"]
        H["Calculate Memory Reduction"]
        I["Reduction = (M1-M2)/M1 × 100%"]
  end
 subgraph Visualization["Visualization"]
        J["Comparison Plot"]
        K["Trend Analysis"]
        L["Performance Over Time"]
  end
 subgraph subGraph3["Data Export"]
        M["JSON Metrics"]
        N["CSV Export"]
  end
    A --> B & C
    B --> D
    C --> E
    D --> F & H
    E --> F & H
    F --> G & K
    H --> I
    G --> J
    I --> J
    K --> L
    J --> M & N
    L --> M & N
    style F fill:#FFD700
    style G fill:#90EE90
    style I fill:#90EE90

Data File Locations Summary

All benchmark data is saved to:

./inference_benchmarks/
├── inference_metrics.json          # All raw metrics (JSON)
├── inference_metrics.csv           # Spreadsheet data (CSV)
├── optimization_comparison.png     # Comparison charts
└── performance_over_time.png       # Trend analysis

Custom location:

--benchmark-dir ./research/results

Data accumulates: Each benchmark run appends to the same files, building trends over time.

Next Steps

✅ Run your first benchmark
✅ Review the comparison plots
✅ Analyze CSV data for deeper insights
✅ Run multiple benchmarks for statistical analysis
✅ Use batch script for trend analysis
✅ Include results in your research paper/presentation

Happy Benchmarking! 📊🔬

19 KiB Raw Blame History Unescape Escape