Files
sheepOp/docs/BENCHMARKING_GUIDE.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

19 KiB
Raw Blame History

Inference Benchmarking Guide

This guide explains how to use the benchmarking feature to compare optimized vs non-optimized inference performance for research purposes.

Overview

The benchmarking feature runs inference both with and without optimizations (KV caching, optimized attention) and generates:

  • Performance metrics (tokens/sec, latency, memory usage)
  • Comparison plots (visual charts showing improvements)
  • CSV export (data for further analysis)

Data Storage Location

All benchmark data is saved to: ./inference_benchmarks/ (default)

You can customize the location:

python inference.py --benchmark --benchmark-dir ./research/results

Data files created:

  • inference_metrics.json - All raw metrics (JSON format)
  • inference_metrics.csv - Spreadsheet-friendly data (CSV format)
  • optimization_comparison.png - Visual comparison charts
  • performance_over_time.png - Trend analysis over multiple runs

Note: All runs accumulate in the same files, so you can run multiple benchmarks and build trends over time.

Quick Start

Basic Benchmark

python inference.py \
    --checkpoint checkpoints/best_checkpoint.pt \
    --prompt "The future of artificial intelligence" \
    --max-length 100 \
    --benchmark

This will:

  1. Run inference without optimizations
  2. Run inference with optimizations (KV cache)
  3. Collect metrics for both runs
  4. Generate comparison plots
  5. Save all data to ./inference_benchmarks/

Custom Benchmark Directory

python inference.py \
    --checkpoint checkpoints/best_checkpoint.pt \
    --prompt "Your prompt here" \
    --max-length 100 \
    --benchmark \
    --benchmark-dir ./research/results

Use the batch benchmark script to run multiple prompts and create trends:

# Create a prompts file
cat > prompts.txt << EOF
The future of artificial intelligence
Machine learning is transforming
Deep neural networks enable
Natural language processing requires
EOF

# Run batch benchmarks
python benchmark_batch.py \
    --checkpoint checkpoints/best_checkpoint.pt \
    --prompt-file prompts.txt \
    --max-length 100 \
    --benchmark-dir ./research/results

Or use command-line prompts:

python benchmark_batch.py \
    --checkpoint checkpoints/best_checkpoint.pt \
    --prompts "Prompt 1" "Prompt 2" "Prompt 3" \
    --max-length 100

Results accumulate in the same files, allowing you to:

  • Build trends across multiple prompts
  • Analyze performance consistency
  • Create comprehensive research reports

Output Files

After running a benchmark, you'll get:

1. JSON Metrics File

Location: inference_benchmarks/inference_metrics.json

Contains all raw metrics data:

{
  "runs": [
    {
      "run_name": "run_1234567890_optimized",
      "optimized": true,
      "tokens_per_second": 150.5,
      "time_per_token": 6.64,
      "memory_used_mb": 245.3,
      ...
    },
    ...
  ]
}

2. CSV Export

Location: inference_benchmarks/inference_metrics.csv

For spreadsheet analysis:

run_name,timestamp,optimized,prompt_length,generated_length,total_time,tokens_per_second,time_per_token,memory_used_mb,device
run_1234567890_optimized,1234567890.5,true,20,100,0.663,150.8,6.63,245.3,cuda
...

3. Comparison Plot

Location: inference_benchmarks/optimization_comparison.png

Shows 4 charts:

  • Tokens per Second (speed comparison)
  • Time per Token (latency comparison)
  • Total Generation Time (overall speed)
  • Memory Usage (memory efficiency)

4. Performance Over Time Plot

Location: inference_benchmarks/performance_over_time.png

Shows how performance varies across multiple benchmark runs.

Metrics Collected

Performance Metrics

  • Tokens per Second: Generation speed
  • Time per Token: Latency per token (milliseconds)
  • Total Time: Complete generation time

Resource Metrics

  • Memory Usage: GPU memory consumption (MB)
  • Device: Device used (cuda/cpu/mps)

Derived Metrics

  • Speedup: Ratio of optimized vs non-optimized speed
  • Memory Reduction: Percentage reduction in memory usage

Example Output

🔬 BENCHMARK MODE: Comparing optimized vs non-optimized inference
======================================================================

BENCHMARK RUN: run_1234567890
======================================================================

🔴 Running NON-OPTIMIZED inference...
  ⏱️  Total Time: 1.234 s
  📊 Tokens/Second: 81.0
  ⚡ Time/Token: 12.35 ms
  💾 Memory Used: 512.3 MB
  📝 Generated: The future of artificial intelligence is bright...

🟢 Running OPTIMIZED inference...
  ⏱️  Total Time: 0.663 s
  📊 Tokens/Second: 150.8
  ⚡ Time/Token: 6.63 ms
  💾 Memory Used: 245.3 MB
  📝 Generated: The future of artificial intelligence is bright...

🚀 SPEEDUP: 1.86x faster with optimizations
💾 MEMORY REDUCTION: 52.1%

📊 Generating comparison plots and data...
📊 Comparison plot saved to: ./inference_benchmarks/optimization_comparison.png
📊 Performance over time plot saved to: ./inference_benchmarks/performance_over_time.png
📊 Metrics exported to CSV: ./inference_benchmarks/inference_metrics.csv

✅ Benchmark complete! Results saved to: ./inference_benchmarks

Method 1: Individual Runs (Manual)

# Run 1
python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 1" --benchmark

# Run 2
python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 2" --benchmark

# Run 3
python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 3" --max-length 200 --benchmark

All runs accumulate in the same files:

  • inference_metrics.json - All runs appended
  • inference_metrics.csv - All runs in CSV format
  • Plots update automatically with new data

Create a prompts file:

cat > research_prompts.txt << EOF
The future of artificial intelligence is bright.
Machine learning models are becoming more efficient.
Deep neural networks can process complex patterns.
Natural language processing enables human-computer interaction.
Transformer architectures revolutionized NLP.
EOF

Run batch benchmarks:

python benchmark_batch.py \
    --checkpoint checkpoints/best_checkpoint.pt \
    --prompt-file research_prompts.txt \
    --max-length 100 \
    --benchmark-dir ./research/results \
    --delay 2.0

Benefits:

  • Runs all prompts automatically
  • Accumulates data for trend analysis
  • Creates comprehensive performance reports
  • Handles errors gracefully

After running multiple benchmarks:

  • Check performance_over_time.png for trends
  • Analyze inference_metrics.csv in Excel/Python
  • Review aggregated statistics in console output

Research Use Cases

1. Performance Analysis

Compare how optimizations affect inference speed:

python inference.py \
    --checkpoint checkpoints/best.pt \
    --prompt "Your research prompt" \
    --benchmark

2. Memory Efficiency Study

Analyze memory usage improvements:

# Check memory reduction
python inference.py --checkpoint checkpoints/best.pt --prompt "Long prompt" --max-length 500 --benchmark

3. Scalability Testing

Test with different generation lengths:

# Short sequences
python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 50 --benchmark

# Medium sequences
python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 200 --benchmark

# Long sequences
python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 1000 --benchmark

Plot Interpretation

Comparison Plot (optimization_comparison.png)

Top Left - Tokens per Second:

  • Higher is better
  • Shows generation speed
  • Speedup annotation shows improvement factor

Top Right - Time per Token:

  • Lower is better
  • Shows latency per token
  • Important for real-time applications

Bottom Left - Total Generation Time:

  • Lower is better
  • Overall generation time
  • Most user-visible metric

Bottom Right - Memory Usage:

  • Lower is better
  • GPU memory consumption
  • Memory reduction annotation shows savings

Performance Over Time Plot (performance_over_time.png)

Shows performance trends across multiple benchmark runs:

  • Green line: Optimized performance
  • Red line: Non-optimized performance
  • Useful for finding performance regressions or improvements

Reporting Results

Speedup Calculation

Speedup = Optimized Tokens/Second / Non-Optimized Tokens/Second

Example:

  • Optimized: 150 tokens/sec
  • Non-Optimized: 81 tokens/sec
  • Speedup: 150/81 = 1.85x faster

Memory Reduction Calculation

Memory Reduction % = (1 - Optimized Memory / Non-Optimized Memory) × 100

Example:

  • Optimized: 245 MB
  • Non-Optimized: 512 MB
  • Reduction: (1 - 245/512) × 100 = 52.1%

Tips for Best Results

  1. Warm Up GPU: Run a few inference calls before benchmarking to warm up the GPU
  2. Clear Cache: The benchmark automatically clears CUDA cache between runs
  3. Multiple Runs: Run multiple benchmarks for statistical significance
  4. Consistent Prompts: Use the same prompt for fair comparison
  5. Device Consistency: Use the same device for all runs

Command Line Options

python inference.py \
    --checkpoint PATH          # Path to model checkpoint (required)
    --prompt TEXT              # Prompt text (required)
    --max-length INT           # Maximum generation length (default: 100)
    --temperature FLOAT        # Sampling temperature (default: 1.0)
    --top-k INT                # Top-k sampling (default: 50)
    --top-p FLOAT              # Top-p sampling (default: 0.95)
    --device DEVICE            # Device: cuda/cpu/mps (default: cuda)
    --benchmark                # Enable benchmarking mode
    --benchmark-dir DIR        # Benchmark output directory (default: ./inference_benchmarks)

Troubleshooting

No GPU Memory Stats

If memory stats show as None:

  • CUDA: Memory tracking should work automatically
  • MPS (Apple Silicon): Memory tracking not available
  • CPU: Memory tracking not available

Plots Not Generated

If plots fail to generate:

  • Ensure matplotlib is installed: pip install matplotlib
  • Check file permissions for output directory

Inconsistent Results

For consistent results:

  • Use same device for all runs
  • Use same prompt length
  • Allow GPU to warm up
  • Close other GPU applications

Example Research Workflow

# 1. Run initial benchmark
python inference.py --checkpoint checkpoints/best.pt --prompt "Test prompt" --benchmark

# 2. Review results
ls inference_benchmarks/
cat inference_benchmarks/inference_metrics.json

# 3. Generate plots (already done automatically)
# View: inference_benchmarks/optimization_comparison.png

# 4. Analyze CSV data
# Open: inference_benchmarks/inference_metrics.csv in Excel/Python

# 5. Run additional benchmarks
python inference.py --checkpoint checkpoints/best.pt --prompt "Different prompt" --max-length 200 --benchmark

# 6. Compare results
python inference.py --checkpoint checkpoints/best.pt --prompt "Same prompt" --benchmark

Optimization Architecture & Code Injection

Overview: Optimization Layers

The optimizations are implemented as layers that wrap the standard inference pipeline:

flowchart TB
 subgraph subGraph0["Standard Inference (Non-Optimized)"]
        B["Tokenize"]
        A["Input Prompt"]
        C["Embedding Layer"]
        D["Transformer Blocks"]
        E["Attention: Recompute All"]
        F["Forward Pass: O(n²)"]
        G["Output Tokens"]
        H["Detokenize"]
        I["Generated Text"]
  end
 subgraph subGraph1["Optimized Inference (With KV Cache)"]
        B2["Tokenize"]
        A2["Input Prompt"]
        C2["Embedding Layer"]
        D2["Transformer Blocks"]
        E2["Optimized Attention"]
        F2["KV Cache Layer"]
        G2["Forward Pass: O(n)"]
        H2["Output Tokens"]
        I2["Detokenize"]
        J2["Generated Text"]
  end
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H
    H --> I
    A2 --> B2
    B2 --> C2
    C2 --> D2
    D2 --> E2
    E2 --> F2
    F2 --> G2
    G2 --> H2
    H2 --> I2
    I2 --> J2
    style E fill:#ffcccc
    style F fill:#ffcccc
    style E2 fill:#ccffcc
    style F2 fill:#ccffcc

Detailed Optimization Flow


flowchart LR
 subgraph subGraph0["Request Flow"]
        Mode{"Optimized?"}
        Start["Benchmark Request"]
        Standard["Standard Path"]
        Optimized["Optimized Path"]
  end
 subgraph subGraph1["Standard Path"]
        S1["Model.generate"]
        S2["Transformer Forward"]
        S3["MultiHeadAttention"]
        S4["Compute Q, K, V"]
        S5["Recompute All KVs"]
        S6["Attention Scores: O(n²)"]
        S7["Generate Token"]
  end
 subgraph subGraph2["Optimized Path"]
        O1["OptimizedInference"]
        O2["Init KV Cache"]
        O3["Transformer Forward"]
        O4["OptimizedMultiHeadAttention"]
        O5["Compute Q, K, V"]
        O6["KV Cache Layer"]
        O7["Append to Cache"]
        O8["Reuse Cached KVs"]
        O9["Attention Scores: O(n)"]
        O10["Generate Token"]
  end
    Start --> Mode
    Mode -- No --> Standard
    Mode -- Yes --> Optimized
    Standard --> S1
    S1 --> S2
    S2 --> S3
    S3 --> S4
    S4 --> S5
    S5 --> S6
    S6 --> S7
    Optimized --> O1
    O1 --> O2
    O2 --> O3
    O3 --> O4
    O4 --> O5
    O5 --> O6
    O6 --> O7
    O7 --> O8
    O8 --> O9
    O9 --> O10
    S7 --> Metrics["Collect Metrics"]
    O10 --> Metrics
    style Standard fill:#ffcccc
    style Optimized fill:#ccffcc
    style S5 fill:#ffcccc
    style O8 fill:#ccffcc

Code Injection Points

graph TB
    subgraph "Standard Model Architecture"
        A[TransformerModel] --> B[TransformerBlock]
        B --> C[MultiHeadAttention]
        C --> D[Q, K, V Projections]
        D --> E[Attention Computation]
        E --> F[Output Projection]
        F --> G[Feed Forward]
    end

    subgraph "Optimization Injection Points"
        H[OptimizedInference Wrapper] --> A
        A --> B2[TransformerBlock]
        B2 --> C2[OptimizedMultiHeadAttention]
        C2 --> D2[Q, K, V Projections]
        D2 --> I[KV Cache Injection]
        I --> E2[Optimized Attention]
        E2 --> F2[Output Projection]
        F2 --> G2[Feed Forward]
    end

    subgraph "KV Cache Layer Details"
        I --> J[Cache Check]
        J --> K{Cache Exists?}
        K -->|No| L[Compute K, V]
        K -->|Yes| M[Retrieve from Cache]
        L --> N[Store in Cache]
        M --> O[Append New K, V]
        N --> O
        O --> P[Use Cached KVs]
    end

    style H fill:#90EE90
    style I fill:#90EE90
    style K fill:#FFD700
    style P fill:#90EE90

Benchmark Execution Flow

sequenceDiagram
    participant User
    participant InferenceScript
    participant BenchmarkModule
    participant OptimizedInference
    participant StandardModel
    participant MetricsCollector

    User->>InferenceScript: python inference.py --benchmark
    InferenceScript->>BenchmarkModule: Initialize Metrics
    BenchmarkModule->>MetricsCollector: Create InferenceMetrics

    Note over InferenceScript: Run 1: Non-Optimized
    InferenceScript->>StandardModel: model.generate()
    StandardModel->>StandardModel: Forward Pass (O(n²))
    StandardModel-->>InferenceScript: Generated Tokens
    InferenceScript->>MetricsCollector: Log Run (optimized=false)

    Note over InferenceScript: Run 2: Optimized
    InferenceScript->>OptimizedInference: get_optimized_inference()
    OptimizedInference->>OptimizedInference: Init KV Cache
    OptimizedInference->>OptimizedInference: generate_with_cache()

    loop For each token
        OptimizedInference->>OptimizedInference: Forward Pass (O(n))
        OptimizedInference->>OptimizedInference: Update KV Cache
    end

    OptimizedInference-->>InferenceScript: Generated Tokens
    InferenceScript->>MetricsCollector: Log Run (optimized=true)

    MetricsCollector->>MetricsCollector: Calculate Speedup
    MetricsCollector->>MetricsCollector: Generate Plots
    MetricsCollector->>MetricsCollector: Export CSV
    MetricsCollector-->>User: Results & Plots

Optimization Components Stack

graph TD
    subgraph "Application Layer"
        A[inference.py] --> B[benchmark_inference]
        B --> C[Generate Text]
    end

    subgraph "Optimization Layer"
        C --> D{Optimized?}
        D -->|Yes| E[OptimizedInference]
        D -->|No| F[Standard Model]
        E --> G[KV Cache Manager]
        E --> H[Optimized Attention]
    end

    subgraph "Core Model Layer"
        F --> I[TransformerModel]
        E --> I
        I --> J[TransformerBlock]
        J --> K[MultiHeadAttention]
        H --> K
        K --> L[Attention Computation]
    end

    subgraph "Cache Layer"
        G --> M[KVCache Data Structure]
        M --> N[Keys Cache]
        M --> O[Values Cache]
        N --> P[Retrieve Previous K]
        O --> Q[Retrieve Previous V]
    end

    subgraph "Compute Layer"
        L --> R[Q × K^T]
        P --> R
        Q --> R
        R --> S[Softmax]
        S --> T[Attention Weights]
        T --> U[Output]
    end

    style E fill:#90EE90
    style G fill:#90EE90
    style H fill:#90EE90
    style M fill:#FFD700

Performance Comparison Schema


flowchart LR
 subgraph subGraph0["Metrics Collection"]
        B["Non-Optimized Metrics"]
        A["Benchmark Run"]
        C["Optimized Metrics"]
        D["Time: T1<br>Memory: M1<br>Speed: S1"]
        E["Time: T2<br>Memory: M2<br>Speed: S2"]
  end
 subgraph Analysis["Analysis"]
        F["Calculate Speedup"]
        G["Speedup = S2/S1"]
        H["Calculate Memory Reduction"]
        I["Reduction = (M1-M2)/M1 × 100%"]
  end
 subgraph Visualization["Visualization"]
        J["Comparison Plot"]
        K["Trend Analysis"]
        L["Performance Over Time"]
  end
 subgraph subGraph3["Data Export"]
        M["JSON Metrics"]
        N["CSV Export"]
  end
    A --> B & C
    B --> D
    C --> E
    D --> F & H
    E --> F & H
    F --> G & K
    H --> I
    G --> J
    I --> J
    K --> L
    J --> M & N
    L --> M & N
    style F fill:#FFD700
    style G fill:#90EE90
    style I fill:#90EE90

Data File Locations Summary

All benchmark data is saved to:

./inference_benchmarks/
├── inference_metrics.json          # All raw metrics (JSON)
├── inference_metrics.csv           # Spreadsheet data (CSV)
├── optimization_comparison.png     # Comparison charts
└── performance_over_time.png       # Trend analysis

Custom location:

--benchmark-dir ./research/results

Data accumulates: Each benchmark run appends to the same files, building trends over time.

Next Steps

  1. Run your first benchmark
  2. Review the comparison plots
  3. Analyze CSV data for deeper insights
  4. Run multiple benchmarks for statistical analysis
  5. Use batch script for trend analysis
  6. Include results in your research paper/presentation

Happy Benchmarking! 📊🔬