- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
19 KiB
Inference Benchmarking Guide
This guide explains how to use the benchmarking feature to compare optimized vs non-optimized inference performance for research purposes.
Overview
The benchmarking feature runs inference both with and without optimizations (KV caching, optimized attention) and generates:
- Performance metrics (tokens/sec, latency, memory usage)
- Comparison plots (visual charts showing improvements)
- CSV export (data for further analysis)
Data Storage Location
All benchmark data is saved to: ./inference_benchmarks/ (default)
You can customize the location:
python inference.py --benchmark --benchmark-dir ./research/results
Data files created:
inference_metrics.json- All raw metrics (JSON format)inference_metrics.csv- Spreadsheet-friendly data (CSV format)optimization_comparison.png- Visual comparison chartsperformance_over_time.png- Trend analysis over multiple runs
Note: All runs accumulate in the same files, so you can run multiple benchmarks and build trends over time.
Quick Start
Basic Benchmark
python inference.py \
--checkpoint checkpoints/best_checkpoint.pt \
--prompt "The future of artificial intelligence" \
--max-length 100 \
--benchmark
This will:
- Run inference without optimizations
- Run inference with optimizations (KV cache)
- Collect metrics for both runs
- Generate comparison plots
- Save all data to
./inference_benchmarks/
Custom Benchmark Directory
python inference.py \
--checkpoint checkpoints/best_checkpoint.pt \
--prompt "Your prompt here" \
--max-length 100 \
--benchmark \
--benchmark-dir ./research/results
Running Multiple Prompts for Trends
Use the batch benchmark script to run multiple prompts and create trends:
# Create a prompts file
cat > prompts.txt << EOF
The future of artificial intelligence
Machine learning is transforming
Deep neural networks enable
Natural language processing requires
EOF
# Run batch benchmarks
python benchmark_batch.py \
--checkpoint checkpoints/best_checkpoint.pt \
--prompt-file prompts.txt \
--max-length 100 \
--benchmark-dir ./research/results
Or use command-line prompts:
python benchmark_batch.py \
--checkpoint checkpoints/best_checkpoint.pt \
--prompts "Prompt 1" "Prompt 2" "Prompt 3" \
--max-length 100
Results accumulate in the same files, allowing you to:
- Build trends across multiple prompts
- Analyze performance consistency
- Create comprehensive research reports
Output Files
After running a benchmark, you'll get:
1. JSON Metrics File
Location: inference_benchmarks/inference_metrics.json
Contains all raw metrics data:
{
"runs": [
{
"run_name": "run_1234567890_optimized",
"optimized": true,
"tokens_per_second": 150.5,
"time_per_token": 6.64,
"memory_used_mb": 245.3,
...
},
...
]
}
2. CSV Export
Location: inference_benchmarks/inference_metrics.csv
For spreadsheet analysis:
run_name,timestamp,optimized,prompt_length,generated_length,total_time,tokens_per_second,time_per_token,memory_used_mb,device
run_1234567890_optimized,1234567890.5,true,20,100,0.663,150.8,6.63,245.3,cuda
...
3. Comparison Plot
Location: inference_benchmarks/optimization_comparison.png
Shows 4 charts:
- Tokens per Second (speed comparison)
- Time per Token (latency comparison)
- Total Generation Time (overall speed)
- Memory Usage (memory efficiency)
4. Performance Over Time Plot
Location: inference_benchmarks/performance_over_time.png
Shows how performance varies across multiple benchmark runs.
Metrics Collected
Performance Metrics
- Tokens per Second: Generation speed
- Time per Token: Latency per token (milliseconds)
- Total Time: Complete generation time
Resource Metrics
- Memory Usage: GPU memory consumption (MB)
- Device: Device used (cuda/cpu/mps)
Derived Metrics
- Speedup: Ratio of optimized vs non-optimized speed
- Memory Reduction: Percentage reduction in memory usage
Example Output
🔬 BENCHMARK MODE: Comparing optimized vs non-optimized inference
======================================================================
BENCHMARK RUN: run_1234567890
======================================================================
🔴 Running NON-OPTIMIZED inference...
⏱️ Total Time: 1.234 s
📊 Tokens/Second: 81.0
⚡ Time/Token: 12.35 ms
💾 Memory Used: 512.3 MB
📝 Generated: The future of artificial intelligence is bright...
🟢 Running OPTIMIZED inference...
⏱️ Total Time: 0.663 s
📊 Tokens/Second: 150.8
⚡ Time/Token: 6.63 ms
💾 Memory Used: 245.3 MB
📝 Generated: The future of artificial intelligence is bright...
🚀 SPEEDUP: 1.86x faster with optimizations
💾 MEMORY REDUCTION: 52.1%
📊 Generating comparison plots and data...
📊 Comparison plot saved to: ./inference_benchmarks/optimization_comparison.png
📊 Performance over time plot saved to: ./inference_benchmarks/performance_over_time.png
📊 Metrics exported to CSV: ./inference_benchmarks/inference_metrics.csv
✅ Benchmark complete! Results saved to: ./inference_benchmarks
Running Multiple Benchmarks for Trends
Method 1: Individual Runs (Manual)
# Run 1
python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 1" --benchmark
# Run 2
python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 2" --benchmark
# Run 3
python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 3" --max-length 200 --benchmark
All runs accumulate in the same files:
inference_metrics.json- All runs appendedinference_metrics.csv- All runs in CSV format- Plots update automatically with new data
Method 2: Batch Script (Recommended)
Create a prompts file:
cat > research_prompts.txt << EOF
The future of artificial intelligence is bright.
Machine learning models are becoming more efficient.
Deep neural networks can process complex patterns.
Natural language processing enables human-computer interaction.
Transformer architectures revolutionized NLP.
EOF
Run batch benchmarks:
python benchmark_batch.py \
--checkpoint checkpoints/best_checkpoint.pt \
--prompt-file research_prompts.txt \
--max-length 100 \
--benchmark-dir ./research/results \
--delay 2.0
Benefits:
- ✅ Runs all prompts automatically
- ✅ Accumulates data for trend analysis
- ✅ Creates comprehensive performance reports
- ✅ Handles errors gracefully
After running multiple benchmarks:
- Check
performance_over_time.pngfor trends - Analyze
inference_metrics.csvin Excel/Python - Review aggregated statistics in console output
Research Use Cases
1. Performance Analysis
Compare how optimizations affect inference speed:
python inference.py \
--checkpoint checkpoints/best.pt \
--prompt "Your research prompt" \
--benchmark
2. Memory Efficiency Study
Analyze memory usage improvements:
# Check memory reduction
python inference.py --checkpoint checkpoints/best.pt --prompt "Long prompt" --max-length 500 --benchmark
3. Scalability Testing
Test with different generation lengths:
# Short sequences
python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 50 --benchmark
# Medium sequences
python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 200 --benchmark
# Long sequences
python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 1000 --benchmark
Plot Interpretation
Comparison Plot (optimization_comparison.png)
Top Left - Tokens per Second:
- Higher is better
- Shows generation speed
- Speedup annotation shows improvement factor
Top Right - Time per Token:
- Lower is better
- Shows latency per token
- Important for real-time applications
Bottom Left - Total Generation Time:
- Lower is better
- Overall generation time
- Most user-visible metric
Bottom Right - Memory Usage:
- Lower is better
- GPU memory consumption
- Memory reduction annotation shows savings
Performance Over Time Plot (performance_over_time.png)
Shows performance trends across multiple benchmark runs:
- Green line: Optimized performance
- Red line: Non-optimized performance
- Useful for finding performance regressions or improvements
Reporting Results
Speedup Calculation
Speedup = Optimized Tokens/Second / Non-Optimized Tokens/Second
Example:
- Optimized: 150 tokens/sec
- Non-Optimized: 81 tokens/sec
- Speedup: 150/81 = 1.85x faster
Memory Reduction Calculation
Memory Reduction % = (1 - Optimized Memory / Non-Optimized Memory) × 100
Example:
- Optimized: 245 MB
- Non-Optimized: 512 MB
- Reduction: (1 - 245/512) × 100 = 52.1%
Tips for Best Results
- Warm Up GPU: Run a few inference calls before benchmarking to warm up the GPU
- Clear Cache: The benchmark automatically clears CUDA cache between runs
- Multiple Runs: Run multiple benchmarks for statistical significance
- Consistent Prompts: Use the same prompt for fair comparison
- Device Consistency: Use the same device for all runs
Command Line Options
python inference.py \
--checkpoint PATH # Path to model checkpoint (required)
--prompt TEXT # Prompt text (required)
--max-length INT # Maximum generation length (default: 100)
--temperature FLOAT # Sampling temperature (default: 1.0)
--top-k INT # Top-k sampling (default: 50)
--top-p FLOAT # Top-p sampling (default: 0.95)
--device DEVICE # Device: cuda/cpu/mps (default: cuda)
--benchmark # Enable benchmarking mode
--benchmark-dir DIR # Benchmark output directory (default: ./inference_benchmarks)
Troubleshooting
No GPU Memory Stats
If memory stats show as None:
- CUDA: Memory tracking should work automatically
- MPS (Apple Silicon): Memory tracking not available
- CPU: Memory tracking not available
Plots Not Generated
If plots fail to generate:
- Ensure
matplotlibis installed:pip install matplotlib - Check file permissions for output directory
Inconsistent Results
For consistent results:
- Use same device for all runs
- Use same prompt length
- Allow GPU to warm up
- Close other GPU applications
Example Research Workflow
# 1. Run initial benchmark
python inference.py --checkpoint checkpoints/best.pt --prompt "Test prompt" --benchmark
# 2. Review results
ls inference_benchmarks/
cat inference_benchmarks/inference_metrics.json
# 3. Generate plots (already done automatically)
# View: inference_benchmarks/optimization_comparison.png
# 4. Analyze CSV data
# Open: inference_benchmarks/inference_metrics.csv in Excel/Python
# 5. Run additional benchmarks
python inference.py --checkpoint checkpoints/best.pt --prompt "Different prompt" --max-length 200 --benchmark
# 6. Compare results
python inference.py --checkpoint checkpoints/best.pt --prompt "Same prompt" --benchmark
Optimization Architecture & Code Injection
Overview: Optimization Layers
The optimizations are implemented as layers that wrap the standard inference pipeline:
flowchart TB
subgraph subGraph0["Standard Inference (Non-Optimized)"]
B["Tokenize"]
A["Input Prompt"]
C["Embedding Layer"]
D["Transformer Blocks"]
E["Attention: Recompute All"]
F["Forward Pass: O(n²)"]
G["Output Tokens"]
H["Detokenize"]
I["Generated Text"]
end
subgraph subGraph1["Optimized Inference (With KV Cache)"]
B2["Tokenize"]
A2["Input Prompt"]
C2["Embedding Layer"]
D2["Transformer Blocks"]
E2["Optimized Attention"]
F2["KV Cache Layer"]
G2["Forward Pass: O(n)"]
H2["Output Tokens"]
I2["Detokenize"]
J2["Generated Text"]
end
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
H --> I
A2 --> B2
B2 --> C2
C2 --> D2
D2 --> E2
E2 --> F2
F2 --> G2
G2 --> H2
H2 --> I2
I2 --> J2
style E fill:#ffcccc
style F fill:#ffcccc
style E2 fill:#ccffcc
style F2 fill:#ccffcc
Detailed Optimization Flow
flowchart LR
subgraph subGraph0["Request Flow"]
Mode{"Optimized?"}
Start["Benchmark Request"]
Standard["Standard Path"]
Optimized["Optimized Path"]
end
subgraph subGraph1["Standard Path"]
S1["Model.generate"]
S2["Transformer Forward"]
S3["MultiHeadAttention"]
S4["Compute Q, K, V"]
S5["Recompute All KVs"]
S6["Attention Scores: O(n²)"]
S7["Generate Token"]
end
subgraph subGraph2["Optimized Path"]
O1["OptimizedInference"]
O2["Init KV Cache"]
O3["Transformer Forward"]
O4["OptimizedMultiHeadAttention"]
O5["Compute Q, K, V"]
O6["KV Cache Layer"]
O7["Append to Cache"]
O8["Reuse Cached KVs"]
O9["Attention Scores: O(n)"]
O10["Generate Token"]
end
Start --> Mode
Mode -- No --> Standard
Mode -- Yes --> Optimized
Standard --> S1
S1 --> S2
S2 --> S3
S3 --> S4
S4 --> S5
S5 --> S6
S6 --> S7
Optimized --> O1
O1 --> O2
O2 --> O3
O3 --> O4
O4 --> O5
O5 --> O6
O6 --> O7
O7 --> O8
O8 --> O9
O9 --> O10
S7 --> Metrics["Collect Metrics"]
O10 --> Metrics
style Standard fill:#ffcccc
style Optimized fill:#ccffcc
style S5 fill:#ffcccc
style O8 fill:#ccffcc
Code Injection Points
graph TB
subgraph "Standard Model Architecture"
A[TransformerModel] --> B[TransformerBlock]
B --> C[MultiHeadAttention]
C --> D[Q, K, V Projections]
D --> E[Attention Computation]
E --> F[Output Projection]
F --> G[Feed Forward]
end
subgraph "Optimization Injection Points"
H[OptimizedInference Wrapper] --> A
A --> B2[TransformerBlock]
B2 --> C2[OptimizedMultiHeadAttention]
C2 --> D2[Q, K, V Projections]
D2 --> I[KV Cache Injection]
I --> E2[Optimized Attention]
E2 --> F2[Output Projection]
F2 --> G2[Feed Forward]
end
subgraph "KV Cache Layer Details"
I --> J[Cache Check]
J --> K{Cache Exists?}
K -->|No| L[Compute K, V]
K -->|Yes| M[Retrieve from Cache]
L --> N[Store in Cache]
M --> O[Append New K, V]
N --> O
O --> P[Use Cached KVs]
end
style H fill:#90EE90
style I fill:#90EE90
style K fill:#FFD700
style P fill:#90EE90
Benchmark Execution Flow
sequenceDiagram
participant User
participant InferenceScript
participant BenchmarkModule
participant OptimizedInference
participant StandardModel
participant MetricsCollector
User->>InferenceScript: python inference.py --benchmark
InferenceScript->>BenchmarkModule: Initialize Metrics
BenchmarkModule->>MetricsCollector: Create InferenceMetrics
Note over InferenceScript: Run 1: Non-Optimized
InferenceScript->>StandardModel: model.generate()
StandardModel->>StandardModel: Forward Pass (O(n²))
StandardModel-->>InferenceScript: Generated Tokens
InferenceScript->>MetricsCollector: Log Run (optimized=false)
Note over InferenceScript: Run 2: Optimized
InferenceScript->>OptimizedInference: get_optimized_inference()
OptimizedInference->>OptimizedInference: Init KV Cache
OptimizedInference->>OptimizedInference: generate_with_cache()
loop For each token
OptimizedInference->>OptimizedInference: Forward Pass (O(n))
OptimizedInference->>OptimizedInference: Update KV Cache
end
OptimizedInference-->>InferenceScript: Generated Tokens
InferenceScript->>MetricsCollector: Log Run (optimized=true)
MetricsCollector->>MetricsCollector: Calculate Speedup
MetricsCollector->>MetricsCollector: Generate Plots
MetricsCollector->>MetricsCollector: Export CSV
MetricsCollector-->>User: Results & Plots
Optimization Components Stack
graph TD
subgraph "Application Layer"
A[inference.py] --> B[benchmark_inference]
B --> C[Generate Text]
end
subgraph "Optimization Layer"
C --> D{Optimized?}
D -->|Yes| E[OptimizedInference]
D -->|No| F[Standard Model]
E --> G[KV Cache Manager]
E --> H[Optimized Attention]
end
subgraph "Core Model Layer"
F --> I[TransformerModel]
E --> I
I --> J[TransformerBlock]
J --> K[MultiHeadAttention]
H --> K
K --> L[Attention Computation]
end
subgraph "Cache Layer"
G --> M[KVCache Data Structure]
M --> N[Keys Cache]
M --> O[Values Cache]
N --> P[Retrieve Previous K]
O --> Q[Retrieve Previous V]
end
subgraph "Compute Layer"
L --> R[Q × K^T]
P --> R
Q --> R
R --> S[Softmax]
S --> T[Attention Weights]
T --> U[Output]
end
style E fill:#90EE90
style G fill:#90EE90
style H fill:#90EE90
style M fill:#FFD700
Performance Comparison Schema
flowchart LR
subgraph subGraph0["Metrics Collection"]
B["Non-Optimized Metrics"]
A["Benchmark Run"]
C["Optimized Metrics"]
D["Time: T1<br>Memory: M1<br>Speed: S1"]
E["Time: T2<br>Memory: M2<br>Speed: S2"]
end
subgraph Analysis["Analysis"]
F["Calculate Speedup"]
G["Speedup = S2/S1"]
H["Calculate Memory Reduction"]
I["Reduction = (M1-M2)/M1 × 100%"]
end
subgraph Visualization["Visualization"]
J["Comparison Plot"]
K["Trend Analysis"]
L["Performance Over Time"]
end
subgraph subGraph3["Data Export"]
M["JSON Metrics"]
N["CSV Export"]
end
A --> B & C
B --> D
C --> E
D --> F & H
E --> F & H
F --> G & K
H --> I
G --> J
I --> J
K --> L
J --> M & N
L --> M & N
style F fill:#FFD700
style G fill:#90EE90
style I fill:#90EE90
Data File Locations Summary
All benchmark data is saved to:
./inference_benchmarks/
├── inference_metrics.json # All raw metrics (JSON)
├── inference_metrics.csv # Spreadsheet data (CSV)
├── optimization_comparison.png # Comparison charts
└── performance_over_time.png # Trend analysis
Custom location:
--benchmark-dir ./research/results
Data accumulates: Each benchmark run appends to the same files, building trends over time.
Next Steps
- ✅ Run your first benchmark
- ✅ Review the comparison plots
- ✅ Analyze CSV data for deeper insights
- ✅ Run multiple benchmarks for statistical analysis
- ✅ Use batch script for trend analysis
- ✅ Include results in your research paper/presentation
Happy Benchmarking! 📊🔬