Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
This commit is contained in:
757
docs/BENCHMARKING_GUIDE.md
Normal file
757
docs/BENCHMARKING_GUIDE.md
Normal file
@@ -0,0 +1,757 @@
|
||||
# Inference Benchmarking Guide
|
||||
|
||||
This guide explains how to use the benchmarking feature to compare optimized vs non-optimized inference performance for research purposes.
|
||||
|
||||
## Overview
|
||||
|
||||
The benchmarking feature runs inference both with and without optimizations (KV caching, optimized attention) and generates:
|
||||
|
||||
- **Performance metrics** (tokens/sec, latency, memory usage)
|
||||
- **Comparison plots** (visual charts showing improvements)
|
||||
- **CSV export** (data for further analysis)
|
||||
|
||||
## Data Storage Location
|
||||
|
||||
**All benchmark data is saved to:** `./inference_benchmarks/` (default)
|
||||
|
||||
**You can customize the location:**
|
||||
|
||||
```bash
|
||||
python inference.py --benchmark --benchmark-dir ./research/results
|
||||
```
|
||||
|
||||
**Data files created:**
|
||||
|
||||
- `inference_metrics.json` - All raw metrics (JSON format)
|
||||
- `inference_metrics.csv` - Spreadsheet-friendly data (CSV format)
|
||||
- `optimization_comparison.png` - Visual comparison charts
|
||||
- `performance_over_time.png` - Trend analysis over multiple runs
|
||||
|
||||
**Note:** All runs accumulate in the same files, so you can run multiple benchmarks and build trends over time.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Benchmark
|
||||
|
||||
```bash
|
||||
python inference.py \
|
||||
--checkpoint checkpoints/best_checkpoint.pt \
|
||||
--prompt "The future of artificial intelligence" \
|
||||
--max-length 100 \
|
||||
--benchmark
|
||||
```
|
||||
|
||||
This will:
|
||||
|
||||
1. Run inference **without** optimizations
|
||||
2. Run inference **with** optimizations (KV cache)
|
||||
3. Collect metrics for both runs
|
||||
4. Generate comparison plots
|
||||
5. Save all data to `./inference_benchmarks/`
|
||||
|
||||
### Custom Benchmark Directory
|
||||
|
||||
```bash
|
||||
python inference.py \
|
||||
--checkpoint checkpoints/best_checkpoint.pt \
|
||||
--prompt "Your prompt here" \
|
||||
--max-length 100 \
|
||||
--benchmark \
|
||||
--benchmark-dir ./research/results
|
||||
```
|
||||
|
||||
### Running Multiple Prompts for Trends
|
||||
|
||||
**Use the batch benchmark script** to run multiple prompts and create trends:
|
||||
|
||||
```bash
|
||||
# Create a prompts file
|
||||
cat > prompts.txt << EOF
|
||||
The future of artificial intelligence
|
||||
Machine learning is transforming
|
||||
Deep neural networks enable
|
||||
Natural language processing requires
|
||||
EOF
|
||||
|
||||
# Run batch benchmarks
|
||||
python benchmark_batch.py \
|
||||
--checkpoint checkpoints/best_checkpoint.pt \
|
||||
--prompt-file prompts.txt \
|
||||
--max-length 100 \
|
||||
--benchmark-dir ./research/results
|
||||
```
|
||||
|
||||
**Or use command-line prompts:**
|
||||
|
||||
```bash
|
||||
python benchmark_batch.py \
|
||||
--checkpoint checkpoints/best_checkpoint.pt \
|
||||
--prompts "Prompt 1" "Prompt 2" "Prompt 3" \
|
||||
--max-length 100
|
||||
```
|
||||
|
||||
**Results accumulate** in the same files, allowing you to:
|
||||
|
||||
- Build trends across multiple prompts
|
||||
- Analyze performance consistency
|
||||
- Create comprehensive research reports
|
||||
|
||||
## Output Files
|
||||
|
||||
After running a benchmark, you'll get:
|
||||
|
||||
### 1. JSON Metrics File
|
||||
|
||||
**Location:** `inference_benchmarks/inference_metrics.json`
|
||||
|
||||
Contains all raw metrics data:
|
||||
|
||||
```json
|
||||
{
|
||||
"runs": [
|
||||
{
|
||||
"run_name": "run_1234567890_optimized",
|
||||
"optimized": true,
|
||||
"tokens_per_second": 150.5,
|
||||
"time_per_token": 6.64,
|
||||
"memory_used_mb": 245.3,
|
||||
...
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 2. CSV Export
|
||||
|
||||
**Location:** `inference_benchmarks/inference_metrics.csv`
|
||||
|
||||
For spreadsheet analysis:
|
||||
|
||||
```csv
|
||||
run_name,timestamp,optimized,prompt_length,generated_length,total_time,tokens_per_second,time_per_token,memory_used_mb,device
|
||||
run_1234567890_optimized,1234567890.5,true,20,100,0.663,150.8,6.63,245.3,cuda
|
||||
...
|
||||
```
|
||||
|
||||
### 3. Comparison Plot
|
||||
|
||||
**Location:** `inference_benchmarks/optimization_comparison.png`
|
||||
|
||||
Shows 4 charts:
|
||||
|
||||
- **Tokens per Second** (speed comparison)
|
||||
- **Time per Token** (latency comparison)
|
||||
- **Total Generation Time** (overall speed)
|
||||
- **Memory Usage** (memory efficiency)
|
||||
|
||||
### 4. Performance Over Time Plot
|
||||
|
||||
**Location:** `inference_benchmarks/performance_over_time.png`
|
||||
|
||||
Shows how performance varies across multiple benchmark runs.
|
||||
|
||||
## Metrics Collected
|
||||
|
||||
### Performance Metrics
|
||||
|
||||
- **Tokens per Second**: Generation speed
|
||||
- **Time per Token**: Latency per token (milliseconds)
|
||||
- **Total Time**: Complete generation time
|
||||
|
||||
### Resource Metrics
|
||||
|
||||
- **Memory Usage**: GPU memory consumption (MB)
|
||||
- **Device**: Device used (cuda/cpu/mps)
|
||||
|
||||
### Derived Metrics
|
||||
|
||||
- **Speedup**: Ratio of optimized vs non-optimized speed
|
||||
- **Memory Reduction**: Percentage reduction in memory usage
|
||||
|
||||
## Example Output
|
||||
|
||||
```
|
||||
🔬 BENCHMARK MODE: Comparing optimized vs non-optimized inference
|
||||
======================================================================
|
||||
|
||||
BENCHMARK RUN: run_1234567890
|
||||
======================================================================
|
||||
|
||||
🔴 Running NON-OPTIMIZED inference...
|
||||
⏱️ Total Time: 1.234 s
|
||||
📊 Tokens/Second: 81.0
|
||||
⚡ Time/Token: 12.35 ms
|
||||
💾 Memory Used: 512.3 MB
|
||||
📝 Generated: The future of artificial intelligence is bright...
|
||||
|
||||
🟢 Running OPTIMIZED inference...
|
||||
⏱️ Total Time: 0.663 s
|
||||
📊 Tokens/Second: 150.8
|
||||
⚡ Time/Token: 6.63 ms
|
||||
💾 Memory Used: 245.3 MB
|
||||
📝 Generated: The future of artificial intelligence is bright...
|
||||
|
||||
🚀 SPEEDUP: 1.86x faster with optimizations
|
||||
💾 MEMORY REDUCTION: 52.1%
|
||||
|
||||
📊 Generating comparison plots and data...
|
||||
📊 Comparison plot saved to: ./inference_benchmarks/optimization_comparison.png
|
||||
📊 Performance over time plot saved to: ./inference_benchmarks/performance_over_time.png
|
||||
📊 Metrics exported to CSV: ./inference_benchmarks/inference_metrics.csv
|
||||
|
||||
✅ Benchmark complete! Results saved to: ./inference_benchmarks
|
||||
```
|
||||
|
||||
## Running Multiple Benchmarks for Trends
|
||||
|
||||
### Method 1: Individual Runs (Manual)
|
||||
|
||||
```bash
|
||||
# Run 1
|
||||
python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 1" --benchmark
|
||||
|
||||
# Run 2
|
||||
python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 2" --benchmark
|
||||
|
||||
# Run 3
|
||||
python inference.py --checkpoint checkpoints/best.pt --prompt "Prompt 3" --max-length 200 --benchmark
|
||||
```
|
||||
|
||||
All runs accumulate in the same files:
|
||||
|
||||
- `inference_metrics.json` - All runs appended
|
||||
- `inference_metrics.csv` - All runs in CSV format
|
||||
- Plots update automatically with new data
|
||||
|
||||
### Method 2: Batch Script (Recommended)
|
||||
|
||||
**Create a prompts file:**
|
||||
|
||||
```bash
|
||||
cat > research_prompts.txt << EOF
|
||||
The future of artificial intelligence is bright.
|
||||
Machine learning models are becoming more efficient.
|
||||
Deep neural networks can process complex patterns.
|
||||
Natural language processing enables human-computer interaction.
|
||||
Transformer architectures revolutionized NLP.
|
||||
EOF
|
||||
```
|
||||
|
||||
**Run batch benchmarks:**
|
||||
|
||||
```bash
|
||||
python benchmark_batch.py \
|
||||
--checkpoint checkpoints/best_checkpoint.pt \
|
||||
--prompt-file research_prompts.txt \
|
||||
--max-length 100 \
|
||||
--benchmark-dir ./research/results \
|
||||
--delay 2.0
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
|
||||
- ✅ Runs all prompts automatically
|
||||
- ✅ Accumulates data for trend analysis
|
||||
- ✅ Creates comprehensive performance reports
|
||||
- ✅ Handles errors gracefully
|
||||
|
||||
**After running multiple benchmarks:**
|
||||
|
||||
- Check `performance_over_time.png` for trends
|
||||
- Analyze `inference_metrics.csv` in Excel/Python
|
||||
- Review aggregated statistics in console output
|
||||
|
||||
## Research Use Cases
|
||||
|
||||
### 1. Performance Analysis
|
||||
|
||||
Compare how optimizations affect inference speed:
|
||||
|
||||
```bash
|
||||
python inference.py \
|
||||
--checkpoint checkpoints/best.pt \
|
||||
--prompt "Your research prompt" \
|
||||
--benchmark
|
||||
```
|
||||
|
||||
### 2. Memory Efficiency Study
|
||||
|
||||
Analyze memory usage improvements:
|
||||
|
||||
```bash
|
||||
# Check memory reduction
|
||||
python inference.py --checkpoint checkpoints/best.pt --prompt "Long prompt" --max-length 500 --benchmark
|
||||
```
|
||||
|
||||
### 3. Scalability Testing
|
||||
|
||||
Test with different generation lengths:
|
||||
|
||||
```bash
|
||||
# Short sequences
|
||||
python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 50 --benchmark
|
||||
|
||||
# Medium sequences
|
||||
python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 200 --benchmark
|
||||
|
||||
# Long sequences
|
||||
python inference.py --checkpoint checkpoints/best.pt --prompt "Test" --max-length 1000 --benchmark
|
||||
```
|
||||
|
||||
## Plot Interpretation
|
||||
|
||||
### Comparison Plot (`optimization_comparison.png`)
|
||||
|
||||
**Top Left - Tokens per Second:**
|
||||
|
||||
- Higher is better
|
||||
- Shows generation speed
|
||||
- Speedup annotation shows improvement factor
|
||||
|
||||
**Top Right - Time per Token:**
|
||||
|
||||
- Lower is better
|
||||
- Shows latency per token
|
||||
- Important for real-time applications
|
||||
|
||||
**Bottom Left - Total Generation Time:**
|
||||
|
||||
- Lower is better
|
||||
- Overall generation time
|
||||
- Most user-visible metric
|
||||
|
||||
**Bottom Right - Memory Usage:**
|
||||
|
||||
- Lower is better
|
||||
- GPU memory consumption
|
||||
- Memory reduction annotation shows savings
|
||||
|
||||
### Performance Over Time Plot (`performance_over_time.png`)
|
||||
|
||||
Shows performance trends across multiple benchmark runs:
|
||||
|
||||
- **Green line**: Optimized performance
|
||||
- **Red line**: Non-optimized performance
|
||||
- Useful for finding performance regressions or improvements
|
||||
|
||||
## Reporting Results
|
||||
|
||||
### Speedup Calculation
|
||||
|
||||
```
|
||||
Speedup = Optimized Tokens/Second / Non-Optimized Tokens/Second
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
- Optimized: 150 tokens/sec
|
||||
- Non-Optimized: 81 tokens/sec
|
||||
- Speedup: 150/81 = 1.85x faster
|
||||
|
||||
### Memory Reduction Calculation
|
||||
|
||||
```
|
||||
Memory Reduction % = (1 - Optimized Memory / Non-Optimized Memory) × 100
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
- Optimized: 245 MB
|
||||
- Non-Optimized: 512 MB
|
||||
- Reduction: (1 - 245/512) × 100 = 52.1%
|
||||
|
||||
## Tips for Best Results
|
||||
|
||||
1. **Warm Up GPU**: Run a few inference calls before benchmarking to warm up the GPU
|
||||
2. **Clear Cache**: The benchmark automatically clears CUDA cache between runs
|
||||
3. **Multiple Runs**: Run multiple benchmarks for statistical significance
|
||||
4. **Consistent Prompts**: Use the same prompt for fair comparison
|
||||
5. **Device Consistency**: Use the same device for all runs
|
||||
|
||||
## Command Line Options
|
||||
|
||||
```bash
|
||||
python inference.py \
|
||||
--checkpoint PATH # Path to model checkpoint (required)
|
||||
--prompt TEXT # Prompt text (required)
|
||||
--max-length INT # Maximum generation length (default: 100)
|
||||
--temperature FLOAT # Sampling temperature (default: 1.0)
|
||||
--top-k INT # Top-k sampling (default: 50)
|
||||
--top-p FLOAT # Top-p sampling (default: 0.95)
|
||||
--device DEVICE # Device: cuda/cpu/mps (default: cuda)
|
||||
--benchmark # Enable benchmarking mode
|
||||
--benchmark-dir DIR # Benchmark output directory (default: ./inference_benchmarks)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No GPU Memory Stats
|
||||
|
||||
If memory stats show as `None`:
|
||||
|
||||
- CUDA: Memory tracking should work automatically
|
||||
- MPS (Apple Silicon): Memory tracking not available
|
||||
- CPU: Memory tracking not available
|
||||
|
||||
### Plots Not Generated
|
||||
|
||||
If plots fail to generate:
|
||||
|
||||
- Ensure `matplotlib` is installed: `pip install matplotlib`
|
||||
- Check file permissions for output directory
|
||||
|
||||
### Inconsistent Results
|
||||
|
||||
For consistent results:
|
||||
|
||||
- Use same device for all runs
|
||||
- Use same prompt length
|
||||
- Allow GPU to warm up
|
||||
- Close other GPU applications
|
||||
|
||||
## Example Research Workflow
|
||||
|
||||
```bash
|
||||
# 1. Run initial benchmark
|
||||
python inference.py --checkpoint checkpoints/best.pt --prompt "Test prompt" --benchmark
|
||||
|
||||
# 2. Review results
|
||||
ls inference_benchmarks/
|
||||
cat inference_benchmarks/inference_metrics.json
|
||||
|
||||
# 3. Generate plots (already done automatically)
|
||||
# View: inference_benchmarks/optimization_comparison.png
|
||||
|
||||
# 4. Analyze CSV data
|
||||
# Open: inference_benchmarks/inference_metrics.csv in Excel/Python
|
||||
|
||||
# 5. Run additional benchmarks
|
||||
python inference.py --checkpoint checkpoints/best.pt --prompt "Different prompt" --max-length 200 --benchmark
|
||||
|
||||
# 6. Compare results
|
||||
python inference.py --checkpoint checkpoints/best.pt --prompt "Same prompt" --benchmark
|
||||
```
|
||||
|
||||
## Optimization Architecture & Code Injection
|
||||
|
||||
### Overview: Optimization Layers
|
||||
|
||||
The optimizations are implemented as layers that wrap the standard inference pipeline:
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph subGraph0["Standard Inference (Non-Optimized)"]
|
||||
B["Tokenize"]
|
||||
A["Input Prompt"]
|
||||
C["Embedding Layer"]
|
||||
D["Transformer Blocks"]
|
||||
E["Attention: Recompute All"]
|
||||
F["Forward Pass: O(n²)"]
|
||||
G["Output Tokens"]
|
||||
H["Detokenize"]
|
||||
I["Generated Text"]
|
||||
end
|
||||
subgraph subGraph1["Optimized Inference (With KV Cache)"]
|
||||
B2["Tokenize"]
|
||||
A2["Input Prompt"]
|
||||
C2["Embedding Layer"]
|
||||
D2["Transformer Blocks"]
|
||||
E2["Optimized Attention"]
|
||||
F2["KV Cache Layer"]
|
||||
G2["Forward Pass: O(n)"]
|
||||
H2["Output Tokens"]
|
||||
I2["Detokenize"]
|
||||
J2["Generated Text"]
|
||||
end
|
||||
A --> B
|
||||
B --> C
|
||||
C --> D
|
||||
D --> E
|
||||
E --> F
|
||||
F --> G
|
||||
G --> H
|
||||
H --> I
|
||||
A2 --> B2
|
||||
B2 --> C2
|
||||
C2 --> D2
|
||||
D2 --> E2
|
||||
E2 --> F2
|
||||
F2 --> G2
|
||||
G2 --> H2
|
||||
H2 --> I2
|
||||
I2 --> J2
|
||||
style E fill:#ffcccc
|
||||
style F fill:#ffcccc
|
||||
style E2 fill:#ccffcc
|
||||
style F2 fill:#ccffcc
|
||||
```
|
||||
|
||||
### Detailed Optimization Flow
|
||||
|
||||
```mermaid
|
||||
|
||||
flowchart LR
|
||||
subgraph subGraph0["Request Flow"]
|
||||
Mode{"Optimized?"}
|
||||
Start["Benchmark Request"]
|
||||
Standard["Standard Path"]
|
||||
Optimized["Optimized Path"]
|
||||
end
|
||||
subgraph subGraph1["Standard Path"]
|
||||
S1["Model.generate"]
|
||||
S2["Transformer Forward"]
|
||||
S3["MultiHeadAttention"]
|
||||
S4["Compute Q, K, V"]
|
||||
S5["Recompute All KVs"]
|
||||
S6["Attention Scores: O(n²)"]
|
||||
S7["Generate Token"]
|
||||
end
|
||||
subgraph subGraph2["Optimized Path"]
|
||||
O1["OptimizedInference"]
|
||||
O2["Init KV Cache"]
|
||||
O3["Transformer Forward"]
|
||||
O4["OptimizedMultiHeadAttention"]
|
||||
O5["Compute Q, K, V"]
|
||||
O6["KV Cache Layer"]
|
||||
O7["Append to Cache"]
|
||||
O8["Reuse Cached KVs"]
|
||||
O9["Attention Scores: O(n)"]
|
||||
O10["Generate Token"]
|
||||
end
|
||||
Start --> Mode
|
||||
Mode -- No --> Standard
|
||||
Mode -- Yes --> Optimized
|
||||
Standard --> S1
|
||||
S1 --> S2
|
||||
S2 --> S3
|
||||
S3 --> S4
|
||||
S4 --> S5
|
||||
S5 --> S6
|
||||
S6 --> S7
|
||||
Optimized --> O1
|
||||
O1 --> O2
|
||||
O2 --> O3
|
||||
O3 --> O4
|
||||
O4 --> O5
|
||||
O5 --> O6
|
||||
O6 --> O7
|
||||
O7 --> O8
|
||||
O8 --> O9
|
||||
O9 --> O10
|
||||
S7 --> Metrics["Collect Metrics"]
|
||||
O10 --> Metrics
|
||||
style Standard fill:#ffcccc
|
||||
style Optimized fill:#ccffcc
|
||||
style S5 fill:#ffcccc
|
||||
style O8 fill:#ccffcc
|
||||
|
||||
```
|
||||
|
||||
### Code Injection Points
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Standard Model Architecture"
|
||||
A[TransformerModel] --> B[TransformerBlock]
|
||||
B --> C[MultiHeadAttention]
|
||||
C --> D[Q, K, V Projections]
|
||||
D --> E[Attention Computation]
|
||||
E --> F[Output Projection]
|
||||
F --> G[Feed Forward]
|
||||
end
|
||||
|
||||
subgraph "Optimization Injection Points"
|
||||
H[OptimizedInference Wrapper] --> A
|
||||
A --> B2[TransformerBlock]
|
||||
B2 --> C2[OptimizedMultiHeadAttention]
|
||||
C2 --> D2[Q, K, V Projections]
|
||||
D2 --> I[KV Cache Injection]
|
||||
I --> E2[Optimized Attention]
|
||||
E2 --> F2[Output Projection]
|
||||
F2 --> G2[Feed Forward]
|
||||
end
|
||||
|
||||
subgraph "KV Cache Layer Details"
|
||||
I --> J[Cache Check]
|
||||
J --> K{Cache Exists?}
|
||||
K -->|No| L[Compute K, V]
|
||||
K -->|Yes| M[Retrieve from Cache]
|
||||
L --> N[Store in Cache]
|
||||
M --> O[Append New K, V]
|
||||
N --> O
|
||||
O --> P[Use Cached KVs]
|
||||
end
|
||||
|
||||
style H fill:#90EE90
|
||||
style I fill:#90EE90
|
||||
style K fill:#FFD700
|
||||
style P fill:#90EE90
|
||||
```
|
||||
|
||||
### Benchmark Execution Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant InferenceScript
|
||||
participant BenchmarkModule
|
||||
participant OptimizedInference
|
||||
participant StandardModel
|
||||
participant MetricsCollector
|
||||
|
||||
User->>InferenceScript: python inference.py --benchmark
|
||||
InferenceScript->>BenchmarkModule: Initialize Metrics
|
||||
BenchmarkModule->>MetricsCollector: Create InferenceMetrics
|
||||
|
||||
Note over InferenceScript: Run 1: Non-Optimized
|
||||
InferenceScript->>StandardModel: model.generate()
|
||||
StandardModel->>StandardModel: Forward Pass (O(n²))
|
||||
StandardModel-->>InferenceScript: Generated Tokens
|
||||
InferenceScript->>MetricsCollector: Log Run (optimized=false)
|
||||
|
||||
Note over InferenceScript: Run 2: Optimized
|
||||
InferenceScript->>OptimizedInference: get_optimized_inference()
|
||||
OptimizedInference->>OptimizedInference: Init KV Cache
|
||||
OptimizedInference->>OptimizedInference: generate_with_cache()
|
||||
|
||||
loop For each token
|
||||
OptimizedInference->>OptimizedInference: Forward Pass (O(n))
|
||||
OptimizedInference->>OptimizedInference: Update KV Cache
|
||||
end
|
||||
|
||||
OptimizedInference-->>InferenceScript: Generated Tokens
|
||||
InferenceScript->>MetricsCollector: Log Run (optimized=true)
|
||||
|
||||
MetricsCollector->>MetricsCollector: Calculate Speedup
|
||||
MetricsCollector->>MetricsCollector: Generate Plots
|
||||
MetricsCollector->>MetricsCollector: Export CSV
|
||||
MetricsCollector-->>User: Results & Plots
|
||||
```
|
||||
|
||||
### Optimization Components Stack
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Application Layer"
|
||||
A[inference.py] --> B[benchmark_inference]
|
||||
B --> C[Generate Text]
|
||||
end
|
||||
|
||||
subgraph "Optimization Layer"
|
||||
C --> D{Optimized?}
|
||||
D -->|Yes| E[OptimizedInference]
|
||||
D -->|No| F[Standard Model]
|
||||
E --> G[KV Cache Manager]
|
||||
E --> H[Optimized Attention]
|
||||
end
|
||||
|
||||
subgraph "Core Model Layer"
|
||||
F --> I[TransformerModel]
|
||||
E --> I
|
||||
I --> J[TransformerBlock]
|
||||
J --> K[MultiHeadAttention]
|
||||
H --> K
|
||||
K --> L[Attention Computation]
|
||||
end
|
||||
|
||||
subgraph "Cache Layer"
|
||||
G --> M[KVCache Data Structure]
|
||||
M --> N[Keys Cache]
|
||||
M --> O[Values Cache]
|
||||
N --> P[Retrieve Previous K]
|
||||
O --> Q[Retrieve Previous V]
|
||||
end
|
||||
|
||||
subgraph "Compute Layer"
|
||||
L --> R[Q × K^T]
|
||||
P --> R
|
||||
Q --> R
|
||||
R --> S[Softmax]
|
||||
S --> T[Attention Weights]
|
||||
T --> U[Output]
|
||||
end
|
||||
|
||||
style E fill:#90EE90
|
||||
style G fill:#90EE90
|
||||
style H fill:#90EE90
|
||||
style M fill:#FFD700
|
||||
```
|
||||
|
||||
### Performance Comparison Schema
|
||||
|
||||
```mermaid
|
||||
|
||||
flowchart LR
|
||||
subgraph subGraph0["Metrics Collection"]
|
||||
B["Non-Optimized Metrics"]
|
||||
A["Benchmark Run"]
|
||||
C["Optimized Metrics"]
|
||||
D["Time: T1<br>Memory: M1<br>Speed: S1"]
|
||||
E["Time: T2<br>Memory: M2<br>Speed: S2"]
|
||||
end
|
||||
subgraph Analysis["Analysis"]
|
||||
F["Calculate Speedup"]
|
||||
G["Speedup = S2/S1"]
|
||||
H["Calculate Memory Reduction"]
|
||||
I["Reduction = (M1-M2)/M1 × 100%"]
|
||||
end
|
||||
subgraph Visualization["Visualization"]
|
||||
J["Comparison Plot"]
|
||||
K["Trend Analysis"]
|
||||
L["Performance Over Time"]
|
||||
end
|
||||
subgraph subGraph3["Data Export"]
|
||||
M["JSON Metrics"]
|
||||
N["CSV Export"]
|
||||
end
|
||||
A --> B & C
|
||||
B --> D
|
||||
C --> E
|
||||
D --> F & H
|
||||
E --> F & H
|
||||
F --> G & K
|
||||
H --> I
|
||||
G --> J
|
||||
I --> J
|
||||
K --> L
|
||||
J --> M & N
|
||||
L --> M & N
|
||||
style F fill:#FFD700
|
||||
style G fill:#90EE90
|
||||
style I fill:#90EE90
|
||||
|
||||
```
|
||||
|
||||
## Data File Locations Summary
|
||||
|
||||
**All benchmark data is saved to:**
|
||||
|
||||
```
|
||||
./inference_benchmarks/
|
||||
├── inference_metrics.json # All raw metrics (JSON)
|
||||
├── inference_metrics.csv # Spreadsheet data (CSV)
|
||||
├── optimization_comparison.png # Comparison charts
|
||||
└── performance_over_time.png # Trend analysis
|
||||
```
|
||||
|
||||
**Custom location:**
|
||||
|
||||
```bash
|
||||
--benchmark-dir ./research/results
|
||||
```
|
||||
|
||||
**Data accumulates:** Each benchmark run appends to the same files, building trends over time.
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. ✅ Run your first benchmark
|
||||
2. ✅ Review the comparison plots
|
||||
3. ✅ Analyze CSV data for deeper insights
|
||||
4. ✅ Run multiple benchmarks for statistical analysis
|
||||
5. ✅ Use batch script for trend analysis
|
||||
6. ✅ Include results in your research paper/presentation
|
||||
|
||||
---
|
||||
|
||||
**Happy Benchmarking!** 📊🔬
|
||||
Reference in New Issue
Block a user