initial commit

2025-09-21 01:17:26 -04:00
commit cd69096346
150 changed files with 87323 additions and 0 deletions
--- a/pipelineSimulation/README.md
+++ b/pipelineSimulation/README.md
@@ -0,0 +1,38 @@
+# Pipeline Simulation
+
+## Set up and configuration in Gem5.
+
+To begin the project, a script was created to configure and launch the baseline pipeline in Gem5. The simulation was executed on the X86 DerivO3CPU model with a 2 GHz CPU and system clock, 32 KB L1 instruction and data caches, and a unified 1 MB L2 cache. The benchmark program used was memtouch, run in syscall emulation mode. The script defined output directories, cache parameters, and execution limits, ensuring reproducibility of the setup.
+
+## Output after running the command
+
+The initial run produced a baseline performance snapshot. The measured IPC was ~0.05 (CPI ≈ 19.7), indicating extremely low throughput. Nearly 97% of cycles retired no instructions, showing that the pipeline was heavily stalled. Analysis of memory system statistics revealed an L1D miss rate close to 50%, with an average miss latency of ~78,000 ticks. These misses frequently propagated through the pipeline, creating bubbles and stalling progress. In contrast, branch prediction worked effectively, with a misprediction rate below 0.05%.
+
+## Cycle-by-cycle analysis of pipeline stages.
+
+Cycle-level tracing (via --debug-flags=O3CPU,Fetch,Decode,Rename,IEW,Commit,Branch) revealed the interactions of the pipeline stages. The traces show frequent stalls at IEW and Commit, triggered by long-latency load misses that blocked dependent instructions. Fetch and Decode also experienced backpressure as the backend filled, illustrating how memory bottlenecks propagate through the pipeline. Overall, the cycle-by-cycle view confirmed that the pipeline’s performance limit was not due to execution width or branch handling, but due to structural and data hazards in the memory system. s
+
+|Metric|Value|Interpretation|
+|---|---|---|
+|Total Instructions (simInsts)|25,297,289|Benchmark retired ~25M instructions|
+|Total Cycles (numCycles)|498,254,810|Simulation ran ~498M cycles|
+|IPC|0.051|Very low throughput (pipeline mostly stalled)|
+|CPI|19.74|Each instruction took ~20 cycles on average|
+|Commit Histogram|~97% cycles at 0 commit|Pipeline idle most of the time (waiting on memory)|
+|L1I Miss Rate|~0%|Instruction cache well-behaved|
+|L1D Miss Rate|49.8%|Half of all data accesses missed L1|
+|L1D MPKI|~124|Very high miss intensity (memory-bound workload)|
+|Avg. L1D Miss Latency|~78,000 ticks|Memory stalls extremely long|
+|L2 Hit Ratio|66%|1/3 of L2 misses → DRAM access|
+|Branch Pred. Mispredict Rate|0.03%|Branching handled very well (not a bottleneck)|
+|Instruction Mix|73% IntAlu, 27% memory ops|Heavy integer + memory workloa|
+
+·      The baseline run shows severe memory bottlenecks: nearly half of L1D accesses miss, with miss penalties of tens of thousands of cycles.
+
+·      This results in IPC ≈ 0.05, with the pipeline committing zero instructions in ~97% of cycles.
+
+·      Branch prediction is highly accurate and does not contribute to stalls.
+
+·      Instruction mix is dominated by integer arithmetic and memory operations, with stores forming a large share.
+
+·      The cycle-by-cycle analysis confirms that structural and data hazards in the memory subsystem are the main performance limiter, not branch or execution resources.jggjjjj
--- a/pipelineSimulation/Technical_Analysis_Report.md
+++ b/pipelineSimulation/Technical_Analysis_Report.md
@@ -0,0 +1,223 @@
+# Pipeline Simulation Technical Analysis Report
+
+## Executive Summary
+
+This report presents a comprehensive analysis of gem5 pipeline simulation experiments conducted using the DerivO3CPU model. The experiments reveal significant performance bottlenecks primarily in the memory subsystem, with IPC values around 0.05-0.08 indicating severe pipeline stalls. The analysis covers baseline performance, cycle-by-cycle pipeline behavior, and identifies key architectural bottlenecks that limit processor throughput.
+
+## Set up and configuration in Gem5
+
+The experimental setup utilized a sophisticated out-of-order processor model with comprehensive pipeline simulation capabilities. The DerivO3CPU configuration featured an 8-wide superscalar design with multiple execution units, sophisticated branch prediction, and a multi-level cache hierarchy. The system operated at 2 GHz with carefully tuned memory subsystem parameters to provide realistic performance characteristics.
+
+### Configuration Summary
+- **CPU Model**: DerivO3CPU (Out-of-order execution)
+- **Clock Frequency**: 2 GHz (500 ps period)
+- **Pipeline Widths**: 8-wide fetch, decode, issue, commit
+- **ROB Size**: 192 entries
+- **IQ Size**: 64 entries  
+- **LSQ Configuration**: 32 load queue entries, 32 store queue entries
+- **Branch Predictor**: Tournament predictor with 4K BTB entries
+- **L1 Cache**: 32KB I-cache, 32KB D-cache (2-way associative)
+- **L2 Cache**: 1MB unified cache (8-way associative)
+- **Memory**: DDR3-1600 with realistic timing parameters
+
+The configuration represents a modern high-performance processor design with aggressive out-of-order execution capabilities. The 8-wide pipeline allows for significant instruction-level parallelism, while the large ROB and IQ provide substantial instruction window depth for dependency resolution.
+
+## Output after running the command
+
+The experimental results reveal a processor operating far below its theoretical peak performance. The baseline configuration achieved an IPC of approximately 0.051, corresponding to a CPI of 19.74 cycles per instruction. This performance level indicates that the pipeline is experiencing severe stalls, with the processor retiring zero instructions in approximately 97% of execution cycles.
+
+### Performance Metrics Summary
+
+| Configuration | Instructions | Cycles | IPC | CPI | Simulation Time |
+|---------------|--------------|--------|-----|-----|-----------------|
+| o3-baseline | 25,297,289 | 499,384,067 | 0.051 | 19.74 | 0.250s |
+| o3-trace | 368,504 | 4,491,071 | 0.082 | 12.19 | 0.002s |
+| pipeline/o3-baseline | 25,297,289 | 499,384,067 | 0.051 | 19.74 | 0.250s |
+
+The commit distribution histogram reveals the severity of pipeline stalls, with 97.37% of cycles committing zero instructions in the baseline configuration. This indicates that the processor is spending the vast majority of its time waiting for long-latency operations to complete, primarily memory accesses.
+
+### Memory System Analysis
+The memory subsystem shows significant performance bottlenecks:
+- **L1D Miss Rate**: 49.8% (3,147,778 misses out of 6,319,345 accesses)
+- **L1I Miss Rate**: 4.97% (1,029 misses out of 20,710 accesses)  
+- **L2 Miss Rate**: 99.98% (3,147,457 misses out of 3,147,514 accesses)
+- **Average L1D Miss Latency**: ~70,000 ticks (~35,000 cycles)
+
+The extremely high L2 miss rate indicates that nearly all L1 misses result in main memory accesses, creating a severe memory bottleneck. The average miss latency of 70,000 ticks represents approximately 35,000 processor cycles, explaining the low IPC observed.
+
+### Branch Prediction Performance
+Branch prediction demonstrates excellent accuracy:
+- **Branch Misprediction Rate**: 0.03% (733 mispredicts out of 3,528,004 lookups)
+- **BTB Hit Ratio**: 99.97%
+- **RAS Accuracy**: 99.67%
+
+The branch predictor is highly effective and does not contribute significantly to performance degradation, confirming that memory system bottlenecks are the primary performance limiter.
+
+## Cycle-by-cycle analysis of pipeline stages
+
+The cycle-level analysis reveals the intricate interactions between pipeline stages and identifies the root causes of performance bottlenecks. The DerivO3CPU's out-of-order execution engine attempts to maximize instruction-level parallelism, but structural and data hazards in the memory subsystem create severe pipeline stalls.
+
+### Pipeline Stage Utilization
+
+The commit distribution analysis shows:
+- **0 instructions/cycle**: 97.37% of cycles (485,870,838 cycles)
+- **1 instruction/cycle**: 0.77% of cycles (3,859,564 cycles)  
+- **2 instructions/cycle**: 0.08% of cycles (401,486 cycles)
+- **3+ instructions/cycle**: 1.78% of cycles (remaining cycles)
+
+This distribution confirms that the pipeline is severely underutilized, spending most cycles waiting for memory operations to complete. The average commit rate of 0.0698 instructions per cycle is far below the theoretical maximum of 8 instructions per cycle.
+
+### Memory System Impact on Pipeline
+
+The memory subsystem creates cascading stalls throughout the pipeline:
+1. **Fetch Stage**: Limited by instruction cache misses (4.97% miss rate)
+2. **Decode/Rename**: Backpressure from full instruction queues
+3. **Issue/Execute**: Blocked by long-latency memory operations
+4. **Commit**: Severely limited by memory dependency chains
+
+The average L1D miss latency of ~70,000 ticks creates pipeline bubbles that propagate through all stages. When a load instruction misses in the L1D cache, dependent instructions must wait for the memory access to complete, creating a chain reaction of stalls.
+
+### Functional Unit Utilization
+
+The instruction mix analysis reveals:
+- **Integer ALU**: 72.82% of committed instructions
+- **Memory Operations**: 27.18% of committed instructions
+- **Floating Point**: Minimal usage
+- **SIMD Operations**: Minimal usage
+
+The heavy memory operation workload (27.18%) combined with the high miss rate creates a perfect storm for pipeline stalls. Each memory operation that misses creates a long-latency dependency chain that blocks subsequent instruction execution.
+
+## Key Performance Bottlenecks and Analysis
+
+### Primary Bottleneck: Memory System
+
+The memory subsystem represents the dominant performance bottleneck, with several contributing factors:
+
+1. **High L1D Miss Rate (49.8%)**: Nearly half of all data cache accesses miss, requiring L2 cache or main memory access
+2. **Catastrophic L2 Miss Rate (99.98%)**: Almost all L1 misses result in main memory access
+3. **Long Miss Latency (~35,000 cycles)**: Memory access latency is orders of magnitude higher than processor cycle time
+4. **Memory Dependency Chains**: Load instructions create long dependency chains that block dependent instructions
+
+### Secondary Factors
+
+While memory dominates, other factors contribute to performance degradation:
+
+1. **Instruction Cache Misses**: 4.97% miss rate creates occasional fetch stalls
+2. **Pipeline Width Underutilization**: 8-wide pipeline commits less than 0.07 instructions per cycle on average
+3. **ROB/IQ Capacity**: Large instruction windows (192 ROB, 64 IQ) are underutilized due to memory stalls
+
+### Performance Scaling Analysis
+
+The comparison between configurations reveals:
+- **o3-trace**: Higher IPC (0.082 vs 0.051) but shorter simulation (368K vs 25M instructions)
+- **Consistent Bottlenecks**: All configurations show similar memory system behavior
+- **Branch Prediction**: Consistently excellent across all runs
+
+The trace configuration shows improved IPC, likely due to different workload characteristics or shorter simulation duration that doesn't fully expose memory system bottlenecks.
+
+## Architectural Implications and Recommendations
+
+### Memory System Optimizations
+
+1. **L1D Cache Size Increase**: Current 32KB may be insufficient for the workload
+2. **L2 Cache Size Increase**: 1MB L2 cache shows 99.98% miss rate
+3. **Prefetching**: Implement hardware prefetching to reduce miss rates
+4. **Memory Bandwidth**: Increase memory controller bandwidth and reduce latency
+
+### Pipeline Optimizations
+
+1. **Load-Store Queue Sizing**: Current 32-entry LSQ may limit memory parallelism
+2. **Memory Disambiguation**: Improve load-store dependency detection
+3. **Speculative Execution**: Enhance memory speculation capabilities
+
+### Workload Characteristics
+
+The memtouch benchmark appears to be memory-intensive with poor spatial and temporal locality. This workload choice may not represent typical application behavior, suggesting the need for additional benchmarks to validate architectural decisions.
+
+## Key Insights and Interesting Discoveries
+
+### 🔍 **The "Memory Wall" in Action**
+
+The most striking finding is how **catastrophically** the memory system dominates performance:
+- **99.98% L2 miss rate** - This is essentially saying "the L2 cache doesn't work at all"
+- **97% of cycles commit ZERO instructions** - The processor is essentially idle most of the time
+- **IPC of 0.051 vs theoretical 8.0** - We're getting only **0.6%** of peak performance!
+
+This is a perfect example of the "memory wall" problem that computer architects have been fighting for decades. Despite having a sophisticated 8-wide superscalar processor, memory system limitations reduce it to effectively a single-cycle machine.
+
+### 🎯 **The Branch Predictor Paradox**
+
+Here's something fascinating: The branch predictor is **incredibly accurate** (99.97% accuracy), yet the processor still performs terribly. This proves that:
+- **Branch prediction isn't the bottleneck** - it's working perfectly
+- **Memory stalls dominate everything** - even perfect branch prediction can't save you from memory latency
+- **Modern branch predictors are very sophisticated** - the Tournament predictor with 4K BTB entries is doing its job
+
+This demonstrates that **optimizing the wrong subsystem yields no performance gains**. The branch predictor could be 100% accurate and performance would remain terrible due to memory stalls.
+
+### 🚀 **The "8-Wide Pipeline Illusion"**
+
+The configuration has an **8-wide superscalar pipeline** (can theoretically execute 8 instructions per cycle), but:
+- **Average commit rate: 0.07 instructions/cycle** 
+- **Peak observed: 8 instructions/cycle in only 0.0003% of cycles**
+- **192-entry ROB and 64-entry IQ are massively underutilized**
+
+This shows that **pipeline width means nothing if you can't feed it with instructions**. The processor has enormous execution resources that sit idle because memory can't provide data fast enough.
+
+### 💡 **The Memtouch Benchmark Revelation**
+
+The workload choice is **brutal** for this architecture:
+- **27% memory operations** with **49.8% miss rate**
+- **Poor spatial/temporal locality** - the benchmark is designed to stress memory systems
+- This creates a "perfect storm" of memory stalls
+
+This suggests the benchmark might be **artificially pessimistic** compared to real applications, but it perfectly exposes memory system bottlenecks that would be hidden by more cache-friendly workloads.
+
+## Practical Implications
+
+### **For Computer Architecture Education:**
+This is a **textbook example** of why memory system design is crucial. You can have the most sophisticated CPU core in the world, but if memory can't keep up, you get terrible performance. This experiment perfectly demonstrates the concept of "balanced system design."
+
+### **For Industry Applications:**
+- **Cache sizes matter enormously** - 32KB L1D is clearly insufficient for this workload
+- **Memory bandwidth is critical** - the processor is starved for data
+- **Prefetching could be transformative** - predicting memory access patterns could dramatically improve performance
+- **Workload characterization is essential** - different applications need different memory system characteristics
+
+### **For Research Directions:**
+- **Memory-centric architectures** - maybe we need to rethink the balance between CPU and memory
+- **Advanced prefetching** - this could be the key to unlocking performance
+- **Workload-aware design** - processors should adapt to application memory access patterns
+- **Memory hierarchy optimization** - the current L1/L2/L3 structure may not be optimal
+
+## The Most Surprising Insight
+
+The most surprising thing is how **dramatically** a single subsystem (memory) can cripple an otherwise sophisticated processor. We have:
+- ✅ Excellent branch prediction (99.97% accuracy)
+- ✅ Large instruction windows (192 ROB, 64 IQ)
+- ✅ Out-of-order execution capabilities
+- ✅ Multiple functional units (6 IntAlu, 2 IntMult, 4 FloatAdd, etc.)
+- ❌ **But terrible memory performance**
+
+This creates a **99.4% performance loss** - the processor is essentially a very expensive, very slow single-cycle machine due to memory stalls.
+
+## Why This Matters
+
+This experiment perfectly demonstrates why modern processors invest so heavily in:
+- **Larger caches** (L3 caches, victim caches, non-inclusive hierarchies)
+- **Sophisticated prefetching** (hardware and software prefetching, stride predictors)
+- **Memory bandwidth** (DDR5, HBM, multiple memory channels)
+- **Memory hierarchy optimization** (NUMA, memory controllers, cache coherence)
+
+The CPU core is no longer the bottleneck - **memory system design is everything** in modern processors. This is why companies like Intel, AMD, and ARM spend enormous resources on memory subsystem optimization rather than just making the CPU core faster. The core is already fast enough - it's waiting for memory most of the time!
+
+## Conclusion
+
+The gem5 pipeline simulation experiments reveal a processor architecture that is fundamentally limited by memory system performance. Despite sophisticated out-of-order execution capabilities, branch prediction, and large instruction windows, the processor achieves only 5-8% of its theoretical peak performance due to memory subsystem bottlenecks.
+
+The analysis demonstrates the critical importance of memory system design in modern processors. While the CPU core can theoretically execute 8 instructions per cycle, memory system limitations reduce actual performance to less than 0.1 instructions per cycle. This highlights the need for balanced system design where memory subsystem capabilities match processor core capabilities.
+
+**Key Takeaway:** This experiment perfectly demonstrates why modern processors invest so heavily in memory system optimization. The CPU core is no longer the bottleneck - memory system design is everything in modern processors. Future work should focus on memory system optimizations, including larger caches, improved prefetching, and higher memory bandwidth to unlock the full potential of the out-of-order execution engine.
+
+---
+
+*This analysis is based on gem5 simulation results using the DerivO3CPU model with realistic memory system timing. All performance metrics are derived from detailed cycle-accurate simulation data.*
--- a/pipelineSimulation/o3-baseline/config.ini
+++ b/pipelineSimulation/o3-baseline/config.ini
--- a/pipelineSimulation/o3-baseline/config.json
+++ b/pipelineSimulation/o3-baseline/config.json
--- a/pipelineSimulation/o3-baseline/fs/proc/cpuinfo
+++ b/pipelineSimulation/o3-baseline/fs/proc/cpuinfo
@@ -0,0 +1,19 @@
+processor       : 0
+vendor_id       : Generic
+cpu family      : 0
+model           : 0
+model name      : Generic
+stepping        : 0
+cpu MHz         : 2000.000
+cache size:     : 1024.0K
+physical id     : 0
+siblings        : 1
+core id         : 0
+cpu cores       : 1
+fpu             : yes
+fpu exception   : yes
+cpuid level     : 1
+wp              : yes
+flags           : fpu
+cache alignment : 64
+
--- a/pipelineSimulation/o3-baseline/fs/proc/stat
+++ b/pipelineSimulation/o3-baseline/fs/proc/stat
@@ -0,0 +1,2 @@
+cpu 0 0 0 0 0 0 0
+cpu0 0 0 0 0 0 0 0
--- a/pipelineSimulation/o3-baseline/fs/sys/devices/system/cpu/online
+++ b/pipelineSimulation/o3-baseline/fs/sys/devices/system/cpu/online
@@ -0,0 +1 @@
+0-0
--- a/pipelineSimulation/o3-baseline/fs/sys/devices/system/cpu/possible
+++ b/pipelineSimulation/o3-baseline/fs/sys/devices/system/cpu/possible
@@ -0,0 +1 @@
+0-0
--- a/pipelineSimulation/o3-baseline/simerr
+++ b/pipelineSimulation/o3-baseline/simerr
@@ -0,0 +1,13 @@
+warn: The `get_runtime_isa` function is deprecated. Please migrate away from using this function.
+warn: The se.py script is deprecated. It will be removed in future releases of  gem5.
+warn: The `get_runtime_isa` function is deprecated. Please migrate away from using this function.
+warn: No dot file generated. Please install pydot to generate the dot file and pdf.
+src/mem/dram_interface.cc:690: warn: DRAM device capacity (8192 Mbytes) does not match the address range assigned (512 Mbytes)
+src/base/statistics.hh:279: warn: One of the stats is a legacy stat. Legacy stat is a stat that does not belong to any statistics::Group. Legacy stat is deprecated.
+system.remote_gdb: Listening for connections on port 7000
+src/sim/simulate.cc:194: info: Entering event queue @ 0.  Starting simulation...
+src/arch/x86/cpuid.cc:180: warn: x86 cpuid family 0x0000: unimplemented function 13
+src/sim/syscall_emul.cc:74: warn: ignoring syscall set_robust_list(...)
+src/sim/syscall_emul.cc:74: warn: ignoring syscall rseq(...)
+src/sim/mem_state.cc:443: info: Increasing stack size by one page.
+src/sim/syscall_emul.cc:74: warn: ignoring syscall mprotect(...)
--- a/pipelineSimulation/o3-baseline/simout
+++ b/pipelineSimulation/o3-baseline/simout
@@ -0,0 +1,13 @@
+Global frequency set at 1000000000000 ticks per second
+gem5 Simulator System.  https://www.gem5.org
+gem5 is copyrighted software; use the --copyright option for details.
+
+gem5 version 23.0.0.1
+gem5 compiled Aug 28 2025 18:18:37
+gem5 started Sep 20 2025 02:29:30
+gem5 executing on cargdevgpu, pid 2145614
+command line: /home/carlos/projects/gem5/gem5src/gem5/build/X86/gem5.opt --outdir=/home/carlos/projects/gem5/gem5-data/results/pipeline/o3-baseline /home/carlos/projects/gem5/gem5src/gem5/configs/deprecated/example/se.py --cmd=/home/carlos/projects/gem5/gem5-run/memtouch/memtouch --cpu-type=DerivO3CPU --cpu-clock=2GHz --sys-clock=2GHz --caches --l2cache --l1i_size=32kB --l1d_size=32kB --l2_size=1MB --maxinsts=200000000
+
+**** REAL SIMULATION ****
+sum=301989888
+Exiting @ tick 249692033000 because exiting with last active thread context
--- a/pipelineSimulation/o3-baseline/stats.txt
+++ b/pipelineSimulation/o3-baseline/stats.txt
--- a/pipelineSimulation/o3-trace/config.ini
+++ b/pipelineSimulation/o3-trace/config.ini
--- a/pipelineSimulation/o3-trace/config.json
+++ b/pipelineSimulation/o3-trace/config.json
--- a/pipelineSimulation/o3-trace/fs/proc/cpuinfo
+++ b/pipelineSimulation/o3-trace/fs/proc/cpuinfo
@@ -0,0 +1,19 @@
+processor       : 0
+vendor_id       : Generic
+cpu family      : 0
+model           : 0
+model name      : Generic
+stepping        : 0
+cpu MHz         : 2000.000
+cache size:     : 2048.0K
+physical id     : 0
+siblings        : 1
+core id         : 0
+cpu cores       : 1
+fpu             : yes
+fpu exception   : yes
+cpuid level     : 1
+wp              : yes
+flags           : fpu
+cache alignment : 64
+
--- a/pipelineSimulation/o3-trace/fs/proc/stat
+++ b/pipelineSimulation/o3-trace/fs/proc/stat
@@ -0,0 +1,2 @@
+cpu 0 0 0 0 0 0 0
+cpu0 0 0 0 0 0 0 0
--- a/pipelineSimulation/o3-trace/fs/sys/devices/system/cpu/online
+++ b/pipelineSimulation/o3-trace/fs/sys/devices/system/cpu/online
@@ -0,0 +1 @@
+0-0
--- a/pipelineSimulation/o3-trace/fs/sys/devices/system/cpu/possible
+++ b/pipelineSimulation/o3-trace/fs/sys/devices/system/cpu/possible
@@ -0,0 +1 @@
+0-0
--- a/pipelineSimulation/o3-trace/stats.txt
+++ b/pipelineSimulation/o3-trace/stats.txt
--- a/pipelineSimulation/pipeline_sim.sh
+++ b/pipelineSimulation/pipeline_sim.sh
@@ -0,0 +1,57 @@
+#!/usr/bin/env bash
+set -eu
+
+# --- Paths (match your cache script style) ---
+export GEM5=/home/carlos/projects/gem5/gem5src/gem5
+export RUN=/home/carlos/projects/gem5/gem5-run
+export OUTROOT=/home/carlos/projects/gem5/gem5-data/results
+
+# Workload (reuse your memtouch; swap for any x86 bin you like)
+CMD="$RUN/memtouch/memtouch"
+[ -x "$CMD" ] || CMD="/bin/ls"
+
+# Convenience
+BIN="$GEM5/build/X86/gem5.opt"
+SEPY="$GEM5/configs/deprecated/example/se.py"
+
+# ------------- 1) Baseline O3 run -------------
+OUT="$OUTROOT/pipeline/o3-baseline"
+mkdir -p "$OUT"
+"$BIN" \
+  --outdir="$OUT" \
+  "$SEPY" \
+  --cmd="$CMD" \
+  --cpu-type=DerivO3CPU \
+  --cpu-clock=2GHz --sys-clock=2GHz \
+  --caches --l2cache \
+  --l1i_size=32kB --l1d_size=32kB --l2_size=1MB \
+  --maxinsts=200000000
+
+echo "[baseline] stats: $OUT/stats.txt"
+awk '
+/simInsts/ {I=$2}
+/system\.cpu\.numCycles/ {C=$2}
+END{if(C>0) printf("Baseline IPC = %.3f  (insts=%s cycles=%s)\n", I/C, I, C)}' \
+  "$OUT/stats.txt"
+
+# ------------- 2) Cycle-by-cycle trace -------------
+OUT="$OUTROOT/pipeline/o3-trace"
+mkdir -p "$OUT"
+"$BIN" \
+  --outdir="$OUT" \
+  --debug-flags=O3CPU,Fetch,Decode,Rename,IEW,Commit,Branch,Activity \
+  --debug-file=pipe.trace \
+  "$SEPY" \
+  --cmd="$CMD" \
+  --cpu-type=DerivO3CPU \
+  --cpu-clock=2GHz --sys-clock=2GHz \
+  --caches --l2cache \
+  --maxinsts=5000000
+
+echo "[trace] debug trace: $OUT/pipe.trace"
+echo "[trace] quick peek:"
+grep -E 'Fetch|Decode|Rename|IEW|Commit|Branch' "$OUT/pipe.trace" | head -60
+
+echo "[trace] stage/queue highlights:"
+egrep 'iq|ROB|LQ|SQ|idleCycles' "$OUT/stats.txt" | sed -n '1,200p'
+