gem5-ilp-experiments/pipelineSimulation/README.md

# Pipeline Simulation

## Set up and configuration in Gem5.

To begin the project, a script was created to configure and launch the baseline pipeline in Gem5. The simulation was executed on the X86 DerivO3CPU model with a 2 GHz CPU and system clock, 32 KB L1 instruction and data caches, and a unified 1 MB L2 cache. The benchmark program used was memtouch, run in syscall emulation mode. The script defined output directories, cache parameters, and execution limits, ensuring reproducibility of the setup.

## Output after running the command

The initial run produced a baseline performance snapshot. The measured IPC was ~0.05 (CPI ≈ 19.7), indicating extremely low throughput. Nearly 97% of cycles retired no instructions, showing that the pipeline was heavily stalled. Analysis of memory system statistics revealed an L1D miss rate close to 50%, with an average miss latency of ~78,000 ticks. These misses frequently propagated through the pipeline, creating bubbles and stalling progress. In contrast, branch prediction worked effectively, with a misprediction rate below 0.05%.

## Cycle-by-cycle analysis of pipeline stages.

Cycle-level tracing (via --debug-flags=O3CPU,Fetch,Decode,Rename,IEW,Commit,Branch) revealed the interactions of the pipeline stages. The traces show frequent stalls at IEW and Commit, triggered by long-latency load misses that blocked dependent instructions. Fetch and Decode also experienced backpressure as the backend filled, illustrating how memory bottlenecks propagate through the pipeline. Overall, the cycle-by-cycle view confirmed that the pipeline’s performance limit was not due to execution width or branch handling, but due to structural and data hazards in the memory system. s

|Metric|Value|Interpretation|
|---|---|---|
|Total Instructions (simInsts)|25,297,289|Benchmark retired ~25M instructions|
|Total Cycles (numCycles)|498,254,810|Simulation ran ~498M cycles|
|IPC|0.051|Very low throughput (pipeline mostly stalled)|
|CPI|19.74|Each instruction took ~20 cycles on average|
|Commit Histogram|~97% cycles at 0 commit|Pipeline idle most of the time (waiting on memory)|
|L1I Miss Rate|~0%|Instruction cache well-behaved|
|L1D Miss Rate|49.8%|Half of all data accesses missed L1|
|L1D MPKI|~124|Very high miss intensity (memory-bound workload)|
|Avg. L1D Miss Latency|~78,000 ticks|Memory stalls extremely long|
|L2 Hit Ratio|66%|1/3 of L2 misses → DRAM access|
|Branch Pred. Mispredict Rate|0.03%|Branching handled very well (not a bottleneck)|
|Instruction Mix|73% IntAlu, 27% memory ops|Heavy integer + memory workloa|

·      The baseline run shows severe memory bottlenecks: nearly half of L1D accesses miss, with miss penalties of tens of thousands of cycles.

·      This results in IPC ≈ 0.05, with the pipeline committing zero instructions in ~97% of cycles.

·      Branch prediction is highly accurate and does not contribute to stalls.

·      Instruction mix is dominated by integer arithmetic and memory operations, with stores forming a large share.

·      The cycle-by-cycle analysis confirms that structural and data hazards in the memory subsystem are the main performance limiter, not branch or execution resources.jggjjjj