This repository contains a standalone C implementation demonstrating memory alignment effects on high-performance computing performance. The prototype investigates performance variability issues related to memory alignment, specifically examining patterns described in OpenBLAS Issue #3879.

Project Overview

This repository focuses on the practical implementation and benchmarking framework:

C Prototype: Custom implementation demonstrating cache-blocked matrix multiplication with AVX SIMD optimizations
Memory Alignment Comparison: Compares 64-byte cache-line aligned vs 16-byte aligned memory access patterns
Benchmarking Framework: Automated scripts for performance testing and visualization
Performance Analysis: Tools for measuring and visualizing alignment effects across different matrix sizes

Project Structure

.
├── matrix_alignment_prototype.c   # C implementation demonstrating alignment effects
├── Makefile                        # Build configuration for C prototype
├── generate_plots.py               # Python script for performance visualization
├── run_benchmark_sizes.sh         # Automated benchmarking script
├── run_all_tests.sh               # Complete test suite orchestrator
├── benchmark_results.csv          # Collected performance data (generated)
├── requirements.txt               # Python dependencies
└── assets/                        # Generated plots and figures

Building and Running

Prerequisites

GCC compiler with AVX support
Python 3 with matplotlib and numpy
LaTeX distribution (for report compilation)
Make utility

Compiling the C Prototype

make

This will compile matrix_alignment_prototype.c with AVX optimizations enabled.

Running Benchmarks

Run the complete test suite:

./run_all_tests.sh

Or run benchmarks for specific matrix sizes:

./run_benchmark_sizes.sh

For CSV output:

./matrix_alignment_prototype -s 1024 --csv

Generating Visualizations

After running benchmarks, generate plots:

python3 generate_plots.py

Plots will be saved in the assets/ directory.

Key Features

Memory Alignment Demonstration

The prototype demonstrates:

Aligned version: Uses 64-byte cache-line aligned memory with _mm256_load_ps (aligned SIMD loads)
Misaligned version: Uses 16-byte aligned memory with _mm256_loadu_ps (unaligned SIMD loads)
Cache-blocked algorithm: Implements tiled matrix multiplication for optimal cache utilization
Performance variability analysis: Measures and visualizes alignment effects across different matrix sizes

Benchmarking Framework

The automated framework includes:

Multiple matrix size testing (512, 1024, 1500, 2048)
CSV data collection for reproducibility
Python visualization generating multiple analysis plots
Execution time, speedup ratio, variability, and GFLOPS metrics

Results

The implementation demonstrates performance variability patterns consistent with OpenBLAS Issue #3879, showing:

Peak variability of 6.6% at matrix size 512
Size-dependent performance differences
Architecture-sensitive alignment effects
Reduced variability on modern hardware (Zen 3) compared to older architectures (Zen 2)

Technical Details

Memory Alignment Implementation

The prototype demonstrates two memory allocation strategies:

Aligned allocation: Uses posix_memalign() to allocate memory aligned to 64-byte cache-line boundaries
Misaligned allocation: Simulates C++ default 16-byte alignment by offsetting pointers from cache-line boundaries

SIMD Optimizations

When compiled with AVX support, the implementation uses:

_mm256_load_ps(): Aligned SIMD loads (faster, requires 32-byte alignment)
_mm256_loadu_ps(): Unaligned SIMD loads (slower, works with any alignment)

Cache-Blocked Algorithm

The matrix multiplication uses a tiled (cache-blocked) approach:

Tile size: 64x64 elements (256 bytes for floats)
Maximizes cache line utilization
Reduces memory bandwidth requirements
Enables better spatial locality

Background

This implementation was developed to investigate performance variability patterns described in OpenBLAS Issue #3879, which reported up to 2x performance differences depending on memory alignment. The prototype demonstrates that:

Performance variability is size-dependent and architecture-sensitive
Modern CPUs (Zen 3) show reduced alignment sensitivity compared to older architectures (Zen 2)
Proper cache-line alignment reduces performance unpredictability

This prototype is based on analysis of OpenBLAS Issue #3879, which documented performance variability in matrix multiplication operations due to memory alignment. The implementation demonstrates similar variability patterns while providing a standalone, reproducible example.

License

This project is for educational and research purposes. Code implementations are provided for demonstration of HPC optimization principles related to memory alignment and SIMD vectorization.

README.md

Matrix Alignment Prototype - HPC Performance Demonstration