# Matrix Alignment Prototype - HPC Performance Demonstration This repository contains a standalone C implementation demonstrating memory alignment effects on high-performance computing performance. The prototype investigates performance variability issues related to memory alignment, specifically examining patterns described in OpenBLAS Issue #3879. ## Project Overview This repository focuses on the practical implementation and benchmarking framework: - **C Prototype**: Custom implementation demonstrating cache-blocked matrix multiplication with AVX SIMD optimizations - **Memory Alignment Comparison**: Compares 64-byte cache-line aligned vs 16-byte aligned memory access patterns - **Benchmarking Framework**: Automated scripts for performance testing and visualization - **Performance Analysis**: Tools for measuring and visualizing alignment effects across different matrix sizes ## Project Structure ```text . ├── matrix_alignment_prototype.c # C implementation demonstrating alignment effects ├── Makefile # Build configuration for C prototype ├── generate_plots.py # Python script for performance visualization ├── run_benchmark_sizes.sh # Automated benchmarking script ├── run_all_tests.sh # Complete test suite orchestrator ├── benchmark_results.csv # Collected performance data (generated) ├── requirements.txt # Python dependencies └── assets/ # Generated plots and figures ``` ## Building and Running ### Prerequisites - GCC compiler with AVX support - Python 3 with matplotlib and numpy - LaTeX distribution (for report compilation) - Make utility ### Compiling the C Prototype ```bash make ``` This will compile `matrix_alignment_prototype.c` with AVX optimizations enabled. ### Running Benchmarks Run the complete test suite: ```bash ./run_all_tests.sh ``` Or run benchmarks for specific matrix sizes: ```bash ./run_benchmark_sizes.sh ``` For CSV output: ```bash ./matrix_alignment_prototype -s 1024 --csv ``` ### Generating Visualizations After running benchmarks, generate plots: ```bash python3 generate_plots.py ``` Plots will be saved in the `assets/` directory. ## Key Features ### Memory Alignment Demonstration The prototype demonstrates: - **Aligned version**: Uses 64-byte cache-line aligned memory with `_mm256_load_ps` (aligned SIMD loads) - **Misaligned version**: Uses 16-byte aligned memory with `_mm256_loadu_ps` (unaligned SIMD loads) - **Cache-blocked algorithm**: Implements tiled matrix multiplication for optimal cache utilization - **Performance variability analysis**: Measures and visualizes alignment effects across different matrix sizes ### Benchmarking Framework The automated framework includes: - Multiple matrix size testing (512, 1024, 1500, 2048) - CSV data collection for reproducibility - Python visualization generating multiple analysis plots - Execution time, speedup ratio, variability, and GFLOPS metrics ## Results The implementation demonstrates performance variability patterns consistent with OpenBLAS Issue #3879, showing: - Peak variability of 6.6% at matrix size 512 - Size-dependent performance differences - Architecture-sensitive alignment effects - Reduced variability on modern hardware (Zen 3) compared to older architectures (Zen 2) ## Technical Details ### Memory Alignment Implementation The prototype demonstrates two memory allocation strategies: - **Aligned allocation**: Uses `posix_memalign()` to allocate memory aligned to 64-byte cache-line boundaries - **Misaligned allocation**: Simulates C++ default 16-byte alignment by offsetting pointers from cache-line boundaries ### SIMD Optimizations When compiled with AVX support, the implementation uses: - `_mm256_load_ps()`: Aligned SIMD loads (faster, requires 32-byte alignment) - `_mm256_loadu_ps()`: Unaligned SIMD loads (slower, works with any alignment) ### Cache-Blocked Algorithm The matrix multiplication uses a tiled (cache-blocked) approach: - Tile size: 64x64 elements (256 bytes for floats) - Maximizes cache line utilization - Reduces memory bandwidth requirements - Enables better spatial locality ## Background This implementation was developed to investigate performance variability patterns described in OpenBLAS Issue #3879, which reported up to 2x performance differences depending on memory alignment. The prototype demonstrates that: - Performance variability is size-dependent and architecture-sensitive - Modern CPUs (Zen 3) show reduced alignment sensitivity compared to older architectures (Zen 2) - Proper cache-line alignment reduces performance unpredictability ## Related Work This prototype is based on analysis of OpenBLAS Issue #3879, which documented performance variability in matrix multiplication operations due to memory alignment. The implementation demonstrates similar variability patterns while providing a standalone, reproducible example. ## License This project is for educational and research purposes. Code implementations are provided for demonstration of HPC optimization principles related to memory alignment and SIMD vectorization.