Matrix Alignment Prototype - HPC Performance Demonstration

This repository contains a standalone C implementation demonstrating memory alignment effects on high-performance computing performance. The prototype investigates performance variability issues related to memory alignment, specifically examining patterns described in OpenBLAS Issue #3879.

Project Overview

This repository focuses on the practical implementation and benchmarking framework:

  • C Prototype: Custom implementation demonstrating cache-blocked matrix multiplication with AVX SIMD optimizations
  • Memory Alignment Comparison: Compares 64-byte cache-line aligned vs 16-byte aligned memory access patterns
  • Benchmarking Framework: Automated scripts for performance testing and visualization
  • Performance Analysis: Tools for measuring and visualizing alignment effects across different matrix sizes

Project Structure

.
├── matrix_alignment_prototype.c   # C implementation demonstrating alignment effects
├── Makefile                        # Build configuration for C prototype
├── generate_plots.py               # Python script for performance visualization
├── run_benchmark_sizes.sh         # Automated benchmarking script
├── run_all_tests.sh               # Complete test suite orchestrator
├── benchmark_results.csv          # Collected performance data (generated)
├── requirements.txt               # Python dependencies
└── assets/                        # Generated plots and figures

Building and Running

Prerequisites

  • GCC compiler with AVX support
  • Python 3 with matplotlib and numpy
  • LaTeX distribution (for report compilation)
  • Make utility

Compiling the C Prototype

make

This will compile matrix_alignment_prototype.c with AVX optimizations enabled.

Running Benchmarks

Run the complete test suite:

./run_all_tests.sh

Or run benchmarks for specific matrix sizes:

./run_benchmark_sizes.sh

For CSV output:

./matrix_alignment_prototype -s 1024 --csv

Generating Visualizations

After running benchmarks, generate plots:

python3 generate_plots.py

Plots will be saved in the assets/ directory.

Key Features

Memory Alignment Demonstration

The prototype demonstrates:

  • Aligned version: Uses 64-byte cache-line aligned memory with _mm256_load_ps (aligned SIMD loads)
  • Misaligned version: Uses 16-byte aligned memory with _mm256_loadu_ps (unaligned SIMD loads)
  • Cache-blocked algorithm: Implements tiled matrix multiplication for optimal cache utilization
  • Performance variability analysis: Measures and visualizes alignment effects across different matrix sizes

Benchmarking Framework

The automated framework includes:

  • Multiple matrix size testing (512, 1024, 1500, 2048)
  • CSV data collection for reproducibility
  • Python visualization generating multiple analysis plots
  • Execution time, speedup ratio, variability, and GFLOPS metrics

Results

The implementation demonstrates performance variability patterns consistent with OpenBLAS Issue #3879, showing:

  • Peak variability of 6.6% at matrix size 512
  • Size-dependent performance differences
  • Architecture-sensitive alignment effects
  • Reduced variability on modern hardware (Zen 3) compared to older architectures (Zen 2)

Technical Details

Memory Alignment Implementation

The prototype demonstrates two memory allocation strategies:

  • Aligned allocation: Uses posix_memalign() to allocate memory aligned to 64-byte cache-line boundaries
  • Misaligned allocation: Simulates C++ default 16-byte alignment by offsetting pointers from cache-line boundaries

SIMD Optimizations

When compiled with AVX support, the implementation uses:

  • _mm256_load_ps(): Aligned SIMD loads (faster, requires 32-byte alignment)
  • _mm256_loadu_ps(): Unaligned SIMD loads (slower, works with any alignment)

Cache-Blocked Algorithm

The matrix multiplication uses a tiled (cache-blocked) approach:

  • Tile size: 64x64 elements (256 bytes for floats)
  • Maximizes cache line utilization
  • Reduces memory bandwidth requirements
  • Enables better spatial locality

Background

This implementation was developed to investigate performance variability patterns described in OpenBLAS Issue #3879, which reported up to 2x performance differences depending on memory alignment. The prototype demonstrates that:

  • Performance variability is size-dependent and architecture-sensitive
  • Modern CPUs (Zen 3) show reduced alignment sensitivity compared to older architectures (Zen 2)
  • Proper cache-line alignment reduces performance unpredictability

This prototype is based on analysis of OpenBLAS Issue #3879, which documented performance variability in matrix multiplication operations due to memory alignment. The implementation demonstrates similar variability patterns while providing a standalone, reproducible example.

License

This project is for educational and research purposes. Code implementations are provided for demonstration of HPC optimization principles related to memory alignment and SIMD vectorization.

Description
No description provided
Readme MIT 1.9 MiB
Languages
C 61.1%
Python 26.1%
Shell 11.2%
Makefile 1.6%