Files
High-Performance-Computing-…/README.md
Carlos Gutierrez ae258223ca Initial commit: Matrix alignment prototype for HPC performance demonstration
- Add C implementation demonstrating memory alignment effects (matrix_alignment_prototype.c)
- Include cache-blocked matrix multiplication with AVX SIMD optimizations
- Add automated benchmarking framework (run_all_tests.sh, run_benchmark_sizes.sh)
- Add Python visualization scripts (generate_plots.py)
- Include Makefile for building with AVX support
- Add benchmark results and generated plots
- Add README with build and usage instructions
- Configure .gitignore for C/Python project files
2025-12-06 21:47:42 -05:00

5.1 KiB

Matrix Alignment Prototype - HPC Performance Demonstration

This repository contains a standalone C implementation demonstrating memory alignment effects on high-performance computing performance. The prototype investigates performance variability issues related to memory alignment, specifically examining patterns described in OpenBLAS Issue #3879.

Project Overview

This repository focuses on the practical implementation and benchmarking framework:

  • C Prototype: Custom implementation demonstrating cache-blocked matrix multiplication with AVX SIMD optimizations
  • Memory Alignment Comparison: Compares 64-byte cache-line aligned vs 16-byte aligned memory access patterns
  • Benchmarking Framework: Automated scripts for performance testing and visualization
  • Performance Analysis: Tools for measuring and visualizing alignment effects across different matrix sizes

Project Structure

.
├── matrix_alignment_prototype.c   # C implementation demonstrating alignment effects
├── Makefile                        # Build configuration for C prototype
├── generate_plots.py               # Python script for performance visualization
├── run_benchmark_sizes.sh         # Automated benchmarking script
├── run_all_tests.sh               # Complete test suite orchestrator
├── benchmark_results.csv          # Collected performance data (generated)
├── requirements.txt               # Python dependencies
└── assets/                        # Generated plots and figures

Building and Running

Prerequisites

  • GCC compiler with AVX support
  • Python 3 with matplotlib and numpy
  • LaTeX distribution (for report compilation)
  • Make utility

Compiling the C Prototype

make

This will compile matrix_alignment_prototype.c with AVX optimizations enabled.

Running Benchmarks

Run the complete test suite:

./run_all_tests.sh

Or run benchmarks for specific matrix sizes:

./run_benchmark_sizes.sh

For CSV output:

./matrix_alignment_prototype -s 1024 --csv

Generating Visualizations

After running benchmarks, generate plots:

python3 generate_plots.py

Plots will be saved in the assets/ directory.

Key Features

Memory Alignment Demonstration

The prototype demonstrates:

  • Aligned version: Uses 64-byte cache-line aligned memory with _mm256_load_ps (aligned SIMD loads)
  • Misaligned version: Uses 16-byte aligned memory with _mm256_loadu_ps (unaligned SIMD loads)
  • Cache-blocked algorithm: Implements tiled matrix multiplication for optimal cache utilization
  • Performance variability analysis: Measures and visualizes alignment effects across different matrix sizes

Benchmarking Framework

The automated framework includes:

  • Multiple matrix size testing (512, 1024, 1500, 2048)
  • CSV data collection for reproducibility
  • Python visualization generating multiple analysis plots
  • Execution time, speedup ratio, variability, and GFLOPS metrics

Results

The implementation demonstrates performance variability patterns consistent with OpenBLAS Issue #3879, showing:

  • Peak variability of 6.6% at matrix size 512
  • Size-dependent performance differences
  • Architecture-sensitive alignment effects
  • Reduced variability on modern hardware (Zen 3) compared to older architectures (Zen 2)

Technical Details

Memory Alignment Implementation

The prototype demonstrates two memory allocation strategies:

  • Aligned allocation: Uses posix_memalign() to allocate memory aligned to 64-byte cache-line boundaries
  • Misaligned allocation: Simulates C++ default 16-byte alignment by offsetting pointers from cache-line boundaries

SIMD Optimizations

When compiled with AVX support, the implementation uses:

  • _mm256_load_ps(): Aligned SIMD loads (faster, requires 32-byte alignment)
  • _mm256_loadu_ps(): Unaligned SIMD loads (slower, works with any alignment)

Cache-Blocked Algorithm

The matrix multiplication uses a tiled (cache-blocked) approach:

  • Tile size: 64x64 elements (256 bytes for floats)
  • Maximizes cache line utilization
  • Reduces memory bandwidth requirements
  • Enables better spatial locality

Background

This implementation was developed to investigate performance variability patterns described in OpenBLAS Issue #3879, which reported up to 2x performance differences depending on memory alignment. The prototype demonstrates that:

  • Performance variability is size-dependent and architecture-sensitive
  • Modern CPUs (Zen 3) show reduced alignment sensitivity compared to older architectures (Zen 2)
  • Proper cache-line alignment reduces performance unpredictability

This prototype is based on analysis of OpenBLAS Issue #3879, which documented performance variability in matrix multiplication operations due to memory alignment. The implementation demonstrates similar variability patterns while providing a standalone, reproducible example.

License

This project is for educational and research purposes. Code implementations are provided for demonstration of HPC optimization principles related to memory alignment and SIMD vectorization.