Initial commit: Matrix alignment prototype for HPC performance demonstration

- Add C implementation demonstrating memory alignment effects (matrix_alignment_prototype.c)
- Include cache-blocked matrix multiplication with AVX SIMD optimizations
- Add automated benchmarking framework (run_all_tests.sh, run_benchmark_sizes.sh)
- Add Python visualization scripts (generate_plots.py)
- Include Makefile for building with AVX support
- Add benchmark results and generated plots
- Add README with build and usage instructions
- Configure .gitignore for C/Python project files
This commit is contained in:
Carlos Gutierrez
2025-12-06 21:47:42 -05:00
commit ae258223ca
20 changed files with 1397 additions and 0 deletions

143
README.md Normal file
View File

@@ -0,0 +1,143 @@
# Matrix Alignment Prototype - HPC Performance Demonstration
This repository contains a standalone C implementation demonstrating memory alignment effects on high-performance computing performance. The prototype investigates performance variability issues related to memory alignment, specifically examining patterns described in OpenBLAS Issue #3879.
## Project Overview
This repository focuses on the practical implementation and benchmarking framework:
- **C Prototype**: Custom implementation demonstrating cache-blocked matrix multiplication with AVX SIMD optimizations
- **Memory Alignment Comparison**: Compares 64-byte cache-line aligned vs 16-byte aligned memory access patterns
- **Benchmarking Framework**: Automated scripts for performance testing and visualization
- **Performance Analysis**: Tools for measuring and visualizing alignment effects across different matrix sizes
## Project Structure
```text
.
├── matrix_alignment_prototype.c # C implementation demonstrating alignment effects
├── Makefile # Build configuration for C prototype
├── generate_plots.py # Python script for performance visualization
├── run_benchmark_sizes.sh # Automated benchmarking script
├── run_all_tests.sh # Complete test suite orchestrator
├── benchmark_results.csv # Collected performance data (generated)
├── requirements.txt # Python dependencies
└── assets/ # Generated plots and figures
```
## Building and Running
### Prerequisites
- GCC compiler with AVX support
- Python 3 with matplotlib and numpy
- LaTeX distribution (for report compilation)
- Make utility
### Compiling the C Prototype
```bash
make
```
This will compile `matrix_alignment_prototype.c` with AVX optimizations enabled.
### Running Benchmarks
Run the complete test suite:
```bash
./run_all_tests.sh
```
Or run benchmarks for specific matrix sizes:
```bash
./run_benchmark_sizes.sh
```
For CSV output:
```bash
./matrix_alignment_prototype -s 1024 --csv
```
### Generating Visualizations
After running benchmarks, generate plots:
```bash
python3 generate_plots.py
```
Plots will be saved in the `assets/` directory.
## Key Features
### Memory Alignment Demonstration
The prototype demonstrates:
- **Aligned version**: Uses 64-byte cache-line aligned memory with `_mm256_load_ps` (aligned SIMD loads)
- **Misaligned version**: Uses 16-byte aligned memory with `_mm256_loadu_ps` (unaligned SIMD loads)
- **Cache-blocked algorithm**: Implements tiled matrix multiplication for optimal cache utilization
- **Performance variability analysis**: Measures and visualizes alignment effects across different matrix sizes
### Benchmarking Framework
The automated framework includes:
- Multiple matrix size testing (512, 1024, 1500, 2048)
- CSV data collection for reproducibility
- Python visualization generating multiple analysis plots
- Execution time, speedup ratio, variability, and GFLOPS metrics
## Results
The implementation demonstrates performance variability patterns consistent with OpenBLAS Issue #3879, showing:
- Peak variability of 6.6% at matrix size 512
- Size-dependent performance differences
- Architecture-sensitive alignment effects
- Reduced variability on modern hardware (Zen 3) compared to older architectures (Zen 2)
## Technical Details
### Memory Alignment Implementation
The prototype demonstrates two memory allocation strategies:
- **Aligned allocation**: Uses `posix_memalign()` to allocate memory aligned to 64-byte cache-line boundaries
- **Misaligned allocation**: Simulates C++ default 16-byte alignment by offsetting pointers from cache-line boundaries
### SIMD Optimizations
When compiled with AVX support, the implementation uses:
- `_mm256_load_ps()`: Aligned SIMD loads (faster, requires 32-byte alignment)
- `_mm256_loadu_ps()`: Unaligned SIMD loads (slower, works with any alignment)
### Cache-Blocked Algorithm
The matrix multiplication uses a tiled (cache-blocked) approach:
- Tile size: 64x64 elements (256 bytes for floats)
- Maximizes cache line utilization
- Reduces memory bandwidth requirements
- Enables better spatial locality
## Background
This implementation was developed to investigate performance variability patterns described in OpenBLAS Issue #3879, which reported up to 2x performance differences depending on memory alignment. The prototype demonstrates that:
- Performance variability is size-dependent and architecture-sensitive
- Modern CPUs (Zen 3) show reduced alignment sensitivity compared to older architectures (Zen 2)
- Proper cache-line alignment reduces performance unpredictability
## Related Work
This prototype is based on analysis of OpenBLAS Issue #3879, which documented performance variability in matrix multiplication operations due to memory alignment. The implementation demonstrates similar variability patterns while providing a standalone, reproducible example.
## License
This project is for educational and research purposes. Code implementations are provided for demonstration of HPC optimization principles related to memory alignment and SIMD vectorization.