- Add C implementation demonstrating memory alignment effects (matrix_alignment_prototype.c) - Include cache-blocked matrix multiplication with AVX SIMD optimizations - Add automated benchmarking framework (run_all_tests.sh, run_benchmark_sizes.sh) - Add Python visualization scripts (generate_plots.py) - Include Makefile for building with AVX support - Add benchmark results and generated plots - Add README with build and usage instructions - Configure .gitignore for C/Python project files
144 lines
5.1 KiB
Markdown
144 lines
5.1 KiB
Markdown
# Matrix Alignment Prototype - HPC Performance Demonstration
|
|
|
|
This repository contains a standalone C implementation demonstrating memory alignment effects on high-performance computing performance. The prototype investigates performance variability issues related to memory alignment, specifically examining patterns described in OpenBLAS Issue #3879.
|
|
|
|
## Project Overview
|
|
|
|
This repository focuses on the practical implementation and benchmarking framework:
|
|
|
|
- **C Prototype**: Custom implementation demonstrating cache-blocked matrix multiplication with AVX SIMD optimizations
|
|
- **Memory Alignment Comparison**: Compares 64-byte cache-line aligned vs 16-byte aligned memory access patterns
|
|
- **Benchmarking Framework**: Automated scripts for performance testing and visualization
|
|
- **Performance Analysis**: Tools for measuring and visualizing alignment effects across different matrix sizes
|
|
|
|
## Project Structure
|
|
|
|
```text
|
|
.
|
|
├── matrix_alignment_prototype.c # C implementation demonstrating alignment effects
|
|
├── Makefile # Build configuration for C prototype
|
|
├── generate_plots.py # Python script for performance visualization
|
|
├── run_benchmark_sizes.sh # Automated benchmarking script
|
|
├── run_all_tests.sh # Complete test suite orchestrator
|
|
├── benchmark_results.csv # Collected performance data (generated)
|
|
├── requirements.txt # Python dependencies
|
|
└── assets/ # Generated plots and figures
|
|
```
|
|
|
|
## Building and Running
|
|
|
|
### Prerequisites
|
|
|
|
- GCC compiler with AVX support
|
|
- Python 3 with matplotlib and numpy
|
|
- LaTeX distribution (for report compilation)
|
|
- Make utility
|
|
|
|
### Compiling the C Prototype
|
|
|
|
```bash
|
|
make
|
|
```
|
|
|
|
This will compile `matrix_alignment_prototype.c` with AVX optimizations enabled.
|
|
|
|
### Running Benchmarks
|
|
|
|
Run the complete test suite:
|
|
|
|
```bash
|
|
./run_all_tests.sh
|
|
```
|
|
|
|
Or run benchmarks for specific matrix sizes:
|
|
|
|
```bash
|
|
./run_benchmark_sizes.sh
|
|
```
|
|
|
|
For CSV output:
|
|
|
|
```bash
|
|
./matrix_alignment_prototype -s 1024 --csv
|
|
```
|
|
|
|
### Generating Visualizations
|
|
|
|
After running benchmarks, generate plots:
|
|
|
|
```bash
|
|
python3 generate_plots.py
|
|
```
|
|
|
|
Plots will be saved in the `assets/` directory.
|
|
|
|
## Key Features
|
|
|
|
### Memory Alignment Demonstration
|
|
|
|
The prototype demonstrates:
|
|
|
|
- **Aligned version**: Uses 64-byte cache-line aligned memory with `_mm256_load_ps` (aligned SIMD loads)
|
|
- **Misaligned version**: Uses 16-byte aligned memory with `_mm256_loadu_ps` (unaligned SIMD loads)
|
|
- **Cache-blocked algorithm**: Implements tiled matrix multiplication for optimal cache utilization
|
|
- **Performance variability analysis**: Measures and visualizes alignment effects across different matrix sizes
|
|
|
|
### Benchmarking Framework
|
|
|
|
The automated framework includes:
|
|
|
|
- Multiple matrix size testing (512, 1024, 1500, 2048)
|
|
- CSV data collection for reproducibility
|
|
- Python visualization generating multiple analysis plots
|
|
- Execution time, speedup ratio, variability, and GFLOPS metrics
|
|
|
|
## Results
|
|
|
|
The implementation demonstrates performance variability patterns consistent with OpenBLAS Issue #3879, showing:
|
|
|
|
- Peak variability of 6.6% at matrix size 512
|
|
- Size-dependent performance differences
|
|
- Architecture-sensitive alignment effects
|
|
- Reduced variability on modern hardware (Zen 3) compared to older architectures (Zen 2)
|
|
|
|
## Technical Details
|
|
|
|
### Memory Alignment Implementation
|
|
|
|
The prototype demonstrates two memory allocation strategies:
|
|
|
|
- **Aligned allocation**: Uses `posix_memalign()` to allocate memory aligned to 64-byte cache-line boundaries
|
|
- **Misaligned allocation**: Simulates C++ default 16-byte alignment by offsetting pointers from cache-line boundaries
|
|
|
|
### SIMD Optimizations
|
|
|
|
When compiled with AVX support, the implementation uses:
|
|
|
|
- `_mm256_load_ps()`: Aligned SIMD loads (faster, requires 32-byte alignment)
|
|
- `_mm256_loadu_ps()`: Unaligned SIMD loads (slower, works with any alignment)
|
|
|
|
### Cache-Blocked Algorithm
|
|
|
|
The matrix multiplication uses a tiled (cache-blocked) approach:
|
|
|
|
- Tile size: 64x64 elements (256 bytes for floats)
|
|
- Maximizes cache line utilization
|
|
- Reduces memory bandwidth requirements
|
|
- Enables better spatial locality
|
|
|
|
## Background
|
|
|
|
This implementation was developed to investigate performance variability patterns described in OpenBLAS Issue #3879, which reported up to 2x performance differences depending on memory alignment. The prototype demonstrates that:
|
|
|
|
- Performance variability is size-dependent and architecture-sensitive
|
|
- Modern CPUs (Zen 3) show reduced alignment sensitivity compared to older architectures (Zen 2)
|
|
- Proper cache-line alignment reduces performance unpredictability
|
|
|
|
## Related Work
|
|
|
|
This prototype is based on analysis of OpenBLAS Issue #3879, which documented performance variability in matrix multiplication operations due to memory alignment. The implementation demonstrates similar variability patterns while providing a standalone, reproducible example.
|
|
|
|
## License
|
|
|
|
This project is for educational and research purposes. Code implementations are provided for demonstration of HPC optimization principles related to memory alignment and SIMD vectorization.
|