Initial commit: Matrix alignment prototype for HPC performance demonstration
- Add C implementation demonstrating memory alignment effects (matrix_alignment_prototype.c) - Include cache-blocked matrix multiplication with AVX SIMD optimizations - Add automated benchmarking framework (run_all_tests.sh, run_benchmark_sizes.sh) - Add Python visualization scripts (generate_plots.py) - Include Makefile for building with AVX support - Add benchmark results and generated plots - Add README with build and usage instructions - Configure .gitignore for C/Python project files
This commit is contained in:
143
README.md
Normal file
143
README.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# Matrix Alignment Prototype - HPC Performance Demonstration
|
||||
|
||||
This repository contains a standalone C implementation demonstrating memory alignment effects on high-performance computing performance. The prototype investigates performance variability issues related to memory alignment, specifically examining patterns described in OpenBLAS Issue #3879.
|
||||
|
||||
## Project Overview
|
||||
|
||||
This repository focuses on the practical implementation and benchmarking framework:
|
||||
|
||||
- **C Prototype**: Custom implementation demonstrating cache-blocked matrix multiplication with AVX SIMD optimizations
|
||||
- **Memory Alignment Comparison**: Compares 64-byte cache-line aligned vs 16-byte aligned memory access patterns
|
||||
- **Benchmarking Framework**: Automated scripts for performance testing and visualization
|
||||
- **Performance Analysis**: Tools for measuring and visualizing alignment effects across different matrix sizes
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
.
|
||||
├── matrix_alignment_prototype.c # C implementation demonstrating alignment effects
|
||||
├── Makefile # Build configuration for C prototype
|
||||
├── generate_plots.py # Python script for performance visualization
|
||||
├── run_benchmark_sizes.sh # Automated benchmarking script
|
||||
├── run_all_tests.sh # Complete test suite orchestrator
|
||||
├── benchmark_results.csv # Collected performance data (generated)
|
||||
├── requirements.txt # Python dependencies
|
||||
└── assets/ # Generated plots and figures
|
||||
```
|
||||
|
||||
## Building and Running
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- GCC compiler with AVX support
|
||||
- Python 3 with matplotlib and numpy
|
||||
- LaTeX distribution (for report compilation)
|
||||
- Make utility
|
||||
|
||||
### Compiling the C Prototype
|
||||
|
||||
```bash
|
||||
make
|
||||
```
|
||||
|
||||
This will compile `matrix_alignment_prototype.c` with AVX optimizations enabled.
|
||||
|
||||
### Running Benchmarks
|
||||
|
||||
Run the complete test suite:
|
||||
|
||||
```bash
|
||||
./run_all_tests.sh
|
||||
```
|
||||
|
||||
Or run benchmarks for specific matrix sizes:
|
||||
|
||||
```bash
|
||||
./run_benchmark_sizes.sh
|
||||
```
|
||||
|
||||
For CSV output:
|
||||
|
||||
```bash
|
||||
./matrix_alignment_prototype -s 1024 --csv
|
||||
```
|
||||
|
||||
### Generating Visualizations
|
||||
|
||||
After running benchmarks, generate plots:
|
||||
|
||||
```bash
|
||||
python3 generate_plots.py
|
||||
```
|
||||
|
||||
Plots will be saved in the `assets/` directory.
|
||||
|
||||
## Key Features
|
||||
|
||||
### Memory Alignment Demonstration
|
||||
|
||||
The prototype demonstrates:
|
||||
|
||||
- **Aligned version**: Uses 64-byte cache-line aligned memory with `_mm256_load_ps` (aligned SIMD loads)
|
||||
- **Misaligned version**: Uses 16-byte aligned memory with `_mm256_loadu_ps` (unaligned SIMD loads)
|
||||
- **Cache-blocked algorithm**: Implements tiled matrix multiplication for optimal cache utilization
|
||||
- **Performance variability analysis**: Measures and visualizes alignment effects across different matrix sizes
|
||||
|
||||
### Benchmarking Framework
|
||||
|
||||
The automated framework includes:
|
||||
|
||||
- Multiple matrix size testing (512, 1024, 1500, 2048)
|
||||
- CSV data collection for reproducibility
|
||||
- Python visualization generating multiple analysis plots
|
||||
- Execution time, speedup ratio, variability, and GFLOPS metrics
|
||||
|
||||
## Results
|
||||
|
||||
The implementation demonstrates performance variability patterns consistent with OpenBLAS Issue #3879, showing:
|
||||
|
||||
- Peak variability of 6.6% at matrix size 512
|
||||
- Size-dependent performance differences
|
||||
- Architecture-sensitive alignment effects
|
||||
- Reduced variability on modern hardware (Zen 3) compared to older architectures (Zen 2)
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Memory Alignment Implementation
|
||||
|
||||
The prototype demonstrates two memory allocation strategies:
|
||||
|
||||
- **Aligned allocation**: Uses `posix_memalign()` to allocate memory aligned to 64-byte cache-line boundaries
|
||||
- **Misaligned allocation**: Simulates C++ default 16-byte alignment by offsetting pointers from cache-line boundaries
|
||||
|
||||
### SIMD Optimizations
|
||||
|
||||
When compiled with AVX support, the implementation uses:
|
||||
|
||||
- `_mm256_load_ps()`: Aligned SIMD loads (faster, requires 32-byte alignment)
|
||||
- `_mm256_loadu_ps()`: Unaligned SIMD loads (slower, works with any alignment)
|
||||
|
||||
### Cache-Blocked Algorithm
|
||||
|
||||
The matrix multiplication uses a tiled (cache-blocked) approach:
|
||||
|
||||
- Tile size: 64x64 elements (256 bytes for floats)
|
||||
- Maximizes cache line utilization
|
||||
- Reduces memory bandwidth requirements
|
||||
- Enables better spatial locality
|
||||
|
||||
## Background
|
||||
|
||||
This implementation was developed to investigate performance variability patterns described in OpenBLAS Issue #3879, which reported up to 2x performance differences depending on memory alignment. The prototype demonstrates that:
|
||||
|
||||
- Performance variability is size-dependent and architecture-sensitive
|
||||
- Modern CPUs (Zen 3) show reduced alignment sensitivity compared to older architectures (Zen 2)
|
||||
- Proper cache-line alignment reduces performance unpredictability
|
||||
|
||||
## Related Work
|
||||
|
||||
This prototype is based on analysis of OpenBLAS Issue #3879, which documented performance variability in matrix multiplication operations due to memory alignment. The implementation demonstrates similar variability patterns while providing a standalone, reproducible example.
|
||||
|
||||
## License
|
||||
|
||||
This project is for educational and research purposes. Code implementations are provided for demonstration of HPC optimization principles related to memory alignment and SIMD vectorization.
|
||||
Reference in New Issue
Block a user