Initial commit: Divide-and-conquer sorting algorithms benchmark
- Implement Merge Sort and Quick Sort algorithms with instrumentation - Add Quick Sort pivot strategies: first, last, median_of_three, random - Create dataset generators for 5 dataset types (sorted, reverse, random, nearly_sorted, duplicates_heavy) - Build comprehensive benchmarking CLI with metrics collection - Add performance measurement (time, memory, comparisons, swaps) - Configure logging with rotating file handlers - Generate plots for time and memory vs size - Include comprehensive test suite with pytest - Add full documentation in README.md
This commit is contained in:
368
README.md
Normal file
368
README.md
Normal file
@@ -0,0 +1,368 @@
|
||||
# Divide-and-Conquer Sorting Algorithms Benchmark
|
||||
|
||||
A comprehensive Python project for benchmarking merge sort and quick sort algorithms across different dataset types and sizes, with detailed performance metrics, logging, and visualization.
|
||||
|
||||
## Project Overview
|
||||
|
||||
This project implements two divide-and-conquer sorting algorithms (Merge Sort and Quick Sort) and provides a benchmarking framework to evaluate their performance across various dataset characteristics:
|
||||
|
||||
- **Merge Sort**: Stable, O(n log n) worst-case time complexity
|
||||
- **Quick Sort**: In-place, O(n log n) average-case with configurable pivot strategies
|
||||
|
||||
The benchmark suite measures:
|
||||
- Wall-clock time (using `time.perf_counter`)
|
||||
- Peak memory usage (using `tracemalloc` and `psutil`)
|
||||
- Comparison and swap counts (when instrumentation is enabled)
|
||||
- Correctness verification (comparing against Python's `sorted()`)
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
.
|
||||
├── src/
|
||||
│ ├── algorithms/
|
||||
│ │ ├── merge_sort.py # Merge sort implementation
|
||||
│ │ └── quick_sort.py # Quick sort with pivot strategies
|
||||
│ └── bench/
|
||||
│ ├── benchmark.py # Main CLI benchmark runner
|
||||
│ ├── datasets.py # Dataset generators
|
||||
│ ├── metrics.py # Performance measurement utilities
|
||||
│ └── logging_setup.py # Logging configuration
|
||||
├── tests/
|
||||
│ └── test_sorts.py # Comprehensive test suite
|
||||
├── scripts/
|
||||
│ └── run_benchmarks.sh # Convenience script to run benchmarks
|
||||
├── results/ # Auto-created: CSV, JSON, logs
|
||||
├── plots/ # Auto-created: PNG visualizations
|
||||
├── pyproject.toml # Project configuration and dependencies
|
||||
├── .gitignore
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.8 or higher
|
||||
- pip (Python package manager)
|
||||
|
||||
### Setup
|
||||
|
||||
1. Clone the repository:
|
||||
```bash
|
||||
git clone <repository-url>
|
||||
cd divide-and-conquer-analysis
|
||||
```
|
||||
|
||||
2. Install dependencies:
|
||||
```bash
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
Or install from `pyproject.toml`:
|
||||
```bash
|
||||
pip install -e ".[dev]" # Includes dev dependencies (mypy, ruff, black)
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Run a Simple Benchmark
|
||||
|
||||
```bash
|
||||
python -m src.bench.benchmark \
|
||||
--algorithms merge,quick \
|
||||
--datasets sorted,reverse,random \
|
||||
--sizes 1000,5000,10000 \
|
||||
--runs 5 \
|
||||
--seed 42 \
|
||||
--instrument \
|
||||
--make-plots
|
||||
```
|
||||
|
||||
### Use the Convenience Script
|
||||
|
||||
```bash
|
||||
./scripts/run_benchmarks.sh
|
||||
```
|
||||
|
||||
## CLI Usage
|
||||
|
||||
The benchmark CLI (`src.bench.benchmark`) supports the following arguments:
|
||||
|
||||
### Required Arguments
|
||||
|
||||
None (all have defaults)
|
||||
|
||||
### Optional Arguments
|
||||
|
||||
- `--algorithms`: Comma-separated list of algorithms to benchmark
|
||||
- Options: `merge`, `quick`
|
||||
- Default: `merge,quick`
|
||||
|
||||
- `--pivot`: Pivot strategy for Quick Sort
|
||||
- Options: `first`, `last`, `median_of_three`, `random`
|
||||
- Default: `random`
|
||||
|
||||
- `--datasets`: Comma-separated list of dataset types
|
||||
- Options: `sorted`, `reverse`, `random`, `nearly_sorted`, `duplicates_heavy`
|
||||
- Default: `sorted,reverse,random,nearly_sorted,duplicates_heavy`
|
||||
|
||||
- `--sizes`: Comma-separated list of dataset sizes
|
||||
- Default: `1000,5000,10000,50000`
|
||||
- Example: `--sizes 1000,5000,10000,50000,100000`
|
||||
|
||||
- `--runs`: Number of runs per experiment (for statistical significance)
|
||||
- Default: `5`
|
||||
|
||||
- `--seed`: Random seed for reproducibility
|
||||
- Default: `42`
|
||||
|
||||
- `--outdir`: Output directory for results
|
||||
- Default: `results`
|
||||
|
||||
- `--log-level`: Logging level
|
||||
- Options: `DEBUG`, `INFO`, `WARNING`, `ERROR`
|
||||
- Default: `INFO`
|
||||
|
||||
- `--instrument`: Enable counting of comparisons and swaps
|
||||
- Flag (no value)
|
||||
|
||||
- `--make-plots`: Generate plots after benchmarking
|
||||
- Flag (no value)
|
||||
|
||||
### Example CLI Commands
|
||||
|
||||
**Basic benchmark with default settings:**
|
||||
```bash
|
||||
python -m src.bench.benchmark
|
||||
```
|
||||
|
||||
**Full benchmark with all options:**
|
||||
```bash
|
||||
python -m src.bench.benchmark \
|
||||
--algorithms merge,quick \
|
||||
--pivot random \
|
||||
--datasets sorted,reverse,random,nearly_sorted,duplicates_heavy \
|
||||
--sizes 1000,5000,10000,50000 \
|
||||
--runs 5 \
|
||||
--seed 42 \
|
||||
--instrument \
|
||||
--outdir results \
|
||||
--log-level INFO \
|
||||
--make-plots
|
||||
```
|
||||
|
||||
**Compare pivot strategies:**
|
||||
```bash
|
||||
for pivot in first last median_of_three random; do
|
||||
python -m src.bench.benchmark \
|
||||
--algorithms quick \
|
||||
--pivot $pivot \
|
||||
--datasets random \
|
||||
--sizes 10000,50000 \
|
||||
--runs 10 \
|
||||
--seed 42
|
||||
done
|
||||
```
|
||||
|
||||
**Quick performance check:**
|
||||
```bash
|
||||
python -m src.bench.benchmark \
|
||||
--algorithms merge,quick \
|
||||
--datasets random \
|
||||
--sizes 10000 \
|
||||
--runs 3 \
|
||||
--make-plots
|
||||
```
|
||||
|
||||
## Output Files
|
||||
|
||||
### Results Directory (`results/`)
|
||||
|
||||
- **`bench_results.csv`**: Detailed results in CSV format
|
||||
- Columns: `algorithm`, `pivot`, `dataset`, `size`, `run`, `time_s`, `peak_mem_bytes`, `comparisons`, `swaps`, `seed`
|
||||
- One row per run
|
||||
|
||||
- **`summary.json`**: Aggregated statistics per (algorithm, dataset, size) combination
|
||||
- Includes: mean, std, best, worst times and memory
|
||||
- Comparison and swap statistics (if instrumentation enabled)
|
||||
|
||||
- **`bench.log`**: Rotating log file (max 10MB, 5 backups)
|
||||
- Contains: system info, run metadata, progress logs, errors
|
||||
|
||||
### Plots Directory (`plots/`)
|
||||
|
||||
- **`time_vs_size.png`**: Line chart showing sorting time vs array size
|
||||
- Separate subplot for each dataset type
|
||||
- One line per algorithm
|
||||
|
||||
- **`memory_vs_size.png`**: Line chart showing memory usage vs array size
|
||||
- Separate subplot for each dataset type
|
||||
- One line per algorithm
|
||||
|
||||
## Reproducing Results
|
||||
|
||||
### Generate a Plot
|
||||
|
||||
After running benchmarks:
|
||||
```bash
|
||||
python -m src.bench.benchmark \
|
||||
--algorithms merge,quick \
|
||||
--datasets random \
|
||||
--sizes 1000,5000,10000,50000 \
|
||||
--runs 5 \
|
||||
--seed 42 \
|
||||
--make-plots
|
||||
```
|
||||
|
||||
Plots will be automatically generated in `plots/` directory.
|
||||
|
||||
### Generate CSV from Scratch
|
||||
|
||||
```bash
|
||||
python -m src.bench.benchmark \
|
||||
--algorithms merge \
|
||||
--datasets sorted,reverse,random \
|
||||
--sizes 1000,5000 \
|
||||
--runs 3 \
|
||||
--seed 42 \
|
||||
--outdir results
|
||||
```
|
||||
|
||||
Check `results/bench_results.csv` for the output.
|
||||
|
||||
## Logging
|
||||
|
||||
Logging is configured via `src.bench.logging_setup.py`:
|
||||
|
||||
- **Console output**: Formatted with timestamp, level, and message
|
||||
- **File output**: Detailed logs with function names and line numbers
|
||||
- **Rotation**: Log files rotate at 10MB, keeping 5 backups
|
||||
- **Metadata**: Logs include Python version, OS, architecture, and git commit (if available)
|
||||
|
||||
### Log Levels
|
||||
|
||||
- `DEBUG`: Detailed diagnostic information
|
||||
- `INFO`: General informational messages (default)
|
||||
- `WARNING`: Warning messages
|
||||
- `ERROR`: Error messages
|
||||
|
||||
### Example Log Output
|
||||
|
||||
```
|
||||
2024-01-15 10:30:00 - __main__ - INFO - ================================================================================
|
||||
2024-01-15 10:30:00 - __main__ - INFO - Benchmark session started
|
||||
2024-01-15 10:30:00 - __main__ - INFO - Python version: 3.10.5
|
||||
2024-01-15 10:30:00 - __main__ - INFO - Platform: macOS-13.0
|
||||
2024-01-15 10:30:00 - __main__ - INFO - Running merge on random size=1000 run=1/5
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
Run the test suite:
|
||||
|
||||
```bash
|
||||
pytest tests/ -v
|
||||
```
|
||||
|
||||
Run with coverage:
|
||||
|
||||
```bash
|
||||
pytest tests/ --cov=src --cov-report=html
|
||||
```
|
||||
|
||||
### Test Coverage
|
||||
|
||||
The test suite includes:
|
||||
|
||||
1. **Unit Tests**:
|
||||
- Empty arrays
|
||||
- Single element arrays
|
||||
- Already sorted arrays
|
||||
- Reverse sorted arrays
|
||||
- Random arrays
|
||||
- Arrays with duplicates
|
||||
- Large arrays
|
||||
- Instrumentation tests
|
||||
|
||||
2. **Property Tests**:
|
||||
- Comparison with Python's `sorted()` on random arrays
|
||||
- Multiple sizes and pivot strategies
|
||||
|
||||
## Code Quality
|
||||
|
||||
### Type Checking
|
||||
|
||||
```bash
|
||||
mypy src/ tests/
|
||||
```
|
||||
|
||||
### Linting
|
||||
|
||||
```bash
|
||||
ruff check src/ tests/
|
||||
```
|
||||
|
||||
### Formatting
|
||||
|
||||
```bash
|
||||
ruff format src/ tests/
|
||||
```
|
||||
|
||||
Or using black:
|
||||
|
||||
```bash
|
||||
black src/ tests/
|
||||
```
|
||||
|
||||
## Algorithm Details
|
||||
|
||||
### Merge Sort
|
||||
|
||||
- **Time Complexity**: O(n log n) worst-case, average-case, best-case
|
||||
- **Space Complexity**: O(n)
|
||||
- **Stability**: Stable
|
||||
- **Implementation**: Recursive divide-and-conquer with merging
|
||||
|
||||
### Quick Sort
|
||||
|
||||
- **Time Complexity**: O(n log n) average-case, O(n²) worst-case
|
||||
- **Space Complexity**: O(log n) average-case (recursion stack)
|
||||
- **Stability**: Not stable (in-place implementation)
|
||||
- **Pivot Strategies**:
|
||||
- `first`: Always use first element (O(n²) on sorted arrays)
|
||||
- `last`: Always use last element (O(n²) on reverse sorted arrays)
|
||||
- `median_of_three`: Use median of first, middle, last
|
||||
- `random`: Random pivot (good average performance)
|
||||
|
||||
## Dataset Types
|
||||
|
||||
1. **sorted**: Array already in ascending order `[0, 1, 2, ..., n-1]`
|
||||
2. **reverse**: Array in descending order `[n-1, n-2, ..., 0]`
|
||||
3. **random**: Random integers from `[0, 10*n)` range
|
||||
4. **nearly_sorted**: Sorted array with ~1% of elements swapped
|
||||
5. **duplicates_heavy**: Array with many duplicate values (only `n/10` distinct values)
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- Benchmarks use `time.perf_counter()` for high-resolution timing
|
||||
- Memory measurement uses both `tracemalloc` and `psutil` for accuracy
|
||||
- Multiple runs per experiment reduce variance
|
||||
- Seeded randomness ensures reproducibility
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Follow Python type hints (checked with mypy)
|
||||
2. Maintain test coverage
|
||||
3. Run linting before committing
|
||||
4. Update README for significant changes
|
||||
|
||||
## License
|
||||
|
||||
[Specify your license here]
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- Algorithms based on standard divide-and-conquer implementations
|
||||
- Benchmarking framework inspired by best practices in performance testing
|
||||
|
||||
Reference in New Issue
Block a user