- Removed plots/ and results/ from .gitignore - Added plot images to ANALYSIS.md with figure references - Updated appendix with file descriptions - Committed benchmark results and visualization plots
Divide-and-Conquer Sorting Algorithms Benchmark
A comprehensive Python project for benchmarking merge sort and quick sort algorithms across different dataset types and sizes, with detailed performance metrics, logging, and visualization.
Project Overview
This project implements two divide-and-conquer sorting algorithms (Merge Sort and Quick Sort) and provides a benchmarking framework to evaluate their performance across various dataset characteristics:
- Merge Sort: Stable, O(n log n) worst-case time complexity
- Quick Sort: In-place, O(n log n) average-case with configurable pivot strategies
The benchmark suite measures:
- Wall-clock time (using
time.perf_counter) - Peak memory usage (using
tracemallocandpsutil) - Comparison and swap counts (when instrumentation is enabled)
- Correctness verification (comparing against Python's
sorted())
Project Structure
.
├── src/
│ ├── algorithms/
│ │ ├── merge_sort.py # Merge sort implementation
│ │ └── quick_sort.py # Quick sort with pivot strategies
│ └── bench/
│ ├── benchmark.py # Main CLI benchmark runner
│ ├── datasets.py # Dataset generators
│ ├── metrics.py # Performance measurement utilities
│ └── logging_setup.py # Logging configuration
├── tests/
│ └── test_sorts.py # Comprehensive test suite
├── scripts/
│ └── run_benchmarks.sh # Convenience script to run benchmarks
├── results/ # Auto-created: CSV, JSON, logs
├── plots/ # Auto-created: PNG visualizations
├── pyproject.toml # Project configuration and dependencies
├── .gitignore
└── README.md # This file
Installation
Prerequisites
- Python 3.8 or higher
- pip (Python package manager)
Setup
- Clone the repository:
git clone <repository-url>
cd divide-and-conquer-analysis
- Install dependencies:
pip install -e .
Or install from pyproject.toml:
pip install -e ".[dev]" # Includes dev dependencies (mypy, ruff, black)
Quick Start
Run a Simple Benchmark
python -m src.bench.benchmark \
--algorithms merge,quick \
--datasets sorted,reverse,random \
--sizes 1000,5000,10000 \
--runs 5 \
--seed 42 \
--instrument \
--make-plots
Use the Convenience Script
./scripts/run_benchmarks.sh
CLI Usage
The benchmark CLI (src.bench.benchmark) supports the following arguments:
Required Arguments
None (all have defaults)
Optional Arguments
-
--algorithms: Comma-separated list of algorithms to benchmark- Options:
merge,quick - Default:
merge,quick
- Options:
-
--pivot: Pivot strategy for Quick Sort- Options:
first,last,median_of_three,random - Default:
random
- Options:
-
--datasets: Comma-separated list of dataset types- Options:
sorted,reverse,random,nearly_sorted,duplicates_heavy - Default:
sorted,reverse,random,nearly_sorted,duplicates_heavy
- Options:
-
--sizes: Comma-separated list of dataset sizes- Default:
1000,5000,10000,50000 - Example:
--sizes 1000,5000,10000,50000,100000
- Default:
-
--runs: Number of runs per experiment (for statistical significance)- Default:
5
- Default:
-
--seed: Random seed for reproducibility- Default:
42
- Default:
-
--outdir: Output directory for results- Default:
results
- Default:
-
--log-level: Logging level- Options:
DEBUG,INFO,WARNING,ERROR - Default:
INFO
- Options:
-
--instrument: Enable counting of comparisons and swaps- Flag (no value)
-
--make-plots: Generate plots after benchmarking- Flag (no value)
Example CLI Commands
Basic benchmark with default settings:
python -m src.bench.benchmark
Full benchmark with all options:
python -m src.bench.benchmark \
--algorithms merge,quick \
--pivot random \
--datasets sorted,reverse,random,nearly_sorted,duplicates_heavy \
--sizes 1000,5000,10000,50000 \
--runs 5 \
--seed 42 \
--instrument \
--outdir results \
--log-level INFO \
--make-plots
Compare pivot strategies:
for pivot in first last median_of_three random; do
python -m src.bench.benchmark \
--algorithms quick \
--pivot $pivot \
--datasets random \
--sizes 10000,50000 \
--runs 10 \
--seed 42
done
Quick performance check:
python -m src.bench.benchmark \
--algorithms merge,quick \
--datasets random \
--sizes 10000 \
--runs 3 \
--make-plots
Output Files
Results Directory (results/)
-
bench_results.csv: Detailed results in CSV format- Columns:
algorithm,pivot,dataset,size,run,time_s,peak_mem_bytes,comparisons,swaps,seed - One row per run
- Columns:
-
summary.json: Aggregated statistics per (algorithm, dataset, size) combination- Includes: mean, std, best, worst times and memory
- Comparison and swap statistics (if instrumentation enabled)
-
bench.log: Rotating log file (max 10MB, 5 backups)- Contains: system info, run metadata, progress logs, errors
Plots Directory (plots/)
-
time_vs_size.png: Line chart showing sorting time vs array size- Separate subplot for each dataset type
- One line per algorithm
-
memory_vs_size.png: Line chart showing memory usage vs array size- Separate subplot for each dataset type
- One line per algorithm
Reproducing Results
Generate a Plot
After running benchmarks:
python -m src.bench.benchmark \
--algorithms merge,quick \
--datasets random \
--sizes 1000,5000,10000,50000 \
--runs 5 \
--seed 42 \
--make-plots
Plots will be automatically generated in plots/ directory.
Generate CSV from Scratch
python -m src.bench.benchmark \
--algorithms merge \
--datasets sorted,reverse,random \
--sizes 1000,5000 \
--runs 3 \
--seed 42 \
--outdir results
Check results/bench_results.csv for the output.
Logging
Logging is configured via src.bench.logging_setup.py:
- Console output: Formatted with timestamp, level, and message
- File output: Detailed logs with function names and line numbers
- Rotation: Log files rotate at 10MB, keeping 5 backups
- Metadata: Logs include Python version, OS, architecture, and git commit (if available)
Log Levels
DEBUG: Detailed diagnostic informationINFO: General informational messages (default)WARNING: Warning messagesERROR: Error messages
Example Log Output
2024-01-15 10:30:00 - __main__ - INFO - ================================================================================
2024-01-15 10:30:00 - __main__ - INFO - Benchmark session started
2024-01-15 10:30:00 - __main__ - INFO - Python version: 3.10.5
2024-01-15 10:30:00 - __main__ - INFO - Platform: macOS-13.0
2024-01-15 10:30:00 - __main__ - INFO - Running merge on random size=1000 run=1/5
Testing
Run the test suite:
pytest tests/ -v
Run with coverage:
pytest tests/ --cov=src --cov-report=html
Test Coverage
The test suite includes:
-
Unit Tests:
- Empty arrays
- Single element arrays
- Already sorted arrays
- Reverse sorted arrays
- Random arrays
- Arrays with duplicates
- Large arrays
- Instrumentation tests
-
Property Tests:
- Comparison with Python's
sorted()on random arrays - Multiple sizes and pivot strategies
- Comparison with Python's
Code Quality
Type Checking
mypy src/ tests/
Linting
ruff check src/ tests/
Formatting
ruff format src/ tests/
Or using black:
black src/ tests/
Algorithm Details
Merge Sort
- Time Complexity: O(n log n) worst-case, average-case, best-case
- Space Complexity: O(n)
- Stability: Stable
- Implementation: Recursive divide-and-conquer with merging
Quick Sort
- Time Complexity: O(n log n) average-case, O(n²) worst-case
- Space Complexity: O(log n) average-case (recursion stack)
- Stability: Not stable (in-place implementation)
- Pivot Strategies:
first: Always use first element (O(n²) on sorted arrays)last: Always use last element (O(n²) on reverse sorted arrays)median_of_three: Use median of first, middle, lastrandom: Random pivot (good average performance)
Dataset Types
- sorted: Array already in ascending order
[0, 1, 2, ..., n-1] - reverse: Array in descending order
[n-1, n-2, ..., 0] - random: Random integers from
[0, 10*n)range - nearly_sorted: Sorted array with ~1% of elements swapped
- duplicates_heavy: Array with many duplicate values (only
n/10distinct values)
Performance Considerations
- Benchmarks use
time.perf_counter()for high-resolution timing - Memory measurement uses both
tracemallocandpsutilfor accuracy - Multiple runs per experiment reduce variance
- Seeded randomness ensures reproducibility
Contributing
- Follow Python type hints (checked with mypy)
- Maintain test coverage
- Run linting before committing
- Update README for significant changes
License
[Specify your license here]
Performance Analysis
See ANALYSIS.md for a comprehensive comparison and analysis of the algorithms, including:
- Detailed performance metrics across sorted, reverse sorted, and random datasets
- Execution time and memory usage comparisons
- Operation counts (comparisons and swaps)
- Discussion of discrepancies between theoretical analysis and practical performance
- Explanations for observed performance characteristics
The analysis document includes:
- Performance tables for all dataset types
- Theoretical vs practical performance analysis
- Scalability analysis
- Recommendations for algorithm selection
Acknowledgments
- Algorithms based on standard divide-and-conquer implementations
- Benchmarking framework inspired by best practices in performance testing