Go to file

Carlos Gutierrez 42b02c12c7 Add plots and results to repository, update ANALYSIS.md with visualizations

- Removed plots/ and results/ from .gitignore
- Added plot images to ANALYSIS.md with figure references
- Updated appendix with file descriptions
- Committed benchmark results and visualization plots

2025-10-30 21:35:41 -04:00

plots

Add plots and results to repository, update ANALYSIS.md with visualizations

2025-10-30 21:35:41 -04:00

results

Add plots and results to repository, update ANALYSIS.md with visualizations

2025-10-30 21:35:41 -04:00

scripts

Initial commit: Divide-and-conquer sorting algorithms benchmark

2025-10-30 21:14:37 -04:00

src

Initial commit: Divide-and-conquer sorting algorithms benchmark

2025-10-30 21:14:37 -04:00

tests

Initial commit: Divide-and-conquer sorting algorithms benchmark

2025-10-30 21:14:37 -04:00

.gitignore

Add plots and results to repository, update ANALYSIS.md with visualizations

2025-10-30 21:35:41 -04:00

ANALYSIS.md

Add plots and results to repository, update ANALYSIS.md with visualizations

2025-10-30 21:35:41 -04:00

pyproject.toml

Initial commit: Divide-and-conquer sorting algorithms benchmark

2025-10-30 21:14:37 -04:00

README.md

Add comprehensive performance analysis and comparison

2025-10-30 21:33:45 -04:00

README.md

Divide-and-Conquer Sorting Algorithms Benchmark

A comprehensive Python project for benchmarking merge sort and quick sort algorithms across different dataset types and sizes, with detailed performance metrics, logging, and visualization.

Project Overview

This project implements two divide-and-conquer sorting algorithms (Merge Sort and Quick Sort) and provides a benchmarking framework to evaluate their performance across various dataset characteristics:

Merge Sort: Stable, O(n log n) worst-case time complexity
Quick Sort: In-place, O(n log n) average-case with configurable pivot strategies

The benchmark suite measures:

Wall-clock time (using time.perf_counter)
Peak memory usage (using tracemalloc and psutil)
Comparison and swap counts (when instrumentation is enabled)
Correctness verification (comparing against Python's sorted())

Project Structure

.
├── src/
│   ├── algorithms/
│   │   ├── merge_sort.py      # Merge sort implementation
│   │   └── quick_sort.py       # Quick sort with pivot strategies
│   └── bench/
│       ├── benchmark.py         # Main CLI benchmark runner
│       ├── datasets.py          # Dataset generators
│       ├── metrics.py           # Performance measurement utilities
│       └── logging_setup.py     # Logging configuration
├── tests/
│   └── test_sorts.py            # Comprehensive test suite
├── scripts/
│   └── run_benchmarks.sh        # Convenience script to run benchmarks
├── results/                     # Auto-created: CSV, JSON, logs
├── plots/                       # Auto-created: PNG visualizations
├── pyproject.toml               # Project configuration and dependencies
├── .gitignore
└── README.md                    # This file

Installation

Prerequisites

Python 3.8 or higher
pip (Python package manager)

Setup

Clone the repository:

git clone <repository-url>
cd divide-and-conquer-analysis

Install dependencies:

pip install -e .

Or install from pyproject.toml:

pip install -e ".[dev]"  # Includes dev dependencies (mypy, ruff, black)

Quick Start

Run a Simple Benchmark

python -m src.bench.benchmark \
    --algorithms merge,quick \
    --datasets sorted,reverse,random \
    --sizes 1000,5000,10000 \
    --runs 5 \
    --seed 42 \
    --instrument \
    --make-plots

Use the Convenience Script

./scripts/run_benchmarks.sh

CLI Usage

The benchmark CLI (src.bench.benchmark) supports the following arguments:

Required Arguments

None (all have defaults)

Optional Arguments

--algorithms: Comma-separated list of algorithms to benchmark
- Options: merge, quick
- Default: merge,quick
--pivot: Pivot strategy for Quick Sort
- Options: first, last, median_of_three, random
- Default: random
--datasets: Comma-separated list of dataset types
- Options: sorted, reverse, random, nearly_sorted, duplicates_heavy
- Default: sorted,reverse,random,nearly_sorted,duplicates_heavy
--sizes: Comma-separated list of dataset sizes
- Default: 1000,5000,10000,50000
- Example: --sizes 1000,5000,10000,50000,100000
--runs: Number of runs per experiment (for statistical significance)
- Default: 5
--seed: Random seed for reproducibility
- Default: 42
--outdir: Output directory for results
- Default: results
--log-level: Logging level
- Options: DEBUG, INFO, WARNING, ERROR
- Default: INFO
--instrument: Enable counting of comparisons and swaps
- Flag (no value)
--make-plots: Generate plots after benchmarking
- Flag (no value)

Example CLI Commands

Basic benchmark with default settings:

python -m src.bench.benchmark

Full benchmark with all options:

python -m src.bench.benchmark \
    --algorithms merge,quick \
    --pivot random \
    --datasets sorted,reverse,random,nearly_sorted,duplicates_heavy \
    --sizes 1000,5000,10000,50000 \
    --runs 5 \
    --seed 42 \
    --instrument \
    --outdir results \
    --log-level INFO \
    --make-plots

Compare pivot strategies:

for pivot in first last median_of_three random; do
    python -m src.bench.benchmark \
        --algorithms quick \
        --pivot $pivot \
        --datasets random \
        --sizes 10000,50000 \
        --runs 10 \
        --seed 42
done

Quick performance check:

python -m src.bench.benchmark \
    --algorithms merge,quick \
    --datasets random \
    --sizes 10000 \
    --runs 3 \
    --make-plots

Output Files

Results Directory (`results/`)

bench_results.csv: Detailed results in CSV format
- Columns: algorithm, pivot, dataset, size, run, time_s, peak_mem_bytes, comparisons, swaps, seed
- One row per run
summary.json: Aggregated statistics per (algorithm, dataset, size) combination
- Includes: mean, std, best, worst times and memory
- Comparison and swap statistics (if instrumentation enabled)
bench.log: Rotating log file (max 10MB, 5 backups)
- Contains: system info, run metadata, progress logs, errors

Plots Directory (`plots/`)

time_vs_size.png: Line chart showing sorting time vs array size
- Separate subplot for each dataset type
- One line per algorithm
memory_vs_size.png: Line chart showing memory usage vs array size
- Separate subplot for each dataset type
- One line per algorithm

Reproducing Results

Generate a Plot

After running benchmarks:

python -m src.bench.benchmark \
    --algorithms merge,quick \
    --datasets random \
    --sizes 1000,5000,10000,50000 \
    --runs 5 \
    --seed 42 \
    --make-plots

Plots will be automatically generated in plots/ directory.

Generate CSV from Scratch

python -m src.bench.benchmark \
    --algorithms merge \
    --datasets sorted,reverse,random \
    --sizes 1000,5000 \
    --runs 3 \
    --seed 42 \
    --outdir results

Check results/bench_results.csv for the output.

Logging

Logging is configured via src.bench.logging_setup.py:

Console output: Formatted with timestamp, level, and message
File output: Detailed logs with function names and line numbers
Rotation: Log files rotate at 10MB, keeping 5 backups
Metadata: Logs include Python version, OS, architecture, and git commit (if available)

Log Levels

DEBUG: Detailed diagnostic information
INFO: General informational messages (default)
WARNING: Warning messages
ERROR: Error messages

Example Log Output

2024-01-15 10:30:00 - __main__ - INFO - ================================================================================
2024-01-15 10:30:00 - __main__ - INFO - Benchmark session started
2024-01-15 10:30:00 - __main__ - INFO - Python version: 3.10.5
2024-01-15 10:30:00 - __main__ - INFO - Platform: macOS-13.0
2024-01-15 10:30:00 - __main__ - INFO - Running merge on random size=1000 run=1/5

Testing

Run the test suite:

pytest tests/ -v

Run with coverage:

pytest tests/ --cov=src --cov-report=html

Test Coverage

The test suite includes:

Unit Tests:
- Empty arrays
- Single element arrays
- Already sorted arrays
- Reverse sorted arrays
- Random arrays
- Arrays with duplicates
- Large arrays
- Instrumentation tests
Property Tests:
- Comparison with Python's sorted() on random arrays
- Multiple sizes and pivot strategies

Code Quality

Type Checking

mypy src/ tests/

Linting

ruff check src/ tests/

Formatting

ruff format src/ tests/

Or using black:

black src/ tests/

Algorithm Details

Merge Sort

Time Complexity: O(n log n) worst-case, average-case, best-case
Space Complexity: O(n)
Stability: Stable
Implementation: Recursive divide-and-conquer with merging

Quick Sort

Time Complexity: O(n log n) average-case, O(n²) worst-case
Space Complexity: O(log n) average-case (recursion stack)
Stability: Not stable (in-place implementation)
Pivot Strategies:
- first: Always use first element (O(n²) on sorted arrays)
- last: Always use last element (O(n²) on reverse sorted arrays)
- median_of_three: Use median of first, middle, last
- random: Random pivot (good average performance)

Dataset Types

sorted: Array already in ascending order [0, 1, 2, ..., n-1]
reverse: Array in descending order [n-1, n-2, ..., 0]
random: Random integers from [0, 10*n) range
nearly_sorted: Sorted array with ~1% of elements swapped
duplicates_heavy: Array with many duplicate values (only n/10 distinct values)

Performance Considerations

Benchmarks use time.perf_counter() for high-resolution timing
Memory measurement uses both tracemalloc and psutil for accuracy
Multiple runs per experiment reduce variance
Seeded randomness ensures reproducibility

Contributing

Follow Python type hints (checked with mypy)
Maintain test coverage
Run linting before committing
Update README for significant changes

License

[Specify your license here]

Performance Analysis

See ANALYSIS.md for a comprehensive comparison and analysis of the algorithms, including:

Detailed performance metrics across sorted, reverse sorted, and random datasets
Execution time and memory usage comparisons
Operation counts (comparisons and swaps)
Discussion of discrepancies between theoretical analysis and practical performance
Explanations for observed performance characteristics

The analysis document includes:

Performance tables for all dataset types
Theoretical vs practical performance analysis
Scalability analysis
Recommendations for algorithm selection

Acknowledgments

Algorithms based on standard divide-and-conquer implementations
Benchmarking framework inspired by best practices in performance testing

README.md

Divide-and-Conquer Sorting Algorithms Benchmark

Project Overview

Project Structure

Installation

Prerequisites

Setup

Quick Start

Run a Simple Benchmark

Use the Convenience Script

CLI Usage

Required Arguments

Optional Arguments

Example CLI Commands

Output Files

Results Directory (results/)

Plots Directory (plots/)

Reproducing Results

Generate a Plot

Generate CSV from Scratch

Logging

Log Levels

Example Log Output

Testing

Test Coverage

Code Quality

Type Checking

Linting

Formatting

Algorithm Details

Merge Sort

Quick Sort

Dataset Types

Performance Considerations

Contributing

License

Performance Analysis

Acknowledgments

Results Directory (`results/`)

Plots Directory (`plots/`)