divide-and-conquer-analysis/README.md

# Divide-and-Conquer Sorting Algorithms Benchmark

A comprehensive Python project for benchmarking merge sort and quick sort algorithms across different dataset types and sizes, with detailed performance metrics, logging, and visualization.

## Project Overview

This project implements two divide-and-conquer sorting algorithms (Merge Sort and Quick Sort) and provides a benchmarking framework to evaluate their performance across various dataset characteristics:

- **Merge Sort**: Stable, O(n log n) worst-case time complexity
- **Quick Sort**: In-place, O(n log n) average-case with configurable pivot strategies

The benchmark suite measures:
- Wall-clock time (using `time.perf_counter`)
- Peak memory usage (using `tracemalloc` and `psutil`)
- Comparison and swap counts (when instrumentation is enabled)
- Correctness verification (comparing against Python's `sorted()`)

## Project Structure

```
.
├── src/
│   ├── algorithms/
│   │   ├── merge_sort.py      # Merge sort implementation
│   │   └── quick_sort.py       # Quick sort with pivot strategies
│   └── bench/
│       ├── benchmark.py         # Main CLI benchmark runner
│       ├── datasets.py          # Dataset generators
│       ├── metrics.py           # Performance measurement utilities
│       └── logging_setup.py     # Logging configuration
├── tests/
│   └── test_sorts.py            # Comprehensive test suite
├── scripts/
│   └── run_benchmarks.sh        # Convenience script to run benchmarks
├── results/                     # Auto-created: CSV, JSON, logs
├── plots/                       # Auto-created: PNG visualizations
├── pyproject.toml               # Project configuration and dependencies
├── .gitignore
└── README.md                    # This file
```

## Installation

### Prerequisites

- Python 3.8 or higher
- pip (Python package manager)

### Setup

1. Clone the repository:
```bash
git clone <repository-url>
cd divide-and-conquer-analysis
```

2. Install dependencies:
```bash
pip install -e .
```

Or install from `pyproject.toml`:
```bash
pip install -e ".[dev]"  # Includes dev dependencies (mypy, ruff, black)
```

## Quick Start

### Run a Simple Benchmark

```bash
python -m src.bench.benchmark \
    --algorithms merge,quick \
    --datasets sorted,reverse,random \
    --sizes 1000,5000,10000 \
    --runs 5 \
    --seed 42 \
    --instrument \
    --make-plots
```

### Use the Convenience Script

```bash
./scripts/run_benchmarks.sh
```

## CLI Usage

The benchmark CLI (`src.bench.benchmark`) supports the following arguments:

### Required Arguments

None (all have defaults)

### Optional Arguments

- `--algorithms`: Comma-separated list of algorithms to benchmark
  - Options: `merge`, `quick`
  - Default: `merge,quick`

- `--pivot`: Pivot strategy for Quick Sort
  - Options: `first`, `last`, `median_of_three`, `random`
  - Default: `random`

- `--datasets`: Comma-separated list of dataset types
  - Options: `sorted`, `reverse`, `random`, `nearly_sorted`, `duplicates_heavy`
  - Default: `sorted,reverse,random,nearly_sorted,duplicates_heavy`

- `--sizes`: Comma-separated list of dataset sizes
  - Default: `1000,5000,10000,50000`
  - Example: `--sizes 1000,5000,10000,50000,100000`

- `--runs`: Number of runs per experiment (for statistical significance)
  - Default: `5`

- `--seed`: Random seed for reproducibility
  - Default: `42`

- `--outdir`: Output directory for results
  - Default: `results`

- `--log-level`: Logging level
  - Options: `DEBUG`, `INFO`, `WARNING`, `ERROR`
  - Default: `INFO`

- `--instrument`: Enable counting of comparisons and swaps
  - Flag (no value)

- `--make-plots`: Generate plots after benchmarking
  - Flag (no value)

### Example CLI Commands

**Basic benchmark with default settings:**
```bash
python -m src.bench.benchmark
```

**Full benchmark with all options:**
```bash
python -m src.bench.benchmark \
    --algorithms merge,quick \
    --pivot random \
    --datasets sorted,reverse,random,nearly_sorted,duplicates_heavy \
    --sizes 1000,5000,10000,50000 \
    --runs 5 \
    --seed 42 \
    --instrument \
    --outdir results \
    --log-level INFO \
    --make-plots
```

**Compare pivot strategies:**
```bash
for pivot in first last median_of_three random; do
    python -m src.bench.benchmark \
        --algorithms quick \
        --pivot $pivot \
        --datasets random \
        --sizes 10000,50000 \
        --runs 10 \
        --seed 42
done
```

**Quick performance check:**
```bash
python -m src.bench.benchmark \
    --algorithms merge,quick \
    --datasets random \
    --sizes 10000 \
    --runs 3 \
    --make-plots
```

## Output Files

### Results Directory (`results/`)

- **`bench_results.csv`**: Detailed results in CSV format
  - Columns: `algorithm`, `pivot`, `dataset`, `size`, `run`, `time_s`, `peak_mem_bytes`, `comparisons`, `swaps`, `seed`
  - One row per run

- **`summary.json`**: Aggregated statistics per (algorithm, dataset, size) combination
  - Includes: mean, std, best, worst times and memory
  - Comparison and swap statistics (if instrumentation enabled)

- **`bench.log`**: Rotating log file (max 10MB, 5 backups)
  - Contains: system info, run metadata, progress logs, errors

### Plots Directory (`plots/`)

- **`time_vs_size.png`**: Line chart showing sorting time vs array size
  - Separate subplot for each dataset type
  - One line per algorithm

- **`memory_vs_size.png`**: Line chart showing memory usage vs array size
  - Separate subplot for each dataset type
  - One line per algorithm

## Reproducing Results

### Generate a Plot

After running benchmarks:
```bash
python -m src.bench.benchmark \
    --algorithms merge,quick \
    --datasets random \
    --sizes 1000,5000,10000,50000 \
    --runs 5 \
    --seed 42 \
    --make-plots
```

Plots will be automatically generated in `plots/` directory.

### Generate CSV from Scratch

```bash
python -m src.bench.benchmark \
    --algorithms merge \
    --datasets sorted,reverse,random \
    --sizes 1000,5000 \
    --runs 3 \
    --seed 42 \
    --outdir results
```

Check `results/bench_results.csv` for the output.

## Logging

Logging is configured via `src.bench.logging_setup.py`:

- **Console output**: Formatted with timestamp, level, and message
- **File output**: Detailed logs with function names and line numbers
- **Rotation**: Log files rotate at 10MB, keeping 5 backups
- **Metadata**: Logs include Python version, OS, architecture, and git commit (if available)

### Log Levels

- `DEBUG`: Detailed diagnostic information
- `INFO`: General informational messages (default)
- `WARNING`: Warning messages
- `ERROR`: Error messages

### Example Log Output

```
2024-01-15 10:30:00 - __main__ - INFO - ================================================================================
2024-01-15 10:30:00 - __main__ - INFO - Benchmark session started
2024-01-15 10:30:00 - __main__ - INFO - Python version: 3.10.5
2024-01-15 10:30:00 - __main__ - INFO - Platform: macOS-13.0
2024-01-15 10:30:00 - __main__ - INFO - Running merge on random size=1000 run=1/5
```

## Testing

Run the test suite:

```bash
pytest tests/ -v
```

Run with coverage:

```bash
pytest tests/ --cov=src --cov-report=html
```

### Test Coverage

The test suite includes:

1. **Unit Tests**:
   - Empty arrays
   - Single element arrays
   - Already sorted arrays
   - Reverse sorted arrays
   - Random arrays
   - Arrays with duplicates
   - Large arrays
   - Instrumentation tests

2. **Property Tests**:
   - Comparison with Python's `sorted()` on random arrays
   - Multiple sizes and pivot strategies

## Code Quality

### Type Checking

```bash
mypy src/ tests/
```

### Linting

```bash
ruff check src/ tests/
```

### Formatting

```bash
ruff format src/ tests/
```

Or using black:

```bash
black src/ tests/
```

## Algorithm Details

### Merge Sort

- **Time Complexity**: O(n log n) worst-case, average-case, best-case
- **Space Complexity**: O(n)
- **Stability**: Stable
- **Implementation**: Recursive divide-and-conquer with merging

### Quick Sort

- **Time Complexity**: O(n log n) average-case, O(n²) worst-case
- **Space Complexity**: O(log n) average-case (recursion stack)
- **Stability**: Not stable (in-place implementation)
- **Pivot Strategies**:
  - `first`: Always use first element (O(n²) on sorted arrays)
  - `last`: Always use last element (O(n²) on reverse sorted arrays)
  - `median_of_three`: Use median of first, middle, last
  - `random`: Random pivot (good average performance)

## Dataset Types

1. **sorted**: Array already in ascending order `[0, 1, 2, ..., n-1]`
2. **reverse**: Array in descending order `[n-1, n-2, ..., 0]`
3. **random**: Random integers from `[0, 10*n)` range
4. **nearly_sorted**: Sorted array with ~1% of elements swapped
5. **duplicates_heavy**: Array with many duplicate values (only `n/10` distinct values)

## Performance Considerations

- Benchmarks use `time.perf_counter()` for high-resolution timing
- Memory measurement uses both `tracemalloc` and `psutil` for accuracy
- Multiple runs per experiment reduce variance
- Seeded randomness ensures reproducibility

## Contributing

1. Follow Python type hints (checked with mypy)
2. Maintain test coverage
3. Run linting before committing
4. Update README for significant changes

## License

[Specify your license here]

## Performance Analysis

See **[ANALYSIS.md](ANALYSIS.md)** for a comprehensive comparison and analysis of the algorithms, including:

- Detailed performance metrics across sorted, reverse sorted, and random datasets
- Execution time and memory usage comparisons
- Operation counts (comparisons and swaps)
- Discussion of discrepancies between theoretical analysis and practical performance
- Explanations for observed performance characteristics

The analysis document includes:
- Performance tables for all dataset types
- Theoretical vs practical performance analysis
- Scalability analysis
- Recommendations for algorithm selection

## Acknowledgments

- Algorithms based on standard divide-and-conquer implementations
- Benchmarking framework inspired by best practices in performance testing