Files

Carlos Gutierrez 7416dfed38 Adding hash algorithms

2025-11-24 17:11:51 -05:00

8.2 KiB

Raw Blame History

MSCS532 Assignment 7: Hash Tables and Their Practical Applications

Author: Carlos Gutierrez
Email: cgutierrez44833@ucumberlands.edu
Course: MSCS532 – Data Structures and Algorithms
Assignment: Hash Tables and Their Practical Applications

Overview

This assignment provides a comprehensive study of hash tables, including direct-address tables, hash functions, open addressing, and separate chaining. It includes implementations of various hash table data structures, hash function designs (both good and bad examples), theoretical complexity analysis, empirical benchmarking, test coverage, and reproducible visualization assets.

Repository Structure

MSCS532_Assignment7/
├── docs/
│   ├── hash_function_comparison.png      # Hash function performance comparison
│   ├── open_addressing_vs_chaining.png   # Collision resolution strategies comparison
│   ├── load_factor_impact.png            # Performance at different load factors
│   └── collision_analysis.png            # Collision analysis for different hash functions
├── examples/
│   ├── hash_tables_demo.py               # Hash table demonstrations
│   └── generate_plots.py                 # Script to reproduce all plots
├── src/
│   ├── hash_functions.py                 # Various hash function implementations
│   ├── hash_tables.py                    # Hash table data structures
│   ├── benchmark.py                      # Benchmarking utilities
│   └── __init__.py                       # Package initialization
├── tests/
│   ├── test_hash_functions.py            # Tests for hash functions
│   └── test_hash_tables.py               # Tests for hash tables
├── papers/                               # Reference papers (PDFs)
├── requirements.txt                      # Python dependencies
├── README.md                             # Project documentation (this file)
└── REPORT.md                             # Detailed analysis report

Part 1: Hash Functions and Their Impact

Implementation

Good Hash Functions

Division Method: h(k) = k mod m
- Simple and fast
- Requires careful choice of table size (preferably prime)
Multiplication Method: h(k) = floor(m * (kA mod 1))
- Good distribution with proper choice of A
- Default A = (√5 - 1)/2 ≈ 0.618
Universal Hash Functions: h(k) = ((a*k + b) mod p) mod m
- Minimizes collisions for any set of keys
- Requires random parameters a, b and prime p
Polynomial String Hash: Rolling hash for strings
- Better distribution than simple summation
- Base 31 is commonly used
DJB2 Hash: Popular string hash function
- Known for good distribution properties

Bad Hash Functions (Demonstration)

Simple String Hash: Sums character values
- Prone to collisions for similar strings
- Poor distribution
Bad Clustering Hash: Demonstrates clustering behavior
- Causes many collisions and poor performance

API Highlights

Hash Functions:

division_hash(key: int, table_size: int) -> int
multiplication_hash(key: int, table_size: int, A: float) -> int
universal_hash(key: int, table_size: int, a: int, b: int, p: int) -> int
string_hash_polynomial(key: str, table_size: int, base: int) -> int
string_hash_simple(key: str, table_size: int) -> int  # BAD EXAMPLE
bad_hash_clustering(key: int, table_size: int) -> int  # BAD EXAMPLE

Part 2: Hash Table Data Structures

Implementation

Direct-Address Table

File: src/hash_tables.py
Operations: insert, search, delete
Time Complexity: O(1) for all operations
Space Complexity: O(m) where m is the key range
Use Case: When keys are integers in a small known range

Open Addressing

File: src/hash_tables.py
Probe Types:
- Linear Probing: h(k,i) = (h'(k) + i) mod m
- Quadratic Probing: h(k,i) = (h'(k) + c1*i + c2*i²) mod m
- Double Hashing: h(k,i) = (h1(k) + i*h2(k)) mod m
Operations: insert, search, delete
Time Complexity:
- Best/Average: O(1)
- Worst: O(n) due to clustering
Space Complexity: O(n) where n is number of elements

Separate Chaining

File: src/hash_tables.py
Implementation: Each bucket contains a linked list
Operations: insert, search, delete
Time Complexity:
- Best/Average: O(1)
- Worst: O(n) if all keys hash to same bucket
Space Complexity: O(n + m) where m is table size

Theoretical Performance Analysis

Operation	Direct-Address	Open Addressing	Separate Chaining
Search	O(1)	O(1) avg, O(n) worst	O(1) avg, O(n) worst
Insert	O(1)	O(1) avg, O(n) worst	O(1) avg, O(n) worst
Delete	O(1)	O(1) avg, O(n) worst	O(1) avg, O(n) worst
Space	O(m)	O(n)	O(n + m)

Key Insights:

Direct-Address: Fastest but requires keys in known range
Open Addressing: Better cache performance, but clustering can degrade performance
Separate Chaining: More robust to high load factors, easier deletion

Getting Started

Prerequisites

Python 3.10 or later
Recommended to use a virtual environment

Installation

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Running the Examples

python examples/hash_tables_demo.py          # Hash table demonstrations
python examples/generate_plots.py            # Regenerate all figures in docs/

Running Tests

python -m pytest

The test suite verifies correctness for:

All hash function implementations
Direct-address table
Open addressing (all probe types)
Separate chaining

Reproducing the Empirical Study

Activate your environment and install dependencies.
Run python examples/generate_plots.py.
- Benchmarks may take several minutes depending on hardware.
Generated figures will be written to the docs/ directory.

Visualizations

The following visualizations demonstrate key findings from our empirical analysis:

Hash Function Comparison

This visualization compares different hash functions across multiple metrics: collision rates, distribution variance, execution time, and maximum chain lengths. It clearly demonstrates the impact of hash function design on performance.

Open Addressing vs. Separate Chaining

This comprehensive comparison shows insert, search, and delete performance across different data sizes for both open addressing (linear, quadratic, double hashing) and separate chaining methods.

Load Factor Impact

Performance curves demonstrating how operations degrade as load factor increases, highlighting the importance of maintaining appropriate load factors for optimal performance.

Collision Analysis

Detailed comparison of collision behavior, clearly demonstrating the dramatic difference between well-designed and poorly-designed hash functions.

Practical Applications

Hash Tables

Database Systems: Indexing, join operations
Caching: Memoization, LRU caches
Symbol Tables: Compilers, interpreters
Distributed Systems: Consistent hashing for load balancing
Cryptography: Hash-based data structures, digital signatures

Hash Functions

Data Integrity: Checksums, hash-based message authentication
Load Balancing: Consistent hashing in distributed systems
Bloom Filters: Probabilistic data structures
Database Indexing: Hash indexes for fast lookups

Academic Integrity Statement

This project is submitted for academic evaluation in MSCS532 – Data Structures and Algorithms. All code, analysis, and documentation were authored by Carlos Gutierrez for the specific purpose of this assignment.

References

All references are cited in REPORT.md using APA 7th edition format. Reference papers are located in the papers/ directory.

8.2 KiB Raw Blame History Unescape Escape

MSCS532 Assignment 7: Hash Tables and Their Practical Applications

Overview

Repository Structure

Part 1: Hash Functions and Their Impact

Implementation

Good Hash Functions

Bad Hash Functions (Demonstration)

API Highlights

Part 2: Hash Table Data Structures

Implementation

Direct-Address Table

Open Addressing

Separate Chaining

Theoretical Performance Analysis

Getting Started

Prerequisites

Installation

Running the Examples

Running Tests

Reproducing the Empirical Study

Visualizations

Hash Function Comparison

Open Addressing vs. Separate Chaining

Load Factor Impact

Collision Analysis

Practical Applications

Hash Tables

Hash Functions

Academic Integrity Statement

References

8.2 KiB

Raw Blame History