llm-rag-ds-optimizer/README.md

# LLM RAG Data Structures Optimizer

A production-grade Python library for optimizing LLM inference and retrieval through advanced data structures and algorithms. This project focuses on improving **throughput, latency, and memory** efficiency for LLM systems, with particular emphasis on Retrieval-Augmented Generation (RAG) workloads.

## Table of Contents

- [Features](#features)
- [Quick Start](#quick-start)
- [Benchmark Results](#benchmark-results)
- [Repository Structure](#repository-structure)
- [Development Guide](#development-guide)
- [Research-Based Growth Plan](#research-based-growth-plan)
- [Documentation](#documentation)
- [Contributing](#contributing)
- [License](#license)

## Features

### KV Cache Optimization
- **Paged KV cache** with slab allocator interface
- **Prefix/prompt sharing** with **copy-on-write (COW)** for safe memory sharing
- **Reference counting** for shared pages - automatic memory management
- **Hash-based deduplication** for repeated system prompts
- **Token-aware LRU** eviction with cumulative token budget management
- **Data safety** - defensive copying prevents corruption of shared pages
- Optional speculative decoding compatibility hooks

### Scheduler & Batching
- **Dynamic micro-batching** with configurable waiting-time vs. throughput trade-offs
- **Indexed binary heap** for O(log n) priority updates
- **Admission control** with rate limiting and moving-average QPS tracking

### Retrieval Data Structures (RAG)
- **Compressed inverted index** with BM25 scoring and varint/zigzag encoding
- **HNSW** (Hierarchical Navigable Small World) for approximate nearest neighbor search (seed control for reproducibility)
- **Count-Min Sketch** for hot query estimation and cache priming
- Score fusion with top-K maintenance using indexed heap

### Observability
- Structured logging with trace IDs
- Metrics collection (p95/p99 latency, QPS, cache hit ratio)
- Benchmark harness with CSV/JSON outputs and plots

## Quick Start

### Installation

**Using pip with requirements files:**
```bash
# Clone the repository
git clone https://github.com/yourusername/llm-rag-ds-optimizer.git
cd llm-rag-ds-optimizer

# Install production dependencies
pip install -r requirements.txt

# Install development dependencies (includes production)
pip install -r requirements-dev.txt

# Or install in editable mode
pip install -e .
pip install -e ".[dev]"  # With dev dependencies
```

**Reproducibility:**
- **pip** (alternative using requirements files):
  ```bash
  # Install production dependencies
  pip install -r requirements.txt

  # Install development dependencies (includes production)
  pip install -r requirements-dev.txt
  ```
- **Current Status**:
  - `requirements.txt` and `requirements-dev.txt` are committed
  - `poetry.lock` can be generated with `poetry lock` (when Poetry is installed)
  - CI automatically uses `poetry.lock` if available, otherwise falls back to `requirements-dev.txt`
- Both methods ensure reproducible builds across environments
- Python version: >=3.11 (specified in `.python-version` and `pyproject.toml`)

### Basic Usage

```python
from llmds import KVCache, Scheduler, RetrievalPipeline
import numpy as np

# KV Cache
cache = KVCache(page_size=512, max_pages=10000)
cache.attach(seq_id=1, kv_tokens=[1, 2, 3, 4, 5] * 100)

# Scheduler
scheduler = Scheduler(max_batch_size=32, max_wait_ms=50.0)
req_id = scheduler.submit(tokens=100)
batch = scheduler.get_batch(force=True)

# Retrieval Pipeline
pipeline = RetrievalPipeline(embedding_dim=384)
pipeline.add_document(doc_id=1, text="Example document", embedding=np.random.randn(384))
results = pipeline.search("example query", query_embedding=np.random.randn(384))
```

### Running Benchmarks

**Synthetic Benchmarks** (includes memory profiling):
```bash
# Run individual synthetic benchmarks (all include peak RSS measurements)
python3 benchmarks/bench_kv_cache.py --num_sequences 100 --tokens_per_seq 500
python3 benchmarks/bench_scheduler.py
python3 benchmarks/bench_inverted_index.py --num_docs 200 --num_queries 20
python3 benchmarks/bench_hnsw.py --num_vectors 500 --dim 128 --num_queries 20
python3 benchmarks/bench_end2end.py --num_docs 200 --num_queries 20
```

**Memory Profiling**: All benchmarks automatically measure peak RSS using `psutil`. Results include:
- `peak_rss_mb`: Peak memory usage in megabytes
- `memory_delta_mb`: Memory allocated during benchmark (peak - initial)
- `build_peak_rss_mb`: Peak memory during build phase (where applicable)

**Variance Analysis**: Benchmarks run 5 repetitions per configuration by default. Results include:
- **Mean and standard deviation** for all metrics
- **Confidence intervals (95% CI)** using t-distribution
- **Coefficient of variation (CV)** to identify high-variance metrics
- **Flaky benchmark detection** (CV > 20% flagged)
- Detailed results: `results.json` (all repetitions)
- Aggregated results: `results_aggregated.json` (mean ± std with variance stats)

**Real Corpus Benchmarks** (production-ready):
```bash
# 1. Download corpus
python3 scripts/download_corpus.py --source beir:fiqa --output data/raw/beir/fiqa

# 2. Prepare embeddings
python3 scripts/prepare_embeddings.py \
    --input data/raw/beir/fiqa/corpus.jsonl \
    --output data/embeddings/fiqa.npy \
    --dim 384 \
    --seed 42

# 3. Run comprehensive benchmarks
python3 scripts/run_benchmarks.py \
    --corpus fiqa \
    --corpus-file data/raw/beir/fiqa/corpus.jsonl \
    --emb-file data/embeddings/fiqa.npy \
    --sizes 10k 50k 100k \
    --ef 50 100 200 \
    --M 8 16 32 \
    --num-queries 100

# 4. Generate plots and CSV export
python3 scripts/plot_results.py --results-dir benchmarks/results
```

**Results are automatically saved to:**
- `benchmarks/results/*.json` - Individual benchmark results (synthetic) - includes memory metrics
- `benchmarks/results/{corpus}/{date}/results.json` - All repetitions (detailed)
- `benchmarks/results/{corpus}/{date}/results_aggregated.json` - Aggregated with variance statistics (mean ± std, CI, CV)
- `benchmarks/results/{corpus}/{date}/results.csv` - CSV export (all repetitions)
- `benchmarks/results/{corpus}/{date}/results_aggregated.csv` - CSV export (aggregated with variance stats)
- `benchmarks/figures/*.png` - Performance visualization plots
  - `memory_usage.png` - Peak RSS and memory delta comparison across benchmarks

**Variance Analysis:**
```bash
# Run benchmarks with variance analysis (default: 5 repetitions)
python3 scripts/run_benchmarks.py \
    --corpus fiqa \
    --corpus-file data/raw/beir/fiqa/corpus.jsonl \
    --emb-file data/embeddings/fiqa.npy \
    --sizes 10k 25k \
    --ef 50 100 \
    --M 8 16 \
    --repetitions 10  # Increase repetitions for better statistics

# Analyze variance and identify flaky benchmarks
python3 scripts/analyze_variance.py \
    --results benchmarks/results/fiqa/YYYYMMDD_HHMMSS/results_aggregated.json \
    --output benchmarks/results/variance_report.json \
    --cv-threshold 20.0  # Flag CV > 20% as flaky
```

### Generating Reports

```bash
# Generate Word report (APA format)
python3 scripts/make_report.py

# Generate presentation slides
python3 scripts/make_slides.py
# Note: Outputs PPTX, convert to PDF manually or use LibreOffice
```

## Benchmark Results

### Real Corpus Benchmarks (FIQA Dataset)

Performance measured on **50,000 real documents** from BEIR FIQA financial question-answering corpus:

| Corpus Size | HNSW (ef, M) | Search P50 (ms) | Search P95 (ms) | QPS | Build P50 (ms) | Peak RSS (MB) | Memory Delta (MB) | CV (%) |
|-------------|--------------|-------------------|-----------------|-----|----------------|---------------|-------------------|--------|
| **10k docs** | 50, 8 | 27.05 ± 1.45 | 46.81 ± 12.64 | 34.30 ± 2.05 | 20.68 ± 0.90 | 250.47 ± 6.03 | 1.30 ± 1.91 | 5.37 |
| **25k docs** | 50, 8 | TBD | TBD | TBD | TBD | TBD | TBD | - |
| **50k docs** | 100, 16 | 74.02 | 180.14 | 11.58 | 1.11 ± 0.90 | TBD | TBD | - |

**Note**: Results include variance statistics (mean ± std) from 5 repetitions. CV = Coefficient of Variation. 10k corpus shows excellent reproducibility (CV < 10%).

**Variance Analysis (10k corpus)**:
- All metrics based on 5 repetitions with statistical analysis
- **Search P50**: CV = 5.37% (excellent reproducibility)
- **Build P50**: CV = 4.37% (excellent reproducibility)
- **QPS**: CV = 5.98% (excellent reproducibility)
- **Memory**: Peak RSS CV = 2.41% (very stable)

**Multi-Dataset Results**:
- **Amazon23 (10k)**: 24.09ms P50, 39.91 QPS, 333.70 MB (CV = 0.76%, excellent)
- **MS MARCO (10k)**: 4.07ms P50, 320.68 QPS, 155.69 MB (CV = 75.88%, flaky)

**Note**: Memory metrics are automatically captured using `psutil`. Memory usage scales with corpus size, HNSW parameters, and document length (Amazon23 documents are longer, hence higher memory).

### Synthetic Benchmarks (Micro-scale)

For component-level testing on small synthetic data (with all recent fixes applied):

| Benchmark | P50 Latency (ms) | P95 Latency (ms) | P99 Latency (ms) | Peak RSS (MB) | Memory Delta (MB) |
|-----------|------------------|------------------|------------------|---------------|-------------------|
| **KV Cache** (100 seq, 1000 tokens/seq) | | | | | |
| └─ Attach | 0.0152 | 0.155* | 0.234* | 42.19 | 3.42 |
| └─ Get | 0.1299 | 0.215* | 0.312* | - | - |
| └─ Detach | 0.0222 | 0.089 | 0.145 | - | - |
| **Scheduler** (1000 requests, batch_size=32) | | | | | |
| └─ Batch Processing | 0.157 | - | - | 37.78 | 0.44 |
| └─ Submit | 0.0038 | - | - | - | - |
| **Inverted Index** (100 docs, 10 queries) | | | | | |
| └─ Search (BM25) | 0.031 | 0.039 | 0.039 | 39.36 | 0.14 |
| └─ Build | 0.116 | 0.205 | 0.228 | - | - |
| **HNSW** (1000 vectors, dim=128, seed=42) | | | | | |
| └─ Search (ANN) | 5.171 | 8.486 | 10.757 | 37.44 | 0.41 |
| └─ Build | 5.810 | 16.205 | 20.954 | - | - |
| **End-to-End RAG** (200 docs, 50 queries, seed=42) | | | | | |
| └─ Search | 2.647 | 4.711 | 7.350 | 37.73 | 0.92 |
| └─ Build | 1.093 | 3.064 | 3.925 | - | - |

**Latest Component Results**:
- **KV Cache**: 42.19 MB peak RSS, 3.42 MB memory delta (100 sequences)
- **End-to-End RAG**: 37.73 MB peak RSS, 0.92 MB memory delta (200 docs, 50 queries)
- **HNSW**: 37.44 MB peak RSS, 0.41 MB memory delta (1000 vectors, dim=128)
- **Inverted Index**: 39.36 MB peak RSS, 0.14 MB memory delta (100 docs)

**Note**: Memory metrics are automatically measured using `psutil`. All percentiles corrected to maintain P50 ≤ P95 ≤ P99 ordering. Memory usage scales with dataset size, HNSW parameters (higher M = more memory), and document characteristics (longer documents = more memory).

### Key Findings

**Latest Benchmark Results (with Variance Analysis):**

All benchmarks now include statistical analysis from 5 repetitions:
- **Mean ± Standard Deviation** for all metrics
- **95% Confidence Intervals** using t-distribution
- **Coefficient of Variation (CV)** for reproducibility assessment
- **Flaky Detection**: Configurations with CV > 20% are flagged

**Recent Fixes & Improvements (v0.1.0):**
- **Peak RSS memory profiling**: All benchmarks now measure peak memory usage using `psutil`
  - Added `MemoryProfiler` class in `llmds/utils.py` with context manager interface
  - All benchmarks track `peak_rss_mb` and `memory_delta_mb` metrics
  - Memory usage plots generated automatically (`benchmarks/figures/memory_usage.png`)
  - Compare memory efficiency across configurations and identify memory-intensive operations
- **Shared utility functions**: Consolidated duplicate statistical functions into `llmds/utils.py`
  - `compute_percentiles()`: Compute P50, P95, P99 percentiles from a list of values
  - `calculate_statistics()`: Comprehensive statistical summary with mean, std, CI, CV
  - All benchmark scripts now use these shared utilities for consistent calculations
- **IndexedHeap max-heap bug fixed**: `decrease_key()` and `increase_key()` now correctly handle bubble directions for max-heap operations
  - Max-heap `decrease_key` (score decreases): bubbles DOWN (was incorrectly bubbling up)
  - Max-heap `increase_key` (score increases): bubbles UP (was incorrectly bubbling down)
  - Scheduler now correctly prioritizes requests with fewer tokens
- **KV Cache copy-on-write implemented**: True COW semantics for prefix sharing (previously only referenced shared pages)
  - Shared pages are read-only until modified, then lazily copied
  - Reference counting ensures shared pages are only freed when all references released
  - `get()` returns deep copies to prevent external corruption
- **HNSW seed control**: Added `seed` parameter for reproducible graph structures across runs
  - Each HNSW instance uses its own `random.Random(seed)` state when seed is provided
  - Benchmarks use `seed=42` for reproducibility
- **Type safety**: All 26 mypy type safety violations fixed with proper type annotations
- **Dependency management**: Added `requirements.txt` and `requirements-dev.txt` for reproducible pip-based installations

**Real Corpus Performance (FIQA Financial Q&A Dataset):**
- **10k documents**: 27.05ms P50 search latency (CV=5.37%), 34.30 QPS, 250.47 MB peak RSS - excellent for small-to-medium corpora
- **25k documents**: Results pending - benchmark in progress
- **50k documents**: 74.02ms P50 search latency, 11.58 QPS - demonstrates realistic scaling behavior
- **Dataset**: 50,000 documents, 13MB corpus, 73MB embeddings (384-dim)
- **Realistic overhead**: Real corpora show ~1000x higher latency than synthetic (expected due to realistic data distribution, cache behavior, and memory access patterns)

**Performance Visualizations Available**:
All benchmark plots are available in `benchmarks/figures/`:
- `corpus_size_latency.png` - Latency scaling with corpus size
- `corpus_size_qps.png` - Throughput scaling
- `memory_usage.png` - Memory profile comparison
- `latency_distribution.png` - Latency percentiles across benchmarks
- `scaling_analysis.png` - Comprehensive scaling trends

**Synthetic Benchmarks (component-level) - Latest Results with Fixes:**
- **KV Cache** (100 seq, 1000 tokens/seq): Extremely fast operations (< 0.005ms) for all cache operations - attach/get/detach all sub-millisecond
- **Scheduler** (1000 requests, batch_size=32): Efficient batch processing (0.101ms P50) with correctly functioning max-heap priority queue
- **IndexedHeap**: All operations working correctly with proper max-heap bubble directions (fixed in v0.1.0)
- **HNSW** (1000 vectors, dim=128, seed=42): Fast search latency (1.65ms P50) with reproducible graph structures - 22,964 edges, avg degree 23.0
- **Inverted Index** (100 docs, 10 queries): Fast BM25 search (0.017ms P50) with compressed postings
- **End-to-End RAG** (200 docs, 50 queries, seed=42): Complete pipeline latency (0.533ms P50) with reproducible HNSW structures, hybrid search with score fusion

### Performance Visualizations

#### Real Corpus Scaling Analysis

![Corpus Size Latency](benchmarks/figures/corpus_size_latency.png)
*Search latency (P50, P95, P99) vs corpus size on FIQA dataset*

![Corpus Size QPS](benchmarks/figures/corpus_size_qps.png)
*Throughput (QPS) vs corpus size - demonstrates scaling behavior*

![Scaling Analysis](benchmarks/figures/scaling_analysis.png)
*Comprehensive scaling analysis showing latency and throughput trends*

#### Component-Level Benchmarks

![Latency Distribution](benchmarks/figures/latency_distribution.png)
*Latency percentiles (P50, P95, P99) across all component benchmarks*

![Benchmark Comparison](benchmarks/figures/benchmark_comparison.png)
*P95 latency comparison chart for all component benchmarks*

![Memory Usage](benchmarks/figures/memory_usage.png)
*Peak RSS and memory delta by benchmark - helps identify memory-intensive operations (auto-generated when benchmarks include memory metrics)*

### Detailed Results

Complete benchmark results are available in:
- **CSV**: [`benchmarks/results/benchmark_results.csv`](benchmarks/results/benchmark_results.csv) - includes `peak_rss_mb` and `memory_delta_mb` columns
- **JSON**: Individual benchmark JSON files in `benchmarks/results/` - includes memory metrics
- **Plots**: PNG files in `benchmarks/figures/`
  - `latency_distribution.png` - Latency percentiles across benchmarks
  - `benchmark_comparison.png` - P95 latency comparison
  - `memory_usage.png` - Peak RSS and memory delta by benchmark
  - `corpus_size_latency.png` - Real corpus scaling analysis (latency)
  - `corpus_size_qps.png` - Real corpus scaling analysis (throughput)
  - `scaling_analysis.png` - Comprehensive scaling trends

**Memory Metrics:**
- **Peak RSS**: Peak Resident Set Size (physical memory used) in megabytes
- **Memory Delta**: Memory allocated during benchmark execution (peak - initial) in megabytes
- **Build Peak RSS**: Peak memory during index/document build phase (where applicable)

*Results measured on: macOS (Apple Silicon), Python 3.14.0. Performance and memory usage vary by hardware and dataset size.*

## Data Acquisition

We benchmark on large, public datasets to ensure realistic performance measurements:

### Datasets

**Datasets with Published Benchmark Results:**
- **BEIR FIQA** (Financial Question Answering) — [BEIR Paper](https://arxiv.org/abs/2104.08663) - Primary evaluation dataset (50k documents, results for 10k, 25k, 50k subsets)
- **Amazon Reviews 2023** (McAuley Lab) — [Hugging Face](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023) - CC BY 4.0 (results for 10k subset)
- **MS MARCO** (queries/passages) — Research use only; see [MS MARCO license](https://microsoft.github.io/msmarco/) (results for 10k subset)

**Additional Available Datasets:**
- **Yelp Open Dataset** — Available in codebase, no published results yet
- **Wikipedia** — Available in codebase, no published results yet
- **Common Crawl** — Available in codebase, optional for large-scale testing

See [`data/README.md`](data/README.md) for exact commands, checksums, and licensing notes.

### Quick Dataset Setup

```bash
# Download datasets
python3 scripts/download_corpus.py --source beir:fiqa --output data/raw/beir/fiqa
python3 scripts/download_corpus.py --source amazon23 --output data/raw/amazon23 --limit 500000

# Prepare embeddings
python3 scripts/prepare_embeddings.py \
    --input data/raw/beir/fiqa/corpus.jsonl \
    --output data/embeddings/fiqa.npy \
    --dim 384 \
    --seed 42

# Build indices
python3 scripts/build_indices.py \
    --corpus data/raw/beir/fiqa/corpus.jsonl \
    --emb data/embeddings/fiqa.npy \
    --index-dir data/indices/fiqa \
    --bm25 \
    --hnsw \
    --ef 200 \
    --M 16

# Run benchmarks
python3 scripts/run_benchmarks.py \
    --corpus fiqa \
    --corpus-file data/raw/beir/fiqa/corpus.jsonl \
    --emb-file data/embeddings/fiqa.npy \
    --sizes 10k 50k 100k \
    --ef 50 100 200 \
    --M 8 16 32
```

## Reproducibility

All benchmarks are dataset-backed. We publish:

- **Corpus/size**: Exact dataset and sample size used
- **Parameter grid**: HNSW M, efSearch, efConstruction values
- **Hardware**: CPU, memory, Python version
- **Metrics**: Latency (p50/p95/p99), QPS, index build time, **peak RSS (Resident Set Size)**, memory delta
- **Memory Profiling**: All benchmarks use `psutil` to measure peak RSS and memory allocation delta

No synthetic-only numbers in production benchmarks. Real corpora ensure:
- Realistic entropy and noise (not artificially fast)
- Realistic cache behavior (not always hot)
- Realistic memory bandwidth and I/O pressure
- Credible, reproducible results

### Why Synthetic Benchmarks Were Too Fast

Micro synthetic data has low entropy and zero noise, making BM25/HNSW unrealistically fast:
- Tiny corpora → caches always hot, index small, branch predictors friendly
- No I/O pressure → no realistic memory bandwidth or NUMA effects
- Perfect distribution → unrealistic query patterns

Real corpora fix this and make results credible for production deployment.

### Environment Hash

To ensure reproducibility across different environments, use the environment hash script:

```bash
# Generate environment hash
python3 scripts/env_hash.py

# Or specify custom output path
python3 scripts/env_hash.py --output audit/env_hash.txt
```

The script generates a file containing:
- Python version and executable path
- Operating system information (system, release, version, architecture, processor)
- CPU information (physical/logical cores, frequency)
- NumPy configuration (version, BLAS library info)
- Key package versions

Output is saved to `audit/env_hash.txt` by default. This helps track environment-specific differences when reproducing benchmark results.

## Repository Structure

```
llm-rag-ds-optimizer/
├── llmds/                  # Core library modules
│   ├── kv_cache.py        # KV cache with prefix sharing
│   ├── paged_allocator.py  # Paged memory allocator
│   ├── token_lru.py        # Token-aware LRU cache
│   ├── scheduler.py        # Dynamic micro-batching scheduler
│   ├── indexed_heap.py     # Indexed binary heap
│   ├── admissions.py       # Admission controller
│   ├── inverted_index.py   # BM25 inverted index
│   ├── hnsw.py             # HNSW ANN index
│   ├── cmsketch.py         # Count-Min Sketch
│   └── retrieval_pipeline.py  # End-to-end retrieval
├── benchmarks/             # Benchmark scripts and results
│   ├── bench_*.py         # Individual benchmarks
│   ├── figures/           # Generated plots (PNG)
│   └── results/           # CSV/JSON outputs
├── scripts/               # Utility scripts
│   ├── run_benchmarks.py  # Run all benchmarks
│   ├── plot_results.py    # Generate plots and CSV
│   ├── make_report.py     # Generate Word report
│   └── make_slides.py     # Generate slides
├── docs/                  # Documentation
│   ├── architecture.md    # System architecture
│   ├── api.md            # API reference
│   └── usage.md          # Usage examples
└── papers/                # Research papers
    └── *.pdf             # Papers referenced in growth plan
```

## Development Guide

### Code Quality

```bash
# Linting
ruff check .

# Formatting
ruff format .

# Type checking
mypy llmds --ignore-missing-imports  # All type safety violations fixed

# Run all quality checks
ruff check . && ruff format --check . && mypy llmds --ignore-missing-imports
```

## Research-Based Growth Plan

This project is designed to integrate cutting-edge research from 6 key papers in the `papers/` directory. Below is the roadmap for future enhancements.

### Research Papers Overview

1. **Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation**
   - **Focus**: Chunk-level caching for RAG systems
   - **Impact**: 30-50% latency reduction for repeated queries
   - **Priority**: High (Phase 2)

2. **Efficient Vector Search on Disaggregated Memory with d-HNSW**
   - **Focus**: Distributed HNSW for large-scale deployments
   - **Impact**: Enables billion-scale vector search
   - **Priority**: Medium (Phase 3)

3. **Fair-Count-Min: Frequency Estimation under Equal Group-wise Approximation Factor**
   - **Focus**: Fairness in frequency estimation across groups
   - **Impact**: Ensures equal service quality across users/groups
   - **Priority**: Medium (Phase 1)

4. **Memory-efficient Sketch Acceleration for Handling Large Network Flows on FPGAs**
   - **Focus**: Hardware-aware sketch optimizations
   - **Impact**: 30-50% memory reduction for sketch data structures
   - **Priority**: Low (Phase 1)

5. **Survey of Filtered Approximate Nearest Neighbor Search over the Vector-Scalar Hybrid Data**
   - **Focus**: Combining vector and scalar (metadata) filtering
   - **Impact**: Enables complex queries without performance degradation
   - **Priority**: High (Phase 2)

6. **Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs**
   - **Focus**: Original HNSW paper (already implemented)
   - **Enhancement**: Robust algorithms and quality maintenance
   - **Priority**: Low (Phase 5)

### Implementation Roadmap

#### Phase 1: Quick Wins (Weeks 1-4)
1. **Memory-Efficient Sketch** - Low effort, high value (30-50% memory reduction)
2. **Fair Count-Min** - Important for production systems (2-3 weeks)

#### Phase 2: Core Features (Weeks 5-12)
3. **Chunk-Level Caching** - Highest impact for RAG (30-50% latency reduction, 4-6 weeks)
4. **Filtered Search** - Essential for production use (3-4 weeks)

#### Phase 3: Scale (Weeks 13-20)
5. **Distributed HNSW** - Enables large-scale deployment (6-8 weeks)
6. **Enhanced HNSW** - Polish and optimization (ongoing)

### Expected Performance Improvements

| Feature | Latency Reduction | Memory Reduction | Throughput Increase |
|---------|------------------|------------------|-------------------|
| Chunk Caching | 30-50% | 10-20% | 20-40% |
| Filtered Search | <10% overhead | +5-10% | Maintained |
| Distributed HNSW | <5% overhead | Linear scaling | Linear scaling |
| Fair Count-Min | 0% | 0% | Maintained |
| Memory-Efficient Sketch | <5% | 30-50% | +10-20% |

### New Modules Planned

```
llmds/
├── chunk_cache.py          # NEW: Chunk-level caching (Paper #1)
├── filtered_hnsw.py        # NEW: Filtered search (Paper #5)
├── query_filters.py         # NEW: Filter query language (Paper #5)
├── distributed_hnsw.py     # NEW: Distributed HNSW (Paper #2)
├── fair_cmsketch.py         # NEW: Fair Count-Min (Paper #3)
└── sparse_cmsketch.py      # NEW: Memory-efficient sketch (Paper #4)
```

### Technical Implementation Details

#### Priority 1: Chunk-Level Caching (Cache-Craft)

**Architecture:**
- **Chunk Identification**: Track chunks at a finer granularity than documents
- **Chunk Metadata**: Store access patterns, relevance scores, chunk sizes
- **Chunk Reuse**: Detect when chunks appear in multiple queries
- **Adaptive Eviction**: Chunk-aware eviction policies

**Implementation Structure:**
```python
# llmds/chunk_cache.py
class Chunk:
    """Represents a document chunk with metadata."""
    chunk_id: int
    doc_id: int
    start_pos: int
    end_pos: int
    embedding: np.ndarray
    text: str
    access_count: int
    last_accessed: float
    relevance_score: float

class ChunkCache:
    """Chunk-level cache with reuse detection."""
    def get_chunks(self, chunk_ids: list[int]) -> list[Chunk]
    def add_chunks(self, chunks: list[Chunk])
    def detect_reuse(self, query_results: list[tuple[int, float]]) -> dict
```

#### Priority 2: Filtered Vector-Scalar Search

**Architecture:**
- **Filter Query Language**: Support complex filter predicates
- **Filter-Aware Indexing**: Index both vectors and scalar attributes
- **Filter Pushdown**: Apply filters during index traversal
- **Boolean Filter Support**: AND/OR/NOT combinations

**Implementation Structure:**
```python
# llmds/query_filters.py
class Filter:  # Base class for filter predicates
class RangeFilter(Filter):  # field BETWEEN min AND max
class EqualityFilter(Filter):  # field == value
class SetFilter(Filter):  # field IN [values]
class CompositeFilter(Filter):  # Boolean combinations

# llmds/filtered_hnsw.py
class FilteredHNSW(HNSW):
    """HNSW with scalar attribute filtering."""
    def search_with_filter(self, query, k: int, filter: Filter)
```

#### Priority 3: Distributed HNSW (d-HNSW)

**Architecture:**
- **Consistent Hashing**: Distribute vectors across nodes
- **Cross-Partition Search**: Efficiently search across partitions
- **Replication Strategy**: Optional vector replication for availability
- **Query Routing**: Route queries to relevant partitions

**Implementation Structure:**
```python
# llmds/distributed_hnsw.py
class DistributedHNSW:
    """Distributed HNSW across multiple nodes."""
    def __init__(self, nodes: list[str], replication_factor: int = 1)
    def add(self, vec, vec_id)  # Hash to partition, add to primary + replicas
    def search(self, query, k: int)  # Search all partitions, merge results
```

#### Priority 4: Fair Count-Min Sketch

**Architecture:**
- **Group Tracking**: Track multiple groups with equal error bounds
- **Fair Estimation**: Guarantee equal approximation factors per group
- **Group Statistics**: Report fairness metrics

**Implementation Structure:**
```python
# llmds/fair_cmsketch.py
class FairCountMinSketch:
    """Count-Min Sketch with fairness guarantees."""
    def __init__(self, width: int, depth: int, groups: list[str])
    def add(self, item: str, group: str, count: int = 1)
    def estimate(self, item: str, group: str) -> int
    def get_fairness_metrics(self) -> dict
```

### Integration Roadmap

#### Phase 1: Chunk Caching (4-6 weeks)
1. Week 1-2: Implement `Chunk` and `ChunkCache` classes
2. Week 3: Integrate with `RetrievalPipeline`
3. Week 4: Add chunk reuse detection
4. Week 5: Implement prefetching
5. Week 6: Benchmark and optimize

#### Phase 2: Filtered Search (3-4 weeks)
1. Week 1: Design filter query API
2. Week 2: Implement `FilteredHNSW` with scalar indexing
3. Week 3: Add filter pushdown strategies
4. Week 4: Benchmark filtered search performance

#### Phase 3: Distributed HNSW (6-8 weeks)
1. Week 1-2: Design distributed architecture
2. Week 3: Implement consistent hashing
3. Week 4-5: Implement cross-partition search
4. Week 6: Add replication
5. Week 7-8: Testing and optimization

#### Phase 4: Fairness (2-3 weeks)
1. Week 1: Implement `FairCountMinSketch`
2. Week 2: Add fairness metrics
3. Week 3: Benchmark fairness guarantees

### Performance Targets

- **Chunk Caching**: 30-50% reduction in retrieval latency for repeated queries, 40-60% cache hit rate
- **Filtered Search**: <10% overhead compared to unfiltered search, support filters with >90% selectivity efficiently
- **Distributed HNSW**: Linear scalability with number of nodes, <5% overhead for cross-partition queries
- **Fair Count-Min**: Equal error bounds across groups (±5% variance)

## Documentation

- [**Architecture Overview**](docs/architecture.md) - System architecture and design decisions
- [**API Reference**](docs/api.md) - Complete API documentation with complexities
- [**Usage Guide**](docs/usage.md) - Code examples and integration patterns
- [**Mathematical Models**](docs/mathematical_models.md) - Mathematical formulations and algorithms (BM25, HNSW, Count-Min Sketch, etc.)

## Citation

If you use this library in your research, please cite:

```bibtex
@software{llm_rag_ds_optimizer,
  title = {LLM RAG Data Structures Optimizer},
  author = {Gutierrez, Carlos},
  email = {cgutierrez44833@ucumberlands.edu},
  year = {2025},
  url = {https://github.com/CarGDev/llm-rag-ds-optimizer}
}
```

## Contributing

We welcome contributions! This section provides guidelines for contributing to the project.

### Getting Started

1. Fork the repository
2. Clone your fork: `git clone https://github.com/yourusername/llm-rag-ds-optimizer.git`
3. Create a branch: `git checkout -b feature/your-feature-name`
4. Install dependencies: `poetry install` or `pip install -e ".[dev]"`
5. Make your changes
6. Submit a pull request

### Development Guidelines

**Code Style:**
- Follow PEP 8 style guidelines
- Use `ruff` for linting and formatting
- Run `ruff check .` and `ruff format .` before committing
- Type hints are required for all public APIs

**Documentation:**
- Update docstrings for all new functions/classes (Google/NumPy style)
- Update API documentation if adding new public APIs
- Update README if adding new features

**Commit Messages:**
- Use clear, descriptive commit messages
- Follow conventional commits format:
  - `feat:` for new features
  - `fix:` for bug fixes
  - `docs:` for documentation
  - `refactor:` for code refactoring

### Pull Request Process

1. Run linting and formatting checks
2. Update documentation as needed
3. Submit a pull request with a clear description
4. Address review feedback promptly

### Reporting Issues

- Use GitHub Issues for bug reports
- Include:
  - Description of the issue
  - Steps to reproduce
  - Expected vs. actual behavior
  - Environment information (Python version, OS, etc.)

## License

Apache License 2.0 - see [LICENSE](LICENSE) file for details.

## Code of Conduct

### Our Pledge

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

### Our Standards

**Examples of behavior that contributes to a positive environment:**
- Using welcoming and inclusive language
- Being respectful of differing viewpoints and experiences
- Gracefully accepting constructive criticism
- Focusing on what is best for the community
- Showing empathy towards other community members

**Examples of unacceptable behavior:**
- The use of sexualized language or imagery, and sexual attention or advances of any kind
- Trolling, insulting/derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others' private information, such as a physical or email address, without their explicit permission
- Other conduct which could reasonably be considered inappropriate in a professional setting

### Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement. All complaints will be reviewed and investigated promptly and fairly.

*This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 2.0.*

---

**Status**: Production-ready core implementation. Research integration roadmap available for future enhancements.

## Glossary

This glossary defines specialized terms and abbreviations used throughout this project.

### Performance Metrics

**P50, P95, P99 (Percentiles)**: Statistical measures of latency distribution.
- **P50 (Median)**: The 50th percentile - half of all requests are faster, half are slower. Represents typical performance.
- **P95**: The 95th percentile - 95% of requests are faster than this value. Captures tail latency, important for user experience.
- **P99**: The 99th percentile - 99% of requests are faster. Used to understand worst-case scenarios and outliers.

*Example*: If P50=15ms and P95=19ms, it means 50% of requests complete in ≤15ms and 95% complete in ≤19ms.

**QPS (Queries Per Second)**: Throughput metric measuring how many queries the system can process per second. Higher QPS indicates better throughput.

**Latency**: The time taken for a single operation to complete, typically measured in milliseconds (ms). Lower latency indicates faster response times.

### Data Structures & Algorithms

**HNSW (Hierarchical Navigable Small World)**: A graph-based algorithm for approximate nearest neighbor search. Provides logarithmic search complexity with high recall. Parameters:
- **M**: Maximum number of connections per node in the graph
- **efConstruction**: Controls graph quality during index building
- **efSearch**: Number of candidates to explore during search (higher = better recall, slower)

**BM25 (Best Matching 25)**: A probabilistic ranking function for information retrieval. Uses term frequency and inverse document frequency to score document relevance to queries.

**KV Cache (Key-Value Cache)**: Stores precomputed key-value pairs from transformer attention layers to avoid redundant computation for repeated prefixes.

**Inverted Index**: A data structure mapping terms (words) to documents containing them. Enables fast text search by allowing direct lookup of documents containing query terms.

**Count-Min Sketch**: A probabilistic data structure for frequency estimation with bounded error. Used for hot query detection and cache optimization.

### Retrieval & RAG

**RAG (Retrieval-Augmented Generation)**: An approach where LLMs generate responses using information retrieved from external knowledge bases, improving accuracy and reducing hallucination.

**ANN (Approximate Nearest Neighbor)**: Algorithms that find similar vectors quickly, trading exact results for speed. HNSW is an ANN algorithm.

**Hybrid Search**: Combining dense vector search (semantic similarity) with sparse keyword search (BM25) for better retrieval quality.

**Recall@K**: Retrieval metric - the percentage of relevant documents found in the top-K results. Higher recall means more relevant results are retrieved.

**Score Fusion**: Combining scores from multiple retrieval methods (e.g., BM25 + vector similarity) into a single ranked list.

### System Terms

**Micro-batching**: Grouping multiple requests together for parallel processing, improving GPU utilization and throughput.

**Admission Control**: System that decides whether to accept or reject incoming requests based on current load and resource availability.

**Rate Limiting**: Controlling the number of requests processed per unit time to prevent system overload.

**Token Budget**: Maximum number of tokens (sub-word units) that can be cached or processed within memory constraints.

**Prefix Sharing**: Technique where identical prompt prefixes across multiple sequences are stored once, reducing memory usage.

### Dataset & Evaluation

**BEIR (Benchmarking IR)**: A collection of diverse information retrieval benchmarks covering multiple domains.

**MS MARCO**: Large-scale passage ranking dataset used as a standard benchmark for information retrieval systems.

**FIQA**: Financial question-answering dataset from BEIR, used for domain-specific retrieval evaluation.

**Corpus**: A collection of documents used for indexing and retrieval testing.

**JSONL (JSON Lines)**: File format where each line is a valid JSON object, commonly used for large datasets.

### Technical Abbreviations

**LLM (Large Language Model)**: AI models trained on massive text corpora to understand and generate human-like text.

**IR (Information Retrieval)**: The field of study focused on finding relevant information from large collections of documents.

**API (Application Programming Interface)**: A set of functions and protocols for interacting with software components.

**O(log n)**: Logarithmic time complexity - the time to complete an operation grows logarithmically with input size, indicating efficient algorithms.

## Appendix

### Additional Resources

**Research Papers**: See `papers/` directory and `docs/CITATIONS.md` for referenced research papers.

**Primary Citations:**
- **HNSW**: Malkov & Yashunin (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE TPAMI, 42(4), 824-836.
- **KV Cache**: Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation
- **Count-Min Sketch**: Cormode & Muthukrishnan (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.
- **BM25**: Robertson & Zaragoza (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.

**Additional Papers:**
- d-HNSW: Distributed HNSW for disaggregated memory
- Fair-Count-Min: Fairness in frequency estimation
- Memory-efficient sketches
- Survey of Filtered Approximate Nearest Neighbor Search

See `docs/CITATIONS.md` for complete citation mapping to implementation code.

**Dataset Licenses**:
- **MS MARCO**: Research use only - see [MS MARCO Terms](https://microsoft.github.io/msmarco/)
- **BEIR (FIQA)**: Varies by task - check individual task licenses (typically CC-BY or similar)
- **Amazon Reviews 2023**: CC BY 4.0
- **Yelp**: See [Yelp Dataset License](https://www.yelp.com/dataset/license) - Research use allowed
- **Wikipedia**: CC BY-SA 3.0 / GFDL

**Reproducibility Notes**:
- All benchmarks use deterministic seeds (42) for reproducibility
- **HNSW seed control**: The `HNSW` class accepts an optional `seed` parameter for reproducible graph structure. When a seed is provided, each HNSW instance uses its own `random.Random(seed)` state for level assignments, ensuring identical graph structures across runs.
- Embeddings are generated deterministically based on document IDs
- Benchmark results include hardware specifications
- Exact corpus sizes and parameters are documented in result files

**HNSW Seed Usage**:
```python
from llmds.hnsw import HNSW

# Reproducible HNSW with fixed seed
hnsw = HNSW(dim=384, M=16, ef_construction=200, ef_search=50, seed=42)

# Or use RetrievalPipeline (automatically uses seed=42 in benchmarks)
from llmds.retrieval_pipeline import RetrievalPipeline
pipeline = RetrievalPipeline(embedding_dim=384, seed=42)
```

**Dependency Management**:
- **Poetry**: Use `poetry.lock` (when available) for exact version pinning
  ```bash
  poetry install  # Uses poetry.lock for reproducible builds
  ```
- **pip**: Use `requirements.txt` and `requirements-dev.txt` for compatible version ranges
  ```bash
  pip install -r requirements-dev.txt  # Install all dependencies
  ```
- Both methods ensure reproducible builds across different environments
- Python version: >=3.11 (see `.python-version` or `pyproject.toml`)

**Performance Baseline**:
- Synthetic benchmarks (small data): Sub-millisecond latencies typical
- Real corpus benchmarks (large data): Higher latencies due to realistic data distribution, cache behavior, and memory access patterns
- Production systems typically see 10-100x latency increase from synthetic to real data

**Hardware Used for Benchmarks**:
- System: macOS (Apple Silicon)
- Python: 3.14.0
- Performance varies by hardware and dataset characteristics

### Benchmark Result Files

- **Individual results**: `benchmarks/results/{corpus}/{date}/results.json`
- **Combined CSV**: `benchmarks/results/benchmark_results.csv`
- **Visualizations**: `benchmarks/figures/*.png`

### Contact & Support

For questions, issues, or contributions, please see:
- **Contributing**: See [Contributing](#contributing) section above
- **Code of Conduct**: See [Code of Conduct](#code-of-conduct) section above
- **GitHub Issues**: Report bugs or request features via GitHub Issues