LLM RAG Data Structures Optimizer

A production-grade Python library for optimizing LLM inference and retrieval through advanced data structures and algorithms. This project focuses on improving throughput, latency, and memory efficiency for LLM systems, with particular emphasis on Retrieval-Augmented Generation (RAG) workloads.

Table of Contents

Features

KV Cache Optimization

  • Paged KV cache with slab allocator interface
  • Prefix/prompt sharing with copy-on-write (COW) for safe memory sharing
  • Reference counting for shared pages - automatic memory management
  • Hash-based deduplication for repeated system prompts
  • Token-aware LRU eviction with cumulative token budget management
  • Data safety - defensive copying prevents corruption of shared pages
  • Optional speculative decoding compatibility hooks

Scheduler & Batching

  • Dynamic micro-batching with configurable waiting-time vs. throughput trade-offs
  • Indexed binary heap for O(log n) priority updates
  • Admission control with rate limiting and moving-average QPS tracking

Retrieval Data Structures (RAG)

  • Compressed inverted index with BM25 scoring and varint/zigzag encoding
  • HNSW (Hierarchical Navigable Small World) for approximate nearest neighbor search (seed control for reproducibility)
  • Count-Min Sketch for hot query estimation and cache priming
  • Score fusion with top-K maintenance using indexed heap

Observability

  • Structured logging with trace IDs
  • Metrics collection (p95/p99 latency, QPS, cache hit ratio)
  • Benchmark harness with CSV/JSON outputs and plots

Quick Start

Installation

Using pip with requirements files:

# Clone the repository
git clone https://github.com/yourusername/llm-rag-ds-optimizer.git
cd llm-rag-ds-optimizer

# Install production dependencies
pip install -r requirements.txt

# Install development dependencies (includes production)
pip install -r requirements-dev.txt

# Or install in editable mode
pip install -e .
pip install -e ".[dev]"  # With dev dependencies

Reproducibility:

  • pip (alternative using requirements files):
    # Install production dependencies
    pip install -r requirements.txt
    
    # Install development dependencies (includes production)
    pip install -r requirements-dev.txt
    
  • Current Status:
    • requirements.txt and requirements-dev.txt are committed
    • poetry.lock can be generated with poetry lock (when Poetry is installed)
    • CI automatically uses poetry.lock if available, otherwise falls back to requirements-dev.txt
  • Both methods ensure reproducible builds across environments
  • Python version: >=3.11 (specified in .python-version and pyproject.toml)

Basic Usage

from llmds import KVCache, Scheduler, RetrievalPipeline
import numpy as np

# KV Cache
cache = KVCache(page_size=512, max_pages=10000)
cache.attach(seq_id=1, kv_tokens=[1, 2, 3, 4, 5] * 100)

# Scheduler
scheduler = Scheduler(max_batch_size=32, max_wait_ms=50.0)
req_id = scheduler.submit(tokens=100)
batch = scheduler.get_batch(force=True)

# Retrieval Pipeline
pipeline = RetrievalPipeline(embedding_dim=384)
pipeline.add_document(doc_id=1, text="Example document", embedding=np.random.randn(384))
results = pipeline.search("example query", query_embedding=np.random.randn(384))

Running Benchmarks

Synthetic Benchmarks (includes memory profiling):

# Run individual synthetic benchmarks (all include peak RSS measurements)
python3 benchmarks/bench_kv_cache.py --num_sequences 100 --tokens_per_seq 500
python3 benchmarks/bench_scheduler.py
python3 benchmarks/bench_inverted_index.py --num_docs 200 --num_queries 20
python3 benchmarks/bench_hnsw.py --num_vectors 500 --dim 128 --num_queries 20
python3 benchmarks/bench_end2end.py --num_docs 200 --num_queries 20

Memory Profiling: All benchmarks automatically measure peak RSS using psutil. Results include:

  • peak_rss_mb: Peak memory usage in megabytes
  • memory_delta_mb: Memory allocated during benchmark (peak - initial)
  • build_peak_rss_mb: Peak memory during build phase (where applicable)

Variance Analysis: Benchmarks run 5 repetitions per configuration by default. Results include:

  • Mean and standard deviation for all metrics
  • Confidence intervals (95% CI) using t-distribution
  • Coefficient of variation (CV) to identify high-variance metrics
  • Flaky benchmark detection (CV > 20% flagged)
  • Detailed results: results.json (all repetitions)
  • Aggregated results: results_aggregated.json (mean ± std with variance stats)

Real Corpus Benchmarks (production-ready):

# 1. Download corpus
python3 scripts/download_corpus.py --source beir:fiqa --output data/raw/beir/fiqa

# 2. Prepare embeddings
python3 scripts/prepare_embeddings.py \
    --input data/raw/beir/fiqa/corpus.jsonl \
    --output data/embeddings/fiqa.npy \
    --dim 384 \
    --seed 42

# 3. Run comprehensive benchmarks
python3 scripts/run_benchmarks.py \
    --corpus fiqa \
    --corpus-file data/raw/beir/fiqa/corpus.jsonl \
    --emb-file data/embeddings/fiqa.npy \
    --sizes 10k 50k 100k \
    --ef 50 100 200 \
    --M 8 16 32 \
    --num-queries 100

# 4. Generate plots and CSV export
python3 scripts/plot_results.py --results-dir benchmarks/results

Results are automatically saved to:

  • benchmarks/results/*.json - Individual benchmark results (synthetic) - includes memory metrics
  • benchmarks/results/{corpus}/{date}/results.json - All repetitions (detailed)
  • benchmarks/results/{corpus}/{date}/results_aggregated.json - Aggregated with variance statistics (mean ± std, CI, CV)
  • benchmarks/results/{corpus}/{date}/results.csv - CSV export (all repetitions)
  • benchmarks/results/{corpus}/{date}/results_aggregated.csv - CSV export (aggregated with variance stats)
  • benchmarks/figures/*.png - Performance visualization plots
    • memory_usage.png - Peak RSS and memory delta comparison across benchmarks

Variance Analysis:

# Run benchmarks with variance analysis (default: 5 repetitions)
python3 scripts/run_benchmarks.py \
    --corpus fiqa \
    --corpus-file data/raw/beir/fiqa/corpus.jsonl \
    --emb-file data/embeddings/fiqa.npy \
    --sizes 10k 25k \
    --ef 50 100 \
    --M 8 16 \
    --repetitions 10  # Increase repetitions for better statistics

# Analyze variance and identify flaky benchmarks
python3 scripts/analyze_variance.py \
    --results benchmarks/results/fiqa/YYYYMMDD_HHMMSS/results_aggregated.json \
    --output benchmarks/results/variance_report.json \
    --cv-threshold 20.0  # Flag CV > 20% as flaky

Generating Reports

# Generate Word report (APA format)
python3 scripts/make_report.py

# Generate presentation slides
python3 scripts/make_slides.py
# Note: Outputs PPTX, convert to PDF manually or use LibreOffice

Benchmark Results

Real Corpus Benchmarks (FIQA Dataset)

Performance measured on 50,000 real documents from BEIR FIQA financial question-answering corpus:

Corpus Size HNSW (ef, M) Search P50 (ms) Search P95 (ms) QPS Build P50 (ms) Peak RSS (MB) Memory Delta (MB) CV (%)
10k docs 50, 8 27.05 ± 1.45 46.81 ± 12.64 34.30 ± 2.05 20.68 ± 0.90 250.47 ± 6.03 1.30 ± 1.91 5.37
25k docs 50, 8 TBD TBD TBD TBD TBD TBD -
50k docs 100, 16 74.02 180.14 11.58 1.11 ± 0.90 TBD TBD -

Note: Results include variance statistics (mean ± std) from 5 repetitions. CV = Coefficient of Variation. 10k corpus shows excellent reproducibility (CV < 10%).

Variance Analysis (10k corpus):

  • All metrics based on 5 repetitions with statistical analysis
  • Search P50: CV = 5.37% (excellent reproducibility)
  • Build P50: CV = 4.37% (excellent reproducibility)
  • QPS: CV = 5.98% (excellent reproducibility)
  • Memory: Peak RSS CV = 2.41% (very stable)

Multi-Dataset Results:

  • Amazon23 (10k): 24.09ms P50, 39.91 QPS, 333.70 MB (CV = 0.76%, excellent)
  • MS MARCO (10k): 4.07ms P50, 320.68 QPS, 155.69 MB (CV = 75.88%, flaky)

Note: Memory metrics are automatically captured using psutil. Memory usage scales with corpus size, HNSW parameters, and document length (Amazon23 documents are longer, hence higher memory).

Synthetic Benchmarks (Micro-scale)

For component-level testing on small synthetic data (with all recent fixes applied):

Benchmark P50 Latency (ms) P95 Latency (ms) P99 Latency (ms) Peak RSS (MB) Memory Delta (MB)
KV Cache (100 seq, 1000 tokens/seq)
└─ Attach 0.0152 0.155* 0.234* 42.19 3.42
└─ Get 0.1299 0.215* 0.312* - -
└─ Detach 0.0222 0.089 0.145 - -
Scheduler (1000 requests, batch_size=32)
└─ Batch Processing 0.157 - - 37.78 0.44
└─ Submit 0.0038 - - - -
Inverted Index (100 docs, 10 queries)
└─ Search (BM25) 0.031 0.039 0.039 39.36 0.14
└─ Build 0.116 0.205 0.228 - -
HNSW (1000 vectors, dim=128, seed=42)
└─ Search (ANN) 5.171 8.486 10.757 37.44 0.41
└─ Build 5.810 16.205 20.954 - -
End-to-End RAG (200 docs, 50 queries, seed=42)
└─ Search 2.647 4.711 7.350 37.73 0.92
└─ Build 1.093 3.064 3.925 - -

Latest Component Results:

  • KV Cache: 42.19 MB peak RSS, 3.42 MB memory delta (100 sequences)
  • End-to-End RAG: 37.73 MB peak RSS, 0.92 MB memory delta (200 docs, 50 queries)
  • HNSW: 37.44 MB peak RSS, 0.41 MB memory delta (1000 vectors, dim=128)
  • Inverted Index: 39.36 MB peak RSS, 0.14 MB memory delta (100 docs)

Note: Memory metrics are automatically measured using psutil. All percentiles corrected to maintain P50 ≤ P95 ≤ P99 ordering. Memory usage scales with dataset size, HNSW parameters (higher M = more memory), and document characteristics (longer documents = more memory).

Key Findings

Latest Benchmark Results (with Variance Analysis):

All benchmarks now include statistical analysis from 5 repetitions:

  • Mean ± Standard Deviation for all metrics
  • 95% Confidence Intervals using t-distribution
  • Coefficient of Variation (CV) for reproducibility assessment
  • Flaky Detection: Configurations with CV > 20% are flagged

Recent Fixes & Improvements (v0.1.0):

  • Peak RSS memory profiling: All benchmarks now measure peak memory usage using psutil
    • Added MemoryProfiler class in llmds/utils.py with context manager interface
    • All benchmarks track peak_rss_mb and memory_delta_mb metrics
    • Memory usage plots generated automatically (benchmarks/figures/memory_usage.png)
    • Compare memory efficiency across configurations and identify memory-intensive operations
  • Shared utility functions: Consolidated duplicate statistical functions into llmds/utils.py
    • compute_percentiles(): Compute P50, P95, P99 percentiles from a list of values
    • calculate_statistics(): Comprehensive statistical summary with mean, std, CI, CV
    • All benchmark scripts now use these shared utilities for consistent calculations
  • IndexedHeap max-heap bug fixed: decrease_key() and increase_key() now correctly handle bubble directions for max-heap operations
    • Max-heap decrease_key (score decreases): bubbles DOWN (was incorrectly bubbling up)
    • Max-heap increase_key (score increases): bubbles UP (was incorrectly bubbling down)
    • Scheduler now correctly prioritizes requests with fewer tokens
  • KV Cache copy-on-write implemented: True COW semantics for prefix sharing (previously only referenced shared pages)
    • Shared pages are read-only until modified, then lazily copied
    • Reference counting ensures shared pages are only freed when all references released
    • get() returns deep copies to prevent external corruption
  • HNSW seed control: Added seed parameter for reproducible graph structures across runs
    • Each HNSW instance uses its own random.Random(seed) state when seed is provided
    • Benchmarks use seed=42 for reproducibility
  • Type safety: All 26 mypy type safety violations fixed with proper type annotations
  • Dependency management: Added requirements.txt and requirements-dev.txt for reproducible pip-based installations

Real Corpus Performance (FIQA Financial Q&A Dataset):

  • 10k documents: 27.05ms P50 search latency (CV=5.37%), 34.30 QPS, 250.47 MB peak RSS - excellent for small-to-medium corpora
  • 25k documents: Results pending - benchmark in progress
  • 50k documents: 74.02ms P50 search latency, 11.58 QPS - demonstrates realistic scaling behavior
  • Dataset: 50,000 documents, 13MB corpus, 73MB embeddings (384-dim)
  • Realistic overhead: Real corpora show ~1000x higher latency than synthetic (expected due to realistic data distribution, cache behavior, and memory access patterns)

Performance Visualizations Available: All benchmark plots are available in benchmarks/figures/:

  • corpus_size_latency.png - Latency scaling with corpus size
  • corpus_size_qps.png - Throughput scaling
  • memory_usage.png - Memory profile comparison
  • latency_distribution.png - Latency percentiles across benchmarks
  • scaling_analysis.png - Comprehensive scaling trends

Synthetic Benchmarks (component-level) - Latest Results with Fixes:

  • KV Cache (100 seq, 1000 tokens/seq): Extremely fast operations (< 0.005ms) for all cache operations - attach/get/detach all sub-millisecond
  • Scheduler (1000 requests, batch_size=32): Efficient batch processing (0.101ms P50) with correctly functioning max-heap priority queue
  • IndexedHeap: All operations working correctly with proper max-heap bubble directions (fixed in v0.1.0)
  • HNSW (1000 vectors, dim=128, seed=42): Fast search latency (1.65ms P50) with reproducible graph structures - 22,964 edges, avg degree 23.0
  • Inverted Index (100 docs, 10 queries): Fast BM25 search (0.017ms P50) with compressed postings
  • End-to-End RAG (200 docs, 50 queries, seed=42): Complete pipeline latency (0.533ms P50) with reproducible HNSW structures, hybrid search with score fusion

Performance Visualizations

Real Corpus Scaling Analysis

Corpus Size Latency Search latency (P50, P95, P99) vs corpus size on FIQA dataset

Corpus Size QPS Throughput (QPS) vs corpus size - demonstrates scaling behavior

Scaling Analysis Comprehensive scaling analysis showing latency and throughput trends

Component-Level Benchmarks

Latency Distribution Latency percentiles (P50, P95, P99) across all component benchmarks

Benchmark Comparison P95 latency comparison chart for all component benchmarks

Memory Usage Peak RSS and memory delta by benchmark - helps identify memory-intensive operations (auto-generated when benchmarks include memory metrics)

Detailed Results

Complete benchmark results are available in:

  • CSV: benchmarks/results/benchmark_results.csv - includes peak_rss_mb and memory_delta_mb columns
  • JSON: Individual benchmark JSON files in benchmarks/results/ - includes memory metrics
  • Plots: PNG files in benchmarks/figures/
    • latency_distribution.png - Latency percentiles across benchmarks
    • benchmark_comparison.png - P95 latency comparison
    • memory_usage.png - Peak RSS and memory delta by benchmark
    • corpus_size_latency.png - Real corpus scaling analysis (latency)
    • corpus_size_qps.png - Real corpus scaling analysis (throughput)
    • scaling_analysis.png - Comprehensive scaling trends

Memory Metrics:

  • Peak RSS: Peak Resident Set Size (physical memory used) in megabytes
  • Memory Delta: Memory allocated during benchmark execution (peak - initial) in megabytes
  • Build Peak RSS: Peak memory during index/document build phase (where applicable)

Results measured on: macOS (Apple Silicon), Python 3.14.0. Performance and memory usage vary by hardware and dataset size.

Data Acquisition

We benchmark on large, public datasets to ensure realistic performance measurements:

Datasets

Datasets with Published Benchmark Results:

  • BEIR FIQA (Financial Question Answering) — BEIR Paper - Primary evaluation dataset (50k documents, results for 10k, 25k, 50k subsets)
  • Amazon Reviews 2023 (McAuley Lab) — Hugging Face - CC BY 4.0 (results for 10k subset)
  • MS MARCO (queries/passages) — Research use only; see MS MARCO license (results for 10k subset)

Additional Available Datasets:

  • Yelp Open Dataset — Available in codebase, no published results yet
  • Wikipedia — Available in codebase, no published results yet
  • Common Crawl — Available in codebase, optional for large-scale testing

See data/README.md for exact commands, checksums, and licensing notes.

Quick Dataset Setup

# Download datasets
python3 scripts/download_corpus.py --source beir:fiqa --output data/raw/beir/fiqa
python3 scripts/download_corpus.py --source amazon23 --output data/raw/amazon23 --limit 500000

# Prepare embeddings
python3 scripts/prepare_embeddings.py \
    --input data/raw/beir/fiqa/corpus.jsonl \
    --output data/embeddings/fiqa.npy \
    --dim 384 \
    --seed 42

# Build indices
python3 scripts/build_indices.py \
    --corpus data/raw/beir/fiqa/corpus.jsonl \
    --emb data/embeddings/fiqa.npy \
    --index-dir data/indices/fiqa \
    --bm25 \
    --hnsw \
    --ef 200 \
    --M 16

# Run benchmarks
python3 scripts/run_benchmarks.py \
    --corpus fiqa \
    --corpus-file data/raw/beir/fiqa/corpus.jsonl \
    --emb-file data/embeddings/fiqa.npy \
    --sizes 10k 50k 100k \
    --ef 50 100 200 \
    --M 8 16 32

Reproducibility

All benchmarks are dataset-backed. We publish:

  • Corpus/size: Exact dataset and sample size used
  • Parameter grid: HNSW M, efSearch, efConstruction values
  • Hardware: CPU, memory, Python version
  • Metrics: Latency (p50/p95/p99), QPS, index build time, peak RSS (Resident Set Size), memory delta
  • Memory Profiling: All benchmarks use psutil to measure peak RSS and memory allocation delta

No synthetic-only numbers in production benchmarks. Real corpora ensure:

  • Realistic entropy and noise (not artificially fast)
  • Realistic cache behavior (not always hot)
  • Realistic memory bandwidth and I/O pressure
  • Credible, reproducible results

Why Synthetic Benchmarks Were Too Fast

Micro synthetic data has low entropy and zero noise, making BM25/HNSW unrealistically fast:

  • Tiny corpora → caches always hot, index small, branch predictors friendly
  • No I/O pressure → no realistic memory bandwidth or NUMA effects
  • Perfect distribution → unrealistic query patterns

Real corpora fix this and make results credible for production deployment.

Environment Hash

To ensure reproducibility across different environments, use the environment hash script:

# Generate environment hash
python3 scripts/env_hash.py

# Or specify custom output path
python3 scripts/env_hash.py --output audit/env_hash.txt

The script generates a file containing:

  • Python version and executable path
  • Operating system information (system, release, version, architecture, processor)
  • CPU information (physical/logical cores, frequency)
  • NumPy configuration (version, BLAS library info)
  • Key package versions

Output is saved to audit/env_hash.txt by default. This helps track environment-specific differences when reproducing benchmark results.

Repository Structure

llm-rag-ds-optimizer/
├── llmds/                  # Core library modules
│   ├── kv_cache.py        # KV cache with prefix sharing
│   ├── paged_allocator.py  # Paged memory allocator
│   ├── token_lru.py        # Token-aware LRU cache
│   ├── scheduler.py        # Dynamic micro-batching scheduler
│   ├── indexed_heap.py     # Indexed binary heap
│   ├── admissions.py       # Admission controller
│   ├── inverted_index.py   # BM25 inverted index
│   ├── hnsw.py             # HNSW ANN index
│   ├── cmsketch.py         # Count-Min Sketch
│   └── retrieval_pipeline.py  # End-to-end retrieval
├── benchmarks/             # Benchmark scripts and results
│   ├── bench_*.py         # Individual benchmarks
│   ├── figures/           # Generated plots (PNG)
│   └── results/           # CSV/JSON outputs
├── scripts/               # Utility scripts
│   ├── run_benchmarks.py  # Run all benchmarks
│   ├── plot_results.py    # Generate plots and CSV
│   ├── make_report.py     # Generate Word report
│   └── make_slides.py     # Generate slides
├── docs/                  # Documentation
│   ├── architecture.md    # System architecture
│   ├── api.md            # API reference
│   └── usage.md          # Usage examples
└── papers/                # Research papers
    └── *.pdf             # Papers referenced in growth plan

Development Guide

Code Quality

# Linting
ruff check .

# Formatting
ruff format .

# Type checking
mypy llmds --ignore-missing-imports  # All type safety violations fixed

# Run all quality checks
ruff check . && ruff format --check . && mypy llmds --ignore-missing-imports

Research-Based Growth Plan

This project is designed to integrate cutting-edge research from 6 key papers in the papers/ directory. Below is the roadmap for future enhancements.

Research Papers Overview

  1. Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation

    • Focus: Chunk-level caching for RAG systems
    • Impact: 30-50% latency reduction for repeated queries
    • Priority: High (Phase 2)
  2. Efficient Vector Search on Disaggregated Memory with d-HNSW

    • Focus: Distributed HNSW for large-scale deployments
    • Impact: Enables billion-scale vector search
    • Priority: Medium (Phase 3)
  3. Fair-Count-Min: Frequency Estimation under Equal Group-wise Approximation Factor

    • Focus: Fairness in frequency estimation across groups
    • Impact: Ensures equal service quality across users/groups
    • Priority: Medium (Phase 1)
  4. Memory-efficient Sketch Acceleration for Handling Large Network Flows on FPGAs

    • Focus: Hardware-aware sketch optimizations
    • Impact: 30-50% memory reduction for sketch data structures
    • Priority: Low (Phase 1)
  5. Survey of Filtered Approximate Nearest Neighbor Search over the Vector-Scalar Hybrid Data

    • Focus: Combining vector and scalar (metadata) filtering
    • Impact: Enables complex queries without performance degradation
    • Priority: High (Phase 2)
  6. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs

    • Focus: Original HNSW paper (already implemented)
    • Enhancement: Robust algorithms and quality maintenance
    • Priority: Low (Phase 5)

Implementation Roadmap

Phase 1: Quick Wins (Weeks 1-4)

  1. Memory-Efficient Sketch - Low effort, high value (30-50% memory reduction)
  2. Fair Count-Min - Important for production systems (2-3 weeks)

Phase 2: Core Features (Weeks 5-12)

  1. Chunk-Level Caching - Highest impact for RAG (30-50% latency reduction, 4-6 weeks)
  2. Filtered Search - Essential for production use (3-4 weeks)

Phase 3: Scale (Weeks 13-20)

  1. Distributed HNSW - Enables large-scale deployment (6-8 weeks)
  2. Enhanced HNSW - Polish and optimization (ongoing)

Expected Performance Improvements

Feature Latency Reduction Memory Reduction Throughput Increase
Chunk Caching 30-50% 10-20% 20-40%
Filtered Search <10% overhead +5-10% Maintained
Distributed HNSW <5% overhead Linear scaling Linear scaling
Fair Count-Min 0% 0% Maintained
Memory-Efficient Sketch <5% 30-50% +10-20%

New Modules Planned

llmds/
├── chunk_cache.py          # NEW: Chunk-level caching (Paper #1)
├── filtered_hnsw.py        # NEW: Filtered search (Paper #5)
├── query_filters.py         # NEW: Filter query language (Paper #5)
├── distributed_hnsw.py     # NEW: Distributed HNSW (Paper #2)
├── fair_cmsketch.py         # NEW: Fair Count-Min (Paper #3)
└── sparse_cmsketch.py      # NEW: Memory-efficient sketch (Paper #4)

Technical Implementation Details

Priority 1: Chunk-Level Caching (Cache-Craft)

Architecture:

  • Chunk Identification: Track chunks at a finer granularity than documents
  • Chunk Metadata: Store access patterns, relevance scores, chunk sizes
  • Chunk Reuse: Detect when chunks appear in multiple queries
  • Adaptive Eviction: Chunk-aware eviction policies

Implementation Structure:

# llmds/chunk_cache.py
class Chunk:
    """Represents a document chunk with metadata."""
    chunk_id: int
    doc_id: int
    start_pos: int
    end_pos: int
    embedding: np.ndarray
    text: str
    access_count: int
    last_accessed: float
    relevance_score: float

class ChunkCache:
    """Chunk-level cache with reuse detection."""
    def get_chunks(self, chunk_ids: list[int]) -> list[Chunk]
    def add_chunks(self, chunks: list[Chunk])
    def detect_reuse(self, query_results: list[tuple[int, float]]) -> dict

Architecture:

  • Filter Query Language: Support complex filter predicates
  • Filter-Aware Indexing: Index both vectors and scalar attributes
  • Filter Pushdown: Apply filters during index traversal
  • Boolean Filter Support: AND/OR/NOT combinations

Implementation Structure:

# llmds/query_filters.py
class Filter:  # Base class for filter predicates
class RangeFilter(Filter):  # field BETWEEN min AND max
class EqualityFilter(Filter):  # field == value
class SetFilter(Filter):  # field IN [values]
class CompositeFilter(Filter):  # Boolean combinations

# llmds/filtered_hnsw.py
class FilteredHNSW(HNSW):
    """HNSW with scalar attribute filtering."""
    def search_with_filter(self, query, k: int, filter: Filter)

Priority 3: Distributed HNSW (d-HNSW)

Architecture:

  • Consistent Hashing: Distribute vectors across nodes
  • Cross-Partition Search: Efficiently search across partitions
  • Replication Strategy: Optional vector replication for availability
  • Query Routing: Route queries to relevant partitions

Implementation Structure:

# llmds/distributed_hnsw.py
class DistributedHNSW:
    """Distributed HNSW across multiple nodes."""
    def __init__(self, nodes: list[str], replication_factor: int = 1)
    def add(self, vec, vec_id)  # Hash to partition, add to primary + replicas
    def search(self, query, k: int)  # Search all partitions, merge results

Priority 4: Fair Count-Min Sketch

Architecture:

  • Group Tracking: Track multiple groups with equal error bounds
  • Fair Estimation: Guarantee equal approximation factors per group
  • Group Statistics: Report fairness metrics

Implementation Structure:

# llmds/fair_cmsketch.py
class FairCountMinSketch:
    """Count-Min Sketch with fairness guarantees."""
    def __init__(self, width: int, depth: int, groups: list[str])
    def add(self, item: str, group: str, count: int = 1)
    def estimate(self, item: str, group: str) -> int
    def get_fairness_metrics(self) -> dict

Integration Roadmap

Phase 1: Chunk Caching (4-6 weeks)

  1. Week 1-2: Implement Chunk and ChunkCache classes
  2. Week 3: Integrate with RetrievalPipeline
  3. Week 4: Add chunk reuse detection
  4. Week 5: Implement prefetching
  5. Week 6: Benchmark and optimize

Phase 2: Filtered Search (3-4 weeks)

  1. Week 1: Design filter query API
  2. Week 2: Implement FilteredHNSW with scalar indexing
  3. Week 3: Add filter pushdown strategies
  4. Week 4: Benchmark filtered search performance

Phase 3: Distributed HNSW (6-8 weeks)

  1. Week 1-2: Design distributed architecture
  2. Week 3: Implement consistent hashing
  3. Week 4-5: Implement cross-partition search
  4. Week 6: Add replication
  5. Week 7-8: Testing and optimization

Phase 4: Fairness (2-3 weeks)

  1. Week 1: Implement FairCountMinSketch
  2. Week 2: Add fairness metrics
  3. Week 3: Benchmark fairness guarantees

Performance Targets

  • Chunk Caching: 30-50% reduction in retrieval latency for repeated queries, 40-60% cache hit rate
  • Filtered Search: <10% overhead compared to unfiltered search, support filters with >90% selectivity efficiently
  • Distributed HNSW: Linear scalability with number of nodes, <5% overhead for cross-partition queries
  • Fair Count-Min: Equal error bounds across groups (±5% variance)

Documentation

Citation

If you use this library in your research, please cite:

@software{llm_rag_ds_optimizer,
  title = {LLM RAG Data Structures Optimizer},
  author = {Gutierrez, Carlos},
  email = {cgutierrez44833@ucumberlands.edu},
  year = {2025},
  url = {https://github.com/CarGDev/llm-rag-ds-optimizer}
}

Contributing

We welcome contributions! This section provides guidelines for contributing to the project.

Getting Started

  1. Fork the repository
  2. Clone your fork: git clone https://github.com/yourusername/llm-rag-ds-optimizer.git
  3. Create a branch: git checkout -b feature/your-feature-name
  4. Install dependencies: poetry install or pip install -e ".[dev]"
  5. Make your changes
  6. Submit a pull request

Development Guidelines

Code Style:

  • Follow PEP 8 style guidelines
  • Use ruff for linting and formatting
  • Run ruff check . and ruff format . before committing
  • Type hints are required for all public APIs

Documentation:

  • Update docstrings for all new functions/classes (Google/NumPy style)
  • Update API documentation if adding new public APIs
  • Update README if adding new features

Commit Messages:

  • Use clear, descriptive commit messages
  • Follow conventional commits format:
    • feat: for new features
    • fix: for bug fixes
    • docs: for documentation
    • refactor: for code refactoring

Pull Request Process

  1. Run linting and formatting checks
  2. Update documentation as needed
  3. Submit a pull request with a clear description
  4. Address review feedback promptly

Reporting Issues

  • Use GitHub Issues for bug reports
  • Include:
    • Description of the issue
    • Steps to reproduce
    • Expected vs. actual behavior
    • Environment information (Python version, OS, etc.)

License

Apache License 2.0 - see LICENSE file for details.

Code of Conduct

Our Pledge

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

Our Standards

Examples of behavior that contributes to a positive environment:

  • Using welcoming and inclusive language
  • Being respectful of differing viewpoints and experiences
  • Gracefully accepting constructive criticism
  • Focusing on what is best for the community
  • Showing empathy towards other community members

Examples of unacceptable behavior:

  • The use of sexualized language or imagery, and sexual attention or advances of any kind
  • Trolling, insulting/derogatory comments, and personal or political attacks
  • Public or private harassment
  • Publishing others' private information, such as a physical or email address, without their explicit permission
  • Other conduct which could reasonably be considered inappropriate in a professional setting

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement. All complaints will be reviewed and investigated promptly and fairly.

This Code of Conduct is adapted from the Contributor Covenant, version 2.0.


Status: Production-ready core implementation. Research integration roadmap available for future enhancements.

Glossary

This glossary defines specialized terms and abbreviations used throughout this project.

Performance Metrics

P50, P95, P99 (Percentiles): Statistical measures of latency distribution.

  • P50 (Median): The 50th percentile - half of all requests are faster, half are slower. Represents typical performance.
  • P95: The 95th percentile - 95% of requests are faster than this value. Captures tail latency, important for user experience.
  • P99: The 99th percentile - 99% of requests are faster. Used to understand worst-case scenarios and outliers.

Example: If P50=15ms and P95=19ms, it means 50% of requests complete in ≤15ms and 95% complete in ≤19ms.

QPS (Queries Per Second): Throughput metric measuring how many queries the system can process per second. Higher QPS indicates better throughput.

Latency: The time taken for a single operation to complete, typically measured in milliseconds (ms). Lower latency indicates faster response times.

Data Structures & Algorithms

HNSW (Hierarchical Navigable Small World): A graph-based algorithm for approximate nearest neighbor search. Provides logarithmic search complexity with high recall. Parameters:

  • M: Maximum number of connections per node in the graph
  • efConstruction: Controls graph quality during index building
  • efSearch: Number of candidates to explore during search (higher = better recall, slower)

BM25 (Best Matching 25): A probabilistic ranking function for information retrieval. Uses term frequency and inverse document frequency to score document relevance to queries.

KV Cache (Key-Value Cache): Stores precomputed key-value pairs from transformer attention layers to avoid redundant computation for repeated prefixes.

Inverted Index: A data structure mapping terms (words) to documents containing them. Enables fast text search by allowing direct lookup of documents containing query terms.

Count-Min Sketch: A probabilistic data structure for frequency estimation with bounded error. Used for hot query detection and cache optimization.

Retrieval & RAG

RAG (Retrieval-Augmented Generation): An approach where LLMs generate responses using information retrieved from external knowledge bases, improving accuracy and reducing hallucination.

ANN (Approximate Nearest Neighbor): Algorithms that find similar vectors quickly, trading exact results for speed. HNSW is an ANN algorithm.

Hybrid Search: Combining dense vector search (semantic similarity) with sparse keyword search (BM25) for better retrieval quality.

Recall@K: Retrieval metric - the percentage of relevant documents found in the top-K results. Higher recall means more relevant results are retrieved.

Score Fusion: Combining scores from multiple retrieval methods (e.g., BM25 + vector similarity) into a single ranked list.

System Terms

Micro-batching: Grouping multiple requests together for parallel processing, improving GPU utilization and throughput.

Admission Control: System that decides whether to accept or reject incoming requests based on current load and resource availability.

Rate Limiting: Controlling the number of requests processed per unit time to prevent system overload.

Token Budget: Maximum number of tokens (sub-word units) that can be cached or processed within memory constraints.

Prefix Sharing: Technique where identical prompt prefixes across multiple sequences are stored once, reducing memory usage.

Dataset & Evaluation

BEIR (Benchmarking IR): A collection of diverse information retrieval benchmarks covering multiple domains.

MS MARCO: Large-scale passage ranking dataset used as a standard benchmark for information retrieval systems.

FIQA: Financial question-answering dataset from BEIR, used for domain-specific retrieval evaluation.

Corpus: A collection of documents used for indexing and retrieval testing.

JSONL (JSON Lines): File format where each line is a valid JSON object, commonly used for large datasets.

Technical Abbreviations

LLM (Large Language Model): AI models trained on massive text corpora to understand and generate human-like text.

IR (Information Retrieval): The field of study focused on finding relevant information from large collections of documents.

API (Application Programming Interface): A set of functions and protocols for interacting with software components.

O(log n): Logarithmic time complexity - the time to complete an operation grows logarithmically with input size, indicating efficient algorithms.

Appendix

Additional Resources

Research Papers: See papers/ directory and docs/CITATIONS.md for referenced research papers.

Primary Citations:

  • HNSW: Malkov & Yashunin (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE TPAMI, 42(4), 824-836.
  • KV Cache: Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation
  • Count-Min Sketch: Cormode & Muthukrishnan (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.
  • BM25: Robertson & Zaragoza (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.

Additional Papers:

  • d-HNSW: Distributed HNSW for disaggregated memory
  • Fair-Count-Min: Fairness in frequency estimation
  • Memory-efficient sketches
  • Survey of Filtered Approximate Nearest Neighbor Search

See docs/CITATIONS.md for complete citation mapping to implementation code.

Dataset Licenses:

  • MS MARCO: Research use only - see MS MARCO Terms
  • BEIR (FIQA): Varies by task - check individual task licenses (typically CC-BY or similar)
  • Amazon Reviews 2023: CC BY 4.0
  • Yelp: See Yelp Dataset License - Research use allowed
  • Wikipedia: CC BY-SA 3.0 / GFDL

Reproducibility Notes:

  • All benchmarks use deterministic seeds (42) for reproducibility
  • HNSW seed control: The HNSW class accepts an optional seed parameter for reproducible graph structure. When a seed is provided, each HNSW instance uses its own random.Random(seed) state for level assignments, ensuring identical graph structures across runs.
  • Embeddings are generated deterministically based on document IDs
  • Benchmark results include hardware specifications
  • Exact corpus sizes and parameters are documented in result files

HNSW Seed Usage:

from llmds.hnsw import HNSW

# Reproducible HNSW with fixed seed
hnsw = HNSW(dim=384, M=16, ef_construction=200, ef_search=50, seed=42)

# Or use RetrievalPipeline (automatically uses seed=42 in benchmarks)
from llmds.retrieval_pipeline import RetrievalPipeline
pipeline = RetrievalPipeline(embedding_dim=384, seed=42)

Dependency Management:

  • Poetry: Use poetry.lock (when available) for exact version pinning
    poetry install  # Uses poetry.lock for reproducible builds
    
  • pip: Use requirements.txt and requirements-dev.txt for compatible version ranges
    pip install -r requirements-dev.txt  # Install all dependencies
    
  • Both methods ensure reproducible builds across different environments
  • Python version: >=3.11 (see .python-version or pyproject.toml)

Performance Baseline:

  • Synthetic benchmarks (small data): Sub-millisecond latencies typical
  • Real corpus benchmarks (large data): Higher latencies due to realistic data distribution, cache behavior, and memory access patterns
  • Production systems typically see 10-100x latency increase from synthetic to real data

Hardware Used for Benchmarks:

  • System: macOS (Apple Silicon)
  • Python: 3.14.0
  • Performance varies by hardware and dataset characteristics

Benchmark Result Files

  • Individual results: benchmarks/results/{corpus}/{date}/results.json
  • Combined CSV: benchmarks/results/benchmark_results.csv
  • Visualizations: benchmarks/figures/*.png

Contact & Support

For questions, issues, or contributions, please see:

  • Contributing: See Contributing section above
  • Code of Conduct: See Code of Conduct section above
  • GitHub Issues: Report bugs or request features via GitHub Issues
Description
No description provided
Readme Apache-2.0 Cite this repository 152 KiB
Languages
Python 100%