LLM RAG Data Structures Optimizer
A production-grade Python library for optimizing LLM inference and retrieval through advanced data structures and algorithms. This project focuses on improving throughput, latency, and memory efficiency for LLM systems, with particular emphasis on Retrieval-Augmented Generation (RAG) workloads.
Table of Contents
- Features
- Quick Start
- Benchmark Results
- Repository Structure
- Development Guide
- Research-Based Growth Plan
- Documentation
- Contributing
- License
Features
KV Cache Optimization
- Paged KV cache with slab allocator interface
- Prefix/prompt sharing with copy-on-write (COW) for safe memory sharing
- Reference counting for shared pages - automatic memory management
- Hash-based deduplication for repeated system prompts
- Token-aware LRU eviction with cumulative token budget management
- Data safety - defensive copying prevents corruption of shared pages
- Optional speculative decoding compatibility hooks
Scheduler & Batching
- Dynamic micro-batching with configurable waiting-time vs. throughput trade-offs
- Indexed binary heap for O(log n) priority updates
- Admission control with rate limiting and moving-average QPS tracking
Retrieval Data Structures (RAG)
- Compressed inverted index with BM25 scoring and varint/zigzag encoding
- HNSW (Hierarchical Navigable Small World) for approximate nearest neighbor search (seed control for reproducibility)
- Count-Min Sketch for hot query estimation and cache priming
- Score fusion with top-K maintenance using indexed heap
Observability
- Structured logging with trace IDs
- Metrics collection (p95/p99 latency, QPS, cache hit ratio)
- Benchmark harness with CSV/JSON outputs and plots
Quick Start
Installation
Using pip with requirements files:
# Clone the repository
git clone https://github.com/yourusername/llm-rag-ds-optimizer.git
cd llm-rag-ds-optimizer
# Install production dependencies
pip install -r requirements.txt
# Install development dependencies (includes production)
pip install -r requirements-dev.txt
# Or install in editable mode
pip install -e .
pip install -e ".[dev]" # With dev dependencies
Reproducibility:
- pip (alternative using requirements files):
# Install production dependencies pip install -r requirements.txt # Install development dependencies (includes production) pip install -r requirements-dev.txt - Current Status:
requirements.txtandrequirements-dev.txtare committedpoetry.lockcan be generated withpoetry lock(when Poetry is installed)- CI automatically uses
poetry.lockif available, otherwise falls back torequirements-dev.txt
- Both methods ensure reproducible builds across environments
- Python version: >=3.11 (specified in
.python-versionandpyproject.toml)
Basic Usage
from llmds import KVCache, Scheduler, RetrievalPipeline
import numpy as np
# KV Cache
cache = KVCache(page_size=512, max_pages=10000)
cache.attach(seq_id=1, kv_tokens=[1, 2, 3, 4, 5] * 100)
# Scheduler
scheduler = Scheduler(max_batch_size=32, max_wait_ms=50.0)
req_id = scheduler.submit(tokens=100)
batch = scheduler.get_batch(force=True)
# Retrieval Pipeline
pipeline = RetrievalPipeline(embedding_dim=384)
pipeline.add_document(doc_id=1, text="Example document", embedding=np.random.randn(384))
results = pipeline.search("example query", query_embedding=np.random.randn(384))
Running Benchmarks
Synthetic Benchmarks (includes memory profiling):
# Run individual synthetic benchmarks (all include peak RSS measurements)
python3 benchmarks/bench_kv_cache.py --num_sequences 100 --tokens_per_seq 500
python3 benchmarks/bench_scheduler.py
python3 benchmarks/bench_inverted_index.py --num_docs 200 --num_queries 20
python3 benchmarks/bench_hnsw.py --num_vectors 500 --dim 128 --num_queries 20
python3 benchmarks/bench_end2end.py --num_docs 200 --num_queries 20
Memory Profiling: All benchmarks automatically measure peak RSS using psutil. Results include:
peak_rss_mb: Peak memory usage in megabytesmemory_delta_mb: Memory allocated during benchmark (peak - initial)build_peak_rss_mb: Peak memory during build phase (where applicable)
Variance Analysis: Benchmarks run 5 repetitions per configuration by default. Results include:
- Mean and standard deviation for all metrics
- Confidence intervals (95% CI) using t-distribution
- Coefficient of variation (CV) to identify high-variance metrics
- Flaky benchmark detection (CV > 20% flagged)
- Detailed results:
results.json(all repetitions) - Aggregated results:
results_aggregated.json(mean ± std with variance stats)
Real Corpus Benchmarks (production-ready):
# 1. Download corpus
python3 scripts/download_corpus.py --source beir:fiqa --output data/raw/beir/fiqa
# 2. Prepare embeddings
python3 scripts/prepare_embeddings.py \
--input data/raw/beir/fiqa/corpus.jsonl \
--output data/embeddings/fiqa.npy \
--dim 384 \
--seed 42
# 3. Run comprehensive benchmarks
python3 scripts/run_benchmarks.py \
--corpus fiqa \
--corpus-file data/raw/beir/fiqa/corpus.jsonl \
--emb-file data/embeddings/fiqa.npy \
--sizes 10k 50k 100k \
--ef 50 100 200 \
--M 8 16 32 \
--num-queries 100
# 4. Generate plots and CSV export
python3 scripts/plot_results.py --results-dir benchmarks/results
Results are automatically saved to:
benchmarks/results/*.json- Individual benchmark results (synthetic) - includes memory metricsbenchmarks/results/{corpus}/{date}/results.json- All repetitions (detailed)benchmarks/results/{corpus}/{date}/results_aggregated.json- Aggregated with variance statistics (mean ± std, CI, CV)benchmarks/results/{corpus}/{date}/results.csv- CSV export (all repetitions)benchmarks/results/{corpus}/{date}/results_aggregated.csv- CSV export (aggregated with variance stats)benchmarks/figures/*.png- Performance visualization plotsmemory_usage.png- Peak RSS and memory delta comparison across benchmarks
Variance Analysis:
# Run benchmarks with variance analysis (default: 5 repetitions)
python3 scripts/run_benchmarks.py \
--corpus fiqa \
--corpus-file data/raw/beir/fiqa/corpus.jsonl \
--emb-file data/embeddings/fiqa.npy \
--sizes 10k 25k \
--ef 50 100 \
--M 8 16 \
--repetitions 10 # Increase repetitions for better statistics
# Analyze variance and identify flaky benchmarks
python3 scripts/analyze_variance.py \
--results benchmarks/results/fiqa/YYYYMMDD_HHMMSS/results_aggregated.json \
--output benchmarks/results/variance_report.json \
--cv-threshold 20.0 # Flag CV > 20% as flaky
Generating Reports
# Generate Word report (APA format)
python3 scripts/make_report.py
# Generate presentation slides
python3 scripts/make_slides.py
# Note: Outputs PPTX, convert to PDF manually or use LibreOffice
Benchmark Results
Real Corpus Benchmarks (FIQA Dataset)
Performance measured on 50,000 real documents from BEIR FIQA financial question-answering corpus:
| Corpus Size | HNSW (ef, M) | Search P50 (ms) | Search P95 (ms) | QPS | Build P50 (ms) | Peak RSS (MB) | Memory Delta (MB) | CV (%) |
|---|---|---|---|---|---|---|---|---|
| 10k docs | 50, 8 | 27.05 ± 1.45 | 46.81 ± 12.64 | 34.30 ± 2.05 | 20.68 ± 0.90 | 250.47 ± 6.03 | 1.30 ± 1.91 | 5.37 |
| 25k docs | 50, 8 | TBD | TBD | TBD | TBD | TBD | TBD | - |
| 50k docs | 100, 16 | 74.02 | 180.14 | 11.58 | 1.11 ± 0.90 | TBD | TBD | - |
Note: Results include variance statistics (mean ± std) from 5 repetitions. CV = Coefficient of Variation. 10k corpus shows excellent reproducibility (CV < 10%).
Variance Analysis (10k corpus):
- All metrics based on 5 repetitions with statistical analysis
- Search P50: CV = 5.37% (excellent reproducibility)
- Build P50: CV = 4.37% (excellent reproducibility)
- QPS: CV = 5.98% (excellent reproducibility)
- Memory: Peak RSS CV = 2.41% (very stable)
Multi-Dataset Results:
- Amazon23 (10k): 24.09ms P50, 39.91 QPS, 333.70 MB (CV = 0.76%, excellent)
- MS MARCO (10k): 4.07ms P50, 320.68 QPS, 155.69 MB (CV = 75.88%, flaky)
Note: Memory metrics are automatically captured using psutil. Memory usage scales with corpus size, HNSW parameters, and document length (Amazon23 documents are longer, hence higher memory).
Synthetic Benchmarks (Micro-scale)
For component-level testing on small synthetic data (with all recent fixes applied):
| Benchmark | P50 Latency (ms) | P95 Latency (ms) | P99 Latency (ms) | Peak RSS (MB) | Memory Delta (MB) |
|---|---|---|---|---|---|
| KV Cache (100 seq, 1000 tokens/seq) | |||||
| └─ Attach | 0.0152 | 0.155* | 0.234* | 42.19 | 3.42 |
| └─ Get | 0.1299 | 0.215* | 0.312* | - | - |
| └─ Detach | 0.0222 | 0.089 | 0.145 | - | - |
| Scheduler (1000 requests, batch_size=32) | |||||
| └─ Batch Processing | 0.157 | - | - | 37.78 | 0.44 |
| └─ Submit | 0.0038 | - | - | - | - |
| Inverted Index (100 docs, 10 queries) | |||||
| └─ Search (BM25) | 0.031 | 0.039 | 0.039 | 39.36 | 0.14 |
| └─ Build | 0.116 | 0.205 | 0.228 | - | - |
| HNSW (1000 vectors, dim=128, seed=42) | |||||
| └─ Search (ANN) | 5.171 | 8.486 | 10.757 | 37.44 | 0.41 |
| └─ Build | 5.810 | 16.205 | 20.954 | - | - |
| End-to-End RAG (200 docs, 50 queries, seed=42) | |||||
| └─ Search | 2.647 | 4.711 | 7.350 | 37.73 | 0.92 |
| └─ Build | 1.093 | 3.064 | 3.925 | - | - |
Latest Component Results:
- KV Cache: 42.19 MB peak RSS, 3.42 MB memory delta (100 sequences)
- End-to-End RAG: 37.73 MB peak RSS, 0.92 MB memory delta (200 docs, 50 queries)
- HNSW: 37.44 MB peak RSS, 0.41 MB memory delta (1000 vectors, dim=128)
- Inverted Index: 39.36 MB peak RSS, 0.14 MB memory delta (100 docs)
Note: Memory metrics are automatically measured using psutil. All percentiles corrected to maintain P50 ≤ P95 ≤ P99 ordering. Memory usage scales with dataset size, HNSW parameters (higher M = more memory), and document characteristics (longer documents = more memory).
Key Findings
Latest Benchmark Results (with Variance Analysis):
All benchmarks now include statistical analysis from 5 repetitions:
- Mean ± Standard Deviation for all metrics
- 95% Confidence Intervals using t-distribution
- Coefficient of Variation (CV) for reproducibility assessment
- Flaky Detection: Configurations with CV > 20% are flagged
Recent Fixes & Improvements (v0.1.0):
- Peak RSS memory profiling: All benchmarks now measure peak memory usage using
psutil- Added
MemoryProfilerclass inllmds/utils.pywith context manager interface - All benchmarks track
peak_rss_mbandmemory_delta_mbmetrics - Memory usage plots generated automatically (
benchmarks/figures/memory_usage.png) - Compare memory efficiency across configurations and identify memory-intensive operations
- Added
- Shared utility functions: Consolidated duplicate statistical functions into
llmds/utils.pycompute_percentiles(): Compute P50, P95, P99 percentiles from a list of valuescalculate_statistics(): Comprehensive statistical summary with mean, std, CI, CV- All benchmark scripts now use these shared utilities for consistent calculations
- IndexedHeap max-heap bug fixed:
decrease_key()andincrease_key()now correctly handle bubble directions for max-heap operations- Max-heap
decrease_key(score decreases): bubbles DOWN (was incorrectly bubbling up) - Max-heap
increase_key(score increases): bubbles UP (was incorrectly bubbling down) - Scheduler now correctly prioritizes requests with fewer tokens
- Max-heap
- KV Cache copy-on-write implemented: True COW semantics for prefix sharing (previously only referenced shared pages)
- Shared pages are read-only until modified, then lazily copied
- Reference counting ensures shared pages are only freed when all references released
get()returns deep copies to prevent external corruption
- HNSW seed control: Added
seedparameter for reproducible graph structures across runs- Each HNSW instance uses its own
random.Random(seed)state when seed is provided - Benchmarks use
seed=42for reproducibility
- Each HNSW instance uses its own
- Type safety: All 26 mypy type safety violations fixed with proper type annotations
- Dependency management: Added
requirements.txtandrequirements-dev.txtfor reproducible pip-based installations
Real Corpus Performance (FIQA Financial Q&A Dataset):
- 10k documents: 27.05ms P50 search latency (CV=5.37%), 34.30 QPS, 250.47 MB peak RSS - excellent for small-to-medium corpora
- 25k documents: Results pending - benchmark in progress
- 50k documents: 74.02ms P50 search latency, 11.58 QPS - demonstrates realistic scaling behavior
- Dataset: 50,000 documents, 13MB corpus, 73MB embeddings (384-dim)
- Realistic overhead: Real corpora show ~1000x higher latency than synthetic (expected due to realistic data distribution, cache behavior, and memory access patterns)
Performance Visualizations Available:
All benchmark plots are available in benchmarks/figures/:
corpus_size_latency.png- Latency scaling with corpus sizecorpus_size_qps.png- Throughput scalingmemory_usage.png- Memory profile comparisonlatency_distribution.png- Latency percentiles across benchmarksscaling_analysis.png- Comprehensive scaling trends
Synthetic Benchmarks (component-level) - Latest Results with Fixes:
- KV Cache (100 seq, 1000 tokens/seq): Extremely fast operations (< 0.005ms) for all cache operations - attach/get/detach all sub-millisecond
- Scheduler (1000 requests, batch_size=32): Efficient batch processing (0.101ms P50) with correctly functioning max-heap priority queue
- IndexedHeap: All operations working correctly with proper max-heap bubble directions (fixed in v0.1.0)
- HNSW (1000 vectors, dim=128, seed=42): Fast search latency (1.65ms P50) with reproducible graph structures - 22,964 edges, avg degree 23.0
- Inverted Index (100 docs, 10 queries): Fast BM25 search (0.017ms P50) with compressed postings
- End-to-End RAG (200 docs, 50 queries, seed=42): Complete pipeline latency (0.533ms P50) with reproducible HNSW structures, hybrid search with score fusion
Performance Visualizations
Real Corpus Scaling Analysis
Search latency (P50, P95, P99) vs corpus size on FIQA dataset
Throughput (QPS) vs corpus size - demonstrates scaling behavior
Comprehensive scaling analysis showing latency and throughput trends
Component-Level Benchmarks
Latency percentiles (P50, P95, P99) across all component benchmarks
P95 latency comparison chart for all component benchmarks
Peak RSS and memory delta by benchmark - helps identify memory-intensive operations (auto-generated when benchmarks include memory metrics)
Detailed Results
Complete benchmark results are available in:
- CSV:
benchmarks/results/benchmark_results.csv- includespeak_rss_mbandmemory_delta_mbcolumns - JSON: Individual benchmark JSON files in
benchmarks/results/- includes memory metrics - Plots: PNG files in
benchmarks/figures/latency_distribution.png- Latency percentiles across benchmarksbenchmark_comparison.png- P95 latency comparisonmemory_usage.png- Peak RSS and memory delta by benchmarkcorpus_size_latency.png- Real corpus scaling analysis (latency)corpus_size_qps.png- Real corpus scaling analysis (throughput)scaling_analysis.png- Comprehensive scaling trends
Memory Metrics:
- Peak RSS: Peak Resident Set Size (physical memory used) in megabytes
- Memory Delta: Memory allocated during benchmark execution (peak - initial) in megabytes
- Build Peak RSS: Peak memory during index/document build phase (where applicable)
Results measured on: macOS (Apple Silicon), Python 3.14.0. Performance and memory usage vary by hardware and dataset size.
Data Acquisition
We benchmark on large, public datasets to ensure realistic performance measurements:
Datasets
Datasets with Published Benchmark Results:
- BEIR FIQA (Financial Question Answering) — BEIR Paper - Primary evaluation dataset (50k documents, results for 10k, 25k, 50k subsets)
- Amazon Reviews 2023 (McAuley Lab) — Hugging Face - CC BY 4.0 (results for 10k subset)
- MS MARCO (queries/passages) — Research use only; see MS MARCO license (results for 10k subset)
Additional Available Datasets:
- Yelp Open Dataset — Available in codebase, no published results yet
- Wikipedia — Available in codebase, no published results yet
- Common Crawl — Available in codebase, optional for large-scale testing
See data/README.md for exact commands, checksums, and licensing notes.
Quick Dataset Setup
# Download datasets
python3 scripts/download_corpus.py --source beir:fiqa --output data/raw/beir/fiqa
python3 scripts/download_corpus.py --source amazon23 --output data/raw/amazon23 --limit 500000
# Prepare embeddings
python3 scripts/prepare_embeddings.py \
--input data/raw/beir/fiqa/corpus.jsonl \
--output data/embeddings/fiqa.npy \
--dim 384 \
--seed 42
# Build indices
python3 scripts/build_indices.py \
--corpus data/raw/beir/fiqa/corpus.jsonl \
--emb data/embeddings/fiqa.npy \
--index-dir data/indices/fiqa \
--bm25 \
--hnsw \
--ef 200 \
--M 16
# Run benchmarks
python3 scripts/run_benchmarks.py \
--corpus fiqa \
--corpus-file data/raw/beir/fiqa/corpus.jsonl \
--emb-file data/embeddings/fiqa.npy \
--sizes 10k 50k 100k \
--ef 50 100 200 \
--M 8 16 32
Reproducibility
All benchmarks are dataset-backed. We publish:
- Corpus/size: Exact dataset and sample size used
- Parameter grid: HNSW M, efSearch, efConstruction values
- Hardware: CPU, memory, Python version
- Metrics: Latency (p50/p95/p99), QPS, index build time, peak RSS (Resident Set Size), memory delta
- Memory Profiling: All benchmarks use
psutilto measure peak RSS and memory allocation delta
No synthetic-only numbers in production benchmarks. Real corpora ensure:
- Realistic entropy and noise (not artificially fast)
- Realistic cache behavior (not always hot)
- Realistic memory bandwidth and I/O pressure
- Credible, reproducible results
Why Synthetic Benchmarks Were Too Fast
Micro synthetic data has low entropy and zero noise, making BM25/HNSW unrealistically fast:
- Tiny corpora → caches always hot, index small, branch predictors friendly
- No I/O pressure → no realistic memory bandwidth or NUMA effects
- Perfect distribution → unrealistic query patterns
Real corpora fix this and make results credible for production deployment.
Environment Hash
To ensure reproducibility across different environments, use the environment hash script:
# Generate environment hash
python3 scripts/env_hash.py
# Or specify custom output path
python3 scripts/env_hash.py --output audit/env_hash.txt
The script generates a file containing:
- Python version and executable path
- Operating system information (system, release, version, architecture, processor)
- CPU information (physical/logical cores, frequency)
- NumPy configuration (version, BLAS library info)
- Key package versions
Output is saved to audit/env_hash.txt by default. This helps track environment-specific differences when reproducing benchmark results.
Repository Structure
llm-rag-ds-optimizer/
├── llmds/ # Core library modules
│ ├── kv_cache.py # KV cache with prefix sharing
│ ├── paged_allocator.py # Paged memory allocator
│ ├── token_lru.py # Token-aware LRU cache
│ ├── scheduler.py # Dynamic micro-batching scheduler
│ ├── indexed_heap.py # Indexed binary heap
│ ├── admissions.py # Admission controller
│ ├── inverted_index.py # BM25 inverted index
│ ├── hnsw.py # HNSW ANN index
│ ├── cmsketch.py # Count-Min Sketch
│ └── retrieval_pipeline.py # End-to-end retrieval
├── benchmarks/ # Benchmark scripts and results
│ ├── bench_*.py # Individual benchmarks
│ ├── figures/ # Generated plots (PNG)
│ └── results/ # CSV/JSON outputs
├── scripts/ # Utility scripts
│ ├── run_benchmarks.py # Run all benchmarks
│ ├── plot_results.py # Generate plots and CSV
│ ├── make_report.py # Generate Word report
│ └── make_slides.py # Generate slides
├── docs/ # Documentation
│ ├── architecture.md # System architecture
│ ├── api.md # API reference
│ └── usage.md # Usage examples
└── papers/ # Research papers
└── *.pdf # Papers referenced in growth plan
Development Guide
Code Quality
# Linting
ruff check .
# Formatting
ruff format .
# Type checking
mypy llmds --ignore-missing-imports # All type safety violations fixed
# Run all quality checks
ruff check . && ruff format --check . && mypy llmds --ignore-missing-imports
Research-Based Growth Plan
This project is designed to integrate cutting-edge research from 6 key papers in the papers/ directory. Below is the roadmap for future enhancements.
Research Papers Overview
-
Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation
- Focus: Chunk-level caching for RAG systems
- Impact: 30-50% latency reduction for repeated queries
- Priority: High (Phase 2)
-
Efficient Vector Search on Disaggregated Memory with d-HNSW
- Focus: Distributed HNSW for large-scale deployments
- Impact: Enables billion-scale vector search
- Priority: Medium (Phase 3)
-
Fair-Count-Min: Frequency Estimation under Equal Group-wise Approximation Factor
- Focus: Fairness in frequency estimation across groups
- Impact: Ensures equal service quality across users/groups
- Priority: Medium (Phase 1)
-
Memory-efficient Sketch Acceleration for Handling Large Network Flows on FPGAs
- Focus: Hardware-aware sketch optimizations
- Impact: 30-50% memory reduction for sketch data structures
- Priority: Low (Phase 1)
-
Survey of Filtered Approximate Nearest Neighbor Search over the Vector-Scalar Hybrid Data
- Focus: Combining vector and scalar (metadata) filtering
- Impact: Enables complex queries without performance degradation
- Priority: High (Phase 2)
-
Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
- Focus: Original HNSW paper (already implemented)
- Enhancement: Robust algorithms and quality maintenance
- Priority: Low (Phase 5)
Implementation Roadmap
Phase 1: Quick Wins (Weeks 1-4)
- Memory-Efficient Sketch - Low effort, high value (30-50% memory reduction)
- Fair Count-Min - Important for production systems (2-3 weeks)
Phase 2: Core Features (Weeks 5-12)
- Chunk-Level Caching - Highest impact for RAG (30-50% latency reduction, 4-6 weeks)
- Filtered Search - Essential for production use (3-4 weeks)
Phase 3: Scale (Weeks 13-20)
- Distributed HNSW - Enables large-scale deployment (6-8 weeks)
- Enhanced HNSW - Polish and optimization (ongoing)
Expected Performance Improvements
| Feature | Latency Reduction | Memory Reduction | Throughput Increase |
|---|---|---|---|
| Chunk Caching | 30-50% | 10-20% | 20-40% |
| Filtered Search | <10% overhead | +5-10% | Maintained |
| Distributed HNSW | <5% overhead | Linear scaling | Linear scaling |
| Fair Count-Min | 0% | 0% | Maintained |
| Memory-Efficient Sketch | <5% | 30-50% | +10-20% |
New Modules Planned
llmds/
├── chunk_cache.py # NEW: Chunk-level caching (Paper #1)
├── filtered_hnsw.py # NEW: Filtered search (Paper #5)
├── query_filters.py # NEW: Filter query language (Paper #5)
├── distributed_hnsw.py # NEW: Distributed HNSW (Paper #2)
├── fair_cmsketch.py # NEW: Fair Count-Min (Paper #3)
└── sparse_cmsketch.py # NEW: Memory-efficient sketch (Paper #4)
Technical Implementation Details
Priority 1: Chunk-Level Caching (Cache-Craft)
Architecture:
- Chunk Identification: Track chunks at a finer granularity than documents
- Chunk Metadata: Store access patterns, relevance scores, chunk sizes
- Chunk Reuse: Detect when chunks appear in multiple queries
- Adaptive Eviction: Chunk-aware eviction policies
Implementation Structure:
# llmds/chunk_cache.py
class Chunk:
"""Represents a document chunk with metadata."""
chunk_id: int
doc_id: int
start_pos: int
end_pos: int
embedding: np.ndarray
text: str
access_count: int
last_accessed: float
relevance_score: float
class ChunkCache:
"""Chunk-level cache with reuse detection."""
def get_chunks(self, chunk_ids: list[int]) -> list[Chunk]
def add_chunks(self, chunks: list[Chunk])
def detect_reuse(self, query_results: list[tuple[int, float]]) -> dict
Priority 2: Filtered Vector-Scalar Search
Architecture:
- Filter Query Language: Support complex filter predicates
- Filter-Aware Indexing: Index both vectors and scalar attributes
- Filter Pushdown: Apply filters during index traversal
- Boolean Filter Support: AND/OR/NOT combinations
Implementation Structure:
# llmds/query_filters.py
class Filter: # Base class for filter predicates
class RangeFilter(Filter): # field BETWEEN min AND max
class EqualityFilter(Filter): # field == value
class SetFilter(Filter): # field IN [values]
class CompositeFilter(Filter): # Boolean combinations
# llmds/filtered_hnsw.py
class FilteredHNSW(HNSW):
"""HNSW with scalar attribute filtering."""
def search_with_filter(self, query, k: int, filter: Filter)
Priority 3: Distributed HNSW (d-HNSW)
Architecture:
- Consistent Hashing: Distribute vectors across nodes
- Cross-Partition Search: Efficiently search across partitions
- Replication Strategy: Optional vector replication for availability
- Query Routing: Route queries to relevant partitions
Implementation Structure:
# llmds/distributed_hnsw.py
class DistributedHNSW:
"""Distributed HNSW across multiple nodes."""
def __init__(self, nodes: list[str], replication_factor: int = 1)
def add(self, vec, vec_id) # Hash to partition, add to primary + replicas
def search(self, query, k: int) # Search all partitions, merge results
Priority 4: Fair Count-Min Sketch
Architecture:
- Group Tracking: Track multiple groups with equal error bounds
- Fair Estimation: Guarantee equal approximation factors per group
- Group Statistics: Report fairness metrics
Implementation Structure:
# llmds/fair_cmsketch.py
class FairCountMinSketch:
"""Count-Min Sketch with fairness guarantees."""
def __init__(self, width: int, depth: int, groups: list[str])
def add(self, item: str, group: str, count: int = 1)
def estimate(self, item: str, group: str) -> int
def get_fairness_metrics(self) -> dict
Integration Roadmap
Phase 1: Chunk Caching (4-6 weeks)
- Week 1-2: Implement
ChunkandChunkCacheclasses - Week 3: Integrate with
RetrievalPipeline - Week 4: Add chunk reuse detection
- Week 5: Implement prefetching
- Week 6: Benchmark and optimize
Phase 2: Filtered Search (3-4 weeks)
- Week 1: Design filter query API
- Week 2: Implement
FilteredHNSWwith scalar indexing - Week 3: Add filter pushdown strategies
- Week 4: Benchmark filtered search performance
Phase 3: Distributed HNSW (6-8 weeks)
- Week 1-2: Design distributed architecture
- Week 3: Implement consistent hashing
- Week 4-5: Implement cross-partition search
- Week 6: Add replication
- Week 7-8: Testing and optimization
Phase 4: Fairness (2-3 weeks)
- Week 1: Implement
FairCountMinSketch - Week 2: Add fairness metrics
- Week 3: Benchmark fairness guarantees
Performance Targets
- Chunk Caching: 30-50% reduction in retrieval latency for repeated queries, 40-60% cache hit rate
- Filtered Search: <10% overhead compared to unfiltered search, support filters with >90% selectivity efficiently
- Distributed HNSW: Linear scalability with number of nodes, <5% overhead for cross-partition queries
- Fair Count-Min: Equal error bounds across groups (±5% variance)
Documentation
- Architecture Overview - System architecture and design decisions
- API Reference - Complete API documentation with complexities
- Usage Guide - Code examples and integration patterns
- Mathematical Models - Mathematical formulations and algorithms (BM25, HNSW, Count-Min Sketch, etc.)
Citation
If you use this library in your research, please cite:
@software{llm_rag_ds_optimizer,
title = {LLM RAG Data Structures Optimizer},
author = {Gutierrez, Carlos},
email = {cgutierrez44833@ucumberlands.edu},
year = {2025},
url = {https://github.com/CarGDev/llm-rag-ds-optimizer}
}
Contributing
We welcome contributions! This section provides guidelines for contributing to the project.
Getting Started
- Fork the repository
- Clone your fork:
git clone https://github.com/yourusername/llm-rag-ds-optimizer.git - Create a branch:
git checkout -b feature/your-feature-name - Install dependencies:
poetry installorpip install -e ".[dev]" - Make your changes
- Submit a pull request
Development Guidelines
Code Style:
- Follow PEP 8 style guidelines
- Use
rufffor linting and formatting - Run
ruff check .andruff format .before committing - Type hints are required for all public APIs
Documentation:
- Update docstrings for all new functions/classes (Google/NumPy style)
- Update API documentation if adding new public APIs
- Update README if adding new features
Commit Messages:
- Use clear, descriptive commit messages
- Follow conventional commits format:
feat:for new featuresfix:for bug fixesdocs:for documentationrefactor:for code refactoring
Pull Request Process
- Run linting and formatting checks
- Update documentation as needed
- Submit a pull request with a clear description
- Address review feedback promptly
Reporting Issues
- Use GitHub Issues for bug reports
- Include:
- Description of the issue
- Steps to reproduce
- Expected vs. actual behavior
- Environment information (Python version, OS, etc.)
License
MIT License - see LICENSE file for details.
Code of Conduct
Our Pledge
We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
Our Standards
Examples of behavior that contributes to a positive environment:
- Using welcoming and inclusive language
- Being respectful of differing viewpoints and experiences
- Gracefully accepting constructive criticism
- Focusing on what is best for the community
- Showing empathy towards other community members
Examples of unacceptable behavior:
- The use of sexualized language or imagery, and sexual attention or advances of any kind
- Trolling, insulting/derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others' private information, such as a physical or email address, without their explicit permission
- Other conduct which could reasonably be considered inappropriate in a professional setting
Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement. All complaints will be reviewed and investigated promptly and fairly.
This Code of Conduct is adapted from the Contributor Covenant, version 2.0.
Status: Production-ready core implementation. Research integration roadmap available for future enhancements.
Glossary
This glossary defines specialized terms and abbreviations used throughout this project.
Performance Metrics
P50, P95, P99 (Percentiles): Statistical measures of latency distribution.
- P50 (Median): The 50th percentile - half of all requests are faster, half are slower. Represents typical performance.
- P95: The 95th percentile - 95% of requests are faster than this value. Captures tail latency, important for user experience.
- P99: The 99th percentile - 99% of requests are faster. Used to understand worst-case scenarios and outliers.
Example: If P50=15ms and P95=19ms, it means 50% of requests complete in ≤15ms and 95% complete in ≤19ms.
QPS (Queries Per Second): Throughput metric measuring how many queries the system can process per second. Higher QPS indicates better throughput.
Latency: The time taken for a single operation to complete, typically measured in milliseconds (ms). Lower latency indicates faster response times.
Data Structures & Algorithms
HNSW (Hierarchical Navigable Small World): A graph-based algorithm for approximate nearest neighbor search. Provides logarithmic search complexity with high recall. Parameters:
- M: Maximum number of connections per node in the graph
- efConstruction: Controls graph quality during index building
- efSearch: Number of candidates to explore during search (higher = better recall, slower)
BM25 (Best Matching 25): A probabilistic ranking function for information retrieval. Uses term frequency and inverse document frequency to score document relevance to queries.
KV Cache (Key-Value Cache): Stores precomputed key-value pairs from transformer attention layers to avoid redundant computation for repeated prefixes.
Inverted Index: A data structure mapping terms (words) to documents containing them. Enables fast text search by allowing direct lookup of documents containing query terms.
Count-Min Sketch: A probabilistic data structure for frequency estimation with bounded error. Used for hot query detection and cache optimization.
Retrieval & RAG
RAG (Retrieval-Augmented Generation): An approach where LLMs generate responses using information retrieved from external knowledge bases, improving accuracy and reducing hallucination.
ANN (Approximate Nearest Neighbor): Algorithms that find similar vectors quickly, trading exact results for speed. HNSW is an ANN algorithm.
Hybrid Search: Combining dense vector search (semantic similarity) with sparse keyword search (BM25) for better retrieval quality.
Recall@K: Retrieval metric - the percentage of relevant documents found in the top-K results. Higher recall means more relevant results are retrieved.
Score Fusion: Combining scores from multiple retrieval methods (e.g., BM25 + vector similarity) into a single ranked list.
System Terms
Micro-batching: Grouping multiple requests together for parallel processing, improving GPU utilization and throughput.
Admission Control: System that decides whether to accept or reject incoming requests based on current load and resource availability.
Rate Limiting: Controlling the number of requests processed per unit time to prevent system overload.
Token Budget: Maximum number of tokens (sub-word units) that can be cached or processed within memory constraints.
Prefix Sharing: Technique where identical prompt prefixes across multiple sequences are stored once, reducing memory usage.
Dataset & Evaluation
BEIR (Benchmarking IR): A collection of diverse information retrieval benchmarks covering multiple domains.
MS MARCO: Large-scale passage ranking dataset used as a standard benchmark for information retrieval systems.
FIQA: Financial question-answering dataset from BEIR, used for domain-specific retrieval evaluation.
Corpus: A collection of documents used for indexing and retrieval testing.
JSONL (JSON Lines): File format where each line is a valid JSON object, commonly used for large datasets.
Technical Abbreviations
LLM (Large Language Model): AI models trained on massive text corpora to understand and generate human-like text.
IR (Information Retrieval): The field of study focused on finding relevant information from large collections of documents.
API (Application Programming Interface): A set of functions and protocols for interacting with software components.
O(log n): Logarithmic time complexity - the time to complete an operation grows logarithmically with input size, indicating efficient algorithms.
Appendix
Additional Resources
Research Papers: See papers/ directory and docs/CITATIONS.md for referenced research papers.
Primary Citations:
- HNSW: Malkov & Yashunin (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE TPAMI, 42(4), 824-836.
- KV Cache: Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation
- Count-Min Sketch: Cormode & Muthukrishnan (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.
- BM25: Robertson & Zaragoza (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
Additional Papers:
- d-HNSW: Distributed HNSW for disaggregated memory
- Fair-Count-Min: Fairness in frequency estimation
- Memory-efficient sketches
- Survey of Filtered Approximate Nearest Neighbor Search
See docs/CITATIONS.md for complete citation mapping to implementation code.
Dataset Licenses:
- MS MARCO: Research use only - see MS MARCO Terms
- BEIR (FIQA): Varies by task - check individual task licenses (typically CC-BY or similar)
- Amazon Reviews 2023: CC BY 4.0
- Yelp: See Yelp Dataset License - Research use allowed
- Wikipedia: CC BY-SA 3.0 / GFDL
Reproducibility Notes:
- All benchmarks use deterministic seeds (42) for reproducibility
- HNSW seed control: The
HNSWclass accepts an optionalseedparameter for reproducible graph structure. When a seed is provided, each HNSW instance uses its ownrandom.Random(seed)state for level assignments, ensuring identical graph structures across runs. - Embeddings are generated deterministically based on document IDs
- Benchmark results include hardware specifications
- Exact corpus sizes and parameters are documented in result files
HNSW Seed Usage:
from llmds.hnsw import HNSW
# Reproducible HNSW with fixed seed
hnsw = HNSW(dim=384, M=16, ef_construction=200, ef_search=50, seed=42)
# Or use RetrievalPipeline (automatically uses seed=42 in benchmarks)
from llmds.retrieval_pipeline import RetrievalPipeline
pipeline = RetrievalPipeline(embedding_dim=384, seed=42)
Dependency Management:
- Poetry: Use
poetry.lock(when available) for exact version pinningpoetry install # Uses poetry.lock for reproducible builds - pip: Use
requirements.txtandrequirements-dev.txtfor compatible version rangespip install -r requirements-dev.txt # Install all dependencies - Both methods ensure reproducible builds across different environments
- Python version: >=3.11 (see
.python-versionorpyproject.toml)
Performance Baseline:
- Synthetic benchmarks (small data): Sub-millisecond latencies typical
- Real corpus benchmarks (large data): Higher latencies due to realistic data distribution, cache behavior, and memory access patterns
- Production systems typically see 10-100x latency increase from synthetic to real data
Hardware Used for Benchmarks:
- System: macOS (Apple Silicon)
- Python: 3.14.0
- Performance varies by hardware and dataset characteristics
Benchmark Result Files
- Individual results:
benchmarks/results/{corpus}/{date}/results.json - Combined CSV:
benchmarks/results/benchmark_results.csv - Visualizations:
benchmarks/figures/*.png
Contact & Support
For questions, issues, or contributions, please see:
- Contributing: See Contributing section above
- Code of Conduct: See Code of Conduct section above
- GitHub Issues: Report bugs or request features via GitHub Issues