6.5 KiB
Research Citations and Implementation Mapping
This document maps research papers to their implementations in the codebase.
HNSW (Hierarchical Navigable Small World)
Implementation: llmds/hnsw.py
Primary Citation:
- Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4), 824-836.
Related Papers:
- Efficient Vector Search on Disaggregated Memory with d-HNSW (for memory-efficient variations)
Techniques Implemented:
- Hierarchical multi-layer graph structure (
_layers) - Greedy search algorithm (
_search_layer) - Level assignment with exponential distribution (
_random_level) - Entry point selection and navigation
- Dynamic connection management (M parameter)
- efConstruction and efSearch parameters for quality/speed trade-offs
Code References:
HNSWclass: Main implementation_random_level(): Level assignment following exponential distribution_search_layer(): Greedy search in a single layeradd(): Vector insertion with connection managementsearch(): Multi-layer search from top to bottom
KV Cache with Prefix Sharing
Implementation: llmds/kv_cache.py, llmds/paged_allocator.py
Primary Citation:
- Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation (specific paper on KV cache optimization for RAG)
Techniques Implemented:
- Paged allocation with fixed-size pages (
PagedAllocator) - Prefix/prompt sharing with copy-on-write semantics (
KVCache._copy_if_shared) - Hash-based deduplication (
_hash_prefix) - Reference counting for shared pages (
_page_refs) - Defensive copying to prevent corruption (
get()returns deep copies)
Code References:
KVCacheclass: Main KV cache implementationPagedAllocatorclass: Page-based memory management_copy_if_shared(): Copy-on-write implementation_hash_prefix(): SHA256-based prefix hashingattach()/detach(): Sequence management with reference counting
Count-Min Sketch
Implementation: llmds/cmsketch.py
Primary Citations:
- Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.
- Fair-Count-Min: Frequency Estimation under Equal Group-wise Approximation Factor
Techniques Implemented:
- Count-Min Sketch with multiple hash functions (
depthparameter) - Conservative update strategy to reduce overestimation bias
- Error bound calculation (
get_error_bound()) - Hot item detection (
is_hot())
Code References:
CountMinSketchclass: Main sketch implementationadd(): Conservative update algorithmestimate(): Minimum across all hash rowsget_error_bound(): Theoretical error bound calculation- Uses MurmurHash3 for hash functions
BM25 Inverted Index
Implementation: llmds/inverted_index.py
Primary Citation:
- Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
Techniques Implemented:
- BM25 scoring formula with k1 and b parameters
- Inverted index structure with compressed postings
- Varint encoding for integer compression (
_encode_varint) - Zigzag encoding for signed integers (
_zigzag_encode) - Term frequency and document frequency tracking
Code References:
InvertedIndexclass: Main inverted index implementation_bm25_score(): BM25 scoring functionadd_document(): Index constructionsearch(): BM25 retrieval
Hybrid Retrieval (Dense + Sparse)
Implementation: llmds/retrieval_pipeline.py
Primary Citation:
- Survey of Filtered Approximate Nearest Neighbor Search over the Vector-Scalar Hybrid Data
Techniques Implemented:
- Hybrid dense (HNSW) + sparse (BM25) retrieval
- Score fusion with configurable weights (
fusion_weightparameter) - Top-K maintenance using indexed heap
- Hot query caching using Count-Min Sketch
Code References:
RetrievalPipelineclass: End-to-end hybrid retrievalsearch(): Combines HNSW and BM25 with score fusion- Uses
IndexedHeapfor efficient top-K maintenance
Indexed Heap
Implementation: llmds/indexed_heap.py
Techniques Implemented:
- Indexed binary heap for O(log n) priority updates
- Support for both min-heap and max-heap
- O(1) key lookup via position map (
_posdictionary) - Decrease/increase key operations with correct bubble directions
Code References:
IndexedHeapclass: Indexed heap implementationdecrease_key()/increase_key(): Key update operations_bubble_up()/_bubble_down(): Heap property maintenance
Scheduler and Batching
Implementation: llmds/scheduler.py, llmds/admissions.py
Techniques Implemented:
- Dynamic micro-batching with configurable wait time
- Priority queue using indexed heap
- Admission control with QPS and token rate limiting
- Moving window average for rate tracking
Code References:
Schedulerclass: Batching schedulerAdmissionControllerclass: Rate limiting and admission control- Uses
IndexedHeapfor priority queue
Token-Aware LRU
Implementation: llmds/token_lru.py
Techniques Implemented:
- LRU eviction with token-aware budgeting
- Cumulative token tracking across cache entries
- Eviction based on token count rather than entry count
Code References:
TokenLRUclass: Token-aware LRU cachetotal_tokens(): Cumulative token trackingput(): Token-aware insertion with eviction
How to Cite
Citing This Software
If you use this codebase in your research, please cite:
@software{llm_rag_ds_optimizer,
title = {LLM RAG Data Structures Optimizer},
author = {Gutierrez, Carlos},
email = {cgutierrez44833@ucumberlands.edu},
year = {2025},
url = {https://github.com/CarGDev/llm-rag-ds-optimizer}
}
Citing Related Papers
When using this codebase in research, please also cite the relevant papers:
- HNSW: Cite Malkov & Yashunin (2018) for HNSW algorithm
- KV Cache: Cite Cache-Craft paper for prefix sharing techniques
- Count-Min Sketch: Cite Cormode & Muthukrishnan (2005) for Count-Min Sketch
- BM25: Cite Robertson & Zaragoza (2009) for BM25 scoring
- Hybrid Retrieval: Cite survey paper for hybrid dense+sparse approaches
Additional References
- Papers in
papers/directory contain full citations and implementation details - See
README.mdfor usage examples and performance benchmarks