187 lines
6.5 KiB
Markdown
187 lines
6.5 KiB
Markdown
# Research Citations and Implementation Mapping
|
|
|
|
This document maps research papers to their implementations in the codebase.
|
|
|
|
## HNSW (Hierarchical Navigable Small World)
|
|
|
|
**Implementation:** `llmds/hnsw.py`
|
|
|
|
**Primary Citation:**
|
|
- Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4), 824-836.
|
|
|
|
**Related Papers:**
|
|
- Efficient Vector Search on Disaggregated Memory with d-HNSW (for memory-efficient variations)
|
|
|
|
**Techniques Implemented:**
|
|
- Hierarchical multi-layer graph structure (`_layers`)
|
|
- Greedy search algorithm (`_search_layer`)
|
|
- Level assignment with exponential distribution (`_random_level`)
|
|
- Entry point selection and navigation
|
|
- Dynamic connection management (M parameter)
|
|
- efConstruction and efSearch parameters for quality/speed trade-offs
|
|
|
|
**Code References:**
|
|
- `HNSW` class: Main implementation
|
|
- `_random_level()`: Level assignment following exponential distribution
|
|
- `_search_layer()`: Greedy search in a single layer
|
|
- `add()`: Vector insertion with connection management
|
|
- `search()`: Multi-layer search from top to bottom
|
|
|
|
## KV Cache with Prefix Sharing
|
|
|
|
**Implementation:** `llmds/kv_cache.py`, `llmds/paged_allocator.py`
|
|
|
|
**Primary Citation:**
|
|
- Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation (specific paper on KV cache optimization for RAG)
|
|
|
|
**Techniques Implemented:**
|
|
- Paged allocation with fixed-size pages (`PagedAllocator`)
|
|
- Prefix/prompt sharing with copy-on-write semantics (`KVCache._copy_if_shared`)
|
|
- Hash-based deduplication (`_hash_prefix`)
|
|
- Reference counting for shared pages (`_page_refs`)
|
|
- Defensive copying to prevent corruption (`get()` returns deep copies)
|
|
|
|
**Code References:**
|
|
- `KVCache` class: Main KV cache implementation
|
|
- `PagedAllocator` class: Page-based memory management
|
|
- `_copy_if_shared()`: Copy-on-write implementation
|
|
- `_hash_prefix()`: SHA256-based prefix hashing
|
|
- `attach()` / `detach()`: Sequence management with reference counting
|
|
|
|
## Count-Min Sketch
|
|
|
|
**Implementation:** `llmds/cmsketch.py`
|
|
|
|
**Primary Citations:**
|
|
- Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.
|
|
- Fair-Count-Min: Frequency Estimation under Equal Group-wise Approximation Factor
|
|
|
|
**Techniques Implemented:**
|
|
- Count-Min Sketch with multiple hash functions (`depth` parameter)
|
|
- Conservative update strategy to reduce overestimation bias
|
|
- Error bound calculation (`get_error_bound()`)
|
|
- Hot item detection (`is_hot()`)
|
|
|
|
**Code References:**
|
|
- `CountMinSketch` class: Main sketch implementation
|
|
- `add()`: Conservative update algorithm
|
|
- `estimate()`: Minimum across all hash rows
|
|
- `get_error_bound()`: Theoretical error bound calculation
|
|
- Uses MurmurHash3 for hash functions
|
|
|
|
## BM25 Inverted Index
|
|
|
|
**Implementation:** `llmds/inverted_index.py`
|
|
|
|
**Primary Citation:**
|
|
- Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
|
|
|
|
**Techniques Implemented:**
|
|
- BM25 scoring formula with k1 and b parameters
|
|
- Inverted index structure with compressed postings
|
|
- Varint encoding for integer compression (`_encode_varint`)
|
|
- Zigzag encoding for signed integers (`_zigzag_encode`)
|
|
- Term frequency and document frequency tracking
|
|
|
|
**Code References:**
|
|
- `InvertedIndex` class: Main inverted index implementation
|
|
- `_bm25_score()`: BM25 scoring function
|
|
- `add_document()`: Index construction
|
|
- `search()`: BM25 retrieval
|
|
|
|
## Hybrid Retrieval (Dense + Sparse)
|
|
|
|
**Implementation:** `llmds/retrieval_pipeline.py`
|
|
|
|
**Primary Citation:**
|
|
- Survey of Filtered Approximate Nearest Neighbor Search over the Vector-Scalar Hybrid Data
|
|
|
|
**Techniques Implemented:**
|
|
- Hybrid dense (HNSW) + sparse (BM25) retrieval
|
|
- Score fusion with configurable weights (`fusion_weight` parameter)
|
|
- Top-K maintenance using indexed heap
|
|
- Hot query caching using Count-Min Sketch
|
|
|
|
**Code References:**
|
|
- `RetrievalPipeline` class: End-to-end hybrid retrieval
|
|
- `search()`: Combines HNSW and BM25 with score fusion
|
|
- Uses `IndexedHeap` for efficient top-K maintenance
|
|
|
|
## Indexed Heap
|
|
|
|
**Implementation:** `llmds/indexed_heap.py`
|
|
|
|
**Techniques Implemented:**
|
|
- Indexed binary heap for O(log n) priority updates
|
|
- Support for both min-heap and max-heap
|
|
- O(1) key lookup via position map (`_pos` dictionary)
|
|
- Decrease/increase key operations with correct bubble directions
|
|
|
|
**Code References:**
|
|
- `IndexedHeap` class: Indexed heap implementation
|
|
- `decrease_key()` / `increase_key()`: Key update operations
|
|
- `_bubble_up()` / `_bubble_down()`: Heap property maintenance
|
|
|
|
## Scheduler and Batching
|
|
|
|
**Implementation:** `llmds/scheduler.py`, `llmds/admissions.py`
|
|
|
|
**Techniques Implemented:**
|
|
- Dynamic micro-batching with configurable wait time
|
|
- Priority queue using indexed heap
|
|
- Admission control with QPS and token rate limiting
|
|
- Moving window average for rate tracking
|
|
|
|
**Code References:**
|
|
- `Scheduler` class: Batching scheduler
|
|
- `AdmissionController` class: Rate limiting and admission control
|
|
- Uses `IndexedHeap` for priority queue
|
|
|
|
## Token-Aware LRU
|
|
|
|
**Implementation:** `llmds/token_lru.py`
|
|
|
|
**Techniques Implemented:**
|
|
- LRU eviction with token-aware budgeting
|
|
- Cumulative token tracking across cache entries
|
|
- Eviction based on token count rather than entry count
|
|
|
|
**Code References:**
|
|
- `TokenLRU` class: Token-aware LRU cache
|
|
- `total_tokens()`: Cumulative token tracking
|
|
- `put()`: Token-aware insertion with eviction
|
|
|
|
---
|
|
|
|
## How to Cite
|
|
|
|
### Citing This Software
|
|
|
|
If you use this codebase in your research, please cite:
|
|
|
|
```bibtex
|
|
@software{llm_rag_ds_optimizer,
|
|
title = {LLM RAG Data Structures Optimizer},
|
|
author = {Gutierrez, Carlos},
|
|
email = {cgutierrez44833@ucumberlands.edu},
|
|
year = {2025},
|
|
url = {https://github.com/CarGDev/llm-rag-ds-optimizer}
|
|
}
|
|
```
|
|
|
|
### Citing Related Papers
|
|
|
|
When using this codebase in research, please also cite the relevant papers:
|
|
|
|
1. **HNSW**: Cite Malkov & Yashunin (2018) for HNSW algorithm
|
|
2. **KV Cache**: Cite Cache-Craft paper for prefix sharing techniques
|
|
3. **Count-Min Sketch**: Cite Cormode & Muthukrishnan (2005) for Count-Min Sketch
|
|
4. **BM25**: Cite Robertson & Zaragoza (2009) for BM25 scoring
|
|
5. **Hybrid Retrieval**: Cite survey paper for hybrid dense+sparse approaches
|
|
|
|
## Additional References
|
|
|
|
- Papers in `papers/` directory contain full citations and implementation details
|
|
- See `README.md` for usage examples and performance benchmarks
|
|
|