Initial commit: LLM-DS optimizer framework with data files excluded
This commit is contained in:
186
docs/CITATIONS.md
Normal file
186
docs/CITATIONS.md
Normal file
@@ -0,0 +1,186 @@
|
||||
# Research Citations and Implementation Mapping
|
||||
|
||||
This document maps research papers to their implementations in the codebase.
|
||||
|
||||
## HNSW (Hierarchical Navigable Small World)
|
||||
|
||||
**Implementation:** `llmds/hnsw.py`
|
||||
|
||||
**Primary Citation:**
|
||||
- Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4), 824-836.
|
||||
|
||||
**Related Papers:**
|
||||
- Efficient Vector Search on Disaggregated Memory with d-HNSW (for memory-efficient variations)
|
||||
|
||||
**Techniques Implemented:**
|
||||
- Hierarchical multi-layer graph structure (`_layers`)
|
||||
- Greedy search algorithm (`_search_layer`)
|
||||
- Level assignment with exponential distribution (`_random_level`)
|
||||
- Entry point selection and navigation
|
||||
- Dynamic connection management (M parameter)
|
||||
- efConstruction and efSearch parameters for quality/speed trade-offs
|
||||
|
||||
**Code References:**
|
||||
- `HNSW` class: Main implementation
|
||||
- `_random_level()`: Level assignment following exponential distribution
|
||||
- `_search_layer()`: Greedy search in a single layer
|
||||
- `add()`: Vector insertion with connection management
|
||||
- `search()`: Multi-layer search from top to bottom
|
||||
|
||||
## KV Cache with Prefix Sharing
|
||||
|
||||
**Implementation:** `llmds/kv_cache.py`, `llmds/paged_allocator.py`
|
||||
|
||||
**Primary Citation:**
|
||||
- Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation (specific paper on KV cache optimization for RAG)
|
||||
|
||||
**Techniques Implemented:**
|
||||
- Paged allocation with fixed-size pages (`PagedAllocator`)
|
||||
- Prefix/prompt sharing with copy-on-write semantics (`KVCache._copy_if_shared`)
|
||||
- Hash-based deduplication (`_hash_prefix`)
|
||||
- Reference counting for shared pages (`_page_refs`)
|
||||
- Defensive copying to prevent corruption (`get()` returns deep copies)
|
||||
|
||||
**Code References:**
|
||||
- `KVCache` class: Main KV cache implementation
|
||||
- `PagedAllocator` class: Page-based memory management
|
||||
- `_copy_if_shared()`: Copy-on-write implementation
|
||||
- `_hash_prefix()`: SHA256-based prefix hashing
|
||||
- `attach()` / `detach()`: Sequence management with reference counting
|
||||
|
||||
## Count-Min Sketch
|
||||
|
||||
**Implementation:** `llmds/cmsketch.py`
|
||||
|
||||
**Primary Citations:**
|
||||
- Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.
|
||||
- Fair-Count-Min: Frequency Estimation under Equal Group-wise Approximation Factor
|
||||
|
||||
**Techniques Implemented:**
|
||||
- Count-Min Sketch with multiple hash functions (`depth` parameter)
|
||||
- Conservative update strategy to reduce overestimation bias
|
||||
- Error bound calculation (`get_error_bound()`)
|
||||
- Hot item detection (`is_hot()`)
|
||||
|
||||
**Code References:**
|
||||
- `CountMinSketch` class: Main sketch implementation
|
||||
- `add()`: Conservative update algorithm
|
||||
- `estimate()`: Minimum across all hash rows
|
||||
- `get_error_bound()`: Theoretical error bound calculation
|
||||
- Uses MurmurHash3 for hash functions
|
||||
|
||||
## BM25 Inverted Index
|
||||
|
||||
**Implementation:** `llmds/inverted_index.py`
|
||||
|
||||
**Primary Citation:**
|
||||
- Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
|
||||
|
||||
**Techniques Implemented:**
|
||||
- BM25 scoring formula with k1 and b parameters
|
||||
- Inverted index structure with compressed postings
|
||||
- Varint encoding for integer compression (`_encode_varint`)
|
||||
- Zigzag encoding for signed integers (`_zigzag_encode`)
|
||||
- Term frequency and document frequency tracking
|
||||
|
||||
**Code References:**
|
||||
- `InvertedIndex` class: Main inverted index implementation
|
||||
- `_bm25_score()`: BM25 scoring function
|
||||
- `add_document()`: Index construction
|
||||
- `search()`: BM25 retrieval
|
||||
|
||||
## Hybrid Retrieval (Dense + Sparse)
|
||||
|
||||
**Implementation:** `llmds/retrieval_pipeline.py`
|
||||
|
||||
**Primary Citation:**
|
||||
- Survey of Filtered Approximate Nearest Neighbor Search over the Vector-Scalar Hybrid Data
|
||||
|
||||
**Techniques Implemented:**
|
||||
- Hybrid dense (HNSW) + sparse (BM25) retrieval
|
||||
- Score fusion with configurable weights (`fusion_weight` parameter)
|
||||
- Top-K maintenance using indexed heap
|
||||
- Hot query caching using Count-Min Sketch
|
||||
|
||||
**Code References:**
|
||||
- `RetrievalPipeline` class: End-to-end hybrid retrieval
|
||||
- `search()`: Combines HNSW and BM25 with score fusion
|
||||
- Uses `IndexedHeap` for efficient top-K maintenance
|
||||
|
||||
## Indexed Heap
|
||||
|
||||
**Implementation:** `llmds/indexed_heap.py`
|
||||
|
||||
**Techniques Implemented:**
|
||||
- Indexed binary heap for O(log n) priority updates
|
||||
- Support for both min-heap and max-heap
|
||||
- O(1) key lookup via position map (`_pos` dictionary)
|
||||
- Decrease/increase key operations with correct bubble directions
|
||||
|
||||
**Code References:**
|
||||
- `IndexedHeap` class: Indexed heap implementation
|
||||
- `decrease_key()` / `increase_key()`: Key update operations
|
||||
- `_bubble_up()` / `_bubble_down()`: Heap property maintenance
|
||||
|
||||
## Scheduler and Batching
|
||||
|
||||
**Implementation:** `llmds/scheduler.py`, `llmds/admissions.py`
|
||||
|
||||
**Techniques Implemented:**
|
||||
- Dynamic micro-batching with configurable wait time
|
||||
- Priority queue using indexed heap
|
||||
- Admission control with QPS and token rate limiting
|
||||
- Moving window average for rate tracking
|
||||
|
||||
**Code References:**
|
||||
- `Scheduler` class: Batching scheduler
|
||||
- `AdmissionController` class: Rate limiting and admission control
|
||||
- Uses `IndexedHeap` for priority queue
|
||||
|
||||
## Token-Aware LRU
|
||||
|
||||
**Implementation:** `llmds/token_lru.py`
|
||||
|
||||
**Techniques Implemented:**
|
||||
- LRU eviction with token-aware budgeting
|
||||
- Cumulative token tracking across cache entries
|
||||
- Eviction based on token count rather than entry count
|
||||
|
||||
**Code References:**
|
||||
- `TokenLRU` class: Token-aware LRU cache
|
||||
- `total_tokens()`: Cumulative token tracking
|
||||
- `put()`: Token-aware insertion with eviction
|
||||
|
||||
---
|
||||
|
||||
## How to Cite
|
||||
|
||||
### Citing This Software
|
||||
|
||||
If you use this codebase in your research, please cite:
|
||||
|
||||
```bibtex
|
||||
@software{llm_rag_ds_optimizer,
|
||||
title = {LLM RAG Data Structures Optimizer},
|
||||
author = {Gutierrez, Carlos},
|
||||
email = {cgutierrez44833@ucumberlands.edu},
|
||||
year = {2025},
|
||||
url = {https://github.com/CarGDev/llm-rag-ds-optimizer}
|
||||
}
|
||||
```
|
||||
|
||||
### Citing Related Papers
|
||||
|
||||
When using this codebase in research, please also cite the relevant papers:
|
||||
|
||||
1. **HNSW**: Cite Malkov & Yashunin (2018) for HNSW algorithm
|
||||
2. **KV Cache**: Cite Cache-Craft paper for prefix sharing techniques
|
||||
3. **Count-Min Sketch**: Cite Cormode & Muthukrishnan (2005) for Count-Min Sketch
|
||||
4. **BM25**: Cite Robertson & Zaragoza (2009) for BM25 scoring
|
||||
5. **Hybrid Retrieval**: Cite survey paper for hybrid dense+sparse approaches
|
||||
|
||||
## Additional References
|
||||
|
||||
- Papers in `papers/` directory contain full citations and implementation details
|
||||
- See `README.md` for usage examples and performance benchmarks
|
||||
|
||||
252
docs/api.md
Normal file
252
docs/api.md
Normal file
@@ -0,0 +1,252 @@
|
||||
# API Reference
|
||||
|
||||
## Core Modules
|
||||
|
||||
### `llmds.paged_allocator.PagedAllocator`
|
||||
|
||||
Paged memory allocator with slab allocation.
|
||||
|
||||
**Methods:**
|
||||
- `alloc(num_pages: int) -> list[int]`: Allocate pages
|
||||
- `free(page_ids: list[int]) -> None`: Free pages
|
||||
- `stats() -> PageStats`: Get allocation statistics
|
||||
- `defragment() -> None`: Defragment pages
|
||||
|
||||
**Complexity:** O(1) alloc/free, O(n) defragment
|
||||
|
||||
### `llmds.kv_cache.KVCache`
|
||||
|
||||
KV cache with prefix sharing and deduplication. Implements copy-on-write (COW) for safe prefix sharing.
|
||||
|
||||
**Parameters:**
|
||||
- `page_size: int = 512` - Size of each KV cache page in tokens
|
||||
- `max_pages: int = 10000` - Maximum number of pages to allocate
|
||||
- `enable_prefix_sharing: bool = True` - Enable prefix sharing optimization
|
||||
|
||||
**Methods:**
|
||||
- `attach(seq_id: int, kv_tokens: list, prefix_tokens: Optional[list] = None) -> None` - Attach KV cache for a sequence. Uses COW for shared pages.
|
||||
- `detach(seq_id: int) -> None` - Detach and free KV cache, with reference counting for shared pages
|
||||
- `get(seq_id: int) -> Optional[list]` - Get KV cache (returns deep copy to prevent external modification)
|
||||
- `stats() -> dict` - Get cache statistics including shared pages count and reference counts
|
||||
|
||||
**Complexity:** O(1) attach/get, O(k) detach where k = pages
|
||||
|
||||
**Copy-on-Write Semantics:**
|
||||
- Shared pages (from prefix sharing) are read-only until written
|
||||
- Writes to shared pages trigger lazy copying (COW)
|
||||
- Reference counting ensures shared pages are only freed when all references are released
|
||||
- `get()` returns deep copies to prevent external corruption of shared pages
|
||||
|
||||
**Safety:** All shared page operations are protected against data corruption through COW and defensive copying.
|
||||
|
||||
### `llmds.utils.Timer`
|
||||
|
||||
Simple timer context manager for measuring execution time.
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from llmds.utils import Timer
|
||||
|
||||
with Timer() as t:
|
||||
# Your code here
|
||||
pass
|
||||
elapsed_seconds = t.elapsed # Float representing elapsed time
|
||||
```
|
||||
|
||||
**Complexity:** O(1) for all operations
|
||||
|
||||
### `llmds.utils.MemoryProfiler`
|
||||
|
||||
Memory profiler for measuring peak RSS (Resident Set Size) during benchmarks.
|
||||
|
||||
**Methods:**
|
||||
- `start() -> None`: Start memory profiling
|
||||
- `sample() -> int`: Sample current RSS and update peak
|
||||
- `get_peak_rss_mb() -> float`: Get peak RSS in megabytes
|
||||
- `get_peak_rss_bytes() -> int`: Get peak RSS in bytes
|
||||
- `get_current_rss_mb() -> float`: Get current RSS in megabytes
|
||||
- `get_memory_delta_mb() -> float`: Get memory delta from initial RSS in megabytes
|
||||
|
||||
**Context Manager:**
|
||||
- `memory_profiler() -> Iterator[MemoryProfiler]`: Context manager for automatic profiling
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from llmds.utils import memory_profiler
|
||||
|
||||
with memory_profiler() as profiler:
|
||||
# Your code here
|
||||
profiler.sample() # Optional: sample at specific points
|
||||
peak_rss_mb = profiler.get_peak_rss_mb()
|
||||
```
|
||||
|
||||
**Complexity:** O(1) for all operations
|
||||
|
||||
### `llmds.utils.compute_percentiles`
|
||||
|
||||
Compute P50, P95, P99 percentiles from a list of values.
|
||||
|
||||
**Parameters:**
|
||||
- `values: list[float]` - List of numeric values
|
||||
|
||||
**Returns:**
|
||||
- `dict[str, float]` - Dictionary with `p50`, `p95`, `p99` keys
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from llmds.utils import compute_percentiles
|
||||
|
||||
latencies = [10.5, 12.3, 11.1, 15.2, 10.8, ...]
|
||||
percentiles = compute_percentiles(latencies)
|
||||
print(f"P50: {percentiles['p50']:.2f}ms")
|
||||
print(f"P95: {percentiles['p95']:.2f}ms")
|
||||
print(f"P99: {percentiles['p99']:.2f}ms")
|
||||
```
|
||||
|
||||
**Complexity:** O(n log n) where n = len(values)
|
||||
|
||||
### `llmds.utils.calculate_statistics`
|
||||
|
||||
Calculate comprehensive statistical summary for a list of values.
|
||||
|
||||
**Parameters:**
|
||||
- `values: list[float]` - List of numeric values
|
||||
- `confidence_level: float = 0.95` - Confidence level for intervals (e.g., 0.95 for 95% CI)
|
||||
|
||||
**Returns:**
|
||||
- `dict[str, Any]` - Dictionary containing:
|
||||
- `mean`: Mean value
|
||||
- `std`: Standard deviation (sample)
|
||||
- `min`: Minimum value
|
||||
- `max`: Maximum value
|
||||
- `p50`, `p95`, `p99`: Percentiles
|
||||
- `ci_lower`, `ci_upper`: Confidence interval bounds
|
||||
- `cv`: Coefficient of variation (%)
|
||||
- `count`: Number of values
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from llmds.utils import calculate_statistics
|
||||
|
||||
measurements = [10.5, 12.3, 11.1, 15.2, 10.8, ...]
|
||||
stats = calculate_statistics(measurements, confidence_level=0.95)
|
||||
print(f"Mean: {stats['mean']:.2f} ± {stats['std']:.2f}")
|
||||
print(f"95% CI: [{stats['ci_lower']:.2f}, {stats['ci_upper']:.2f}]")
|
||||
print(f"CV: {stats['cv']:.2f}%")
|
||||
```
|
||||
|
||||
**Complexity:** O(n log n) where n = len(values)
|
||||
|
||||
### `llmds.token_lru.TokenLRU`
|
||||
|
||||
Token-aware LRU cache with eviction until budget.
|
||||
|
||||
**Methods:**
|
||||
- `put(key: K, value: V) -> None`
|
||||
- `get(key: K) -> Optional[V]`
|
||||
- `evict_until_budget(target_budget: int) -> list[tuple[K, V]]`
|
||||
- `total_tokens() -> int`
|
||||
|
||||
**Complexity:** O(1) put/get, O(n) evict_until_budget
|
||||
|
||||
### `llmds.indexed_heap.IndexedHeap`
|
||||
|
||||
Indexed binary heap with decrease/increase-key operations. Supports both min-heap and max-heap modes.
|
||||
|
||||
**Parameters:**
|
||||
- `max_heap: bool = False` - If True, use max-heap (largest score at top), otherwise min-heap
|
||||
|
||||
**Methods:**
|
||||
- `push(key_id: int, score: float) -> None` - Add item to heap
|
||||
- `pop() -> tuple[float, int]` - Remove and return top element
|
||||
- `decrease_key(key_id: int, new_score: float) -> None` - Decrease key value (bubbles down for max-heap, up for min-heap)
|
||||
- `increase_key(key_id: int, new_score: float) -> None` - Increase key value (bubbles up for max-heap, down for min-heap)
|
||||
- `delete(key_id: int) -> tuple[float, int]` - Remove specific item
|
||||
- `get_score(key_id: int) -> Optional[float]` - Get score for key_id
|
||||
- `peek() -> Optional[tuple[float, int]]` - View top element without removing
|
||||
- `size() -> int` - Get number of elements
|
||||
- `is_empty() -> bool` - Check if heap is empty
|
||||
|
||||
**Complexity:** O(log n) for all operations
|
||||
|
||||
**Note:** Fixed max-heap bubble directions (v0.1.0) - `decrease_key` bubbles down and `increase_key` bubbles up for max-heap.
|
||||
|
||||
### `llmds.scheduler.Scheduler`
|
||||
|
||||
Dynamic micro-batching scheduler.
|
||||
|
||||
**Methods:**
|
||||
- `submit(tokens: int, slo_ms: Optional[float] = None) -> int`
|
||||
- `get_batch(force: bool = False) -> Optional[list[int]]`
|
||||
- `complete_batch(request_ids: list[int]) -> None`
|
||||
- `update_priority(request_id: int, new_tokens: int) -> None`
|
||||
|
||||
**Complexity:** O(log n) submit, O(k log n) get_batch where k = batch_size
|
||||
|
||||
### `llmds.admissions.AdmissionController`
|
||||
|
||||
Admission controller with rate limiting.
|
||||
|
||||
**Methods:**
|
||||
- `should_admit(estimated_tokens: int = 0) -> tuple[bool, str]`
|
||||
- `record_request(tokens: int) -> None`
|
||||
- `stats() -> dict`: Get admission statistics
|
||||
|
||||
**Complexity:** O(1) should_admit
|
||||
|
||||
### `llmds.inverted_index.InvertedIndex`
|
||||
|
||||
Compressed inverted index with BM25 scoring.
|
||||
|
||||
**Methods:**
|
||||
- `add_document(doc_id: int, text: str) -> None`
|
||||
- `search(query: str, top_k: int = 10) -> list[tuple[int, float]]`
|
||||
- `get_term_frequency(term: str, doc_id: int) -> int`
|
||||
- `get_document_frequency(term: str) -> int`
|
||||
|
||||
**Complexity:** O(|doc|) add_document, O(|query| × avg_doc_freq) search
|
||||
|
||||
### `llmds.hnsw.HNSW`
|
||||
|
||||
Hierarchical Navigable Small World graph for approximate nearest neighbor search.
|
||||
|
||||
**Parameters:**
|
||||
- `dim: int` - Dimension of vectors
|
||||
- `M: int = 16` - Maximum number of connections per node
|
||||
- `ef_construction: int = 200` - Size of candidate set during construction
|
||||
- `ef_search: int = 50` - Size of candidate set during search
|
||||
- `ml: float = 1.0 / log(2.0)` - Normalization factor for level assignment
|
||||
- `seed: Optional[int] = None` - Random seed for reproducible graph structure
|
||||
|
||||
**Methods:**
|
||||
- `add(vec: np.ndarray, vec_id: int) -> None` - Add vector to index
|
||||
- `search(query: np.ndarray, k: int) -> list[tuple[int, float]]` - Search for k nearest neighbors. Returns list of (vector_id, distance) tuples
|
||||
- `stats() -> dict` - Get index statistics (num_vectors, num_layers, entry_point, etc.)
|
||||
|
||||
**Complexity:** O(log n) search, O(log n × efConstruction) add
|
||||
|
||||
**Reproducibility:** When `seed` is provided, each HNSW instance uses its own `random.Random(seed)` state for level assignments, ensuring identical graph structures across runs with the same seed.
|
||||
|
||||
### `llmds.cmsketch.CountMinSketch`
|
||||
|
||||
Count-Min Sketch for frequency estimation.
|
||||
|
||||
**Methods:**
|
||||
- `add(item: str, count: int = 1) -> None`
|
||||
- `estimate(item: str) -> int`
|
||||
- `is_hot(item: str, threshold: int) -> bool`
|
||||
- `get_error_bound() -> float`
|
||||
|
||||
**Complexity:** O(depth) add/estimate
|
||||
|
||||
### `llmds.retrieval_pipeline.RetrievalPipeline`
|
||||
|
||||
End-to-end retrieval pipeline.
|
||||
|
||||
**Methods:**
|
||||
- `add_document(doc_id: int, text: str, embedding: Optional[np.ndarray] = None) -> None`
|
||||
- `search(query: str, query_embedding: Optional[np.ndarray] = None, top_k: int = 10, fusion_weight: float = 0.5) -> list[tuple[int, float]]`
|
||||
- `stats() -> dict`: Get pipeline statistics
|
||||
|
||||
**Complexity:** O(log n) search (HNSW) + O(|query| × avg_doc_freq) (BM25)
|
||||
|
||||
161
docs/architecture.md
Normal file
161
docs/architecture.md
Normal file
@@ -0,0 +1,161 @@
|
||||
# Architecture Overview
|
||||
|
||||
## System Architecture
|
||||
|
||||
The LLM Data Structures Optimizer is organized into several key subsystems:
|
||||
|
||||
### 1. KV Cache System
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ KVCache │
|
||||
│ ┌───────────────────────────────────┐ │
|
||||
│ │ Prefix Hash Map │ │
|
||||
│ │ (SHA256-based deduplication) │ │
|
||||
│ └───────────────────────────────────┘ │
|
||||
│ ┌───────────────────────────────────┐ │
|
||||
│ │ Sequence → Page Mapping │ │
|
||||
│ └───────────────────────────────────┘ │
|
||||
│ ┌───────────────────────────────────┐ │
|
||||
│ │ PagedAllocator │ │
|
||||
│ │ - Fixed-size pages │ │
|
||||
│ │ - Free-list management │ │
|
||||
│ │ - Defragmentation │ │
|
||||
│ └───────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Key Features:**
|
||||
- **Copy-on-write (COW)** for prefix sharing - shared pages are read-only until modified, then lazily copied
|
||||
- **Reference counting** - shared pages are tracked and only freed when all references are released
|
||||
- **Hash-based deduplication** - identical prefixes are automatically detected and shared
|
||||
- **Page-level allocation granularity** - efficient memory management with fixed-size pages
|
||||
- **Defensive copying** - `get()` returns deep copies to prevent external modification of shared data
|
||||
|
||||
### 2. Scheduler & Batching
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Scheduler │
|
||||
│ ┌───────────────────────────────────┐ │
|
||||
│ │ IndexedHeap (Max-Heap Priority Queue) │ │
|
||||
│ │ - O(log n) decrease/increase-key │ │
|
||||
│ │ - Priority by remaining tokens │ │
|
||||
│ │ - Fixed bubble directions (v0.1.0) │ │
|
||||
│ └───────────────────────────────────┘ │
|
||||
│ ┌───────────────────────────────────┐ │
|
||||
│ │ AdmissionController │ │
|
||||
│ │ - QPS limiting │ │
|
||||
│ │ - Token rate limiting │ │
|
||||
│ │ - Moving window average │ │
|
||||
│ └───────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Key Features:**
|
||||
- Dynamic micro-batching with configurable wait time
|
||||
- SLO-aware prioritization
|
||||
- Rate limiting and admission control
|
||||
|
||||
### 3. Retrieval Pipeline
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ RetrievalPipeline │
|
||||
│ ┌───────────────────────────────────┐ │
|
||||
│ │ HNSW (Dense Search) │ │
|
||||
│ │ - Hierarchical graph │ │
|
||||
│ │ - Approximate nearest neighbor │ │
|
||||
│ │ - Reproducible via seed parameter │ │
|
||||
│ └───────────────────────────────────┘ │
|
||||
│ ┌───────────────────────────────────┐ │
|
||||
│ │ InvertedIndex (BM25) │ │
|
||||
│ │ - Compressed postings │ │
|
||||
│ │ - Varint/zigzag encoding │ │
|
||||
│ └───────────────────────────────────┘ │
|
||||
│ ┌───────────────────────────────────┐ │
|
||||
│ │ Score Fusion │ │
|
||||
│ │ - Weighted combination │ │
|
||||
│ │ - Top-K heap maintenance │ │
|
||||
│ └───────────────────────────────────┘ │
|
||||
│ ┌───────────────────────────────────┐ │
|
||||
│ │ CountMinSketch │ │
|
||||
│ │ - Hot query detection │ │
|
||||
│ │ - Cache priming │ │
|
||||
│ └───────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Key Features:**
|
||||
- Hybrid dense + sparse retrieval
|
||||
- Score fusion with configurable weights
|
||||
- Hot query caching
|
||||
|
||||
## Data Flow
|
||||
|
||||
### KV Cache Flow
|
||||
|
||||
1. **Attach Sequence**: Allocate pages, hash prefix, check for sharing
|
||||
2. **Get Sequence**: Retrieve pages, reconstruct KV tokens
|
||||
3. **Detach Sequence**: Free pages, update statistics
|
||||
|
||||
### Scheduler Flow
|
||||
|
||||
1. **Submit Request**: Add to priority queue, update admission stats
|
||||
2. **Get Batch**: Check wait time, pop top-k requests
|
||||
3. **Complete Batch**: Remove from queue, update metrics
|
||||
|
||||
### Retrieval Flow
|
||||
|
||||
1. **Index Building**: Add documents to HNSW and inverted index
|
||||
2. **Query Processing**:
|
||||
- Dense search (HNSW)
|
||||
- Sparse search (BM25)
|
||||
- Score fusion
|
||||
- Top-K selection
|
||||
3. **Caching**: Check CMS for hot queries, cache results
|
||||
|
||||
## Memory Management
|
||||
|
||||
### Token Budgeting
|
||||
|
||||
- Global token budget manager tracks:
|
||||
- KV cache tokens
|
||||
- Prompt tokens
|
||||
- Context window tokens
|
||||
|
||||
### Page Allocation
|
||||
|
||||
- Fixed-size pages reduce fragmentation
|
||||
- Free-list management for O(1) allocation
|
||||
- Periodic defragmentation for compaction
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Time Complexities
|
||||
|
||||
- **KV Cache**: O(1) attach/get, O(k) detach (k = pages)
|
||||
- **Indexed Heap**: O(log n) push/pop/update
|
||||
- **HNSW Search**: O(log n) approximate nearest neighbor
|
||||
- **BM25 Search**: O(|query| × avg_doc_freq)
|
||||
|
||||
### Space Complexities
|
||||
|
||||
- **KV Cache**: O(sequences × tokens_per_seq)
|
||||
- **HNSW**: O(n × M) where M = max connections
|
||||
- **Inverted Index**: O(|vocab| × avg_postings)
|
||||
|
||||
## Trade-offs
|
||||
|
||||
### Page Size
|
||||
- **Small pages**: Better memory utilization, higher overhead
|
||||
- **Large pages**: Lower overhead, more fragmentation
|
||||
|
||||
### Batch Size
|
||||
- **Small batches**: Lower latency, lower throughput
|
||||
- **Large batches**: Higher throughput, higher latency
|
||||
|
||||
### HNSW Parameters
|
||||
- **M (connections)**: Higher = better recall, more memory
|
||||
- **efSearch**: Higher = better recall, slower search
|
||||
|
||||
537
docs/mathematical_models.md
Normal file
537
docs/mathematical_models.md
Normal file
@@ -0,0 +1,537 @@
|
||||
# Mathematical Models
|
||||
|
||||
This document describes the mathematical formulations and algorithms used throughout the LLM RAG Data Structures Optimizer.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [BM25 Ranking Function](#bm25-ranking-function)
|
||||
- [HNSW Distance Metrics](#hnsw-distance-metrics)
|
||||
- [Count-Min Sketch Error Bounds](#count-min-sketch-error-bounds)
|
||||
- [Score Fusion](#score-fusion)
|
||||
- [KV Cache Memory Calculation](#kv-cache-memory-calculation)
|
||||
- [Token-Aware LRU Eviction](#token-aware-lru-eviction)
|
||||
- [Admission Control Rate Limiting](#admission-control-rate-limiting)
|
||||
|
||||
---
|
||||
|
||||
## BM25 Ranking Function
|
||||
|
||||
BM25 (Best Matching 25) is a probabilistic ranking function used for information retrieval. It scores documents based on term frequency and inverse document frequency.
|
||||
|
||||
### Formula
|
||||
|
||||
For a query $Q = \{q_1, q_2, \ldots, q_n\}$ and document $D$, the BM25 score is:
|
||||
|
||||
$$
|
||||
\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}
|
||||
$$
|
||||
|
||||
Where:
|
||||
- $f(q_i, D)$ = frequency of term $q_i$ in document $D$
|
||||
- $|D|$ = length of document $D$ (number of terms)
|
||||
- $\text{avgdl}$ = average document length in the collection
|
||||
- $k_1$ = term frequency saturation parameter (typically 1.2-2.0)
|
||||
- $b$ = length normalization parameter (typically 0.75)
|
||||
|
||||
### Inverse Document Frequency (IDF)
|
||||
|
||||
$$
|
||||
\text{IDF}(q_i) = \log \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}
|
||||
$$
|
||||
|
||||
Where:
|
||||
- $N$ = total number of documents in the collection
|
||||
- $n(q_i)$ = number of documents containing term $q_i$
|
||||
|
||||
The 0.5 smoothing factor prevents division by zero and handles terms that appear in all documents.
|
||||
|
||||
### Implementation Defaults
|
||||
|
||||
In our implementation:
|
||||
- $k_1 = 1.5$ (default)
|
||||
- $b = 0.75$ (default)
|
||||
|
||||
---
|
||||
|
||||
## HNSW Distance Metrics
|
||||
|
||||
Hierarchical Navigable Small World (HNSW) uses distance metrics to measure similarity between vectors. The default distance metric is **L2 (Euclidean) distance**.
|
||||
|
||||
### L2 Distance (Euclidean)
|
||||
|
||||
For vectors $\vec{u} = (u_1, u_2, \ldots, u_d)$ and $\vec{v} = (v_1, v_2, \ldots, v_d)$:
|
||||
|
||||
$$
|
||||
d_{\text{L2}}(\vec{u}, \vec{v}) = \sqrt{\sum_{i=1}^{d} (u_i - v_i)^2}
|
||||
$$
|
||||
|
||||
In practice, we often use squared L2 distance for efficiency (monotonic with L2):
|
||||
|
||||
$$
|
||||
d_{\text{L2}}^2(\vec{u}, \vec{v}) = \sum_{i=1}^{d} (u_i - v_i)^2
|
||||
$$
|
||||
|
||||
### Cosine Similarity (Alternative)
|
||||
|
||||
For normalized vectors, cosine similarity is often preferred:
|
||||
|
||||
$$
|
||||
\text{cosine}(\vec{u}, \vec{v}) = \frac{\vec{u} \cdot \vec{v}}{||\vec{u}|| \cdot ||\vec{v}||} = \frac{\sum_{i=1}^{d} u_i \cdot v_i}{\sqrt{\sum_{i=1}^{d} u_i^2} \cdot \sqrt{\sum_{i=1}^{d} v_i^2}}
|
||||
$$
|
||||
|
||||
For normalized vectors where $||\vec{u}|| = ||\vec{v}|| = 1$:
|
||||
|
||||
$$
|
||||
\text{cosine}(\vec{u}, \vec{v}) = \vec{u} \cdot \vec{v} = \sum_{i=1}^{d} u_i \cdot v_i
|
||||
$$
|
||||
|
||||
**Note**: Cosine similarity can be converted to distance: $d_{\text{cosine}} = 1 - \text{cosine}(\vec{u}, \vec{v})$
|
||||
|
||||
### HNSW Graph Properties
|
||||
|
||||
The HNSW graph has logarithmic search complexity:
|
||||
|
||||
- **Search complexity**: $O(\log N)$ where $N$ is the number of vectors
|
||||
- **Construction complexity**: $O(N \log N)$
|
||||
- **Memory complexity**: $O(N \cdot M)$ where $M$ is the maximum connections per node
|
||||
|
||||
**Return Format**: The `search()` and `_search_layer()` methods return results as `(node_id, distance)` tuples, where:
|
||||
- `node_id`: Integer identifier of the vector in the index
|
||||
- `distance`: Float representing the L2 distance from the query vector
|
||||
|
||||
---
|
||||
|
||||
## Count-Min Sketch Error Bounds
|
||||
|
||||
Count-Min Sketch is a probabilistic data structure for frequency estimation with bounded error.
|
||||
|
||||
### Structure
|
||||
|
||||
A Count-Min Sketch has width $w$ and depth $d$, creating a $d \times w$ table of counters.
|
||||
|
||||
### Update Operation
|
||||
|
||||
For item $x$ with count $c$, update all $d$ rows:
|
||||
|
||||
$$
|
||||
\text{CM}[i][h_i(x)] \leftarrow \text{CM}[i][h_i(x)] + c, \quad \forall i \in \{1, 2, \ldots, d\}
|
||||
$$
|
||||
|
||||
Where $h_i(x)$ is a hash function for row $i$.
|
||||
|
||||
### Estimation
|
||||
|
||||
The estimated frequency is the minimum across all rows:
|
||||
|
||||
$$
|
||||
\hat{f}(x) = \min_{i \in \{1, \ldots, d\}} \text{CM}[i][h_i(x)]
|
||||
$$
|
||||
|
||||
### Error Bound
|
||||
|
||||
With probability at least $1 - \delta$, the error is bounded by:
|
||||
|
||||
$$
|
||||
\hat{f}(x) - f(x) \leq \epsilon \cdot ||\mathbf{f}||_1
|
||||
$$
|
||||
|
||||
Where:
|
||||
- $f(x)$ = true frequency of $x$
|
||||
- $||\mathbf{f}||_1$ = total count of all items (L1 norm)
|
||||
- $\epsilon = \frac{e}{w}$ (where $e \approx 2.71828$)
|
||||
- $\delta = \left(\frac{1}{2}\right)^d$
|
||||
|
||||
### Parameter Selection
|
||||
|
||||
To achieve error bound $\epsilon$ with probability $1 - \delta$:
|
||||
|
||||
$$
|
||||
w = \left\lceil \frac{e}{\epsilon} \right\rceil
|
||||
$$
|
||||
$$
|
||||
d = \left\lceil \ln \frac{1}{\delta} \right\rceil
|
||||
$$
|
||||
|
||||
**Default parameters** in our implementation:
|
||||
- $w = 2048$ → $\epsilon \approx 0.0013$
|
||||
- $d = 4$ → $\delta = 0.0625$ (6.25% error probability)
|
||||
|
||||
---
|
||||
|
||||
## Score Fusion
|
||||
|
||||
Hybrid search combines scores from multiple retrieval methods (dense vectors and sparse keywords).
|
||||
|
||||
### Weighted Linear Combination
|
||||
|
||||
$$
|
||||
S_{\text{fused}}(d, q) = \alpha \cdot S_{\text{dense}}(d, q) + \beta \cdot S_{\text{sparse}}(d, q)
|
||||
$$
|
||||
|
||||
Where:
|
||||
- $S_{\text{dense}}(d, q)$ = normalized vector similarity score
|
||||
- $S_{\text{sparse}}(d, q)$ = normalized BM25 score
|
||||
- $\alpha + \beta = 1$ (typically $\alpha = 0.7$, $\beta = 0.3$)
|
||||
|
||||
### Score Normalization
|
||||
|
||||
Before fusion, scores are normalized to [0, 1] range:
|
||||
|
||||
$$
|
||||
S_{\text{norm}}(d, q) = \frac{S(d, q) - S_{\min}}{S_{\max} - S_{\min}}
|
||||
$$
|
||||
|
||||
Where $S_{\min}$ and $S_{\max}$ are the minimum and maximum scores in the candidate set.
|
||||
|
||||
### Reciprocal Rank Fusion (Alternative)
|
||||
|
||||
$$
|
||||
S_{\text{RRF}}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}
|
||||
$$
|
||||
|
||||
Where:
|
||||
- $R$ = set of ranked lists to fuse
|
||||
- $\text{rank}_r(d)$ = rank of document $d$ in ranked list $r$
|
||||
- $k$ = smoothing parameter (typically 60)
|
||||
|
||||
---
|
||||
|
||||
## KV Cache Memory Calculation
|
||||
|
||||
The KV cache memory usage depends on the number of cached tokens and the model dimensions.
|
||||
|
||||
### Per-Sequence Memory
|
||||
|
||||
For a sequence with $T$ tokens and model with hidden dimension $d$:
|
||||
|
||||
$$
|
||||
M_{\text{sequence}} = 2 \cdot T \cdot d \cdot \text{bytes\_per\_element}
|
||||
$$
|
||||
|
||||
Where:
|
||||
- Factor of 2 accounts for both key and value tensors
|
||||
- $\text{bytes\_per\_element} = 4$ for float32, $2$ for float16
|
||||
|
||||
### Paged Allocation
|
||||
|
||||
With page size $P$ pages and page capacity $C$ tokens per page:
|
||||
|
||||
$$
|
||||
M_{\text{paged}} = \left\lceil \frac{T}{C} \right\rceil \cdot P \cdot \text{page\_overhead}
|
||||
$$
|
||||
|
||||
Where $\text{page\_overhead}$ includes page metadata.
|
||||
|
||||
### Prefix Sharing Memory Savings
|
||||
|
||||
If $N$ sequences share a prefix of length $L$:
|
||||
|
||||
$$
|
||||
M_{\text{shared}} = L \cdot d \cdot \text{bytes\_per\_element}
|
||||
$$
|
||||
$$
|
||||
M_{\text{without\_sharing}} = N \cdot L \cdot d \cdot \text{bytes\_per\_element}
|
||||
$$
|
||||
|
||||
Memory savings:
|
||||
|
||||
$$
|
||||
\text{Savings} = (N - 1) \cdot L \cdot d \cdot \text{bytes\_per\_element}
|
||||
$$
|
||||
|
||||
Savings ratio:
|
||||
|
||||
$$
|
||||
\text{Savings Ratio} = \frac{N - 1}{N} = 1 - \frac{1}{N}
|
||||
$$
|
||||
|
||||
For large $N$, this approaches 100% savings on shared prefixes.
|
||||
|
||||
### Copy-on-Write Overhead
|
||||
|
||||
With copy-on-write (COW), if $K$ sequences modify shared pages:
|
||||
|
||||
$$
|
||||
M_{\text{with\_cow}} = (N - K) \cdot L_{\text{shared}} \cdot d \cdot \text{bytes\_per\_element} + K \cdot L_{\text{modified}} \cdot d \cdot \text{bytes\_per\_element}
|
||||
$$
|
||||
|
||||
Where:
|
||||
- $L_{\text{shared}}$ = length of shared (unmodified) prefix pages
|
||||
- $L_{\text{modified}}$ = length of modified prefix pages (copied)
|
||||
|
||||
**COW Efficiency:**
|
||||
- If no sequences modify shared pages ($K = 0$): Maximum savings (shared pages stored once)
|
||||
- If all sequences modify ($K = N$): No savings (each has own copy)
|
||||
- Typical case ($K < N$): Partial savings based on modification rate
|
||||
|
||||
**Reference Counting:**
|
||||
Shared pages are freed when reference count $r$ reaches zero:
|
||||
|
||||
$$
|
||||
r = \sum_{i=1}^{N} \mathbf{1}_{\text{seq}_i \text{ references page}}
|
||||
$$
|
||||
|
||||
Where $\mathbf{1}$ is the indicator function (1 if sequence references page, 0 otherwise).
|
||||
|
||||
---
|
||||
|
||||
## Token-Aware LRU Eviction
|
||||
|
||||
Token-aware LRU maintains a cumulative token budget while evicting least recently used items.
|
||||
|
||||
### Eviction Criterion
|
||||
|
||||
Evict item $i$ with minimum value of:
|
||||
|
||||
$$
|
||||
\text{priority}(i) = \frac{\text{access\_count}(i)}{\text{token\_count}(i)}
|
||||
$$
|
||||
|
||||
Or use recency-weighted:
|
||||
|
||||
$$
|
||||
\text{priority}(i) = \frac{\text{last\_access\_time}(i)}{\text{token\_count}(i)}
|
||||
$$
|
||||
|
||||
### Token Budget Constraint
|
||||
|
||||
Maintain total tokens below budget $B$:
|
||||
|
||||
$$
|
||||
\sum_{i \in \text{cache}} \text{token\_count}(i) \leq B
|
||||
$$
|
||||
|
||||
When adding item $j$ with $t_j$ tokens:
|
||||
|
||||
1. If $\sum_{i} t_i + t_j \leq B$: add item
|
||||
2. Else: evict items until $\sum_{i} t_i + t_j \leq B$
|
||||
|
||||
### Eviction Algorithm
|
||||
|
||||
```
|
||||
while total_tokens + new_tokens > budget:
|
||||
item = item_with_min_priority()
|
||||
total_tokens -= token_count(item)
|
||||
evict(item)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Admission Control Rate Limiting
|
||||
|
||||
The admission controller uses an exponentially weighted moving average (EWMA) to track request rate.
|
||||
|
||||
### Moving Average Update
|
||||
|
||||
$$
|
||||
\bar{r}_{t} = \alpha \cdot r_t + (1 - \alpha) \cdot \bar{r}_{t-1}
|
||||
$$
|
||||
|
||||
Where:
|
||||
- $r_t$ = current request rate
|
||||
- $\bar{r}_t$ = smoothed average rate
|
||||
- $\alpha$ = smoothing factor (typically 0.1-0.3)
|
||||
|
||||
### Admission Decision
|
||||
|
||||
Admit request if:
|
||||
|
||||
$$
|
||||
\bar{r}_t + \text{margin} \leq R_{\max}
|
||||
$$
|
||||
|
||||
Where:
|
||||
- $R_{\max}$ = maximum allowed rate (QPS limit)
|
||||
- $\text{margin}$ = safety margin to account for burstiness
|
||||
|
||||
### Token Bucket (Alternative)
|
||||
|
||||
The token bucket algorithm allows bursty traffic:
|
||||
|
||||
$$
|
||||
\text{tokens}(t) = \min(B, \text{tokens}(t-1) + R \cdot \Delta t)
|
||||
$$
|
||||
|
||||
Where:
|
||||
- $B$ = bucket capacity (burst limit)
|
||||
- $R$ = token refill rate (sustainable rate)
|
||||
- $\Delta t$ = time since last update
|
||||
|
||||
Request is admitted if $\text{tokens}(t) \geq 1$, then $\text{tokens}(t) \leftarrow \text{tokens}(t) - 1$.
|
||||
|
||||
---
|
||||
|
||||
## Indexed Binary Heap
|
||||
|
||||
The indexed heap supports O(log n) priority updates with decrease/increase-key operations. Supports both min-heap and max-heap modes.
|
||||
|
||||
### Heap Property
|
||||
|
||||
**Min-heap**: `parent_score ≤ child_score`
|
||||
**Max-heap**: `parent_score ≥ child_score`
|
||||
|
||||
### Decrease/Increase Key Operations
|
||||
|
||||
**Min-heap:**
|
||||
- `decrease_key(new_score < old_score)`: Bubbles UP (score decreases → higher priority)
|
||||
- `increase_key(new_score > old_score)`: Bubbles DOWN (score increases → lower priority)
|
||||
|
||||
**Max-heap:**
|
||||
- `decrease_key(new_score < old_score)`: Bubbles DOWN (score decreases → lower priority) Fixed in v0.1.0
|
||||
- `increase_key(new_score > old_score)`: Bubbles UP (score increases → higher priority) Fixed in v0.1.0
|
||||
|
||||
### Complexity
|
||||
|
||||
- Push: O(log n)
|
||||
- Pop: O(log n)
|
||||
- Decrease/Increase key: O(log n)
|
||||
- Delete: O(log n)
|
||||
|
||||
### Heap Properties
|
||||
|
||||
For a min-heap with $n$ elements:
|
||||
|
||||
- Parent of node $i$: $\lfloor (i-1)/2 \rfloor$
|
||||
- Left child of node $i$: $2i + 1$
|
||||
- Right child of node $i$: $2i + 2$
|
||||
|
||||
### Heap Invariant
|
||||
|
||||
$$
|
||||
\text{priority}(\text{parent}(i)) \leq \text{priority}(i), \quad \forall i > 0
|
||||
$$
|
||||
|
||||
### Complexity
|
||||
|
||||
- Insert: $O(\log n)$
|
||||
- Extract min: $O(\log n)$
|
||||
- Update key: $O(\log n)$ (with index mapping)
|
||||
- Decrease key: $O(\log n)$
|
||||
|
||||
---
|
||||
|
||||
## Variance Analysis and Statistical Confidence
|
||||
|
||||
Benchmark results include variance analysis to assess measurement reliability and identify flaky configurations.
|
||||
|
||||
### Statistical Summary
|
||||
|
||||
For a set of $n$ measurements $\{x_1, x_2, \ldots, x_n\}$:
|
||||
|
||||
**Mean:**
|
||||
$$
|
||||
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
|
||||
$$
|
||||
|
||||
**Standard Deviation (Sample):**
|
||||
$$
|
||||
s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}
|
||||
$$
|
||||
|
||||
**Coefficient of Variation:**
|
||||
$$
|
||||
\text{CV} = \frac{s}{\bar{x}} \times 100\%
|
||||
$$
|
||||
|
||||
The CV expresses relative variability as a percentage, making it easier to compare variance across different metrics and scales.
|
||||
|
||||
### Confidence Intervals
|
||||
|
||||
For small samples ($n < 30$), we use the t-distribution for confidence intervals:
|
||||
|
||||
$$
|
||||
\text{CI} = \bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}
|
||||
$$
|
||||
|
||||
Where:
|
||||
- $t_{\alpha/2, n-1}$ = t-critical value for $\alpha$ significance level and $n-1$ degrees of freedom
|
||||
- For 95% confidence: $\alpha = 0.05$, so $t_{0.025, n-1}$
|
||||
|
||||
For large samples ($n \geq 30$), we approximate with the normal distribution:
|
||||
$$
|
||||
\text{CI} = \bar{x} \pm z_{\alpha/2} \cdot \frac{s}{\sqrt{n}}
|
||||
$$
|
||||
|
||||
Where $z_{\alpha/2} = 1.96$ for 95% confidence.
|
||||
|
||||
### Flaky Benchmark Detection
|
||||
|
||||
A benchmark configuration is considered **flaky** if:
|
||||
|
||||
$$
|
||||
\text{CV} > \text{threshold}
|
||||
$$
|
||||
|
||||
Where the default threshold is 20% (coefficient of variation > 20%).
|
||||
|
||||
**Interpretation:**
|
||||
- **CV < 10%**: Excellent reproducibility
|
||||
- **10% ≤ CV < 20%**: Good reproducibility
|
||||
- **20% ≤ CV < 50%**: Moderate variance (flagged as potentially flaky)
|
||||
- **CV ≥ 50%**: High variance (likely flaky, investigate)
|
||||
|
||||
### Variance Metrics Reported
|
||||
|
||||
For each metric (e.g., `search_p50_ms`, `qps`), we report:
|
||||
|
||||
- `{metric}_mean`: Mean across repetitions
|
||||
- `{metric}_std`: Standard deviation
|
||||
- `{metric}_min`: Minimum value
|
||||
- `{metric}_max`: Maximum value
|
||||
- `{metric}_ci_lower`: Lower bound of 95% confidence interval
|
||||
- `{metric}_ci_upper`: Upper bound of 95% confidence interval
|
||||
- `{metric}_cv`: Coefficient of variation (%)
|
||||
|
||||
### Example
|
||||
|
||||
For a benchmark with 5 repetitions producing search P50 latencies:
|
||||
$$
|
||||
\{15.2, 15.8, 14.9, 16.1, 15.5\} \text{ ms}
|
||||
$$
|
||||
|
||||
Results:
|
||||
- Mean: $\bar{x} = 15.5$ ms
|
||||
- Std: $s = 0.44$ ms
|
||||
- CV: $\frac{0.44}{15.5} \times 100\% = 2.8\%$ (excellent)
|
||||
- 95% CI: $15.5 \pm 0.59$ ms → [14.91, 16.09] ms
|
||||
|
||||
### Implementation
|
||||
|
||||
These statistical calculations are implemented in `llmds.utils`:
|
||||
|
||||
- `compute_percentiles(values)`: Computes P50, P95, P99 percentiles
|
||||
- `calculate_statistics(values, confidence_level=0.95)`: Computes comprehensive statistics including mean, std, percentiles, confidence intervals, and coefficient of variation
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from llmds.utils import compute_percentiles, calculate_statistics
|
||||
|
||||
latencies = [15.2, 15.8, 14.9, 16.1, 15.5]
|
||||
|
||||
# Quick percentiles
|
||||
percentiles = compute_percentiles(latencies)
|
||||
print(f"P50: {percentiles['p50']:.2f} ms")
|
||||
|
||||
# Full statistics
|
||||
stats = calculate_statistics(latencies)
|
||||
print(f"Mean: {stats['mean']:.2f} ± {stats['std']:.2f} ms")
|
||||
print(f"CV: {stats['cv']:.2f}%")
|
||||
print(f"95% CI: [{stats['ci_lower']:.2f}, {stats['ci_upper']:.2f}] ms")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
1. **BM25**: Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. *Foundations and Trends in Information Retrieval*, 3(4), 333-389.
|
||||
|
||||
2. **HNSW**: Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. *IEEE transactions on pattern analysis and machine intelligence*, 42(4), 824-836.
|
||||
|
||||
3. **Count-Min Sketch**: Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. *Journal of Algorithms*, 55(1), 58-75.
|
||||
|
||||
4. **Score Fusion**: Cormack, G. V., Clarke, C. L., & Büttcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. *Proceedings of the 32nd international ACM SIGIR conference*.
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-01-01
|
||||
|
||||
1612
docs/paper.md
Normal file
1612
docs/paper.md
Normal file
File diff suppressed because it is too large
Load Diff
288
docs/usage.md
Normal file
288
docs/usage.md
Normal file
@@ -0,0 +1,288 @@
|
||||
# Usage Guide
|
||||
|
||||
## Basic Examples
|
||||
|
||||
### KV Cache
|
||||
|
||||
```python
|
||||
from llmds import KVCache
|
||||
|
||||
# Create cache with prefix sharing (enabled by default)
|
||||
cache = KVCache(page_size=512, max_pages=10000, enable_prefix_sharing=True)
|
||||
|
||||
# Attach KV tokens for a sequence with prefix sharing
|
||||
prefix = [1, 2, 3] # Shared system prompt
|
||||
kv_tokens = prefix + [4, 5, 6] * 100 # 500 tokens
|
||||
cache.attach(seq_id=1, kv_tokens=kv_tokens, prefix_tokens=prefix)
|
||||
|
||||
# Second sequence with same prefix - will share pages
|
||||
kv_tokens2 = prefix + [7, 8, 9] * 100
|
||||
cache.attach(seq_id=2, kv_tokens=kv_tokens2, prefix_tokens=prefix)
|
||||
|
||||
# Retrieve (returns deep copy to prevent corruption)
|
||||
cached = cache.get(seq_id=1)
|
||||
|
||||
# Copy-on-write: if you modify shared pages, they are automatically copied
|
||||
# Shared pages are read-only until modified, then lazily copied
|
||||
|
||||
# Detach when done (reference counting handles shared pages)
|
||||
cache.detach(seq_id=1)
|
||||
cache.detach(seq_id=2)
|
||||
```
|
||||
|
||||
**Copy-on-Write Behavior:**
|
||||
- Shared pages (from prefix sharing) are read-only by default
|
||||
- Writing different data to a shared page triggers lazy copying
|
||||
- Each sequence gets its own copy of modified pages
|
||||
- Original shared pages remain unchanged for other sequences
|
||||
- `get()` always returns deep copies to prevent external corruption
|
||||
|
||||
### Scheduler
|
||||
|
||||
```python
|
||||
from llmds import Scheduler
|
||||
|
||||
# Create scheduler
|
||||
scheduler = Scheduler(max_batch_size=32, max_wait_ms=50.0)
|
||||
|
||||
# Submit requests
|
||||
req_id1 = scheduler.submit(tokens=100)
|
||||
req_id2 = scheduler.submit(tokens=200, slo_ms=100.0) # SLO deadline
|
||||
|
||||
# Get batch (waits for max_wait_ms or until batch is full)
|
||||
batch = scheduler.get_batch(force=False)
|
||||
|
||||
# Process batch...
|
||||
# scheduler.complete_batch(batch)
|
||||
```
|
||||
|
||||
### Admission Control
|
||||
|
||||
```python
|
||||
from llmds import AdmissionController
|
||||
|
||||
# Create controller
|
||||
controller = AdmissionController(qps_target=10.0, token_rate_limit=10000)
|
||||
|
||||
# Check admission
|
||||
should_admit, reason = controller.should_admit(estimated_tokens=100)
|
||||
if should_admit:
|
||||
# Process request
|
||||
controller.record_request(tokens=100)
|
||||
else:
|
||||
# Reject request
|
||||
print(f"Rejected: {reason}")
|
||||
```
|
||||
|
||||
### Retrieval Pipeline
|
||||
|
||||
```python
|
||||
from llmds import RetrievalPipeline
|
||||
import numpy as np
|
||||
|
||||
# Create pipeline with reproducible HNSW structure
|
||||
pipeline = RetrievalPipeline(embedding_dim=384, seed=42)
|
||||
|
||||
# Add documents
|
||||
for i in range(100):
|
||||
text = f"Document {i} content"
|
||||
embedding = np.random.randn(384).astype(np.float32)
|
||||
embedding = embedding / np.linalg.norm(embedding)
|
||||
pipeline.add_document(doc_id=i, text=text, embedding=embedding)
|
||||
|
||||
# Search
|
||||
query = "example query"
|
||||
query_embedding = np.random.randn(384).astype(np.float32)
|
||||
query_embedding = query_embedding / np.linalg.norm(query_embedding)
|
||||
|
||||
results = pipeline.search(query, query_embedding=query_embedding, top_k=10)
|
||||
for doc_id, score in results:
|
||||
print(f"Doc {doc_id}: {score:.4f}")
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Custom Priority Function
|
||||
|
||||
```python
|
||||
from llmds import Scheduler
|
||||
|
||||
def custom_priority_fn(req):
|
||||
# Prioritize by inverse token count
|
||||
return 1.0 / (req.tokens + 1.0)
|
||||
|
||||
scheduler = Scheduler(
|
||||
max_batch_size=32,
|
||||
max_wait_ms=50.0,
|
||||
priority_fn=custom_priority_fn
|
||||
)
|
||||
```
|
||||
|
||||
### Token Budget Management
|
||||
|
||||
```python
|
||||
from llmds import TokenLRU
|
||||
|
||||
def token_counter(value):
|
||||
return len(str(value))
|
||||
|
||||
cache = TokenLRU(token_budget=1000, token_of=token_counter)
|
||||
|
||||
# Add items (evicts LRU if budget exceeded)
|
||||
cache.put("key1", "value with many tokens")
|
||||
cache.put("key2", "another value")
|
||||
|
||||
# Evict until target budget
|
||||
evicted = cache.evict_until_budget(target_budget=500)
|
||||
```
|
||||
|
||||
### HNSW Parameter Tuning
|
||||
|
||||
```python
|
||||
from llmds import HNSW
|
||||
import numpy as np
|
||||
|
||||
# Tune for better recall (higher memory)
|
||||
hnsw_high_recall = HNSW(
|
||||
dim=384,
|
||||
M=32, # More connections
|
||||
ef_construction=400, # More candidates during build
|
||||
ef_search=100, # More candidates during search
|
||||
seed=42 # Reproducible graph structure
|
||||
)
|
||||
|
||||
# Tune for faster search (lower memory)
|
||||
hnsw_fast = HNSW(
|
||||
dim=384,
|
||||
M=8, # Fewer connections
|
||||
ef_construction=100,
|
||||
ef_search=20, # Fewer candidates
|
||||
seed=42 # Reproducible graph structure
|
||||
)
|
||||
|
||||
# Reproducible benchmarks
|
||||
hnsw_bench = HNSW(dim=128, M=16, ef_construction=200, ef_search=50, seed=42)
|
||||
# Same seed ensures identical graph structure across runs
|
||||
```
|
||||
|
||||
## Benchmarking
|
||||
|
||||
### Running Benchmarks
|
||||
|
||||
```python
|
||||
from benchmarks.bench_kv_cache import benchmark_kv_cache
|
||||
|
||||
results = benchmark_kv_cache(
|
||||
num_sequences=1000,
|
||||
tokens_per_seq=1000,
|
||||
page_size=512
|
||||
)
|
||||
print(f"P95 latency: {results['attach_p95_ms']:.2f} ms")
|
||||
```
|
||||
|
||||
### Custom Benchmarks
|
||||
|
||||
```python
|
||||
from llmds.utils import Timer, compute_percentiles, calculate_statistics
|
||||
|
||||
latencies = []
|
||||
|
||||
for i in range(100):
|
||||
with Timer() as t:
|
||||
# Your operation here
|
||||
pass
|
||||
latencies.append(t.elapsed * 1000) # Convert to milliseconds
|
||||
|
||||
# Compute percentiles
|
||||
percentiles = compute_percentiles(latencies)
|
||||
print(f"P50: {percentiles['p50']:.2f} ms")
|
||||
print(f"P95: {percentiles['p95']:.2f} ms")
|
||||
print(f"P99: {percentiles['p99']:.2f} ms")
|
||||
|
||||
# Or compute comprehensive statistics
|
||||
stats = calculate_statistics(latencies)
|
||||
print(f"Mean: {stats['mean']:.2f} ± {stats['std']:.2f} ms")
|
||||
print(f"95% CI: [{stats['ci_lower']:.2f}, {stats['ci_upper']:.2f}] ms")
|
||||
print(f"CV: {stats['cv']:.2f}%")
|
||||
```
|
||||
|
||||
## Memory Profiling
|
||||
|
||||
All benchmarks automatically measure peak RSS (Resident Set Size) using `psutil`:
|
||||
|
||||
```python
|
||||
from llmds.utils import memory_profiler
|
||||
import numpy as np
|
||||
|
||||
# Memory profiling in your benchmarks
|
||||
with memory_profiler() as profiler:
|
||||
# Allocate memory
|
||||
data = np.random.randn(1000, 1000).astype(np.float32)
|
||||
profiler.sample() # Optional: sample at specific points
|
||||
|
||||
# More operations
|
||||
result = process_data(data)
|
||||
|
||||
peak_rss_mb = profiler.get_peak_rss_mb()
|
||||
memory_delta_mb = profiler.get_memory_delta_mb()
|
||||
|
||||
print(f"Peak memory: {peak_rss_mb:.2f} MB")
|
||||
print(f"Memory allocated: {memory_delta_mb:.2f} MB")
|
||||
```
|
||||
|
||||
**Benchmark Results Include:**
|
||||
- `peak_rss_mb`: Peak memory usage during benchmark
|
||||
- `memory_delta_mb`: Memory allocated during execution (peak - initial)
|
||||
- `build_peak_rss_mb`: Peak memory during build/indexing phase (where applicable)
|
||||
|
||||
All benchmark scripts automatically include memory profiling - no additional configuration needed.
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### RAG Pipeline
|
||||
|
||||
```python
|
||||
from llmds import RetrievalPipeline
|
||||
import numpy as np
|
||||
|
||||
# Initialize
|
||||
pipeline = RetrievalPipeline(embedding_dim=384)
|
||||
|
||||
# Index documents
|
||||
documents = ["doc1", "doc2", "doc3"]
|
||||
embeddings = [np.random.randn(384) for _ in documents]
|
||||
for doc_id, (text, emb) in enumerate(zip(documents, embeddings)):
|
||||
emb = emb / np.linalg.norm(emb)
|
||||
pipeline.add_document(doc_id=doc_id, text=text, embedding=emb)
|
||||
|
||||
# Query
|
||||
query_emb = np.random.randn(384)
|
||||
query_emb = query_emb / np.linalg.norm(query_emb)
|
||||
results = pipeline.search("query", query_embedding=query_emb, top_k=5)
|
||||
```
|
||||
|
||||
### LLM Inference with KV Cache
|
||||
|
||||
```python
|
||||
from llmds import KVCache, Scheduler, TokenLRU
|
||||
|
||||
# Setup
|
||||
kv_cache = KVCache()
|
||||
scheduler = Scheduler()
|
||||
token_cache = TokenLRU(token_budget=100000, token_of=lambda x: len(str(x)))
|
||||
|
||||
# Process request
|
||||
seq_id = 1
|
||||
prompt_tokens = [1, 2, 3, 4, 5]
|
||||
kv_tokens = generate_kv_cache(prompt_tokens) # Your function
|
||||
|
||||
kv_cache.attach(seq_id=seq_id, kv_tokens=kv_tokens, prefix_tokens=prompt_tokens)
|
||||
|
||||
# Use cached KV for generation
|
||||
cached_kv = kv_cache.get(seq_id)
|
||||
# ... generate tokens using cached KV ...
|
||||
|
||||
# Cleanup
|
||||
kv_cache.detach(seq_id)
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user