Files
llm-rag-ds-optimizer/docs/architecture.md

162 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architecture Overview
## System Architecture
The LLM Data Structures Optimizer is organized into several key subsystems:
### 1. KV Cache System
```
┌─────────────────────────────────────────┐
│ KVCache │
│ ┌───────────────────────────────────┐ │
│ │ Prefix Hash Map │ │
│ │ (SHA256-based deduplication) │ │
│ └───────────────────────────────────┘ │
│ ┌───────────────────────────────────┐ │
│ │ Sequence → Page Mapping │ │
│ └───────────────────────────────────┘ │
│ ┌───────────────────────────────────┐ │
│ │ PagedAllocator │ │
│ │ - Fixed-size pages │ │
│ │ - Free-list management │ │
│ │ - Defragmentation │ │
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────┘
```
**Key Features:**
- **Copy-on-write (COW)** for prefix sharing - shared pages are read-only until modified, then lazily copied
- **Reference counting** - shared pages are tracked and only freed when all references are released
- **Hash-based deduplication** - identical prefixes are automatically detected and shared
- **Page-level allocation granularity** - efficient memory management with fixed-size pages
- **Defensive copying** - `get()` returns deep copies to prevent external modification of shared data
### 2. Scheduler & Batching
```
┌─────────────────────────────────────────┐
│ Scheduler │
│ ┌───────────────────────────────────┐ │
│ │ IndexedHeap (Max-Heap Priority Queue) │ │
│ │ - O(log n) decrease/increase-key │ │
│ │ - Priority by remaining tokens │ │
│ │ - Fixed bubble directions (v0.1.0) │ │
│ └───────────────────────────────────┘ │
│ ┌───────────────────────────────────┐ │
│ │ AdmissionController │ │
│ │ - QPS limiting │ │
│ │ - Token rate limiting │ │
│ │ - Moving window average │ │
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────┘
```
**Key Features:**
- Dynamic micro-batching with configurable wait time
- SLO-aware prioritization
- Rate limiting and admission control
### 3. Retrieval Pipeline
```
┌─────────────────────────────────────────┐
│ RetrievalPipeline │
│ ┌───────────────────────────────────┐ │
│ │ HNSW (Dense Search) │ │
│ │ - Hierarchical graph │ │
│ │ - Approximate nearest neighbor │ │
│ │ - Reproducible via seed parameter │ │
│ └───────────────────────────────────┘ │
│ ┌───────────────────────────────────┐ │
│ │ InvertedIndex (BM25) │ │
│ │ - Compressed postings │ │
│ │ - Varint/zigzag encoding │ │
│ └───────────────────────────────────┘ │
│ ┌───────────────────────────────────┐ │
│ │ Score Fusion │ │
│ │ - Weighted combination │ │
│ │ - Top-K heap maintenance │ │
│ └───────────────────────────────────┘ │
│ ┌───────────────────────────────────┐ │
│ │ CountMinSketch │ │
│ │ - Hot query detection │ │
│ │ - Cache priming │ │
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────┘
```
**Key Features:**
- Hybrid dense + sparse retrieval
- Score fusion with configurable weights
- Hot query caching
## Data Flow
### KV Cache Flow
1. **Attach Sequence**: Allocate pages, hash prefix, check for sharing
2. **Get Sequence**: Retrieve pages, reconstruct KV tokens
3. **Detach Sequence**: Free pages, update statistics
### Scheduler Flow
1. **Submit Request**: Add to priority queue, update admission stats
2. **Get Batch**: Check wait time, pop top-k requests
3. **Complete Batch**: Remove from queue, update metrics
### Retrieval Flow
1. **Index Building**: Add documents to HNSW and inverted index
2. **Query Processing**:
- Dense search (HNSW)
- Sparse search (BM25)
- Score fusion
- Top-K selection
3. **Caching**: Check CMS for hot queries, cache results
## Memory Management
### Token Budgeting
- Global token budget manager tracks:
- KV cache tokens
- Prompt tokens
- Context window tokens
### Page Allocation
- Fixed-size pages reduce fragmentation
- Free-list management for O(1) allocation
- Periodic defragmentation for compaction
## Performance Characteristics
### Time Complexities
- **KV Cache**: O(1) attach/get, O(k) detach (k = pages)
- **Indexed Heap**: O(log n) push/pop/update
- **HNSW Search**: O(log n) approximate nearest neighbor
- **BM25 Search**: O(|query| × avg_doc_freq)
### Space Complexities
- **KV Cache**: O(sequences × tokens_per_seq)
- **HNSW**: O(n × M) where M = max connections
- **Inverted Index**: O(|vocab| × avg_postings)
## Trade-offs
### Page Size
- **Small pages**: Better memory utilization, higher overhead
- **Large pages**: Lower overhead, more fragmentation
### Batch Size
- **Small batches**: Lower latency, lower throughput
- **Large batches**: Higher throughput, higher latency
### HNSW Parameters
- **M (connections)**: Higher = better recall, more memory
- **efSearch**: Higher = better recall, slower search