162 lines
7.2 KiB
Markdown
162 lines
7.2 KiB
Markdown
# Architecture Overview
|
||
|
||
## System Architecture
|
||
|
||
The LLM Data Structures Optimizer is organized into several key subsystems:
|
||
|
||
### 1. KV Cache System
|
||
|
||
```
|
||
┌─────────────────────────────────────────┐
|
||
│ KVCache │
|
||
│ ┌───────────────────────────────────┐ │
|
||
│ │ Prefix Hash Map │ │
|
||
│ │ (SHA256-based deduplication) │ │
|
||
│ └───────────────────────────────────┘ │
|
||
│ ┌───────────────────────────────────┐ │
|
||
│ │ Sequence → Page Mapping │ │
|
||
│ └───────────────────────────────────┘ │
|
||
│ ┌───────────────────────────────────┐ │
|
||
│ │ PagedAllocator │ │
|
||
│ │ - Fixed-size pages │ │
|
||
│ │ - Free-list management │ │
|
||
│ │ - Defragmentation │ │
|
||
│ └───────────────────────────────────┘ │
|
||
└─────────────────────────────────────────┘
|
||
```
|
||
|
||
**Key Features:**
|
||
- **Copy-on-write (COW)** for prefix sharing - shared pages are read-only until modified, then lazily copied
|
||
- **Reference counting** - shared pages are tracked and only freed when all references are released
|
||
- **Hash-based deduplication** - identical prefixes are automatically detected and shared
|
||
- **Page-level allocation granularity** - efficient memory management with fixed-size pages
|
||
- **Defensive copying** - `get()` returns deep copies to prevent external modification of shared data
|
||
|
||
### 2. Scheduler & Batching
|
||
|
||
```
|
||
┌─────────────────────────────────────────┐
|
||
│ Scheduler │
|
||
│ ┌───────────────────────────────────┐ │
|
||
│ │ IndexedHeap (Max-Heap Priority Queue) │ │
|
||
│ │ - O(log n) decrease/increase-key │ │
|
||
│ │ - Priority by remaining tokens │ │
|
||
│ │ - Fixed bubble directions (v0.1.0) │ │
|
||
│ └───────────────────────────────────┘ │
|
||
│ ┌───────────────────────────────────┐ │
|
||
│ │ AdmissionController │ │
|
||
│ │ - QPS limiting │ │
|
||
│ │ - Token rate limiting │ │
|
||
│ │ - Moving window average │ │
|
||
│ └───────────────────────────────────┘ │
|
||
└─────────────────────────────────────────┘
|
||
```
|
||
|
||
**Key Features:**
|
||
- Dynamic micro-batching with configurable wait time
|
||
- SLO-aware prioritization
|
||
- Rate limiting and admission control
|
||
|
||
### 3. Retrieval Pipeline
|
||
|
||
```
|
||
┌─────────────────────────────────────────┐
|
||
│ RetrievalPipeline │
|
||
│ ┌───────────────────────────────────┐ │
|
||
│ │ HNSW (Dense Search) │ │
|
||
│ │ - Hierarchical graph │ │
|
||
│ │ - Approximate nearest neighbor │ │
|
||
│ │ - Reproducible via seed parameter │ │
|
||
│ └───────────────────────────────────┘ │
|
||
│ ┌───────────────────────────────────┐ │
|
||
│ │ InvertedIndex (BM25) │ │
|
||
│ │ - Compressed postings │ │
|
||
│ │ - Varint/zigzag encoding │ │
|
||
│ └───────────────────────────────────┘ │
|
||
│ ┌───────────────────────────────────┐ │
|
||
│ │ Score Fusion │ │
|
||
│ │ - Weighted combination │ │
|
||
│ │ - Top-K heap maintenance │ │
|
||
│ └───────────────────────────────────┘ │
|
||
│ ┌───────────────────────────────────┐ │
|
||
│ │ CountMinSketch │ │
|
||
│ │ - Hot query detection │ │
|
||
│ │ - Cache priming │ │
|
||
│ └───────────────────────────────────┘ │
|
||
└─────────────────────────────────────────┘
|
||
```
|
||
|
||
**Key Features:**
|
||
- Hybrid dense + sparse retrieval
|
||
- Score fusion with configurable weights
|
||
- Hot query caching
|
||
|
||
## Data Flow
|
||
|
||
### KV Cache Flow
|
||
|
||
1. **Attach Sequence**: Allocate pages, hash prefix, check for sharing
|
||
2. **Get Sequence**: Retrieve pages, reconstruct KV tokens
|
||
3. **Detach Sequence**: Free pages, update statistics
|
||
|
||
### Scheduler Flow
|
||
|
||
1. **Submit Request**: Add to priority queue, update admission stats
|
||
2. **Get Batch**: Check wait time, pop top-k requests
|
||
3. **Complete Batch**: Remove from queue, update metrics
|
||
|
||
### Retrieval Flow
|
||
|
||
1. **Index Building**: Add documents to HNSW and inverted index
|
||
2. **Query Processing**:
|
||
- Dense search (HNSW)
|
||
- Sparse search (BM25)
|
||
- Score fusion
|
||
- Top-K selection
|
||
3. **Caching**: Check CMS for hot queries, cache results
|
||
|
||
## Memory Management
|
||
|
||
### Token Budgeting
|
||
|
||
- Global token budget manager tracks:
|
||
- KV cache tokens
|
||
- Prompt tokens
|
||
- Context window tokens
|
||
|
||
### Page Allocation
|
||
|
||
- Fixed-size pages reduce fragmentation
|
||
- Free-list management for O(1) allocation
|
||
- Periodic defragmentation for compaction
|
||
|
||
## Performance Characteristics
|
||
|
||
### Time Complexities
|
||
|
||
- **KV Cache**: O(1) attach/get, O(k) detach (k = pages)
|
||
- **Indexed Heap**: O(log n) push/pop/update
|
||
- **HNSW Search**: O(log n) approximate nearest neighbor
|
||
- **BM25 Search**: O(|query| × avg_doc_freq)
|
||
|
||
### Space Complexities
|
||
|
||
- **KV Cache**: O(sequences × tokens_per_seq)
|
||
- **HNSW**: O(n × M) where M = max connections
|
||
- **Inverted Index**: O(|vocab| × avg_postings)
|
||
|
||
## Trade-offs
|
||
|
||
### Page Size
|
||
- **Small pages**: Better memory utilization, higher overhead
|
||
- **Large pages**: Lower overhead, more fragmentation
|
||
|
||
### Batch Size
|
||
- **Small batches**: Lower latency, lower throughput
|
||
- **Large batches**: Higher throughput, higher latency
|
||
|
||
### HNSW Parameters
|
||
- **M (connections)**: Higher = better recall, more memory
|
||
- **efSearch**: Higher = better recall, slower search
|
||
|