llm-rag-ds-optimizer/docs/architecture.md

# Architecture Overview

## System Architecture

The LLM Data Structures Optimizer is organized into several key subsystems:

### 1. KV Cache System

```
┌─────────────────────────────────────────┐
│         KVCache                         │
│  ┌───────────────────────────────────┐  │
│  │  Prefix Hash Map                  │  │
│  │  (SHA256-based deduplication)     │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  Sequence → Page Mapping          │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  PagedAllocator                   │  │
│  │  - Fixed-size pages               │  │
│  │  - Free-list management           │  │
│  │  - Defragmentation                │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘
```

**Key Features:**
- **Copy-on-write (COW)** for prefix sharing - shared pages are read-only until modified, then lazily copied
- **Reference counting** - shared pages are tracked and only freed when all references are released
- **Hash-based deduplication** - identical prefixes are automatically detected and shared
- **Page-level allocation granularity** - efficient memory management with fixed-size pages
- **Defensive copying** - `get()` returns deep copies to prevent external modification of shared data

### 2. Scheduler & Batching

```
┌─────────────────────────────────────────┐
│         Scheduler                       │
│  ┌───────────────────────────────────┐  │
│  │  IndexedHeap (Max-Heap Priority Queue) │  │
│  │  - O(log n) decrease/increase-key     │  │
│  │  - Priority by remaining tokens       │  │
│  │  - Fixed bubble directions (v0.1.0)   │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  AdmissionController              │  │
│  │  - QPS limiting                   │  │
│  │  - Token rate limiting            │  │
│  │  - Moving window average          │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘
```

**Key Features:**
- Dynamic micro-batching with configurable wait time
- SLO-aware prioritization
- Rate limiting and admission control

### 3. Retrieval Pipeline

```
┌─────────────────────────────────────────┐
│      RetrievalPipeline                  │
│  ┌───────────────────────────────────┐  │
│  │  HNSW (Dense Search)               │  │
│  │  - Hierarchical graph              │  │
│  │  - Approximate nearest neighbor    │  │
│  │  - Reproducible via seed parameter  │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  InvertedIndex (BM25)             │  │
│  │  - Compressed postings             │  │
│  │  - Varint/zigzag encoding          │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  Score Fusion                     │  │
│  │  - Weighted combination            │  │
│  │  - Top-K heap maintenance         │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  CountMinSketch                   │  │
│  │  - Hot query detection             │  │
│  │  - Cache priming                   │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘
```

**Key Features:**
- Hybrid dense + sparse retrieval
- Score fusion with configurable weights
- Hot query caching

## Data Flow

### KV Cache Flow

1. **Attach Sequence**: Allocate pages, hash prefix, check for sharing
2. **Get Sequence**: Retrieve pages, reconstruct KV tokens
3. **Detach Sequence**: Free pages, update statistics

### Scheduler Flow

1. **Submit Request**: Add to priority queue, update admission stats
2. **Get Batch**: Check wait time, pop top-k requests
3. **Complete Batch**: Remove from queue, update metrics

### Retrieval Flow

1. **Index Building**: Add documents to HNSW and inverted index
2. **Query Processing**:
   - Dense search (HNSW)
   - Sparse search (BM25)
   - Score fusion
   - Top-K selection
3. **Caching**: Check CMS for hot queries, cache results

## Memory Management

### Token Budgeting

- Global token budget manager tracks:
  - KV cache tokens
  - Prompt tokens
  - Context window tokens

### Page Allocation

- Fixed-size pages reduce fragmentation
- Free-list management for O(1) allocation
- Periodic defragmentation for compaction

## Performance Characteristics

### Time Complexities

- **KV Cache**: O(1) attach/get, O(k) detach (k = pages)
- **Indexed Heap**: O(log n) push/pop/update
- **HNSW Search**: O(log n) approximate nearest neighbor
- **BM25 Search**: O(|query| × avg_doc_freq)

### Space Complexities

- **KV Cache**: O(sequences × tokens_per_seq)
- **HNSW**: O(n × M) where M = max connections
- **Inverted Index**: O(|vocab| × avg_postings)

## Trade-offs

### Page Size
- **Small pages**: Better memory utilization, higher overhead
- **Large pages**: Lower overhead, more fragmentation

### Batch Size
- **Small batches**: Lower latency, lower throughput
- **Large batches**: Higher throughput, higher latency

### HNSW Parameters
- **M (connections)**: Higher = better recall, more memory
- **efSearch**: Higher = better recall, slower search