Files
llm-rag-ds-optimizer/docs/architecture.md

7.2 KiB
Raw Permalink Blame History

Architecture Overview

System Architecture

The LLM Data Structures Optimizer is organized into several key subsystems:

1. KV Cache System

┌─────────────────────────────────────────┐
│         KVCache                         │
│  ┌───────────────────────────────────┐  │
│  │  Prefix Hash Map                  │  │
│  │  (SHA256-based deduplication)     │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  Sequence → Page Mapping          │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  PagedAllocator                   │  │
│  │  - Fixed-size pages               │  │
│  │  - Free-list management           │  │
│  │  - Defragmentation                │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘

Key Features:

  • Copy-on-write (COW) for prefix sharing - shared pages are read-only until modified, then lazily copied
  • Reference counting - shared pages are tracked and only freed when all references are released
  • Hash-based deduplication - identical prefixes are automatically detected and shared
  • Page-level allocation granularity - efficient memory management with fixed-size pages
  • Defensive copying - get() returns deep copies to prevent external modification of shared data

2. Scheduler & Batching

┌─────────────────────────────────────────┐
│         Scheduler                       │
│  ┌───────────────────────────────────┐  │
│  │  IndexedHeap (Max-Heap Priority Queue) │  │
│  │  - O(log n) decrease/increase-key     │  │
│  │  - Priority by remaining tokens       │  │
│  │  - Fixed bubble directions (v0.1.0)   │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  AdmissionController              │  │
│  │  - QPS limiting                   │  │
│  │  - Token rate limiting            │  │
│  │  - Moving window average          │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘

Key Features:

  • Dynamic micro-batching with configurable wait time
  • SLO-aware prioritization
  • Rate limiting and admission control

3. Retrieval Pipeline

┌─────────────────────────────────────────┐
│      RetrievalPipeline                  │
│  ┌───────────────────────────────────┐  │
│  │  HNSW (Dense Search)               │  │
│  │  - Hierarchical graph              │  │
│  │  - Approximate nearest neighbor    │  │
│  │  - Reproducible via seed parameter  │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  InvertedIndex (BM25)             │  │
│  │  - Compressed postings             │  │
│  │  - Varint/zigzag encoding          │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  Score Fusion                     │  │
│  │  - Weighted combination            │  │
│  │  - Top-K heap maintenance         │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  CountMinSketch                   │  │
│  │  - Hot query detection             │  │
│  │  - Cache priming                   │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘

Key Features:

  • Hybrid dense + sparse retrieval
  • Score fusion with configurable weights
  • Hot query caching

Data Flow

KV Cache Flow

  1. Attach Sequence: Allocate pages, hash prefix, check for sharing
  2. Get Sequence: Retrieve pages, reconstruct KV tokens
  3. Detach Sequence: Free pages, update statistics

Scheduler Flow

  1. Submit Request: Add to priority queue, update admission stats
  2. Get Batch: Check wait time, pop top-k requests
  3. Complete Batch: Remove from queue, update metrics

Retrieval Flow

  1. Index Building: Add documents to HNSW and inverted index
  2. Query Processing:
    • Dense search (HNSW)
    • Sparse search (BM25)
    • Score fusion
    • Top-K selection
  3. Caching: Check CMS for hot queries, cache results

Memory Management

Token Budgeting

  • Global token budget manager tracks:
    • KV cache tokens
    • Prompt tokens
    • Context window tokens

Page Allocation

  • Fixed-size pages reduce fragmentation
  • Free-list management for O(1) allocation
  • Periodic defragmentation for compaction

Performance Characteristics

Time Complexities

  • KV Cache: O(1) attach/get, O(k) detach (k = pages)
  • Indexed Heap: O(log n) push/pop/update
  • HNSW Search: O(log n) approximate nearest neighbor
  • BM25 Search: O(|query| × avg_doc_freq)

Space Complexities

  • KV Cache: O(sequences × tokens_per_seq)
  • HNSW: O(n × M) where M = max connections
  • Inverted Index: O(|vocab| × avg_postings)

Trade-offs

Page Size

  • Small pages: Better memory utilization, higher overhead
  • Large pages: Lower overhead, more fragmentation

Batch Size

  • Small batches: Lower latency, lower throughput
  • Large batches: Higher throughput, higher latency

HNSW Parameters

  • M (connections): Higher = better recall, more memory
  • efSearch: Higher = better recall, slower search