CarGDev/llm-rag-ds-optimizer

Fork 0

Files

Carlos Gutierrez f83fe475df Initial commit: LLM-DS optimizer framework with data files excluded

2025-11-06 22:20:11 -05:00

7.2 KiB

Raw Permalink Blame History

Architecture Overview

System Architecture

The LLM Data Structures Optimizer is organized into several key subsystems:

1. KV Cache System

┌─────────────────────────────────────────┐
│         KVCache                         │
│  ┌───────────────────────────────────┐  │
│  │  Prefix Hash Map                  │  │
│  │  (SHA256-based deduplication)     │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  Sequence → Page Mapping          │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  PagedAllocator                   │  │
│  │  - Fixed-size pages               │  │
│  │  - Free-list management           │  │
│  │  - Defragmentation                │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘

Key Features:

Copy-on-write (COW) for prefix sharing - shared pages are read-only until modified, then lazily copied
Reference counting - shared pages are tracked and only freed when all references are released
Hash-based deduplication - identical prefixes are automatically detected and shared
Page-level allocation granularity - efficient memory management with fixed-size pages
Defensive copying - get() returns deep copies to prevent external modification of shared data

2. Scheduler & Batching

┌─────────────────────────────────────────┐
│         Scheduler                       │
│  ┌───────────────────────────────────┐  │
│  │  IndexedHeap (Max-Heap Priority Queue) │  │
│  │  - O(log n) decrease/increase-key     │  │
│  │  - Priority by remaining tokens       │  │
│  │  - Fixed bubble directions (v0.1.0)   │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  AdmissionController              │  │
│  │  - QPS limiting                   │  │
│  │  - Token rate limiting            │  │
│  │  - Moving window average          │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘

Key Features:

Dynamic micro-batching with configurable wait time
SLO-aware prioritization
Rate limiting and admission control

3. Retrieval Pipeline

┌─────────────────────────────────────────┐
│      RetrievalPipeline                  │
│  ┌───────────────────────────────────┐  │
│  │  HNSW (Dense Search)               │  │
│  │  - Hierarchical graph              │  │
│  │  - Approximate nearest neighbor    │  │
│  │  - Reproducible via seed parameter  │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  InvertedIndex (BM25)             │  │
│  │  - Compressed postings             │  │
│  │  - Varint/zigzag encoding          │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  Score Fusion                     │  │
│  │  - Weighted combination            │  │
│  │  - Top-K heap maintenance         │  │
│  └───────────────────────────────────┘  │
│  ┌───────────────────────────────────┐  │
│  │  CountMinSketch                   │  │
│  │  - Hot query detection             │  │
│  │  - Cache priming                   │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘

Key Features:

Hybrid dense + sparse retrieval
Score fusion with configurable weights
Hot query caching

Data Flow

KV Cache Flow

Attach Sequence: Allocate pages, hash prefix, check for sharing
Get Sequence: Retrieve pages, reconstruct KV tokens
Detach Sequence: Free pages, update statistics

Scheduler Flow

Submit Request: Add to priority queue, update admission stats
Get Batch: Check wait time, pop top-k requests
Complete Batch: Remove from queue, update metrics

Retrieval Flow

Index Building: Add documents to HNSW and inverted index
Query Processing:
- Dense search (HNSW)
- Sparse search (BM25)
- Score fusion
- Top-K selection
Caching: Check CMS for hot queries, cache results

Memory Management

Token Budgeting

Global token budget manager tracks:
- KV cache tokens
- Prompt tokens
- Context window tokens

Page Allocation

Fixed-size pages reduce fragmentation
Free-list management for O(1) allocation
Periodic defragmentation for compaction

Performance Characteristics

Time Complexities

KV Cache: O(1) attach/get, O(k) detach (k = pages)
Indexed Heap: O(log n) push/pop/update
HNSW Search: O(log n) approximate nearest neighbor
BM25 Search: O(|query| × avg_doc_freq)

Space Complexities

KV Cache: O(sequences × tokens_per_seq)
HNSW: O(n × M) where M = max connections
Inverted Index: O(|vocab| × avg_postings)

Trade-offs

Page Size

Small pages: Better memory utilization, higher overhead
Large pages: Lower overhead, more fragmentation

Batch Size

Small batches: Lower latency, lower throughput
Large batches: Higher throughput, higher latency

HNSW Parameters

M (connections): Higher = better recall, more memory
efSearch: Higher = better recall, slower search

7.2 KiB Raw Permalink Blame History Unescape Escape

Architecture Overview

System Architecture

1. KV Cache System

2. Scheduler & Batching

3. Retrieval Pipeline

Data Flow

KV Cache Flow

Scheduler Flow

Retrieval Flow

Memory Management

Token Budgeting

Page Allocation

Performance Characteristics

Time Complexities

Space Complexities

Trade-offs

Page Size

Batch Size

HNSW Parameters

7.2 KiB

Raw Permalink Blame History