Files

8.5 KiB
Raw Permalink Blame History

API Reference

Core Modules

llmds.paged_allocator.PagedAllocator

Paged memory allocator with slab allocation.

Methods:

  • alloc(num_pages: int) -> list[int]: Allocate pages
  • free(page_ids: list[int]) -> None: Free pages
  • stats() -> PageStats: Get allocation statistics
  • defragment() -> None: Defragment pages

Complexity: O(1) alloc/free, O(n) defragment

llmds.kv_cache.KVCache

KV cache with prefix sharing and deduplication. Implements copy-on-write (COW) for safe prefix sharing.

Parameters:

  • page_size: int = 512 - Size of each KV cache page in tokens
  • max_pages: int = 10000 - Maximum number of pages to allocate
  • enable_prefix_sharing: bool = True - Enable prefix sharing optimization

Methods:

  • attach(seq_id: int, kv_tokens: list, prefix_tokens: Optional[list] = None) -> None - Attach KV cache for a sequence. Uses COW for shared pages.
  • detach(seq_id: int) -> None - Detach and free KV cache, with reference counting for shared pages
  • get(seq_id: int) -> Optional[list] - Get KV cache (returns deep copy to prevent external modification)
  • stats() -> dict - Get cache statistics including shared pages count and reference counts

Complexity: O(1) attach/get, O(k) detach where k = pages

Copy-on-Write Semantics:

  • Shared pages (from prefix sharing) are read-only until written
  • Writes to shared pages trigger lazy copying (COW)
  • Reference counting ensures shared pages are only freed when all references are released
  • get() returns deep copies to prevent external corruption of shared pages

Safety: All shared page operations are protected against data corruption through COW and defensive copying.

llmds.utils.Timer

Simple timer context manager for measuring execution time.

Usage:

from llmds.utils import Timer

with Timer() as t:
    # Your code here
    pass
elapsed_seconds = t.elapsed  # Float representing elapsed time

Complexity: O(1) for all operations

llmds.utils.MemoryProfiler

Memory profiler for measuring peak RSS (Resident Set Size) during benchmarks.

Methods:

  • start() -> None: Start memory profiling
  • sample() -> int: Sample current RSS and update peak
  • get_peak_rss_mb() -> float: Get peak RSS in megabytes
  • get_peak_rss_bytes() -> int: Get peak RSS in bytes
  • get_current_rss_mb() -> float: Get current RSS in megabytes
  • get_memory_delta_mb() -> float: Get memory delta from initial RSS in megabytes

Context Manager:

  • memory_profiler() -> Iterator[MemoryProfiler]: Context manager for automatic profiling

Usage:

from llmds.utils import memory_profiler

with memory_profiler() as profiler:
    # Your code here
    profiler.sample()  # Optional: sample at specific points
peak_rss_mb = profiler.get_peak_rss_mb()

Complexity: O(1) for all operations

llmds.utils.compute_percentiles

Compute P50, P95, P99 percentiles from a list of values.

Parameters:

  • values: list[float] - List of numeric values

Returns:

  • dict[str, float] - Dictionary with p50, p95, p99 keys

Usage:

from llmds.utils import compute_percentiles

latencies = [10.5, 12.3, 11.1, 15.2, 10.8, ...]
percentiles = compute_percentiles(latencies)
print(f"P50: {percentiles['p50']:.2f}ms")
print(f"P95: {percentiles['p95']:.2f}ms")
print(f"P99: {percentiles['p99']:.2f}ms")

Complexity: O(n log n) where n = len(values)

llmds.utils.calculate_statistics

Calculate comprehensive statistical summary for a list of values.

Parameters:

  • values: list[float] - List of numeric values
  • confidence_level: float = 0.95 - Confidence level for intervals (e.g., 0.95 for 95% CI)

Returns:

  • dict[str, Any] - Dictionary containing:
    • mean: Mean value
    • std: Standard deviation (sample)
    • min: Minimum value
    • max: Maximum value
    • p50, p95, p99: Percentiles
    • ci_lower, ci_upper: Confidence interval bounds
    • cv: Coefficient of variation (%)
    • count: Number of values

Usage:

from llmds.utils import calculate_statistics

measurements = [10.5, 12.3, 11.1, 15.2, 10.8, ...]
stats = calculate_statistics(measurements, confidence_level=0.95)
print(f"Mean: {stats['mean']:.2f} ± {stats['std']:.2f}")
print(f"95% CI: [{stats['ci_lower']:.2f}, {stats['ci_upper']:.2f}]")
print(f"CV: {stats['cv']:.2f}%")

Complexity: O(n log n) where n = len(values)

llmds.token_lru.TokenLRU

Token-aware LRU cache with eviction until budget.

Methods:

  • put(key: K, value: V) -> None
  • get(key: K) -> Optional[V]
  • evict_until_budget(target_budget: int) -> list[tuple[K, V]]
  • total_tokens() -> int

Complexity: O(1) put/get, O(n) evict_until_budget

llmds.indexed_heap.IndexedHeap

Indexed binary heap with decrease/increase-key operations. Supports both min-heap and max-heap modes.

Parameters:

  • max_heap: bool = False - If True, use max-heap (largest score at top), otherwise min-heap

Methods:

  • push(key_id: int, score: float) -> None - Add item to heap
  • pop() -> tuple[float, int] - Remove and return top element
  • decrease_key(key_id: int, new_score: float) -> None - Decrease key value (bubbles down for max-heap, up for min-heap)
  • increase_key(key_id: int, new_score: float) -> None - Increase key value (bubbles up for max-heap, down for min-heap)
  • delete(key_id: int) -> tuple[float, int] - Remove specific item
  • get_score(key_id: int) -> Optional[float] - Get score for key_id
  • peek() -> Optional[tuple[float, int]] - View top element without removing
  • size() -> int - Get number of elements
  • is_empty() -> bool - Check if heap is empty

Complexity: O(log n) for all operations

Note: Fixed max-heap bubble directions (v0.1.0) - decrease_key bubbles down and increase_key bubbles up for max-heap.

llmds.scheduler.Scheduler

Dynamic micro-batching scheduler.

Methods:

  • submit(tokens: int, slo_ms: Optional[float] = None) -> int
  • get_batch(force: bool = False) -> Optional[list[int]]
  • complete_batch(request_ids: list[int]) -> None
  • update_priority(request_id: int, new_tokens: int) -> None

Complexity: O(log n) submit, O(k log n) get_batch where k = batch_size

llmds.admissions.AdmissionController

Admission controller with rate limiting.

Methods:

  • should_admit(estimated_tokens: int = 0) -> tuple[bool, str]
  • record_request(tokens: int) -> None
  • stats() -> dict: Get admission statistics

Complexity: O(1) should_admit

llmds.inverted_index.InvertedIndex

Compressed inverted index with BM25 scoring.

Methods:

  • add_document(doc_id: int, text: str) -> None
  • search(query: str, top_k: int = 10) -> list[tuple[int, float]]
  • get_term_frequency(term: str, doc_id: int) -> int
  • get_document_frequency(term: str) -> int

Complexity: O(|doc|) add_document, O(|query| × avg_doc_freq) search

llmds.hnsw.HNSW

Hierarchical Navigable Small World graph for approximate nearest neighbor search.

Parameters:

  • dim: int - Dimension of vectors
  • M: int = 16 - Maximum number of connections per node
  • ef_construction: int = 200 - Size of candidate set during construction
  • ef_search: int = 50 - Size of candidate set during search
  • ml: float = 1.0 / log(2.0) - Normalization factor for level assignment
  • seed: Optional[int] = None - Random seed for reproducible graph structure

Methods:

  • add(vec: np.ndarray, vec_id: int) -> None - Add vector to index
  • search(query: np.ndarray, k: int) -> list[tuple[int, float]] - Search for k nearest neighbors. Returns list of (vector_id, distance) tuples
  • stats() -> dict - Get index statistics (num_vectors, num_layers, entry_point, etc.)

Complexity: O(log n) search, O(log n × efConstruction) add

Reproducibility: When seed is provided, each HNSW instance uses its own random.Random(seed) state for level assignments, ensuring identical graph structures across runs with the same seed.

llmds.cmsketch.CountMinSketch

Count-Min Sketch for frequency estimation.

Methods:

  • add(item: str, count: int = 1) -> None
  • estimate(item: str) -> int
  • is_hot(item: str, threshold: int) -> bool
  • get_error_bound() -> float

Complexity: O(depth) add/estimate

llmds.retrieval_pipeline.RetrievalPipeline

End-to-end retrieval pipeline.

Methods:

  • add_document(doc_id: int, text: str, embedding: Optional[np.ndarray] = None) -> None
  • search(query: str, query_embedding: Optional[np.ndarray] = None, top_k: int = 10, fusion_weight: float = 0.5) -> list[tuple[int, float]]
  • stats() -> dict: Get pipeline statistics

Complexity: O(log n) search (HNSW) + O(|query| × avg_doc_freq) (BM25)