Files
llm-rag-ds-optimizer/docs/mathematical_models.md

14 KiB

Mathematical Models

This document describes the mathematical formulations and algorithms used throughout the LLM RAG Data Structures Optimizer.

Table of Contents


BM25 Ranking Function

BM25 (Best Matching 25) is a probabilistic ranking function used for information retrieval. It scores documents based on term frequency and inverse document frequency.

Formula

For a query Q = \{q_1, q_2, \ldots, q_n\} and document D, the BM25 score is:


\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}

Where:

  • f(q_i, D) = frequency of term q_i in document D
  • |D| = length of document D (number of terms)
  • \text{avgdl} = average document length in the collection
  • k_1 = term frequency saturation parameter (typically 1.2-2.0)
  • b = length normalization parameter (typically 0.75)

Inverse Document Frequency (IDF)


\text{IDF}(q_i) = \log \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}

Where:

  • N = total number of documents in the collection
  • n(q_i) = number of documents containing term q_i

The 0.5 smoothing factor prevents division by zero and handles terms that appear in all documents.

Implementation Defaults

In our implementation:

  • k_1 = 1.5 (default)
  • b = 0.75 (default)

HNSW Distance Metrics

Hierarchical Navigable Small World (HNSW) uses distance metrics to measure similarity between vectors. The default distance metric is L2 (Euclidean) distance.

L2 Distance (Euclidean)

For vectors \vec{u} = (u_1, u_2, \ldots, u_d) and \vec{v} = (v_1, v_2, \ldots, v_d):


d_{\text{L2}}(\vec{u}, \vec{v}) = \sqrt{\sum_{i=1}^{d} (u_i - v_i)^2}

In practice, we often use squared L2 distance for efficiency (monotonic with L2):


d_{\text{L2}}^2(\vec{u}, \vec{v}) = \sum_{i=1}^{d} (u_i - v_i)^2

Cosine Similarity (Alternative)

For normalized vectors, cosine similarity is often preferred:


\text{cosine}(\vec{u}, \vec{v}) = \frac{\vec{u} \cdot \vec{v}}{||\vec{u}|| \cdot ||\vec{v}||} = \frac{\sum_{i=1}^{d} u_i \cdot v_i}{\sqrt{\sum_{i=1}^{d} u_i^2} \cdot \sqrt{\sum_{i=1}^{d} v_i^2}}

For normalized vectors where ||\vec{u}|| = ||\vec{v}|| = 1:


\text{cosine}(\vec{u}, \vec{v}) = \vec{u} \cdot \vec{v} = \sum_{i=1}^{d} u_i \cdot v_i

Note: Cosine similarity can be converted to distance: d_{\text{cosine}} = 1 - \text{cosine}(\vec{u}, \vec{v})

HNSW Graph Properties

The HNSW graph has logarithmic search complexity:

  • Search complexity: O(\log N) where N is the number of vectors
  • Construction complexity: O(N \log N)
  • Memory complexity: O(N \cdot M) where M is the maximum connections per node

Return Format: The search() and _search_layer() methods return results as (node_id, distance) tuples, where:

  • node_id: Integer identifier of the vector in the index
  • distance: Float representing the L2 distance from the query vector

Count-Min Sketch Error Bounds

Count-Min Sketch is a probabilistic data structure for frequency estimation with bounded error.

Structure

A Count-Min Sketch has width w and depth d, creating a d \times w table of counters.

Update Operation

For item x with count c, update all d rows:


\text{CM}[i][h_i(x)] \leftarrow \text{CM}[i][h_i(x)] + c, \quad \forall i \in \{1, 2, \ldots, d\}

Where h_i(x) is a hash function for row i.

Estimation

The estimated frequency is the minimum across all rows:


\hat{f}(x) = \min_{i \in \{1, \ldots, d\}} \text{CM}[i][h_i(x)]

Error Bound

With probability at least 1 - \delta, the error is bounded by:


\hat{f}(x) - f(x) \leq \epsilon \cdot ||\mathbf{f}||_1

Where:

  • f(x) = true frequency of x
  • ||\mathbf{f}||_1 = total count of all items (L1 norm)
  • \epsilon = \frac{e}{w} (where e \approx 2.71828)
  • \delta = \left(\frac{1}{2}\right)^d

Parameter Selection

To achieve error bound \epsilon with probability 1 - \delta:


w = \left\lceil \frac{e}{\epsilon} \right\rceil

d = \left\lceil \ln \frac{1}{\delta} \right\rceil

Default parameters in our implementation:

  • w = 2048\epsilon \approx 0.0013
  • d = 4\delta = 0.0625 (6.25% error probability)

Score Fusion

Hybrid search combines scores from multiple retrieval methods (dense vectors and sparse keywords).

Weighted Linear Combination


S_{\text{fused}}(d, q) = \alpha \cdot S_{\text{dense}}(d, q) + \beta \cdot S_{\text{sparse}}(d, q)

Where:

  • S_{\text{dense}}(d, q) = normalized vector similarity score
  • S_{\text{sparse}}(d, q) = normalized BM25 score
  • \alpha + \beta = 1 (typically \alpha = 0.7, \beta = 0.3)

Score Normalization

Before fusion, scores are normalized to [0, 1] range:


S_{\text{norm}}(d, q) = \frac{S(d, q) - S_{\min}}{S_{\max} - S_{\min}}

Where S_{\min} and S_{\max} are the minimum and maximum scores in the candidate set.

Reciprocal Rank Fusion (Alternative)


S_{\text{RRF}}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}

Where:

  • R = set of ranked lists to fuse
  • \text{rank}_r(d) = rank of document d in ranked list r
  • k = smoothing parameter (typically 60)

KV Cache Memory Calculation

The KV cache memory usage depends on the number of cached tokens and the model dimensions.

Per-Sequence Memory

For a sequence with T tokens and model with hidden dimension d:


M_{\text{sequence}} = 2 \cdot T \cdot d \cdot \text{bytes\_per\_element}

Where:

  • Factor of 2 accounts for both key and value tensors
  • \text{bytes\_per\_element} = 4 for float32, 2 for float16

Paged Allocation

With page size P pages and page capacity C tokens per page:


M_{\text{paged}} = \left\lceil \frac{T}{C} \right\rceil \cdot P \cdot \text{page\_overhead}

Where \text{page\_overhead} includes page metadata.

Prefix Sharing Memory Savings

If N sequences share a prefix of length L:


M_{\text{shared}} = L \cdot d \cdot \text{bytes\_per\_element}

M_{\text{without\_sharing}} = N \cdot L \cdot d \cdot \text{bytes\_per\_element}

Memory savings:


\text{Savings} = (N - 1) \cdot L \cdot d \cdot \text{bytes\_per\_element}

Savings ratio:


\text{Savings Ratio} = \frac{N - 1}{N} = 1 - \frac{1}{N}

For large N, this approaches 100% savings on shared prefixes.

Copy-on-Write Overhead

With copy-on-write (COW), if K sequences modify shared pages:


M_{\text{with\_cow}} = (N - K) \cdot L_{\text{shared}} \cdot d \cdot \text{bytes\_per\_element} + K \cdot L_{\text{modified}} \cdot d \cdot \text{bytes\_per\_element}

Where:

  • L_{\text{shared}} = length of shared (unmodified) prefix pages
  • L_{\text{modified}} = length of modified prefix pages (copied)

COW Efficiency:

  • If no sequences modify shared pages (K = 0): Maximum savings (shared pages stored once)
  • If all sequences modify (K = N): No savings (each has own copy)
  • Typical case (K < N): Partial savings based on modification rate

Reference Counting: Shared pages are freed when reference count r reaches zero:


r = \sum_{i=1}^{N} \mathbf{1}_{\text{seq}_i \text{ references page}}

Where \mathbf{1} is the indicator function (1 if sequence references page, 0 otherwise).


Token-Aware LRU Eviction

Token-aware LRU maintains a cumulative token budget while evicting least recently used items.

Eviction Criterion

Evict item i with minimum value of:


\text{priority}(i) = \frac{\text{access\_count}(i)}{\text{token\_count}(i)}

Or use recency-weighted:


\text{priority}(i) = \frac{\text{last\_access\_time}(i)}{\text{token\_count}(i)}

Token Budget Constraint

Maintain total tokens below budget B:


\sum_{i \in \text{cache}} \text{token\_count}(i) \leq B

When adding item j with t_j tokens:

  1. If \sum_{i} t_i + t_j \leq B: add item
  2. Else: evict items until \sum_{i} t_i + t_j \leq B

Eviction Algorithm

while total_tokens + new_tokens > budget:
    item = item_with_min_priority()
    total_tokens -= token_count(item)
    evict(item)

Admission Control Rate Limiting

The admission controller uses an exponentially weighted moving average (EWMA) to track request rate.

Moving Average Update


\bar{r}_{t} = \alpha \cdot r_t + (1 - \alpha) \cdot \bar{r}_{t-1}

Where:

  • r_t = current request rate
  • \bar{r}_t = smoothed average rate
  • \alpha = smoothing factor (typically 0.1-0.3)

Admission Decision

Admit request if:


\bar{r}_t + \text{margin} \leq R_{\max}

Where:

  • R_{\max} = maximum allowed rate (QPS limit)
  • \text{margin} = safety margin to account for burstiness

Token Bucket (Alternative)

The token bucket algorithm allows bursty traffic:


\text{tokens}(t) = \min(B, \text{tokens}(t-1) + R \cdot \Delta t)

Where:

  • B = bucket capacity (burst limit)
  • R = token refill rate (sustainable rate)
  • \Delta t = time since last update

Request is admitted if \text{tokens}(t) \geq 1, then \text{tokens}(t) \leftarrow \text{tokens}(t) - 1.


Indexed Binary Heap

The indexed heap supports O(log n) priority updates with decrease/increase-key operations. Supports both min-heap and max-heap modes.

Heap Property

Min-heap: parent_score ≤ child_score
Max-heap: parent_score ≥ child_score

Decrease/Increase Key Operations

Min-heap:

  • decrease_key(new_score < old_score): Bubbles UP (score decreases → higher priority)
  • increase_key(new_score > old_score): Bubbles DOWN (score increases → lower priority)

Max-heap:

  • decrease_key(new_score < old_score): Bubbles DOWN (score decreases → lower priority) Fixed in v0.1.0
  • increase_key(new_score > old_score): Bubbles UP (score increases → higher priority) Fixed in v0.1.0

Complexity

  • Push: O(log n)
  • Pop: O(log n)
  • Decrease/Increase key: O(log n)
  • Delete: O(log n)

Heap Properties

For a min-heap with n elements:

  • Parent of node i: \lfloor (i-1)/2 \rfloor
  • Left child of node i: 2i + 1
  • Right child of node i: 2i + 2

Heap Invariant


\text{priority}(\text{parent}(i)) \leq \text{priority}(i), \quad \forall i > 0

Complexity

  • Insert: O(\log n)
  • Extract min: O(\log n)
  • Update key: O(\log n) (with index mapping)
  • Decrease key: O(\log n)

Variance Analysis and Statistical Confidence

Benchmark results include variance analysis to assess measurement reliability and identify flaky configurations.

Statistical Summary

For a set of n measurements \{x_1, x_2, \ldots, x_n\}:

Mean:


\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i

Standard Deviation (Sample):


s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}

Coefficient of Variation:


\text{CV} = \frac{s}{\bar{x}} \times 100\%

The CV expresses relative variability as a percentage, making it easier to compare variance across different metrics and scales.

Confidence Intervals

For small samples (n < 30), we use the t-distribution for confidence intervals:


\text{CI} = \bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}

Where:

  • t_{\alpha/2, n-1} = t-critical value for \alpha significance level and n-1 degrees of freedom
  • For 95% confidence: \alpha = 0.05, so t_{0.025, n-1}

For large samples (n \geq 30), we approximate with the normal distribution:


\text{CI} = \bar{x} \pm z_{\alpha/2} \cdot \frac{s}{\sqrt{n}}

Where z_{\alpha/2} = 1.96 for 95% confidence.

Flaky Benchmark Detection

A benchmark configuration is considered flaky if:


\text{CV} > \text{threshold}

Where the default threshold is 20% (coefficient of variation > 20%).

Interpretation:

  • CV < 10%: Excellent reproducibility
  • 10% ≤ CV < 20%: Good reproducibility
  • 20% ≤ CV < 50%: Moderate variance (flagged as potentially flaky)
  • CV ≥ 50%: High variance (likely flaky, investigate)

Variance Metrics Reported

For each metric (e.g., search_p50_ms, qps), we report:

  • {metric}_mean: Mean across repetitions
  • {metric}_std: Standard deviation
  • {metric}_min: Minimum value
  • {metric}_max: Maximum value
  • {metric}_ci_lower: Lower bound of 95% confidence interval
  • {metric}_ci_upper: Upper bound of 95% confidence interval
  • {metric}_cv: Coefficient of variation (%)

Example

For a benchmark with 5 repetitions producing search P50 latencies:


\{15.2, 15.8, 14.9, 16.1, 15.5\} \text{ ms}

Results:

  • Mean: \bar{x} = 15.5 ms
  • Std: s = 0.44 ms
  • CV: \frac{0.44}{15.5} \times 100\% = 2.8\% (excellent)
  • 95% CI: 15.5 \pm 0.59 ms → [14.91, 16.09] ms

Implementation

These statistical calculations are implemented in llmds.utils:

  • compute_percentiles(values): Computes P50, P95, P99 percentiles
  • calculate_statistics(values, confidence_level=0.95): Computes comprehensive statistics including mean, std, percentiles, confidence intervals, and coefficient of variation

Usage:

from llmds.utils import compute_percentiles, calculate_statistics

latencies = [15.2, 15.8, 14.9, 16.1, 15.5]

# Quick percentiles
percentiles = compute_percentiles(latencies)
print(f"P50: {percentiles['p50']:.2f} ms")

# Full statistics
stats = calculate_statistics(latencies)
print(f"Mean: {stats['mean']:.2f} ± {stats['std']:.2f} ms")
print(f"CV: {stats['cv']:.2f}%")
print(f"95% CI: [{stats['ci_lower']:.2f}, {stats['ci_upper']:.2f}] ms")

References

  1. BM25: Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.

  2. HNSW: Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4), 824-836.

  3. Count-Min Sketch: Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.

  4. Score Fusion: Cormack, G. V., Clarke, C. L., & Büttcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. Proceedings of the 32nd international ACM SIGIR conference.


Last Updated: 2025-01-01