14 KiB
Mathematical Models
This document describes the mathematical formulations and algorithms used throughout the LLM RAG Data Structures Optimizer.
Table of Contents
- BM25 Ranking Function
- HNSW Distance Metrics
- Count-Min Sketch Error Bounds
- Score Fusion
- KV Cache Memory Calculation
- Token-Aware LRU Eviction
- Admission Control Rate Limiting
BM25 Ranking Function
BM25 (Best Matching 25) is a probabilistic ranking function used for information retrieval. It scores documents based on term frequency and inverse document frequency.
Formula
For a query Q = \{q_1, q_2, \ldots, q_n\} and document D, the BM25 score is:
\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}
Where:
f(q_i, D)= frequency of termq_iin documentD|D|= length of documentD(number of terms)\text{avgdl}= average document length in the collectionk_1= term frequency saturation parameter (typically 1.2-2.0)b= length normalization parameter (typically 0.75)
Inverse Document Frequency (IDF)
\text{IDF}(q_i) = \log \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}
Where:
N= total number of documents in the collectionn(q_i)= number of documents containing termq_i
The 0.5 smoothing factor prevents division by zero and handles terms that appear in all documents.
Implementation Defaults
In our implementation:
k_1 = 1.5(default)b = 0.75(default)
HNSW Distance Metrics
Hierarchical Navigable Small World (HNSW) uses distance metrics to measure similarity between vectors. The default distance metric is L2 (Euclidean) distance.
L2 Distance (Euclidean)
For vectors \vec{u} = (u_1, u_2, \ldots, u_d) and \vec{v} = (v_1, v_2, \ldots, v_d):
d_{\text{L2}}(\vec{u}, \vec{v}) = \sqrt{\sum_{i=1}^{d} (u_i - v_i)^2}
In practice, we often use squared L2 distance for efficiency (monotonic with L2):
d_{\text{L2}}^2(\vec{u}, \vec{v}) = \sum_{i=1}^{d} (u_i - v_i)^2
Cosine Similarity (Alternative)
For normalized vectors, cosine similarity is often preferred:
\text{cosine}(\vec{u}, \vec{v}) = \frac{\vec{u} \cdot \vec{v}}{||\vec{u}|| \cdot ||\vec{v}||} = \frac{\sum_{i=1}^{d} u_i \cdot v_i}{\sqrt{\sum_{i=1}^{d} u_i^2} \cdot \sqrt{\sum_{i=1}^{d} v_i^2}}
For normalized vectors where ||\vec{u}|| = ||\vec{v}|| = 1:
\text{cosine}(\vec{u}, \vec{v}) = \vec{u} \cdot \vec{v} = \sum_{i=1}^{d} u_i \cdot v_i
Note: Cosine similarity can be converted to distance: d_{\text{cosine}} = 1 - \text{cosine}(\vec{u}, \vec{v})
HNSW Graph Properties
The HNSW graph has logarithmic search complexity:
- Search complexity:
O(\log N)whereNis the number of vectors - Construction complexity:
O(N \log N) - Memory complexity:
O(N \cdot M)whereMis the maximum connections per node
Return Format: The search() and _search_layer() methods return results as (node_id, distance) tuples, where:
node_id: Integer identifier of the vector in the indexdistance: Float representing the L2 distance from the query vector
Count-Min Sketch Error Bounds
Count-Min Sketch is a probabilistic data structure for frequency estimation with bounded error.
Structure
A Count-Min Sketch has width w and depth d, creating a d \times w table of counters.
Update Operation
For item x with count c, update all d rows:
\text{CM}[i][h_i(x)] \leftarrow \text{CM}[i][h_i(x)] + c, \quad \forall i \in \{1, 2, \ldots, d\}
Where h_i(x) is a hash function for row i.
Estimation
The estimated frequency is the minimum across all rows:
\hat{f}(x) = \min_{i \in \{1, \ldots, d\}} \text{CM}[i][h_i(x)]
Error Bound
With probability at least 1 - \delta, the error is bounded by:
\hat{f}(x) - f(x) \leq \epsilon \cdot ||\mathbf{f}||_1
Where:
f(x)= true frequency ofx||\mathbf{f}||_1= total count of all items (L1 norm)\epsilon = \frac{e}{w}(wheree \approx 2.71828)\delta = \left(\frac{1}{2}\right)^d
Parameter Selection
To achieve error bound \epsilon with probability 1 - \delta:
w = \left\lceil \frac{e}{\epsilon} \right\rceil
d = \left\lceil \ln \frac{1}{\delta} \right\rceil
Default parameters in our implementation:
w = 2048→\epsilon \approx 0.0013d = 4→\delta = 0.0625(6.25% error probability)
Score Fusion
Hybrid search combines scores from multiple retrieval methods (dense vectors and sparse keywords).
Weighted Linear Combination
S_{\text{fused}}(d, q) = \alpha \cdot S_{\text{dense}}(d, q) + \beta \cdot S_{\text{sparse}}(d, q)
Where:
S_{\text{dense}}(d, q)= normalized vector similarity scoreS_{\text{sparse}}(d, q)= normalized BM25 score\alpha + \beta = 1(typically\alpha = 0.7,\beta = 0.3)
Score Normalization
Before fusion, scores are normalized to [0, 1] range:
S_{\text{norm}}(d, q) = \frac{S(d, q) - S_{\min}}{S_{\max} - S_{\min}}
Where S_{\min} and S_{\max} are the minimum and maximum scores in the candidate set.
Reciprocal Rank Fusion (Alternative)
S_{\text{RRF}}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}
Where:
R= set of ranked lists to fuse\text{rank}_r(d)= rank of documentdin ranked listrk= smoothing parameter (typically 60)
KV Cache Memory Calculation
The KV cache memory usage depends on the number of cached tokens and the model dimensions.
Per-Sequence Memory
For a sequence with T tokens and model with hidden dimension d:
M_{\text{sequence}} = 2 \cdot T \cdot d \cdot \text{bytes\_per\_element}
Where:
- Factor of 2 accounts for both key and value tensors
\text{bytes\_per\_element} = 4for float32,2for float16
Paged Allocation
With page size P pages and page capacity C tokens per page:
M_{\text{paged}} = \left\lceil \frac{T}{C} \right\rceil \cdot P \cdot \text{page\_overhead}
Where \text{page\_overhead} includes page metadata.
Prefix Sharing Memory Savings
If N sequences share a prefix of length L:
M_{\text{shared}} = L \cdot d \cdot \text{bytes\_per\_element}
M_{\text{without\_sharing}} = N \cdot L \cdot d \cdot \text{bytes\_per\_element}
Memory savings:
\text{Savings} = (N - 1) \cdot L \cdot d \cdot \text{bytes\_per\_element}
Savings ratio:
\text{Savings Ratio} = \frac{N - 1}{N} = 1 - \frac{1}{N}
For large N, this approaches 100% savings on shared prefixes.
Copy-on-Write Overhead
With copy-on-write (COW), if K sequences modify shared pages:
M_{\text{with\_cow}} = (N - K) \cdot L_{\text{shared}} \cdot d \cdot \text{bytes\_per\_element} + K \cdot L_{\text{modified}} \cdot d \cdot \text{bytes\_per\_element}
Where:
L_{\text{shared}}= length of shared (unmodified) prefix pagesL_{\text{modified}}= length of modified prefix pages (copied)
COW Efficiency:
- If no sequences modify shared pages (
K = 0): Maximum savings (shared pages stored once) - If all sequences modify (
K = N): No savings (each has own copy) - Typical case (
K < N): Partial savings based on modification rate
Reference Counting:
Shared pages are freed when reference count r reaches zero:
r = \sum_{i=1}^{N} \mathbf{1}_{\text{seq}_i \text{ references page}}
Where \mathbf{1} is the indicator function (1 if sequence references page, 0 otherwise).
Token-Aware LRU Eviction
Token-aware LRU maintains a cumulative token budget while evicting least recently used items.
Eviction Criterion
Evict item i with minimum value of:
\text{priority}(i) = \frac{\text{access\_count}(i)}{\text{token\_count}(i)}
Or use recency-weighted:
\text{priority}(i) = \frac{\text{last\_access\_time}(i)}{\text{token\_count}(i)}
Token Budget Constraint
Maintain total tokens below budget B:
\sum_{i \in \text{cache}} \text{token\_count}(i) \leq B
When adding item j with t_j tokens:
- If
\sum_{i} t_i + t_j \leq B: add item - Else: evict items until
\sum_{i} t_i + t_j \leq B
Eviction Algorithm
while total_tokens + new_tokens > budget:
item = item_with_min_priority()
total_tokens -= token_count(item)
evict(item)
Admission Control Rate Limiting
The admission controller uses an exponentially weighted moving average (EWMA) to track request rate.
Moving Average Update
\bar{r}_{t} = \alpha \cdot r_t + (1 - \alpha) \cdot \bar{r}_{t-1}
Where:
r_t= current request rate\bar{r}_t= smoothed average rate\alpha= smoothing factor (typically 0.1-0.3)
Admission Decision
Admit request if:
\bar{r}_t + \text{margin} \leq R_{\max}
Where:
R_{\max}= maximum allowed rate (QPS limit)\text{margin}= safety margin to account for burstiness
Token Bucket (Alternative)
The token bucket algorithm allows bursty traffic:
\text{tokens}(t) = \min(B, \text{tokens}(t-1) + R \cdot \Delta t)
Where:
B= bucket capacity (burst limit)R= token refill rate (sustainable rate)\Delta t= time since last update
Request is admitted if \text{tokens}(t) \geq 1, then \text{tokens}(t) \leftarrow \text{tokens}(t) - 1.
Indexed Binary Heap
The indexed heap supports O(log n) priority updates with decrease/increase-key operations. Supports both min-heap and max-heap modes.
Heap Property
Min-heap: parent_score ≤ child_score
Max-heap: parent_score ≥ child_score
Decrease/Increase Key Operations
Min-heap:
decrease_key(new_score < old_score): Bubbles UP (score decreases → higher priority)increase_key(new_score > old_score): Bubbles DOWN (score increases → lower priority)
Max-heap:
decrease_key(new_score < old_score): Bubbles DOWN (score decreases → lower priority) Fixed in v0.1.0increase_key(new_score > old_score): Bubbles UP (score increases → higher priority) Fixed in v0.1.0
Complexity
- Push: O(log n)
- Pop: O(log n)
- Decrease/Increase key: O(log n)
- Delete: O(log n)
Heap Properties
For a min-heap with n elements:
- Parent of node
i:\lfloor (i-1)/2 \rfloor - Left child of node
i:2i + 1 - Right child of node
i:2i + 2
Heap Invariant
\text{priority}(\text{parent}(i)) \leq \text{priority}(i), \quad \forall i > 0
Complexity
- Insert:
O(\log n) - Extract min:
O(\log n) - Update key:
O(\log n)(with index mapping) - Decrease key:
O(\log n)
Variance Analysis and Statistical Confidence
Benchmark results include variance analysis to assess measurement reliability and identify flaky configurations.
Statistical Summary
For a set of n measurements \{x_1, x_2, \ldots, x_n\}:
Mean:
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
Standard Deviation (Sample):
s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}
Coefficient of Variation:
\text{CV} = \frac{s}{\bar{x}} \times 100\%
The CV expresses relative variability as a percentage, making it easier to compare variance across different metrics and scales.
Confidence Intervals
For small samples (n < 30), we use the t-distribution for confidence intervals:
\text{CI} = \bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}
Where:
t_{\alpha/2, n-1}= t-critical value for\alphasignificance level andn-1degrees of freedom- For 95% confidence:
\alpha = 0.05, sot_{0.025, n-1}
For large samples (n \geq 30), we approximate with the normal distribution:
\text{CI} = \bar{x} \pm z_{\alpha/2} \cdot \frac{s}{\sqrt{n}}
Where z_{\alpha/2} = 1.96 for 95% confidence.
Flaky Benchmark Detection
A benchmark configuration is considered flaky if:
\text{CV} > \text{threshold}
Where the default threshold is 20% (coefficient of variation > 20%).
Interpretation:
- CV < 10%: Excellent reproducibility
- 10% ≤ CV < 20%: Good reproducibility
- 20% ≤ CV < 50%: Moderate variance (flagged as potentially flaky)
- CV ≥ 50%: High variance (likely flaky, investigate)
Variance Metrics Reported
For each metric (e.g., search_p50_ms, qps), we report:
{metric}_mean: Mean across repetitions{metric}_std: Standard deviation{metric}_min: Minimum value{metric}_max: Maximum value{metric}_ci_lower: Lower bound of 95% confidence interval{metric}_ci_upper: Upper bound of 95% confidence interval{metric}_cv: Coefficient of variation (%)
Example
For a benchmark with 5 repetitions producing search P50 latencies:
\{15.2, 15.8, 14.9, 16.1, 15.5\} \text{ ms}
Results:
- Mean:
\bar{x} = 15.5ms - Std:
s = 0.44ms - CV:
\frac{0.44}{15.5} \times 100\% = 2.8\%(excellent) - 95% CI:
15.5 \pm 0.59ms → [14.91, 16.09] ms
Implementation
These statistical calculations are implemented in llmds.utils:
compute_percentiles(values): Computes P50, P95, P99 percentilescalculate_statistics(values, confidence_level=0.95): Computes comprehensive statistics including mean, std, percentiles, confidence intervals, and coefficient of variation
Usage:
from llmds.utils import compute_percentiles, calculate_statistics
latencies = [15.2, 15.8, 14.9, 16.1, 15.5]
# Quick percentiles
percentiles = compute_percentiles(latencies)
print(f"P50: {percentiles['p50']:.2f} ms")
# Full statistics
stats = calculate_statistics(latencies)
print(f"Mean: {stats['mean']:.2f} ± {stats['std']:.2f} ms")
print(f"CV: {stats['cv']:.2f}%")
print(f"95% CI: [{stats['ci_lower']:.2f}, {stats['ci_upper']:.2f}] ms")
References
-
BM25: Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
-
HNSW: Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4), 824-836.
-
Count-Min Sketch: Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.
-
Score Fusion: Cormack, G. V., Clarke, C. L., & Büttcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. Proceedings of the 32nd international ACM SIGIR conference.
Last Updated: 2025-01-01