# SheepOp LLM - Complete Mathematical Formulation Complete mathematical derivation and step-by-step solutions for every component of the SheepOp Language Model. ## Table of Contents 1. [Data Processing and Tokenization](#1-data-processing-and-tokenization) 2. [Token Embedding](#2-token-embedding) 3. [Positional Encoding](#3-positional-encoding) 4. [Multi-Head Self-Attention](#4-multi-head-self-attention) 5. [Feed-Forward Network](#5-feed-forward-network) 6. [Layer Normalization](#6-layer-normalization) 7. [Transformer Block](#7-transformer-block) 8. [Complete Forward Pass](#8-complete-forward-pass) 9. [Loss Computation](#9-loss-computation) 10. [Backpropagation](#10-backpropagation) 11. [AdamW Optimizer Update](#11-adamw-optimizer-update) 12. [Learning Rate Scheduling](#12-learning-rate-scheduling) 13. [Text Generation](#13-text-generation) --- ## 1. Data Processing and Tokenization ### 1.1 Text Extraction Given a text file with lines, we extract text samples: **Input:** Raw text files, PDFs, images, code files **Output:** List of text strings $S = \{s_1, s_2, \ldots, s_N\}$ where each $s_i$ is a text line **Example:** ``` Input: "Hello world\nMachine learning is cool." Output: S = ["Hello world", "Machine learning is cool."] ``` ### 1.2 Character-Level Tokenization **Vocabulary Construction:** For character-level tokenization, we create a vocabulary $V$ mapping characters to token IDs: ```math V = \{(\text{}, 0), (\text{}, 1), (\text{}, 2), (\text{}, 3), (\text{space}, 4), (\text{!}, 5), \ldots, (\text{z}, 129)\} ``` Or more formally: ```math V: \mathcal{C} \rightarrow \mathbb{N}, \quad V(c) = \begin{cases} 0 & \text{if } c = \text{} \\ 1 & \text{if } c = \text{} \\ 2 & \text{if } c = \text{} \\ 3 & \text{if } c = \text{} \\ 4 & \text{if } c = \text{space} \\ \vdots & \\ 129 & \text{if } c = \text{z} \end{cases} ``` where $\mathcal{C}$ is the set of all characters in the vocabulary. **Encoding Function:** For a text string $s = c_1 c_2 \ldots c_n$ where $c_i$ are characters: ```math \text{encode}(s) = [V[c_1], V[c_2], \ldots, V[c_n]] ``` **Example:** ``` Input: "Hi" s = ['H', 'i'] V = {'H': 72, 'i': 105} # ASCII values encode("Hi") = [72, 105] ``` **Decoding Function:** ```math \text{decode}([t_1, t_2, ..., t_n]) = V^{-1}[t_1] \cdot V^{-1}[t_2] \cdot \ldots \cdot V^{-1}[t_n] ``` where $V^{-1}$ is the inverse mapping from token IDs to characters. ### 1.3 Sequence Chunking For a token sequence $T = [t_1, t_2, ..., t_L]$ and maximum length $M$: **Chunking:** ```math \text{chunks} = \{[t_{i\cdot S}, t_{i\cdot S+1}, ..., t_{\min(i\cdot S+M, L)}] : i \in \{0, 1, ..., \lfloor\frac{L-M}{S}\rfloor\}\} ``` where $S$ is the stride (default $S = M$). **Padding:** For a chunk $C$ with length $|C| < M$: ```math \text{padded}(C) = C \oplus [\text{pad\_token}]^{(M - |C|)} ``` **Example:** ``` M = 5, S = 5 T = [72, 105, 44, 32, 119, 111, 114, 108, 100] Chunk 1: [72, 105, 44, 32, 119] Chunk 2: [111, 114, 108, 100, ] ``` --- ## 2. Token Embedding ### 2.1 Embedding Matrix We have an embedding matrix $E \in \mathbb{R}^{V \times d}$ where: - $V$ = vocabulary size - $d$ = embedding dimension (d_model) ### 2.2 Embedding Lookup For input token IDs $\mathbf{t} = [t_1, t_2, ..., t_n]$: ```math \mathbf{X} = E[\mathbf{t}] = \begin{bmatrix} E[t_1] \\ E[t_2] \\ \vdots \\ E[t_n] \end{bmatrix} \in \mathbb{R}^{n \times d} ``` **Example:** ``` V = 128, d = 512 t = [72, 105] E[72] = [0.1, -0.2, ..., 0.05] (512-dim vector) E[105] = [-0.1, 0.3, ..., 0.02] (512-dim vector) X = [[0.1, -0.2, ..., 0.05], [-0.1, 0.3, ..., 0.02]] ``` **Batch Processing:** For batch size $B$: ```math \mathbf{X} = E[\mathbf{T}] \in \mathbb{R}^{B \times n \times d} ``` where $\mathbf{T} \in \mathbb{N}^{B \times n}$ is the batch of token IDs. --- ## 3. Positional Encoding ### 3.1 Sinusoidal Positional Encoding For position $pos$ and dimension $i$: ```math PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) ``` ```math PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right) ``` **Origin of the 10000 Constant:** The constant $10000$ is a **hyperparameter** introduced in the original "Attention Is All You Need" paper (Vaswani et al., 2017). This value controls the **frequency** (or wavelength) of the sinusoidal functions used for positional encoding. **What 10000 Controls:** The term $10000^{2i/d}$ creates a **geometric progression** of frequencies across different dimensions: - **Lower dimensions** (small $i$): Higher frequencies (faster oscillation) - **Higher dimensions** (large $i$): Lower frequencies (slower oscillation) **Mathematical Interpretation:** The wavelength $\lambda_i$ for dimension pair $(2i, 2i+1)$ is: ```math \lambda_i = 2\pi \cdot 10000^{2i/d} ``` This means: - When $i = 0$: $\lambda_0 = 2\pi \cdot 10000^{0} = 2\pi \approx 6.28$ (short wavelength) - When $i = d/2 - 1$: $\lambda_{d/2-1} = 2\pi \cdot 10000^{(d-2)/d} \approx 2\pi \cdot 10000$ (long wavelength) **Why 10000?** 1. **Scale Balance**: It provides a good balance between: - Being large enough to create distinguishable patterns across positions - Being small enough to prevent numerical issues 2. **Empirical Choice**: The authors found this value works well for typical sequence lengths (up to ~5000 tokens) 3. **Frequency Range**: For $d = 512$: - Lowest frequency: $\frac{1}{10000^{512/512}} = \frac{1}{10000} = 0.0001$ cycles per position - Highest frequency: $\frac{1}{10000^{0/512}} = 1$ cycle per position - This covers a wide range allowing the model to capture both local and long-range positional patterns **What Happens if We Change It?** - **Smaller values** (e.g., 100): Higher frequencies overall → better for short sequences, but may cause aliasing for long sequences - **Larger values** (e.g., 100000): Lower frequencies overall → better for very long sequences, but may lose fine-grained positional information - **Different values** are sometimes used: Some models use 10000, others use 5000 or 20000 depending on their typical sequence lengths **Example Frequency Analysis:** For $d = 512$: ``` i = 0: 10000^(0/512) = 1.0 → wavelength ≈ 6.28 positions i = 64: 10000^(128/512) = 10 → wavelength ≈ 62.8 positions i = 128: 10000^(256/512) = 100 → wavelength ≈ 628 positions i = 256: 10000^(512/512) = 10000 → wavelength ≈ 62,832 positions ``` This creates a **multi-scale representation** where different dimensions encode positional information at different resolutions. **Simplified Form:** ```math PE_{(pos, 2i)} = \sin\left(pos \cdot \exp\left(-\frac{2i \log(10000)}{d}\right)\right) ``` ```math PE_{(pos, 2i+1)} = \cos\left(pos \cdot \exp\left(-\frac{2i \log(10000)}{d}\right)\right) ``` ### 3.2 Positional Encoding Matrix For sequence length $n$ and model dimension $d$: ```math PE = \begin{bmatrix} PE_{(0,0)} & PE_{(0,1)} & \cdots & PE_{(0,d-1)} \\ PE_{(1,0)} & PE_{(1,1)} & \cdots & PE_{(1,d-1)} \\ \vdots & \vdots & \ddots & \vdots \\ PE_{(n-1,0)} & PE_{(n-1,1)} & \cdots & PE_{(n-1,d-1)} \end{bmatrix} \in \mathbb{R}^{n \times d} ``` ### 3.3 Adding Positional Encoding ```math \mathbf{X}' = \mathbf{X} + PE ``` **Example Calculation:** ``` d = 512, pos = 0, i = 0: PE(0,0) = sin(0 / 10000^(0/512)) = sin(0) = 0 PE(0,1) = cos(0 / 10000^(0/512)) = cos(0) = 1 pos = 0, i = 1: PE(0,2) = sin(0 / 10000^(2/512)) = sin(0) = 0 PE(0,3) = cos(0 / 10000^(2/512)) = cos(0) = 1 pos = 1, i = 0: PE(1,0) = sin(1 / 10000^(0/512)) = sin(1) ≈ 0.8415 PE(1,1) = cos(1 / 10000^(0/512)) = cos(1) ≈ 0.5403 ``` ### 3.4 Dropout Application ```math \mathbf{X}'' = \text{Dropout}(\mathbf{X}', p) ``` where $p$ is the dropout probability (typically 0.1). --- ## 4. Multi-Head Self-Attention ### 4.1 Query, Key, Value Projections For input $\mathbf{X} \in \mathbb{R}^{B \times n \times d}$: ```math \mathbf{Q} = \mathbf{X} W_Q, \quad \mathbf{K} = \mathbf{X} W_K, \quad \mathbf{V} = \mathbf{X} W_V ``` where: - $W_Q, W_K, W_V \in \mathbb{R}^{d \times d}$ are learnable weight matrices - $\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{B \times n \times d}$ **Example:** ``` B = 2, n = 5, d = 512 X shape: [2, 5, 512] W_Q shape: [512, 512] Q = X @ W_Q → [2, 5, 512] ``` ### 4.2 Multi-Head Splitting For $h$ heads: ```math d_k = \frac{d}{h} ``` ```math \mathbf{Q}_i = \mathbf{Q}[:, :, i \cdot d_k : (i+1) \cdot d_k] \in \mathbb{R}^{B \times n \times d_k} ``` ```math \mathbf{K}_i = \mathbf{K}[:, :, i \cdot d_k : (i+1) \cdot d_k] \in \mathbb{R}^{B \times n \times d_k} ``` ```math \mathbf{V}_i = \mathbf{V}[:, :, i \cdot d_k : (i+1) \cdot d_k] \in \mathbb{R}^{B \times n \times d_k} ``` **Reshaping:** ```math \mathbf{Q}_i \in \mathbb{R}^{B \times h \times n \times d_k} ``` **Example:** ``` d = 512, h = 8, d_k = 64 Q shape: [2, 5, 512] After reshape: [2, 8, 5, 64] ``` ### 4.3 Scaled Dot-Product Attention **Attention Scores:** ```math \mathbf{S} = \frac{\mathbf{Q}_i \mathbf{K}_i^T}{\sqrt{d_k}} \in \mathbb{R}^{B \times h \times n \times n} ``` **Example Calculation:** ``` For head i, one example: Q_i[0,0] = [0.1, -0.2, 0.3, ..., 0.05] (64-dim) K_i[0,0] = [0.2, 0.1, -0.1, ..., 0.1] (64-dim) Dot product: Q_i[0,0] · K_i[0,0] = 0.1×0.2 + (-0.2)×0.1 + ... = 0.15 Scale: 0.15 / √64 = 0.15 / 8 = 0.01875 Score matrix S[i,j] = Q_i[i] · K_i[j] / √d_k ``` ### 4.4 Causal Masking For causal (autoregressive) attention: ```math M_{causal} = \begin{bmatrix} 1 & -\infty & -\infty & \cdots \\ 1 & 1 & -\infty & \cdots \\ 1 & 1 & 1 & \cdots \\ \vdots & \vdots & \vdots & \ddots \end{bmatrix} ``` ```math \mathbf{S}_{masked} = \mathbf{S} + M_{causal} ``` **Example:** ``` n = 3 M_causal = [[0, -inf, -inf], [0, 0, -inf], [0, 0, 0]] S = [[0.2, 0.1, 0.3], [0.1, 0.4, 0.2], [0.3, 0.2, 0.5]] S_masked = [[0.2, -inf, -inf], [0.1, 0.4, -inf], [0.3, 0.2, 0.5]] ``` ### 4.5 Softmax Normalization ```math \mathbf{A} = \text{softmax}(\mathbf{S}_{masked}) = \frac{\exp(\mathbf{S}_{masked})}{\sum_{j=1}^{n} \exp(\mathbf{S}_{masked}[i,j])} ``` **Element-wise:** ```math A_{ij} = \frac{\exp(S_{masked,ij})}{\sum_{k=1}^{n} \exp(S_{masked,ik})} ``` **Example:** ``` S_masked = [[0.2, -inf, -inf], [0.1, 0.4, -inf], [0.3, 0.2, 0.5]] For row 0: exp(0.2) = 1.221, exp(-inf) = 0, exp(-inf) = 0 sum = 1.221 A[0,0] = 1.221/1.221 = 1.0 A[0,1] = 0/1.221 = 0 A[0,2] = 0/1.221 = 0 For row 1: exp(0.1) = 1.105, exp(0.4) = 1.492, exp(-inf) = 0 sum = 2.597 A[1,0] = 1.105/2.597 ≈ 0.426 A[1,1] = 1.492/2.597 ≈ 0.574 A[1,2] = 0/2.597 = 0 A = [[1.0, 0.0, 0.0], [0.426, 0.574, 0.0], [0.268, 0.263, 0.469]] ``` ### 4.6 Attention Application ```math \mathbf{O}_i = \mathbf{A}_i \mathbf{V}_i \in \mathbb{R}^{B \times h \times n \times d_k} ``` **Example:** ``` A[0] = [1.0, 0.0, 0.0] V[0] = [[0.1, 0.2, ...], [0.3, 0.4, ...], [0.5, 0.6, ...]] O[0] = 1.0×[0.1,0.2,...] + 0.0×[0.3,0.4,...] + 0.0×[0.5,0.6,...] = [0.1, 0.2, ...] ``` ### 4.7 Concatenation and Output Projection **Concatenate heads:** ```math \mathbf{O} = \text{Concat}(\mathbf{O}_1, \mathbf{O}_2, ..., \mathbf{O}_h) \in \mathbb{R}^{B \times n \times d} ``` **Output projection:** ```math \text{Attention}(\mathbf{X}) = \mathbf{O} W_O \in \mathbb{R}^{B \times n \times d} ``` where $W_O \in \mathbb{R}^{d \times d}$ is the output projection weight matrix. --- ## 5. Feed-Forward Network ### 5.1 Feed-Forward Computation ```math \text{FFN}(\mathbf{X}) = \text{ReLU}(\mathbf{X} W_1 + \mathbf{b}_1) W_2 + \mathbf{b}_2 ``` Using GELU activation (default): ```math \text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right) ``` where $\Phi(x)$ is the standard normal CDF. **Approximation:** ```math \text{GELU}(x) \approx 0.5x\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right)\right) ``` **Complete FFN:** ```math \mathbf{H} = \mathbf{X} W_1 \in \mathbb{R}^{B \times n \times d_{ff}} ``` ```math \mathbf{H}' = \text{GELU}(\mathbf{H}) \in \mathbb{R}^{B \times n \times d_{ff}} ``` ```math \mathbf{H}'' = \text{Dropout}(\mathbf{H}', p) ``` ```math \text{FFN}(\mathbf{X}) = \mathbf{H}'' W_2 \in \mathbb{R}^{B \times n \times d} ``` **Example:** ``` d = 512, d_ff = 2048 X shape: [2, 5, 512] W1 shape: [512, 2048] H = X @ W1 → [2, 5, 2048] H' = GELU(H) → [2, 5, 2048] H'' = Dropout(H', 0.1) → [2, 5, 2048] W2 shape: [2048, 512] FFN(X) = H'' @ W2 → [2, 5, 512] ``` --- ## 6. Layer Normalization ### 6.1 Layer Normalization Formula For input $\mathbf{x} \in \mathbb{R}^d$: ```math \mu = \frac{1}{d} \sum_{i=1}^{d} x_i ``` ```math \sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2 ``` ```math \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} ``` ```math \text{LayerNorm}(\mathbf{x}) = \gamma \odot \hat{\mathbf{x}} + \beta ``` where: - $\epsilon$ = small constant (default 1e-5) - $\gamma$ = learnable scale parameter - $\beta$ = learnable shift parameter - $\odot$ = element-wise multiplication **Example:** ``` x = [1.0, 2.0, 3.0, 4.0] d = 4 μ = (1.0 + 2.0 + 3.0 + 4.0) / 4 = 2.5 σ² = ((1-2.5)² + (2-2.5)² + (3-2.5)² + (4-2.5)²) / 4 = (2.25 + 0.25 + 0.25 + 2.25) / 4 = 1.25 σ = √1.25 ≈ 1.118 ε = 1e-5 x̂ = [(1-2.5)/(1.118+1e-5), (2-2.5)/(1.118+1e-5), ...] = [-1.341, -0.447, 0.447, 1.341] γ = [1.0, 1.0, 1.0, 1.0] (initialized) β = [0.0, 0.0, 0.0, 0.0] (initialized) LayerNorm(x) = γ ⊙ x̂ + β = x̂ ``` --- ## 7. Transformer Block ### 7.1 Pre-Norm Architecture **Self-Attention Block:** ```math \mathbf{X}_1 = \mathbf{X} + \text{Dropout}(\text{Attention}(\text{LayerNorm}(\mathbf{X})), p) ``` **Feed-Forward Block:** ```math \mathbf{X}_2 = \mathbf{X}_1 + \text{Dropout}(\text{FFN}(\text{LayerNorm}(\mathbf{X}_1)), p) ``` **Complete Transformer Block:** ```math \mathbf{X}_{out} = \text{TransformerBlock}(\mathbf{X}_{in}) ``` **Step-by-step:** ```math 1. \mathbf{X}_{norm1} = \text{LayerNorm}(\mathbf{X}_{in}) 2. \mathbf{X}_{attn} = \text{Attention}(\mathbf{X}_{norm1}) 3. \mathbf{X}_{attn\_drop} = \text{Dropout}(\mathbf{X}_{attn}, p) 4. \mathbf{X}_1 = \mathbf{X}_{in} + \mathbf{X}_{attn\_drop}$ (residual connection) 5. \mathbf{X}_{norm2} = \text{LayerNorm}(\mathbf{X}_1) 6. \mathbf{X}_{ffn} = \text{FFN}(\mathbf{X}_{norm2}) 7. \mathbf{X}_{ffn\_drop} = \text{Dropout}(\mathbf{X}_{ffn}, p) 8. \mathbf{X}_{out} = \mathbf{X}_1 + \mathbf{X}_{ffn\_drop}$ (residual connection) ``` --- ## 8. Complete Forward Pass ### 8.1 Full Model Forward Pass Given input token IDs $\mathbf{T} \in \mathbb{N}^{B \times n}$: **Step 1: Token Embedding** ```math \mathbf{X}_0 = E[\mathbf{T}] \in \mathbb{R}^{B \times n \times d} ``` **Step 2: Positional Encoding** ```math \mathbf{X}_1 = \mathbf{X}_0 + PE \in \mathbb{R}^{B \times n \times d} ``` ```math \mathbf{X}_2 = \text{Dropout}(\mathbf{X}_1, p) ``` **Step 3: Transformer Layers** For $L$ layers: ```math \mathbf{X}_{l+1} = \text{TransformerBlock}_l(\mathbf{X}_l), \quad l = 2, 3, ..., L+1 ``` **Step 4: Final Layer Norm** ```math \mathbf{X}_{final} = \text{LayerNorm}(\mathbf{X}_{L+1}) ``` **Step 5: Output Projection** ```math \mathbf{L} = \mathbf{X}_{final} W_{out} \in \mathbb{R}^{B \times n \times V} ``` where $W_{out} \in \mathbb{R}^{d \times V}$ is the output projection matrix. **Output logits:** ```math \text{logits}[b, t, v] = \text{log probability of token } v \text{ at position } t \text{ in batch } b ``` --- ## 9. Loss Computation ### 9.1 Cross-Entropy Loss For logits $\mathbf{L} \in \mathbb{R}^{B \times n \times V}$ and labels $\mathbf{Y} \in \mathbb{N}^{B \times n}$: **Reshape for loss:** ```math \mathbf{L}_{flat} = \mathbf{L}.view(B \cdot n, V) \in \mathbb{R}^{(B \cdot n) \times V} ``` ```math \mathbf{Y}_{flat} = \mathbf{Y}.view(B \cdot n) \in \mathbb{N}^{B \cdot n} ``` **Softmax probabilities:** ```math p_i = \frac{\exp(L_{flat}[i, y_i])}{\sum_{v=1}^{V} \exp(L_{flat}[i, v])} ``` **Cross-entropy loss:** ```math \mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log(p_i) ``` where $N$ is the number of valid (non-padding) tokens. **Masked loss (ignoring padding):** ```math \mathcal{L} = -\frac{1}{N} \sum_{i: y_i \neq \text{pad\_id}} \log(p_i) ``` **Example:** ``` B = 2, n = 3, V = 128 L shape: [2, 3, 128] Y = [[72, 105, -100], [44, 32, 119]] (-100 is padding) L_flat shape: [6, 128] Y_flat = [72, 105, -100, 44, 32, 119] For i=0 (y_i=72): logits = L_flat[0] = [0.1, -0.2, ..., 0.5, ...] (128 values) p_0 = exp(0.5) / sum(exp(logits)) ≈ 0.8 (assuming 0.5 was max) log(p_0) = log(0.8) ≈ -0.223 For i=2 (y_i=-100): Skip (padding token) Total loss = -1/5 * (log(p_0) + log(p_1) + log(p_3) + log(p_4) + log(p_5)) ``` ### 9.2 Perplexity ```math \text{Perplexity} = \exp(\mathcal{L}) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log(p_i)\right) ``` **Example:** ``` If L = 2.0, then Perplexity = exp(2.0) ≈ 7.39 ``` --- ## 10. Backpropagation ### 10.1 Gradient Flow **Loss gradient:** ```math \frac{\partial \mathcal{L}}{\partial \mathbf{L}_{flat}} = \frac{\partial}{\partial \mathbf{L}_{flat}} \left(-\frac{1}{N} \sum_{i=1}^{N} \log(p_i)\right) ``` **Chain rule through output projection:** ```math \frac{\partial \mathcal{L}}{\partial W_{out}} = \frac{\partial \mathcal{L}}{\partial \mathbf{L}} \cdot \frac{\partial \mathbf{L}}{\partial W_{out}} ``` ```math \frac{\partial \mathcal{L}}{\partial \mathbf{X}_{final}} = \frac{\partial \mathcal{L}}{\partial \mathbf{L}} \cdot W_{out}^T ``` **Through transformer layers (backward):** For layer $l$ from $L$ to $1$: ```math \frac{\partial \mathcal{L}}{\partial \mathbf{X}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{X}_{l+1}} \cdot \frac{\partial \mathbf{X}_{l+1}}{\partial \mathbf{X}_l} ``` **Residual connection gradient:** ```math \frac{\partial \mathcal{L}}{\partial \mathbf{X}_{in}} = \frac{\partial \mathcal{L}}{\partial \mathbf{X}_{out}} + \frac{\partial \mathcal{L}}{\partial \mathbf{X}_{residual}} ``` ### 10.2 Attention Gradients **Attention weight gradients:** ```math \frac{\partial \mathcal{L}}{\partial \mathbf{A}} = \frac{\partial \mathcal{L}}{\partial \mathbf{O}} \cdot \mathbf{V}^T ``` **Query, Key, Value gradients:** ```math \frac{\partial \mathcal{L}}{\partial \mathbf{Q}} = \frac{\partial \mathcal{L}}{\partial \mathbf{S}} \cdot \mathbf{K} \cdot \frac{1}{\sqrt{d_k}} ``` ```math \frac{\partial \mathcal{L}}{\partial \mathbf{K}} = \frac{\partial \mathcal{L}}{\partial \mathbf{S}} \cdot \mathbf{Q}^T \cdot \frac{1}{\sqrt{d_k}} ``` ```math \frac{\partial \mathcal{L}}{\partial \mathbf{V}} = \mathbf{A}^T \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{O}} ``` ### 10.3 Gradient Clipping **Gradient norm:** ```math ||\mathbf{g}|| = \sqrt{\sum_{i} g_i^2} ``` **Clipped gradient:** ```math \mathbf{g}_{clipped} = \begin{cases} \mathbf{g} & \text{if } ||\mathbf{g}|| \leq \theta \\ \mathbf{g} \cdot \frac{\theta}{||\mathbf{g}||} & \text{if } ||\mathbf{g}|| > \theta \end{cases} ``` where $\theta$ is the max gradient norm (default 1.0). **Example:** ``` g = [0.5, 0.8, 1.2] ||g|| = √(0.5² + 0.8² + 1.2²) = √(0.25 + 0.64 + 1.44) = √2.33 ≈ 1.526 θ = 1.0 Since ||g|| > θ: g_clipped = g × (1.0 / 1.526) = [0.328, 0.524, 0.786] ``` --- ## 11. AdamW Optimizer Update ### 11.1 AdamW Algorithm For parameter $\theta_t$ at step $t$: **Momentum update:** ```math m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t ``` ```math v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 ``` **Bias correction:** ```math \hat{m}_t = \frac{m_t}{1 - \beta_1^t} ``` ```math \hat{v}_t = \frac{v_t}{1 - \beta_2^t} ``` **Parameter update:** ```math \theta_t = \theta_{t-1} - \eta_t \left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1}\right) ``` where: - $\beta_1 = 0.9$ (momentum decay) - $\beta_2 = 0.999$ (variance decay) - $\eta_t$ = learning rate at step $t$ - $\lambda$ = weight decay coefficient (default 0.01) - $\epsilon = 10^{-8}$ (numerical stability) ### 11.2 Step-by-Step Example **Initialization:** ``` t = 0 θ₀ = 0.5 (initial parameter value) m₀ = 0, v₀ = 0 β₁ = 0.9, β₂ = 0.999 η = 0.001 (learning rate) λ = 0.01 (weight decay) ε = 1e-8 ``` **Step 1:** ``` t = 1 g₁ = 0.3 (gradient) m₁ = 0.9 × 0 + 0.1 × 0.3 = 0.03 v₁ = 0.999 × 0 + 0.001 × 0.3² = 0.001 × 0.09 = 0.00009 m̂₁ = 0.03 / (1 - 0.9¹) = 0.03 / 0.1 = 0.3 v̂₁ = 0.00009 / (1 - 0.999¹) = 0.00009 / 0.001 = 0.09 θ₁ = 0.5 - 0.001 × (0.3 / (√0.09 + 1e-8) + 0.01 × 0.5) = 0.5 - 0.001 × (0.3 / 0.3 + 0.005) = 0.5 - 0.001 × (1.005) = 0.5 - 0.001005 = 0.498995 ``` **Step 2:** ``` t = 2 g₂ = -0.2 m₂ = 0.9 × 0.03 + 0.1 × (-0.2) = 0.027 - 0.02 = 0.007 v₂ = 0.999 × 0.00009 + 0.001 × (-0.2)² = 0.00008991 + 0.00004 = 0.00012991 m̂₂ = 0.007 / (1 - 0.9²) = 0.007 / 0.19 = 0.0368 v̂₂ = 0.00012991 / (1 - 0.999²) = 0.00012991 / 0.001999 ≈ 0.06496 θ₂ = 0.498995 - 0.001 × (0.0368 / (√0.06496 + 1e-8) + 0.01 × 0.498995) = 0.498995 - 0.001 × (0.0368 / 0.2549 + 0.00498995) = 0.498995 - 0.001 × (0.1444 + 0.00498995) = 0.498995 - 0.001 × 0.1494 = 0.498995 - 0.0001494 = 0.498846 ``` ### 11.3 AdamW vs Adam The key difference is the weight decay term: **Adam:** ```math \theta_t = \theta_{t-1} - \eta_t \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} ``` Then separately apply weight decay: ```math \theta_t = \theta_t (1 - \lambda) ``` **AdamW:** ```math \theta_t = \theta_{t-1} - \eta_t \left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1}\right) ``` AdamW decouples weight decay from gradient-based updates, leading to better generalization. --- ## 12. Learning Rate Scheduling ### 12.1 Cosine Annealing Schedule ```math \eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \cdot \frac{1 + \cos(\pi \cdot \frac{t}{T_{max}})}{2} ``` where: - $\eta_{max}$ = initial learning rate - $\eta_{min}$ = minimum learning rate (default 0) - $T_{max}$ = total number of steps - $t$ = current step **Example:** ``` η_max = 0.001 η_min = 0 T_max = 10000 t = 0: η₀ = 0 + (0.001 - 0) × (1 + cos(0)) / 2 = 0.001 × 1 = 0.001 t = 2500: η = 0 + 0.001 × (1 + cos(π/4)) / 2 = 0.001 × (1 + 0.707) / 2 ≈ 0.000854 t = 5000: η = 0 + 0.001 × (1 + cos(π/2)) / 2 = 0.001 × (1 + 0) / 2 = 0.0005 t = 7500: η = 0 + 0.001 × (1 + cos(3π/4)) / 2 ≈ 0.000146 t = 10000: η = 0 + 0.001 × (1 + cos(π)) / 2 = 0.001 × (1 + (-1)) / 2 = 0 ``` ### 12.2 Learning Rate Schedule Visualization The cosine annealing schedule creates a smooth decay from maximum to minimum learning rate following a cosine curve. --- ## 13. Text Generation ### 13.1 Autoregressive Generation Given prompt tokens $\mathbf{P} = [p_1, p_2, ..., p_k]$: **Initialization:** ```math \mathbf{T}_0 = \mathbf{P} ``` **For each generation step $t$ from $k+1$ to $k+n$:** 1. **Forward pass:** ```math \mathbf{L}_t = \text{Model}(\mathbf{T}_{t-1}) ``` 2. **Get next token logits:** ```math \mathbf{l}_t = \mathbf{L}_t[:, -1, :] \in \mathbb{R}^{B \times V} ``` 3. **Apply temperature:** ```math \mathbf{l}_t' = \frac{\mathbf{l}_t}{T} ``` where $T$ is the temperature (default 1.0). 4. **Top-k filtering (optional):** ```math \mathbf{l}_t''[v] = \begin{cases} \mathbf{l}_t'[v] & \text{if } v \in \text{top-k}(\mathbf{l}_t') \\ -\infty & \text{otherwise} \end{cases} ``` 5. **Top-p (nucleus) sampling (optional):** - Sort tokens by probability - Find smallest set $S$ where $\sum_{v \in S} p(v) \geq p$ - Set probabilities outside $S$ to 0 6. **Sample token:** ```math p_t = \text{softmax}(\mathbf{l}_t'') \in \mathbb{R}^V t_t \sim \text{Categorical}(p_t) ``` 7. **Append token:** ```math \mathbf{T}_t = [\mathbf{T}_{t-1}, t_t] ``` ### 13.2 Generation Example **Input:** ``` Prompt: "Hello" P = [72, 101, 108, 108, 111] ("Hello") ``` **Step 1:** ``` T₀ = [72, 101, 108, 108, 111] Forward pass → L₁ shape: [1, 5, 128] l₁ = L₁[0, -1, :] = [0.1, -0.2, ..., 0.8, ...] (logits for next token) Apply temperature T=1.0: l₁' = l₁ / 1.0 = l₁ Softmax: p₁ = softmax(l₁) = [0.001, 0.0005, ..., 0.15, ...] Sample (let's say token 32 = ' '): t₁ = 32 T₁ = [72, 101, 108, 108, 111, 32] ``` **Step 2:** ``` T₁ = [72, 101, 108, 108, 111, 32] Forward pass → L₂ shape: [1, 6, 128] l₂ = L₂[0, -1, :] Continue until max_length reached... ``` ### 13.3 Top-k Sampling **Example:** ``` V = 128, k = 50 l = [0.5, 0.3, ..., -0.1, ...] (128 logits) Sort and get top 50: top_k_indices = [0, 5, 12, ..., 87] (50 tokens) l' = [-inf, -inf, ..., 0.5, -inf, ..., 0.3, ...] (only top-k kept, others set to -inf) ``` ### 13.4 Top-p (Nucleus) Sampling **Example:** ``` p = 0.95 (threshold) p_sorted = [0.3, 0.2, 0.15, 0.1, 0.05, 0.03, ...] (sorted probabilities) Cumulative: [0.3, 0.5, 0.65, 0.75, 0.8, 0.83, ...] Find where cumulative ≥ 0.95: At index 20: cumulative = 0.96 ≥ 0.95 Keep first 20 tokens, set others to 0 ``` --- ## Summary This document provides complete mathematical formulations for: 1. **Data Processing**: Tokenization, chunking, padding 2. **Embeddings**: Token embeddings and positional encodings 3. **Attention**: Multi-head self-attention with scaling and masking 4. **Feed-Forward**: GELU activation and linear transformations 5. **Normalization**: Layer normalization with learnable parameters 6. **Training**: Loss computation, backpropagation, gradient clipping 7. **Optimization**: AdamW update rule with momentum and variance tracking 8. **Scheduling**: Cosine annealing learning rate schedule 9. **Generation**: Autoregressive sampling with temperature, top-k, and top-p Each section includes: - Mathematical formulations - Step-by-step calculations - Worked examples with numerical values - Implementation details All equations are directly implementable in PyTorch and match the actual implementation in the SheepOp codebase.