- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
25 KiB
SheepOp LLM - Complete Mathematical Formulation
Complete mathematical derivation and step-by-step solutions for every component of the SheepOp Language Model.
Table of Contents
- Data Processing and Tokenization
- Token Embedding
- Positional Encoding
- Multi-Head Self-Attention
- Feed-Forward Network
- Layer Normalization
- Transformer Block
- Complete Forward Pass
- Loss Computation
- Backpropagation
- AdamW Optimizer Update
- Learning Rate Scheduling
- Text Generation
1. Data Processing and Tokenization
1.1 Text Extraction
Given a text file with lines, we extract text samples:
Input: Raw text files, PDFs, images, code files
Output: List of text strings S = \{s_1, s_2, \ldots, s_N\} where each s_i is a text line
Example:
Input: "Hello world\nMachine learning is cool."
Output: S = ["Hello world", "Machine learning is cool."]
1.2 Character-Level Tokenization
Vocabulary Construction:
For character-level tokenization, we create a vocabulary V mapping characters to token IDs:
V = \{(\text{<pad>}, 0), (\text{<unk>}, 1), (\text{<bos>}, 2), (\text{<eos>}, 3), (\text{space}, 4), (\text{!}, 5), \ldots, (\text{z}, 129)\}
Or more formally:
V: \mathcal{C} \rightarrow \mathbb{N}, \quad V(c) = \begin{cases}
0 & \text{if } c = \text{<pad>} \\
1 & \text{if } c = \text{<unk>} \\
2 & \text{if } c = \text{<bos>} \\
3 & \text{if } c = \text{<eos>} \\
4 & \text{if } c = \text{space} \\
\vdots & \\
129 & \text{if } c = \text{z}
\end{cases}
where \mathcal{C} is the set of all characters in the vocabulary.
Encoding Function:
For a text string s = c_1 c_2 \ldots c_n where c_i are characters:
\text{encode}(s) = [V[c_1], V[c_2], \ldots, V[c_n]]
Example:
Input: "Hi"
s = ['H', 'i']
V = {'H': 72, 'i': 105} # ASCII values
encode("Hi") = [72, 105]
Decoding Function:
\text{decode}([t_1, t_2, ..., t_n]) = V^{-1}[t_1] \cdot V^{-1}[t_2] \cdot \ldots \cdot V^{-1}[t_n]
where V^{-1} is the inverse mapping from token IDs to characters.
1.3 Sequence Chunking
For a token sequence T = [t_1, t_2, ..., t_L] and maximum length M:
Chunking:
\text{chunks} = \{[t_{i\cdot S}, t_{i\cdot S+1}, ..., t_{\min(i\cdot S+M, L)}] : i \in \{0, 1, ..., \lfloor\frac{L-M}{S}\rfloor\}\}
where S is the stride (default S = M).
Padding:
For a chunk C with length |C| < M:
\text{padded}(C) = C \oplus [\text{pad\_token}]^{(M - |C|)}
Example:
M = 5, S = 5
T = [72, 105, 44, 32, 119, 111, 114, 108, 100]
Chunk 1: [72, 105, 44, 32, 119]
Chunk 2: [111, 114, 108, 100, <pad>]
2. Token Embedding
2.1 Embedding Matrix
We have an embedding matrix E \in \mathbb{R}^{V \times d} where:
V= vocabulary sized= embedding dimension (d_model)
2.2 Embedding Lookup
For input token IDs \mathbf{t} = [t_1, t_2, ..., t_n]:
\mathbf{X} = E[\mathbf{t}] = \begin{bmatrix} E[t_1] \\ E[t_2] \\ \vdots \\ E[t_n] \end{bmatrix} \in \mathbb{R}^{n \times d}
Example:
V = 128, d = 512
t = [72, 105]
E[72] = [0.1, -0.2, ..., 0.05] (512-dim vector)
E[105] = [-0.1, 0.3, ..., 0.02] (512-dim vector)
X = [[0.1, -0.2, ..., 0.05],
[-0.1, 0.3, ..., 0.02]]
Batch Processing:
For batch size B:
\mathbf{X} = E[\mathbf{T}] \in \mathbb{R}^{B \times n \times d}
where \mathbf{T} \in \mathbb{N}^{B \times n} is the batch of token IDs.
3. Positional Encoding
3.1 Sinusoidal Positional Encoding
For position pos and dimension i:
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)
Origin of the 10000 Constant:
The constant 10000 is a hyperparameter introduced in the original "Attention Is All You Need" paper (Vaswani et al., 2017). This value controls the frequency (or wavelength) of the sinusoidal functions used for positional encoding.
What 10000 Controls:
The term 10000^{2i/d} creates a geometric progression of frequencies across different dimensions:
- Lower dimensions (small
i): Higher frequencies (faster oscillation) - Higher dimensions (large
i): Lower frequencies (slower oscillation)
Mathematical Interpretation:
The wavelength \lambda_i for dimension pair (2i, 2i+1) is:
\lambda_i = 2\pi \cdot 10000^{2i/d}
This means:
- When
i = 0:\lambda_0 = 2\pi \cdot 10000^{0} = 2\pi \approx 6.28(short wavelength) - When
i = d/2 - 1:\lambda_{d/2-1} = 2\pi \cdot 10000^{(d-2)/d} \approx 2\pi \cdot 10000(long wavelength)
Why 10000?
-
Scale Balance: It provides a good balance between:
- Being large enough to create distinguishable patterns across positions
- Being small enough to prevent numerical issues
-
Empirical Choice: The authors found this value works well for typical sequence lengths (up to ~5000 tokens)
-
Frequency Range: For
d = 512:- Lowest frequency:
\frac{1}{10000^{512/512}} = \frac{1}{10000} = 0.0001cycles per position - Highest frequency:
\frac{1}{10000^{0/512}} = 1cycle per position - This covers a wide range allowing the model to capture both local and long-range positional patterns
- Lowest frequency:
What Happens if We Change It?
- Smaller values (e.g., 100): Higher frequencies overall → better for short sequences, but may cause aliasing for long sequences
- Larger values (e.g., 100000): Lower frequencies overall → better for very long sequences, but may lose fine-grained positional information
- Different values are sometimes used: Some models use 10000, others use 5000 or 20000 depending on their typical sequence lengths
Example Frequency Analysis:
For d = 512:
i = 0: 10000^(0/512) = 1.0 → wavelength ≈ 6.28 positions
i = 64: 10000^(128/512) = 10 → wavelength ≈ 62.8 positions
i = 128: 10000^(256/512) = 100 → wavelength ≈ 628 positions
i = 256: 10000^(512/512) = 10000 → wavelength ≈ 62,832 positions
This creates a multi-scale representation where different dimensions encode positional information at different resolutions.
Simplified Form:
PE_{(pos, 2i)} = \sin\left(pos \cdot \exp\left(-\frac{2i \log(10000)}{d}\right)\right)
PE_{(pos, 2i+1)} = \cos\left(pos \cdot \exp\left(-\frac{2i \log(10000)}{d}\right)\right)
3.2 Positional Encoding Matrix
For sequence length n and model dimension d:
PE = \begin{bmatrix}
PE_{(0,0)} & PE_{(0,1)} & \cdots & PE_{(0,d-1)} \\
PE_{(1,0)} & PE_{(1,1)} & \cdots & PE_{(1,d-1)} \\
\vdots & \vdots & \ddots & \vdots \\
PE_{(n-1,0)} & PE_{(n-1,1)} & \cdots & PE_{(n-1,d-1)}
\end{bmatrix} \in \mathbb{R}^{n \times d}
3.3 Adding Positional Encoding
\mathbf{X}' = \mathbf{X} + PE
Example Calculation:
d = 512, pos = 0, i = 0:
PE(0,0) = sin(0 / 10000^(0/512)) = sin(0) = 0
PE(0,1) = cos(0 / 10000^(0/512)) = cos(0) = 1
pos = 0, i = 1:
PE(0,2) = sin(0 / 10000^(2/512)) = sin(0) = 0
PE(0,3) = cos(0 / 10000^(2/512)) = cos(0) = 1
pos = 1, i = 0:
PE(1,0) = sin(1 / 10000^(0/512)) = sin(1) ≈ 0.8415
PE(1,1) = cos(1 / 10000^(0/512)) = cos(1) ≈ 0.5403
3.4 Dropout Application
\mathbf{X}'' = \text{Dropout}(\mathbf{X}', p)
where p is the dropout probability (typically 0.1).
4. Multi-Head Self-Attention
4.1 Query, Key, Value Projections
For input \mathbf{X} \in \mathbb{R}^{B \times n \times d}:
\mathbf{Q} = \mathbf{X} W_Q, \quad \mathbf{K} = \mathbf{X} W_K, \quad \mathbf{V} = \mathbf{X} W_V
where:
W_Q, W_K, W_V \in \mathbb{R}^{d \times d}are learnable weight matrices\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{B \times n \times d}
Example:
B = 2, n = 5, d = 512
X shape: [2, 5, 512]
W_Q shape: [512, 512]
Q = X @ W_Q → [2, 5, 512]
4.2 Multi-Head Splitting
For h heads:
d_k = \frac{d}{h}
\mathbf{Q}_i = \mathbf{Q}[:, :, i \cdot d_k : (i+1) \cdot d_k] \in \mathbb{R}^{B \times n \times d_k}
\mathbf{K}_i = \mathbf{K}[:, :, i \cdot d_k : (i+1) \cdot d_k] \in \mathbb{R}^{B \times n \times d_k}
\mathbf{V}_i = \mathbf{V}[:, :, i \cdot d_k : (i+1) \cdot d_k] \in \mathbb{R}^{B \times n \times d_k}
Reshaping:
\mathbf{Q}_i \in \mathbb{R}^{B \times h \times n \times d_k}
Example:
d = 512, h = 8, d_k = 64
Q shape: [2, 5, 512]
After reshape: [2, 8, 5, 64]
4.3 Scaled Dot-Product Attention
Attention Scores:
\mathbf{S} = \frac{\mathbf{Q}_i \mathbf{K}_i^T}{\sqrt{d_k}} \in \mathbb{R}^{B \times h \times n \times n}
Example Calculation:
For head i, one example:
Q_i[0,0] = [0.1, -0.2, 0.3, ..., 0.05] (64-dim)
K_i[0,0] = [0.2, 0.1, -0.1, ..., 0.1] (64-dim)
Dot product: Q_i[0,0] · K_i[0,0] = 0.1×0.2 + (-0.2)×0.1 + ... = 0.15
Scale: 0.15 / √64 = 0.15 / 8 = 0.01875
Score matrix S[i,j] = Q_i[i] · K_i[j] / √d_k
4.4 Causal Masking
For causal (autoregressive) attention:
M_{causal} = \begin{bmatrix}
1 & -\infty & -\infty & \cdots \\
1 & 1 & -\infty & \cdots \\
1 & 1 & 1 & \cdots \\
\vdots & \vdots & \vdots & \ddots
\end{bmatrix}
\mathbf{S}_{masked} = \mathbf{S} + M_{causal}
Example:
n = 3
M_causal = [[0, -inf, -inf],
[0, 0, -inf],
[0, 0, 0]]
S = [[0.2, 0.1, 0.3],
[0.1, 0.4, 0.2],
[0.3, 0.2, 0.5]]
S_masked = [[0.2, -inf, -inf],
[0.1, 0.4, -inf],
[0.3, 0.2, 0.5]]
4.5 Softmax Normalization
\mathbf{A} = \text{softmax}(\mathbf{S}_{masked}) = \frac{\exp(\mathbf{S}_{masked})}{\sum_{j=1}^{n} \exp(\mathbf{S}_{masked}[i,j])}
Element-wise:
A_{ij} = \frac{\exp(S_{masked,ij})}{\sum_{k=1}^{n} \exp(S_{masked,ik})}
Example:
S_masked = [[0.2, -inf, -inf],
[0.1, 0.4, -inf],
[0.3, 0.2, 0.5]]
For row 0:
exp(0.2) = 1.221, exp(-inf) = 0, exp(-inf) = 0
sum = 1.221
A[0,0] = 1.221/1.221 = 1.0
A[0,1] = 0/1.221 = 0
A[0,2] = 0/1.221 = 0
For row 1:
exp(0.1) = 1.105, exp(0.4) = 1.492, exp(-inf) = 0
sum = 2.597
A[1,0] = 1.105/2.597 ≈ 0.426
A[1,1] = 1.492/2.597 ≈ 0.574
A[1,2] = 0/2.597 = 0
A = [[1.0, 0.0, 0.0],
[0.426, 0.574, 0.0],
[0.268, 0.263, 0.469]]
4.6 Attention Application
\mathbf{O}_i = \mathbf{A}_i \mathbf{V}_i \in \mathbb{R}^{B \times h \times n \times d_k}
Example:
A[0] = [1.0, 0.0, 0.0]
V[0] = [[0.1, 0.2, ...],
[0.3, 0.4, ...],
[0.5, 0.6, ...]]
O[0] = 1.0×[0.1,0.2,...] + 0.0×[0.3,0.4,...] + 0.0×[0.5,0.6,...]
= [0.1, 0.2, ...]
4.7 Concatenation and Output Projection
Concatenate heads:
\mathbf{O} = \text{Concat}(\mathbf{O}_1, \mathbf{O}_2, ..., \mathbf{O}_h) \in \mathbb{R}^{B \times n \times d}
Output projection:
\text{Attention}(\mathbf{X}) = \mathbf{O} W_O \in \mathbb{R}^{B \times n \times d}
where W_O \in \mathbb{R}^{d \times d} is the output projection weight matrix.
5. Feed-Forward Network
5.1 Feed-Forward Computation
\text{FFN}(\mathbf{X}) = \text{ReLU}(\mathbf{X} W_1 + \mathbf{b}_1) W_2 + \mathbf{b}_2
Using GELU activation (default):
\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)
where \Phi(x) is the standard normal CDF.
Approximation:
\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right)\right)
Complete FFN:
\mathbf{H} = \mathbf{X} W_1 \in \mathbb{R}^{B \times n \times d_{ff}}
\mathbf{H}' = \text{GELU}(\mathbf{H}) \in \mathbb{R}^{B \times n \times d_{ff}}
\mathbf{H}'' = \text{Dropout}(\mathbf{H}', p)
\text{FFN}(\mathbf{X}) = \mathbf{H}'' W_2 \in \mathbb{R}^{B \times n \times d}
Example:
d = 512, d_ff = 2048
X shape: [2, 5, 512]
W1 shape: [512, 2048]
H = X @ W1 → [2, 5, 2048]
H' = GELU(H) → [2, 5, 2048]
H'' = Dropout(H', 0.1) → [2, 5, 2048]
W2 shape: [2048, 512]
FFN(X) = H'' @ W2 → [2, 5, 512]
6. Layer Normalization
6.1 Layer Normalization Formula
For input \mathbf{x} \in \mathbb{R}^d:
\mu = \frac{1}{d} \sum_{i=1}^{d} x_i
\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2
\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
\text{LayerNorm}(\mathbf{x}) = \gamma \odot \hat{\mathbf{x}} + \beta
where:
\epsilon= small constant (default 1e-5)\gamma= learnable scale parameter\beta= learnable shift parameter\odot= element-wise multiplication
Example:
x = [1.0, 2.0, 3.0, 4.0]
d = 4
μ = (1.0 + 2.0 + 3.0 + 4.0) / 4 = 2.5
σ² = ((1-2.5)² + (2-2.5)² + (3-2.5)² + (4-2.5)²) / 4
= (2.25 + 0.25 + 0.25 + 2.25) / 4 = 1.25
σ = √1.25 ≈ 1.118
ε = 1e-5
x̂ = [(1-2.5)/(1.118+1e-5), (2-2.5)/(1.118+1e-5), ...]
= [-1.341, -0.447, 0.447, 1.341]
γ = [1.0, 1.0, 1.0, 1.0] (initialized)
β = [0.0, 0.0, 0.0, 0.0] (initialized)
LayerNorm(x) = γ ⊙ x̂ + β = x̂
7. Transformer Block
7.1 Pre-Norm Architecture
Self-Attention Block:
\mathbf{X}_1 = \mathbf{X} + \text{Dropout}(\text{Attention}(\text{LayerNorm}(\mathbf{X})), p)
Feed-Forward Block:
\mathbf{X}_2 = \mathbf{X}_1 + \text{Dropout}(\text{FFN}(\text{LayerNorm}(\mathbf{X}_1)), p)
Complete Transformer Block:
\mathbf{X}_{out} = \text{TransformerBlock}(\mathbf{X}_{in})
Step-by-step:
1. \mathbf{X}_{norm1} = \text{LayerNorm}(\mathbf{X}_{in})
2. \mathbf{X}_{attn} = \text{Attention}(\mathbf{X}_{norm1})
3. \mathbf{X}_{attn\_drop} = \text{Dropout}(\mathbf{X}_{attn}, p)
4. \mathbf{X}_1 = \mathbf{X}_{in} + \mathbf{X}_{attn\_drop}$ (residual connection)
5. \mathbf{X}_{norm2} = \text{LayerNorm}(\mathbf{X}_1)
6. \mathbf{X}_{ffn} = \text{FFN}(\mathbf{X}_{norm2})
7. \mathbf{X}_{ffn\_drop} = \text{Dropout}(\mathbf{X}_{ffn}, p)
8. \mathbf{X}_{out} = \mathbf{X}_1 + \mathbf{X}_{ffn\_drop}$ (residual connection)
8. Complete Forward Pass
8.1 Full Model Forward Pass
Given input token IDs \mathbf{T} \in \mathbb{N}^{B \times n}:
Step 1: Token Embedding
\mathbf{X}_0 = E[\mathbf{T}] \in \mathbb{R}^{B \times n \times d}
Step 2: Positional Encoding
\mathbf{X}_1 = \mathbf{X}_0 + PE \in \mathbb{R}^{B \times n \times d}
\mathbf{X}_2 = \text{Dropout}(\mathbf{X}_1, p)
Step 3: Transformer Layers
For L layers:
\mathbf{X}_{l+1} = \text{TransformerBlock}_l(\mathbf{X}_l), \quad l = 2, 3, ..., L+1
Step 4: Final Layer Norm
\mathbf{X}_{final} = \text{LayerNorm}(\mathbf{X}_{L+1})
Step 5: Output Projection
\mathbf{L} = \mathbf{X}_{final} W_{out} \in \mathbb{R}^{B \times n \times V}
where W_{out} \in \mathbb{R}^{d \times V} is the output projection matrix.
Output logits:
\text{logits}[b, t, v] = \text{log probability of token } v \text{ at position } t \text{ in batch } b
9. Loss Computation
9.1 Cross-Entropy Loss
For logits \mathbf{L} \in \mathbb{R}^{B \times n \times V} and labels \mathbf{Y} \in \mathbb{N}^{B \times n}:
Reshape for loss:
\mathbf{L}_{flat} = \mathbf{L}.view(B \cdot n, V) \in \mathbb{R}^{(B \cdot n) \times V}
\mathbf{Y}_{flat} = \mathbf{Y}.view(B \cdot n) \in \mathbb{N}^{B \cdot n}
Softmax probabilities:
p_i = \frac{\exp(L_{flat}[i, y_i])}{\sum_{v=1}^{V} \exp(L_{flat}[i, v])}
Cross-entropy loss:
\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log(p_i)
where N is the number of valid (non-padding) tokens.
Masked loss (ignoring padding):
\mathcal{L} = -\frac{1}{N} \sum_{i: y_i \neq \text{pad\_id}} \log(p_i)
Example:
B = 2, n = 3, V = 128
L shape: [2, 3, 128]
Y = [[72, 105, -100], [44, 32, 119]] (-100 is padding)
L_flat shape: [6, 128]
Y_flat = [72, 105, -100, 44, 32, 119]
For i=0 (y_i=72):
logits = L_flat[0] = [0.1, -0.2, ..., 0.5, ...] (128 values)
p_0 = exp(0.5) / sum(exp(logits)) ≈ 0.8 (assuming 0.5 was max)
log(p_0) = log(0.8) ≈ -0.223
For i=2 (y_i=-100):
Skip (padding token)
Total loss = -1/5 * (log(p_0) + log(p_1) + log(p_3) + log(p_4) + log(p_5))
9.2 Perplexity
\text{Perplexity} = \exp(\mathcal{L}) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log(p_i)\right)
Example:
If L = 2.0, then Perplexity = exp(2.0) ≈ 7.39
10. Backpropagation
10.1 Gradient Flow
Loss gradient:
\frac{\partial \mathcal{L}}{\partial \mathbf{L}_{flat}} = \frac{\partial}{\partial \mathbf{L}_{flat}} \left(-\frac{1}{N} \sum_{i=1}^{N} \log(p_i)\right)
Chain rule through output projection:
\frac{\partial \mathcal{L}}{\partial W_{out}} = \frac{\partial \mathcal{L}}{\partial \mathbf{L}} \cdot \frac{\partial \mathbf{L}}{\partial W_{out}}
\frac{\partial \mathcal{L}}{\partial \mathbf{X}_{final}} = \frac{\partial \mathcal{L}}{\partial \mathbf{L}} \cdot W_{out}^T
Through transformer layers (backward):
For layer l from L to 1:
\frac{\partial \mathcal{L}}{\partial \mathbf{X}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{X}_{l+1}} \cdot \frac{\partial \mathbf{X}_{l+1}}{\partial \mathbf{X}_l}
Residual connection gradient:
\frac{\partial \mathcal{L}}{\partial \mathbf{X}_{in}} = \frac{\partial \mathcal{L}}{\partial \mathbf{X}_{out}} + \frac{\partial \mathcal{L}}{\partial \mathbf{X}_{residual}}
10.2 Attention Gradients
Attention weight gradients:
\frac{\partial \mathcal{L}}{\partial \mathbf{A}} = \frac{\partial \mathcal{L}}{\partial \mathbf{O}} \cdot \mathbf{V}^T
Query, Key, Value gradients:
\frac{\partial \mathcal{L}}{\partial \mathbf{Q}} = \frac{\partial \mathcal{L}}{\partial \mathbf{S}} \cdot \mathbf{K} \cdot \frac{1}{\sqrt{d_k}}
\frac{\partial \mathcal{L}}{\partial \mathbf{K}} = \frac{\partial \mathcal{L}}{\partial \mathbf{S}} \cdot \mathbf{Q}^T \cdot \frac{1}{\sqrt{d_k}}
\frac{\partial \mathcal{L}}{\partial \mathbf{V}} = \mathbf{A}^T \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{O}}
10.3 Gradient Clipping
Gradient norm:
||\mathbf{g}|| = \sqrt{\sum_{i} g_i^2}
Clipped gradient:
\mathbf{g}_{clipped} = \begin{cases}
\mathbf{g} & \text{if } ||\mathbf{g}|| \leq \theta \\
\mathbf{g} \cdot \frac{\theta}{||\mathbf{g}||} & \text{if } ||\mathbf{g}|| > \theta
\end{cases}
where \theta is the max gradient norm (default 1.0).
Example:
g = [0.5, 0.8, 1.2]
||g|| = √(0.5² + 0.8² + 1.2²) = √(0.25 + 0.64 + 1.44) = √2.33 ≈ 1.526
θ = 1.0
Since ||g|| > θ:
g_clipped = g × (1.0 / 1.526) = [0.328, 0.524, 0.786]
11. AdamW Optimizer Update
11.1 AdamW Algorithm
For parameter \theta_t at step t:
Momentum update:
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
Bias correction:
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
Parameter update:
\theta_t = \theta_{t-1} - \eta_t \left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1}\right)
where:
\beta_1 = 0.9(momentum decay)\beta_2 = 0.999(variance decay)\eta_t= learning rate at stept\lambda= weight decay coefficient (default 0.01)\epsilon = 10^{-8}(numerical stability)
11.2 Step-by-Step Example
Initialization:
t = 0
θ₀ = 0.5 (initial parameter value)
m₀ = 0, v₀ = 0
β₁ = 0.9, β₂ = 0.999
η = 0.001 (learning rate)
λ = 0.01 (weight decay)
ε = 1e-8
Step 1:
t = 1
g₁ = 0.3 (gradient)
m₁ = 0.9 × 0 + 0.1 × 0.3 = 0.03
v₁ = 0.999 × 0 + 0.001 × 0.3² = 0.001 × 0.09 = 0.00009
m̂₁ = 0.03 / (1 - 0.9¹) = 0.03 / 0.1 = 0.3
v̂₁ = 0.00009 / (1 - 0.999¹) = 0.00009 / 0.001 = 0.09
θ₁ = 0.5 - 0.001 × (0.3 / (√0.09 + 1e-8) + 0.01 × 0.5)
= 0.5 - 0.001 × (0.3 / 0.3 + 0.005)
= 0.5 - 0.001 × (1.005)
= 0.5 - 0.001005
= 0.498995
Step 2:
t = 2
g₂ = -0.2
m₂ = 0.9 × 0.03 + 0.1 × (-0.2) = 0.027 - 0.02 = 0.007
v₂ = 0.999 × 0.00009 + 0.001 × (-0.2)² = 0.00008991 + 0.00004 = 0.00012991
m̂₂ = 0.007 / (1 - 0.9²) = 0.007 / 0.19 = 0.0368
v̂₂ = 0.00012991 / (1 - 0.999²) = 0.00012991 / 0.001999 ≈ 0.06496
θ₂ = 0.498995 - 0.001 × (0.0368 / (√0.06496 + 1e-8) + 0.01 × 0.498995)
= 0.498995 - 0.001 × (0.0368 / 0.2549 + 0.00498995)
= 0.498995 - 0.001 × (0.1444 + 0.00498995)
= 0.498995 - 0.001 × 0.1494
= 0.498995 - 0.0001494
= 0.498846
11.3 AdamW vs Adam
The key difference is the weight decay term:
Adam:
\theta_t = \theta_{t-1} - \eta_t \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
Then separately apply weight decay:
\theta_t = \theta_t (1 - \lambda)
AdamW:
\theta_t = \theta_{t-1} - \eta_t \left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1}\right)
AdamW decouples weight decay from gradient-based updates, leading to better generalization.
12. Learning Rate Scheduling
12.1 Cosine Annealing Schedule
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \cdot \frac{1 + \cos(\pi \cdot \frac{t}{T_{max}})}{2}
where:
\eta_{max}= initial learning rate\eta_{min}= minimum learning rate (default 0)T_{max}= total number of stepst= current step
Example:
η_max = 0.001
η_min = 0
T_max = 10000
t = 0: η₀ = 0 + (0.001 - 0) × (1 + cos(0)) / 2 = 0.001 × 1 = 0.001
t = 2500: η = 0 + 0.001 × (1 + cos(π/4)) / 2 = 0.001 × (1 + 0.707) / 2 ≈ 0.000854
t = 5000: η = 0 + 0.001 × (1 + cos(π/2)) / 2 = 0.001 × (1 + 0) / 2 = 0.0005
t = 7500: η = 0 + 0.001 × (1 + cos(3π/4)) / 2 ≈ 0.000146
t = 10000: η = 0 + 0.001 × (1 + cos(π)) / 2 = 0.001 × (1 + (-1)) / 2 = 0
12.2 Learning Rate Schedule Visualization
The cosine annealing schedule creates a smooth decay from maximum to minimum learning rate following a cosine curve.
13. Text Generation
13.1 Autoregressive Generation
Given prompt tokens \mathbf{P} = [p_1, p_2, ..., p_k]:
Initialization:
\mathbf{T}_0 = \mathbf{P}
For each generation step t from k+1 to k+n:
- Forward pass:
\mathbf{L}_t = \text{Model}(\mathbf{T}_{t-1})
- Get next token logits:
\mathbf{l}_t = \mathbf{L}_t[:, -1, :] \in \mathbb{R}^{B \times V}
- Apply temperature:
\mathbf{l}_t' = \frac{\mathbf{l}_t}{T}
where T is the temperature (default 1.0).
- Top-k filtering (optional):
\mathbf{l}_t''[v] = \begin{cases}
\mathbf{l}_t'[v] & \text{if } v \in \text{top-k}(\mathbf{l}_t') \\
-\infty & \text{otherwise}
\end{cases}
-
Top-p (nucleus) sampling (optional):
- Sort tokens by probability
- Find smallest set
Swhere\sum_{v \in S} p(v) \geq p - Set probabilities outside
Sto 0
-
Sample token:
p_t = \text{softmax}(\mathbf{l}_t'') \in \mathbb{R}^V
t_t \sim \text{Categorical}(p_t)
- Append token:
\mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]
13.2 Generation Example
Input:
Prompt: "Hello"
P = [72, 101, 108, 108, 111] ("Hello")
Step 1:
T₀ = [72, 101, 108, 108, 111]
Forward pass → L₁ shape: [1, 5, 128]
l₁ = L₁[0, -1, :] = [0.1, -0.2, ..., 0.8, ...] (logits for next token)
Apply temperature T=1.0:
l₁' = l₁ / 1.0 = l₁
Softmax:
p₁ = softmax(l₁) = [0.001, 0.0005, ..., 0.15, ...]
Sample (let's say token 32 = ' '):
t₁ = 32
T₁ = [72, 101, 108, 108, 111, 32]
Step 2:
T₁ = [72, 101, 108, 108, 111, 32]
Forward pass → L₂ shape: [1, 6, 128]
l₂ = L₂[0, -1, :]
Continue until max_length reached...
13.3 Top-k Sampling
Example:
V = 128, k = 50
l = [0.5, 0.3, ..., -0.1, ...] (128 logits)
Sort and get top 50:
top_k_indices = [0, 5, 12, ..., 87] (50 tokens)
l' = [-inf, -inf, ..., 0.5, -inf, ..., 0.3, ...]
(only top-k kept, others set to -inf)
13.4 Top-p (Nucleus) Sampling
Example:
p = 0.95 (threshold)
p_sorted = [0.3, 0.2, 0.15, 0.1, 0.05, 0.03, ...] (sorted probabilities)
Cumulative: [0.3, 0.5, 0.65, 0.75, 0.8, 0.83, ...]
Find where cumulative ≥ 0.95:
At index 20: cumulative = 0.96 ≥ 0.95
Keep first 20 tokens, set others to 0
Summary
This document provides complete mathematical formulations for:
- Data Processing: Tokenization, chunking, padding
- Embeddings: Token embeddings and positional encodings
- Attention: Multi-head self-attention with scaling and masking
- Feed-Forward: GELU activation and linear transformations
- Normalization: Layer normalization with learnable parameters
- Training: Loss computation, backpropagation, gradient clipping
- Optimization: AdamW update rule with momentum and variance tracking
- Scheduling: Cosine annealing learning rate schedule
- Generation: Autoregressive sampling with temperature, top-k, and top-p
Each section includes:
- Mathematical formulations
- Step-by-step calculations
- Worked examples with numerical values
- Implementation details
All equations are directly implementable in PyTorch and match the actual implementation in the SheepOp codebase.