# SheepOp LLM - Mathematical Control System Model Complete mathematical control system formulation of the SheepOp Language Model, treating the entire system as a unified mathematical control system with state-space representations, transfer functions, and step-by-step explanations. ## Table of Contents 1. [System Overview](#1-system-overview) 2. [State-Space Representation](#2-state-space-representation) 3. [Tokenizer as Input Encoder](#3-tokenizer-as-input-encoder) 4. [Seed Control System](#4-seed-control-system) 5. [Embedding Layer Control](#5-embedding-layer-control) 6. [Positional Encoding State](#6-positional-encoding-state) 7. [Self-Attention Control System](#7-self-attention-control-system) 8. [Feed-Forward Control](#8-feed-forward-control) 9. [Layer Normalization Feedback](#9-layer-normalization-feedback) 10. [Complete System Dynamics](#10-complete-system-dynamics) 11. [Training as Optimization Control](#11-training-as-optimization-control) 12. [Inference Control Loop](#12-inference-control-loop) --- ## 1. System Overview ### 1.1 Control System Architecture The SheepOp LLM can be modeled as a **nonlinear dynamical control system** with: - **Input**: Character sequence $\mathbf{c} = [c_1, c_2, ..., c_n]$ - **State**: Hidden representations $\mathbf{h}\_t $at each layer and time step - **Control**: Model parameters $\theta = \{W_Q, W_K, W_V, W_1, W_2, ...\} $ - **Output**: Probability distribution over vocabulary $\mathbf{p}\_t \in \mathbb{R}^V$ **System Block Diagram:** ``` Input Sequence → Tokenizer → Embeddings → Positional Encoding → ↓ [Transformer Layer 1] → [Transformer Layer 2] → ... → [Transformer Layer L] ↓ Output Projection → Logits → Softmax → Output Probabilities ``` ### 1.2 Mathematical System Formulation The complete system can be expressed as: ```math \mathbf{y}_t = \mathcal{F}(\mathbf{x}_t, \mathbf{h}_t, \theta, \mathbf{s}) ``` where: - $\mathbf{x}\_t $= input at time$ t$ - $\mathbf{h}\_t $= hidden state at time$ t$ - $\theta $= system parameters (weights) - $\mathbf{s} $= seed for randomness - $\mathcal{F} $= complete forward function --- ## 2. State-Space Representation ### 2.1 Discrete-Time State-Space Model For a transformer with L layers and sequence length n : **State Vector:** ```math \mathbf{H}_t = \begin{bmatrix} \mathbf{h}_t^{(1)} \\ \mathbf{h}_t^{(2)} \\ \vdots \\ \mathbf{h}_t^{(L)} \end{bmatrix} \in \mathbb{R}^{L \times n \times d} ``` where $\mathbf{h}_t^{(l)} \in \mathbb{R}^{n \times d} is the hidden state at layer l .$ **State Update Equation:** ```math \mathbf{h}_t^{(l+1)} = f_l(\mathbf{h}_t^{(l)}, \theta_l), \quad l = 0, 1, ..., L-1 where f_l is the transformation at layer l . ``` **Output Equation:** ```math \mathbf{y}_t = g(\mathbf{h}_t^{(L)}, \theta_{out}) ``` ### 2.2 System Linearity Analysis The system is **nonlinear** due to: - Attention mechanism (softmax) - Activation functions (GELU) - Layer normalization However, individual components can be analyzed as **piecewise linear** systems. --- ## 3. Tokenizer as Input Encoder ### 3.1 Tokenizer Control Function The tokenizer maps a character sequence to a discrete token sequence: ```math \mathcal{T}: \mathcal{C}^* \rightarrow \mathbb{N}^* ``` **Mathematical Formulation:** For input sequence $\mathbf{c} = [c_1, c_2, ..., c_n] $: ```math \mathbf{t} = \mathcal{T}(\mathbf{c}) = [V(c_1), V(c_2), ..., V(c_n)] where V: \mathcal{C} \rightarrow \mathbb{N} is the vocabulary mapping function. ``` ### 3.2 Vocabulary Mapping Function ```math V(c) = \begin{cases} 0 & \text{if } c = \text{} \\ 1 & \text{if } c = \text{} \\ 2 & \text{if } c = \text{} \\ 3 & \text{if } c = \text{} \\ v & \text{if } c \in \mathcal{C}_{vocab} \end{cases} ``` **Control Properties:** - **Deterministic**: Same input always produces same output - **Invertible**: For most tokens, $V^{-1}$ exists - **Bijective**: Each character maps to unique token ID ### 3.3 Tokenizer State Space The tokenizer maintains internal state: ```math \Sigma_{\mathcal{T}} = \{V, V^{-1}, \text{padding\_strategy}, \text{max\_length}\} ``` **State Transition:** ```math \Sigma_{\mathcal{T}}' = \Sigma_{\mathcal{T}} \quad \text{(static during operation)} ``` ### 3.4 Step-by-Step Explanation **Step 1: Character Extraction** - Input: Raw text string "Hello" - Process: Extract each character $c \in \{'H', 'e', 'l', 'l', 'o'\}$ - Meaning: Break down text into atomic units **Step 2: Vocabulary Lookup** - Process: Apply $V(c)$ to each character - Example: $V('H') = 72, V('e') = 101, V('l') = 108, V('o') = 111$ - Meaning: Convert characters to numerical indices **Step 3: Sequence Formation** - Output: $\mathbf{t} = [72, 101, 108, 108, 111]$ - Meaning: Numerical representation ready for embedding **Control Impact**: Tokenizer creates the **foundation** for all subsequent processing. Any error here propagates through the entire system. --- ## 4. Seed Control System ### 4.1 Seed as System Initialization The seed $s \in \mathbb{N}$ controls **randomness** throughout the system: ```math \mathcal{R}(\mathbf{x}, s) = \text{deterministic\_random}(\mathbf{x}, s) ``` ### 4.2 Seed Propagation Function **Initialization:** ```math \text{seed\_torch}(s): \text{torch.manual\_seed}(s) \text{seed\_cuda}(s): \text{torch.cuda.manual\_seed\_all}(s) \text{seed\_cudnn}(s): \text{torch.backends.cudnn.deterministic} = \text{True} ``` **Mathematical Model:** ```math \mathbb{P}(\mathbf{W} | s) = \begin{cases} \delta(\mathbf{W} - \mathbf{W}_s) & \text{if deterministic} \\ \text{some distribution} & \text{if stochastic} \end{cases} where \delta is the Dirac delta and \mathbf{W}_s is the weight initialization given seed s . ``` ### 4.3 Seed Control Equation For weight initialization: ```math \mathbf{W}_0 = \mathcal{I}(\mathbf{s}, \text{init\_method}) where \mathcal{I} is the initialization function. ``` **Example - Normal Initialization:** ```math \mathbf{W}_0 \sim \mathcal{N}(0, \sigma^2) \quad \text{with random state } r(s) W_{ij} = \sigma \cdot \Phi^{-1}(U_{ij}(s)) where: - \mathcal{N}(0, \sigma^2) = normal distribution - \Phi^{-1} = inverse CDF - U_{ij}(s) = uniform random number from seed s - \sigma = 0.02 (typical value) ``` ### 4.4 Step-by-Step Explanation **Step 1: Seed Input** - Input: $s = 42$ - Meaning: Provides reproducibility guarantee **Step 2: RNG State Initialization** - Process: Set all random number generators to state based on $s$ - Meaning: Ensures deterministic behavior **Step 3: Weight Initialization** - Process: Generate all weights using RNG with seed $s$ - Example: $W\_{ij} = \text{normal}(0, 0.02, \text{seed}=42)$ - Meaning: Starting point for optimization **Step 4: Training Determinism** - Process: Same seed + same data → same gradients → same updates - Meaning: Complete reproducibility **Control Impact**: Seed controls **initial conditions** and **stochastic processes** throughout training. It's the **control parameter** for reproducibility. --- ## 5. Embedding Layer Control ### 5.1 Embedding as Linear Transformation The embedding layer performs a **lookup operation**: ```math \mathcal{E}: \mathbb{N} \rightarrow \mathbb{R}^d ``` **Mathematical Formulation:** ```math \mathbf{E} \in \mathbb{R}^{V \times d} \quad \text{(embedding matrix)} \mathbf{x}_t = \mathbf{E}[\mathbf{t}_t] = \mathbf{E}_t \in \mathbb{R}^d where \mathbf{t}_t \in \mathbb{N} is the token ID at position t . ``` ### 5.2 Embedding Control System **Batch Processing:** ```math \mathbf{X} = \mathbf{E}[\mathbf{T}] \in \mathbb{R}^{B \times n \times d} where \mathbf{T} \in \mathbb{N}^{B \times n} is the batch of token IDs. ``` **Control Function:** ```math \mathbf{X} = \mathcal{E}(\mathbf{T}, \mathbf{E}) ``` **Gradient Flow:** ```math \frac{\partial \mathcal{L}}{\partial \mathbf{E}} = \sum_{b,t} \frac{\partial \mathcal{L}}{\partial \mathbf{X}_{b,t}} \cdot \mathbf{1}[\mathbf{T}_{b,t}] where \mathbf{1}[\mathbf{T}_{b,t}] is a one-hot indicator. ``` ### 5.3 Step-by-Step Explanation **Step 1: Token ID Input** - Input: $t = 72$ (token ID for 'H') - Meaning: Discrete index into vocabulary **Step 2: Matrix Lookup** - Process: $\mathbf{x} = \mathbf{E}[72]$ - Example: $\mathbf{x} = [0.1, -0.2, 0.3, ..., 0.05] \in \mathbb{R}^{512}$ - Meaning: Continuous vector representation **Step 3: Semantic Encoding** - Property: Similar tokens have similar embeddings (after training) - Meaning: Embeddings capture semantic relationships **Control Impact**: Embedding layer **projects** discrete tokens into continuous space, enabling gradient-based optimization. --- ## 6. Positional Encoding State ### 6.1 Positional Encoding as Additive Control ```math \mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} \in \mathbb{R}^{B \times n \times d} where \mathbf{PE} \in \mathbb{R}^{n \times d} is the positional encoding matrix. ``` ### 6.2 Positional Encoding Function ```math PE_{(pos, i)} = \begin{cases} \sin\left(\frac{pos}{10000^{2i/d}}\right) & \text{if } i \text{ is even} \\ \cos\left(\frac{pos}{10000^{2(i-1)/d}}\right) & \text{if } i \text{ is odd} \end{cases} ``` ### 6.3 Control System Interpretation **Additive Control:** ```math \mathbf{X}_{out} = \mathbf{X}_{in} + \mathbf{U}_{pos} where \mathbf{U}_{pos} is the **control input** representing position information. ``` **Meaning**: Positional encoding **injects** positional information into the embeddings. ### 6.4 Step-by-Step Explanation **Step 1: Position Index** - Input: Position $pos = 0, 1, 2, ..., n-1$ - Meaning: Absolute position in sequence **Step 2: Encoding Generation** - Process: Compute $PE\_{(pos, i)}$ for each dimension $ i$ - Example: $PE*{(0, 0)} = 0, PE*{(0, 1)} = 1, PE\_{(1, 0)} \approx 0.84$ - Meaning: Unique pattern for each position **Step 3: Addition Operation** - Process: $\mathbf{X}\_{pos} = \mathbf{X} + PE$ - Meaning: Position information added to embeddings **Step 4: Multi-Scale Representation** - Property: Different dimensions encode different frequency scales - Meaning: Model can learn both local and global positional patterns **Control Impact**: Positional encoding provides **temporal/spatial awareness** to the model, enabling it to understand sequence order. --- ## 7. Self-Attention Control System ### 7.1 Attention as Information Routing Self-attention can be modeled as a **dynamical control system** that routes information: ```math \mathbf{O} = \text{Attention}(\mathbf{X}, \mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V) ``` ### 7.2 State-Space Model for Attention **Query, Key, Value Generation:** ```math \mathbf{Q} = \mathbf{X} \mathbf{W}_Q \in \mathbb{R}^{B \times n \times d} \mathbf{K} = \mathbf{X} \mathbf{W}_K \in \mathbb{R}^{B \times n \times d} \mathbf{V} = \mathbf{X} \mathbf{W}_V \in \mathbb{R}^{B \times n \times d} ``` **Attention Scores (Transfer Function):** ```math \mathbf{S} = \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \in \mathbb{R}^{B \times h \times n \times n} ``` **Attention Weights (Control Signal):** ```math \mathbf{A} = \text{softmax}(\mathbf{S}) \in \mathbb{R}^{B \times h \times n \times n} ``` **Output (Controlled Response):** ```math \mathbf{O} = \mathbf{A} \mathbf{V} \in \mathbb{R}^{B \times h \times n \times d_k} ``` ### 7.3 Control System Interpretation **Attention as Feedback Control:** ```math \mathbf{O}_i = \sum_{j=1}^{n} A_{ij} \mathbf{V}_j where A_{ij} is the **control gain** determining how much information flows from position j to position i . ``` **Meaning**: Attention acts as a **learnable routing mechanism** controlled by similarities between queries and keys. ### 7.4 Multi-Head Attention Control **Head Splitting:** ```math \mathbf{Q}_h = \mathbf{Q}[:, :, h \cdot d_k : (h+1) \cdot d_k] \in \mathbb{R}^{B \times n \times d_k} ``` **Parallel Processing:** ```math \mathbf{O}_h = \text{Attention}(\mathbf{Q}_h, \mathbf{K}_h, \mathbf{V}_h), \quad h = 1, ..., H ``` **Concatenation:** ```math \mathbf{O} = \text{Concat}[\mathbf{O}_1, \mathbf{O}_2, ..., \mathbf{O}_H] \in \mathbb{R}^{B \times n \times d} ``` ### 7.5 Causal Masking Control **Causal Mask:** ```math M_{ij} = \begin{cases} 0 & \text{if } i \geq j \text{ (allowed)} \\ -\infty & \text{if } i < j \text{ (masked)} \end{cases} ``` **Masked Attention:** ```math \mathbf{S}_{masked} = \mathbf{S} + M ``` **Effect**: Prevents information flow from future positions. ### 7.6 Step-by-Step Explanation **Step 1: Query, Key, Value Generation** - Process: Linear transformations of input - Meaning: Create three representations: what to look for (Q), what to match (K), what to retrieve (V) **Step 2: Similarity Computation** - Process: $S\_{ij} = Q_i \cdot K_j / \sqrt{d_k}$ - Meaning: Measure similarity/relevance between positions $i$ and $ j $ **Step 3: Softmax Normalization** - Process: $A*{ij} = \exp(S*{ij}) / \sum*k \exp(S*{ik})$ - Meaning: Convert similarities to probability distribution (attention weights) **Step 4: Weighted Aggregation** - Process: $O*i = \sum_j A*{ij} V_j$ - Meaning: Combine values weighted by attention probabilities **Step 5: Information Flow** - Property: Each position receives information from all other positions (with causal masking) - Meaning: Enables long-range dependencies and context understanding **Control Impact**: Self-attention is the **core control mechanism** that determines **what information flows where** in the sequence. --- ## 8. Feed-Forward Control ### 8.1 Feed-Forward as Nonlinear Transformation ```math \text{FFN}(\mathbf{X}) = \text{GELU}(\mathbf{X} \mathbf{W}_1 + \mathbf{b}_1) \mathbf{W}_2 + \mathbf{b}_2 ``` ### 8.2 Control System Model **Two-Stage Transformation:** ```math \mathbf{H} = \mathbf{X} \mathbf{W}_1 \in \mathbb{R}^{B \times n \times d_{ff}} \mathbf{H}' = \text{GELU}(\mathbf{H}) \in \mathbb{R}^{B \times n \times d_{ff}} \mathbf{O} = \mathbf{H}' \mathbf{W}_2 \in \mathbb{R}^{B \times n \times d} ``` ### 8.3 GELU Activation Control ```math \text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right) ``` **Control Interpretation**: GELU applies **smooth gating** - values near zero are suppressed, positive values pass through. ### 8.4 Step-by-Step Explanation **Step 1: Expansion** - Process: $\mathbf{H} = \mathbf{X} \mathbf{W}_1 expands to d_{ff} > d$ - Example: $d = 512 \rightarrow d\_{ff} = 2048$ - Meaning: Increases capacity for complex transformations **Step 2: Nonlinear Activation** - Process: $\mathbf{H}' = \text{GELU}(\mathbf{H})$ - Meaning: Introduces nonlinearity, enabling complex function approximation **Step 3: Compression** - Process: $\mathbf{O} = \mathbf{H}' \mathbf{W}\_2 $compresses back to$ d$ - Meaning: Projects back to original dimension **Control Impact**: FFN provides **nonlinear processing power** and **feature transformation** at each position. --- ## 9. Layer Normalization Feedback ### 9.1 Normalization as Feedback Control ```math \text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta where: - \mu = \frac{1}{d} \sum_{i=1}^{d} x_i (mean) - \sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2 (variance) - \gamma, \beta = learnable parameters (scale and shift) ``` ### 9.2 Control System Interpretation **Normalization as State Regulation:** ```math \mathbf{x}_{norm} = \gamma \odot \frac{\mathbf{x} - \mu(\mathbf{x})}{\sigma(\mathbf{x})} + \beta ``` **Meaning**: Normalization **regulates** the distribution of activations, preventing saturation and improving gradient flow. ### 9.3 Pre-Norm Architecture **Transformer Block with Pre-Norm:** ```math \mathbf{x}_{norm} = \text{LayerNorm}(\mathbf{x}_{in}) \mathbf{x}_{attn} = \text{Attention}(\mathbf{x}_{norm}) \mathbf{x}_{out} = \mathbf{x}_{in} + \mathbf{x}_{attn} \quad \text{(residual connection)} ``` **Control Impact**: Pre-norm architecture provides **stability** and **better gradient flow**. ### 9.4 Step-by-Step Explanation **Step 1: Mean Computation** - Process: $\mu = \frac{1}{d} \sum x_i$ - Meaning: Find center of distribution **Step 2: Variance Computation** - Process: $\sigma^2 = \frac{1}{d} \sum (x_i - \mu)^2$ - Meaning: Measure spread of distribution **Step 3: Normalization** - Process: $\hat{x}\_i = (x_i - \mu) / \sqrt{\sigma^2 + \epsilon}$ - Meaning: Standardize to zero mean, unit variance **Step 4: Scale and Shift** - Process: $x\_{out} = \gamma \odot \hat{x} + \beta$ - Meaning: Allow model to learn optimal scale and shift **Control Impact**: Layer normalization provides **stability** and **faster convergence** by maintaining consistent activation distributions. --- ## 10. Complete System Dynamics ### 10.1 Complete Forward Pass **System State Evolution:** ```math \mathbf{h}_0 = \mathcal{E}(\mathbf{T}) + \mathbf{PE} \quad \text{(embedding + positional)} \mathbf{h}_l = \text{TransformerBlock}_l(\mathbf{h}_{l-1}), \quad l = 1, ..., L \mathbf{y} = \mathbf{h}_L \mathbf{W}_{out} \in \mathbb{R}^{B \times n \times V} ``` ### 10.2 Recursive System Equation ```math \mathbf{h}_t^{(l)} = f_l(\mathbf{h}_t^{(l-1)}, \theta_l) where: f_l(\mathbf{x}, \theta_l) = \mathbf{x} + \text{Dropout}(\text{Attention}(\text{LayerNorm}(\mathbf{x}))) + \text{Dropout}(\text{FFN}(\text{LayerNorm}(\mathbf{x} + \text{Attention}(\text{LayerNorm}(\mathbf{x}))))) ``` ### 10.3 System Transfer Function The complete system can be viewed as: ```math \mathbf{Y} = \mathcal{F}(\mathbf{T}, \theta, \mathbf{s}) where: - \mathbf{T} = input tokens - \theta = all parameters - \mathbf{s} = seed ``` **Properties:** - **Nonlinear**: Due to softmax, GELU, normalization - **Differentiable**: All operations have gradients - **Compositional**: Built from simpler functions ### 10.4 Step-by-Step System Flow **Step 1: Input Encoding** - Input: Token sequence $\mathbf{T}$ - Process: Embedding + Positional Encoding - Output: $\mathbf{h}\_0 \in \mathbb{R}^{B \times n \times d}$ - Meaning: Convert discrete tokens to continuous vectors with position info **Step 2: Layer Processing** - For each layer $l = 1, ..., L $: - Process: Self-attention + FFN with residual connections - Output: $\mathbf{h}\_l \in \mathbb{R}^{B \times n \times d}$ - Meaning: Transform representations through attention and processing **Step 3: Output Generation** - Process: Final layer norm + output projection - Output: $\mathbf{L} \in \mathbb{R}^{B \times n \times V} (logits)$ - Meaning: Predict probability distribution over vocabulary **Step 4: Probability Computation** - Process: Softmax over logits - Output: $\mathbf{p} \in \mathbb{R}^{B \times n \times V} (probabilities)$ - Meaning: Normalized probability distribution for next token prediction --- ## 11. Training as Optimization Control ### 11.1 Training as Optimal Control Problem **Objective Function:** ```math J(\theta) = \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(\mathbf{y}_i, \hat{\mathbf{y}}_i(\theta)) where: - \mathcal{L} = loss function (cross-entropy) - \mathbf{y}_i = true labels - \hat{\mathbf{y}}_i(\theta) = model predictions ``` **Optimization Problem:** ```math \theta^* = \arg\min_{\theta} J(\theta) ``` ### 11.2 Gradient-Based Control **Gradient Computation:** ```math \mathbf{g}_t = \nabla_\theta J(\theta_t) = \frac{\partial J}{\partial \theta_t} ``` **Parameter Update (AdamW):** ```math \theta_{t+1} = \theta_t - \eta_t \left(\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} + \lambda \theta_t\right) where: - \hat{\mathbf{m}}_t = biased-corrected momentum - \hat{\mathbf{v}}_t = biased-corrected variance - \eta_t = learning rate (controlled by scheduler) - \lambda = weight decay coefficient ``` ### 11.3 Learning Rate Control **Cosine Annealing Schedule:** ```math \eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \cdot \frac{1 + \cos(\pi \cdot \frac{t}{T_{max}})}{2} ``` **Control Interpretation**: Learning rate acts as **gain scheduling** - high gain initially for fast convergence, low gain later for fine-tuning. ### 11.4 Gradient Clipping Control **Clipping Function:** ```math \mathbf{g}_{clipped} = \begin{cases} \mathbf{g} & \text{if } ||\mathbf{g}|| \leq \theta \\ \mathbf{g} \cdot \frac{\theta}{||\mathbf{g}||} & \text{if } ||\mathbf{g}|| > \theta \end{cases} ``` **Purpose**: Prevents **explosive gradients** that could destabilize training. ### 11.5 Step-by-Step Training Control **Step 1: Forward Pass** - Process: $\hat{\mathbf{y}} = \mathcal{F}(\mathbf{x}, \theta_t)$ - Meaning: Compute predictions with current parameters **Step 2: Loss Computation** - Process: $\mathcal{L} = \text{CrossEntropy}(\hat{\mathbf{y}}, \mathbf{y})$ - Meaning: Measure prediction error **Step 3: Backward Pass** - Process: $\mathbf{g} = \nabla\_\theta \mathcal{L}$ - Meaning: Compute gradients for all parameters **Step 4: Gradient Clipping** - Process: $\mathbf{g}\_{clipped} = \text{Clip}(\mathbf{g}, \theta)$ - Meaning: Prevent gradient explosion **Step 5: Optimizer Update** - Process: $\theta*{t+1} = \text{AdamW}(\theta_t, \mathbf{g}*{clipped}, \eta_t)$ - Meaning: Update parameters using adaptive learning rate **Step 6: Learning Rate Update** - Process: $\eta\_{t+1} = \text{Scheduler}(\eta_t, t)$ - Meaning: Adjust learning rate according to schedule **Control Impact**: Training process is a **closed-loop control system** where: - **Error signal**: Loss - **Controller**: Optimizer (AdamW) - **Actuator**: Parameter updates - **Plant**: Model forward pass --- ## 12. Inference Control Loop ### 12.1 Autoregressive Generation as Control Loop **State-Space Model:** ```math \mathbf{h}_t = \mathcal{F}(\mathbf{x}_t, \mathbf{h}_{t-1}, \theta) \mathbf{p}_t = \text{softmax}(\mathbf{h}_t \mathbf{W}_{out}) \mathbf{x}_{t+1} \sim \text{Categorical}(\mathbf{p}_t) ``` ### 12.2 Generation Control Function **Step-by-Step:** 1. **Current State**: $\mathbf{h}\_t$ 2. **Output Generation**: $\mathbf{p}_t = \text{softmax}(\mathbf{h}\_t \mathbf{W}_{out})$ 3. **Sampling**: $x\_{t+1} \sim \mathbf{p}\_t (with temperature, top-k, top-p)$ 4. **State Update**: $\mathbf{h}_{t+1} = \mathcal{F}([\mathbf{h}\_t, x_{t+1}], \theta)$ 5. **Repeat**: Until max length or stop token ### 12.3 Sampling Control Parameters **Temperature Control:** ```math \mathbf{p}_t^{temp} = \text{softmax}\left(\frac{\mathbf{h}_t \mathbf{W}_{out}}{T}\right) - T < 1 : More deterministic (sharp distribution) - T > 1 : More random (flat distribution) - T = 1 : Default ``` **Top-k Filtering:** ```math \mathbf{p}_t^{topk}[v] = \begin{cases} \mathbf{p}_t[v] & \text{if } v \in \text{top-k}(\mathbf{p}_t) \\ 0 & \text{otherwise} \end{cases} ``` **Top-p (Nucleus) Sampling:** ```math \mathbf{p}_t^{topp}[v] = \begin{cases} \mathbf{p}_t[v] & \text{if } v \in S_p \\ 0 & \text{otherwise} \end{cases} where S_p is the smallest set such that \sum_{v \in S_p} \mathbf{p}_t[v] \geq p . ``` ### 12.4 Step-by-Step Inference Control **Step 1: Initialization** - Input: Prompt tokens $\mathbf{P} = [p_1, ..., p_k]$ - Process: Initialize state $\mathbf{h}\_0 = \mathcal{E}(\mathbf{P}) + \mathbf{PE}$ - Meaning: Set initial state from prompt **Step 2: Forward Pass** - Process: $\mathbf{h}_t = \text{Transformer}(\mathbf{h}_{t-1})$ - Output: Hidden state $\mathbf{h}\_t$ - Meaning: Process current sequence **Step 3: Logit Generation** - Process: $\mathbf{l}_t = \mathbf{h}\_t \mathbf{W}_{out}$ - Output: Logits $\mathbf{l}\_t \in \mathbb{R}^V$ - Meaning: Unnormalized scores for each token **Step 4: Probability Computation** - Process: $\mathbf{p}\_t = \text{softmax}(\mathbf{l}\_t / T)$ - Output: Probability distribution $\mathbf{p}\_t$ - Meaning: Normalized probabilities with temperature **Step 5: Sampling** - Process: $x\_{t+1} \sim \mathbf{p}\_t (with optional top-k/top-p)$ - Output: Next token $x\_{t+1}$ - Meaning: Stochastically select next token **Step 6: State Update** - Process: Append $x*{t+1}$ to sequence, update $\mathbf{h}*{t+1}$ - Meaning: Incorporate new token into state **Step 7: Termination Check** - Condition: $t < \text{max_length} and x\_{t+1} \neq \text{}$ - If true: Go to Step 2 - If false: Return generated sequence **Control Impact**: Inference is a **recurrent control system** where: - **State**: Current hidden representation - **Control**: Sampling strategy (temperature, top-k, top-p) - **Output**: Generated token sequence --- ## Summary: Unified Control System Model ### Complete System Equation ```math \mathbf{Y} = \mathcal{G}(\mathbf{C}, \theta, \mathbf{s}, \mathbf{T}, \{k, p\}) where: - \mathbf{C} = input characters - \theta = model parameters - \mathbf{s} = seed - \mathbf{T} = temperature - \{k, p\} = top-k and top-p parameters ``` ### System Components as Control Elements 1. **Tokenizer**: Input encoder $\mathcal{T}$ 2. **Seed**: Initialization control $\mathbf{s}$ 3. **Embeddings**: State projection $\mathcal{E}$ 4. **Positional Encoding**: Temporal control $\mathbf{PE}$ 5. **Attention**: Information routing $\mathcal{A}$ 6. **FFN**: Nonlinear transformation $\mathcal{F}$ 7. **Normalization**: State regulation $\mathcal{N}$ 8. **Optimizer**: Parameter control $\mathcal{O}$ 9. **Scheduler**: Learning rate control $\mathcal{S}$ 10. **Sampling**: Output control $\mathcal{P}$ ### Control Flow Summary ``` Input Characters ↓ [Tokenizer Control] Token IDs ↓ [Seed Control] Initialized Parameters ↓ [Embedding Control] Vector Representations ↓ [Positional Control] Position-Aware Vectors ↓ [Attention Control] Context-Aware Representations ↓ [FFN Control] Transformed Features ↓ [Normalization Control] Stabilized Activations ↓ [Output Control] Probability Distributions ↓ [Sampling Control] Generated Tokens ``` Each component acts as a **control element** in a unified dynamical system, working together to transform input text into meaningful language model outputs. --- ## 13. Block Diagram Analysis ### 13.1 Single Transformer Block Control System **Block Diagram (a): Detailed Single Transformer Block** ``` Input X ↓ ┌─────────────┐ │ LayerNorm │ └──────┬──────┘ ↓ ┌─────────────┐ │ Multi-Head │ │ Attention │ └──────┬──────┘ ↓ ┌─────────────┐ │ Dropout │ └──────┬──────┘ ↓ ┌─────────────┐ │ + │ ←─── (Residual Connection from X) └──────┬──────┘ ↓ ┌─────────────┐ │ LayerNorm │ └──────┬──────┘ ↓ ┌─────────────┐ │ Feed-Forward│ │ Network │ └──────┬──────┘ ↓ ┌─────────────┐ │ Dropout │ └──────┬──────┘ ↓ ┌─────────────┐ │ + │ ←─── (Residual Connection) └──────┬──────┘ ↓ Output X' ``` **Mathematical Transfer Function:** ```math \mathbf{X}_{out} = \mathbf{X}_{in} + \text{Dropout}(\text{FFN}(\text{LayerNorm}(\mathbf{X}_{in} + \text{Dropout}(\text{Attention}(\text{LayerNorm}(\mathbf{X}_{in}))))) ``` ### 13.2 Simplified Transformer Block **Block Diagram (b): Simplified Single Block** ``` Input X ↓ ┌─────────────────────────────────────┐ │ TransformerBlock │ │ G_block(X) = X + Attn(LN(X)) + │ │ FFN(LN(X + Attn(LN(X))))│ └──────────────┬──────────────────────┘ ↓ Output X' ``` **Transfer Function:** ```math G_{block}(\mathbf{X}) = \mathbf{X} + G_{attn}(\text{LN}(\mathbf{X})) + G_{ffn}(\text{LN}(\mathbf{X} + G_{attn}(\text{LN}(\mathbf{X})))) where: - G_{attn} = Attention transfer function - G_{ffn} = Feed-forward transfer function - \text{LN} = Layer normalization ``` ### 13.3 Complete Model with Multiple Layers **Block Diagram (c): Cascaded Transformer Blocks** ``` Input Tokens T ↓ ┌─────────────┐ │ Embedding │ │ G_emb │ └──────┬──────┘ ↓ ┌─────────────┐ │ Positional │ │ G_pos │ └──────┬──────┘ ↓ ┌─────────────┐ │ Block 1 │ │ G_block₁ │ └──────┬──────┘ ↓ ┌─────────────┐ │ Block 2 │ │ G_block₂ │ └──────┬──────┘ ↓ ┌─────────────┐ │ ... │ └──────┬──────┘ ↓ ┌─────────────┐ │ Block L │ │ G_block_L │ └──────┬──────┘ ↓ ┌─────────────┐ │ Final Norm │ │ G_norm │ └──────┬──────┘ ↓ ┌─────────────┐ │ Output Proj │ │ G_out │ └──────┬──────┘ ↓ Output Logits ``` **Overall Transfer Function:** ```math \mathbf{Y} = G_{out} \circ G_{norm} \circ G_{block_L} \circ ... \circ G_{block_2} \circ G_{block_1} \circ G_{pos} \circ G_{emb}(\mathbf{T}) ``` ### 13.4 Closed-Loop Training System **Block Diagram (d): Training Control Loop** ``` Input Data X ↓ ┌─────────────┐ │ Model │ │ Forward │ │ F │ └──────┬──────┘ ↓ ┌─────────────┐ │ Output │ │ ŷ │ └──────┬──────┘ ↓ ┌─────────────┐ │ Loss │ │ L(ŷ, y) │ └──────┬──────┘ ↓ ┌─────────────┐ │ Gradient │ │ ∇θ │ └──────┬──────┘ ↓ ┌─────────────┐ │ Clipping │ │ Clip │ └──────┬──────┘ ↓ ┌─────────────┐ │ Optimizer │ │ AdamW │ └──────┬──────┘ ↓ ┌─────────────┐ │ Parameter │ │ Update │ └──────┬──────┘ ↓ ┌─────────────┐ │ - │ ←─── (Feedback to Model) └─────────────┘ ``` **Closed-Loop Transfer Function:** ```math \theta_{t+1} = \theta_t - \eta_t \cdot \text{AdamW}(\text{Clip}(\nabla_\theta L(\mathcal{F}(\mathbf{X}, \theta_t), \mathbf{y}))) ``` --- ## 14. Vector Visualization and Examples ### 14.1 Example Phrase: "Hello World" We'll trace through the complete system with the phrase **"Hello World"**. #### Step 1: Tokenization **Input:** `"Hello World"` **Process:** ``` Characters: ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd'] Token IDs: [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100] ``` **Mathematical:** ```math \mathbf{c} = \text{"Hello World"} \mathbf{t} = \mathcal{T}(\mathbf{c}) = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100] ``` **Vector Representation:** - Dimension: $n = 11$ tokens - Token IDs: $\mathbf{t} \in \mathbb{N}^{11}$ #### Step 2: Embedding **Embedding Matrix:** $\mathbf{E} \in \mathbb{R}^{128 \times 512}$ **Lookup Operation:** ```math \mathbf{X} = \mathbf{E}[\mathbf{t}] = \begin{bmatrix} \mathbf{E}[72] \\ \mathbf{E}[101] \\ \mathbf{E}[108] \\ \mathbf{E}[108] \\ \mathbf{E}[111] \\ \mathbf{E}[32] \\ \mathbf{E}[87] \\ \mathbf{E}[111] \\ \mathbf{E}[114] \\ \mathbf{E}[108] \\ \mathbf{E}[100] \end{bmatrix} \in \mathbb{R}^{11 \times 512} ``` **Example Values (first 3 dimensions):** ```math \mathbf{E}[72] = [0.1, -0.2, 0.3, ...]^T \\ \mathbf{E}[101] = [-0.1, 0.3, -0.1, ...]^T \\ \mathbf{E}[108] = [0.05, 0.15, -0.05, ...]^T ``` **Vector Visualization:** ``` Token 'H' (ID=72): [0.10, -0.20, 0.30, ..., 0.05] (512-dim vector) Token 'e' (ID=101): [-0.10, 0.30, -0.10, ..., 0.02] (512-dim vector) Token 'l' (ID=108): [0.05, 0.15, -0.05, ..., 0.01] (512-dim vector) ... ``` #### Step 3: Positional Encoding **Positional Encoding Matrix:** $\mathbf{PE} \in \mathbb{R}^{11 \times 512}$ **Computation:** ```math PE_{(0, 0)} = \sin(0 / 10000^0) = 0 \\ PE_{(0, 1)} = \cos(0 / 10000^0) = 1 \\ PE_{(1, 0)} = \sin(1 / 10000^0) = \sin(1) \approx 0.8415 \\ PE_{(1, 1)} = \cos(1 / 10000^0) = \cos(1) \approx 0.5403 ``` **Addition:** ```math \mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} ``` **Example (first token, first 3 dimensions):** ```math \mathbf{X}_{pos}[0, :3] = \begin{bmatrix} 0.1 \\ -0.2 \\ 0.3 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 0.1 \\ 0.8 \\ 0.3 \end{bmatrix} ``` #### Step 4: Multi-Head Attention **Query, Key, Value Projections:** Let $\mathbf{W}\_Q, \mathbf{W}\_K, \mathbf{W}\_V \in \mathbb{R}^{512 \times 512}$ ```math \mathbf{Q} = \mathbf{X}_{pos} \mathbf{W}_Q \in \mathbb{R}^{11 \times 512} ``` **Example Calculation (head 0, token 0):** For $h = 0 , d_k = 512/8 = 64 $: ```math \mathbf{Q}[0, :64] = \mathbf{X}_{pos}[0] \mathbf{W}_Q[:, :64] ``` **Attention Score Computation:** ```math S_{0,1} = \frac{\mathbf{Q}[0] \cdot \mathbf{K}[1]}{\sqrt{64}} = \frac{\sum_{i=0}^{63} Q_{0,i} \cdot K_{1,i}}{8} ``` **Example Numerical Calculation:** Assume: ```math \mathbf{Q}[0, :3] = [0.2, -0.1, 0.3] \\ \mathbf{K}[1, :3] = [0.1, 0.2, -0.1] S_{0,1} = \frac{0.2 \times 0.1 + (-0.1) \times 0.2 + 0.3 \times (-0.1)}{8} \\ = \frac{0.02 - 0.02 - 0.03}{8} = \frac{-0.03}{8} = -0.00375 ``` **Attention Weights:** ```math A_{0,:} = \text{softmax}(S_{0,:}) = \frac{\exp(S_{0,:})}{\sum_{j=0}^{10} \exp(S_{0,j})} ``` **Example:** If $S\_{0,:} = [-0.004, 0.05, 0.02, 0.02, 0.08, -0.01, 0.03, 0.08, 0.01, 0.02, 0.04]$ ```math \exp(S_{0,:}) = [0.996, 1.051, 1.020, 1.020, 1.083, 0.990, 1.030, 1.083, 1.010, 1.020, 1.041] \sum = 11.335 A_{0,:} = [0.088, 0.093, 0.090, 0.090, 0.096, 0.087, 0.091, 0.096, 0.089, 0.090, 0.092] ``` **Output Calculation:** ```math \mathbf{O}[0] = \sum_{j=0}^{10} A_{0,j} \mathbf{V}[j] ``` **Example (first dimension):** ```math O_{0,0} = A_{0,0} V_{0,0} + A_{0,1} V_{1,0} + ... + A_{0,10} V_{10,0} \\ = 0.088 \times 0.2 + 0.093 \times 0.1 + ... + 0.092 \times 0.15 \\ \approx 0.12 ``` #### Step 5: Feed-Forward Network **Input:** $\mathbf{X}\_{attn} \in \mathbb{R}^{11 \times 512}$ **First Linear Transformation:** ```math \mathbf{H} = \mathbf{X}_{attn} \mathbf{W}_1 \in \mathbb{R}^{11 \times 2048} ``` **Example (token 0, first dimension):** ```math H_{0,0} = \sum_{i=0}^{511} X_{attn,0,i} \cdot W_{1,i,0} Assuming X_{attn}[0, :3] = [0.12, -0.05, 0.08] and W_1[:3, :3] = \begin{bmatrix} 0.1 & 0.2 \\ -0.1 & 0.1 \\ 0.05 & -0.05 \end{bmatrix} H_{0,0} = 0.12 \times 0.1 + (-0.05) \times (-0.1) + 0.08 \times 0.05 \\ = 0.012 + 0.005 + 0.004 = 0.021 ``` **GELU Activation:** ```math \text{GELU}(0.021) = 0.021 \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{0.021}{\sqrt{2}}\right)\right) \text{erf}(0.021/\sqrt{2}) = \text{erf}(0.0148) \approx 0.0167 \text{GELU}(0.021) = 0.021 \times 0.5 \times (1 + 0.0167) = 0.021 \times 0.5084 \approx 0.0107 ``` **Second Linear Transformation:** ```math \mathbf{O}_{ffn} = \mathbf{H}' \mathbf{W}_2 \in \mathbb{R}^{11 \times 512} ``` #### Step 6: Complete Forward Pass Through One Layer **Input:** $\mathbf{X}_{in} = \mathbf{X}_{pos} \in \mathbb{R}^{11 \times 512}$ **Step 6.1: Layer Normalization** ```math \mu_0 = \frac{1}{512} \sum_{i=0}^{511} X_{in,0,i} ``` **Example:** ```math \mu_0 = \frac{0.1 + 0.8 + 0.3 + ...}{512} \approx 0.02 \sigma_0^2 = \frac{1}{512} \sum_{i=0}^{511} (X_{in,0,i} - \mu_0)^2 \sigma_0^2 \approx \frac{(0.1-0.02)^2 + (0.8-0.02)^2 + ...}{512} \approx 0.15 \hat{X}_{0,0} = \frac{0.1 - 0.02}{\sqrt{0.15 + 1e-5}} = \frac{0.08}{0.387} \approx 0.207 ``` **Step 6.2: Attention Output** ```math \mathbf{X}_{attn} = \text{Attention}(\hat{\mathbf{X}}) ``` **Step 6.3: Residual Connection** ```math \mathbf{X}_{res1} = \mathbf{X}_{in} + \mathbf{X}_{attn} ``` **Example:** ```math X_{res1,0,0} = 0.1 + 0.12 = 0.22 ``` **Step 6.4: Second Layer Norm + FFN** ```math \mathbf{X}_{ffn} = \text{FFN}(\text{LayerNorm}(\mathbf{X}_{res1})) ``` **Step 6.5: Final Residual** ```math \mathbf{X}_{out} = \mathbf{X}_{res1} + \mathbf{X}_{ffn} ``` **Example:** ```math X_{out,0,0} = 0.22 + 0.15 = 0.37 ``` #### Step 7: Output Projection **After L layers:** ```math \mathbf{H}_{final} = \text{LayerNorm}(\mathbf{X}_{out}^{(L)}) \in \mathbb{R}^{11 \times 512} ``` **Output Projection:** ```math \mathbf{L} = \mathbf{H}_{final} \mathbf{W}_{out} \in \mathbb{R}^{11 \times 128} ``` **Example (position 0):** ```math L_{0,:} = \mathbf{H}_{final}[0] \mathbf{W}_{out} \in \mathbb{R}^{128} ``` **Softmax:** ```math p_{0,v} = \frac{\exp(L_{0,v})}{\sum_{w=0}^{127} \exp(L_{0,w})} ``` **Example:** If $L*{0,72} = 5.2 (logit for 'H'), L*{0,101} = 3.1 (logit for 'e'), etc.$ ```math \exp(5.2) = 181.27 \\ \exp(3.1) = 22.20 \\ \vdots \sum_{w=0}^{127} \exp(L_{0,w}) \approx 250.0 p_{0,72} = \frac{181.27}{250.0} \approx 0.725 \quad \text{(72\% probability for H)} ``` --- ## 15. Complete Numerical Example: "Hello" Let's trace through the complete system with **"Hello"** step-by-step. ### Input: "Hello" ### Stage 1: Tokenization ```math \mathbf{c} = \text{"Hello"} = ['H', 'e', 'l', 'l', 'o'] \mathbf{t} = [72, 101, 108, 108, 111] ``` ### Stage 2: Embedding (d=512) ```math \mathbf{E} \in \mathbb{R}^{128 \times 512} \mathbf{X} = \begin{bmatrix} \mathbf{E}[72] \\ \mathbf{E}[101] \\ \mathbf{E}[108] \\ \mathbf{E}[108] \\ \mathbf{E}[111] \end{bmatrix} = \begin{bmatrix} 0.10 & -0.20 & 0.30 & ... & 0.05 \\ -0.10 & 0.30 & -0.10 & ... & 0.02 \\ 0.05 & 0.15 & -0.05 & ... & 0.01 \\ 0.05 & 0.15 & -0.05 & ... & 0.01 \\ -0.05 & 0.20 & 0.10 & ... & 0.03 \end{bmatrix} \in \mathbb{R}^{5 \times 512} ``` ### Stage 3: Positional Encoding ```math \mathbf{PE} = \begin{bmatrix} 0 & 1 & 0 & ... & 0 \\ 0.84 & 0.54 & 0.01 & ... & 0.00 \\ 0.91 & -0.42 & 0.02 & ... & 0.00 \\ 0.14 & -0.99 & 0.03 & ... & 0.00 \\ -0.76 & -0.65 & 0.04 & ... & 0.00 \end{bmatrix} \in \mathbb{R}^{5 \times 512} \mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} = \begin{bmatrix} 0.10 & 0.80 & 0.30 & ... & 0.05 \\ 0.74 & 0.84 & -0.09 & ... & 0.02 \\ 0.96 & -0.27 & -0.03 & ... & 0.01 \\ 0.19 & -0.84 & -0.02 & ... & 0.01 \\ -0.81 & -0.45 & 0.14 & ... & 0.03 \end{bmatrix} ``` ### Stage 4: Attention (h=8 heads, d_k=64) **Query Generation:** ```math \mathbf{Q} = \mathbf{X}_{pos} \mathbf{W}_Q \in \mathbb{R}^{5 \times 512} ``` **Score Matrix (head 0):** ```math \mathbf{S}_0 = \frac{\mathbf{Q}_0 \mathbf{K}_0^T}{\sqrt{64}} \in \mathbb{R}^{5 \times 5} ``` **Example Values:** ```math \mathbf{S}_0 = \begin{bmatrix} 0.50 & -0.10 & 0.20 & 0.15 & 0.30 \\ -0.05 & 0.45 & 0.10 & 0.08 & 0.25 \\ 0.15 & 0.05 & 0.40 & 0.30 & 0.20 \\ 0.12 & 0.08 & 0.28 & 0.35 & 0.18 \\ 0.25 & 0.15 & 0.22 & 0.20 & 0.42 \end{bmatrix} ``` **Attention Weights:** ```math \mathbf{A}_0 = \text{softmax}(\mathbf{S}_0) = \begin{bmatrix} 0.35 & 0.15 & 0.22 & 0.20 & 0.28 \\ 0.15 & 0.38 & 0.20 & 0.18 & 0.27 \\ 0.23 & 0.18 & 0.32 & 0.30 & 0.26 \\ 0.21 & 0.19 & 0.28 & 0.33 & 0.25 \\ 0.27 & 0.22 & 0.26 & 0.25 & 0.36 \end{bmatrix} ``` **Output (head 0):** ```math \mathbf{O}_0 = \mathbf{A}_0 \mathbf{V}_0 \in \mathbb{R}^{5 \times 64} ``` **Concatenate All Heads:** ```math \mathbf{O} = \text{Concat}[\mathbf{O}_0, ..., \mathbf{O}_7] \in \mathbb{R}^{5 \times 512} ``` ### Stage 5: Feed-Forward ```math \mathbf{H} = \mathbf{O} \mathbf{W}_1 \in \mathbb{R}^{5 \times 2048} \mathbf{H}' = \text{GELU}(\mathbf{H}) \in \mathbb{R}^{5 \times 2048} \mathbf{O}_{ffn} = \mathbf{H}' \mathbf{W}_2 \in \mathbb{R}^{5 \times 512} ``` ### Stage 6: Output Logits After processing through all L layers: ```math \mathbf{L} = \mathbf{H}_{final} \mathbf{W}_{out} \in \mathbb{R}^{5 \times 128} ``` **Example (position 4, predicting next token):** ```math L_{4,:} = [2.1, 1.5, ..., 5.2, ..., 3.1, ...] Where: - L_{4,111} = 5.2 (high score for 'o') - L_{4,32} = 4.8 (high score for space) - L_{4,87} = 4.5 (high score for 'W') ``` **Probability Distribution:** ```math \mathbf{p}_4 = \text{softmax}(L_{4,:}) = [0.01, 0.008, ..., 0.25, ..., 0.18, ...] p_{4,111} \approx 0.25 \quad \text{(25\% for o)} \\ p_{4,32} \approx 0.22 \quad \text{(22\% for space)} \\ p_{4,87} \approx 0.18 \quad \text{(18\% for W)} ``` --- ## 16. Vector Space Visualization ### 16.1 Embedding Space **2D Projection Example:** After embedding "Hello", tokens occupy positions in 512-dimensional space. Projected to 2D: ``` Token Positions (idealized 2D projection): 'l' (0.05, 0.15) ● 'e' (-0.10, 0.30) ● Origin (0, 0) ● 'H' (0.10, -0.20) ● 'o' (-0.05, 0.20) ● ``` **Distance in Embedding Space:** ```math d(\mathbf{E}[72], \mathbf{E}[101]) = ||\mathbf{E}[72] - \mathbf{E}[101]||_2 d = \sqrt{(0.1 - (-0.1))^2 + (-0.2 - 0.3)^2 + ...} \approx \sqrt{0.04 + 0.25 + ...} \approx 2.1 ``` ### 16.2 Attention Weight Visualization **Attention Matrix Visualization:** ``` Position 0 1 2 3 4 ┌─────┴─────┴─────┴─────┴──┐ Token 0 │ 0.35 0.15 0.22 0.20 0.28 │ 'H' │ │ Token 1 │ 0.15 0.38 0.20 0.18 0.27 │ 'e' │ │ Token 2 │ 0.23 0.18 0.32 0.30 0.26 │ 'l' │ │ Token 3 │ 0.21 0.19 0.28 0.33 0.25 │ 'l' │ │ Token 4 │ 0.27 0.22 0.26 0.25 0.36 │ 'o' └──────────────────────────┘ ``` **Interpretation:** - Token 0 ('H') attends most to itself (0.35) and token 4 (0.28) - Token 4 ('o') attends moderately to all positions - Higher values indicate stronger attention ### 16.3 Probability Distribution Visualization **Output Distribution for Position 5 (next token after "Hello"):** ``` Probability Distribution p[5, :] Probability │ 0.3 │ ● │ 0.2 │ ● ● │ 0.1 │ ● ● ● ● │ 0.0 ├─┴───┴───┴───┴───┴───┴───┴───┴─── Token IDs 32 72 87 101 108 111 ... 127 ␣ H W e l o ``` **Meaning:** - Highest probability for space (32) ≈ 0.28 - Next: 'o' (111) ≈ 0.23 - Then: 'W' (87) ≈ 0.18 - Model predicts space or continuation --- ## 17. Advanced Block Diagram Simplification ### 17.1 Complex Multi-Layer System Simplification Following control system reduction techniques, we can simplify the transformer model step-by-step: **Diagram (a): Original Complex System** ``` Input R (Tokens) ↓ ┌─────────────┐ │ Embedding │ │ G_emb │ └──────┬──────┘ ↓ ┌─────────────┐ │ Positional │ │ Encoding │ │ G_pos │ └──────┬──────┘ ↓ ┌─────────────┐ │ + │ ←─── Feedback from Layer 2 └──────┬──────┘ ↓ ┌─────────────┐ │ Layer 1 │ │ G_block₁ │ └──────┬──────┘ ↓ ┌─────────────┐ │ + │ ←─── Feedback from Output └──────┬──────┘ ↓ ┌─────────────┐ │ Layer 2 │ │ G_block₂ │ └──────┬──────┘ ↓ ┌─────────────┐ │ + │ ←─── Feedback H₁ └──────┬──────┘ ↓ ┌─────────────┐ │ Output Proj │ │ G_out │ └──────┬──────┘ ↓ Output C (Logits) ``` **Diagram (b): First Simplification (Combine Embedding and Positional)** ``` Input R ↓ ┌─────────────────────┐ │ G_emb_pos = │ │ G_pos ∘ G_emb │ └──────┬──────────────┘ ↓ ┌─────────────┐ │ + │ └──────┬──────┘ ↓ ┌─────────────┐ │ Layer 1 │ │ G_block₁ │ └──────┬──────┘ ↓ ┌─────────────┐ │ + │ └──────┬──────┘ ↓ ┌─────────────┐ │ Layer 2 │ │ G_block₂ │ └──────┬──────┘ ↓ ┌─────────────┐ │ + │ ←─── H₁ └──────┬──────┘ ↓ ┌─────────────┐ │ G_out │ └──────┬──────┘ ↓ Output C ``` **Diagram (c): Second Simplification (Combine Layers)** ``` Input R ↓ ┌─────────────────────┐ │ G_emb_pos │ └──────┬──────────────┘ ↓ ┌──────────────────────────────────┐ │ G_layers = G_block₂ ∘ G_block₁ │ │ Equivalent to: │ │ X + Δ₁(X) + Δ₂(X + Δ₁(X)) │ └──────┬───────────────────────────┘ ↓ ┌─────────────┐ │ + │ ←─── H₁ └──────┬──────┘ ↓ ┌─────────────┐ │ G_out │ └──────┬──────┘ ↓ Output C ``` **Diagram (d): Third Simplification (Combine with Output)** ``` Input R ↓ ┌──────────────────────────────┐ │ G_forward = │ │ G_out ∘ G_layers ∘ G_emb_pos │ └──────┬───────────────────────┘ ↓ ┌─────────────┐ │ + │ ←─── H₁ (Feedback) └──────┬──────┘ ↓ Output C ``` **Diagram (e): Final Simplified Transfer Function** ``` Input R ↓ ┌────────────────────────────────────────────┐ │ Overall Transfer Function: │ │ │ │ C/R = G_forward / (1 + G_forward × H₁) │ │ │ │ Where: │ │ G_forward = G_out ∘ G_layers ∘ G_emb_pos │ │ │ └──────┬─────────────────────────────────────┘ ↓ Output C ``` **Mathematical Derivation:** **Step 1:** Combine embedding and positional encoding: ```math G_{emb\_pos}(\mathbf{T}) = G_{pos}(G_{emb}(\mathbf{T})) = \mathbf{E}[\mathbf{T}] + \mathbf{PE} ``` **Step 2:** Combine transformer layers: ```math G_{layers}(\mathbf{X}) = G_{block_2}(G_{block_1}(\mathbf{X})) G_{layers}(\mathbf{X}) = \mathbf{X} + \Delta_1(\mathbf{X}) + \Delta_2(\mathbf{X} + \Delta_1(\mathbf{X})) where \Delta_l represents the transformation inside block l . ``` **Step 3:** Combine with output projection: ```math G_{forward}(\mathbf{T}) = G_{out}(G_{layers}(G_{emb\_pos}(\mathbf{T}))) ``` **Step 4:** Apply feedback reduction: ```math \frac{C}{R} = \frac{G_{forward}}{1 + G_{forward} \times H_1} ``` ### 17.2 Attention Block Simplification **Diagram (a): Detailed Attention** ``` Input X ↓ ┌─────────────┐ │ Q │ ←─── W_Q │ K │ ←─── W_K │ V │ ←─── W_V └──────┬──────┘ ↓ ┌─────────────┐ │ Scores │ │ S = QK^T/√d │ └──────┬──────┘ ↓ ┌─────────────┐ │ Softmax │ │ A = σ(S) │ └──────┬──────┘ ↓ ┌─────────────┐ │ Output │ │ O = AV │ └──────┬──────┘ ↓ ┌─────────────┐ │ Out Proj │ │ W_O │ └──────┬──────┘ ↓ Output X' ``` **Diagram (b): Simplified Attention Transfer Function** ``` Input X ↓ ┌──────────────────────────────┐ │ G_attn(X) = │ │ W_O · softmax(QK^T/√d) · V │ │ │ │ Where: │ │ Q = XW_Q, K = XW_K, V = XW_V │ └──────┬───────────────────────┘ ↓ Output X' ``` **Mathematical Transfer Function:** ```math G_{attn}(\mathbf{X}) = \mathbf{X} \mathbf{W}_O \cdot \text{softmax}\left(\frac{(\mathbf{X} \mathbf{W}_Q)(\mathbf{X} \mathbf{W}_K)^T}{\sqrt{d_k}}\right) \cdot (\mathbf{X} \mathbf{W}_V) ``` --- ## 18. Vector Trace: "Hello World" Complete Flow ### 18.1 Complete Vector Trace with Numerical Values **Input:** `"Hello World"` **Stage 1: Tokenization** ```math \mathbf{t} = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100] ``` **Stage 2: Embedding (showing first 4 dimensions)** ```math \mathbf{X} = \begin{bmatrix} [H] & 0.10 & -0.20 & 0.30 & 0.15 & ... \\ [e] & -0.10 & 0.30 & -0.10 & 0.08 & ... \\ [l] & 0.05 & 0.15 & -0.05 & 0.03 & ... \\ [l] & 0.05 & 0.15 & -0.05 & 0.03 & ... \\ [o] & -0.05 & 0.20 & 0.10 & 0.06 & ... \\ [ ] & 0.02 & 0.05 & 0.02 & 0.01 & ... \\ [W] & 0.15 & -0.15 & 0.25 & 0.12 & ... \\ [o] & -0.05 & 0.20 & 0.10 & 0.06 & ... \\ [r] & 0.08 & 0.10 & -0.08 & 0.04 & ... \\ [l] & 0.05 & 0.15 & -0.05 & 0.03 & ... \\ [d] & 0.12 & -0.08 & 0.18 & 0.09 & ... \end{bmatrix} \in \mathbb{R}^{11 \times 512} ``` **Stage 3: Positional Encoding (first 4 dimensions)** ```math \mathbf{PE} = \begin{bmatrix} [0] & 0.00 & 1.00 & 0.00 & 0.00 & ... \\ [1] & 0.84 & 0.54 & 0.01 & 0.00 & ... \\ [2] & 0.91 & -0.42 & 0.02 & 0.00 & ... \\ [3] & 0.14 & -0.99 & 0.03 & 0.00 & ... \\ [4] & -0.76 & -0.65 & 0.04 & 0.00 & ... \\ [5] & -0.96 & 0.28 & 0.05 & 0.00 & ... \\ [6] & -0.28 & 0.96 & 0.06 & 0.00 & ... \\ [7] & 0.65 & 0.76 & 0.07 & 0.00 & ... \\ [8] & 0.99 & -0.14 & 0.08 & 0.00 & ... \\ [9] & 0.42 & -0.91 & 0.09 & 0.00 & ... \\ [10] & -0.54 & -0.84 & 0.10 & 0.00 & ... \end{bmatrix} ``` **Stage 4: Combined Input** ```math \mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} ``` **Example Row 0 (token 'H'):** ```math \mathbf{X}_{pos}[0, :4] = [0.10, -0.20, 0.30, 0.15] + [0.00, 1.00, 0.00, 0.00] = [0.10, 0.80, 0.30, 0.15] ``` **Stage 5: Attention (Head 0, showing attention from token 0 to all tokens)** ```math \mathbf{S}_0[0, :] = [0.50, -0.10, 0.20, 0.15, 0.30, -0.05, 0.18, 0.28, 0.12, 0.20, 0.22] \mathbf{A}_0[0, :] = \text{softmax}(\mathbf{S}_0[0, :]) = [0.35, 0.15, 0.22, 0.20, 0.28, 0.14, 0.19, 0.26, 0.17, 0.21, 0.23] **Meaning:** Token 'H' (position 0) attends: - 35% to itself - 28% to token 'o' (position 4) - 26% to token 'o' (position 7) - 23% to token 'd' (position 10) ``` **Stage 6: Attention Output** ```math \mathbf{O}_0[0, :] = \sum_{j=0}^{10} A_{0,j} \mathbf{V}_0[j, :] ``` **Example (first dimension):** ```math O_{0,0,0} = 0.35 \times 0.12 + 0.15 \times 0.08 + ... + 0.23 \times 0.15 \approx 0.115 ``` **Stage 7: FFN Output** ```math \mathbf{H}_{ffn}[0, :4] = [0.15, -0.08, 0.22, 0.18] ``` **Stage 8: Final Output (after all layers)** ```math \mathbf{H}_{final}[0, :4] = [0.42, 0.25, 0.58, 0.31] ``` **Stage 9: Logits** ```math \mathbf{L}[0, :] = [2.1, 1.8, ..., 5.2, ..., 3.4, ...] Where L[0, 72] = 5.2 is highest (predicting 'H' at position 1). ``` **Stage 10: Probabilities** ```math \mathbf{p}[0, :] = \text{softmax}(\mathbf{L}[0, :]) = [0.01, 0.008, ..., 0.28, ..., 0.15, ...] p[0, 72] \approx 0.28 \quad \text{(28\% probability for H)} ``` --- ## 19. Vector Plots and Visualizations ### 19.1 Embedding Vector Trajectory **Trajectory Plot:** ``` 512-Dimensional Embedding Space (2D Projection) 0.3 │ 'e' (pos 1) │ ● 0.2 │ 'r' (pos 8) │ ● 0.1 │ 'l' (pos 2,3,9) 'o' (pos 4,7) │ ● ● 0.0 ├─────────────────────────────────────────── │ 'H' (pos 0) -0.1 │ ● │ -0.2 │ │ -0.3 │ 'W' (pos 6) │ ● └─────────────────────────────────────────── -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 ``` ### 19.2 Attention Heatmap **Attention Weight Matrix Visualization:** ``` Attention Weights A[i,j] for "Hello World" j → 0 1 2 3 4 5 6 7 8 9 10 ↓ ['H'] ['e'] ['l'] ['l'] ['o'] [' '] ['W'] ['o'] ['r'] ['l'] ['d'] i=0 ['H'] │ 0.35 0.15 0.22 0.20 0.28 0.14 0.19 0.26 0.17 0.21 0.23 │ i=1 ['e'] │ 0.15 0.38 0.20 0.18 0.27 0.16 0.18 0.25 0.19 0.22 0.20 │ i=2 ['l'] │ 0.23 0.18 0.32 0.30 0.26 0.17 0.21 0.24 0.25 0.31 0.23 │ i=3 ['l'] │ 0.21 0.19 0.28 0.33 0.25 0.18 0.20 0.23 0.24 0.30 0.22 │ i=4 ['o'] │ 0.27 0.22 0.26 0.25 0.36 0.19 0.23 0.29 0.24 0.27 0.25 │ i=5 [' '] │ 0.18 0.20 0.19 0.21 0.24 0.40 0.22 0.25 0.21 0.20 0.22 │ i=6 ['W'] │ 0.22 0.21 0.23 0.24 0.26 0.20 0.45 0.28 0.27 0.23 0.25 │ i=7 ['o'] │ 0.26 0.25 0.24 0.23 0.29 0.21 0.28 0.38 0.26 0.24 0.26 │ i=8 ['r'] │ 0.19 0.21 0.25 0.24 0.24 0.19 0.27 0.26 0.42 0.27 0.28 │ i=9 ['l'] │ 0.21 0.22 0.31 0.30 0.27 0.20 0.23 0.24 0.27 0.35 0.24 │ i=10['d'] │ 0.23 0.20 0.23 0.22 0.25 0.22 0.25 0.26 0.28 0.24 0.48 │ Color Coding: █ = 0.48-0.50 (very high attention) █ = 0.35-0.48 (high attention) █ = 0.25-0.35 (medium attention) █ = 0.15-0.25 (low attention) █ = 0.00-0.15 (very low attention) ``` ### 19.3 Probability Distribution Plot **Logits and Probabilities:** ``` Logits L[5, :] (predicting token after "Hello ") Logit Value │ 6.0 │ ● (token 87 'W') │ 5.0 │ ● (token 111 'o') │ 4.0 │ ● (token 32 ' ') ● (token 114 'r') │ 3.0 │ ● ● ● │ 2.0 │ ● ● ● ● ● ● ● ● ● ● ● │ 1.0 │ ● ● ● ● ● ● ● ● ● ● ● │ 0.0 ├─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴── Token IDs 32 72 87 101 108 111 114 ... ␣ H W e l o r Probabilities p[5, :] Probability │ 0.3│ ● ('W') │ 0.2│ ● (' ') ● ('o') │ 0.1│ ● ● ● ● ● ● ● │ 0.0├─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴── Token IDs 32 72 87 101 108 111 114 ... ``` ### 19.4 Hidden State Evolution Through Layers **Layer-by-Layer Transformation:** ``` Hidden State Evolution for Token 'H' (position 0) Dimension 0: Layer 0: 0.10 (embedding + positional) Layer 1: 0.42 (after attention + FFN) Layer 2: 0.58 (after second layer) Layer 3: 0.65 (after third layer) ... ... Layer L: 0.72 (final hidden state) Dimension 1: Layer 0: 0.80 (embedding + positional) Layer 1: 0.25 (after attention + FFN) Layer 2: 0.18 (after second layer) Layer 3: 0.22 (after third layer) ... ... Layer L: 0.15 (final hidden state) ``` **Visualization:** ``` Hidden State Magnitude ||h[l]|| Over Layers Magnitude │ 1.0│ ● │ ● 0.8│ ● │ ● 0.6│ ● │ ● 0.4│ ● │ ● 0.2│ ● │ ● 0.0├───────────────────────── Layer 0 1 2 3 4 5 6 ``` --- ## 20. Summary: Complete Mathematical Trace ### Complete System Equation with Numerical Example **Text:** `"Hello World"` **Complete Mathematical Flow:** 1. **Tokenization:** ```math \mathbf{t} = \mathcal{T}(\text{"Hello World"}) = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100] ``` 2. **Embedding:** ```math \mathbf{X} = \mathbf{E}[\mathbf{t}] \in \mathbb{R}^{11 \times 512} ``` 3. **Positional Encoding:** ```math \mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} \in \mathbb{R}^{11 \times 512} ``` 4. **Transformer Layers (L=6):** ```math \mathbf{h}_l = \text{TransformerBlock}_l(\mathbf{h}_{l-1}), \quad l = 1, ..., 6 ``` 5. **Output:** ```math \mathbf{L} = \mathbf{h}_6 \mathbf{W}_{out} \in \mathbb{R}^{11 \times 128} ``` 6. **Probabilities:** ```math \mathbf{p} = \text{softmax}(\mathbf{L}) \in \mathbb{R}^{11 \times 128} ``` **Final Prediction:** For position 5 (after "Hello "): ```math p[5, 87] = 0.28 \quad \text{(28\% for W)} \\ p[5, 32] = 0.22 \quad \text{(22\% for space)} \\ p[5, 111] = 0.18 \quad \text{(18\% for o)} ``` **Most Likely:** `'W'` → Complete prediction: `"Hello World"` --- _This document provides a complete mathematical control system formulation with block diagrams, vector visualizations, numerical examples, and step-by-step calculations for every component of the SheepOp LLM._