Files
sheepOp/docs/CONTROL_SYSTEM_MODEL.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

2575 lines
59 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SheepOp LLM - Mathematical Control System Model
Complete mathematical control system formulation of the SheepOp Language Model, treating the entire system as a unified mathematical control system with state-space representations, transfer functions, and step-by-step explanations.
## Table of Contents
1. [System Overview](#1-system-overview)
2. [State-Space Representation](#2-state-space-representation)
3. [Tokenizer as Input Encoder](#3-tokenizer-as-input-encoder)
4. [Seed Control System](#4-seed-control-system)
5. [Embedding Layer Control](#5-embedding-layer-control)
6. [Positional Encoding State](#6-positional-encoding-state)
7. [Self-Attention Control System](#7-self-attention-control-system)
8. [Feed-Forward Control](#8-feed-forward-control)
9. [Layer Normalization Feedback](#9-layer-normalization-feedback)
10. [Complete System Dynamics](#10-complete-system-dynamics)
11. [Training as Optimization Control](#11-training-as-optimization-control)
12. [Inference Control Loop](#12-inference-control-loop)
---
## 1. System Overview
### 1.1 Control System Architecture
The SheepOp LLM can be modeled as a **nonlinear dynamical control system** with:
- **Input**: Character sequence $\mathbf{c} = [c_1, c_2, ..., c_n]$
- **State**: Hidden representations $\mathbf{h}\_t $at each layer and time step
- **Control**: Model parameters $\theta = \{W_Q, W_K, W_V, W_1, W_2, ...\}
$
- **Output**: Probability distribution over vocabulary $\mathbf{p}\_t \in \mathbb{R}^V$
**System Block Diagram:**
```
Input Sequence → Tokenizer → Embeddings → Positional Encoding →
[Transformer Layer 1] → [Transformer Layer 2] → ... → [Transformer Layer L]
Output Projection → Logits → Softmax → Output Probabilities
```
### 1.2 Mathematical System Formulation
The complete system can be expressed as:
```math
\mathbf{y}_t = \mathcal{F}(\mathbf{x}_t, \mathbf{h}_t, \theta, \mathbf{s})
```
where:
- $\mathbf{x}\_t $= input at time$ t$
- $\mathbf{h}\_t $= hidden state at time$ t$
- $\theta $= system parameters (weights)
- $\mathbf{s} $= seed for randomness
- $\mathcal{F} $= complete forward function
---
## 2. State-Space Representation
### 2.1 Discrete-Time State-Space Model
For a transformer with L layers and sequence length n :
**State Vector:**
```math
\mathbf{H}_t = \begin{bmatrix}
\mathbf{h}_t^{(1)} \\
\mathbf{h}_t^{(2)} \\
\vdots \\
\mathbf{h}_t^{(L)}
\end{bmatrix} \in \mathbb{R}^{L \times n \times d}
```
where
$\mathbf{h}_t^{(l)} \in \mathbb{R}^{n \times d} is the hidden state at layer l .$
**State Update Equation:**
```math
\mathbf{h}_t^{(l+1)} = f_l(\mathbf{h}_t^{(l)}, \theta_l), \quad l = 0, 1, ..., L-1
where f_l is the transformation at layer l .
```
**Output Equation:**
```math
\mathbf{y}_t = g(\mathbf{h}_t^{(L)}, \theta_{out})
```
### 2.2 System Linearity Analysis
The system is **nonlinear** due to:
- Attention mechanism (softmax)
- Activation functions (GELU)
- Layer normalization
However, individual components can be analyzed as **piecewise linear** systems.
---
## 3. Tokenizer as Input Encoder
### 3.1 Tokenizer Control Function
The tokenizer maps a character sequence to a discrete token sequence:
```math
\mathcal{T}: \mathcal{C}^* \rightarrow \mathbb{N}^*
```
**Mathematical Formulation:**
For input sequence $\mathbf{c} = [c_1, c_2, ..., c_n] $:
```math
\mathbf{t} = \mathcal{T}(\mathbf{c}) = [V(c_1), V(c_2), ..., V(c_n)]
where V: \mathcal{C} \rightarrow \mathbb{N} is the vocabulary mapping function.
```
### 3.2 Vocabulary Mapping Function
```math
V(c) = \begin{cases}
0 & \text{if } c = \text{<pad>} \\
1 & \text{if } c = \text{<unk>} \\
2 & \text{if } c = \text{<bos>} \\
3 & \text{if } c = \text{<eos>} \\
v & \text{if } c \in \mathcal{C}_{vocab}
\end{cases}
```
**Control Properties:**
- **Deterministic**: Same input always produces same output
- **Invertible**: For most tokens, $V^{-1}$ exists
- **Bijective**: Each character maps to unique token ID
### 3.3 Tokenizer State Space
The tokenizer maintains internal state:
```math
\Sigma_{\mathcal{T}} = \{V, V^{-1}, \text{padding\_strategy}, \text{max\_length}\}
```
**State Transition:**
```math
\Sigma_{\mathcal{T}}' = \Sigma_{\mathcal{T}} \quad \text{(static during operation)}
```
### 3.4 Step-by-Step Explanation
**Step 1: Character Extraction**
- Input: Raw text string "Hello"
- Process: Extract each character $c \in \{'H', 'e', 'l', 'l', 'o'\}$
- Meaning: Break down text into atomic units
**Step 2: Vocabulary Lookup**
- Process: Apply $V(c)$ to each character
- Example: $V('H') = 72, V('e') = 101, V('l') = 108, V('o') = 111$
- Meaning: Convert characters to numerical indices
**Step 3: Sequence Formation**
- Output: $\mathbf{t} = [72, 101, 108, 108, 111]$
- Meaning: Numerical representation ready for embedding
**Control Impact**: Tokenizer creates the **foundation** for all subsequent processing. Any error here propagates through the entire system.
---
## 4. Seed Control System
### 4.1 Seed as System Initialization
The seed $s \in \mathbb{N}$ controls **randomness** throughout the system:
```math
\mathcal{R}(\mathbf{x}, s) = \text{deterministic\_random}(\mathbf{x}, s)
```
### 4.2 Seed Propagation Function
**Initialization:**
```math
\text{seed\_torch}(s): \text{torch.manual\_seed}(s)
\text{seed\_cuda}(s): \text{torch.cuda.manual\_seed\_all}(s)
\text{seed\_cudnn}(s): \text{torch.backends.cudnn.deterministic} = \text{True}
```
**Mathematical Model:**
```math
\mathbb{P}(\mathbf{W} | s) = \begin{cases}
\delta(\mathbf{W} - \mathbf{W}_s) & \text{if deterministic} \\
\text{some distribution} & \text{if stochastic}
\end{cases}
where \delta is the Dirac delta and \mathbf{W}_s is the weight initialization given seed s .
```
### 4.3 Seed Control Equation
For weight initialization:
```math
\mathbf{W}_0 = \mathcal{I}(\mathbf{s}, \text{init\_method})
where \mathcal{I} is the initialization function.
```
**Example - Normal Initialization:**
```math
\mathbf{W}_0 \sim \mathcal{N}(0, \sigma^2) \quad \text{with random state } r(s)
W_{ij} = \sigma \cdot \Phi^{-1}(U_{ij}(s))
where:
- \mathcal{N}(0, \sigma^2) = normal distribution
- \Phi^{-1} = inverse CDF
- U_{ij}(s) = uniform random number from seed s
- \sigma = 0.02 (typical value)
```
### 4.4 Step-by-Step Explanation
**Step 1: Seed Input**
- Input: $s = 42$
- Meaning: Provides reproducibility guarantee
**Step 2: RNG State Initialization**
- Process: Set all random number generators to state based on $s$
- Meaning: Ensures deterministic behavior
**Step 3: Weight Initialization**
- Process: Generate all weights using RNG with seed $s$
- Example: $W\_{ij} = \text{normal}(0, 0.02, \text{seed}=42)$
- Meaning: Starting point for optimization
**Step 4: Training Determinism**
- Process: Same seed + same data → same gradients → same updates
- Meaning: Complete reproducibility
**Control Impact**: Seed controls **initial conditions** and **stochastic processes** throughout training. It's the **control parameter** for reproducibility.
---
## 5. Embedding Layer Control
### 5.1 Embedding as Linear Transformation
The embedding layer performs a **lookup operation**:
```math
\mathcal{E}: \mathbb{N} \rightarrow \mathbb{R}^d
```
**Mathematical Formulation:**
```math
\mathbf{E} \in \mathbb{R}^{V \times d} \quad \text{(embedding matrix)}
\mathbf{x}_t = \mathbf{E}[\mathbf{t}_t] = \mathbf{E}_t \in \mathbb{R}^d
where \mathbf{t}_t \in \mathbb{N} is the token ID at position t .
```
### 5.2 Embedding Control System
**Batch Processing:**
```math
\mathbf{X} = \mathbf{E}[\mathbf{T}] \in \mathbb{R}^{B \times n \times d}
where \mathbf{T} \in \mathbb{N}^{B \times n} is the batch of token IDs.
```
**Control Function:**
```math
\mathbf{X} = \mathcal{E}(\mathbf{T}, \mathbf{E})
```
**Gradient Flow:**
```math
\frac{\partial \mathcal{L}}{\partial \mathbf{E}} = \sum_{b,t} \frac{\partial \mathcal{L}}{\partial \mathbf{X}_{b,t}} \cdot \mathbf{1}[\mathbf{T}_{b,t}]
where \mathbf{1}[\mathbf{T}_{b,t}] is a one-hot indicator.
```
### 5.3 Step-by-Step Explanation
**Step 1: Token ID Input**
- Input: $t = 72$ (token ID for 'H')
- Meaning: Discrete index into vocabulary
**Step 2: Matrix Lookup**
- Process: $\mathbf{x} = \mathbf{E}[72]$
- Example: $\mathbf{x} = [0.1, -0.2, 0.3, ..., 0.05] \in \mathbb{R}^{512}$
- Meaning: Continuous vector representation
**Step 3: Semantic Encoding**
- Property: Similar tokens have similar embeddings (after training)
- Meaning: Embeddings capture semantic relationships
**Control Impact**: Embedding layer **projects** discrete tokens into continuous space, enabling gradient-based optimization.
---
## 6. Positional Encoding State
### 6.1 Positional Encoding as Additive Control
```math
\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} \in \mathbb{R}^{B \times n \times d}
where \mathbf{PE} \in \mathbb{R}^{n \times d} is the positional encoding matrix.
```
### 6.2 Positional Encoding Function
```math
PE_{(pos, i)} = \begin{cases}
\sin\left(\frac{pos}{10000^{2i/d}}\right) & \text{if } i \text{ is even} \\
\cos\left(\frac{pos}{10000^{2(i-1)/d}}\right) & \text{if } i \text{ is odd}
\end{cases}
```
### 6.3 Control System Interpretation
**Additive Control:**
```math
\mathbf{X}_{out} = \mathbf{X}_{in} + \mathbf{U}_{pos}
where \mathbf{U}_{pos} is the **control input** representing position information.
```
**Meaning**: Positional encoding **injects** positional information into the embeddings.
### 6.4 Step-by-Step Explanation
**Step 1: Position Index**
- Input: Position $pos = 0, 1, 2, ..., n-1$
- Meaning: Absolute position in sequence
**Step 2: Encoding Generation**
- Process: Compute $PE\_{(pos, i)}$ for each dimension $ i$
- Example: $PE*{(0, 0)} = 0, PE*{(0, 1)} = 1, PE\_{(1, 0)} \approx 0.84$
- Meaning: Unique pattern for each position
**Step 3: Addition Operation**
- Process: $\mathbf{X}\_{pos} = \mathbf{X} + PE$
- Meaning: Position information added to embeddings
**Step 4: Multi-Scale Representation**
- Property: Different dimensions encode different frequency scales
- Meaning: Model can learn both local and global positional patterns
**Control Impact**: Positional encoding provides **temporal/spatial awareness** to the model, enabling it to understand sequence order.
---
## 7. Self-Attention Control System
### 7.1 Attention as Information Routing
Self-attention can be modeled as a **dynamical control system** that routes information:
```math
\mathbf{O} = \text{Attention}(\mathbf{X}, \mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V)
```
### 7.2 State-Space Model for Attention
**Query, Key, Value Generation:**
```math
\mathbf{Q} = \mathbf{X} \mathbf{W}_Q \in \mathbb{R}^{B \times n \times d}
\mathbf{K} = \mathbf{X} \mathbf{W}_K \in \mathbb{R}^{B \times n \times d}
\mathbf{V} = \mathbf{X} \mathbf{W}_V \in \mathbb{R}^{B \times n \times d}
```
**Attention Scores (Transfer Function):**
```math
\mathbf{S} = \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \in \mathbb{R}^{B \times h \times n \times n}
```
**Attention Weights (Control Signal):**
```math
\mathbf{A} = \text{softmax}(\mathbf{S}) \in \mathbb{R}^{B \times h \times n \times n}
```
**Output (Controlled Response):**
```math
\mathbf{O} = \mathbf{A} \mathbf{V} \in \mathbb{R}^{B \times h \times n \times d_k}
```
### 7.3 Control System Interpretation
**Attention as Feedback Control:**
```math
\mathbf{O}_i = \sum_{j=1}^{n} A_{ij} \mathbf{V}_j
where A_{ij} is the **control gain** determining how much information flows from position j to position i .
```
**Meaning**: Attention acts as a **learnable routing mechanism** controlled by similarities between queries and keys.
### 7.4 Multi-Head Attention Control
**Head Splitting:**
```math
\mathbf{Q}_h = \mathbf{Q}[:, :, h \cdot d_k : (h+1) \cdot d_k] \in \mathbb{R}^{B \times n \times d_k}
```
**Parallel Processing:**
```math
\mathbf{O}_h = \text{Attention}(\mathbf{Q}_h, \mathbf{K}_h, \mathbf{V}_h), \quad h = 1, ..., H
```
**Concatenation:**
```math
\mathbf{O} = \text{Concat}[\mathbf{O}_1, \mathbf{O}_2, ..., \mathbf{O}_H] \in \mathbb{R}^{B \times n \times d}
```
### 7.5 Causal Masking Control
**Causal Mask:**
```math
M_{ij} = \begin{cases}
0 & \text{if } i \geq j \text{ (allowed)} \\
-\infty & \text{if } i < j \text{ (masked)}
\end{cases}
```
**Masked Attention:**
```math
\mathbf{S}_{masked} = \mathbf{S} + M
```
**Effect**: Prevents information flow from future positions.
### 7.6 Step-by-Step Explanation
**Step 1: Query, Key, Value Generation**
- Process: Linear transformations of input
- Meaning: Create three representations: what to look for (Q), what to match (K), what to retrieve (V)
**Step 2: Similarity Computation**
- Process: $S\_{ij} = Q_i \cdot K_j / \sqrt{d_k}$
- Meaning: Measure similarity/relevance between positions $i$ and $ j
$
**Step 3: Softmax Normalization**
- Process: $A*{ij} = \exp(S*{ij}) / \sum*k \exp(S*{ik})$
- Meaning: Convert similarities to probability distribution (attention weights)
**Step 4: Weighted Aggregation**
- Process: $O*i = \sum_j A*{ij} V_j$
- Meaning: Combine values weighted by attention probabilities
**Step 5: Information Flow**
- Property: Each position receives information from all other positions (with causal masking)
- Meaning: Enables long-range dependencies and context understanding
**Control Impact**: Self-attention is the **core control mechanism** that determines **what information flows where** in the sequence.
---
## 8. Feed-Forward Control
### 8.1 Feed-Forward as Nonlinear Transformation
```math
\text{FFN}(\mathbf{X}) = \text{GELU}(\mathbf{X} \mathbf{W}_1 + \mathbf{b}_1) \mathbf{W}_2 + \mathbf{b}_2
```
### 8.2 Control System Model
**Two-Stage Transformation:**
```math
\mathbf{H} = \mathbf{X} \mathbf{W}_1 \in \mathbb{R}^{B \times n \times d_{ff}}
\mathbf{H}' = \text{GELU}(\mathbf{H}) \in \mathbb{R}^{B \times n \times d_{ff}}
\mathbf{O} = \mathbf{H}' \mathbf{W}_2 \in \mathbb{R}^{B \times n \times d}
```
### 8.3 GELU Activation Control
```math
\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)
```
**Control Interpretation**: GELU applies **smooth gating** - values near zero are suppressed, positive values pass through.
### 8.4 Step-by-Step Explanation
**Step 1: Expansion**
- Process: $\mathbf{H} = \mathbf{X} \mathbf{W}_1 expands to d_{ff} > d$
- Example: $d = 512 \rightarrow d\_{ff} = 2048$
- Meaning: Increases capacity for complex transformations
**Step 2: Nonlinear Activation**
- Process: $\mathbf{H}' = \text{GELU}(\mathbf{H})$
- Meaning: Introduces nonlinearity, enabling complex function approximation
**Step 3: Compression**
- Process: $\mathbf{O} = \mathbf{H}' \mathbf{W}\_2 $compresses back to$ d$
- Meaning: Projects back to original dimension
**Control Impact**: FFN provides **nonlinear processing power** and **feature transformation** at each position.
---
## 9. Layer Normalization Feedback
### 9.1 Normalization as Feedback Control
```math
\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
where:
- \mu = \frac{1}{d} \sum_{i=1}^{d} x_i (mean)
- \sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2 (variance)
- \gamma, \beta = learnable parameters (scale and shift)
```
### 9.2 Control System Interpretation
**Normalization as State Regulation:**
```math
\mathbf{x}_{norm} = \gamma \odot \frac{\mathbf{x} - \mu(\mathbf{x})}{\sigma(\mathbf{x})} + \beta
```
**Meaning**: Normalization **regulates** the distribution of activations, preventing saturation and improving gradient flow.
### 9.3 Pre-Norm Architecture
**Transformer Block with Pre-Norm:**
```math
\mathbf{x}_{norm} = \text{LayerNorm}(\mathbf{x}_{in})
\mathbf{x}_{attn} = \text{Attention}(\mathbf{x}_{norm})
\mathbf{x}_{out} = \mathbf{x}_{in} + \mathbf{x}_{attn} \quad \text{(residual connection)}
```
**Control Impact**: Pre-norm architecture provides **stability** and **better gradient flow**.
### 9.4 Step-by-Step Explanation
**Step 1: Mean Computation**
- Process: $\mu = \frac{1}{d} \sum x_i$
- Meaning: Find center of distribution
**Step 2: Variance Computation**
- Process: $\sigma^2 = \frac{1}{d} \sum (x_i - \mu)^2$
- Meaning: Measure spread of distribution
**Step 3: Normalization**
- Process: $\hat{x}\_i = (x_i - \mu) / \sqrt{\sigma^2 + \epsilon}$
- Meaning: Standardize to zero mean, unit variance
**Step 4: Scale and Shift**
- Process: $x\_{out} = \gamma \odot \hat{x} + \beta$
- Meaning: Allow model to learn optimal scale and shift
**Control Impact**: Layer normalization provides **stability** and **faster convergence** by maintaining consistent activation distributions.
---
## 10. Complete System Dynamics
### 10.1 Complete Forward Pass
**System State Evolution:**
```math
\mathbf{h}_0 = \mathcal{E}(\mathbf{T}) + \mathbf{PE} \quad \text{(embedding + positional)}
\mathbf{h}_l = \text{TransformerBlock}_l(\mathbf{h}_{l-1}), \quad l = 1, ..., L
\mathbf{y} = \mathbf{h}_L \mathbf{W}_{out} \in \mathbb{R}^{B \times n \times V}
```
### 10.2 Recursive System Equation
```math
\mathbf{h}_t^{(l)} = f_l(\mathbf{h}_t^{(l-1)}, \theta_l)
where:
f_l(\mathbf{x}, \theta_l) = \mathbf{x} + \text{Dropout}(\text{Attention}(\text{LayerNorm}(\mathbf{x}))) + \text{Dropout}(\text{FFN}(\text{LayerNorm}(\mathbf{x} + \text{Attention}(\text{LayerNorm}(\mathbf{x})))))
```
### 10.3 System Transfer Function
The complete system can be viewed as:
```math
\mathbf{Y} = \mathcal{F}(\mathbf{T}, \theta, \mathbf{s})
where:
- \mathbf{T} = input tokens
- \theta = all parameters
- \mathbf{s} = seed
```
**Properties:**
- **Nonlinear**: Due to softmax, GELU, normalization
- **Differentiable**: All operations have gradients
- **Compositional**: Built from simpler functions
### 10.4 Step-by-Step System Flow
**Step 1: Input Encoding**
- Input: Token sequence $\mathbf{T}$
- Process: Embedding + Positional Encoding
- Output: $\mathbf{h}\_0 \in \mathbb{R}^{B \times n \times d}$
- Meaning: Convert discrete tokens to continuous vectors with position info
**Step 2: Layer Processing**
- For each layer $l = 1, ..., L $:
- Process: Self-attention + FFN with residual connections
- Output: $\mathbf{h}\_l \in \mathbb{R}^{B \times n \times d}$
- Meaning: Transform representations through attention and processing
**Step 3: Output Generation**
- Process: Final layer norm + output projection
- Output: $\mathbf{L} \in \mathbb{R}^{B \times n \times V} (logits)$
- Meaning: Predict probability distribution over vocabulary
**Step 4: Probability Computation**
- Process: Softmax over logits
- Output: $\mathbf{p} \in \mathbb{R}^{B \times n \times V}
(probabilities)$
- Meaning: Normalized probability distribution for next token prediction
---
## 11. Training as Optimization Control
### 11.1 Training as Optimal Control Problem
**Objective Function:**
```math
J(\theta) = \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(\mathbf{y}_i, \hat{\mathbf{y}}_i(\theta))
where:
- \mathcal{L} = loss function (cross-entropy)
- \mathbf{y}_i = true labels
- \hat{\mathbf{y}}_i(\theta) = model predictions
```
**Optimization Problem:**
```math
\theta^* = \arg\min_{\theta} J(\theta)
```
### 11.2 Gradient-Based Control
**Gradient Computation:**
```math
\mathbf{g}_t = \nabla_\theta J(\theta_t) = \frac{\partial J}{\partial \theta_t}
```
**Parameter Update (AdamW):**
```math
\theta_{t+1} = \theta_t - \eta_t \left(\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} + \lambda \theta_t\right)
where:
- \hat{\mathbf{m}}_t = biased-corrected momentum
- \hat{\mathbf{v}}_t = biased-corrected variance
- \eta_t = learning rate (controlled by scheduler)
- \lambda = weight decay coefficient
```
### 11.3 Learning Rate Control
**Cosine Annealing Schedule:**
```math
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \cdot \frac{1 + \cos(\pi \cdot \frac{t}{T_{max}})}{2}
```
**Control Interpretation**: Learning rate acts as **gain scheduling** - high gain initially for fast convergence, low gain later for fine-tuning.
### 11.4 Gradient Clipping Control
**Clipping Function:**
```math
\mathbf{g}_{clipped} = \begin{cases}
\mathbf{g} & \text{if } ||\mathbf{g}|| \leq \theta \\
\mathbf{g} \cdot \frac{\theta}{||\mathbf{g}||} & \text{if } ||\mathbf{g}|| > \theta
\end{cases}
```
**Purpose**: Prevents **explosive gradients** that could destabilize training.
### 11.5 Step-by-Step Training Control
**Step 1: Forward Pass**
- Process: $\hat{\mathbf{y}} = \mathcal{F}(\mathbf{x}, \theta_t)$
- Meaning: Compute predictions with current parameters
**Step 2: Loss Computation**
- Process: $\mathcal{L} = \text{CrossEntropy}(\hat{\mathbf{y}}, \mathbf{y})$
- Meaning: Measure prediction error
**Step 3: Backward Pass**
- Process: $\mathbf{g} = \nabla\_\theta \mathcal{L}$
- Meaning: Compute gradients for all parameters
**Step 4: Gradient Clipping**
- Process: $\mathbf{g}\_{clipped} = \text{Clip}(\mathbf{g}, \theta)$
- Meaning: Prevent gradient explosion
**Step 5: Optimizer Update**
- Process: $\theta*{t+1} = \text{AdamW}(\theta_t, \mathbf{g}*{clipped}, \eta_t)$
- Meaning: Update parameters using adaptive learning rate
**Step 6: Learning Rate Update**
- Process: $\eta\_{t+1} = \text{Scheduler}(\eta_t, t)$
- Meaning: Adjust learning rate according to schedule
**Control Impact**: Training process is a **closed-loop control system** where:
- **Error signal**: Loss
- **Controller**: Optimizer (AdamW)
- **Actuator**: Parameter updates
- **Plant**: Model forward pass
---
## 12. Inference Control Loop
### 12.1 Autoregressive Generation as Control Loop
**State-Space Model:**
```math
\mathbf{h}_t = \mathcal{F}(\mathbf{x}_t, \mathbf{h}_{t-1}, \theta)
\mathbf{p}_t = \text{softmax}(\mathbf{h}_t \mathbf{W}_{out})
\mathbf{x}_{t+1} \sim \text{Categorical}(\mathbf{p}_t)
```
### 12.2 Generation Control Function
**Step-by-Step:**
1. **Current State**: $\mathbf{h}\_t$
2. **Output Generation**: $\mathbf{p}_t = \text{softmax}(\mathbf{h}\_t \mathbf{W}_{out})$
3. **Sampling**: $x\_{t+1} \sim \mathbf{p}\_t (with temperature, top-k, top-p)$
4. **State Update**: $\mathbf{h}_{t+1} = \mathcal{F}([\mathbf{h}\_t, x_{t+1}], \theta)$
5. **Repeat**: Until max length or stop token
### 12.3 Sampling Control Parameters
**Temperature Control:**
```math
\mathbf{p}_t^{temp} = \text{softmax}\left(\frac{\mathbf{h}_t \mathbf{W}_{out}}{T}\right)
- T < 1 : More deterministic (sharp distribution)
- T > 1 : More random (flat distribution)
- T = 1 : Default
```
**Top-k Filtering:**
```math
\mathbf{p}_t^{topk}[v] = \begin{cases}
\mathbf{p}_t[v] & \text{if } v \in \text{top-k}(\mathbf{p}_t) \\
0 & \text{otherwise}
\end{cases}
```
**Top-p (Nucleus) Sampling:**
```math
\mathbf{p}_t^{topp}[v] = \begin{cases}
\mathbf{p}_t[v] & \text{if } v \in S_p \\
0 & \text{otherwise}
\end{cases}
where S_p is the smallest set such that \sum_{v \in S_p} \mathbf{p}_t[v] \geq p .
```
### 12.4 Step-by-Step Inference Control
**Step 1: Initialization**
- Input: Prompt tokens $\mathbf{P} = [p_1, ..., p_k]$
- Process: Initialize state $\mathbf{h}\_0 = \mathcal{E}(\mathbf{P}) + \mathbf{PE}$
- Meaning: Set initial state from prompt
**Step 2: Forward Pass**
- Process: $\mathbf{h}_t = \text{Transformer}(\mathbf{h}_{t-1})$
- Output: Hidden state $\mathbf{h}\_t$
- Meaning: Process current sequence
**Step 3: Logit Generation**
- Process: $\mathbf{l}_t = \mathbf{h}\_t \mathbf{W}_{out}$
- Output: Logits $\mathbf{l}\_t \in \mathbb{R}^V$
- Meaning: Unnormalized scores for each token
**Step 4: Probability Computation**
- Process: $\mathbf{p}\_t = \text{softmax}(\mathbf{l}\_t / T)$
- Output: Probability distribution $\mathbf{p}\_t$
- Meaning: Normalized probabilities with temperature
**Step 5: Sampling**
- Process: $x\_{t+1} \sim \mathbf{p}\_t (with optional top-k/top-p)$
- Output: Next token $x\_{t+1}$
- Meaning: Stochastically select next token
**Step 6: State Update**
- Process: Append $x*{t+1}$ to sequence, update $\mathbf{h}*{t+1}$
- Meaning: Incorporate new token into state
**Step 7: Termination Check**
- Condition: $t < \text{max_length} and x\_{t+1} \neq \text{<eos>}$
- If true: Go to Step 2
- If false: Return generated sequence
**Control Impact**: Inference is a **recurrent control system** where:
- **State**: Current hidden representation
- **Control**: Sampling strategy (temperature, top-k, top-p)
- **Output**: Generated token sequence
---
## Summary: Unified Control System Model
### Complete System Equation
```math
\mathbf{Y} = \mathcal{G}(\mathbf{C}, \theta, \mathbf{s}, \mathbf{T}, \{k, p\})
where:
- \mathbf{C} = input characters
- \theta = model parameters
- \mathbf{s} = seed
- \mathbf{T} = temperature
- \{k, p\} = top-k and top-p parameters
```
### System Components as Control Elements
1. **Tokenizer**: Input encoder $\mathcal{T}$
2. **Seed**: Initialization control $\mathbf{s}$
3. **Embeddings**: State projection $\mathcal{E}$
4. **Positional Encoding**: Temporal control $\mathbf{PE}$
5. **Attention**: Information routing $\mathcal{A}$
6. **FFN**: Nonlinear transformation $\mathcal{F}$
7. **Normalization**: State regulation $\mathcal{N}$
8. **Optimizer**: Parameter control $\mathcal{O}$
9. **Scheduler**: Learning rate control $\mathcal{S}$
10. **Sampling**: Output control $\mathcal{P}$
### Control Flow Summary
```
Input Characters
↓ [Tokenizer Control]
Token IDs
↓ [Seed Control]
Initialized Parameters
↓ [Embedding Control]
Vector Representations
↓ [Positional Control]
Position-Aware Vectors
↓ [Attention Control]
Context-Aware Representations
↓ [FFN Control]
Transformed Features
↓ [Normalization Control]
Stabilized Activations
↓ [Output Control]
Probability Distributions
↓ [Sampling Control]
Generated Tokens
```
Each component acts as a **control element** in a unified dynamical system, working together to transform input text into meaningful language model outputs.
---
## 13. Block Diagram Analysis
### 13.1 Single Transformer Block Control System
**Block Diagram (a): Detailed Single Transformer Block**
```
Input X
┌─────────────┐
│ LayerNorm │
└──────┬──────┘
┌─────────────┐
│ Multi-Head │
│ Attention │
└──────┬──────┘
┌─────────────┐
│ Dropout │
└──────┬──────┘
┌─────────────┐
│ + │ ←─── (Residual Connection from X)
└──────┬──────┘
┌─────────────┐
│ LayerNorm │
└──────┬──────┘
┌─────────────┐
│ Feed-Forward│
│ Network │
└──────┬──────┘
┌─────────────┐
│ Dropout │
└──────┬──────┘
┌─────────────┐
│ + │ ←─── (Residual Connection)
└──────┬──────┘
Output X'
```
**Mathematical Transfer Function:**
```math
\mathbf{X}_{out} = \mathbf{X}_{in} + \text{Dropout}(\text{FFN}(\text{LayerNorm}(\mathbf{X}_{in} + \text{Dropout}(\text{Attention}(\text{LayerNorm}(\mathbf{X}_{in})))))
```
### 13.2 Simplified Transformer Block
**Block Diagram (b): Simplified Single Block**
```
Input X
┌─────────────────────────────────────┐
│ TransformerBlock │
│ G_block(X) = X + Attn(LN(X)) + │
│ FFN(LN(X + Attn(LN(X))))│
└──────────────┬──────────────────────┘
Output X'
```
**Transfer Function:**
```math
G_{block}(\mathbf{X}) = \mathbf{X} + G_{attn}(\text{LN}(\mathbf{X})) + G_{ffn}(\text{LN}(\mathbf{X} + G_{attn}(\text{LN}(\mathbf{X}))))
where:
- G_{attn} = Attention transfer function
- G_{ffn} = Feed-forward transfer function
- \text{LN} = Layer normalization
```
### 13.3 Complete Model with Multiple Layers
**Block Diagram (c): Cascaded Transformer Blocks**
```
Input Tokens T
┌─────────────┐
│ Embedding │
│ G_emb │
└──────┬──────┘
┌─────────────┐
│ Positional │
│ G_pos │
└──────┬──────┘
┌─────────────┐
│ Block 1 │
│ G_block₁ │
└──────┬──────┘
┌─────────────┐
│ Block 2 │
│ G_block₂ │
└──────┬──────┘
┌─────────────┐
│ ... │
└──────┬──────┘
┌─────────────┐
│ Block L │
│ G_block_L │
└──────┬──────┘
┌─────────────┐
│ Final Norm │
│ G_norm │
└──────┬──────┘
┌─────────────┐
│ Output Proj │
│ G_out │
└──────┬──────┘
Output Logits
```
**Overall Transfer Function:**
```math
\mathbf{Y} = G_{out} \circ G_{norm} \circ G_{block_L} \circ ... \circ G_{block_2} \circ G_{block_1} \circ G_{pos} \circ G_{emb}(\mathbf{T})
```
### 13.4 Closed-Loop Training System
**Block Diagram (d): Training Control Loop**
```
Input Data X
┌─────────────┐
│ Model │
│ Forward │
│ F │
└──────┬──────┘
┌─────────────┐
│ Output │
│ ŷ │
└──────┬──────┘
┌─────────────┐
│ Loss │
│ L(ŷ, y) │
└──────┬──────┘
┌─────────────┐
│ Gradient │
│ ∇θ │
└──────┬──────┘
┌─────────────┐
│ Clipping │
│ Clip │
└──────┬──────┘
┌─────────────┐
│ Optimizer │
│ AdamW │
└──────┬──────┘
┌─────────────┐
│ Parameter │
│ Update │
└──────┬──────┘
┌─────────────┐
│ - │ ←─── (Feedback to Model)
└─────────────┘
```
**Closed-Loop Transfer Function:**
```math
\theta_{t+1} = \theta_t - \eta_t \cdot \text{AdamW}(\text{Clip}(\nabla_\theta L(\mathcal{F}(\mathbf{X}, \theta_t), \mathbf{y})))
```
---
## 14. Vector Visualization and Examples
### 14.1 Example Phrase: "Hello World"
We'll trace through the complete system with the phrase **"Hello World"**.
#### Step 1: Tokenization
**Input:** `"Hello World"`
**Process:**
```
Characters: ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']
Token IDs: [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]
```
**Mathematical:**
```math
\mathbf{c} = \text{"Hello World"}
\mathbf{t} = \mathcal{T}(\mathbf{c}) = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]
```
**Vector Representation:**
- Dimension: $n = 11$ tokens
- Token IDs: $\mathbf{t} \in \mathbb{N}^{11}$
#### Step 2: Embedding
**Embedding Matrix:** $\mathbf{E} \in \mathbb{R}^{128 \times 512}$
**Lookup Operation:**
```math
\mathbf{X} = \mathbf{E}[\mathbf{t}] = \begin{bmatrix}
\mathbf{E}[72] \\
\mathbf{E}[101] \\
\mathbf{E}[108] \\
\mathbf{E}[108] \\
\mathbf{E}[111] \\
\mathbf{E}[32] \\
\mathbf{E}[87] \\
\mathbf{E}[111] \\
\mathbf{E}[114] \\
\mathbf{E}[108] \\
\mathbf{E}[100]
\end{bmatrix} \in \mathbb{R}^{11 \times 512}
```
**Example Values (first 3 dimensions):**
```math
\mathbf{E}[72] = [0.1, -0.2, 0.3, ...]^T \\
\mathbf{E}[101] = [-0.1, 0.3, -0.1, ...]^T \\
\mathbf{E}[108] = [0.05, 0.15, -0.05, ...]^T
```
**Vector Visualization:**
```
Token 'H' (ID=72): [0.10, -0.20, 0.30, ..., 0.05] (512-dim vector)
Token 'e' (ID=101): [-0.10, 0.30, -0.10, ..., 0.02] (512-dim vector)
Token 'l' (ID=108): [0.05, 0.15, -0.05, ..., 0.01] (512-dim vector)
...
```
#### Step 3: Positional Encoding
**Positional Encoding Matrix:** $\mathbf{PE} \in \mathbb{R}^{11 \times 512}$
**Computation:**
```math
PE_{(0, 0)} = \sin(0 / 10000^0) = 0 \\
PE_{(0, 1)} = \cos(0 / 10000^0) = 1 \\
PE_{(1, 0)} = \sin(1 / 10000^0) = \sin(1) \approx 0.8415 \\
PE_{(1, 1)} = \cos(1 / 10000^0) = \cos(1) \approx 0.5403
```
**Addition:**
```math
\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE}
```
**Example (first token, first 3 dimensions):**
```math
\mathbf{X}_{pos}[0, :3] = \begin{bmatrix}
0.1 \\ -0.2 \\ 0.3
\end{bmatrix} + \begin{bmatrix}
0 \\ 1 \\ 0
\end{bmatrix} = \begin{bmatrix}
0.1 \\ 0.8 \\ 0.3
\end{bmatrix}
```
#### Step 4: Multi-Head Attention
**Query, Key, Value Projections:**
Let $\mathbf{W}\_Q, \mathbf{W}\_K, \mathbf{W}\_V \in \mathbb{R}^{512 \times 512}$
```math
\mathbf{Q} = \mathbf{X}_{pos} \mathbf{W}_Q \in \mathbb{R}^{11 \times 512}
```
**Example Calculation (head 0, token 0):**
For $h = 0 , d_k = 512/8 = 64 $:
```math
\mathbf{Q}[0, :64] = \mathbf{X}_{pos}[0] \mathbf{W}_Q[:, :64]
```
**Attention Score Computation:**
```math
S_{0,1} = \frac{\mathbf{Q}[0] \cdot \mathbf{K}[1]}{\sqrt{64}} = \frac{\sum_{i=0}^{63} Q_{0,i} \cdot K_{1,i}}{8}
```
**Example Numerical Calculation:**
Assume:
```math
\mathbf{Q}[0, :3] = [0.2, -0.1, 0.3] \\
\mathbf{K}[1, :3] = [0.1, 0.2, -0.1]
S_{0,1} = \frac{0.2 \times 0.1 + (-0.1) \times 0.2 + 0.3 \times (-0.1)}{8} \\
= \frac{0.02 - 0.02 - 0.03}{8} = \frac{-0.03}{8} = -0.00375
```
**Attention Weights:**
```math
A_{0,:} = \text{softmax}(S_{0,:}) = \frac{\exp(S_{0,:})}{\sum_{j=0}^{10} \exp(S_{0,j})}
```
**Example:**
If $S\_{0,:} = [-0.004, 0.05, 0.02, 0.02, 0.08, -0.01, 0.03, 0.08, 0.01, 0.02, 0.04]$
```math
\exp(S_{0,:}) = [0.996, 1.051, 1.020, 1.020, 1.083, 0.990, 1.030, 1.083, 1.010, 1.020, 1.041]
\sum = 11.335
A_{0,:} = [0.088, 0.093, 0.090, 0.090, 0.096, 0.087, 0.091, 0.096, 0.089, 0.090, 0.092]
```
**Output Calculation:**
```math
\mathbf{O}[0] = \sum_{j=0}^{10} A_{0,j} \mathbf{V}[j]
```
**Example (first dimension):**
```math
O_{0,0} = A_{0,0} V_{0,0} + A_{0,1} V_{1,0} + ... + A_{0,10} V_{10,0} \\
= 0.088 \times 0.2 + 0.093 \times 0.1 + ... + 0.092 \times 0.15 \\
\approx 0.12
```
#### Step 5: Feed-Forward Network
**Input:** $\mathbf{X}\_{attn} \in \mathbb{R}^{11 \times 512}$
**First Linear Transformation:**
```math
\mathbf{H} = \mathbf{X}_{attn} \mathbf{W}_1 \in \mathbb{R}^{11 \times 2048}
```
**Example (token 0, first dimension):**
```math
H_{0,0} = \sum_{i=0}^{511} X_{attn,0,i} \cdot W_{1,i,0}
Assuming X_{attn}[0, :3] = [0.12, -0.05, 0.08] and W_1[:3, :3] = \begin{bmatrix} 0.1 & 0.2 \\ -0.1 & 0.1 \\ 0.05 & -0.05 \end{bmatrix}
H_{0,0} = 0.12 \times 0.1 + (-0.05) \times (-0.1) + 0.08 \times 0.05 \\
= 0.012 + 0.005 + 0.004 = 0.021
```
**GELU Activation:**
```math
\text{GELU}(0.021) = 0.021 \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{0.021}{\sqrt{2}}\right)\right)
\text{erf}(0.021/\sqrt{2}) = \text{erf}(0.0148) \approx 0.0167
\text{GELU}(0.021) = 0.021 \times 0.5 \times (1 + 0.0167) = 0.021 \times 0.5084 \approx 0.0107
```
**Second Linear Transformation:**
```math
\mathbf{O}_{ffn} = \mathbf{H}' \mathbf{W}_2 \in \mathbb{R}^{11 \times 512}
```
#### Step 6: Complete Forward Pass Through One Layer
**Input:** $\mathbf{X}_{in} = \mathbf{X}_{pos} \in \mathbb{R}^{11 \times 512}$
**Step 6.1: Layer Normalization**
```math
\mu_0 = \frac{1}{512} \sum_{i=0}^{511} X_{in,0,i}
```
**Example:**
```math
\mu_0 = \frac{0.1 + 0.8 + 0.3 + ...}{512} \approx 0.02
\sigma_0^2 = \frac{1}{512} \sum_{i=0}^{511} (X_{in,0,i} - \mu_0)^2
\sigma_0^2 \approx \frac{(0.1-0.02)^2 + (0.8-0.02)^2 + ...}{512} \approx 0.15
\hat{X}_{0,0} = \frac{0.1 - 0.02}{\sqrt{0.15 + 1e-5}} = \frac{0.08}{0.387} \approx 0.207
```
**Step 6.2: Attention Output**
```math
\mathbf{X}_{attn} = \text{Attention}(\hat{\mathbf{X}})
```
**Step 6.3: Residual Connection**
```math
\mathbf{X}_{res1} = \mathbf{X}_{in} + \mathbf{X}_{attn}
```
**Example:**
```math
X_{res1,0,0} = 0.1 + 0.12 = 0.22
```
**Step 6.4: Second Layer Norm + FFN**
```math
\mathbf{X}_{ffn} = \text{FFN}(\text{LayerNorm}(\mathbf{X}_{res1}))
```
**Step 6.5: Final Residual**
```math
\mathbf{X}_{out} = \mathbf{X}_{res1} + \mathbf{X}_{ffn}
```
**Example:**
```math
X_{out,0,0} = 0.22 + 0.15 = 0.37
```
#### Step 7: Output Projection
**After L layers:**
```math
\mathbf{H}_{final} = \text{LayerNorm}(\mathbf{X}_{out}^{(L)}) \in \mathbb{R}^{11 \times 512}
```
**Output Projection:**
```math
\mathbf{L} = \mathbf{H}_{final} \mathbf{W}_{out} \in \mathbb{R}^{11 \times 128}
```
**Example (position 0):**
```math
L_{0,:} = \mathbf{H}_{final}[0] \mathbf{W}_{out} \in \mathbb{R}^{128}
```
**Softmax:**
```math
p_{0,v} = \frac{\exp(L_{0,v})}{\sum_{w=0}^{127} \exp(L_{0,w})}
```
**Example:**
If $L*{0,72} = 5.2 (logit for 'H'), L*{0,101} = 3.1 (logit for 'e'), etc.$
```math
\exp(5.2) = 181.27 \\
\exp(3.1) = 22.20 \\
\vdots
\sum_{w=0}^{127} \exp(L_{0,w}) \approx 250.0
p_{0,72} = \frac{181.27}{250.0} \approx 0.725 \quad \text{(72\% probability for H)}
```
---
## 15. Complete Numerical Example: "Hello"
Let's trace through the complete system with **"Hello"** step-by-step.
### Input: "Hello"
### Stage 1: Tokenization
```math
\mathbf{c} = \text{"Hello"} = ['H', 'e', 'l', 'l', 'o']
\mathbf{t} = [72, 101, 108, 108, 111]
```
### Stage 2: Embedding (d=512)
```math
\mathbf{E} \in \mathbb{R}^{128 \times 512}
\mathbf{X} = \begin{bmatrix}
\mathbf{E}[72] \\
\mathbf{E}[101] \\
\mathbf{E}[108] \\
\mathbf{E}[108] \\
\mathbf{E}[111]
\end{bmatrix} = \begin{bmatrix}
0.10 & -0.20 & 0.30 & ... & 0.05 \\
-0.10 & 0.30 & -0.10 & ... & 0.02 \\
0.05 & 0.15 & -0.05 & ... & 0.01 \\
0.05 & 0.15 & -0.05 & ... & 0.01 \\
-0.05 & 0.20 & 0.10 & ... & 0.03
\end{bmatrix} \in \mathbb{R}^{5 \times 512}
```
### Stage 3: Positional Encoding
```math
\mathbf{PE} = \begin{bmatrix}
0 & 1 & 0 & ... & 0 \\
0.84 & 0.54 & 0.01 & ... & 0.00 \\
0.91 & -0.42 & 0.02 & ... & 0.00 \\
0.14 & -0.99 & 0.03 & ... & 0.00 \\
-0.76 & -0.65 & 0.04 & ... & 0.00
\end{bmatrix} \in \mathbb{R}^{5 \times 512}
\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} = \begin{bmatrix}
0.10 & 0.80 & 0.30 & ... & 0.05 \\
0.74 & 0.84 & -0.09 & ... & 0.02 \\
0.96 & -0.27 & -0.03 & ... & 0.01 \\
0.19 & -0.84 & -0.02 & ... & 0.01 \\
-0.81 & -0.45 & 0.14 & ... & 0.03
\end{bmatrix}
```
### Stage 4: Attention (h=8 heads, d_k=64)
**Query Generation:**
```math
\mathbf{Q} = \mathbf{X}_{pos} \mathbf{W}_Q \in \mathbb{R}^{5 \times 512}
```
**Score Matrix (head 0):**
```math
\mathbf{S}_0 = \frac{\mathbf{Q}_0 \mathbf{K}_0^T}{\sqrt{64}} \in \mathbb{R}^{5 \times 5}
```
**Example Values:**
```math
\mathbf{S}_0 = \begin{bmatrix}
0.50 & -0.10 & 0.20 & 0.15 & 0.30 \\
-0.05 & 0.45 & 0.10 & 0.08 & 0.25 \\
0.15 & 0.05 & 0.40 & 0.30 & 0.20 \\
0.12 & 0.08 & 0.28 & 0.35 & 0.18 \\
0.25 & 0.15 & 0.22 & 0.20 & 0.42
\end{bmatrix}
```
**Attention Weights:**
```math
\mathbf{A}_0 = \text{softmax}(\mathbf{S}_0) = \begin{bmatrix}
0.35 & 0.15 & 0.22 & 0.20 & 0.28 \\
0.15 & 0.38 & 0.20 & 0.18 & 0.27 \\
0.23 & 0.18 & 0.32 & 0.30 & 0.26 \\
0.21 & 0.19 & 0.28 & 0.33 & 0.25 \\
0.27 & 0.22 & 0.26 & 0.25 & 0.36
\end{bmatrix}
```
**Output (head 0):**
```math
\mathbf{O}_0 = \mathbf{A}_0 \mathbf{V}_0 \in \mathbb{R}^{5 \times 64}
```
**Concatenate All Heads:**
```math
\mathbf{O} = \text{Concat}[\mathbf{O}_0, ..., \mathbf{O}_7] \in \mathbb{R}^{5 \times 512}
```
### Stage 5: Feed-Forward
```math
\mathbf{H} = \mathbf{O} \mathbf{W}_1 \in \mathbb{R}^{5 \times 2048}
\mathbf{H}' = \text{GELU}(\mathbf{H}) \in \mathbb{R}^{5 \times 2048}
\mathbf{O}_{ffn} = \mathbf{H}' \mathbf{W}_2 \in \mathbb{R}^{5 \times 512}
```
### Stage 6: Output Logits
After processing through all L layers:
```math
\mathbf{L} = \mathbf{H}_{final} \mathbf{W}_{out} \in \mathbb{R}^{5 \times 128}
```
**Example (position 4, predicting next token):**
```math
L_{4,:} = [2.1, 1.5, ..., 5.2, ..., 3.1, ...]
Where:
- L_{4,111} = 5.2 (high score for 'o')
- L_{4,32} = 4.8 (high score for space)
- L_{4,87} = 4.5 (high score for 'W')
```
**Probability Distribution:**
```math
\mathbf{p}_4 = \text{softmax}(L_{4,:}) = [0.01, 0.008, ..., 0.25, ..., 0.18, ...]
p_{4,111} \approx 0.25 \quad \text{(25\% for o)} \\
p_{4,32} \approx 0.22 \quad \text{(22\% for space)} \\
p_{4,87} \approx 0.18 \quad \text{(18\% for W)}
```
---
## 16. Vector Space Visualization
### 16.1 Embedding Space
**2D Projection Example:**
After embedding "Hello", tokens occupy positions in 512-dimensional space. Projected to 2D:
```
Token Positions (idealized 2D projection):
'l' (0.05, 0.15)
'e' (-0.10, 0.30)
Origin (0, 0)
'H' (0.10, -0.20)
'o' (-0.05, 0.20)
```
**Distance in Embedding Space:**
```math
d(\mathbf{E}[72], \mathbf{E}[101]) = ||\mathbf{E}[72] - \mathbf{E}[101]||_2
d = \sqrt{(0.1 - (-0.1))^2 + (-0.2 - 0.3)^2 + ...} \approx \sqrt{0.04 + 0.25 + ...} \approx 2.1
```
### 16.2 Attention Weight Visualization
**Attention Matrix Visualization:**
```
Position 0 1 2 3 4
┌─────┴─────┴─────┴─────┴──┐
Token 0 │ 0.35 0.15 0.22 0.20 0.28 │ 'H'
│ │
Token 1 │ 0.15 0.38 0.20 0.18 0.27 │ 'e'
│ │
Token 2 │ 0.23 0.18 0.32 0.30 0.26 │ 'l'
│ │
Token 3 │ 0.21 0.19 0.28 0.33 0.25 │ 'l'
│ │
Token 4 │ 0.27 0.22 0.26 0.25 0.36 │ 'o'
└──────────────────────────┘
```
**Interpretation:**
- Token 0 ('H') attends most to itself (0.35) and token 4 (0.28)
- Token 4 ('o') attends moderately to all positions
- Higher values indicate stronger attention
### 16.3 Probability Distribution Visualization
**Output Distribution for Position 5 (next token after "Hello"):**
```
Probability Distribution p[5, :]
Probability
0.3 │ ●
0.2 │ ● ●
0.1 │ ● ● ● ●
0.0 ├─┴───┴───┴───┴───┴───┴───┴───┴─── Token IDs
32 72 87 101 108 111 ... 127
␣ H W e l o
```
**Meaning:**
- Highest probability for space (32) ≈ 0.28
- Next: 'o' (111) ≈ 0.23
- Then: 'W' (87) ≈ 0.18
- Model predicts space or continuation
---
## 17. Advanced Block Diagram Simplification
### 17.1 Complex Multi-Layer System Simplification
Following control system reduction techniques, we can simplify the transformer model step-by-step:
**Diagram (a): Original Complex System**
```
Input R (Tokens)
┌─────────────┐
│ Embedding │
│ G_emb │
└──────┬──────┘
┌─────────────┐
│ Positional │
│ Encoding │
│ G_pos │
└──────┬──────┘
┌─────────────┐
│ + │ ←─── Feedback from Layer 2
└──────┬──────┘
┌─────────────┐
│ Layer 1 │
│ G_block₁ │
└──────┬──────┘
┌─────────────┐
│ + │ ←─── Feedback from Output
└──────┬──────┘
┌─────────────┐
│ Layer 2 │
│ G_block₂ │
└──────┬──────┘
┌─────────────┐
│ + │ ←─── Feedback H₁
└──────┬──────┘
┌─────────────┐
│ Output Proj │
│ G_out │
└──────┬──────┘
Output C (Logits)
```
**Diagram (b): First Simplification (Combine Embedding and Positional)**
```
Input R
┌─────────────────────┐
│ G_emb_pos = │
│ G_pos ∘ G_emb │
└──────┬──────────────┘
┌─────────────┐
│ + │
└──────┬──────┘
┌─────────────┐
│ Layer 1 │
│ G_block₁ │
└──────┬──────┘
┌─────────────┐
│ + │
└──────┬──────┘
┌─────────────┐
│ Layer 2 │
│ G_block₂ │
└──────┬──────┘
┌─────────────┐
│ + │ ←─── H₁
└──────┬──────┘
┌─────────────┐
│ G_out │
└──────┬──────┘
Output C
```
**Diagram (c): Second Simplification (Combine Layers)**
```
Input R
┌─────────────────────┐
│ G_emb_pos │
└──────┬──────────────┘
┌──────────────────────────────────┐
│ G_layers = G_block₂ ∘ G_block₁ │
│ Equivalent to: │
│ X + Δ₁(X) + Δ₂(X + Δ₁(X)) │
└──────┬───────────────────────────┘
┌─────────────┐
│ + │ ←─── H₁
└──────┬──────┘
┌─────────────┐
│ G_out │
└──────┬──────┘
Output C
```
**Diagram (d): Third Simplification (Combine with Output)**
```
Input R
┌──────────────────────────────┐
│ G_forward = │
│ G_out ∘ G_layers ∘ G_emb_pos │
└──────┬───────────────────────┘
┌─────────────┐
│ + │ ←─── H₁ (Feedback)
└──────┬──────┘
Output C
```
**Diagram (e): Final Simplified Transfer Function**
```
Input R
┌────────────────────────────────────────────┐
│ Overall Transfer Function: │
│ │
│ C/R = G_forward / (1 + G_forward × H₁) │
│ │
│ Where: │
│ G_forward = G_out ∘ G_layers ∘ G_emb_pos │
│ │
└──────┬─────────────────────────────────────┘
Output C
```
**Mathematical Derivation:**
**Step 1:** Combine embedding and positional encoding:
```math
G_{emb\_pos}(\mathbf{T}) = G_{pos}(G_{emb}(\mathbf{T})) = \mathbf{E}[\mathbf{T}] + \mathbf{PE}
```
**Step 2:** Combine transformer layers:
```math
G_{layers}(\mathbf{X}) = G_{block_2}(G_{block_1}(\mathbf{X}))
G_{layers}(\mathbf{X}) = \mathbf{X} + \Delta_1(\mathbf{X}) + \Delta_2(\mathbf{X} + \Delta_1(\mathbf{X}))
where \Delta_l represents the transformation inside block l .
```
**Step 3:** Combine with output projection:
```math
G_{forward}(\mathbf{T}) = G_{out}(G_{layers}(G_{emb\_pos}(\mathbf{T})))
```
**Step 4:** Apply feedback reduction:
```math
\frac{C}{R} = \frac{G_{forward}}{1 + G_{forward} \times H_1}
```
### 17.2 Attention Block Simplification
**Diagram (a): Detailed Attention**
```
Input X
┌─────────────┐
│ Q │ ←─── W_Q
│ K │ ←─── W_K
│ V │ ←─── W_V
└──────┬──────┘
┌─────────────┐
│ Scores │
│ S = QK^T/√d │
└──────┬──────┘
┌─────────────┐
│ Softmax │
│ A = σ(S) │
└──────┬──────┘
┌─────────────┐
│ Output │
│ O = AV │
└──────┬──────┘
┌─────────────┐
│ Out Proj │
│ W_O │
└──────┬──────┘
Output X'
```
**Diagram (b): Simplified Attention Transfer Function**
```
Input X
┌──────────────────────────────┐
│ G_attn(X) = │
│ W_O · softmax(QK^T/√d) · V │
│ │
│ Where: │
│ Q = XW_Q, K = XW_K, V = XW_V │
└──────┬───────────────────────┘
Output X'
```
**Mathematical Transfer Function:**
```math
G_{attn}(\mathbf{X}) = \mathbf{X} \mathbf{W}_O \cdot \text{softmax}\left(\frac{(\mathbf{X} \mathbf{W}_Q)(\mathbf{X} \mathbf{W}_K)^T}{\sqrt{d_k}}\right) \cdot (\mathbf{X} \mathbf{W}_V)
```
---
## 18. Vector Trace: "Hello World" Complete Flow
### 18.1 Complete Vector Trace with Numerical Values
**Input:** `"Hello World"`
**Stage 1: Tokenization**
```math
\mathbf{t} = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]
```
**Stage 2: Embedding (showing first 4 dimensions)**
```math
\mathbf{X} = \begin{bmatrix}
[H] & 0.10 & -0.20 & 0.30 & 0.15 & ... \\
[e] & -0.10 & 0.30 & -0.10 & 0.08 & ... \\
[l] & 0.05 & 0.15 & -0.05 & 0.03 & ... \\
[l] & 0.05 & 0.15 & -0.05 & 0.03 & ... \\
[o] & -0.05 & 0.20 & 0.10 & 0.06 & ... \\
[ ] & 0.02 & 0.05 & 0.02 & 0.01 & ... \\
[W] & 0.15 & -0.15 & 0.25 & 0.12 & ... \\
[o] & -0.05 & 0.20 & 0.10 & 0.06 & ... \\
[r] & 0.08 & 0.10 & -0.08 & 0.04 & ... \\
[l] & 0.05 & 0.15 & -0.05 & 0.03 & ... \\
[d] & 0.12 & -0.08 & 0.18 & 0.09 & ...
\end{bmatrix} \in \mathbb{R}^{11 \times 512}
```
**Stage 3: Positional Encoding (first 4 dimensions)**
```math
\mathbf{PE} = \begin{bmatrix}
[0] & 0.00 & 1.00 & 0.00 & 0.00 & ... \\
[1] & 0.84 & 0.54 & 0.01 & 0.00 & ... \\
[2] & 0.91 & -0.42 & 0.02 & 0.00 & ... \\
[3] & 0.14 & -0.99 & 0.03 & 0.00 & ... \\
[4] & -0.76 & -0.65 & 0.04 & 0.00 & ... \\
[5] & -0.96 & 0.28 & 0.05 & 0.00 & ... \\
[6] & -0.28 & 0.96 & 0.06 & 0.00 & ... \\
[7] & 0.65 & 0.76 & 0.07 & 0.00 & ... \\
[8] & 0.99 & -0.14 & 0.08 & 0.00 & ... \\
[9] & 0.42 & -0.91 & 0.09 & 0.00 & ... \\
[10] & -0.54 & -0.84 & 0.10 & 0.00 & ...
\end{bmatrix}
```
**Stage 4: Combined Input**
```math
\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE}
```
**Example Row 0 (token 'H'):**
```math
\mathbf{X}_{pos}[0, :4] = [0.10, -0.20, 0.30, 0.15] + [0.00, 1.00, 0.00, 0.00] = [0.10, 0.80, 0.30, 0.15]
```
**Stage 5: Attention (Head 0, showing attention from token 0 to all tokens)**
```math
\mathbf{S}_0[0, :] = [0.50, -0.10, 0.20, 0.15, 0.30, -0.05, 0.18, 0.28, 0.12, 0.20, 0.22]
\mathbf{A}_0[0, :] = \text{softmax}(\mathbf{S}_0[0, :]) = [0.35, 0.15, 0.22, 0.20, 0.28, 0.14, 0.19, 0.26, 0.17, 0.21, 0.23]
**Meaning:** Token 'H' (position 0) attends:
- 35% to itself
- 28% to token 'o' (position 4)
- 26% to token 'o' (position 7)
- 23% to token 'd' (position 10)
```
**Stage 6: Attention Output**
```math
\mathbf{O}_0[0, :] = \sum_{j=0}^{10} A_{0,j} \mathbf{V}_0[j, :]
```
**Example (first dimension):**
```math
O_{0,0,0} = 0.35 \times 0.12 + 0.15 \times 0.08 + ... + 0.23 \times 0.15 \approx 0.115
```
**Stage 7: FFN Output**
```math
\mathbf{H}_{ffn}[0, :4] = [0.15, -0.08, 0.22, 0.18]
```
**Stage 8: Final Output (after all layers)**
```math
\mathbf{H}_{final}[0, :4] = [0.42, 0.25, 0.58, 0.31]
```
**Stage 9: Logits**
```math
\mathbf{L}[0, :] = [2.1, 1.8, ..., 5.2, ..., 3.4, ...]
Where L[0, 72] = 5.2 is highest (predicting 'H' at position 1).
```
**Stage 10: Probabilities**
```math
\mathbf{p}[0, :] = \text{softmax}(\mathbf{L}[0, :]) = [0.01, 0.008, ..., 0.28, ..., 0.15, ...]
p[0, 72] \approx 0.28 \quad \text{(28\% probability for H)}
```
---
## 19. Vector Plots and Visualizations
### 19.1 Embedding Vector Trajectory
**Trajectory Plot:**
```
512-Dimensional Embedding Space (2D Projection)
0.3 │ 'e' (pos 1)
│ ●
0.2 │ 'r' (pos 8)
│ ●
0.1 │ 'l' (pos 2,3,9) 'o' (pos 4,7)
│ ● ●
0.0 ├───────────────────────────────────────────
│ 'H' (pos 0)
-0.1 │ ●
-0.2 │
-0.3 │ 'W' (pos 6)
│ ●
└───────────────────────────────────────────
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
```
### 19.2 Attention Heatmap
**Attention Weight Matrix Visualization:**
```
Attention Weights A[i,j] for "Hello World"
j → 0 1 2 3 4 5 6 7 8 9 10
↓ ['H'] ['e'] ['l'] ['l'] ['o'] [' '] ['W'] ['o'] ['r'] ['l'] ['d']
i=0 ['H'] │ 0.35 0.15 0.22 0.20 0.28 0.14 0.19 0.26 0.17 0.21 0.23 │
i=1 ['e'] │ 0.15 0.38 0.20 0.18 0.27 0.16 0.18 0.25 0.19 0.22 0.20 │
i=2 ['l'] │ 0.23 0.18 0.32 0.30 0.26 0.17 0.21 0.24 0.25 0.31 0.23 │
i=3 ['l'] │ 0.21 0.19 0.28 0.33 0.25 0.18 0.20 0.23 0.24 0.30 0.22 │
i=4 ['o'] │ 0.27 0.22 0.26 0.25 0.36 0.19 0.23 0.29 0.24 0.27 0.25 │
i=5 [' '] │ 0.18 0.20 0.19 0.21 0.24 0.40 0.22 0.25 0.21 0.20 0.22 │
i=6 ['W'] │ 0.22 0.21 0.23 0.24 0.26 0.20 0.45 0.28 0.27 0.23 0.25 │
i=7 ['o'] │ 0.26 0.25 0.24 0.23 0.29 0.21 0.28 0.38 0.26 0.24 0.26 │
i=8 ['r'] │ 0.19 0.21 0.25 0.24 0.24 0.19 0.27 0.26 0.42 0.27 0.28 │
i=9 ['l'] │ 0.21 0.22 0.31 0.30 0.27 0.20 0.23 0.24 0.27 0.35 0.24 │
i=10['d'] │ 0.23 0.20 0.23 0.22 0.25 0.22 0.25 0.26 0.28 0.24 0.48 │
Color Coding:
█ = 0.48-0.50 (very high attention)
█ = 0.35-0.48 (high attention)
█ = 0.25-0.35 (medium attention)
█ = 0.15-0.25 (low attention)
█ = 0.00-0.15 (very low attention)
```
### 19.3 Probability Distribution Plot
**Logits and Probabilities:**
```
Logits L[5, :] (predicting token after "Hello ")
Logit
Value │
6.0 │ ● (token 87 'W')
5.0 │ ● (token 111 'o')
4.0 │ ● (token 32 ' ') ● (token 114 'r')
3.0 │ ● ● ●
2.0 │ ● ● ● ● ● ● ● ● ● ● ●
1.0 │ ● ● ● ● ● ● ● ● ● ● ●
0.0 ├─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴── Token IDs
32 72 87 101 108 111 114 ...
␣ H W e l o r
Probabilities p[5, :]
Probability
0.3│ ● ('W')
0.2│ ● (' ') ● ('o')
0.1│ ● ● ● ● ● ● ●
0.0├─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴── Token IDs
32 72 87 101 108 111 114 ...
```
### 19.4 Hidden State Evolution Through Layers
**Layer-by-Layer Transformation:**
```
Hidden State Evolution for Token 'H' (position 0)
Dimension 0:
Layer 0: 0.10 (embedding + positional)
Layer 1: 0.42 (after attention + FFN)
Layer 2: 0.58 (after second layer)
Layer 3: 0.65 (after third layer)
... ...
Layer L: 0.72 (final hidden state)
Dimension 1:
Layer 0: 0.80 (embedding + positional)
Layer 1: 0.25 (after attention + FFN)
Layer 2: 0.18 (after second layer)
Layer 3: 0.22 (after third layer)
... ...
Layer L: 0.15 (final hidden state)
```
**Visualization:**
```
Hidden State Magnitude ||h[l]|| Over Layers
Magnitude
1.0│ ●
│ ●
0.8│ ●
│ ●
0.6│ ●
│ ●
0.4│ ●
│ ●
0.2│ ●
│ ●
0.0├───────────────────────── Layer
0 1 2 3 4 5 6
```
---
## 20. Summary: Complete Mathematical Trace
### Complete System Equation with Numerical Example
**Text:** `"Hello World"`
**Complete Mathematical Flow:**
1. **Tokenization:**
```math
\mathbf{t} = \mathcal{T}(\text{"Hello World"}) = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]
```
2. **Embedding:**
```math
\mathbf{X} = \mathbf{E}[\mathbf{t}] \in \mathbb{R}^{11 \times 512}
```
3. **Positional Encoding:**
```math
\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} \in \mathbb{R}^{11 \times 512}
```
4. **Transformer Layers (L=6):**
```math
\mathbf{h}_l = \text{TransformerBlock}_l(\mathbf{h}_{l-1}), \quad l = 1, ..., 6
```
5. **Output:**
```math
\mathbf{L} = \mathbf{h}_6 \mathbf{W}_{out} \in \mathbb{R}^{11 \times 128}
```
6. **Probabilities:**
```math
\mathbf{p} = \text{softmax}(\mathbf{L}) \in \mathbb{R}^{11 \times 128}
```
**Final Prediction:**
For position 5 (after "Hello "):
```math
p[5, 87] = 0.28 \quad \text{(28\% for W)} \\
p[5, 32] = 0.22 \quad \text{(22\% for space)} \\
p[5, 111] = 0.18 \quad \text{(18\% for o)}
```
**Most Likely:** `'W'` → Complete prediction: `"Hello World"`
---
_This document provides a complete mathematical control system formulation with block diagrams, vector visualizations, numerical examples, and step-by-step calculations for every component of the SheepOp LLM._