- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
59 KiB
SheepOp LLM - Mathematical Control System Model
Complete mathematical control system formulation of the SheepOp Language Model, treating the entire system as a unified mathematical control system with state-space representations, transfer functions, and step-by-step explanations.
Table of Contents
- System Overview
- State-Space Representation
- Tokenizer as Input Encoder
- Seed Control System
- Embedding Layer Control
- Positional Encoding State
- Self-Attention Control System
- Feed-Forward Control
- Layer Normalization Feedback
- Complete System Dynamics
- Training as Optimization Control
- Inference Control Loop
1. System Overview
1.1 Control System Architecture
The SheepOp LLM can be modeled as a nonlinear dynamical control system with:
- Input: Character sequence
\mathbf{c} = [c_1, c_2, ..., c_n] - State: Hidden representations $\mathbf{h}_t $at each layer and time step
- Control: Model parameters $\theta = {W_Q, W_K, W_V, W_1, W_2, ...} $
- Output: Probability distribution over vocabulary
\mathbf{p}\_t \in \mathbb{R}^V
System Block Diagram:
Input Sequence → Tokenizer → Embeddings → Positional Encoding →
↓
[Transformer Layer 1] → [Transformer Layer 2] → ... → [Transformer Layer L]
↓
Output Projection → Logits → Softmax → Output Probabilities
1.2 Mathematical System Formulation
The complete system can be expressed as:
\mathbf{y}_t = \mathcal{F}(\mathbf{x}_t, \mathbf{h}_t, \theta, \mathbf{s})
where:
- $\mathbf{x}_t
= input at timet$ - $\mathbf{h}_t
= hidden state at timet$ - $\theta $= system parameters (weights)
- $\mathbf{s} $= seed for randomness
- $\mathcal{F} $= complete forward function
2. State-Space Representation
2.1 Discrete-Time State-Space Model
For a transformer with L layers and sequence length n :
State Vector:
\mathbf{H}_t = \begin{bmatrix}
\mathbf{h}_t^{(1)} \\
\mathbf{h}_t^{(2)} \\
\vdots \\
\mathbf{h}_t^{(L)}
\end{bmatrix} \in \mathbb{R}^{L \times n \times d}
where
\mathbf{h}_t^{(l)} \in \mathbb{R}^{n \times d} is the hidden state at layer l .
State Update Equation:
\mathbf{h}_t^{(l+1)} = f_l(\mathbf{h}_t^{(l)}, \theta_l), \quad l = 0, 1, ..., L-1
where f_l is the transformation at layer l .
Output Equation:
\mathbf{y}_t = g(\mathbf{h}_t^{(L)}, \theta_{out})
2.2 System Linearity Analysis
The system is nonlinear due to:
- Attention mechanism (softmax)
- Activation functions (GELU)
- Layer normalization
However, individual components can be analyzed as piecewise linear systems.
3. Tokenizer as Input Encoder
3.1 Tokenizer Control Function
The tokenizer maps a character sequence to a discrete token sequence:
\mathcal{T}: \mathcal{C}^* \rightarrow \mathbb{N}^*
Mathematical Formulation:
For input sequence \mathbf{c} = [c_1, c_2, ..., c_n] :
\mathbf{t} = \mathcal{T}(\mathbf{c}) = [V(c_1), V(c_2), ..., V(c_n)]
where V: \mathcal{C} \rightarrow \mathbb{N} is the vocabulary mapping function.
3.2 Vocabulary Mapping Function
V(c) = \begin{cases}
0 & \text{if } c = \text{<pad>} \\
1 & \text{if } c = \text{<unk>} \\
2 & \text{if } c = \text{<bos>} \\
3 & \text{if } c = \text{<eos>} \\
v & \text{if } c \in \mathcal{C}_{vocab}
\end{cases}
Control Properties:
- Deterministic: Same input always produces same output
- Invertible: For most tokens,
V^{-1}exists - Bijective: Each character maps to unique token ID
3.3 Tokenizer State Space
The tokenizer maintains internal state:
\Sigma_{\mathcal{T}} = \{V, V^{-1}, \text{padding\_strategy}, \text{max\_length}\}
State Transition:
\Sigma_{\mathcal{T}}' = \Sigma_{\mathcal{T}} \quad \text{(static during operation)}
3.4 Step-by-Step Explanation
Step 1: Character Extraction
- Input: Raw text string "Hello"
- Process: Extract each character
c \in \{'H', 'e', 'l', 'l', 'o'\} - Meaning: Break down text into atomic units
Step 2: Vocabulary Lookup
- Process: Apply
V(c)to each character - Example:
V('H') = 72, V('e') = 101, V('l') = 108, V('o') = 111 - Meaning: Convert characters to numerical indices
Step 3: Sequence Formation
- Output:
\mathbf{t} = [72, 101, 108, 108, 111] - Meaning: Numerical representation ready for embedding
Control Impact: Tokenizer creates the foundation for all subsequent processing. Any error here propagates through the entire system.
4. Seed Control System
4.1 Seed as System Initialization
The seed s \in \mathbb{N} controls randomness throughout the system:
\mathcal{R}(\mathbf{x}, s) = \text{deterministic\_random}(\mathbf{x}, s)
4.2 Seed Propagation Function
Initialization:
\text{seed\_torch}(s): \text{torch.manual\_seed}(s)
\text{seed\_cuda}(s): \text{torch.cuda.manual\_seed\_all}(s)
\text{seed\_cudnn}(s): \text{torch.backends.cudnn.deterministic} = \text{True}
Mathematical Model:
\mathbb{P}(\mathbf{W} | s) = \begin{cases}
\delta(\mathbf{W} - \mathbf{W}_s) & \text{if deterministic} \\
\text{some distribution} & \text{if stochastic}
\end{cases}
where \delta is the Dirac delta and \mathbf{W}_s is the weight initialization given seed s .
4.3 Seed Control Equation
For weight initialization:
\mathbf{W}_0 = \mathcal{I}(\mathbf{s}, \text{init\_method})
where \mathcal{I} is the initialization function.
Example - Normal Initialization:
\mathbf{W}_0 \sim \mathcal{N}(0, \sigma^2) \quad \text{with random state } r(s)
W_{ij} = \sigma \cdot \Phi^{-1}(U_{ij}(s))
where:
- \mathcal{N}(0, \sigma^2) = normal distribution
- \Phi^{-1} = inverse CDF
- U_{ij}(s) = uniform random number from seed s
- \sigma = 0.02 (typical value)
4.4 Step-by-Step Explanation
Step 1: Seed Input
- Input:
s = 42 - Meaning: Provides reproducibility guarantee
Step 2: RNG State Initialization
- Process: Set all random number generators to state based on
s - Meaning: Ensures deterministic behavior
Step 3: Weight Initialization
- Process: Generate all weights using RNG with seed
s - Example:
W\_{ij} = \text{normal}(0, 0.02, \text{seed}=42) - Meaning: Starting point for optimization
Step 4: Training Determinism
- Process: Same seed + same data → same gradients → same updates
- Meaning: Complete reproducibility
Control Impact: Seed controls initial conditions and stochastic processes throughout training. It's the control parameter for reproducibility.
5. Embedding Layer Control
5.1 Embedding as Linear Transformation
The embedding layer performs a lookup operation:
\mathcal{E}: \mathbb{N} \rightarrow \mathbb{R}^d
Mathematical Formulation:
\mathbf{E} \in \mathbb{R}^{V \times d} \quad \text{(embedding matrix)}
\mathbf{x}_t = \mathbf{E}[\mathbf{t}_t] = \mathbf{E}_t \in \mathbb{R}^d
where \mathbf{t}_t \in \mathbb{N} is the token ID at position t .
5.2 Embedding Control System
Batch Processing:
\mathbf{X} = \mathbf{E}[\mathbf{T}] \in \mathbb{R}^{B \times n \times d}
where \mathbf{T} \in \mathbb{N}^{B \times n} is the batch of token IDs.
Control Function:
\mathbf{X} = \mathcal{E}(\mathbf{T}, \mathbf{E})
Gradient Flow:
\frac{\partial \mathcal{L}}{\partial \mathbf{E}} = \sum_{b,t} \frac{\partial \mathcal{L}}{\partial \mathbf{X}_{b,t}} \cdot \mathbf{1}[\mathbf{T}_{b,t}]
where \mathbf{1}[\mathbf{T}_{b,t}] is a one-hot indicator.
5.3 Step-by-Step Explanation
Step 1: Token ID Input
- Input:
t = 72(token ID for 'H') - Meaning: Discrete index into vocabulary
Step 2: Matrix Lookup
- Process:
\mathbf{x} = \mathbf{E}[72] - Example:
\mathbf{x} = [0.1, -0.2, 0.3, ..., 0.05] \in \mathbb{R}^{512} - Meaning: Continuous vector representation
Step 3: Semantic Encoding
- Property: Similar tokens have similar embeddings (after training)
- Meaning: Embeddings capture semantic relationships
Control Impact: Embedding layer projects discrete tokens into continuous space, enabling gradient-based optimization.
6. Positional Encoding State
6.1 Positional Encoding as Additive Control
\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} \in \mathbb{R}^{B \times n \times d}
where \mathbf{PE} \in \mathbb{R}^{n \times d} is the positional encoding matrix.
6.2 Positional Encoding Function
PE_{(pos, i)} = \begin{cases}
\sin\left(\frac{pos}{10000^{2i/d}}\right) & \text{if } i \text{ is even} \\
\cos\left(\frac{pos}{10000^{2(i-1)/d}}\right) & \text{if } i \text{ is odd}
\end{cases}
6.3 Control System Interpretation
Additive Control:
\mathbf{X}_{out} = \mathbf{X}_{in} + \mathbf{U}_{pos}
where \mathbf{U}_{pos} is the **control input** representing position information.
Meaning: Positional encoding injects positional information into the embeddings.
6.4 Step-by-Step Explanation
Step 1: Position Index
- Input: Position
pos = 0, 1, 2, ..., n-1 - Meaning: Absolute position in sequence
Step 2: Encoding Generation
- Process: Compute
PE\_{(pos, i)}for each dimensioni - Example:
PE*{(0, 0)} = 0, PE*{(0, 1)} = 1, PE\_{(1, 0)} \approx 0.84 - Meaning: Unique pattern for each position
Step 3: Addition Operation
- Process:
\mathbf{X}\_{pos} = \mathbf{X} + PE - Meaning: Position information added to embeddings
Step 4: Multi-Scale Representation
- Property: Different dimensions encode different frequency scales
- Meaning: Model can learn both local and global positional patterns
Control Impact: Positional encoding provides temporal/spatial awareness to the model, enabling it to understand sequence order.
7. Self-Attention Control System
7.1 Attention as Information Routing
Self-attention can be modeled as a dynamical control system that routes information:
\mathbf{O} = \text{Attention}(\mathbf{X}, \mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V)
7.2 State-Space Model for Attention
Query, Key, Value Generation:
\mathbf{Q} = \mathbf{X} \mathbf{W}_Q \in \mathbb{R}^{B \times n \times d}
\mathbf{K} = \mathbf{X} \mathbf{W}_K \in \mathbb{R}^{B \times n \times d}
\mathbf{V} = \mathbf{X} \mathbf{W}_V \in \mathbb{R}^{B \times n \times d}
Attention Scores (Transfer Function):
\mathbf{S} = \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \in \mathbb{R}^{B \times h \times n \times n}
Attention Weights (Control Signal):
\mathbf{A} = \text{softmax}(\mathbf{S}) \in \mathbb{R}^{B \times h \times n \times n}
Output (Controlled Response):
\mathbf{O} = \mathbf{A} \mathbf{V} \in \mathbb{R}^{B \times h \times n \times d_k}
7.3 Control System Interpretation
Attention as Feedback Control:
\mathbf{O}_i = \sum_{j=1}^{n} A_{ij} \mathbf{V}_j
where A_{ij} is the **control gain** determining how much information flows from position j to position i .
Meaning: Attention acts as a learnable routing mechanism controlled by similarities between queries and keys.
7.4 Multi-Head Attention Control
Head Splitting:
\mathbf{Q}_h = \mathbf{Q}[:, :, h \cdot d_k : (h+1) \cdot d_k] \in \mathbb{R}^{B \times n \times d_k}
Parallel Processing:
\mathbf{O}_h = \text{Attention}(\mathbf{Q}_h, \mathbf{K}_h, \mathbf{V}_h), \quad h = 1, ..., H
Concatenation:
\mathbf{O} = \text{Concat}[\mathbf{O}_1, \mathbf{O}_2, ..., \mathbf{O}_H] \in \mathbb{R}^{B \times n \times d}
7.5 Causal Masking Control
Causal Mask:
M_{ij} = \begin{cases}
0 & \text{if } i \geq j \text{ (allowed)} \\
-\infty & \text{if } i < j \text{ (masked)}
\end{cases}
Masked Attention:
\mathbf{S}_{masked} = \mathbf{S} + M
Effect: Prevents information flow from future positions.
7.6 Step-by-Step Explanation
Step 1: Query, Key, Value Generation
- Process: Linear transformations of input
- Meaning: Create three representations: what to look for (Q), what to match (K), what to retrieve (V)
Step 2: Similarity Computation
- Process:
S\_{ij} = Q_i \cdot K_j / \sqrt{d_k} - Meaning: Measure similarity/relevance between positions
iand $ j $
Step 3: Softmax Normalization
- Process:
A*{ij} = \exp(S*{ij}) / \sum*k \exp(S*{ik}) - Meaning: Convert similarities to probability distribution (attention weights)
Step 4: Weighted Aggregation
- Process:
O*i = \sum_j A*{ij} V_j - Meaning: Combine values weighted by attention probabilities
Step 5: Information Flow
- Property: Each position receives information from all other positions (with causal masking)
- Meaning: Enables long-range dependencies and context understanding
Control Impact: Self-attention is the core control mechanism that determines what information flows where in the sequence.
8. Feed-Forward Control
8.1 Feed-Forward as Nonlinear Transformation
\text{FFN}(\mathbf{X}) = \text{GELU}(\mathbf{X} \mathbf{W}_1 + \mathbf{b}_1) \mathbf{W}_2 + \mathbf{b}_2
8.2 Control System Model
Two-Stage Transformation:
\mathbf{H} = \mathbf{X} \mathbf{W}_1 \in \mathbb{R}^{B \times n \times d_{ff}}
\mathbf{H}' = \text{GELU}(\mathbf{H}) \in \mathbb{R}^{B \times n \times d_{ff}}
\mathbf{O} = \mathbf{H}' \mathbf{W}_2 \in \mathbb{R}^{B \times n \times d}
8.3 GELU Activation Control
\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)
Control Interpretation: GELU applies smooth gating - values near zero are suppressed, positive values pass through.
8.4 Step-by-Step Explanation
Step 1: Expansion
- Process:
\mathbf{H} = \mathbf{X} \mathbf{W}_1 expands to d_{ff} > d - Example:
d = 512 \rightarrow d\_{ff} = 2048 - Meaning: Increases capacity for complex transformations
Step 2: Nonlinear Activation
- Process:
\mathbf{H}' = \text{GELU}(\mathbf{H}) - Meaning: Introduces nonlinearity, enabling complex function approximation
Step 3: Compression
- Process: $\mathbf{O} = \mathbf{H}' \mathbf{W}_2
compresses back tod$ - Meaning: Projects back to original dimension
Control Impact: FFN provides nonlinear processing power and feature transformation at each position.
9. Layer Normalization Feedback
9.1 Normalization as Feedback Control
\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
where:
- \mu = \frac{1}{d} \sum_{i=1}^{d} x_i (mean)
- \sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2 (variance)
- \gamma, \beta = learnable parameters (scale and shift)
9.2 Control System Interpretation
Normalization as State Regulation:
\mathbf{x}_{norm} = \gamma \odot \frac{\mathbf{x} - \mu(\mathbf{x})}{\sigma(\mathbf{x})} + \beta
Meaning: Normalization regulates the distribution of activations, preventing saturation and improving gradient flow.
9.3 Pre-Norm Architecture
Transformer Block with Pre-Norm:
\mathbf{x}_{norm} = \text{LayerNorm}(\mathbf{x}_{in})
\mathbf{x}_{attn} = \text{Attention}(\mathbf{x}_{norm})
\mathbf{x}_{out} = \mathbf{x}_{in} + \mathbf{x}_{attn} \quad \text{(residual connection)}
Control Impact: Pre-norm architecture provides stability and better gradient flow.
9.4 Step-by-Step Explanation
Step 1: Mean Computation
- Process:
\mu = \frac{1}{d} \sum x_i - Meaning: Find center of distribution
Step 2: Variance Computation
- Process:
\sigma^2 = \frac{1}{d} \sum (x_i - \mu)^2 - Meaning: Measure spread of distribution
Step 3: Normalization
- Process:
\hat{x}\_i = (x_i - \mu) / \sqrt{\sigma^2 + \epsilon} - Meaning: Standardize to zero mean, unit variance
Step 4: Scale and Shift
- Process:
x\_{out} = \gamma \odot \hat{x} + \beta - Meaning: Allow model to learn optimal scale and shift
Control Impact: Layer normalization provides stability and faster convergence by maintaining consistent activation distributions.
10. Complete System Dynamics
10.1 Complete Forward Pass
System State Evolution:
\mathbf{h}_0 = \mathcal{E}(\mathbf{T}) + \mathbf{PE} \quad \text{(embedding + positional)}
\mathbf{h}_l = \text{TransformerBlock}_l(\mathbf{h}_{l-1}), \quad l = 1, ..., L
\mathbf{y} = \mathbf{h}_L \mathbf{W}_{out} \in \mathbb{R}^{B \times n \times V}
10.2 Recursive System Equation
\mathbf{h}_t^{(l)} = f_l(\mathbf{h}_t^{(l-1)}, \theta_l)
where:
f_l(\mathbf{x}, \theta_l) = \mathbf{x} + \text{Dropout}(\text{Attention}(\text{LayerNorm}(\mathbf{x}))) + \text{Dropout}(\text{FFN}(\text{LayerNorm}(\mathbf{x} + \text{Attention}(\text{LayerNorm}(\mathbf{x})))))
10.3 System Transfer Function
The complete system can be viewed as:
\mathbf{Y} = \mathcal{F}(\mathbf{T}, \theta, \mathbf{s})
where:
- \mathbf{T} = input tokens
- \theta = all parameters
- \mathbf{s} = seed
Properties:
- Nonlinear: Due to softmax, GELU, normalization
- Differentiable: All operations have gradients
- Compositional: Built from simpler functions
10.4 Step-by-Step System Flow
Step 1: Input Encoding
- Input: Token sequence
\mathbf{T} - Process: Embedding + Positional Encoding
- Output:
\mathbf{h}\_0 \in \mathbb{R}^{B \times n \times d} - Meaning: Convert discrete tokens to continuous vectors with position info
Step 2: Layer Processing
- For each layer
l = 1, ..., L:- Process: Self-attention + FFN with residual connections
- Output:
\mathbf{h}\_l \in \mathbb{R}^{B \times n \times d} - Meaning: Transform representations through attention and processing
Step 3: Output Generation
- Process: Final layer norm + output projection
- Output:
\mathbf{L} \in \mathbb{R}^{B \times n \times V} (logits) - Meaning: Predict probability distribution over vocabulary
Step 4: Probability Computation
- Process: Softmax over logits
- Output: $\mathbf{p} \in \mathbb{R}^{B \times n \times V} (probabilities)$
- Meaning: Normalized probability distribution for next token prediction
11. Training as Optimization Control
11.1 Training as Optimal Control Problem
Objective Function:
J(\theta) = \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(\mathbf{y}_i, \hat{\mathbf{y}}_i(\theta))
where:
- \mathcal{L} = loss function (cross-entropy)
- \mathbf{y}_i = true labels
- \hat{\mathbf{y}}_i(\theta) = model predictions
Optimization Problem:
\theta^* = \arg\min_{\theta} J(\theta)
11.2 Gradient-Based Control
Gradient Computation:
\mathbf{g}_t = \nabla_\theta J(\theta_t) = \frac{\partial J}{\partial \theta_t}
Parameter Update (AdamW):
\theta_{t+1} = \theta_t - \eta_t \left(\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} + \lambda \theta_t\right)
where:
- \hat{\mathbf{m}}_t = biased-corrected momentum
- \hat{\mathbf{v}}_t = biased-corrected variance
- \eta_t = learning rate (controlled by scheduler)
- \lambda = weight decay coefficient
11.3 Learning Rate Control
Cosine Annealing Schedule:
\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \cdot \frac{1 + \cos(\pi \cdot \frac{t}{T_{max}})}{2}
Control Interpretation: Learning rate acts as gain scheduling - high gain initially for fast convergence, low gain later for fine-tuning.
11.4 Gradient Clipping Control
Clipping Function:
\mathbf{g}_{clipped} = \begin{cases}
\mathbf{g} & \text{if } ||\mathbf{g}|| \leq \theta \\
\mathbf{g} \cdot \frac{\theta}{||\mathbf{g}||} & \text{if } ||\mathbf{g}|| > \theta
\end{cases}
Purpose: Prevents explosive gradients that could destabilize training.
11.5 Step-by-Step Training Control
Step 1: Forward Pass
- Process:
\hat{\mathbf{y}} = \mathcal{F}(\mathbf{x}, \theta_t) - Meaning: Compute predictions with current parameters
Step 2: Loss Computation
- Process:
\mathcal{L} = \text{CrossEntropy}(\hat{\mathbf{y}}, \mathbf{y}) - Meaning: Measure prediction error
Step 3: Backward Pass
- Process:
\mathbf{g} = \nabla\_\theta \mathcal{L} - Meaning: Compute gradients for all parameters
Step 4: Gradient Clipping
- Process:
\mathbf{g}\_{clipped} = \text{Clip}(\mathbf{g}, \theta) - Meaning: Prevent gradient explosion
Step 5: Optimizer Update
- Process:
\theta*{t+1} = \text{AdamW}(\theta_t, \mathbf{g}*{clipped}, \eta_t) - Meaning: Update parameters using adaptive learning rate
Step 6: Learning Rate Update
- Process:
\eta\_{t+1} = \text{Scheduler}(\eta_t, t) - Meaning: Adjust learning rate according to schedule
Control Impact: Training process is a closed-loop control system where:
- Error signal: Loss
- Controller: Optimizer (AdamW)
- Actuator: Parameter updates
- Plant: Model forward pass
12. Inference Control Loop
12.1 Autoregressive Generation as Control Loop
State-Space Model:
\mathbf{h}_t = \mathcal{F}(\mathbf{x}_t, \mathbf{h}_{t-1}, \theta)
\mathbf{p}_t = \text{softmax}(\mathbf{h}_t \mathbf{W}_{out})
\mathbf{x}_{t+1} \sim \text{Categorical}(\mathbf{p}_t)
12.2 Generation Control Function
Step-by-Step:
- Current State:
\mathbf{h}\_t - Output Generation:
\mathbf{p}_t = \text{softmax}(\mathbf{h}\_t \mathbf{W}_{out}) - Sampling:
x\_{t+1} \sim \mathbf{p}\_t (with temperature, top-k, top-p) - State Update:
\mathbf{h}_{t+1} = \mathcal{F}([\mathbf{h}\_t, x_{t+1}], \theta) - Repeat: Until max length or stop token
12.3 Sampling Control Parameters
Temperature Control:
\mathbf{p}_t^{temp} = \text{softmax}\left(\frac{\mathbf{h}_t \mathbf{W}_{out}}{T}\right)
- T < 1 : More deterministic (sharp distribution)
- T > 1 : More random (flat distribution)
- T = 1 : Default
Top-k Filtering:
\mathbf{p}_t^{topk}[v] = \begin{cases}
\mathbf{p}_t[v] & \text{if } v \in \text{top-k}(\mathbf{p}_t) \\
0 & \text{otherwise}
\end{cases}
Top-p (Nucleus) Sampling:
\mathbf{p}_t^{topp}[v] = \begin{cases}
\mathbf{p}_t[v] & \text{if } v \in S_p \\
0 & \text{otherwise}
\end{cases}
where S_p is the smallest set such that \sum_{v \in S_p} \mathbf{p}_t[v] \geq p .
12.4 Step-by-Step Inference Control
Step 1: Initialization
- Input: Prompt tokens
\mathbf{P} = [p_1, ..., p_k] - Process: Initialize state
\mathbf{h}\_0 = \mathcal{E}(\mathbf{P}) + \mathbf{PE} - Meaning: Set initial state from prompt
Step 2: Forward Pass
- Process:
\mathbf{h}_t = \text{Transformer}(\mathbf{h}_{t-1}) - Output: Hidden state
\mathbf{h}\_t - Meaning: Process current sequence
Step 3: Logit Generation
- Process:
\mathbf{l}_t = \mathbf{h}\_t \mathbf{W}_{out} - Output: Logits
\mathbf{l}\_t \in \mathbb{R}^V - Meaning: Unnormalized scores for each token
Step 4: Probability Computation
- Process:
\mathbf{p}\_t = \text{softmax}(\mathbf{l}\_t / T) - Output: Probability distribution
\mathbf{p}\_t - Meaning: Normalized probabilities with temperature
Step 5: Sampling
- Process:
x\_{t+1} \sim \mathbf{p}\_t (with optional top-k/top-p) - Output: Next token
x\_{t+1} - Meaning: Stochastically select next token
Step 6: State Update
- Process: Append
x*{t+1}to sequence, update\mathbf{h}*{t+1} - Meaning: Incorporate new token into state
Step 7: Termination Check
- Condition:
t < \text{max_length} and x\_{t+1} \neq \text{<eos>} - If true: Go to Step 2
- If false: Return generated sequence
Control Impact: Inference is a recurrent control system where:
- State: Current hidden representation
- Control: Sampling strategy (temperature, top-k, top-p)
- Output: Generated token sequence
Summary: Unified Control System Model
Complete System Equation
\mathbf{Y} = \mathcal{G}(\mathbf{C}, \theta, \mathbf{s}, \mathbf{T}, \{k, p\})
where:
- \mathbf{C} = input characters
- \theta = model parameters
- \mathbf{s} = seed
- \mathbf{T} = temperature
- \{k, p\} = top-k and top-p parameters
System Components as Control Elements
- Tokenizer: Input encoder
\mathcal{T} - Seed: Initialization control
\mathbf{s} - Embeddings: State projection
\mathcal{E} - Positional Encoding: Temporal control
\mathbf{PE} - Attention: Information routing
\mathcal{A} - FFN: Nonlinear transformation
\mathcal{F} - Normalization: State regulation
\mathcal{N} - Optimizer: Parameter control
\mathcal{O} - Scheduler: Learning rate control
\mathcal{S} - Sampling: Output control
\mathcal{P}
Control Flow Summary
Input Characters
↓ [Tokenizer Control]
Token IDs
↓ [Seed Control]
Initialized Parameters
↓ [Embedding Control]
Vector Representations
↓ [Positional Control]
Position-Aware Vectors
↓ [Attention Control]
Context-Aware Representations
↓ [FFN Control]
Transformed Features
↓ [Normalization Control]
Stabilized Activations
↓ [Output Control]
Probability Distributions
↓ [Sampling Control]
Generated Tokens
Each component acts as a control element in a unified dynamical system, working together to transform input text into meaningful language model outputs.
13. Block Diagram Analysis
13.1 Single Transformer Block Control System
Block Diagram (a): Detailed Single Transformer Block
Input X
↓
┌─────────────┐
│ LayerNorm │
└──────┬──────┘
↓
┌─────────────┐
│ Multi-Head │
│ Attention │
└──────┬──────┘
↓
┌─────────────┐
│ Dropout │
└──────┬──────┘
↓
┌─────────────┐
│ + │ ←─── (Residual Connection from X)
└──────┬──────┘
↓
┌─────────────┐
│ LayerNorm │
└──────┬──────┘
↓
┌─────────────┐
│ Feed-Forward│
│ Network │
└──────┬──────┘
↓
┌─────────────┐
│ Dropout │
└──────┬──────┘
↓
┌─────────────┐
│ + │ ←─── (Residual Connection)
└──────┬──────┘
↓
Output X'
Mathematical Transfer Function:
\mathbf{X}_{out} = \mathbf{X}_{in} + \text{Dropout}(\text{FFN}(\text{LayerNorm}(\mathbf{X}_{in} + \text{Dropout}(\text{Attention}(\text{LayerNorm}(\mathbf{X}_{in})))))
13.2 Simplified Transformer Block
Block Diagram (b): Simplified Single Block
Input X
↓
┌─────────────────────────────────────┐
│ TransformerBlock │
│ G_block(X) = X + Attn(LN(X)) + │
│ FFN(LN(X + Attn(LN(X))))│
└──────────────┬──────────────────────┘
↓
Output X'
Transfer Function:
G_{block}(\mathbf{X}) = \mathbf{X} + G_{attn}(\text{LN}(\mathbf{X})) + G_{ffn}(\text{LN}(\mathbf{X} + G_{attn}(\text{LN}(\mathbf{X}))))
where:
- G_{attn} = Attention transfer function
- G_{ffn} = Feed-forward transfer function
- \text{LN} = Layer normalization
13.3 Complete Model with Multiple Layers
Block Diagram (c): Cascaded Transformer Blocks
Input Tokens T
↓
┌─────────────┐
│ Embedding │
│ G_emb │
└──────┬──────┘
↓
┌─────────────┐
│ Positional │
│ G_pos │
└──────┬──────┘
↓
┌─────────────┐
│ Block 1 │
│ G_block₁ │
└──────┬──────┘
↓
┌─────────────┐
│ Block 2 │
│ G_block₂ │
└──────┬──────┘
↓
┌─────────────┐
│ ... │
└──────┬──────┘
↓
┌─────────────┐
│ Block L │
│ G_block_L │
└──────┬──────┘
↓
┌─────────────┐
│ Final Norm │
│ G_norm │
└──────┬──────┘
↓
┌─────────────┐
│ Output Proj │
│ G_out │
└──────┬──────┘
↓
Output Logits
Overall Transfer Function:
\mathbf{Y} = G_{out} \circ G_{norm} \circ G_{block_L} \circ ... \circ G_{block_2} \circ G_{block_1} \circ G_{pos} \circ G_{emb}(\mathbf{T})
13.4 Closed-Loop Training System
Block Diagram (d): Training Control Loop
Input Data X
↓
┌─────────────┐
│ Model │
│ Forward │
│ F │
└──────┬──────┘
↓
┌─────────────┐
│ Output │
│ ŷ │
└──────┬──────┘
↓
┌─────────────┐
│ Loss │
│ L(ŷ, y) │
└──────┬──────┘
↓
┌─────────────┐
│ Gradient │
│ ∇θ │
└──────┬──────┘
↓
┌─────────────┐
│ Clipping │
│ Clip │
└──────┬──────┘
↓
┌─────────────┐
│ Optimizer │
│ AdamW │
└──────┬──────┘
↓
┌─────────────┐
│ Parameter │
│ Update │
└──────┬──────┘
↓
┌─────────────┐
│ - │ ←─── (Feedback to Model)
└─────────────┘
Closed-Loop Transfer Function:
\theta_{t+1} = \theta_t - \eta_t \cdot \text{AdamW}(\text{Clip}(\nabla_\theta L(\mathcal{F}(\mathbf{X}, \theta_t), \mathbf{y})))
14. Vector Visualization and Examples
14.1 Example Phrase: "Hello World"
We'll trace through the complete system with the phrase "Hello World".
Step 1: Tokenization
Input: "Hello World"
Process:
Characters: ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']
Token IDs: [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]
Mathematical:
\mathbf{c} = \text{"Hello World"}
\mathbf{t} = \mathcal{T}(\mathbf{c}) = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]
Vector Representation:
- Dimension:
n = 11tokens - Token IDs:
\mathbf{t} \in \mathbb{N}^{11}
Step 2: Embedding
Embedding Matrix: \mathbf{E} \in \mathbb{R}^{128 \times 512}
Lookup Operation:
\mathbf{X} = \mathbf{E}[\mathbf{t}] = \begin{bmatrix}
\mathbf{E}[72] \\
\mathbf{E}[101] \\
\mathbf{E}[108] \\
\mathbf{E}[108] \\
\mathbf{E}[111] \\
\mathbf{E}[32] \\
\mathbf{E}[87] \\
\mathbf{E}[111] \\
\mathbf{E}[114] \\
\mathbf{E}[108] \\
\mathbf{E}[100]
\end{bmatrix} \in \mathbb{R}^{11 \times 512}
Example Values (first 3 dimensions):
\mathbf{E}[72] = [0.1, -0.2, 0.3, ...]^T \\
\mathbf{E}[101] = [-0.1, 0.3, -0.1, ...]^T \\
\mathbf{E}[108] = [0.05, 0.15, -0.05, ...]^T
Vector Visualization:
Token 'H' (ID=72): [0.10, -0.20, 0.30, ..., 0.05] (512-dim vector)
Token 'e' (ID=101): [-0.10, 0.30, -0.10, ..., 0.02] (512-dim vector)
Token 'l' (ID=108): [0.05, 0.15, -0.05, ..., 0.01] (512-dim vector)
...
Step 3: Positional Encoding
Positional Encoding Matrix: \mathbf{PE} \in \mathbb{R}^{11 \times 512}
Computation:
PE_{(0, 0)} = \sin(0 / 10000^0) = 0 \\
PE_{(0, 1)} = \cos(0 / 10000^0) = 1 \\
PE_{(1, 0)} = \sin(1 / 10000^0) = \sin(1) \approx 0.8415 \\
PE_{(1, 1)} = \cos(1 / 10000^0) = \cos(1) \approx 0.5403
Addition:
\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE}
Example (first token, first 3 dimensions):
\mathbf{X}_{pos}[0, :3] = \begin{bmatrix}
0.1 \\ -0.2 \\ 0.3
\end{bmatrix} + \begin{bmatrix}
0 \\ 1 \\ 0
\end{bmatrix} = \begin{bmatrix}
0.1 \\ 0.8 \\ 0.3
\end{bmatrix}
Step 4: Multi-Head Attention
Query, Key, Value Projections:
Let \mathbf{W}\_Q, \mathbf{W}\_K, \mathbf{W}\_V \in \mathbb{R}^{512 \times 512}
\mathbf{Q} = \mathbf{X}_{pos} \mathbf{W}_Q \in \mathbb{R}^{11 \times 512}
Example Calculation (head 0, token 0):
For h = 0 , d_k = 512/8 = 64 :
\mathbf{Q}[0, :64] = \mathbf{X}_{pos}[0] \mathbf{W}_Q[:, :64]
Attention Score Computation:
S_{0,1} = \frac{\mathbf{Q}[0] \cdot \mathbf{K}[1]}{\sqrt{64}} = \frac{\sum_{i=0}^{63} Q_{0,i} \cdot K_{1,i}}{8}
Example Numerical Calculation:
Assume:
\mathbf{Q}[0, :3] = [0.2, -0.1, 0.3] \\
\mathbf{K}[1, :3] = [0.1, 0.2, -0.1]
S_{0,1} = \frac{0.2 \times 0.1 + (-0.1) \times 0.2 + 0.3 \times (-0.1)}{8} \\
= \frac{0.02 - 0.02 - 0.03}{8} = \frac{-0.03}{8} = -0.00375
Attention Weights:
A_{0,:} = \text{softmax}(S_{0,:}) = \frac{\exp(S_{0,:})}{\sum_{j=0}^{10} \exp(S_{0,j})}
Example:
If S\_{0,:} = [-0.004, 0.05, 0.02, 0.02, 0.08, -0.01, 0.03, 0.08, 0.01, 0.02, 0.04]
\exp(S_{0,:}) = [0.996, 1.051, 1.020, 1.020, 1.083, 0.990, 1.030, 1.083, 1.010, 1.020, 1.041]
\sum = 11.335
A_{0,:} = [0.088, 0.093, 0.090, 0.090, 0.096, 0.087, 0.091, 0.096, 0.089, 0.090, 0.092]
Output Calculation:
\mathbf{O}[0] = \sum_{j=0}^{10} A_{0,j} \mathbf{V}[j]
Example (first dimension):
O_{0,0} = A_{0,0} V_{0,0} + A_{0,1} V_{1,0} + ... + A_{0,10} V_{10,0} \\
= 0.088 \times 0.2 + 0.093 \times 0.1 + ... + 0.092 \times 0.15 \\
\approx 0.12
Step 5: Feed-Forward Network
Input: \mathbf{X}\_{attn} \in \mathbb{R}^{11 \times 512}
First Linear Transformation:
\mathbf{H} = \mathbf{X}_{attn} \mathbf{W}_1 \in \mathbb{R}^{11 \times 2048}
Example (token 0, first dimension):
H_{0,0} = \sum_{i=0}^{511} X_{attn,0,i} \cdot W_{1,i,0}
Assuming X_{attn}[0, :3] = [0.12, -0.05, 0.08] and W_1[:3, :3] = \begin{bmatrix} 0.1 & 0.2 \\ -0.1 & 0.1 \\ 0.05 & -0.05 \end{bmatrix}
H_{0,0} = 0.12 \times 0.1 + (-0.05) \times (-0.1) + 0.08 \times 0.05 \\
= 0.012 + 0.005 + 0.004 = 0.021
GELU Activation:
\text{GELU}(0.021) = 0.021 \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{0.021}{\sqrt{2}}\right)\right)
\text{erf}(0.021/\sqrt{2}) = \text{erf}(0.0148) \approx 0.0167
\text{GELU}(0.021) = 0.021 \times 0.5 \times (1 + 0.0167) = 0.021 \times 0.5084 \approx 0.0107
Second Linear Transformation:
\mathbf{O}_{ffn} = \mathbf{H}' \mathbf{W}_2 \in \mathbb{R}^{11 \times 512}
Step 6: Complete Forward Pass Through One Layer
Input: \mathbf{X}_{in} = \mathbf{X}_{pos} \in \mathbb{R}^{11 \times 512}
Step 6.1: Layer Normalization
\mu_0 = \frac{1}{512} \sum_{i=0}^{511} X_{in,0,i}
Example:
\mu_0 = \frac{0.1 + 0.8 + 0.3 + ...}{512} \approx 0.02
\sigma_0^2 = \frac{1}{512} \sum_{i=0}^{511} (X_{in,0,i} - \mu_0)^2
\sigma_0^2 \approx \frac{(0.1-0.02)^2 + (0.8-0.02)^2 + ...}{512} \approx 0.15
\hat{X}_{0,0} = \frac{0.1 - 0.02}{\sqrt{0.15 + 1e-5}} = \frac{0.08}{0.387} \approx 0.207
Step 6.2: Attention Output
\mathbf{X}_{attn} = \text{Attention}(\hat{\mathbf{X}})
Step 6.3: Residual Connection
\mathbf{X}_{res1} = \mathbf{X}_{in} + \mathbf{X}_{attn}
Example:
X_{res1,0,0} = 0.1 + 0.12 = 0.22
Step 6.4: Second Layer Norm + FFN
\mathbf{X}_{ffn} = \text{FFN}(\text{LayerNorm}(\mathbf{X}_{res1}))
Step 6.5: Final Residual
\mathbf{X}_{out} = \mathbf{X}_{res1} + \mathbf{X}_{ffn}
Example:
X_{out,0,0} = 0.22 + 0.15 = 0.37
Step 7: Output Projection
After L layers:
\mathbf{H}_{final} = \text{LayerNorm}(\mathbf{X}_{out}^{(L)}) \in \mathbb{R}^{11 \times 512}
Output Projection:
\mathbf{L} = \mathbf{H}_{final} \mathbf{W}_{out} \in \mathbb{R}^{11 \times 128}
Example (position 0):
L_{0,:} = \mathbf{H}_{final}[0] \mathbf{W}_{out} \in \mathbb{R}^{128}
Softmax:
p_{0,v} = \frac{\exp(L_{0,v})}{\sum_{w=0}^{127} \exp(L_{0,w})}
Example:
If L*{0,72} = 5.2 (logit for 'H'), L*{0,101} = 3.1 (logit for 'e'), etc.
\exp(5.2) = 181.27 \\
\exp(3.1) = 22.20 \\
\vdots
\sum_{w=0}^{127} \exp(L_{0,w}) \approx 250.0
p_{0,72} = \frac{181.27}{250.0} \approx 0.725 \quad \text{(72\% probability for H)}
15. Complete Numerical Example: "Hello"
Let's trace through the complete system with "Hello" step-by-step.
Input: "Hello"
Stage 1: Tokenization
\mathbf{c} = \text{"Hello"} = ['H', 'e', 'l', 'l', 'o']
\mathbf{t} = [72, 101, 108, 108, 111]
Stage 2: Embedding (d=512)
\mathbf{E} \in \mathbb{R}^{128 \times 512}
\mathbf{X} = \begin{bmatrix}
\mathbf{E}[72] \\
\mathbf{E}[101] \\
\mathbf{E}[108] \\
\mathbf{E}[108] \\
\mathbf{E}[111]
\end{bmatrix} = \begin{bmatrix}
0.10 & -0.20 & 0.30 & ... & 0.05 \\
-0.10 & 0.30 & -0.10 & ... & 0.02 \\
0.05 & 0.15 & -0.05 & ... & 0.01 \\
0.05 & 0.15 & -0.05 & ... & 0.01 \\
-0.05 & 0.20 & 0.10 & ... & 0.03
\end{bmatrix} \in \mathbb{R}^{5 \times 512}
Stage 3: Positional Encoding
\mathbf{PE} = \begin{bmatrix}
0 & 1 & 0 & ... & 0 \\
0.84 & 0.54 & 0.01 & ... & 0.00 \\
0.91 & -0.42 & 0.02 & ... & 0.00 \\
0.14 & -0.99 & 0.03 & ... & 0.00 \\
-0.76 & -0.65 & 0.04 & ... & 0.00
\end{bmatrix} \in \mathbb{R}^{5 \times 512}
\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} = \begin{bmatrix}
0.10 & 0.80 & 0.30 & ... & 0.05 \\
0.74 & 0.84 & -0.09 & ... & 0.02 \\
0.96 & -0.27 & -0.03 & ... & 0.01 \\
0.19 & -0.84 & -0.02 & ... & 0.01 \\
-0.81 & -0.45 & 0.14 & ... & 0.03
\end{bmatrix}
Stage 4: Attention (h=8 heads, d_k=64)
Query Generation:
\mathbf{Q} = \mathbf{X}_{pos} \mathbf{W}_Q \in \mathbb{R}^{5 \times 512}
Score Matrix (head 0):
\mathbf{S}_0 = \frac{\mathbf{Q}_0 \mathbf{K}_0^T}{\sqrt{64}} \in \mathbb{R}^{5 \times 5}
Example Values:
\mathbf{S}_0 = \begin{bmatrix}
0.50 & -0.10 & 0.20 & 0.15 & 0.30 \\
-0.05 & 0.45 & 0.10 & 0.08 & 0.25 \\
0.15 & 0.05 & 0.40 & 0.30 & 0.20 \\
0.12 & 0.08 & 0.28 & 0.35 & 0.18 \\
0.25 & 0.15 & 0.22 & 0.20 & 0.42
\end{bmatrix}
Attention Weights:
\mathbf{A}_0 = \text{softmax}(\mathbf{S}_0) = \begin{bmatrix}
0.35 & 0.15 & 0.22 & 0.20 & 0.28 \\
0.15 & 0.38 & 0.20 & 0.18 & 0.27 \\
0.23 & 0.18 & 0.32 & 0.30 & 0.26 \\
0.21 & 0.19 & 0.28 & 0.33 & 0.25 \\
0.27 & 0.22 & 0.26 & 0.25 & 0.36
\end{bmatrix}
Output (head 0):
\mathbf{O}_0 = \mathbf{A}_0 \mathbf{V}_0 \in \mathbb{R}^{5 \times 64}
Concatenate All Heads:
\mathbf{O} = \text{Concat}[\mathbf{O}_0, ..., \mathbf{O}_7] \in \mathbb{R}^{5 \times 512}
Stage 5: Feed-Forward
\mathbf{H} = \mathbf{O} \mathbf{W}_1 \in \mathbb{R}^{5 \times 2048}
\mathbf{H}' = \text{GELU}(\mathbf{H}) \in \mathbb{R}^{5 \times 2048}
\mathbf{O}_{ffn} = \mathbf{H}' \mathbf{W}_2 \in \mathbb{R}^{5 \times 512}
Stage 6: Output Logits
After processing through all L layers:
\mathbf{L} = \mathbf{H}_{final} \mathbf{W}_{out} \in \mathbb{R}^{5 \times 128}
Example (position 4, predicting next token):
L_{4,:} = [2.1, 1.5, ..., 5.2, ..., 3.1, ...]
Where:
- L_{4,111} = 5.2 (high score for 'o')
- L_{4,32} = 4.8 (high score for space)
- L_{4,87} = 4.5 (high score for 'W')
Probability Distribution:
\mathbf{p}_4 = \text{softmax}(L_{4,:}) = [0.01, 0.008, ..., 0.25, ..., 0.18, ...]
p_{4,111} \approx 0.25 \quad \text{(25\% for o)} \\
p_{4,32} \approx 0.22 \quad \text{(22\% for space)} \\
p_{4,87} \approx 0.18 \quad \text{(18\% for W)}
16. Vector Space Visualization
16.1 Embedding Space
2D Projection Example:
After embedding "Hello", tokens occupy positions in 512-dimensional space. Projected to 2D:
Token Positions (idealized 2D projection):
'l' (0.05, 0.15)
●
'e' (-0.10, 0.30)
●
Origin (0, 0)
●
'H' (0.10, -0.20)
●
'o' (-0.05, 0.20)
●
Distance in Embedding Space:
d(\mathbf{E}[72], \mathbf{E}[101]) = ||\mathbf{E}[72] - \mathbf{E}[101]||_2
d = \sqrt{(0.1 - (-0.1))^2 + (-0.2 - 0.3)^2 + ...} \approx \sqrt{0.04 + 0.25 + ...} \approx 2.1
16.2 Attention Weight Visualization
Attention Matrix Visualization:
Position 0 1 2 3 4
┌─────┴─────┴─────┴─────┴──┐
Token 0 │ 0.35 0.15 0.22 0.20 0.28 │ 'H'
│ │
Token 1 │ 0.15 0.38 0.20 0.18 0.27 │ 'e'
│ │
Token 2 │ 0.23 0.18 0.32 0.30 0.26 │ 'l'
│ │
Token 3 │ 0.21 0.19 0.28 0.33 0.25 │ 'l'
│ │
Token 4 │ 0.27 0.22 0.26 0.25 0.36 │ 'o'
└──────────────────────────┘
Interpretation:
- Token 0 ('H') attends most to itself (0.35) and token 4 (0.28)
- Token 4 ('o') attends moderately to all positions
- Higher values indicate stronger attention
16.3 Probability Distribution Visualization
Output Distribution for Position 5 (next token after "Hello"):
Probability Distribution p[5, :]
Probability
│
0.3 │ ●
│
0.2 │ ● ●
│
0.1 │ ● ● ● ●
│
0.0 ├─┴───┴───┴───┴───┴───┴───┴───┴─── Token IDs
32 72 87 101 108 111 ... 127
␣ H W e l o
Meaning:
- Highest probability for space (32) ≈ 0.28
- Next: 'o' (111) ≈ 0.23
- Then: 'W' (87) ≈ 0.18
- Model predicts space or continuation
17. Advanced Block Diagram Simplification
17.1 Complex Multi-Layer System Simplification
Following control system reduction techniques, we can simplify the transformer model step-by-step:
Diagram (a): Original Complex System
Input R (Tokens)
↓
┌─────────────┐
│ Embedding │
│ G_emb │
└──────┬──────┘
↓
┌─────────────┐
│ Positional │
│ Encoding │
│ G_pos │
└──────┬──────┘
↓
┌─────────────┐
│ + │ ←─── Feedback from Layer 2
└──────┬──────┘
↓
┌─────────────┐
│ Layer 1 │
│ G_block₁ │
└──────┬──────┘
↓
┌─────────────┐
│ + │ ←─── Feedback from Output
└──────┬──────┘
↓
┌─────────────┐
│ Layer 2 │
│ G_block₂ │
└──────┬──────┘
↓
┌─────────────┐
│ + │ ←─── Feedback H₁
└──────┬──────┘
↓
┌─────────────┐
│ Output Proj │
│ G_out │
└──────┬──────┘
↓
Output C (Logits)
Diagram (b): First Simplification (Combine Embedding and Positional)
Input R
↓
┌─────────────────────┐
│ G_emb_pos = │
│ G_pos ∘ G_emb │
└──────┬──────────────┘
↓
┌─────────────┐
│ + │
└──────┬──────┘
↓
┌─────────────┐
│ Layer 1 │
│ G_block₁ │
└──────┬──────┘
↓
┌─────────────┐
│ + │
└──────┬──────┘
↓
┌─────────────┐
│ Layer 2 │
│ G_block₂ │
└──────┬──────┘
↓
┌─────────────┐
│ + │ ←─── H₁
└──────┬──────┘
↓
┌─────────────┐
│ G_out │
└──────┬──────┘
↓
Output C
Diagram (c): Second Simplification (Combine Layers)
Input R
↓
┌─────────────────────┐
│ G_emb_pos │
└──────┬──────────────┘
↓
┌──────────────────────────────────┐
│ G_layers = G_block₂ ∘ G_block₁ │
│ Equivalent to: │
│ X + Δ₁(X) + Δ₂(X + Δ₁(X)) │
└──────┬───────────────────────────┘
↓
┌─────────────┐
│ + │ ←─── H₁
└──────┬──────┘
↓
┌─────────────┐
│ G_out │
└──────┬──────┘
↓
Output C
Diagram (d): Third Simplification (Combine with Output)
Input R
↓
┌──────────────────────────────┐
│ G_forward = │
│ G_out ∘ G_layers ∘ G_emb_pos │
└──────┬───────────────────────┘
↓
┌─────────────┐
│ + │ ←─── H₁ (Feedback)
└──────┬──────┘
↓
Output C
Diagram (e): Final Simplified Transfer Function
Input R
↓
┌────────────────────────────────────────────┐
│ Overall Transfer Function: │
│ │
│ C/R = G_forward / (1 + G_forward × H₁) │
│ │
│ Where: │
│ G_forward = G_out ∘ G_layers ∘ G_emb_pos │
│ │
└──────┬─────────────────────────────────────┘
↓
Output C
Mathematical Derivation:
Step 1: Combine embedding and positional encoding:
G_{emb\_pos}(\mathbf{T}) = G_{pos}(G_{emb}(\mathbf{T})) = \mathbf{E}[\mathbf{T}] + \mathbf{PE}
Step 2: Combine transformer layers:
G_{layers}(\mathbf{X}) = G_{block_2}(G_{block_1}(\mathbf{X}))
G_{layers}(\mathbf{X}) = \mathbf{X} + \Delta_1(\mathbf{X}) + \Delta_2(\mathbf{X} + \Delta_1(\mathbf{X}))
where \Delta_l represents the transformation inside block l .
Step 3: Combine with output projection:
G_{forward}(\mathbf{T}) = G_{out}(G_{layers}(G_{emb\_pos}(\mathbf{T})))
Step 4: Apply feedback reduction:
\frac{C}{R} = \frac{G_{forward}}{1 + G_{forward} \times H_1}
17.2 Attention Block Simplification
Diagram (a): Detailed Attention
Input X
↓
┌─────────────┐
│ Q │ ←─── W_Q
│ K │ ←─── W_K
│ V │ ←─── W_V
└──────┬──────┘
↓
┌─────────────┐
│ Scores │
│ S = QK^T/√d │
└──────┬──────┘
↓
┌─────────────┐
│ Softmax │
│ A = σ(S) │
└──────┬──────┘
↓
┌─────────────┐
│ Output │
│ O = AV │
└──────┬──────┘
↓
┌─────────────┐
│ Out Proj │
│ W_O │
└──────┬──────┘
↓
Output X'
Diagram (b): Simplified Attention Transfer Function
Input X
↓
┌──────────────────────────────┐
│ G_attn(X) = │
│ W_O · softmax(QK^T/√d) · V │
│ │
│ Where: │
│ Q = XW_Q, K = XW_K, V = XW_V │
└──────┬───────────────────────┘
↓
Output X'
Mathematical Transfer Function:
G_{attn}(\mathbf{X}) = \mathbf{X} \mathbf{W}_O \cdot \text{softmax}\left(\frac{(\mathbf{X} \mathbf{W}_Q)(\mathbf{X} \mathbf{W}_K)^T}{\sqrt{d_k}}\right) \cdot (\mathbf{X} \mathbf{W}_V)
18. Vector Trace: "Hello World" Complete Flow
18.1 Complete Vector Trace with Numerical Values
Input: "Hello World"
Stage 1: Tokenization
\mathbf{t} = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]
Stage 2: Embedding (showing first 4 dimensions)
\mathbf{X} = \begin{bmatrix}
[H] & 0.10 & -0.20 & 0.30 & 0.15 & ... \\
[e] & -0.10 & 0.30 & -0.10 & 0.08 & ... \\
[l] & 0.05 & 0.15 & -0.05 & 0.03 & ... \\
[l] & 0.05 & 0.15 & -0.05 & 0.03 & ... \\
[o] & -0.05 & 0.20 & 0.10 & 0.06 & ... \\
[ ] & 0.02 & 0.05 & 0.02 & 0.01 & ... \\
[W] & 0.15 & -0.15 & 0.25 & 0.12 & ... \\
[o] & -0.05 & 0.20 & 0.10 & 0.06 & ... \\
[r] & 0.08 & 0.10 & -0.08 & 0.04 & ... \\
[l] & 0.05 & 0.15 & -0.05 & 0.03 & ... \\
[d] & 0.12 & -0.08 & 0.18 & 0.09 & ...
\end{bmatrix} \in \mathbb{R}^{11 \times 512}
Stage 3: Positional Encoding (first 4 dimensions)
\mathbf{PE} = \begin{bmatrix}
[0] & 0.00 & 1.00 & 0.00 & 0.00 & ... \\
[1] & 0.84 & 0.54 & 0.01 & 0.00 & ... \\
[2] & 0.91 & -0.42 & 0.02 & 0.00 & ... \\
[3] & 0.14 & -0.99 & 0.03 & 0.00 & ... \\
[4] & -0.76 & -0.65 & 0.04 & 0.00 & ... \\
[5] & -0.96 & 0.28 & 0.05 & 0.00 & ... \\
[6] & -0.28 & 0.96 & 0.06 & 0.00 & ... \\
[7] & 0.65 & 0.76 & 0.07 & 0.00 & ... \\
[8] & 0.99 & -0.14 & 0.08 & 0.00 & ... \\
[9] & 0.42 & -0.91 & 0.09 & 0.00 & ... \\
[10] & -0.54 & -0.84 & 0.10 & 0.00 & ...
\end{bmatrix}
Stage 4: Combined Input
\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE}
Example Row 0 (token 'H'):
\mathbf{X}_{pos}[0, :4] = [0.10, -0.20, 0.30, 0.15] + [0.00, 1.00, 0.00, 0.00] = [0.10, 0.80, 0.30, 0.15]
Stage 5: Attention (Head 0, showing attention from token 0 to all tokens)
\mathbf{S}_0[0, :] = [0.50, -0.10, 0.20, 0.15, 0.30, -0.05, 0.18, 0.28, 0.12, 0.20, 0.22]
\mathbf{A}_0[0, :] = \text{softmax}(\mathbf{S}_0[0, :]) = [0.35, 0.15, 0.22, 0.20, 0.28, 0.14, 0.19, 0.26, 0.17, 0.21, 0.23]
**Meaning:** Token 'H' (position 0) attends:
- 35% to itself
- 28% to token 'o' (position 4)
- 26% to token 'o' (position 7)
- 23% to token 'd' (position 10)
Stage 6: Attention Output
\mathbf{O}_0[0, :] = \sum_{j=0}^{10} A_{0,j} \mathbf{V}_0[j, :]
Example (first dimension):
O_{0,0,0} = 0.35 \times 0.12 + 0.15 \times 0.08 + ... + 0.23 \times 0.15 \approx 0.115
Stage 7: FFN Output
\mathbf{H}_{ffn}[0, :4] = [0.15, -0.08, 0.22, 0.18]
Stage 8: Final Output (after all layers)
\mathbf{H}_{final}[0, :4] = [0.42, 0.25, 0.58, 0.31]
Stage 9: Logits
\mathbf{L}[0, :] = [2.1, 1.8, ..., 5.2, ..., 3.4, ...]
Where L[0, 72] = 5.2 is highest (predicting 'H' at position 1).
Stage 10: Probabilities
\mathbf{p}[0, :] = \text{softmax}(\mathbf{L}[0, :]) = [0.01, 0.008, ..., 0.28, ..., 0.15, ...]
p[0, 72] \approx 0.28 \quad \text{(28\% probability for H)}
19. Vector Plots and Visualizations
19.1 Embedding Vector Trajectory
Trajectory Plot:
512-Dimensional Embedding Space (2D Projection)
0.3 │ 'e' (pos 1)
│ ●
0.2 │ 'r' (pos 8)
│ ●
0.1 │ 'l' (pos 2,3,9) 'o' (pos 4,7)
│ ● ●
0.0 ├───────────────────────────────────────────
│ 'H' (pos 0)
-0.1 │ ●
│
-0.2 │
│
-0.3 │ 'W' (pos 6)
│ ●
└───────────────────────────────────────────
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
19.2 Attention Heatmap
Attention Weight Matrix Visualization:
Attention Weights A[i,j] for "Hello World"
j → 0 1 2 3 4 5 6 7 8 9 10
↓ ['H'] ['e'] ['l'] ['l'] ['o'] [' '] ['W'] ['o'] ['r'] ['l'] ['d']
i=0 ['H'] │ 0.35 0.15 0.22 0.20 0.28 0.14 0.19 0.26 0.17 0.21 0.23 │
i=1 ['e'] │ 0.15 0.38 0.20 0.18 0.27 0.16 0.18 0.25 0.19 0.22 0.20 │
i=2 ['l'] │ 0.23 0.18 0.32 0.30 0.26 0.17 0.21 0.24 0.25 0.31 0.23 │
i=3 ['l'] │ 0.21 0.19 0.28 0.33 0.25 0.18 0.20 0.23 0.24 0.30 0.22 │
i=4 ['o'] │ 0.27 0.22 0.26 0.25 0.36 0.19 0.23 0.29 0.24 0.27 0.25 │
i=5 [' '] │ 0.18 0.20 0.19 0.21 0.24 0.40 0.22 0.25 0.21 0.20 0.22 │
i=6 ['W'] │ 0.22 0.21 0.23 0.24 0.26 0.20 0.45 0.28 0.27 0.23 0.25 │
i=7 ['o'] │ 0.26 0.25 0.24 0.23 0.29 0.21 0.28 0.38 0.26 0.24 0.26 │
i=8 ['r'] │ 0.19 0.21 0.25 0.24 0.24 0.19 0.27 0.26 0.42 0.27 0.28 │
i=9 ['l'] │ 0.21 0.22 0.31 0.30 0.27 0.20 0.23 0.24 0.27 0.35 0.24 │
i=10['d'] │ 0.23 0.20 0.23 0.22 0.25 0.22 0.25 0.26 0.28 0.24 0.48 │
Color Coding:
█ = 0.48-0.50 (very high attention)
█ = 0.35-0.48 (high attention)
█ = 0.25-0.35 (medium attention)
█ = 0.15-0.25 (low attention)
█ = 0.00-0.15 (very low attention)
19.3 Probability Distribution Plot
Logits and Probabilities:
Logits L[5, :] (predicting token after "Hello ")
Logit
Value │
6.0 │ ● (token 87 'W')
│
5.0 │ ● (token 111 'o')
│
4.0 │ ● (token 32 ' ') ● (token 114 'r')
│
3.0 │ ● ● ●
│
2.0 │ ● ● ● ● ● ● ● ● ● ● ●
│
1.0 │ ● ● ● ● ● ● ● ● ● ● ●
│
0.0 ├─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴── Token IDs
32 72 87 101 108 111 114 ...
␣ H W e l o r
Probabilities p[5, :]
Probability
│
0.3│ ● ('W')
│
0.2│ ● (' ') ● ('o')
│
0.1│ ● ● ● ● ● ● ●
│
0.0├─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴── Token IDs
32 72 87 101 108 111 114 ...
19.4 Hidden State Evolution Through Layers
Layer-by-Layer Transformation:
Hidden State Evolution for Token 'H' (position 0)
Dimension 0:
Layer 0: 0.10 (embedding + positional)
Layer 1: 0.42 (after attention + FFN)
Layer 2: 0.58 (after second layer)
Layer 3: 0.65 (after third layer)
... ...
Layer L: 0.72 (final hidden state)
Dimension 1:
Layer 0: 0.80 (embedding + positional)
Layer 1: 0.25 (after attention + FFN)
Layer 2: 0.18 (after second layer)
Layer 3: 0.22 (after third layer)
... ...
Layer L: 0.15 (final hidden state)
Visualization:
Hidden State Magnitude ||h[l]|| Over Layers
Magnitude
│
1.0│ ●
│ ●
0.8│ ●
│ ●
0.6│ ●
│ ●
0.4│ ●
│ ●
0.2│ ●
│ ●
0.0├───────────────────────── Layer
0 1 2 3 4 5 6
20. Summary: Complete Mathematical Trace
Complete System Equation with Numerical Example
Text: "Hello World"
Complete Mathematical Flow:
- Tokenization:
\mathbf{t} = \mathcal{T}(\text{"Hello World"}) = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]
- Embedding:
\mathbf{X} = \mathbf{E}[\mathbf{t}] \in \mathbb{R}^{11 \times 512}
- Positional Encoding:
\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} \in \mathbb{R}^{11 \times 512}
- Transformer Layers (L=6):
\mathbf{h}_l = \text{TransformerBlock}_l(\mathbf{h}_{l-1}), \quad l = 1, ..., 6
- Output:
\mathbf{L} = \mathbf{h}_6 \mathbf{W}_{out} \in \mathbb{R}^{11 \times 128}
- Probabilities:
\mathbf{p} = \text{softmax}(\mathbf{L}) \in \mathbb{R}^{11 \times 128}
Final Prediction:
For position 5 (after "Hello "):
p[5, 87] = 0.28 \quad \text{(28\% for W)} \\
p[5, 32] = 0.22 \quad \text{(22\% for space)} \\
p[5, 111] = 0.18 \quad \text{(18\% for o)}
Most Likely: 'W' → Complete prediction: "Hello World"
This document provides a complete mathematical control system formulation with block diagrams, vector visualizations, numerical examples, and step-by-step calculations for every component of the SheepOp LLM.