Files
sheepOp/docs/CONTROL_SYSTEM_MODEL.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

59 KiB
Raw Permalink Blame History

SheepOp LLM - Mathematical Control System Model

Complete mathematical control system formulation of the SheepOp Language Model, treating the entire system as a unified mathematical control system with state-space representations, transfer functions, and step-by-step explanations.

Table of Contents

  1. System Overview
  2. State-Space Representation
  3. Tokenizer as Input Encoder
  4. Seed Control System
  5. Embedding Layer Control
  6. Positional Encoding State
  7. Self-Attention Control System
  8. Feed-Forward Control
  9. Layer Normalization Feedback
  10. Complete System Dynamics
  11. Training as Optimization Control
  12. Inference Control Loop

1. System Overview

1.1 Control System Architecture

The SheepOp LLM can be modeled as a nonlinear dynamical control system with:

  • Input: Character sequence \mathbf{c} = [c_1, c_2, ..., c_n]
  • State: Hidden representations $\mathbf{h}_t $at each layer and time step
  • Control: Model parameters $\theta = {W_Q, W_K, W_V, W_1, W_2, ...} $
  • Output: Probability distribution over vocabulary \mathbf{p}\_t \in \mathbb{R}^V

System Block Diagram:

Input Sequence → Tokenizer → Embeddings → Positional Encoding →
    ↓
    [Transformer Layer 1] → [Transformer Layer 2] → ... → [Transformer Layer L]
    ↓
    Output Projection → Logits → Softmax → Output Probabilities

1.2 Mathematical System Formulation

The complete system can be expressed as:


\mathbf{y}_t = \mathcal{F}(\mathbf{x}_t, \mathbf{h}_t, \theta, \mathbf{s})

where:

  • $\mathbf{x}_t = input at time t$
  • $\mathbf{h}_t = hidden state at time t$
  • $\theta $= system parameters (weights)
  • $\mathbf{s} $= seed for randomness
  • $\mathcal{F} $= complete forward function

2. State-Space Representation

2.1 Discrete-Time State-Space Model

For a transformer with L layers and sequence length n :

State Vector:

\mathbf{H}_t = \begin{bmatrix}
\mathbf{h}_t^{(1)} \\
\mathbf{h}_t^{(2)} \\
\vdots \\
\mathbf{h}_t^{(L)}
\end{bmatrix} \in \mathbb{R}^{L \times n \times d}

where

\mathbf{h}_t^{(l)} \in \mathbb{R}^{n \times d} is the hidden state at layer l .

State Update Equation:


\mathbf{h}_t^{(l+1)} = f_l(\mathbf{h}_t^{(l)}, \theta_l), \quad l = 0, 1, ..., L-1


where  f_l  is the transformation at layer  l .

Output Equation:


\mathbf{y}_t = g(\mathbf{h}_t^{(L)}, \theta_{out})

2.2 System Linearity Analysis

The system is nonlinear due to:

  • Attention mechanism (softmax)
  • Activation functions (GELU)
  • Layer normalization

However, individual components can be analyzed as piecewise linear systems.


3. Tokenizer as Input Encoder

3.1 Tokenizer Control Function

The tokenizer maps a character sequence to a discrete token sequence:


\mathcal{T}: \mathcal{C}^* \rightarrow \mathbb{N}^*

Mathematical Formulation:

For input sequence \mathbf{c} = [c_1, c_2, ..., c_n] :


\mathbf{t} = \mathcal{T}(\mathbf{c}) = [V(c_1), V(c_2), ..., V(c_n)]


where  V: \mathcal{C} \rightarrow \mathbb{N}  is the vocabulary mapping function.

3.2 Vocabulary Mapping Function


V(c) = \begin{cases}
0 & \text{if } c = \text{<pad>} \\
1 & \text{if } c = \text{<unk>} \\
2 & \text{if } c = \text{<bos>} \\
3 & \text{if } c = \text{<eos>} \\
v & \text{if } c \in \mathcal{C}_{vocab}
\end{cases}

Control Properties:

  • Deterministic: Same input always produces same output
  • Invertible: For most tokens, V^{-1} exists
  • Bijective: Each character maps to unique token ID

3.3 Tokenizer State Space

The tokenizer maintains internal state:


\Sigma_{\mathcal{T}} = \{V, V^{-1}, \text{padding\_strategy}, \text{max\_length}\}

State Transition:


\Sigma_{\mathcal{T}}' = \Sigma_{\mathcal{T}} \quad \text{(static during operation)}

3.4 Step-by-Step Explanation

Step 1: Character Extraction

  • Input: Raw text string "Hello"
  • Process: Extract each character c \in \{'H', 'e', 'l', 'l', 'o'\}
  • Meaning: Break down text into atomic units

Step 2: Vocabulary Lookup

  • Process: Apply V(c) to each character
  • Example: V('H') = 72, V('e') = 101, V('l') = 108, V('o') = 111
  • Meaning: Convert characters to numerical indices

Step 3: Sequence Formation

  • Output: \mathbf{t} = [72, 101, 108, 108, 111]
  • Meaning: Numerical representation ready for embedding

Control Impact: Tokenizer creates the foundation for all subsequent processing. Any error here propagates through the entire system.


4. Seed Control System

4.1 Seed as System Initialization

The seed s \in \mathbb{N} controls randomness throughout the system:


\mathcal{R}(\mathbf{x}, s) = \text{deterministic\_random}(\mathbf{x}, s)

4.2 Seed Propagation Function

Initialization:


\text{seed\_torch}(s): \text{torch.manual\_seed}(s)


\text{seed\_cuda}(s): \text{torch.cuda.manual\_seed\_all}(s)


\text{seed\_cudnn}(s): \text{torch.backends.cudnn.deterministic} = \text{True}

Mathematical Model:


\mathbb{P}(\mathbf{W} | s) = \begin{cases}
\delta(\mathbf{W} - \mathbf{W}_s) & \text{if deterministic} \\
\text{some distribution} & \text{if stochastic}
\end{cases}


where  \delta  is the Dirac delta and  \mathbf{W}_s  is the weight initialization given seed  s .

4.3 Seed Control Equation

For weight initialization:


\mathbf{W}_0 = \mathcal{I}(\mathbf{s}, \text{init\_method})


where  \mathcal{I}  is the initialization function.

Example - Normal Initialization:


\mathbf{W}_0 \sim \mathcal{N}(0, \sigma^2) \quad \text{with random state } r(s)



W_{ij} = \sigma \cdot \Phi^{-1}(U_{ij}(s))


where:
-  \mathcal{N}(0, \sigma^2)  = normal distribution
-  \Phi^{-1}  = inverse CDF
-  U_{ij}(s)  = uniform random number from seed  s
-  \sigma = 0.02  (typical value)

4.4 Step-by-Step Explanation

Step 1: Seed Input

  • Input: s = 42
  • Meaning: Provides reproducibility guarantee

Step 2: RNG State Initialization

  • Process: Set all random number generators to state based on s
  • Meaning: Ensures deterministic behavior

Step 3: Weight Initialization

  • Process: Generate all weights using RNG with seed s
  • Example: W\_{ij} = \text{normal}(0, 0.02, \text{seed}=42)
  • Meaning: Starting point for optimization

Step 4: Training Determinism

  • Process: Same seed + same data → same gradients → same updates
  • Meaning: Complete reproducibility

Control Impact: Seed controls initial conditions and stochastic processes throughout training. It's the control parameter for reproducibility.


5. Embedding Layer Control

5.1 Embedding as Linear Transformation

The embedding layer performs a lookup operation:


\mathcal{E}: \mathbb{N} \rightarrow \mathbb{R}^d

Mathematical Formulation:


\mathbf{E} \in \mathbb{R}^{V \times d} \quad \text{(embedding matrix)}



\mathbf{x}_t = \mathbf{E}[\mathbf{t}_t] = \mathbf{E}_t \in \mathbb{R}^d


where  \mathbf{t}_t \in \mathbb{N}  is the token ID at position  t .

5.2 Embedding Control System

Batch Processing:


\mathbf{X} = \mathbf{E}[\mathbf{T}] \in \mathbb{R}^{B \times n \times d}


where  \mathbf{T} \in \mathbb{N}^{B \times n}  is the batch of token IDs.

Control Function:


\mathbf{X} = \mathcal{E}(\mathbf{T}, \mathbf{E})

Gradient Flow:


\frac{\partial \mathcal{L}}{\partial \mathbf{E}} = \sum_{b,t} \frac{\partial \mathcal{L}}{\partial \mathbf{X}_{b,t}} \cdot \mathbf{1}[\mathbf{T}_{b,t}]


where  \mathbf{1}[\mathbf{T}_{b,t}]  is a one-hot indicator.

5.3 Step-by-Step Explanation

Step 1: Token ID Input

  • Input: t = 72 (token ID for 'H')
  • Meaning: Discrete index into vocabulary

Step 2: Matrix Lookup

  • Process: \mathbf{x} = \mathbf{E}[72]
  • Example: \mathbf{x} = [0.1, -0.2, 0.3, ..., 0.05] \in \mathbb{R}^{512}
  • Meaning: Continuous vector representation

Step 3: Semantic Encoding

  • Property: Similar tokens have similar embeddings (after training)
  • Meaning: Embeddings capture semantic relationships

Control Impact: Embedding layer projects discrete tokens into continuous space, enabling gradient-based optimization.


6. Positional Encoding State

6.1 Positional Encoding as Additive Control


\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} \in \mathbb{R}^{B \times n \times d}


where  \mathbf{PE} \in \mathbb{R}^{n \times d}  is the positional encoding matrix.

6.2 Positional Encoding Function


PE_{(pos, i)} = \begin{cases}
\sin\left(\frac{pos}{10000^{2i/d}}\right) & \text{if } i \text{ is even} \\
\cos\left(\frac{pos}{10000^{2(i-1)/d}}\right) & \text{if } i \text{ is odd}
\end{cases}

6.3 Control System Interpretation

Additive Control:


\mathbf{X}_{out} = \mathbf{X}_{in} + \mathbf{U}_{pos}


where  \mathbf{U}_{pos}  is the **control input** representing position information.

Meaning: Positional encoding injects positional information into the embeddings.

6.4 Step-by-Step Explanation

Step 1: Position Index

  • Input: Position pos = 0, 1, 2, ..., n-1
  • Meaning: Absolute position in sequence

Step 2: Encoding Generation

  • Process: Compute PE\_{(pos, i)} for each dimension i
  • Example: PE*{(0, 0)} = 0, PE*{(0, 1)} = 1, PE\_{(1, 0)} \approx 0.84
  • Meaning: Unique pattern for each position

Step 3: Addition Operation

  • Process: \mathbf{X}\_{pos} = \mathbf{X} + PE
  • Meaning: Position information added to embeddings

Step 4: Multi-Scale Representation

  • Property: Different dimensions encode different frequency scales
  • Meaning: Model can learn both local and global positional patterns

Control Impact: Positional encoding provides temporal/spatial awareness to the model, enabling it to understand sequence order.


7. Self-Attention Control System

7.1 Attention as Information Routing

Self-attention can be modeled as a dynamical control system that routes information:


\mathbf{O} = \text{Attention}(\mathbf{X}, \mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V)

7.2 State-Space Model for Attention

Query, Key, Value Generation:


\mathbf{Q} = \mathbf{X} \mathbf{W}_Q \in \mathbb{R}^{B \times n \times d}


\mathbf{K} = \mathbf{X} \mathbf{W}_K \in \mathbb{R}^{B \times n \times d}


\mathbf{V} = \mathbf{X} \mathbf{W}_V \in \mathbb{R}^{B \times n \times d}

Attention Scores (Transfer Function):


\mathbf{S} = \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \in \mathbb{R}^{B \times h \times n \times n}

Attention Weights (Control Signal):


\mathbf{A} = \text{softmax}(\mathbf{S}) \in \mathbb{R}^{B \times h \times n \times n}

Output (Controlled Response):


\mathbf{O} = \mathbf{A} \mathbf{V} \in \mathbb{R}^{B \times h \times n \times d_k}

7.3 Control System Interpretation

Attention as Feedback Control:


\mathbf{O}_i = \sum_{j=1}^{n} A_{ij} \mathbf{V}_j


where  A_{ij}  is the **control gain** determining how much information flows from position  j  to position  i .

Meaning: Attention acts as a learnable routing mechanism controlled by similarities between queries and keys.

7.4 Multi-Head Attention Control

Head Splitting:


\mathbf{Q}_h = \mathbf{Q}[:, :, h \cdot d_k : (h+1) \cdot d_k] \in \mathbb{R}^{B \times n \times d_k}

Parallel Processing:


\mathbf{O}_h = \text{Attention}(\mathbf{Q}_h, \mathbf{K}_h, \mathbf{V}_h), \quad h = 1, ..., H

Concatenation:


\mathbf{O} = \text{Concat}[\mathbf{O}_1, \mathbf{O}_2, ..., \mathbf{O}_H] \in \mathbb{R}^{B \times n \times d}

7.5 Causal Masking Control

Causal Mask:


M_{ij} = \begin{cases}
0 & \text{if } i \geq j \text{ (allowed)} \\
-\infty & \text{if } i < j \text{ (masked)}
\end{cases}

Masked Attention:


\mathbf{S}_{masked} = \mathbf{S} + M

Effect: Prevents information flow from future positions.

7.6 Step-by-Step Explanation

Step 1: Query, Key, Value Generation

  • Process: Linear transformations of input
  • Meaning: Create three representations: what to look for (Q), what to match (K), what to retrieve (V)

Step 2: Similarity Computation

  • Process: S\_{ij} = Q_i \cdot K_j / \sqrt{d_k}
  • Meaning: Measure similarity/relevance between positions i and $ j $

Step 3: Softmax Normalization

  • Process: A*{ij} = \exp(S*{ij}) / \sum*k \exp(S*{ik})
  • Meaning: Convert similarities to probability distribution (attention weights)

Step 4: Weighted Aggregation

  • Process: O*i = \sum_j A*{ij} V_j
  • Meaning: Combine values weighted by attention probabilities

Step 5: Information Flow

  • Property: Each position receives information from all other positions (with causal masking)
  • Meaning: Enables long-range dependencies and context understanding

Control Impact: Self-attention is the core control mechanism that determines what information flows where in the sequence.


8. Feed-Forward Control

8.1 Feed-Forward as Nonlinear Transformation


\text{FFN}(\mathbf{X}) = \text{GELU}(\mathbf{X} \mathbf{W}_1 + \mathbf{b}_1) \mathbf{W}_2 + \mathbf{b}_2

8.2 Control System Model

Two-Stage Transformation:


\mathbf{H} = \mathbf{X} \mathbf{W}_1 \in \mathbb{R}^{B \times n \times d_{ff}}



\mathbf{H}' = \text{GELU}(\mathbf{H}) \in \mathbb{R}^{B \times n \times d_{ff}}



\mathbf{O} = \mathbf{H}' \mathbf{W}_2 \in \mathbb{R}^{B \times n \times d}

8.3 GELU Activation Control


\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)

Control Interpretation: GELU applies smooth gating - values near zero are suppressed, positive values pass through.

8.4 Step-by-Step Explanation

Step 1: Expansion

  • Process: \mathbf{H} = \mathbf{X} \mathbf{W}_1 expands to d_{ff} > d
  • Example: d = 512 \rightarrow d\_{ff} = 2048
  • Meaning: Increases capacity for complex transformations

Step 2: Nonlinear Activation

  • Process: \mathbf{H}' = \text{GELU}(\mathbf{H})
  • Meaning: Introduces nonlinearity, enabling complex function approximation

Step 3: Compression

  • Process: $\mathbf{O} = \mathbf{H}' \mathbf{W}_2 compresses back to d$
  • Meaning: Projects back to original dimension

Control Impact: FFN provides nonlinear processing power and feature transformation at each position.


9. Layer Normalization Feedback

9.1 Normalization as Feedback Control


\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta


where:
-  \mu = \frac{1}{d} \sum_{i=1}^{d} x_i  (mean)
-  \sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2  (variance)
-  \gamma, \beta  = learnable parameters (scale and shift)

9.2 Control System Interpretation

Normalization as State Regulation:


\mathbf{x}_{norm} = \gamma \odot \frac{\mathbf{x} - \mu(\mathbf{x})}{\sigma(\mathbf{x})} + \beta

Meaning: Normalization regulates the distribution of activations, preventing saturation and improving gradient flow.

9.3 Pre-Norm Architecture

Transformer Block with Pre-Norm:


\mathbf{x}_{norm} = \text{LayerNorm}(\mathbf{x}_{in})


\mathbf{x}_{attn} = \text{Attention}(\mathbf{x}_{norm})


\mathbf{x}_{out} = \mathbf{x}_{in} + \mathbf{x}_{attn} \quad \text{(residual connection)}

Control Impact: Pre-norm architecture provides stability and better gradient flow.

9.4 Step-by-Step Explanation

Step 1: Mean Computation

  • Process: \mu = \frac{1}{d} \sum x_i
  • Meaning: Find center of distribution

Step 2: Variance Computation

  • Process: \sigma^2 = \frac{1}{d} \sum (x_i - \mu)^2
  • Meaning: Measure spread of distribution

Step 3: Normalization

  • Process: \hat{x}\_i = (x_i - \mu) / \sqrt{\sigma^2 + \epsilon}
  • Meaning: Standardize to zero mean, unit variance

Step 4: Scale and Shift

  • Process: x\_{out} = \gamma \odot \hat{x} + \beta
  • Meaning: Allow model to learn optimal scale and shift

Control Impact: Layer normalization provides stability and faster convergence by maintaining consistent activation distributions.


10. Complete System Dynamics

10.1 Complete Forward Pass

System State Evolution:


\mathbf{h}_0 = \mathcal{E}(\mathbf{T}) + \mathbf{PE} \quad \text{(embedding + positional)}



\mathbf{h}_l = \text{TransformerBlock}_l(\mathbf{h}_{l-1}), \quad l = 1, ..., L



\mathbf{y} = \mathbf{h}_L \mathbf{W}_{out} \in \mathbb{R}^{B \times n \times V}

10.2 Recursive System Equation


\mathbf{h}_t^{(l)} = f_l(\mathbf{h}_t^{(l-1)}, \theta_l)


where:


f_l(\mathbf{x}, \theta_l) = \mathbf{x} + \text{Dropout}(\text{Attention}(\text{LayerNorm}(\mathbf{x}))) + \text{Dropout}(\text{FFN}(\text{LayerNorm}(\mathbf{x} + \text{Attention}(\text{LayerNorm}(\mathbf{x})))))

10.3 System Transfer Function

The complete system can be viewed as:


\mathbf{Y} = \mathcal{F}(\mathbf{T}, \theta, \mathbf{s})


where:
-  \mathbf{T}  = input tokens
-  \theta  = all parameters
-  \mathbf{s}  = seed

Properties:

  • Nonlinear: Due to softmax, GELU, normalization
  • Differentiable: All operations have gradients
  • Compositional: Built from simpler functions

10.4 Step-by-Step System Flow

Step 1: Input Encoding

  • Input: Token sequence \mathbf{T}
  • Process: Embedding + Positional Encoding
  • Output: \mathbf{h}\_0 \in \mathbb{R}^{B \times n \times d}
  • Meaning: Convert discrete tokens to continuous vectors with position info

Step 2: Layer Processing

  • For each layer l = 1, ..., L :
    • Process: Self-attention + FFN with residual connections
    • Output: \mathbf{h}\_l \in \mathbb{R}^{B \times n \times d}
    • Meaning: Transform representations through attention and processing

Step 3: Output Generation

  • Process: Final layer norm + output projection
  • Output: \mathbf{L} \in \mathbb{R}^{B \times n \times V} (logits)
  • Meaning: Predict probability distribution over vocabulary

Step 4: Probability Computation

  • Process: Softmax over logits
  • Output: $\mathbf{p} \in \mathbb{R}^{B \times n \times V} (probabilities)$
  • Meaning: Normalized probability distribution for next token prediction

11. Training as Optimization Control

11.1 Training as Optimal Control Problem

Objective Function:


J(\theta) = \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(\mathbf{y}_i, \hat{\mathbf{y}}_i(\theta))


where:
-  \mathcal{L}  = loss function (cross-entropy)
-  \mathbf{y}_i  = true labels
-  \hat{\mathbf{y}}_i(\theta)  = model predictions

Optimization Problem:


\theta^* = \arg\min_{\theta} J(\theta)

11.2 Gradient-Based Control

Gradient Computation:


\mathbf{g}_t = \nabla_\theta J(\theta_t) = \frac{\partial J}{\partial \theta_t}

Parameter Update (AdamW):


\theta_{t+1} = \theta_t - \eta_t \left(\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} + \lambda \theta_t\right)


where:
-  \hat{\mathbf{m}}_t  = biased-corrected momentum
-  \hat{\mathbf{v}}_t  = biased-corrected variance
-  \eta_t  = learning rate (controlled by scheduler)
-  \lambda  = weight decay coefficient

11.3 Learning Rate Control

Cosine Annealing Schedule:


\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \cdot \frac{1 + \cos(\pi \cdot \frac{t}{T_{max}})}{2}

Control Interpretation: Learning rate acts as gain scheduling - high gain initially for fast convergence, low gain later for fine-tuning.

11.4 Gradient Clipping Control

Clipping Function:


\mathbf{g}_{clipped} = \begin{cases}
\mathbf{g} & \text{if } ||\mathbf{g}|| \leq \theta \\
\mathbf{g} \cdot \frac{\theta}{||\mathbf{g}||} & \text{if } ||\mathbf{g}|| > \theta
\end{cases}

Purpose: Prevents explosive gradients that could destabilize training.

11.5 Step-by-Step Training Control

Step 1: Forward Pass

  • Process: \hat{\mathbf{y}} = \mathcal{F}(\mathbf{x}, \theta_t)
  • Meaning: Compute predictions with current parameters

Step 2: Loss Computation

  • Process: \mathcal{L} = \text{CrossEntropy}(\hat{\mathbf{y}}, \mathbf{y})
  • Meaning: Measure prediction error

Step 3: Backward Pass

  • Process: \mathbf{g} = \nabla\_\theta \mathcal{L}
  • Meaning: Compute gradients for all parameters

Step 4: Gradient Clipping

  • Process: \mathbf{g}\_{clipped} = \text{Clip}(\mathbf{g}, \theta)
  • Meaning: Prevent gradient explosion

Step 5: Optimizer Update

  • Process: \theta*{t+1} = \text{AdamW}(\theta_t, \mathbf{g}*{clipped}, \eta_t)
  • Meaning: Update parameters using adaptive learning rate

Step 6: Learning Rate Update

  • Process: \eta\_{t+1} = \text{Scheduler}(\eta_t, t)
  • Meaning: Adjust learning rate according to schedule

Control Impact: Training process is a closed-loop control system where:

  • Error signal: Loss
  • Controller: Optimizer (AdamW)
  • Actuator: Parameter updates
  • Plant: Model forward pass

12. Inference Control Loop

12.1 Autoregressive Generation as Control Loop

State-Space Model:


\mathbf{h}_t = \mathcal{F}(\mathbf{x}_t, \mathbf{h}_{t-1}, \theta)



\mathbf{p}_t = \text{softmax}(\mathbf{h}_t \mathbf{W}_{out})



\mathbf{x}_{t+1} \sim \text{Categorical}(\mathbf{p}_t)

12.2 Generation Control Function

Step-by-Step:

  1. Current State: \mathbf{h}\_t
  2. Output Generation: \mathbf{p}_t = \text{softmax}(\mathbf{h}\_t \mathbf{W}_{out})
  3. Sampling: x\_{t+1} \sim \mathbf{p}\_t (with temperature, top-k, top-p)
  4. State Update: \mathbf{h}_{t+1} = \mathcal{F}([\mathbf{h}\_t, x_{t+1}], \theta)
  5. Repeat: Until max length or stop token

12.3 Sampling Control Parameters

Temperature Control:


\mathbf{p}_t^{temp} = \text{softmax}\left(\frac{\mathbf{h}_t \mathbf{W}_{out}}{T}\right)


-  T < 1 : More deterministic (sharp distribution)
-  T > 1 : More random (flat distribution)
-  T = 1 : Default

Top-k Filtering:


\mathbf{p}_t^{topk}[v] = \begin{cases}
\mathbf{p}_t[v] & \text{if } v \in \text{top-k}(\mathbf{p}_t) \\
0 & \text{otherwise}
\end{cases}

Top-p (Nucleus) Sampling:


\mathbf{p}_t^{topp}[v] = \begin{cases}
\mathbf{p}_t[v] & \text{if } v \in S_p \\
0 & \text{otherwise}
\end{cases}


where  S_p  is the smallest set such that  \sum_{v \in S_p} \mathbf{p}_t[v] \geq p .

12.4 Step-by-Step Inference Control

Step 1: Initialization

  • Input: Prompt tokens \mathbf{P} = [p_1, ..., p_k]
  • Process: Initialize state \mathbf{h}\_0 = \mathcal{E}(\mathbf{P}) + \mathbf{PE}
  • Meaning: Set initial state from prompt

Step 2: Forward Pass

  • Process: \mathbf{h}_t = \text{Transformer}(\mathbf{h}_{t-1})
  • Output: Hidden state \mathbf{h}\_t
  • Meaning: Process current sequence

Step 3: Logit Generation

  • Process: \mathbf{l}_t = \mathbf{h}\_t \mathbf{W}_{out}
  • Output: Logits \mathbf{l}\_t \in \mathbb{R}^V
  • Meaning: Unnormalized scores for each token

Step 4: Probability Computation

  • Process: \mathbf{p}\_t = \text{softmax}(\mathbf{l}\_t / T)
  • Output: Probability distribution \mathbf{p}\_t
  • Meaning: Normalized probabilities with temperature

Step 5: Sampling

  • Process: x\_{t+1} \sim \mathbf{p}\_t (with optional top-k/top-p)
  • Output: Next token x\_{t+1}
  • Meaning: Stochastically select next token

Step 6: State Update

  • Process: Append x*{t+1} to sequence, update \mathbf{h}*{t+1}
  • Meaning: Incorporate new token into state

Step 7: Termination Check

  • Condition: t < \text{max_length} and x\_{t+1} \neq \text{<eos>}
  • If true: Go to Step 2
  • If false: Return generated sequence

Control Impact: Inference is a recurrent control system where:

  • State: Current hidden representation
  • Control: Sampling strategy (temperature, top-k, top-p)
  • Output: Generated token sequence

Summary: Unified Control System Model

Complete System Equation


\mathbf{Y} = \mathcal{G}(\mathbf{C}, \theta, \mathbf{s}, \mathbf{T}, \{k, p\})


where:
-  \mathbf{C}  = input characters
-  \theta  = model parameters
-  \mathbf{s}  = seed
-  \mathbf{T}  = temperature
-  \{k, p\}  = top-k and top-p parameters

System Components as Control Elements

  1. Tokenizer: Input encoder \mathcal{T}
  2. Seed: Initialization control \mathbf{s}
  3. Embeddings: State projection \mathcal{E}
  4. Positional Encoding: Temporal control \mathbf{PE}
  5. Attention: Information routing \mathcal{A}
  6. FFN: Nonlinear transformation \mathcal{F}
  7. Normalization: State regulation \mathcal{N}
  8. Optimizer: Parameter control \mathcal{O}
  9. Scheduler: Learning rate control \mathcal{S}
  10. Sampling: Output control \mathcal{P}

Control Flow Summary

Input Characters
    ↓ [Tokenizer Control]
Token IDs
    ↓ [Seed Control]
Initialized Parameters
    ↓ [Embedding Control]
Vector Representations
    ↓ [Positional Control]
Position-Aware Vectors
    ↓ [Attention Control]
Context-Aware Representations
    ↓ [FFN Control]
Transformed Features
    ↓ [Normalization Control]
Stabilized Activations
    ↓ [Output Control]
Probability Distributions
    ↓ [Sampling Control]
Generated Tokens

Each component acts as a control element in a unified dynamical system, working together to transform input text into meaningful language model outputs.


13. Block Diagram Analysis

13.1 Single Transformer Block Control System

Block Diagram (a): Detailed Single Transformer Block

Input X
    ↓
    ┌─────────────┐
    │ LayerNorm   │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ Multi-Head  │
    │ Attention   │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │  Dropout    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │      +      │ ←─── (Residual Connection from X)
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ LayerNorm   │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ Feed-Forward│
    │  Network    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │  Dropout    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │      +      │ ←─── (Residual Connection)
    └──────┬──────┘
           ↓
    Output X'

Mathematical Transfer Function:


\mathbf{X}_{out} = \mathbf{X}_{in} + \text{Dropout}(\text{FFN}(\text{LayerNorm}(\mathbf{X}_{in} + \text{Dropout}(\text{Attention}(\text{LayerNorm}(\mathbf{X}_{in})))))

13.2 Simplified Transformer Block

Block Diagram (b): Simplified Single Block

Input X
    ↓
    ┌─────────────────────────────────────┐
    │ TransformerBlock                    │
    │ G_block(X) = X + Attn(LN(X)) +      │
    │              FFN(LN(X + Attn(LN(X))))│
    └──────────────┬──────────────────────┘
                   ↓
              Output X'

Transfer Function:


G_{block}(\mathbf{X}) = \mathbf{X} + G_{attn}(\text{LN}(\mathbf{X})) + G_{ffn}(\text{LN}(\mathbf{X} + G_{attn}(\text{LN}(\mathbf{X}))))


where:
-  G_{attn}  = Attention transfer function
-  G_{ffn}  = Feed-forward transfer function
-  \text{LN}  = Layer normalization

13.3 Complete Model with Multiple Layers

Block Diagram (c): Cascaded Transformer Blocks

Input Tokens T
    ↓
    ┌─────────────┐
    │ Embedding   │
    │   G_emb     │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ Positional  │
    │ G_pos       │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ Block 1     │
    │ G_block₁    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ Block 2     │
    │ G_block₂    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │    ...      │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ Block L     │
    │ G_block_L   │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ Final Norm  │
    │ G_norm      │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ Output Proj │
    │ G_out       │
    └──────┬──────┘
           ↓
    Output Logits

Overall Transfer Function:


\mathbf{Y} = G_{out} \circ G_{norm} \circ G_{block_L} \circ ... \circ G_{block_2} \circ G_{block_1} \circ G_{pos} \circ G_{emb}(\mathbf{T})

13.4 Closed-Loop Training System

Block Diagram (d): Training Control Loop

Input Data X
    ↓
    ┌─────────────┐
    │   Model     │
    │  Forward    │
    │     F       │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │   Output    │
    │     ŷ       │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │    Loss     │
    │  L(ŷ, y)    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │  Gradient   │
    │    ∇θ       │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ Clipping    │
    │   Clip      │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ Optimizer   │
    │  AdamW      │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │  Parameter  │
    │   Update    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │      -      │ ←─── (Feedback to Model)
    └─────────────┘

Closed-Loop Transfer Function:


\theta_{t+1} = \theta_t - \eta_t \cdot \text{AdamW}(\text{Clip}(\nabla_\theta L(\mathcal{F}(\mathbf{X}, \theta_t), \mathbf{y})))


14. Vector Visualization and Examples

14.1 Example Phrase: "Hello World"

We'll trace through the complete system with the phrase "Hello World".

Step 1: Tokenization

Input: "Hello World"

Process:

Characters: ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']
Token IDs:   [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]

Mathematical:


\mathbf{c} = \text{"Hello World"}


\mathbf{t} = \mathcal{T}(\mathbf{c}) = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]

Vector Representation:

  • Dimension: n = 11 tokens
  • Token IDs: \mathbf{t} \in \mathbb{N}^{11}

Step 2: Embedding

Embedding Matrix: \mathbf{E} \in \mathbb{R}^{128 \times 512}

Lookup Operation:


\mathbf{X} = \mathbf{E}[\mathbf{t}] = \begin{bmatrix}
\mathbf{E}[72] \\
\mathbf{E}[101] \\
\mathbf{E}[108] \\
\mathbf{E}[108] \\
\mathbf{E}[111] \\
\mathbf{E}[32] \\
\mathbf{E}[87] \\
\mathbf{E}[111] \\
\mathbf{E}[114] \\
\mathbf{E}[108] \\
\mathbf{E}[100]
\end{bmatrix} \in \mathbb{R}^{11 \times 512}

Example Values (first 3 dimensions):


\mathbf{E}[72] = [0.1, -0.2, 0.3, ...]^T \\
\mathbf{E}[101] = [-0.1, 0.3, -0.1, ...]^T \\
\mathbf{E}[108] = [0.05, 0.15, -0.05, ...]^T

Vector Visualization:

Token 'H' (ID=72):   [0.10, -0.20,  0.30, ..., 0.05]  (512-dim vector)
Token 'e' (ID=101):  [-0.10,  0.30, -0.10, ..., 0.02]  (512-dim vector)
Token 'l' (ID=108):  [0.05,  0.15, -0.05, ..., 0.01]  (512-dim vector)
...

Step 3: Positional Encoding

Positional Encoding Matrix: \mathbf{PE} \in \mathbb{R}^{11 \times 512}

Computation:


PE_{(0, 0)} = \sin(0 / 10000^0) = 0 \\
PE_{(0, 1)} = \cos(0 / 10000^0) = 1 \\
PE_{(1, 0)} = \sin(1 / 10000^0) = \sin(1) \approx 0.8415 \\
PE_{(1, 1)} = \cos(1 / 10000^0) = \cos(1) \approx 0.5403

Addition:


\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE}

Example (first token, first 3 dimensions):


\mathbf{X}_{pos}[0, :3] = \begin{bmatrix}
0.1 \\ -0.2 \\ 0.3
\end{bmatrix} + \begin{bmatrix}
0 \\ 1 \\ 0
\end{bmatrix} = \begin{bmatrix}
0.1 \\ 0.8 \\ 0.3
\end{bmatrix}

Step 4: Multi-Head Attention

Query, Key, Value Projections:

Let \mathbf{W}\_Q, \mathbf{W}\_K, \mathbf{W}\_V \in \mathbb{R}^{512 \times 512}


\mathbf{Q} = \mathbf{X}_{pos} \mathbf{W}_Q \in \mathbb{R}^{11 \times 512}

Example Calculation (head 0, token 0):

For h = 0 , d_k = 512/8 = 64 :


\mathbf{Q}[0, :64] = \mathbf{X}_{pos}[0] \mathbf{W}_Q[:, :64]

Attention Score Computation:


S_{0,1} = \frac{\mathbf{Q}[0] \cdot \mathbf{K}[1]}{\sqrt{64}} = \frac{\sum_{i=0}^{63} Q_{0,i} \cdot K_{1,i}}{8}

Example Numerical Calculation:

Assume:


\mathbf{Q}[0, :3] = [0.2, -0.1, 0.3] \\
\mathbf{K}[1, :3] = [0.1, 0.2, -0.1]



S_{0,1} = \frac{0.2 \times 0.1 + (-0.1) \times 0.2 + 0.3 \times (-0.1)}{8} \\
= \frac{0.02 - 0.02 - 0.03}{8} = \frac{-0.03}{8} = -0.00375

Attention Weights:


A_{0,:} = \text{softmax}(S_{0,:}) = \frac{\exp(S_{0,:})}{\sum_{j=0}^{10} \exp(S_{0,j})}

Example:

If S\_{0,:} = [-0.004, 0.05, 0.02, 0.02, 0.08, -0.01, 0.03, 0.08, 0.01, 0.02, 0.04]


\exp(S_{0,:}) = [0.996, 1.051, 1.020, 1.020, 1.083, 0.990, 1.030, 1.083, 1.010, 1.020, 1.041]



\sum = 11.335



A_{0,:} = [0.088, 0.093, 0.090, 0.090, 0.096, 0.087, 0.091, 0.096, 0.089, 0.090, 0.092]

Output Calculation:


\mathbf{O}[0] = \sum_{j=0}^{10} A_{0,j} \mathbf{V}[j]

Example (first dimension):


O_{0,0} = A_{0,0} V_{0,0} + A_{0,1} V_{1,0} + ... + A_{0,10} V_{10,0} \\
= 0.088 \times 0.2 + 0.093 \times 0.1 + ... + 0.092 \times 0.15 \\
\approx 0.12

Step 5: Feed-Forward Network

Input: \mathbf{X}\_{attn} \in \mathbb{R}^{11 \times 512}

First Linear Transformation:


\mathbf{H} = \mathbf{X}_{attn} \mathbf{W}_1 \in \mathbb{R}^{11 \times 2048}

Example (token 0, first dimension):


H_{0,0} = \sum_{i=0}^{511} X_{attn,0,i} \cdot W_{1,i,0}


Assuming  X_{attn}[0, :3] = [0.12, -0.05, 0.08]  and  W_1[:3, :3] = \begin{bmatrix} 0.1 & 0.2 \\ -0.1 & 0.1 \\ 0.05 & -0.05 \end{bmatrix}


H_{0,0} = 0.12 \times 0.1 + (-0.05) \times (-0.1) + 0.08 \times 0.05 \\
= 0.012 + 0.005 + 0.004 = 0.021

GELU Activation:


\text{GELU}(0.021) = 0.021 \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{0.021}{\sqrt{2}}\right)\right)



\text{erf}(0.021/\sqrt{2}) = \text{erf}(0.0148) \approx 0.0167



\text{GELU}(0.021) = 0.021 \times 0.5 \times (1 + 0.0167) = 0.021 \times 0.5084 \approx 0.0107

Second Linear Transformation:


\mathbf{O}_{ffn} = \mathbf{H}' \mathbf{W}_2 \in \mathbb{R}^{11 \times 512}

Step 6: Complete Forward Pass Through One Layer

Input: \mathbf{X}_{in} = \mathbf{X}_{pos} \in \mathbb{R}^{11 \times 512}

Step 6.1: Layer Normalization


\mu_0 = \frac{1}{512} \sum_{i=0}^{511} X_{in,0,i}

Example:


\mu_0 = \frac{0.1 + 0.8 + 0.3 + ...}{512} \approx 0.02



\sigma_0^2 = \frac{1}{512} \sum_{i=0}^{511} (X_{in,0,i} - \mu_0)^2



\sigma_0^2 \approx \frac{(0.1-0.02)^2 + (0.8-0.02)^2 + ...}{512} \approx 0.15



\hat{X}_{0,0} = \frac{0.1 - 0.02}{\sqrt{0.15 + 1e-5}} = \frac{0.08}{0.387} \approx 0.207

Step 6.2: Attention Output


\mathbf{X}_{attn} = \text{Attention}(\hat{\mathbf{X}})

Step 6.3: Residual Connection


\mathbf{X}_{res1} = \mathbf{X}_{in} + \mathbf{X}_{attn}

Example:


X_{res1,0,0} = 0.1 + 0.12 = 0.22

Step 6.4: Second Layer Norm + FFN


\mathbf{X}_{ffn} = \text{FFN}(\text{LayerNorm}(\mathbf{X}_{res1}))

Step 6.5: Final Residual


\mathbf{X}_{out} = \mathbf{X}_{res1} + \mathbf{X}_{ffn}

Example:


X_{out,0,0} = 0.22 + 0.15 = 0.37

Step 7: Output Projection

After L layers:


\mathbf{H}_{final} = \text{LayerNorm}(\mathbf{X}_{out}^{(L)}) \in \mathbb{R}^{11 \times 512}

Output Projection:


\mathbf{L} = \mathbf{H}_{final} \mathbf{W}_{out} \in \mathbb{R}^{11 \times 128}

Example (position 0):


L_{0,:} = \mathbf{H}_{final}[0] \mathbf{W}_{out} \in \mathbb{R}^{128}

Softmax:


p_{0,v} = \frac{\exp(L_{0,v})}{\sum_{w=0}^{127} \exp(L_{0,w})}

Example:

If L*{0,72} = 5.2 (logit for 'H'), L*{0,101} = 3.1 (logit for 'e'), etc.


\exp(5.2) = 181.27 \\
\exp(3.1) = 22.20 \\
\vdots



\sum_{w=0}^{127} \exp(L_{0,w}) \approx 250.0



p_{0,72} = \frac{181.27}{250.0} \approx 0.725 \quad \text{(72\% probability for H)}


15. Complete Numerical Example: "Hello"

Let's trace through the complete system with "Hello" step-by-step.

Input: "Hello"

Stage 1: Tokenization


\mathbf{c} = \text{"Hello"} = ['H', 'e', 'l', 'l', 'o']



\mathbf{t} = [72, 101, 108, 108, 111]

Stage 2: Embedding (d=512)


\mathbf{E} \in \mathbb{R}^{128 \times 512}



\mathbf{X} = \begin{bmatrix}
\mathbf{E}[72] \\
\mathbf{E}[101] \\
\mathbf{E}[108] \\
\mathbf{E}[108] \\
\mathbf{E}[111]
\end{bmatrix} = \begin{bmatrix}
0.10 & -0.20 & 0.30 & ... & 0.05 \\
-0.10 & 0.30 & -0.10 & ... & 0.02 \\
0.05 & 0.15 & -0.05 & ... & 0.01 \\
0.05 & 0.15 & -0.05 & ... & 0.01 \\
-0.05 & 0.20 & 0.10 & ... & 0.03
\end{bmatrix} \in \mathbb{R}^{5 \times 512}

Stage 3: Positional Encoding


\mathbf{PE} = \begin{bmatrix}
0 & 1 & 0 & ... & 0 \\
0.84 & 0.54 & 0.01 & ... & 0.00 \\
0.91 & -0.42 & 0.02 & ... & 0.00 \\
0.14 & -0.99 & 0.03 & ... & 0.00 \\
-0.76 & -0.65 & 0.04 & ... & 0.00
\end{bmatrix} \in \mathbb{R}^{5 \times 512}



\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} = \begin{bmatrix}
0.10 & 0.80 & 0.30 & ... & 0.05 \\
0.74 & 0.84 & -0.09 & ... & 0.02 \\
0.96 & -0.27 & -0.03 & ... & 0.01 \\
0.19 & -0.84 & -0.02 & ... & 0.01 \\
-0.81 & -0.45 & 0.14 & ... & 0.03
\end{bmatrix}

Stage 4: Attention (h=8 heads, d_k=64)

Query Generation:


\mathbf{Q} = \mathbf{X}_{pos} \mathbf{W}_Q \in \mathbb{R}^{5 \times 512}

Score Matrix (head 0):


\mathbf{S}_0 = \frac{\mathbf{Q}_0 \mathbf{K}_0^T}{\sqrt{64}} \in \mathbb{R}^{5 \times 5}

Example Values:


\mathbf{S}_0 = \begin{bmatrix}
0.50 & -0.10 & 0.20 & 0.15 & 0.30 \\
-0.05 & 0.45 & 0.10 & 0.08 & 0.25 \\
0.15 & 0.05 & 0.40 & 0.30 & 0.20 \\
0.12 & 0.08 & 0.28 & 0.35 & 0.18 \\
0.25 & 0.15 & 0.22 & 0.20 & 0.42
\end{bmatrix}

Attention Weights:


\mathbf{A}_0 = \text{softmax}(\mathbf{S}_0) = \begin{bmatrix}
0.35 & 0.15 & 0.22 & 0.20 & 0.28 \\
0.15 & 0.38 & 0.20 & 0.18 & 0.27 \\
0.23 & 0.18 & 0.32 & 0.30 & 0.26 \\
0.21 & 0.19 & 0.28 & 0.33 & 0.25 \\
0.27 & 0.22 & 0.26 & 0.25 & 0.36
\end{bmatrix}

Output (head 0):


\mathbf{O}_0 = \mathbf{A}_0 \mathbf{V}_0 \in \mathbb{R}^{5 \times 64}

Concatenate All Heads:


\mathbf{O} = \text{Concat}[\mathbf{O}_0, ..., \mathbf{O}_7] \in \mathbb{R}^{5 \times 512}

Stage 5: Feed-Forward


\mathbf{H} = \mathbf{O} \mathbf{W}_1 \in \mathbb{R}^{5 \times 2048}



\mathbf{H}' = \text{GELU}(\mathbf{H}) \in \mathbb{R}^{5 \times 2048}



\mathbf{O}_{ffn} = \mathbf{H}' \mathbf{W}_2 \in \mathbb{R}^{5 \times 512}

Stage 6: Output Logits

After processing through all L layers:


\mathbf{L} = \mathbf{H}_{final} \mathbf{W}_{out} \in \mathbb{R}^{5 \times 128}

Example (position 4, predicting next token):


L_{4,:} = [2.1, 1.5, ..., 5.2, ..., 3.1, ...]


Where:
-  L_{4,111} = 5.2  (high score for 'o')
-  L_{4,32} = 4.8  (high score for space)
-  L_{4,87} = 4.5  (high score for 'W')

Probability Distribution:


\mathbf{p}_4 = \text{softmax}(L_{4,:}) = [0.01, 0.008, ..., 0.25, ..., 0.18, ...]



p_{4,111} \approx 0.25 \quad \text{(25\% for o)} \\
p_{4,32} \approx 0.22 \quad \text{(22\% for space)} \\
p_{4,87} \approx 0.18 \quad \text{(18\% for W)}


16. Vector Space Visualization

16.1 Embedding Space

2D Projection Example:

After embedding "Hello", tokens occupy positions in 512-dimensional space. Projected to 2D:

Token Positions (idealized 2D projection):

        'l' (0.05, 0.15)
          ●

                    'e' (-0.10, 0.30)
                      ●

Origin (0, 0)
    ●

                      'H' (0.10, -0.20)
                        ●

                            'o' (-0.05, 0.20)
                              ●

Distance in Embedding Space:


d(\mathbf{E}[72], \mathbf{E}[101]) = ||\mathbf{E}[72] - \mathbf{E}[101]||_2



d = \sqrt{(0.1 - (-0.1))^2 + (-0.2 - 0.3)^2 + ...} \approx \sqrt{0.04 + 0.25 + ...} \approx 2.1

16.2 Attention Weight Visualization

Attention Matrix Visualization:

Position   0    1    2    3    4
        ┌─────┴─────┴─────┴─────┴──┐
Token 0 │ 0.35 0.15 0.22 0.20 0.28 │  'H'
        │                          │
Token 1 │ 0.15 0.38 0.20 0.18 0.27 │  'e'
        │                          │
Token 2 │ 0.23 0.18 0.32 0.30 0.26 │  'l'
        │                          │
Token 3 │ 0.21 0.19 0.28 0.33 0.25 │  'l'
        │                          │
Token 4 │ 0.27 0.22 0.26 0.25 0.36 │  'o'
        └──────────────────────────┘

Interpretation:

  • Token 0 ('H') attends most to itself (0.35) and token 4 (0.28)
  • Token 4 ('o') attends moderately to all positions
  • Higher values indicate stronger attention

16.3 Probability Distribution Visualization

Output Distribution for Position 5 (next token after "Hello"):

Probability Distribution p[5, :]

Probability
    │
0.3 │           ●
    │
0.2 │      ●         ●
    │
0.1 │  ●       ●           ●     ●
    │
0.0 ├─┴───┴───┴───┴───┴───┴───┴───┴─── Token IDs
    32  72  87  101  108 111  ... 127
    ␣   H   W   e    l   o

Meaning:

  • Highest probability for space (32) ≈ 0.28
  • Next: 'o' (111) ≈ 0.23
  • Then: 'W' (87) ≈ 0.18
  • Model predicts space or continuation

17. Advanced Block Diagram Simplification

17.1 Complex Multi-Layer System Simplification

Following control system reduction techniques, we can simplify the transformer model step-by-step:

Diagram (a): Original Complex System

Input R (Tokens)
    ↓
    ┌─────────────┐
    │   Embedding │
    │    G_emb    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ Positional  │
    │   Encoding  │
    │    G_pos    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │      +      │ ←─── Feedback from Layer 2
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │  Layer 1    │
    │ G_block₁   │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │      +      │ ←─── Feedback from Output
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │  Layer 2    │
    │ G_block₂    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │      +      │ ←─── Feedback H₁
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ Output Proj │
    │    G_out    │
    └──────┬──────┘
           ↓
    Output C (Logits)

Diagram (b): First Simplification (Combine Embedding and Positional)

Input R
    ↓
    ┌─────────────────────┐
    │ G_emb_pos =         │
    │ G_pos ∘ G_emb       │
    └──────┬──────────────┘
           ↓
    ┌─────────────┐
    │      +      │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │  Layer 1    │
    │ G_block₁    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │      +      │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │  Layer 2    │
    │ G_block₂    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │      +      │ ←─── H₁
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │    G_out    │
    └──────┬──────┘
           ↓
    Output C

Diagram (c): Second Simplification (Combine Layers)

Input R
    ↓
    ┌─────────────────────┐
    │ G_emb_pos           │
    └──────┬──────────────┘
           ↓
    ┌──────────────────────────────────┐
    │ G_layers = G_block₂ ∘ G_block₁   │
    │ Equivalent to:                   │
    │ X + Δ₁(X) + Δ₂(X + Δ₁(X))        │
    └──────┬───────────────────────────┘
           ↓
    ┌─────────────┐
    │      +      │ ←─── H₁
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │    G_out    │
    └──────┬──────┘
           ↓
    Output C

Diagram (d): Third Simplification (Combine with Output)

Input R
    ↓
    ┌──────────────────────────────┐
    │ G_forward =                  │
    │ G_out ∘ G_layers ∘ G_emb_pos │
    └──────┬───────────────────────┘
           ↓
    ┌─────────────┐
    │      +      │ ←─── H₁ (Feedback)
    └──────┬──────┘
           ↓
    Output C

Diagram (e): Final Simplified Transfer Function

Input R
    ↓
    ┌────────────────────────────────────────────┐
    │ Overall Transfer Function:                 │
    │                                            │
    │ C/R = G_forward / (1 + G_forward × H₁)     │
    │                                            │
    │ Where:                                     │
    │ G_forward = G_out ∘ G_layers ∘ G_emb_pos   │
    │                                            │
    └──────┬─────────────────────────────────────┘
           ↓
    Output C

Mathematical Derivation:

Step 1: Combine embedding and positional encoding:


G_{emb\_pos}(\mathbf{T}) = G_{pos}(G_{emb}(\mathbf{T})) = \mathbf{E}[\mathbf{T}] + \mathbf{PE}

Step 2: Combine transformer layers:


G_{layers}(\mathbf{X}) = G_{block_2}(G_{block_1}(\mathbf{X}))



G_{layers}(\mathbf{X}) = \mathbf{X} + \Delta_1(\mathbf{X}) + \Delta_2(\mathbf{X} + \Delta_1(\mathbf{X}))


where  \Delta_l  represents the transformation inside block  l .

Step 3: Combine with output projection:


G_{forward}(\mathbf{T}) = G_{out}(G_{layers}(G_{emb\_pos}(\mathbf{T})))

Step 4: Apply feedback reduction:


\frac{C}{R} = \frac{G_{forward}}{1 + G_{forward} \times H_1}

17.2 Attention Block Simplification

Diagram (a): Detailed Attention

Input X
    ↓
    ┌─────────────┐
    │      Q      │ ←─── W_Q
    │      K      │ ←─── W_K
    │      V      │ ←─── W_V
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │   Scores    │
    │ S = QK^T/√d │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │   Softmax   │
    │   A = σ(S)  │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │    Output   │
    │   O = AV    │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │   Out Proj  │
    │    W_O      │
    └──────┬──────┘
           ↓
    Output X'

Diagram (b): Simplified Attention Transfer Function

Input X
    ↓
    ┌──────────────────────────────┐
    │ G_attn(X) =                  │
    │ W_O · softmax(QK^T/√d) · V   │
    │                              │
    │ Where:                       │
    │ Q = XW_Q, K = XW_K, V = XW_V │
    └──────┬───────────────────────┘
           ↓
    Output X'

Mathematical Transfer Function:


G_{attn}(\mathbf{X}) = \mathbf{X} \mathbf{W}_O \cdot \text{softmax}\left(\frac{(\mathbf{X} \mathbf{W}_Q)(\mathbf{X} \mathbf{W}_K)^T}{\sqrt{d_k}}\right) \cdot (\mathbf{X} \mathbf{W}_V)


18. Vector Trace: "Hello World" Complete Flow

18.1 Complete Vector Trace with Numerical Values

Input: "Hello World"

Stage 1: Tokenization


\mathbf{t} = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]

Stage 2: Embedding (showing first 4 dimensions)


\mathbf{X} = \begin{bmatrix}
[H] & 0.10 & -0.20 & 0.30 & 0.15 & ... \\
[e] & -0.10 & 0.30 & -0.10 & 0.08 & ... \\
[l] & 0.05 & 0.15 & -0.05 & 0.03 & ... \\
[l] & 0.05 & 0.15 & -0.05 & 0.03 & ... \\
[o] & -0.05 & 0.20 & 0.10 & 0.06 & ... \\
[ ] & 0.02 & 0.05 & 0.02 & 0.01 & ... \\
[W] & 0.15 & -0.15 & 0.25 & 0.12 & ... \\
[o] & -0.05 & 0.20 & 0.10 & 0.06 & ... \\
[r] & 0.08 & 0.10 & -0.08 & 0.04 & ... \\
[l] & 0.05 & 0.15 & -0.05 & 0.03 & ... \\
[d] & 0.12 & -0.08 & 0.18 & 0.09 & ...
\end{bmatrix} \in \mathbb{R}^{11 \times 512}

Stage 3: Positional Encoding (first 4 dimensions)


\mathbf{PE} = \begin{bmatrix}
[0] & 0.00 & 1.00 & 0.00 & 0.00 & ... \\
[1] & 0.84 & 0.54 & 0.01 & 0.00 & ... \\
[2] & 0.91 & -0.42 & 0.02 & 0.00 & ... \\
[3] & 0.14 & -0.99 & 0.03 & 0.00 & ... \\
[4] & -0.76 & -0.65 & 0.04 & 0.00 & ... \\
[5] & -0.96 & 0.28 & 0.05 & 0.00 & ... \\
[6] & -0.28 & 0.96 & 0.06 & 0.00 & ... \\
[7] & 0.65 & 0.76 & 0.07 & 0.00 & ... \\
[8] & 0.99 & -0.14 & 0.08 & 0.00 & ... \\
[9] & 0.42 & -0.91 & 0.09 & 0.00 & ... \\
[10] & -0.54 & -0.84 & 0.10 & 0.00 & ...
\end{bmatrix}

Stage 4: Combined Input


\mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE}

Example Row 0 (token 'H'):


\mathbf{X}_{pos}[0, :4] = [0.10, -0.20, 0.30, 0.15] + [0.00, 1.00, 0.00, 0.00] = [0.10, 0.80, 0.30, 0.15]

Stage 5: Attention (Head 0, showing attention from token 0 to all tokens)


\mathbf{S}_0[0, :] = [0.50, -0.10, 0.20, 0.15, 0.30, -0.05, 0.18, 0.28, 0.12, 0.20, 0.22]



\mathbf{A}_0[0, :] = \text{softmax}(\mathbf{S}_0[0, :]) = [0.35, 0.15, 0.22, 0.20, 0.28, 0.14, 0.19, 0.26, 0.17, 0.21, 0.23]


**Meaning:** Token 'H' (position 0) attends:
- 35% to itself
- 28% to token 'o' (position 4)
- 26% to token 'o' (position 7)
- 23% to token 'd' (position 10)

Stage 6: Attention Output


\mathbf{O}_0[0, :] = \sum_{j=0}^{10} A_{0,j} \mathbf{V}_0[j, :]

Example (first dimension):


O_{0,0,0} = 0.35 \times 0.12 + 0.15 \times 0.08 + ... + 0.23 \times 0.15 \approx 0.115

Stage 7: FFN Output


\mathbf{H}_{ffn}[0, :4] = [0.15, -0.08, 0.22, 0.18]

Stage 8: Final Output (after all layers)


\mathbf{H}_{final}[0, :4] = [0.42, 0.25, 0.58, 0.31]

Stage 9: Logits


\mathbf{L}[0, :] = [2.1, 1.8, ..., 5.2, ..., 3.4, ...]


Where  L[0, 72] = 5.2  is highest (predicting 'H' at position 1).

Stage 10: Probabilities


\mathbf{p}[0, :] = \text{softmax}(\mathbf{L}[0, :]) = [0.01, 0.008, ..., 0.28, ..., 0.15, ...]



p[0, 72] \approx 0.28 \quad \text{(28\% probability for H)}


19. Vector Plots and Visualizations

19.1 Embedding Vector Trajectory

Trajectory Plot:

512-Dimensional Embedding Space (2D Projection)

     0.3 │                          'e' (pos 1)
         │                            ●
     0.2 │                    'r' (pos 8)
         │                      ●
     0.1 │         'l' (pos 2,3,9)      'o' (pos 4,7)
         │            ●                 ●
     0.0 ├───────────────────────────────────────────
         │    'H' (pos 0)
    -0.1 │       ●
         │
    -0.2 │
         │
    -0.3 │                              'W' (pos 6)
         │                                  ●
         └───────────────────────────────────────────
           -0.3  -0.2  -0.1  0.0  0.1  0.2  0.3

19.2 Attention Heatmap

Attention Weight Matrix Visualization:

Attention Weights A[i,j] for "Hello World"

         j →   0    1    2    3    4    5    6    7    8    9   10
         ↓  ['H'] ['e'] ['l'] ['l'] ['o'] [' '] ['W'] ['o'] ['r'] ['l'] ['d']
i=0 ['H'] │ 0.35 0.15 0.22 0.20 0.28 0.14 0.19 0.26 0.17 0.21 0.23 │
i=1 ['e'] │ 0.15 0.38 0.20 0.18 0.27 0.16 0.18 0.25 0.19 0.22 0.20 │
i=2 ['l'] │ 0.23 0.18 0.32 0.30 0.26 0.17 0.21 0.24 0.25 0.31 0.23 │
i=3 ['l'] │ 0.21 0.19 0.28 0.33 0.25 0.18 0.20 0.23 0.24 0.30 0.22 │
i=4 ['o'] │ 0.27 0.22 0.26 0.25 0.36 0.19 0.23 0.29 0.24 0.27 0.25 │
i=5 [' '] │ 0.18 0.20 0.19 0.21 0.24 0.40 0.22 0.25 0.21 0.20 0.22 │
i=6 ['W'] │ 0.22 0.21 0.23 0.24 0.26 0.20 0.45 0.28 0.27 0.23 0.25 │
i=7 ['o'] │ 0.26 0.25 0.24 0.23 0.29 0.21 0.28 0.38 0.26 0.24 0.26 │
i=8 ['r'] │ 0.19 0.21 0.25 0.24 0.24 0.19 0.27 0.26 0.42 0.27 0.28 │
i=9 ['l'] │ 0.21 0.22 0.31 0.30 0.27 0.20 0.23 0.24 0.27 0.35 0.24 │
i=10['d'] │ 0.23 0.20 0.23 0.22 0.25 0.22 0.25 0.26 0.28 0.24 0.48 │

Color Coding:
█ = 0.48-0.50 (very high attention)
█ = 0.35-0.48 (high attention)
█ = 0.25-0.35 (medium attention)
█ = 0.15-0.25 (low attention)
█ = 0.00-0.15 (very low attention)

19.3 Probability Distribution Plot

Logits and Probabilities:

Logits L[5, :] (predicting token after "Hello ")

Logit
Value │
  6.0 │                    ● (token 87 'W')
      │
  5.0 │           ● (token 111 'o')
      │
  4.0 │      ● (token 32 ' ')         ● (token 114 'r')
      │
  3.0 │  ●                         ●  ●
      │
  2.0 │  ●  ●  ●  ●  ●  ●  ●  ●  ●  ●  ●
      │
  1.0 │  ●  ●  ●  ●  ●  ●  ●  ●  ●  ●  ●
      │
  0.0 ├─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴── Token IDs
       32  72  87  101 108 111 114 ...
       ␣   H   W   e   l   o   r

Probabilities p[5, :]

Probability
    │
 0.3│                    ● ('W')
    │
 0.2│      ● (' ')              ● ('o')
    │
 0.1│  ●       ●  ●  ●  ●     ●  ●
    │
 0.0├─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴── Token IDs
     32  72  87  101 108 111 114 ...

19.4 Hidden State Evolution Through Layers

Layer-by-Layer Transformation:

Hidden State Evolution for Token 'H' (position 0)

Dimension 0:
Layer 0: 0.10  (embedding + positional)
Layer 1: 0.42  (after attention + FFN)
Layer 2: 0.58  (after second layer)
Layer 3: 0.65  (after third layer)
...      ...
Layer L: 0.72  (final hidden state)

Dimension 1:
Layer 0: 0.80  (embedding + positional)
Layer 1: 0.25  (after attention + FFN)
Layer 2: 0.18  (after second layer)
Layer 3: 0.22  (after third layer)
...      ...
Layer L: 0.15  (final hidden state)

Visualization:

Hidden State Magnitude ||h[l]|| Over Layers

Magnitude
    │
 1.0│ ●
    │   ●
 0.8│     ●
    │       ●
 0.6│         ●
    │           ●
 0.4│             ●
    │               ●
 0.2│                 ●
    │                   ●
 0.0├───────────────────────── Layer
    0  1  2  3  4  5  6

20. Summary: Complete Mathematical Trace

Complete System Equation with Numerical Example

Text: "Hello World"

Complete Mathematical Flow:

  1. Tokenization:

   \mathbf{t} = \mathcal{T}(\text{"Hello World"}) = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]

  1. Embedding:

   \mathbf{X} = \mathbf{E}[\mathbf{t}] \in \mathbb{R}^{11 \times 512}

  1. Positional Encoding:

   \mathbf{X}_{pos} = \mathbf{X} + \mathbf{PE} \in \mathbb{R}^{11 \times 512}

  1. Transformer Layers (L=6):

   \mathbf{h}_l = \text{TransformerBlock}_l(\mathbf{h}_{l-1}), \quad l = 1, ..., 6

  1. Output:

   \mathbf{L} = \mathbf{h}_6 \mathbf{W}_{out} \in \mathbb{R}^{11 \times 128}

  1. Probabilities:

   \mathbf{p} = \text{softmax}(\mathbf{L}) \in \mathbb{R}^{11 \times 128}

Final Prediction:

For position 5 (after "Hello "):


p[5, 87] = 0.28 \quad \text{(28\% for W)} \\
p[5, 32] = 0.22 \quad \text{(22\% for space)} \\
p[5, 111] = 0.18 \quad \text{(18\% for o)}

Most Likely: 'W' → Complete prediction: "Hello World"


This document provides a complete mathematical control system formulation with block diagrams, vector visualizations, numerical examples, and step-by-step calculations for every component of the SheepOp LLM.