Initial commit: SheepOp LLM - Transformer-based language model implementation

- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00
commit 3d2da94ce2
60 changed files with 25153 additions and 0 deletions
--- a/docs/NEURAL_NETWORK_EXPLAINED.md
+++ b/docs/NEURAL_NETWORK_EXPLAINED.md
@@ -0,0 +1,948 @@
+# What is a Neural Network? Step-by-Step Explanation
+
+Complete step-by-step explanation of neural networks: what neurons are, what weights are, how calculations work, why they're important, with mathematical derivations and solved exercises.
+
+## Table of Contents
+
+1. [What is a Neural Network?](#61-what-is-a-neural-network)
+2. [What is a Neuron?](#62-what-is-a-neuron)
+3. [What are Weights?](#63-what-are-weights)
+4. [How Neurons Calculate](#64-how-neurons-calculate)
+5. [Why Weights are Important](#65-why-weights-are-important)
+6. [Complete Mathematical Formulation](#66-complete-mathematical-formulation)
+7. [Multi-Layer Neural Networks](#67-multi-layer-neural-networks)
+8. [Exercise 1: Single Neuron Calculation](#68-exercise-1-single-neuron-calculation)
+9. [Exercise 2: Multi-Layer Network](#69-exercise-2-multi-layer-network)
+10. [Exercise 3: Learning Weights](#610-exercise-3-learning-weights)
+11. [Key Takeaways](#611-key-takeaways)
+
+---
+
+## 6.1 What is a Neural Network?
+
+### Simple Definition
+
+A **neural network** is a computational model inspired by biological neurons that processes information through interconnected nodes (neurons) to make predictions or decisions.
+
+### Visual Analogy
+
+**Think of a neural network like a factory:**
+
+```
+Input → Worker 1 → Worker 2 → Worker 3 → Output
+```
+
+**Neural Network:**
+
+```
+Input → Neuron 1 → Neuron 2 → Neuron 3 → Output
+```
+
+**Each worker (neuron) does a specific job, and they work together to produce the final result.**
+
+### Basic Structure
+
+```
+Input Layer      Hidden Layer      Output Layer
+     ●               ●                 ●
+     ●               ●                 ●
+     ●               ●                 ●
+     ●               ●
+```
+
+**Key Components:**
+
+- **Input Layer:** Receives data
+- **Hidden Layers:** Process information
+- **Output Layer:** Produces predictions
+- **Connections:** Weights between neurons
+
+---
+
+## 6.2 What is a Neuron?
+
+### Simple Definition
+
+A **neuron** (also called a node or unit) is the basic processing unit of a neural network. It receives inputs, performs calculations, and produces an output.
+
+### Biological Inspiration
+
+**Biological Neuron:**
+
+```
+Dendrites → Cell Body → Axon → Synapses
+(inputs)    (process)   (output) (connections)
+```
+
+**Artificial Neuron:**
+
+```
+Inputs → Weighted Sum → Activation → Output
+```
+
+### Structure of a Neuron
+
+```
+Input 1 (x₁) ────┐
+                 │
+Input 2 (x₂) ────┼──→ [Σ] ─→ [f] ─→ Output (y)
+                 │
+Input 3 (x₃) ────┘
+```
+
+**Components:**
+
+1. **Inputs:** Values fed into the neuron
+2. **Weights:** Strength of connections
+3. **Weighted Sum:** Sum of inputs × weights
+4. **Bias:** Added constant
+5. **Activation Function:** Applies nonlinearity
+6. **Output:** Final result
+
+### Visual Representation
+
+```
+Neuron:
+    ┌─────────────────────┐
+    │  Inputs: x₁, x₂, x₃ │
+    │  Weights: w₁, w₂, w₃│
+    │                     │
+    │  z = Σ(xᵢ × wᵢ) + b │
+    │  y = f(z)           │
+    │                     │
+    │  Output: y          │
+    └─────────────────────┘
+```
+
+**Where:**
+
+- `z` = weighted sum (before activation)
+- `f` = activation function
+- `y` = output (after activation)
+
+---
+
+## 6.3 What are Weights?
+
+### Simple Definition
+
+**Weights** are numerical values that determine the strength of connections between neurons. They control how much each input contributes to the output.
+
+### Visual Analogy
+
+**Think of weights like volume controls:**
+
+```
+Music Source 1 ──[Volume: 0.8]──→ Speakers
+Music Source 2 ──[Volume: 0.3]──→ Speakers
+Music Source 3 ──[Volume: 0.5]──→ Speakers
+```
+
+**Higher weight = Louder contribution**
+
+**Neural Network:**
+
+```
+Input 1 ──[Weight: 0.8]──→ Neuron
+Input 2 ──[Weight: 0.3]──→ Neuron
+Input 3 ──[Weight: 0.5]──→ Neuron
+```
+
+**Higher weight = Stronger influence**
+
+### What Weights Do
+
+**Weights determine:**
+
+1. **How much each input matters**
+2. **The relationship between inputs and outputs**
+3. **What patterns the neuron learns**
+
+**Example:**
+
+**Weight = 0.1:**
+
+- Input has small influence
+- Weak connection
+
+**Weight = 5.0:**
+
+- Input has large influence
+- Strong connection
+
+**Weight = -2.0:**
+
+- Input has negative influence
+- Inverts the relationship
+
+**Weight = 0.0:**
+
+- Input has no influence
+- Connection is cut
+
+### Weight Matrix
+
+**In a layer with multiple neurons:**
+
+```
+Input Layer          Weights Matrix      Output Layer
+x₁ ───────────────────┐
+                      │   w₁₁  w₁₂       y₁
+x₂ ───────────────────┼─  w₂₁  w₂₂  ──── y₂
+                      │   w₃₁  w₃₂
+x₃ ───────────────────┘
+```
+
+**Weight Matrix:**
+
+```
+W = [w₁₁  w₁₂]
+    [w₂₁  w₂₂]
+    [w₃₁  w₃₂]
+```
+
+**Each row:** Connections from one input  
+**Each column:** Connections to one output
+
+---
+
+## 6.4 How Neurons Calculate
+
+### Step-by-Step Calculation
+
+#### Step 1: Weighted Sum
+
+**Multiply each input by its weight:**
+
+```math
+z = x_1 \times w_1 + x_2 \times w_2 + x_3 \times w_3 + ... + b
+```
+
+**Or in vector form:**
+
+```math
+z = \mathbf{x} \cdot \mathbf{w} + b = \sum_{i=1}^{n} x_i w_i + b
+```
+
+**Where:**
+
+- $x_i$ = input value
+- $w_i$ = weight for input $i$
+- $b$ = bias (constant)
+- $n$ = number of inputs
+
+#### Step 2: Add Bias
+
+**Bias shifts the activation:**
+
+```math
+z = \sum_{i=1}^{n} x_i w_i + b
+```
+
+**Bias allows the neuron to:**
+
+- Shift activation threshold
+- Learn patterns independent of inputs
+- Adjust baseline output
+
+#### Step 3: Apply Activation Function
+
+**Apply nonlinear function:**
+
+```math
+y = f(z)
+```
+
+**Common activation functions:**
+
+**ReLU (Rectified Linear Unit):**
+
+```math
+f(z) = \max(0, z)
+```
+
+**Sigmoid:**
+
+```math
+f(z) = \frac{1}{1 + e^{-z}}
+```
+
+**Tanh:**
+
+```math
+f(z) = \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}
+```
+
+**GELU (used in transformers):**
+
+```math
+f(z) = z \cdot \Phi(z)
+```
+
+**Where $\Phi(z)$ is the CDF of standard normal distribution**
+
+### Complete Example
+
+**Given:**
+
+- Inputs: $x_1 = 0.5, x_2 = 0.3, x_3 = 0.8$
+- Weights: $w_1 = 0.6, w_2 = 0.4, w_3 = 0.2$
+- Bias: $b = 0.1$
+- Activation: ReLU
+
+**Step 1: Weighted Sum**
+
+```
+z = (0.5 × 0.6) + (0.3 × 0.4) + (0.8 × 0.2) + 0.1
+  = 0.3 + 0.12 + 0.16 + 0.1
+  = 0.68
+```
+
+**Step 2: Apply Activation**
+
+```
+y = ReLU(0.68)
+  = max(0, 0.68)
+  = 0.68
+```
+
+**Result:** Output = 0.68
+
+---
+
+## 6.5 Why Weights are Important
+
+### Reason 1: They Determine What the Neuron Learns
+
+**Different weights = Different patterns:**
+
+**Pattern 1: Emphasis on Input 1**
+
+```
+w₁ = 5.0, w₂ = 0.1, w₃ = 0.1
+→ Neuron cares mostly about input 1
+```
+
+**Pattern 2: Balanced Weights**
+
+```
+w₁ = 0.5, w₂ = 0.5, w₃ = 0.5
+→ Neuron treats all inputs equally
+```
+
+**Pattern 3: Inverted Relationship**
+
+```
+w₁ = -2.0, w₂ = 1.0, w₃ = 1.0
+→ Neuron inverses input 1's effect
+```
+
+### Reason 2: They Enable Learning
+
+**Training adjusts weights:**
+
+**Before Training:**
+
+```
+Weights: Random values
+→ Random predictions
+```
+
+**After Training:**
+
+```
+Weights: Learned values
+→ Accurate predictions
+```
+
+**Weights are what the model learns!**
+
+### Reason 3: They Control Information Flow
+
+**High weights:** Information flows easily  
+**Low weights:** Information flows weakly  
+**Zero weights:** Information blocked  
+**Negative weights:** Information inverted
+
+### Reason 4: They Enable Complex Patterns
+
+**Multiple neurons with different weights:**
+
+```
+Neuron 1: w₁ = 1.0, w₂ = 0.0 → Detects pattern A
+Neuron 2: w₁ = 0.0, w₂ = 1.0 → Detects pattern B
+Neuron 3: w₁ = 0.5, w₂ = 0.5 → Detects pattern C
+```
+
+**Together:** Model learns complex relationships!
+
+---
+
+## 6.6 Complete Mathematical Formulation
+
+### Single Neuron Formula
+
+**Complete neuron calculation:**
+
+```math
+z = \sum_{i=1}^{n} x_i w_i + b
+```
+
+```math
+y = f(z)
+```
+
+**Where:**
+
+- $\mathbf{x} = [x_1, x_2, ..., x_n]$ = input vector
+- $\mathbf{w} = [w_1, w_2, ..., w_n]$ = weight vector
+- $b$ = bias (scalar)
+- $f$ = activation function
+- $z$ = weighted sum (before activation)
+- $y$ = output (after activation)
+
+### Matrix Formulation
+
+**For multiple neurons:**
+
+```math
+\mathbf{z} = \mathbf{X} \mathbf{W} + \mathbf{b}
+```
+
+```math
+\mathbf{Y} = f(\mathbf{z})
+```
+
+**Where:**
+
+- $\mathbf{X} \in \mathbb{R}^{B \times n}$ = input matrix (B samples, n features)
+- $\mathbf{W} \in \mathbb{R}^{n \times m}$ = weight matrix (n inputs, m neurons)
+- $\mathbf{b} \in \mathbb{R}^{1 \times m}$ = bias vector
+- $\mathbf{z} \in \mathbb{R}^{B \times m}$ = weighted sums
+- $\mathbf{Y} \in \mathbb{R}^{B \times m}$ = outputs
+
+**Example:**
+
+**Input Matrix:**
+
+```
+X = [x₁₁  x₁₂]  (2 samples, 2 features)
+    [x₂₁  x₂₂]
+```
+
+**Weight Matrix:**
+
+```
+W = [w₁₁  w₁₂]  (2 inputs, 2 neurons)
+    [w₂₁  w₂₂]
+```
+
+**Bias Vector:**
+
+```
+b = [b₁  b₂]  (2 neurons)
+```
+
+**Calculation:**
+
+```
+z = X × W + b
+
+z₁₁ = x₁₁×w₁₁ + x₁₂×w₂₁ + b₁
+z₁₂ = x₁₁×w₁₂ + x₁₂×w₂₂ + b₂
+z₂₁ = x₂₁×w₁₁ + x₂₂×w₂₁ + b₁
+z₂₂ = x₂₁×w₁₂ + x₂₂×w₂₂ + b₂
+```
+
+---
+
+## 6.7 Multi-Layer Neural Networks
+
+### Structure
+
+```
+Input Layer → Hidden Layer 1 → Hidden Layer 2 → Output Layer
+    x₁            h₁₁               h₂₁             y₁
+    x₂            h₁₂               h₂₂             y₂
+    x₃            h₁₃               h₂₃
+```
+
+### Forward Pass
+
+**Layer 1:**
+
+```math
+\mathbf{h}_1 = f_1(\mathbf{X} \mathbf{W}_1 + \mathbf{b}_1)
+```
+
+**Layer 2:**
+
+```math
+\mathbf{h}_2 = f_2(\mathbf{h}_1 \mathbf{W}_2 + \mathbf{b}_2)
+```
+
+**Output Layer:**
+
+```math
+\mathbf{Y} = f_3(\mathbf{h}_2 \mathbf{W}_3 + \mathbf{b}_3)
+```
+
+**Chained together:**
+
+```math
+\mathbf{Y} = f_3(f_2(f_1(\mathbf{X} \mathbf{W}_1 + \mathbf{b}_1) \mathbf{W}_2 + \mathbf{b}_2) \mathbf{W}_3 + \mathbf{b}_3)
+```
+
+**Each layer transforms the input!**
+
+---
+
+## 6.8 Exercise 1: Single Neuron Calculation
+
+### Problem
+
+**Given a single neuron with:**
+
+- Inputs: $x_1 = 2.0, x_2 = -1.0, x_3 = 0.5$
+- Weights: $w_1 = 0.5, w_2 = -0.3, w_3 = 0.8$
+- Bias: $b = 0.2$
+- Activation function: ReLU $f(z) = \max(0, z)$
+
+**Calculate the output of this neuron.**
+
+### Step-by-Step Solution
+
+#### Step 1: Weighted Sum
+
+**Compute:**
+
+```math
+z = \sum_{i=1}^{3} x_i w_i + b
+```
+
+**Substitute values:**
+
+```math
+z = (2.0 \times 0.5) + (-1.0 \times -0.3) + (0.5 \times 0.8) + 0.2
+```
+
+**Calculate each term:**
+
+```math
+z = (1.0) + (0.3) + (0.4) + 0.2
+```
+
+**Sum:**
+
+```math
+z = 1.0 + 0.3 + 0.4 + 0.2 = 1.9
+```
+
+#### Step 2: Apply Activation Function
+
+**Apply ReLU:**
+
+```math
+y = \text{ReLU}(z) = \max(0, z) = \max(0, 1.9) = 1.9
+```
+
+### Answer
+
+**The output of the neuron is $y = 1.9$.**
+
+### Verification
+
+**Check calculation:**
+
+- Input contribution 1: $2.0 \times 0.5 = 1.0$
+- Input contribution 2: $-1.0 \times -0.3 = 0.3$
+- Input contribution 3: $0.5 \times 0.8 = 0.4$
+- Bias: $0.2$
+- Total: $1.0 + 0.3 + 0.4 + 0.2 = 1.9$ ✓
+- ReLU(1.9) = 1.9 ✓
+
+---
+
+## 6.9 Exercise 2: Multi-Layer Network
+
+### Problem
+
+**Given a neural network with 2 layers:**
+
+**Layer 1:**
+
+- Inputs: $x_1 = 1.0, x_2 = 0.5$
+- Weights: $W_1 = \begin{bmatrix} 0.6 & 0.4 \\ 0.2 & 0.8 \end{bmatrix}$
+- Bias: $b_1 = [0.1, -0.1]$
+- Activation: ReLU
+
+**Layer 2:**
+
+- Inputs: Outputs from Layer 1
+- Weights: $W_2 = \begin{bmatrix} 0.5 \\ 0.7 \end{bmatrix}$
+- Bias: $b_2 = 0.2$
+- Activation: ReLU
+
+**Calculate the final output.**
+
+### Step-by-Step Solution
+
+#### Step 1: Layer 1 - Weighted Sum
+
+**Input vector:**
+
+```math
+\mathbf{x} = [1.0, 0.5]
+```
+
+**Weight matrix:**
+
+```math
+\mathbf{W}_1 = \begin{bmatrix} 0.6 & 0.4 \\ 0.2 & 0.8 \end{bmatrix}
+```
+
+**Bias vector:**
+
+```math
+\mathbf{b}_1 = [0.1, -0.1]
+```
+
+**Calculate:**
+
+```math
+\mathbf{z}_1 = \mathbf{x} \mathbf{W}_1 + \mathbf{b}_1
+```
+
+**Matrix multiplication:**
+
+```math
+\mathbf{z}_1 = [1.0, 0.5] \begin{bmatrix} 0.6 & 0.4 \\ 0.2 & 0.8 \end{bmatrix} + [0.1, -0.1]
+```
+
+**Compute:**
+
+```math
+z_{1,1} = 1.0 \times 0.6 + 0.5 \times 0.2 + 0.1 = 0.6 + 0.1 + 0.1 = 0.8
+```
+
+```math
+z_{1,2} = 1.0 \times 0.4 + 0.5 \times 0.8 + (-0.1) = 0.4 + 0.4 - 0.1 = 0.7
+```
+
+```math
+\mathbf{z}_1 = [0.8, 0.7]
+```
+
+#### Step 2: Layer 1 - Apply Activation
+
+**Apply ReLU:**
+
+```math
+\mathbf{h}_1 = \text{ReLU}(\mathbf{z}_1) = [\max(0, 0.8), \max(0, 0.7)] = [0.8, 0.7]
+```
+
+#### Step 3: Layer 2 - Weighted Sum
+
+**Input (from Layer 1):**
+
+```math
+\mathbf{h}_1 = [0.8, 0.7]
+```
+
+**Weight matrix:**
+
+```math
+\mathbf{W}_2 = \begin{bmatrix} 0.5 \\ 0.7 \end{bmatrix}
+```
+
+**Bias:**
+
+```math
+b_2 = 0.2
+```
+
+**Calculate:**
+
+```math
+z_2 = \mathbf{h}_1 \mathbf{W}_2 + b_2
+```
+
+**Matrix multiplication:**
+
+```math
+z_2 = [0.8, 0.7] \begin{bmatrix} 0.5 \\ 0.7 \end{bmatrix} + 0.2
+```
+
+**Compute:**
+
+```math
+z_2 = 0.8 \times 0.5 + 0.7 \times 0.7 + 0.2 = 0.4 + 0.49 + 0.2 = 1.09
+```
+
+#### Step 4: Layer 2 - Apply Activation
+
+**Apply ReLU:**
+
+```math
+y = \text{ReLU}(z_2) = \max(0, 1.09) = 1.09
+```
+
+### Answer
+
+**The final output is $y = 1.09$.**
+
+### Summary Table
+
+<table>
+  <tr>
+    <th>Layer</th>
+    <th>Input</th>
+    <th>Weights</th>
+    <th>Bias</th>
+    <th>Weighted Sum</th>
+    <th>Activation</th>
+    <th>Output</th>
+  </tr>
+
+  <tr>
+    <td>1</td>
+    <td>[1.0, 0.5]</td>
+    <td>$$\begin{bmatrix} 0.6 & 0.4 \\ 0.2 & 0.8 \end{bmatrix}$$</td>
+    <td>[0.1, -0.1]</td>
+    <td>[0.8, 0.7]</td>
+    <td>ReLU</td>
+    <td>[0.8, 0.7]</td>
+  </tr>
+
+  <tr>
+    <td>2</td>
+    <td>[0.8, 0.7]</td>
+    <td>$$\begin{bmatrix} 0.5 \\ 0.7 \end{bmatrix}$$</td>
+    <td>0.2</td>
+    <td>1.09</td>
+    <td>ReLU</td>
+    <td><strong>1.09</strong></td>
+  </tr>
+</table>
+---
+
+## 6.10 Exercise 3: Learning Weights
+
+### Problem
+
+**Given a neuron that should output 1.0 when inputs are [1.0, 1.0] and output 0.0 when inputs are [0.0, 0.0], find appropriate weights and bias.**
+
+**Use:**
+
+- Activation: Sigmoid $f(z) = \frac{1}{1 + e^{-z}}$
+- Desired behavior: AND gate (output 1 only when both inputs are 1)
+
+### Step-by-Step Solution
+
+#### Step 1: Set Up Equations
+
+**For input [1.0, 1.0], desired output ≈ 1.0:**
+
+```math
+f(w_1 \times 1.0 + w_2 \times 1.0 + b) = 1.0
+```
+
+**For input [0.0, 0.0], desired output ≈ 0.0:**
+
+```math
+f(w_1 \times 0.0 + w_2 \times 0.0 + b) = 0.0
+```
+
+**Note:** Sigmoid outputs range from 0 to 1, so:
+
+- $f(z) \approx 1.0$ when $z \gg 0$ (e.g., $z > 5$)
+- $f(z) \approx 0.0$ when $z \ll 0$ (e.g., $z < -5$)
+
+#### Step 2: Solve for Bias
+
+**From equation 2:**
+
+```math
+f(b) = 0.0
+```
+
+**For sigmoid to output ≈ 0:**
+
+```math
+b < -5
+```
+
+**Let's use:**
+
+```math
+b = -10
+```
+
+#### Step 3: Solve for Weights
+
+**From equation 1:**
+
+```math
+f(w_1 + w_2 - 10) = 1.0
+```
+
+**For sigmoid to output ≈ 1:**
+
+```math
+w_1 + w_2 - 10 > 5
+```
+
+```math
+w_1 + w_2 > 15
+```
+
+**Let's use equal weights:**
+
+```math
+w_1 = w_2 = 8.0
+```
+
+**Check:**
+
+```math
+w_1 + w_2 = 8.0 + 8.0 = 16.0 > 15 \quad ✓
+```
+
+#### Step 4: Verify Solution
+
+**Test Case 1: Input [1.0, 1.0]**
+
+```math
+z = 1.0 \times 8.0 + 1.0 \times 8.0 + (-10) = 8.0 + 8.0 - 10 = 6.0
+```
+
+```math
+y = \frac{1}{1 + e^{-6.0}} = \frac{1}{1 + 0.0025} \approx 0.9975 \approx 1.0 \quad ✓
+```
+
+**Test Case 2: Input [0.0, 0.0]**
+
+```math
+z = 0.0 \times 8.0 + 0.0 \times 8.0 + (-10) = -10
+```
+
+```math
+y = \frac{1}{1 + e^{10}} = \frac{1}{1 + 22026} \approx 0.00005 \approx 0.0 \quad ✓
+```
+
+**Test Case 3: Input [1.0, 0.0]**
+
+```math
+z = 1.0 \times 8.0 + 0.0 \times 8.0 + (-10) = 8.0 - 10 = -2.0
+```
+
+```math
+y = \frac{1}{1 + e^{2.0}} = \frac{1}{1 + 7.39} \approx 0.12 < 0.5 \quad ✓
+```
+
+**Test Case 4: Input [0.0, 1.0]**
+
+```math
+z = 0.0 \times 8.0 + 1.0 \times 8.0 + (-10) = 8.0 - 10 = -2.0
+```
+
+```math
+y = \frac{1}{1 + e^{2.0}} \approx 0.12 < 0.5 \quad ✓
+```
+
+### Answer
+
+**Appropriate weights and bias:**
+
+- $w_1 = 8.0$
+- $w_2 = 8.0$
+- $b = -10.0$
+
+**The neuron implements an AND gate correctly!**
+
+### Key Insight
+
+**This demonstrates learning:**
+
+- Training finds weights that produce desired behavior
+- Different weights = Different logic functions
+- Learning algorithms (like backpropagation) automatically find these weights from data!
+
+---
+
+## 6.11 Key Takeaways
+
+### Neurons
+
+✅ **Neurons are the basic processing units**  
+✅ **Receive inputs, compute weighted sum, apply activation**  
+✅ **Output is the result of activation function**
+
+### Weights
+
+✅ **Weights control connection strength**  
+✅ **Determine what patterns neurons learn**  
+✅ **Are what the model learns during training**  
+✅ **Enable complex pattern recognition**
+
+### Calculation
+
+✅ **Weighted sum: $z = \sum x_i w_i + b$**  
+✅ **Activation: $y = f(z)$**  
+✅ **Matrix form enables efficient computation**
+
+### Importance
+
+✅ **Weights enable learning**  
+✅ **Control information flow**  
+✅ **Enable complex pattern recognition**  
+✅ **Are adjusted during training to minimize error**
+
+### Neural Networks
+
+✅ **Multiple neurons form layers**  
+✅ **Multiple layers form networks**  
+✅ **Each layer transforms the input**  
+✅ **Deep networks learn hierarchical features**
+
+---
+
+## Mathematical Summary
+
+### Single Neuron
+
+```math
+z = \sum_{i=1}^{n} x_i w_i + b
+```
+
+```math
+y = f(z)
+```
+
+### Multiple Neurons (Matrix Form)
+
+```math
+\mathbf{z} = \mathbf{X} \mathbf{W} + \mathbf{b}
+```
+
+```math
+\mathbf{Y} = f(\mathbf{z})
+```
+
+### Multi-Layer Network
+
+```math
+\mathbf{h}_1 = f_1(\mathbf{X} \mathbf{W}_1 + \mathbf{b}_1)
+```
+
+```math
+\mathbf{h}_2 = f_2(\mathbf{h}_1 \mathbf{W}_2 + \mathbf{b}_2)
+```
+
+```math
+\mathbf{Y} = f_3(\mathbf{h}_2 \mathbf{W}_3 + \mathbf{b}_3)
+```
+
+---
+
+_This document provides a comprehensive explanation of neural networks, neurons, weights, and calculations with mathematical derivations and solved exercises._