Files
sheepOp/docs/FEED_FORWARD_EXPLAINED.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

10 KiB
Raw Permalink Blame History

What is Feed-Forward? Step-by-Step Explanation

Complete step-by-step explanation of feed-forward networks in transformer models: how models transform and refine features.

Table of Contents

  1. The Problem Feed-Forward Solves
  2. What is Feed-Forward?
  3. How Feed-Forward Works: Step-by-Step
  4. Complete Example: Feed-Forward on "Hello"
  5. Why Feed-Forward Matters
  6. Complete Feed-Forward Formula
  7. Visual Representation
  8. Why Expand and Compress?
  9. Key Takeaways

3.1 The Problem Feed-Forward Solves

The Challenge

Attention provides context, but we need to process and transform that information.

Think of it like cooking:

  • Attention: Gathers ingredients (context)
  • Feed-Forward: Cooks and transforms ingredients (processing)

The Solution: Feed-Forward Network

Feed-Forward applies complex transformations to each position independently.


3.2 What is Feed-Forward?

Simple Definition

A Feed-Forward Network (FFN) is a two-layer neural network that:

  1. Expands the input to a larger dimension
  2. Applies a nonlinear transformation
  3. Compresses back to original dimension

Visual Analogy

Think of it like a funnel:

Input (512 dimensions)
    ↓
    ┌─────────────┐
    │   EXPAND    │
    │  512 → 2048 │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │ TRANSFORM   │
    │  (GELU)     │
    └──────┬──────┘
           ↓
    ┌─────────────┐
    │  COMPRESS   │
    │ 2048 → 512  │
    └──────┬──────┘
           ↓
Output (512 dimensions)

3.3 How Feed-Forward Works: Step-by-Step

High-Level Overview

Step 1: Expand dimension (512 → 2048)
Step 2: Apply nonlinear activation (GELU)
Step 3: Compress dimension (2048 → 512)

Detailed Step-by-Step

Step 1: Expansion (First Linear Layer)

Input: Vector of size 512
Output: Vector of size 2048

Mathematical Operation:

H = X × W₁ + b₁

Example:

Input X:

[0.10, -0.20, 0.30, ..., 0.05]  (512 numbers)

Weight Matrix W₁:

Shape: [512, 2048]
Each column transforms input to one output dimension

Process:

H[0] = X[0]×W₁[0,0] + X[1]×W₁[1,0] + ... + X[511]×W₁[511,0]
H[1] = X[0]×W₁[0,1] + X[1]×W₁[1,1] + ... + X[511]×W₁[511,1]
...
H[2047] = X[0]×W₁[0,2047] + ... + X[511]×W₁[511,2047]

Result:

H = [0.12, -0.08, 0.25, ..., 0.18]  (2048 numbers)

Why Expand?

  • More dimensions = more capacity for complex transformations
  • Allows the model to learn intricate patterns
  • Think of it as "more room to work"

Step 2: Nonlinear Activation (GELU)

Apply GELU to each element:

GELU Function:

GELU(x) = x × Φ(x)

Where Φ(x) is the cumulative distribution function of standard normal distribution

Simplified Understanding:

  • Values near zero → suppressed (close to 0)
  • Positive values → pass through (modified)
  • Negative values → suppressed more

Example:

Input H:

H = [0.12, -0.08, 0.25, ..., 0.18]

Apply GELU element-wise:

GELU(0.12) ≈ 0.12 × 0.548 ≈ 0.066
GELU(-0.08) ≈ -0.08 × 0.468 ≈ -0.037
GELU(0.25) ≈ 0.25 × 0.599 ≈ 0.150
...
GELU(0.18) ≈ 0.18 × 0.572 ≈ 0.103

Result:

H' = [0.066, -0.037, 0.150, ..., 0.103]  (2048 numbers)

Why Nonlinear?

  • Linear transformations can only do so much
  • Nonlinearity enables complex function approximation
  • Essential for learning patterns

Step 3: Compression (Second Linear Layer)

Input: Vector of size 2048
Output: Vector of size 512

Mathematical Operation:

O = H' × W₂ + b₂

Process:

O[0] = H'[0]×W₂[0,0] + H'[1]×W₂[1,0] + ... + H'[2047]×W₂[2047,0]
O[1] = H'[0]×W₂[0,1] + H'[1]×W₂[1,1] + ... + H'[2047]×W₂[2047,1]
...
O[511] = H'[0]×W₂[0,511] + ... + H'[2047]×W₂[2047,511]

Result:

O = [0.15, -0.10, 0.22, ..., 0.12]  (512 numbers)

Why Compress?

  • Project back to original dimension
  • Maintains consistent size throughout model
  • Combines expanded features into compact representation

3.4 Complete Example: Feed-Forward on "Hello"

Input

Word: "Hello"
After Attention: [0.146, 0.108, 0.192, ..., 0.11]
Dimension: 512

Step-by-Step Processing

Step 1: Expansion

Input X:

[0.146, 0.108, 0.192, ..., 0.11]  (512 numbers)

Weight Matrix W₁:

Shape: [512, 2048]
Values: Learned during training

Compute:

H = X × W₁

Result:

H = [0.21, -0.15, 0.28, ..., 0.19]  (2048 numbers)

Visualization:

512 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions
                    │   W₁     │
                    └──────────┘

Step 2: Activation

Input H:

[0.21, -0.15, 0.28, ..., 0.19]  (2048 numbers)

Apply GELU element-wise:

GELU(0.21) ≈ 0.115
GELU(-0.15) ≈ -0.058
GELU(0.28) ≈ 0.168
...
GELU(0.19) ≈ 0.109

Result:

H' = [0.115, -0.058, 0.168, ..., 0.109]  (2048 numbers)

Visualization:

2048 dimensions ──→ ┌──────────┐ ──→ 2048 dimensions
                    │   GELU   │
                    └──────────┘

Step 3: Compression

Input H':

[0.115, -0.058, 0.168, ..., 0.109]  (2048 numbers)

Weight Matrix W₂:

Shape: [2048, 512]
Values: Learned during training

Compute:

O = H' × W₂

Result:

O = [0.18, -0.12, 0.24, ..., 0.14]  (512 numbers)

Visualization:

2048 dimensions ──→ ┌──────────┐ ──→ 512 dimensions
                    │   W₂     │
                    └──────────┘

Final Output

Output: [0.18, -0.12, 0.24, ..., 0.14]  (512 numbers)

Meaning: Transformed representation that captures processed features


3.5 Why Feed-Forward Matters

Benefit 1: Feature Transformation

Before FFN:

Input: Raw attention output
Information: Contextual relationships

After FFN:

Output: Transformed features
Information: Processed and refined understanding

Benefit 2: Non-Linear Processing

Linear operations (like attention) can only do limited transformations.
Non-linear operations (like GELU in FFN) enable complex function learning.

Analogy:

  • Linear: Can only draw straight lines
  • Non-linear: Can draw curves, circles, complex shapes

Benefit 3: Position-Wise Processing

FFN processes each position independently:

Position 0 ("Hello"): FFN → Transformed representation
Position 1 ("World"): FFN → Transformed representation

Each word gets its own transformation!


3.6 Complete Feed-Forward Formula

Mathematical Expression

FFN(X) = GELU(X × W₁ + b₁) × W₂ + b₂

Breaking it down:

Part 1: First Linear Transformation

H = X × W₁ + b₁
  • Expands from 512 to 2048 dimensions

Part 2: Non-Linear Activation

H' = GELU(H)
  • Applies non-linear transformation

Part 3: Second Linear Transformation

O = H' × W₂ + b₂
  • Compresses from 2048 back to 512 dimensions

Complete:

FFN(X) = O

3.7 Visual Representation

Feed-Forward Pipeline

Input Vector (512D)
    │
    │ [0.146, 0.108, 0.192, ..., 0.11]
    ↓
┌─────────────────────────────┐
│ Linear Layer 1              │
│ (512 → 2048 expansion)      │
│                             │
│ H = X × W₁                  │
└──────────┬──────────────────┘
           │
           │ [0.21, -0.15, 0.28, ..., 0.19] (2048D)
           ↓
┌─────────────────────────────┐
│ GELU Activation             │
│ (Non-linear transformation) │
│                             │
│ H' = GELU(H)                │
└──────────┬──────────────────┘
           │
           │ [0.115, -0.058, 0.168, ..., 0.109] (2048D)
           ↓
┌─────────────────────────────┐
│ Linear Layer 2              │
│ (2048 → 512 compression)   │
│                             │
│ O = H' × W₂                 │
└──────────┬──────────────────┘
           │
           │ [0.18, -0.12, 0.24, ..., 0.14] (512D)
           ↓
    Output Vector (512D)

Dimension Flow

512 ──→ [Expand] ──→ 2048 ──→ [Transform] ──→ 2048 ──→ [Compress] ──→ 512

Like a funnel: Expand → Transform → Compress


3.8 Why Expand and Compress?

The Expansion-Compression Strategy

Why not stay at 512 dimensions?

Answer: Expansion provides "working space"

Analogy:

  • Think of doing math on paper
  • Small paper (512D) = limited space
  • Large paper (2048D) = room to work
  • Then copy results back to small paper (512D)

Benefits:

  1. More capacity: 2048 dimensions = more parameters to learn
  2. Better transformations: More space = more complex functions
  3. Feature refinement: Transformation happens in expanded space

Why compress back?

Answer: Maintain consistent size throughout the model

  • All layers use 512 dimensions
  • Consistent size enables stacking layers
  • Easier to manage and optimize

3.9 Key Takeaways: Feed-Forward

FFN transforms features through expansion and compression
Expands to larger dimension for processing
Applies non-linear transformation (GELU)
Compresses back to original dimension
Processes each position independently


This document provides a step-by-step explanation of feed-forward networks, the component that transforms and refines features in transformer models.