❮ Previous

Next ❯

Training Neural Networks

This lesson explains how neural networks learn from data using forward propagation and backpropagation algorithms.

Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the loss function.
Basic Idea:
- Move weights in the direction that reduces error
- Use the gradient (derivative) of the loss function
Update Rule:
W=W−η∂L∂WW = W - \eta \frac{\partial L}{\partial W}W=W−η∂W∂L
Where:
- WWW = weight
- η\etaη = learning rate
- LLL = loss function
Intuition:
Imagine standing on a hill:
- Gradient tells you slope direction
- You move downhill to reach minimum loss
Learning Rate (η)
The learning rate controls how big a step we take during weight update.
If Learning Rate is:
- Too small → Very slow training
- Too large → Overshoots minimum, unstable
- Just right → Fast and stable convergence
Visualization Concept:
Small steps → Slow but stable
Large steps → Jump around minimum
Choosing correct learning rate is critical.

Batch vs Mini-Batch vs Stochastic Gradient Descent
Type
Data Used per Update
Speed
Stability
Batch GD
Entire dataset
Slow
Very stable
Stochastic GD (SGD)
1 sample
Fast
Noisy
Mini-Batch GD
Small batch (32, 64, 128)
Balanced
Most common
Batch Gradient Descent
- Computes gradient using full dataset
- Computationally expensive for large data
Stochastic Gradient Descent (SGD)
- Updates weights for every single example
- Faster but noisy updates
Mini-Batch Gradient Descent
- Uses small batches
- Most practical and widely used
Backpropagation
Backpropagation is the algorithm that:
Calculates gradients
Sends error backward through network
Updates weights
Steps:
1. Forward Propagation → Compute prediction
2. Compute Loss
3. Backward Propagation → Compute gradients
4. Update Weights using Gradient Descent
5. Repeat for many epochs
Chain Rule (Core of Backpropagation)
Backpropagation works using the Chain Rule from calculus.
Chain Rule Concept:
If:
L=f(g(x))L = f(g(x))L=f(g(x))
Then:
dLdx=dLdg×dgdx\frac{dL}{dx} = \frac{dL}{dg} \times \frac{dg}{dx}dxdL=dgdL×dxdg
In neural networks:
- Loss depends on output
- Output depends on hidden layer
- Hidden layer depends on weights
So gradients are calculated layer by layer backward.
Full Training Flow
Input → Forward Propagation → Loss Calculation
               ↓
        Backpropagation
               ↓
       Weight Update (Gradient Descent)
               ↓
            Repeat

❮ Previous

Next ❯

Type	Data Used per Update	Speed	Stability
Batch GD	Entire dataset	Slow	Very stable
Stochastic GD (SGD)	1 sample	Fast	Noisy
Mini-Batch GD	Small batch (32, 64, 128)	Balanced	Most common

Gradient Descent

Basic Idea:

Update Rule:

Intuition:

Learning Rate (η)

If Learning Rate is:

Visualization Concept:

Batch vs Mini-Batch vs Stochastic Gradient Descent

Batch Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch Gradient Descent

Backpropagation

Steps:

Chain Rule (Core of Backpropagation)

Chain Rule Concept:

Full Training Flow

Login

Create Account