Training Neural Networks

  • This lesson explains how neural networks learn from data using forward propagation and backpropagation algorithms.
  • Gradient Descent

    Gradient Descent is an optimization algorithm used to minimize the loss function.

    Basic Idea:

    • Move weights in the direction that reduces error

    • Use the gradient (derivative) of the loss function

    Update Rule:

    W=W−η∂L∂WW = W - \eta \frac{\partial L}{\partial W}W=W−η∂W∂L​

    Where:

    • WWW = weight

    • η\etaη = learning rate

    • LLL = loss function

    Intuition:

    Imagine standing on a hill:

    • Gradient tells you slope direction

    • You move downhill to reach minimum loss


    Learning Rate (η)

    The learning rate controls how big a step we take during weight update.

    If Learning Rate is:

    • Too small → Very slow training

    • Too large → Overshoots minimum, unstable

    • Just right → Fast and stable convergence

    Visualization Concept:

    Small steps → Slow but stable
    Large steps → Jump around minimum

    Choosing correct learning rate is critical.


    Batch vs Mini-Batch vs Stochastic Gradient Descent

    Type

    Data Used per Update

    Speed

    Stability

    Batch GD

    Entire dataset

    Slow

    Very stable

    Stochastic GD (SGD)

    1 sample

    Fast

    Noisy

    Mini-Batch GD

    Small batch (32, 64, 128)

    Balanced

    Most common

    Batch Gradient Descent

    • Computes gradient using full dataset

    • Computationally expensive for large data

    Stochastic Gradient Descent (SGD)

    • Updates weights for every single example

    • Faster but noisy updates

    Mini-Batch Gradient Descent

    • Uses small batches

    • Most practical and widely used


    Backpropagation

    Backpropagation is the algorithm that:

    Calculates gradients
    Sends error backward through network
    Updates weights

    Steps:

    1. Forward Propagation → Compute prediction

    2. Compute Loss

    3. Backward Propagation → Compute gradients

    4. Update Weights using Gradient Descent

    5. Repeat for many epochs


    Chain Rule (Core of Backpropagation)

    Backpropagation works using the Chain Rule from calculus.

    Chain Rule Concept:

    If:

    L=f(g(x))L = f(g(x))L=f(g(x))

    Then:

    dLdx=dLdg×dgdx\frac{dL}{dx} = \frac{dL}{dg} \times \frac{dg}{dx}dxdL​=dgdL​×dxdg​

    In neural networks:

    • Loss depends on output

    • Output depends on hidden layer

    • Hidden layer depends on weights

    So gradients are calculated layer by layer backward.

    Full Training Flow

    Input → Forward Propagation → Loss Calculation

                   ↓

            Backpropagation

                   ↓

           Weight Update (Gradient Descent)

                   ↓

                Repeat