Next

Optimizers

  • This lesson explains optimizers used in deep learning to update neural network weights and improve model training.
  • SGD (Stochastic Gradient Descent)

    Idea:

    Updates weights using gradient of one sample (or mini-batch).

    W=W−η∂L∂WW = W - \eta \frac{\partial L}{\partial W}W=W−η∂W∂L​

    Advantages:

    • Simple

    • Low memory usage

    • Works well with large datasets

    Problems:

    • Can oscillate

    • Slow convergence

    • Gets stuck in local minima

    Momentum

    Momentum improves SGD by adding velocity.

    Instead of moving only in current gradient direction, it remembers previous direction.

    Update Rule:

    vt=βvt−1+η∇Lv_t = \beta v_{t-1} + \eta \nabla Lvt​=βvt−1​+η∇L W=W−vtW = W - v_tW=W−vt​

    Where:

    • β = momentum factor (usually 0.9)

    Benefits:

    • Reduces oscillations

    • Faster convergence

    • Escapes shallow local minima

    RMSProp (Root Mean Square Propagation)

    RMSProp adapts learning rate for each parameter.

    It keeps track of squared gradients.

    E[g2]t=βE[g2]t−1+(1−β)gt2E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta)g_t^2E[g2]t​=βE[g2]t−1​+(1−β)gt2​ W=W−ηE[g2]t+ϵgtW = W - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_tW=W−E[g2]t​+ϵ​η​gt​

    Key Idea:

    If gradient is large → reduce learning rate
    If gradient is small → increase learning rate

    Benefits:

    • Handles non-stationary data

    • Works well for RNN


    Adam (Adaptive Moment Estimation)

    Adam combines:

    Momentum
    RMSProp

    It keeps track of:

    • First moment (mean of gradients)

    • Second moment (variance of gradients)

    Update:

    mt=β1mt−1+(1−β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_tmt​=β1​mt−1​+(1−β1​)gt​ vt=β2vt−1+(1−β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2vt​=β2​vt−1​+(1−β2​)gt2​

    Final update:

    W=W−ηvt+ϵmtW = W - \frac{\eta}{\sqrt{v_t} + \epsilon} m_tW=W−vt​​+ϵη​mt​

    Why Adam is Popular?

    • Fast convergence

    • Works for most problems

    • Adaptive learning rate

    • Less tuning required


    Adaptive Learning

    Adaptive learning means:

    Learning rate changes during training
    Different parameters get different learning rates

    Optimizers using adaptive learning:

    • RMSProp

    • Adam

    • Adagrad

    Optimizer Comparison

    Optimizer

    Speed

    Stability

    Adaptive LR

    Memory

    SGD

    Slow

    Medium

    Low

    Momentum

    Faster

    Good

    Medium

    RMSProp

    Fast

    Very Good

    Medium

    Adam

    Very Fast

    Excellent

    Higher

    When to Use What?

    Small dataset → Adam
    RNN tasks → RMSProp / Adam
    Large simple models → SGD + Momentum
    Default choice → Adam

    Code Example (TensorFlow)

Testing Different Optimizers in a Neural Network using TensorFlow Keras

This Python example demonstrates how to experiment with different optimizers in a simple feedforward neural network using TensorFlow Keras. The model consists of a Dense hidden layer and an output layer. By changing the optimizer (SGD, Adam, RMSprop), you can observe how different optimization algorithms affect the training process and convergence for regression tasks using Mean Squared Error (MSE) loss.

import tensorflow as tf
from tensorflow.keras import layers, models

model = models.Sequential([
    layers.Dense(64, activation='relu', input_shape=(10,)),
    layers.Dense(1)
])

# Try different optimizers
model.compile(optimizer='sgd', loss='mse')
# model.compile(optimizer='adam', loss='mse')
# model.compile(optimizer='rmsprop', loss='mse')
Next