❮ Previous

Next ❯

Optimizers

This lesson explains optimizers used in deep learning to update neural network weights and improve model training.

SGD (Stochastic Gradient Descent)
Idea:
Updates weights using gradient of one sample (or mini-batch).
W=W−η∂L∂WW = W - \eta \frac{\partial L}{\partial W}W=W−η∂W∂L
Advantages:
- Simple
- Low memory usage
- Works well with large datasets
Problems:
- Can oscillate
- Slow convergence
- Gets stuck in local minima
Momentum
Momentum improves SGD by adding velocity.
Instead of moving only in current gradient direction, it remembers previous direction.
Update Rule:
vt=βvt−1+η∇Lv_t = \beta v_{t-1} + \eta \nabla Lvt=βvt−1+η∇L W=W−vtW = W - v_tW=W−vt
Where:
- β = momentum factor (usually 0.9)
Benefits:
- Reduces oscillations
- Faster convergence
- Escapes shallow local minima
RMSProp (Root Mean Square Propagation)
RMSProp adapts learning rate for each parameter.
It keeps track of squared gradients.
E[g2]t=βE[g2]t−1+(1−β)gt2E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta)g_t^2E[g2]t=βE[g2]t−1+(1−β)gt2 W=W−ηE[g2]t+ϵgtW = W - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_tW=W−E[g2]t+ϵηgt
Key Idea:
If gradient is large → reduce learning rate
If gradient is small → increase learning rate
Benefits:
Handles non-stationary data
Works well for RNN
Adam (Adaptive Moment Estimation)
Adam combines:
Momentum
RMSProp
It keeps track of:
First moment (mean of gradients)
Second moment (variance of gradients)
Update:
mt=β1mt−1+(1−β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_tmt=β1mt−1+(1−β1)gt vt=β2vt−1+(1−β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2vt=β2vt−1+(1−β2)gt2
Final update:
W=W−ηvt+ϵmtW = W - \frac{\eta}{\sqrt{v_t} + \epsilon} m_tW=W−vt+ϵηmt
Why Adam is Popular?
Fast convergence
Works for most problems
Adaptive learning rate
Less tuning required
Adaptive Learning
Adaptive learning means:
Learning rate changes during training
Different parameters get different learning rates
Optimizers using adaptive learning:
RMSProp
Adam
Adagrad
Optimizer Comparison
Optimizer
Speed
Stability
Adaptive LR
Memory
SGD
Slow
Medium
❌
Low
Momentum
Faster
Good
❌
Medium
RMSProp
Fast
Very Good
✅
Medium
Adam
Very Fast
Excellent
✅
Higher
When to Use What?
Small dataset → Adam
RNN tasks → RMSProp / Adam
Large simple models → SGD + Momentum
Default choice → Adam
Code Example (TensorFlow)

Testing Different Optimizers in a Neural Network using TensorFlow Keras

This Python example demonstrates how to experiment with different optimizers in a simple feedforward neural network using TensorFlow Keras. The model consists of a Dense hidden layer and an output layer. By changing the optimizer (SGD, Adam, RMSprop), you can observe how different optimization algorithms affect the training process and convergence for regression tasks using Mean Squared Error (MSE) loss.

import tensorflow as tf
from tensorflow.keras import layers, models

model = models.Sequential([
    layers.Dense(64, activation='relu', input_shape=(10,)),
    layers.Dense(1)
])

# Try different optimizers
model.compile(optimizer='sgd', loss='mse')
# model.compile(optimizer='adam', loss='mse')
# model.compile(optimizer='rmsprop', loss='mse')

❮ Previous

Next ❯

Optimizer	Speed	Stability	Adaptive LR	Memory
SGD	Slow	Medium	❌	Low
Momentum	Faster	Good	❌	Medium
RMSProp	Fast	Very Good	✅	Medium
Adam	Very Fast	Excellent	✅	Higher

SGD (Stochastic Gradient Descent)

Idea:

Advantages:

Problems:

Momentum

Update Rule:

Benefits:

RMSProp (Root Mean Square Propagation)

Key Idea:

If gradient is large → reduce learning rate If gradient is small → increase learning rate

Benefits:

Handles non-stationary dataWorks well for RNN

Adam (Adaptive Moment Estimation)

Adam combines:MomentumRMSPropIt keeps track of:First moment (mean of gradients)Second moment (variance of gradients)

Update:

Why Adam is Popular?

Fast convergenceWorks for most problemsAdaptive learning rateLess tuning required

Adaptive Learning

Adaptive learning means:Learning rate changes during training Different parameters get different learning ratesOptimizers using adaptive learning:RMSPropAdamAdagrad

Optimizer Comparison

OptimizerSpeedStabilityAdaptive LRMemorySGDSlowMedium❌LowMomentumFasterGood❌MediumRMSPropFastVery Good✅MediumAdamVery FastExcellent✅Higher

When to Use What?

Small dataset → AdamRNN tasks → RMSProp / AdamLarge simple models → SGD + MomentumDefault choice → Adam

Code Example (TensorFlow)

Testing Different Optimizers in a Neural Network using TensorFlow Keras

Login

Create Account

If gradient is large → reduce learning rate
If gradient is small → increase learning rate

Handles non-stationary data
Works well for RNN

Adam combines:
Momentum
RMSProp
It keeps track of:
First moment (mean of gradients)
Second moment (variance of gradients)

Fast convergence
Works for most problems
Adaptive learning rate
Less tuning required

Adaptive learning means:
Learning rate changes during training
Different parameters get different learning rates
Optimizers using adaptive learning:
RMSProp
Adam
Adagrad

Optimizer
Speed
Stability
Adaptive LR
Memory
SGD
Slow
Medium
❌
Low
Momentum
Faster
Good
❌
Medium
RMSProp
Fast
Very Good
✅
Medium
Adam
Very Fast
Excellent
✅
Higher

Small dataset → Adam
RNN tasks → RMSProp / Adam
Large simple models → SGD + Momentum
Default choice → Adam