Optimizers
- This lesson explains optimizers used in deep learning to update neural network weights and improve model training.
SGD (Stochastic Gradient Descent)
Idea:
Updates weights using gradient of one sample (or mini-batch).
W=W−η∂L∂WW = W - \eta \frac{\partial L}{\partial W}W=W−η∂W∂L
Advantages:
Simple
Low memory usage
Works well with large datasets
Problems:
Can oscillate
Slow convergence
Gets stuck in local minima
Momentum
Momentum improves SGD by adding velocity.
Instead of moving only in current gradient direction, it remembers previous direction.
Update Rule:
vt=βvt−1+η∇Lv_t = \beta v_{t-1} + \eta \nabla Lvt=βvt−1+η∇L W=W−vtW = W - v_tW=W−vt
Where:
β = momentum factor (usually 0.9)
Benefits:
Reduces oscillations
Faster convergence
Escapes shallow local minima
RMSProp (Root Mean Square Propagation)
RMSProp adapts learning rate for each parameter.
It keeps track of squared gradients.
E[g2]t=βE[g2]t−1+(1−β)gt2E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta)g_t^2E[g2]t=βE[g2]t−1+(1−β)gt2 W=W−ηE[g2]t+ϵgtW = W - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_tW=W−E[g2]t+ϵηgt
Key Idea:
If gradient is large → reduce learning rate
If gradient is small → increase learning rateBenefits:
Handles non-stationary data
Works well for RNN
Adam (Adaptive Moment Estimation)
Adam combines:
Momentum
RMSPropIt keeps track of:
First moment (mean of gradients)
Second moment (variance of gradients)
Update:
mt=β1mt−1+(1−β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_tmt=β1mt−1+(1−β1)gt vt=β2vt−1+(1−β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2vt=β2vt−1+(1−β2)gt2
Final update:
W=W−ηvt+ϵmtW = W - \frac{\eta}{\sqrt{v_t} + \epsilon} m_tW=W−vt+ϵηmt
Why Adam is Popular?
Fast convergence
Works for most problems
Adaptive learning rate
Less tuning required
Adaptive Learning
Adaptive learning means:
Learning rate changes during training
Different parameters get different learning ratesOptimizers using adaptive learning:
RMSProp
Adam
Adagrad
Optimizer Comparison
When to Use What?
Small dataset → Adam
RNN tasks → RMSProp / Adam
Large simple models → SGD + Momentum
Default choice → AdamCode Example (TensorFlow)
Testing Different Optimizers in a Neural Network using TensorFlow Keras
This Python example demonstrates how to experiment with different optimizers in a simple feedforward neural network using TensorFlow Keras. The model consists of a Dense hidden layer and an output layer. By changing the optimizer (SGD, Adam, RMSprop), you can observe how different optimization algorithms affect the training process and convergence for regression tasks using Mean Squared Error (MSE) loss.
import tensorflow as tf
from tensorflow.keras import layers, models
model = models.Sequential([
layers.Dense(64, activation='relu', input_shape=(10,)),
layers.Dense(1)
])
# Try different optimizers
model.compile(optimizer='sgd', loss='mse')
# model.compile(optimizer='adam', loss='mse')
# model.compile(optimizer='rmsprop', loss='mse')