Activation Functions
- This lesson introduces activation functions used in neural networks to enable learning of non-linear patterns.
Sigmoid Function
Formula:
σ(x)=11+e−x\sigma(x) = \frac{1}{1 + e^{-x}}σ(x)=1+e−x1
Output Range:
0 to 1
Use Case:
Binary classification (output layer)
Advantages:
Smooth and differentiable
Outputs probability-like valuesDisadvantages:
Vanishing gradient problem
Not zero-centered
Slow convergenceTanh (Hyperbolic Tangent)
Formula:
tanh(x)=ex−e−xex+e−xtanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}tanh(x)=ex+e−xex−e−x
Output Range:
-1 to 1
Advantages:
Zero-centered
Stronger gradients than sigmoidDisadvantages:
Still suffers from vanishing gradient
ReLU (Rectified Linear Unit)
Formula:
f(x)=max(0,x)f(x) = \max(0, x)f(x)=max(0,x)
Output Range:
0 to ∞
Advantages:
Computationally efficient
Reduces vanishing gradient problem
Fast trainingDisadvantages:
Dying ReLU problem (neurons can stop learning if output becomes 0 permanently)
ReLU is the most commonly used activation in hidden layers.
Leaky ReLU
Formula:
f(x)={xif x>0αxif x≤0f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \le 0 \end{cases}f(x)={xαxif x>0if x≤0
(α is small, e.g., 0.01)
Why Needed?
Fixes the Dying ReLU problem by allowing small gradient when input is negative.
Softmax Function
Formula:
Softmax(zi)=ezi∑jezjSoftmax(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}Softmax(zi)=∑jezjezi
Output:
Produces probabilities
Sum of outputs = 1
Use Case:
Multi-class classification (output layer)
Example:
Input: [2.0, 1.0, 0.1]
Output: [0.65, 0.24, 0.11]
Vanishing Gradient Problem
What is it?
During backpropagation, gradients become extremely small as they move backward through layers.
Why does it happen?
Sigmoid and Tanh squash values into small ranges.
Their derivatives become very small for large positive or negative inputs.
Deep networks → gradients shrink layer by layer.
Effect:
Early layers learn very slowly
Training becomes inefficient
Deep networks fail to converge
Solution:
Use ReLU / Leaky ReLU
Proper weight initialization
Batch Normalization
Use advanced architectures (ResNet, etc.)Comparison Table