❮ Previous Next ❯

Activation Functions

This lesson introduces activation functions used in neural networks to enable learning of non-linear patterns.

Sigmoid Function

Formula:

σ(x)=11+e−x\sigma(x) = \frac{1}{1 + e^{-x}}σ(x)=1+e−x1

Output Range:

0 to 1

Use Case:

Binary classification (output layer)

Advantages:

Smooth and differentiable
Outputs probability-like values

Disadvantages:

Vanishing gradient problem
Not zero-centered
Slow convergence

Tanh (Hyperbolic Tangent)

Formula:

tanh(x)=ex−e−xex+e−xtanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}tanh(x)=ex+e−xex−e−x

Output Range:

-1 to 1

Advantages:

Zero-centered
Stronger gradients than sigmoid

Disadvantages:

Still suffers from vanishing gradient

ReLU (Rectified Linear Unit)

Formula:

f(x)=max⁡(0,x)f(x) = \max(0, x)f(x)=max(0,x)

Output Range:

0 to ∞

Advantages:

Computationally efficient
Reduces vanishing gradient problem
Fast training

Disadvantages:

Dying ReLU problem (neurons can stop learning if output becomes 0 permanently)

ReLU is the most commonly used activation in hidden layers.

Leaky ReLU

Formula:

f(x)={xif x>0αxif x≤0f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \le 0 \end{cases}f(x)={xαxif x>0if x≤0

(α is small, e.g., 0.01)

Why Needed?

Fixes the Dying ReLU problem by allowing small gradient when input is negative.

Softmax Function

Formula:

Softmax(zi)=ezi∑jezjSoftmax(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}Softmax(zi)=∑jezjezi

Output:

Produces probabilities
Sum of outputs = 1

Use Case:

Multi-class classification (output layer)

Example:

Input: [2.0, 1.0, 0.1]

Output: [0.65, 0.24, 0.11]

Vanishing Gradient Problem

What is it?

During backpropagation, gradients become extremely small as they move backward through layers.

Why does it happen?

Sigmoid and Tanh squash values into small ranges.
Their derivatives become very small for large positive or negative inputs.
Deep networks → gradients shrink layer by layer.

Effect:

Early layers learn very slowly
Training becomes inefficient
Deep networks fail to converge

Solution:

Use ReLU / Leaky ReLU
Proper weight initialization
Batch Normalization
Use advanced architectures (ResNet, etc.)

Comparison Table

Activation	Range	Use Case	Problem
Sigmoid	(0,1)	Binary output	Vanishing gradient
Tanh	(-1,1)	Hidden layers (old models)	Vanishing gradient
ReLU	(0,∞)	Hidden layers (most common)	Dying ReLU
Leaky ReLU	(-∞,∞)	Improved ReLU	Slight complexity
Softmax	(0,1), sum=1	Multi-class output	None major

❮ Previous Next ❯

Sigmoid Function

Formula:

Output Range:

Use Case:

Advantages:

Disadvantages:

Tanh (Hyperbolic Tangent)

Formula:

Output Range:

Advantages:

Disadvantages:

ReLU (Rectified Linear Unit)

Formula:

Output Range:

Advantages:

Disadvantages:

Leaky ReLU

Formula:

Why Needed?

Softmax Function

Formula:

Output:

Use Case:

Vanishing Gradient Problem

What is it?

Why does it happen?

Effect:

Solution:

Comparison Table

Login

Create Account