Activation Functions

  • This lesson introduces activation functions used in neural networks to enable learning of non-linear patterns.
  • Sigmoid Function

    Formula:

    σ(x)=11+e−x\sigma(x) = \frac{1}{1 + e^{-x}}σ(x)=1+e−x1​

    Output Range:

    0 to 1

    Use Case:

    • Binary classification (output layer)

    Advantages:

    Smooth and differentiable
    Outputs probability-like values

    Disadvantages:

    Vanishing gradient problem
    Not zero-centered
    Slow convergence


    Tanh (Hyperbolic Tangent)

    Formula:

    tanh(x)=ex−e−xex+e−xtanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}tanh(x)=ex+e−xex−e−x​

    Output Range:

    -1 to 1

    Advantages:

    Zero-centered
    Stronger gradients than sigmoid

    Disadvantages:

    Still suffers from vanishing gradient


    ReLU (Rectified Linear Unit)

    Formula:

    f(x)=max⁡(0,x)f(x) = \max(0, x)f(x)=max(0,x)

    Output Range:

    0 to ∞

    Advantages:

    Computationally efficient
    Reduces vanishing gradient problem
    Fast training

    Disadvantages:

    Dying ReLU problem (neurons can stop learning if output becomes 0 permanently)

    ReLU is the most commonly used activation in hidden layers.


    Leaky ReLU

    Formula:

    f(x)={xif x>0αxif x≤0f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \le 0 \end{cases}f(x)={xαx​if x>0if x≤0​

    (α is small, e.g., 0.01)

    Why Needed?

    Fixes the Dying ReLU problem by allowing small gradient when input is negative.


    Softmax Function

    Formula:

    Softmax(zi)=ezi∑jezjSoftmax(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}Softmax(zi​)=∑j​ezj​ezi​​

    Output:

    • Produces probabilities

    • Sum of outputs = 1

    Use Case:

    • Multi-class classification (output layer)

    Example:

    Input: [2.0, 1.0, 0.1]

    Output: [0.65, 0.24, 0.11]


    Vanishing Gradient Problem

    What is it?

    During backpropagation, gradients become extremely small as they move backward through layers.

    Why does it happen?

    • Sigmoid and Tanh squash values into small ranges.

    • Their derivatives become very small for large positive or negative inputs.

    • Deep networks → gradients shrink layer by layer.

    Effect:

    • Early layers learn very slowly

    • Training becomes inefficient

    • Deep networks fail to converge

    Solution:

    Use ReLU / Leaky ReLU
    Proper weight initialization
    Batch Normalization
    Use advanced architectures (ResNet, etc.)

    Comparison Table

    Activation

    Range

    Use Case

    Problem

    Sigmoid

    (0,1)

    Binary output

    Vanishing gradient

    Tanh

    (-1,1)

    Hidden layers (old models)

    Vanishing gradient

    ReLU

    (0,∞)

    Hidden layers (most common)

    Dying ReLU

    Leaky ReLU

    (-∞,∞)

    Improved ReLU

    Slight complexity

    Softmax

    (0,1), sum=1

    Multi-class output

    None major