Activation functions helps to determine the output of a neural network. These types of functions are attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron’s input is relevant for the model’s prediction.
Activation function also helps to normalize the output of each neuron to a range between 1 and 0 or between -1 and 1.
In a neural network, inputs are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.
The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold.
Neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions.
The function formula and chart are as follows:
Binary step function depends on a threshold value that decides whether a neuron should be activated or not.
The input fed to the activation function is compared to a certain threshold; if the input is greater than it, then the neuron is activated, else it is deactivated, meaning that its output is not passed on to the next hidden layer.
Here are some of the limitations of binary step function:
- It cannot provide multi-value outputs — for example, it cannot be used for multi-class classification problems.
- The gradient of the step function is zero, which causes a hindrance in the backpropagation process.
The function formula and chart are as follows:
The linear activation function, also known as “no activation,” or “identity function” (multiplied x1.0), is where the activation is proportional to the input.
The function doesn’t do anything to the weighted sum of the input, it simply spits out the value it was given.
However, a linear activation function has two major problems:
- It’s not possible to use backpropagation as the derivative of the function is a constant and has no relation to the input x.
- All layers of the neural network will collapse into one if a linear activation function is used. No matter the number of layers in the neural network, the last layer will still be a linear function of the first layer. So, essentially, a linear activation function turns the neural network into just one layer.
The linear activation function shown above is simply a linear regression model. Because of its limited power, this does not allow the model to create complex mappings between the network’s inputs and outputs.
Non-linear activation functions solve the following limitations of linear activation functions:
- They allow backpropagation because now the derivative function would be related to the input, and it’s possible to go back and understand which weights in the input neurons can provide a better prediction.
- They allow the stacking of multiple layers of neurons as the output would now be a non-linear combination of input passed through multiple layers. Any output can be represented as a functional computation in a neural network.
1. Sigmoid function
The function formula and chart are as follows:
It is commonly used for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice because of its range.
In the sigmoid function, we can see that its output is in the open interval (0,1). We can think of probability, but in the strict sense, don’t treat it as probability. The sigmoid function was once more popular. It can be thought of as the firing rate of a neuron. In the middle where the slope is relatively large, it is the sensitive area of the neuron. On the sides where the slope is very gentle, it is the neuron’s inhibitory area.
The function itself has certain defects.
- When the input is slightly away from the coordinate origin, the gradient of the function becomes very small, almost zero. In the process of neural network backpropagation, we all use the chain rule of differential to calculate the differential of each weight w. When the backpropagation passes through the sigmoid function, the differential on this chain is very small. Moreover, it may pass through many sigmoid functions, which will eventually cause the weight w to have little effect on the loss function, which is not conducive to the optimization of the weight. This The problem is called gradient saturation or gradient dispersion.
- The function output is not centered on 0, which will reduce the efficiency of weight update.
- The sigmoid function performs exponential operations, which is slower for computers.
Advantages of Sigmoid Function: –
- Smooth gradient, preventing “jumps” in output values.
- Output values bound between 0 and 1, normalizing the output of each neuron.
- Clear predictions, i.e. very close to 1 or 0.
Sigmoid has three major disadvantages:
- Prone to gradient vanishing
- Function output is not zero-centered.
- Power operations are relatively time consuming.
2. tanh function
The tanh function formula and curve are as follows:
Tanh is a hyperbolic tangent function. The curves of tanh function and sigmod function are relatively similar. Let ’s compare them. First of all, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update. The difference is the output interval.
The output interval of tanh is -1 to 1 and the whole function is 0-centric, which is better than sigmoid.
In general binary classification problems, the tanh function is used for the hidden layer and the sigmoid function is used for the output layer. However, these are not static, and the specific activation function to be used must be analyzed according to the specific problem, or it depends on debugging.
3. ReLU function
ReLU function formula and curve are as follows:
The ReLU function is actually a function that takes the maximum value. Note that this is not fully interval-derivable, but we can take sub-gradient, as shown in the figure above. Although ReLU is simple, it is an important achievement in recent years.
The ReLU (Rectified Linear Unit) function is an activation function that is currently more popular. Compared with the sigmoid function and the tanh function, it has the following advantages:
- When the input is positive, there is no gradient saturation problem.
- The calculation speed is much faster. The ReLU function has only a linear relationship. Whether it is forward or backward, it is much faster than sigmoid and tanh. (Sigmoid and tanh need to calculate the exponent, which will be slower.)
Disadvantages:
- When the input is negative, ReLU is completely inactive, which means that once a negative number is entered, ReLU will die. In this way, in the forward propagation process, it is not a problem. Some areas are sensitive and some are insensitive. But in the backpropagation process, if you enter a negative number, the gradient will be completely zero, which has the same problem as the sigmoid function and tanh function.
- We find that the output of the ReLU function is either 0 or a positive number, which means that the ReLU function is not a 0-centric function.
4. Leaky ReLU function
Leaky ReLU function formula and curve are as follows:
Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as it has a small positive slope in the negative area.
In order to solve the Dead ReLU Problem, people proposed to set the first half of ReLU 0.01x instead of 0. Another intuitive idea is a parameter-based method, Parametric ReLU : f(x)= max(alpha x,x), which alpha can be learned from back propagation. In theory, Leaky ReLU has all the advantages of ReLU, plus there will be no problems with Dead ReLU, but in actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU.
5. ELU (Exponential Linear Units) function
ELU function formula and curve are as follows:
Exponential Linear Unit, or ELU for short, is also a variant of ReLU that modifies the slope of the negative part of the function.
ELU uses a log curve to define the negative values unlike the leaky ReLU and Parametric ReLU functions with a straight line.
ELU is also proposed to solve the problems of ReLU. Obviously, ELU has all the advantages of ReLU, and:
- No Dead ReLU issues.
- The mean of the output is close to 0, zero-centered.
One small problem is that it is slightly more computationally intensive. Similar to Leaky ReLU, although theoretically better than ReLU, there is currently no good evidence in practice that ELU is always better than ReLU.
6. Softmax function
Softmax function formula and curve are as follows:
For an arbitrary real vector of length K, Softmax can compress it into a real vector of length K with a value in the range (0, 1), and the sum of the elements in the vector is 1.
It also has many applications in Multiclass Classification and neural networks. Softmax is different from the normal max function: the max function only outputs the largest value, and Softmax ensures that smaller values have a smaller probability and will not be discarded directly. It is a “max” that is “soft”.
The denominator of the Softmax function combines all factors of the original output value, which means that the different probabilities obtained by the Softmax function are related to each other. In the case of binary classification, for Sigmoid, there are:
For Softmax with K = 2, there are:
It can be seen that in the case of binary classification, Softmax is degraded to Sigmoid.
Let’s go over a simple example of multiclass classification
Assume that you have three classes, meaning that there would be three neurons in the output layer. Now, suppose that your output from the neurons is [1.8, 0.9, 0.68].
Applying the softmax function over these values to give a probabilistic view will result in the following outcome: [0.58, 0.23, 0.19].
The function returns 1 for the largest probability index while it returns 0 for the other two array indexes. Here, giving full weight to index 0 and no weight to index 1 and index 2. So, the output would be the class corresponding to the 1st neuron (index 0) out of three.
You can see now how softmax activation function make things easy for multi-class classification problems.
Activation functions play a crucial role in determining the output of neural networks, serving as mathematical “gates” that regulate the flow of information between neurons. From fundamental functions like binary step and linear activation to more advanced non-linear ones such as sigmoid, tanh, ReLU, Leaky ReLU, ELU, and Softmax, each activation function has its unique characteristics, strengths, and limitations.
Non-linear activation functions enable neural networks to learn complex patterns and relationships in data, making them indispensable for tasks requiring sophisticated modeling. Among these, Softmax stands out for its suitability in multiclass classification problems, ensuring that smaller values have a proportional probability and are not discarded outright.
However, each activation function comes with its own set of advantages and drawbacks. For instance, while sigmoid provides smooth gradients and clear predictions, it suffers from issues like gradient vanishing and non-zero centered outputs. On the other hand, ReLU offers faster computation and addresses gradient saturation, yet it faces challenges with dead neurons in negative input regions.
Choosing the right activation function is paramount for effective neural network design and optimal model performance in machine learning tasks. Understanding the characteristics and trade-offs of each activation function empowers data scientists and machine learning practitioners to make informed decisions in model architecture and training, ultimately leading to better outcomes in real-world applications.