Activation Functions for deep learning

Nour Ouhichi
4 min readJun 7, 2021

Definition:

In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be “ON” (1) or “OFF” (0), depending on input. This is similar to the linear perceptron in neural networks.

The choice of activation function in the hidden layer will control how well the network model learns the training dataset. The choice of activation function in the output layer will define the type of predictions the model can make. As such, a careful choice of activation function must be made for each deep learning neural network project.

types of activation functions:

The most famous activation functions are given below,

  • Binary step
  • Linear
  • Sigmoid
  • Tanh
  • ReLU
  • Softmax

1. Binary Step Activation Function

This activation function very basic and it comes to mind every time if we try to bound output. It is basically a threshold base classifier, in this, we decide some threshold value to decide output that neuron should be activated or deactivated.

f(x) = 1 if x > 0 else 0 if x < 0

In this case, we decide the threshold value to 0. It is very simple and useful to classify binary problems or classifier.

2.Linear Activation Function

It is a simple straight line activation function where our function is directly proportional to the weighted sum of neurons or input. Linear activation functions are better in giving a wide range of activations and a line of a positive slope may increase the firing rate as the input rate increases.

In binary, either a neuron is firing or not. If you know gradient descent in deep learning then you would notice that in this function derivative is constant.

Y = mZ

Where derivative with respect to Z is constant m. The meaning gradient is also constant and it has nothing to do with Z. In this, if the changes made in backpropagation will be constant and not dependent on Z so this will not be good for learning.

In this, our second layer is the output of a linear function of previous layers input. Wait a minute, what have we learned in this that if we compare our all the layers and remove all the layers except the first and last then also we can only get an output which is a linear function of the first layer.

3. Sigmoid Activation Function

The sigmoid activation function is used mostly as it does its task with great efficiency, it basically is a probabilistic approach towards decision making and ranges in between 0 to 1, so when we have to make a decision or to predict an output we use this activation function because of the range is the minimum, therefore, prediction would be more accurate.

The equation for the sigmoid function is

f(x) = 1/(1+e(-x) )

The sigmoid function causes a problem mainly termed as vanishing gradient problem which occurs because we convert large input in between the range of 0 to 1 and therefore their derivatives become much smaller which does not give satisfactory output. To solve this problem another activation function such as ReLU is used where we do not have a small derivative problem.

4. Tanh Function(Hyperbolic Tangent Activation)

This activation function is slightly better than the sigmoid function, like the sigmoid function it is also used to predict or to differentiate between two classes but it maps the negative input into negative quantity only and ranges in between -1 to 1.

5. ReLU(Rectified Linear unit) Activation function

Rectified linear unit or ReLU is most widely used activation function right now which ranges from 0 to infinity, All the negative values are converted into zero, and this conversion rate is so fast that neither it can map nor fit into data properly which creates a problem, but where there is a problem there is a solution.

6. Softmax Activation Function

Softmax is used mainly at the last layer i.e output layer for decision making the same as sigmoid activation works, the softmax basically gives value to the input variable according to their weight and the sum of these weights is eventually one.

For Binary classification, both sigmoid, as well as softmax, are equally approachable but in case of multi-class classification problem we generally use softmax and cross-entropy along with it.

--

--

Nour Ouhichi

Software engineering student at Holberton school with the passion of sharing my knowledge. Simplified language articles about different programming languages.