Sigmoid unit :
Rectified linear unit (ReLU):
- as stepped sigmoid
- as softplus function
The softplus function can be approximated by max function (or hard max ) ie . The max function is commonly known as Rectified Linear Function (ReL).
In the following figure below we compare ReL function (soft/hard) with sigmoid function.
The major differences between the sigmoid and ReL function are:
- Sigmoid function has range [0,1] whereas the ReL function has range . Hence sigmoid function can be used to model probability, whereas ReL can be used to model positive real number. NOTE: The view of softplus function as approximation of stepped sigmoid units relates to the binomial hidden units as discussed in
- The gradient of the sigmoid function vanishes as we increase or decrease x. However, the gradient of the ReL function doesn't vanish as we increase x. In fact, for max function, gradient is defined as .
The advantages of using Rectified Linear Units in neural networks are
- If hard max function is used as activation function, it induces the sparsity in the hidden units.
- As discussed earlier ReLU doesn't face gradient vanishing problem as faced by sigmoid and tanh function. Also, It has been shown that deep networks can be trained efficiently using ReLU even without pre-training.
- ReLU can be used in Restricted Boltzmann machine to model real/integer valued inputs.
- On Rectified Linear Units for Speech Processing
- Rectifier Nonlinearities Improve Neural Network Acoustic Models
- Deep Sparse Rectifier Neural Networks