What is special about rectifier neural units used in NN learning?

Sigmoid unit :

f(x) = frac{1}{1+exp(-x)}

Tanh unit:

f(x) = tanh(x)

Rectified linear unit (ReLU):

f(x) = \sum_{i=1}^{\inf} \sigma (x - i + 0.5) \approx log(1 + e^{x})
we call;

  •  \sum_{i=1}^{inf} sigma (x - i + 0.5) as stepped sigmoid
  • log(1 + e^{x})  as softplus function

The softplus function can be approximated by max function (or hard max ) ie    max( 0, x + N(0,1)) . The max function is commonly known as Rectified Linear Function (ReL).

In the following figure below we see different activation functions plotted.

The major differences between the sigmoid and ReL functions are:

  • Sigmoid function  has a range [0,1] whereas ReL function has a range [0,infty] . Due to its range, sigmoid can be used to model probability hence, it is commonly used for regression or probability estimation at the last layer even when you use ReL for the previous layers.  NERD NOTE:  The view of softplus function is approximation of stepped sigmoid units relates to the binomial hidden units as discussed in http://machinelearning.wustl.edu...
  • The gradient of  sigmoid function vanishes as x recedes from 0 so basically it is called "saturated" at this point. However, the gradient of  ReL function is such problem free due to its unbounded and linear positive part.

The advantages of using Rectified Linear Units in Neural Networks are;

  • If hard max is used, it  induces sparsity on the layer activations.
  • As discussed earlier ReLU doesn't face gradient vanishing problem. Therefore, it allows training deeper networks without pre-training.
  • ReLU can be used in Restricted Boltzmann machine to model real/integer valued inputs.

References :

Share
  • Tony

    Do you know how to prove that $lim_{N->infty} sum_i^{N} sigma(x-i+0.5) - log (1+e^x) =0$ ?