What is special about rectifier neural units used in NN learning?

Sigmoid unit :
 f(x) = frac{1}{1+exp(-x)}
Tanh unit:
 f(x) = tanh(x)

Rectified linear unit (ReLU):
 f(x) = sum_{i=1}^{inf} sigma (x - i + 0.5) approx log(1 + e^{x})

we refer

  •  sum_{i=1}^{inf} sigma (x - i + 0.5) as stepped sigmoid
  •  log(1 + e^{x}) as softplus function

The softplus function can be approximated by max function (or hard max ) ie    max( 0, x + N(0,1)) . The max function is commonly known as Rectified Linear Function (ReL).

In the following figure below we compare ReL function (soft/hard) with sigmoid function.

The major differences between the sigmoid and ReL function are:

  • Sigmoid function  has range [0,1] whereas the ReL function has range [0,infty] . Hence sigmoid function can be used to model probability, whereas ReL can be used to model positive real number. NOTE:  The view of softplus function as approximation of stepped sigmoid units relates to the binomial hidden units as discussed in http://machinelearning.wustl.edu...
  • The gradient of the sigmoid function vanishes as we increase or decrease x. However, the gradient of the ReL function doesn't vanish as we increase x. In fact, for max function, gradient is defined as  begin{Bmatrix} 0 & if\ x < 0 \\ 1 & if\ x > 0 end{Bmatrix} .

The advantages of using Rectified Linear Units in neural networks are

  • If hard max function is used as activation function, it  induces the sparsity in the hidden units.
  • As discussed earlier ReLU doesn't face gradient vanishing problem as faced by sigmoid and tanh function. Also, It has been shown that deep networks can be trained efficiently using ReLU even without pre-training.
  • ReLU can be used in Restricted Boltzmann machine to model real/integer valued inputs.

References :

  • Tony

    Do you know how to prove that $lim_{N->infty} sum_i^{N} sigma(x-i+0.5) - log (1+e^x) =0$ ?