Microsoft Research introduced a new NN model that beats Google and the others

MS researcher recently introduced a new deep ( indeed very deep 🙂 ) NN model (PReLU Net) [1] and they push the state of art in ImageNet 2012 dataset from 6.66% (GoogLeNet) to 4.94% top-5 error rate.

In this work, they introduce an alternation of well-known ReLU activation function. They call it PReLu (Parametric Rectifier Linear Unit). The idea behind is to allow negative activations on the ReLU function with a control parameter a which is also learned over the training phase. Therefore, PReLU allows negative activations and in the paper they argue and emprically show that PReLU is better to resolve diminishing gradient problem for very deep neural networks  (> 13 layers) due to allowance of negative activations. That means more activations per layer, hence more gradient feedback at the backpropagation stage.

PReLU
all figures are from the paper

As I told earlier, PReLU requires a new learned parameter a for each unit (channel-wise) or for each layer (channel-shared). In both cases, they show that PReLU increases the results as oppose to ReLU activations for especially deeper models. For being more precise lets dive into formulations. PReLU behaves with the following function for the feedforward propagation;

PReLU_act

 

a is updated for each epoch with the following formulation. Compute gradient which is the gradient of the deeper layer multiplied by the layer unit stimuli y, if unit activation f(y) > 0. Gradient is 0 otherwise.

PReLU_bpPReLU_bp2PReLU_bp3

mu is momemtum, epsilon is learning rate.

Beside the PReLU function, they also use Spatial Pyramid Pooling  (SPP) layer [2] just before the fully connected layers. Being a side note, SPP is a great tool that makes you able to process different size images and evades the size constraint of the NN models.

PReLU2

 

For more please refer to [1] and I strongly suggest to look [2] as well to see how SPP layer behaves.

[1] Mendonça, S. (2015). Splitting, parallel gradient and Bakry-Emery Ricci curvature. Differential Geometry. Retrieved from http://arxiv.org/abs/1502.0185

[2] He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, 1–14. Computer Vision and Pattern Recognition. Retrieved from http://arxiv.org/abs/1406.4729

 

Share