Paper Review: Self-Normalizing Neural Networks

One of the main problems of neural networks is to tame layer activations so that one is able to obtain stable gradients to learn faster without any confining factor. Batch Normalization shows us that keeping values with mean 0 and variance 1 seems to work things. However, albeit indisputable effectiveness of BN, it adds more layers and computations to your model that you'd not like to have in the best case.

ELU (Exponential Linear Unit) is a activation function aiming to tame neural networks on the fly by a slight modification of activation function. It keeps the positive values as it is and exponentially skew negative values.

ELU function. \alpha is a constant you define.


ELU does its job good enough, if you like to evade the cost of Bath Normalization, however its effectiveness does not rely on a theoretical proof beside empirical satisfaction. And finding a good \alpha is just a guess.

Self-Normalizing Neural Networks takes things to next level. In short, it describes a new activation function SELU (Scaled Exponential Linear Units), a new initialization scheme and a new dropout variant as a repercussion,

The main topic here is to keep network activation in a certain basin defined by a mean and a variance values. These can be any values of your choice but for the paper it is mean 0 and variance 1 (similar to notion of Batch Normalization). The question afterward is to modifying ELU function by some scaling factors to keep the activations with that mean and variance on the fly. They find these scaling values by a long theoretical justification. Stating that, scaling factors of ELU are supposed to be defined as such any passing value of ELU should be contracted to define mean and variance.  (This is just verbal definition by no means complete. Please refer to paper to be more into theory side. )

Above, the scaling factors are shown as \alpha and \lambda.  After long run of computations these values appears to be 1.6732632423543772848170429916717 and 1.0507009873554804934193349852946 relatively. Nevertheless, do not forget that these scaling factors are targeting specifically mean 0 and variance 1.  Any change preludes to change these values as well.

Initialization is also another important part of the whole method. The aim here is to start with the right values. They suggest to sample weights from a Gaussian distribution with mean 0 and variance 1/n where n is number of weights.

It is known with a well credence that Dropout does not play well with Batch Normalization since it smarting network activations in a purely random manner. This method seems even more brittle to dropout effect. As a cure, they propose Alpha Dropout. It randomly sets inputs to saturatied negative value of SELU which is -\alpha\lambda. Then an affine transformation is applied to it with a and b values computed relative to dropout rate, targeted mean and variance.It randomizes network without degrading network properties.

In a practical point of view, SELU seems promising by reducing the computation time relative to RELU+BN for normalizing the network. In the paper they does not provide any vision based baseline such a MNIST, CIFAR and they only pounce on Fully-Connected models. I am still curios to see its performance vis-a-vis on these benchmarks agains Bath Normalization. I plan to give it a shoot in near future.

One tickle in my mind after reading the paper is the obsession of mean 0 and variance 1 for not only this paper but also the other normalization techniques. In deed, these values are just relative so why 0 and 1 but not 0 and 4. If you have a answer to this please ping me below.

  • Charlie Parker

    my suspicion is that N(0,1) is arbitrary as long as the data layer has the same N(0,1) normalization. Why would N(0,4) be better? I feel a proof would show that all the layers (including the data one) have to be at the same scale i.e. same normalization. It would be my guess.

  • Totoka

    Can it be that having variance 1 (or close) is just better for numerical stability?

  • Dmytro Mishkin

    Hi there 🙂

    I have added SELU to my ImageNet-128 px benchmark. Results are not promising 🙁

    F top-1 acc
    ReLU: 0.471
    ELU: 0.488
    SELU: 0.470

    • thanks for sharing your results. Yeah it is interesting to see this. Do you use BN for ReLU?

      • Dmytro Mishkin

        No. With BN, ReLU gives 0.499 acc, ELU 0.498

        • Thx for clerifying

        • I should also say that, I adore your github link 😉

  • llamda

    > They suggest to sample weights from a Gaussian distribution with mean 0 and variance 1/n where n is number of weights.

    No, n is a number of input neurons