NegOut: Substitute for MaxOut units

Maxout [1] units are well-known and frequently used tools for Deep Neural Networks. For whom does not know, with a basic explanation, a Maxout unit is a set of internal activation units competing with each other for each instance and activation of the winner is propagated as output and the loosers are kept silent. At the backpropagation phase, it means we update only the winner unit. That also means, implicitly, we always prefer to back-propagate gradient signal through the strongest path.  It is an important aspect of Maxout units, especially for very deep models which are prone to gradient instability.

Although Maxout units have very good properties like which I told (please refer to the paper for more details), I am a proactive sceptic of its ability to encode underlying information and pass it to next layer.  Here is a very simple example. Suppose we have two competing functions (filters) in a Maxout unit. One of these functions is receptive of edge structures whereas the other is receptive of corners. For an instance, we might have the first filter as the winner with a value, let’s say, ~3 which means Maxout output is also ~3. For another instance, we have the other function as the winner with approximately same value ~3. If we assume that each NN layer is a classifier which takes the previous layer output as a feature vector (I guess not very wrong assumption), then basically we give the same value for different detections for a particular feature dimension (which is corresponded to our Maxout unit). Eventually, we cannot expect from the next layer to be able to discern this signal.

Edge detector is the winner
Edge detector is the winner
Corner detector is the winner but the result is same
Corner detector is the winner but the result is same

One can argue that we should evaluate Maxout unit as a whole and it is reminiscent of OR function on top of multiple filters. This is a valid argument which I cannot refuse directly but the problem that I indicated above is still floating on air.  Beside,  why we would waste our expensive NN parameters, if we could come up with a better encoding scheme for Maxout units

Here is one alternative approach for better encoding of competing functions, which we call NegOut. Let's assume we have a ordering of two competing functions by heart as 1st and 2nd. If the winner is the 1st function, NegOut outputs the 1st function's value and otherwise it outputs the 2nd function but by taking its negative. NegOut yields two assumptions. The first, competing functions are always positive (like ReLU functions ). The second, we have 2 competing functions.

NegOut activation with different winners.
NegOut activation with different winners.

If we consider the backpropagation signal, the only difference from Maxout unit is to take negative of the gradient signal for the 2nd competing unit, if it is the winner.

As you can see from the figure, the inherent property here is to output different values for different winner detectors in which the value captures both the structural difference and the strength of the winner activation.

I performed some experiments on CIFAR-10 and MNIST comparing Maxout Network with NegOut Network with exact same architectures explained in the Maxout Paper [1].  The table below summarizes results that I observe by the initial runs without any finetunning or hyper-parameter optimization yet. More comparisons on larger datasets are still in progress.

Results on CIFAR-10 and MNIST.
Results on CIFAR-10 and MNIST after average of 5 different runs.

NegOut give better results on CIFAR, although it is slightly lower on MNIST. Again notice that no tunning has been took a place for our NegOut network where as Maout Network is optimized as described in the paper [1].  In addition, NegOut network uses 2 competing set of units (as it is constrained by its nature) for the last FC layer in comparison to Maxout net which uses 5 competing units. My expectation is to have more difference as we go through larger models and datasets since as we scale up, representational power takes more place for better results.

Here, I tried to give a basic sketch of my recent work by no means complete. Different observations and experiments are still running. I also need to include LWTA [2] for being more fair and grasp more wider aspect of competing units. Please feel free to share your thoughts as well. Any contribution is appreciated.

PS: Lately, I devote myself to analyze the internal dynamics of Neural Networks with different architectures, layers and activation functions. The aim is checking under the hood and analysing any intuitionally well-functioning ideas applied to  Deep Neural Networks. I also expect to share more of my findings at my blog.

[1] Maxout networks IJ Goodfellow, D Warde-Farley, M Mirza, A Courville, Y Bengio arXiv preprint arXiv:1302.4389

[2] Understanding Locally Competitive Networks Rupesh Kumar Srivastava, Jonathan Masci, Faustino Gomez, Jürgen Schmidhuber.


Recent Advances in Deep Learning #2

  1. Teacher - Student paradigm:
    • The idea is flickered by (up to my best knowledge) Caruana et. al. 2006. Basically, the idea is to train an ensemble of networks and use their outputs on a held-out set to distill the knowledge to a smaller network. Then this idea is recently hashed by G. Hinton's work which trains larger network then use this network output with a mixture of the original train data to train a smaller network. One important trick is to using higher temperature values on softmax layer of the teacher network so class probabilities are smoothly distributed over classes . Student networks is then able to learn class relations induced by the teacher network beside the true classes of the instances as it is suppose to. Eventually, we are able to compress the knowledge of the teacher net by  a smaller network with less number of parameters and faster execution time. Bengio has also one similar work called Fitnets which is the beneficiary of the same idea from a wider aspect. They do not only use the outputs of the teacher net, but they carry representation power of hidden layers  of the teacher to the student net by a regression loss that approximates the teacher hidden layer weights from the student weights.
  2. Bayesian Breezes :
    • We are finally able to see some Bayesian arguments on Deep Models. One of the prevailing works belongs to Maxwelling "Bayesian Dark Knowledge". Again we have the previous idea but with a very simple trick in mind. Basically, we introduces a Gaussian noise, which is scaled by the decaying learning rate, to the gradient signals. This noise indices a MCMC dynamics to the network and it implicitly learns ensemble networks. The teacher trained in that fashion, is then used to train student nets with a similar approach proposed by G. Hinton. I won't go into mathematical details here. I guess this is one of the rare Bayesian approaches which is close to be applicable for real-time problems with its a simple trick which is enough to do all the Bayesian magic.
    • Variational Auto Encoder is not a  new work but it recently draw my attention. The difference between VAE and conventional AE is, given a probability distribution, VAE learns the best possible representation that is parametrized by defined distribution. Let's say we want to fit gaussian distribution to the data. Then, It is able to learn mean and standard deviation of the multiple gaussian functions ( corressponding VAE latent units) with backpropagation with a simple parametrization trick. Eventually, you obtain multiple gaussians with different mean and std on the latent units of VAE and you can sample new instances out of these. You can learn more from this great tutorial.
  3. Recurrent Models for Visual Recognition:
    • ReNet is a paper from Montreal group. They explain an alternative approach to convolutional neural networks in order to learn spatial structures over visual data. Their idea relies on recurrent neural network which scans the image in a sequence of horizontal and then vertical direction. At the end, RNN is able to learn the structure over the whole image (or image patch). Although, their results are not better than state of art, spotting an new alternative to old fashion convolution is exciting effort.
  4. Model Accelerator and Compression Methods:

devil in the implementation details

I was hassling with interesting problem lately. I trained a custom deep neural network model with ImageNet and ended up very good results at least on training logs.  I used Caffe for all these. Then, I ported my model to python interface and give some objects to it. Boommm!not working and even raised random prob values like it is not even trained for 4 days. It was really frustrating. After a dozens of hours I discovered that "Devil is in the details" .

I was using one of the Batch Normalization ("what is it ? "little intro here ) PR that is not merged to master branch but seems fine.  Then I found that interesting problem. The code in the branch computes each batch's mean by only looking at that batch. When we give only one example at test time, then the mean values are exactly the values of this particular image. This disables everything and the net starts to behave strangely. After a small search I found the solution which uses moving average instead of exact batch average. Now, I am at the stage of implementation. The puchcard is, do not use any PR which is not merged to master branch, that simple :)

ImageNet winners after 2012

  • 2012
    • 0.15 - Supervision (AlexNet) - ~ 60954656 params
    • 0.26 - ISI (ensemble of features)
    • 0.27 - LEAR (Fisher Vectors)
  • 2013
    • 0.117 - Clarifai (paper)
    • 0.129 - NUS (very parametric but interesting method based on test and train data affinities)
    • 0.135 - ZF (same paper)
  • 2014
    • 0.06 - GoogLeNet (Inception Modules) -  ~ 11176896 params
    • 0.07 - VGGnet (Go deeper and deeper)
    • 0.08 - SPPnet  (A retrospective addition from early vision)

Bits and Missings for NNs

  • Adversarial instances and robust models
    • Generative Adversarial Network -  Train classifier net as oppose to another net creating possible adversarial instances as the training evolves.
    • Apply genetic algorithms per N training iteration of net and create some adversarial instances.
    • Apply fast gradient approach to image pixels to generate intruding images.
    • Goodfellow states that DAE or CAE are not full solutions to this problem. (verify why ? )
  • Blind training of nets
    • We train huge networks in a very brute force fashion. What I mean is, we are using larger and larger models since we do not know how to learn concise and effective models. Instead we rely on redundancy and expect to have at least some units are receptive to discriminating features.
  • Optimization (as always)
    • It seems inefficient to me to use back-propagation after all these work in the field. Another interesting fact, all the effort in the research community goes to find some new tricks that ease back-propagation flaws. I thing we should replace back-propagation all together instead of daily fleeting solutions.
    • Still use SGD ? Still ?
  • Sparsity ?
    • After a year of hot discussion for sparse representations and it is similarity to human brain activity, it seems like it's been shelved. I still believe, sparsity is very important part of good data representations. It should be integrated to state of art learning models, not only AutoEncoders.

DISCLAIMER:  If you are reading this, this is only captain's note and intended to my own research make up.  So many missing references and novice arguments.

What I read for deep-learning

Today, I spent some time on two new papers proposing a new way of training very deep neural networks (Highway-Networks) and a new activation function for Auto-Encoders (ZERO-BIAS AUTOENCODERS AND THE BENEFITS OF
CO-ADAPTING FEATURES) which evades the use of any regularization methods such as Contraction or Denoising.

Lets start with the first one. Highway-Networks proposes a new activation type similar to LTSM networks and they claim that this peculiar activation is robust to any choice of initialization scheme and learning problems occurred at very deep NNs. It is also incentive to see that they trained models with >100 number of layers. The basic intuition here is to learn a gating function attached to a real activation function that decides to pass the activation or the input itself. Here is the formulation

Screenshot from 2015-05-11 11:35:12

Screenshot from 2015-05-11 11:36:12

T(x,W_t ) is the gating function and H(x,W_H) is the real activation. They use Sigmoid activation for gating and Rectifier for the normal activation in the paper. I also implemented it with Lasagne and tried to replicate the results (I aim to release the code later). It is really impressive to see its ability to learn for 50 layers (this is the most I can for my PC).

The other paper ZERO-BIAS AUTOENCODERS AND THE BENEFITS OF CO-ADAPTING FEATURES suggests the use of non-biased rectifier units for the inference of AEs. You can train your model with a biased Rectifier Unit but at the inference time (test time), you should extract features by ignoring bias term. They show that doing so gives better recognition at CIFAR dataset. They also device a new activation function which has the similar intuition to Highway Networks.  Again, there is a gating unit which thresholds the normal activation function.

Screenshot from 2015-05-11 11:44:42

Screenshot from 2015-05-11 11:47:27

The first equation is the threshold function with a predefined threshold (they use 1 for their experiments).  The second equation shows the reconstruction of the proposed model. Pay attention that, in this equation they use square of a linear activation for thresholding and they call this model TLin  but they also use normal linear function which is called TRec. What this activation does here is to diminish the small activations so that the model is implicitly regularized without any additional regularizer. This is actually good for learning over-complete representation for the given data.

For more than this silly into, please refer to papers :) and warn me for any mistake.

These two papers shows a new coming trend to Deep Learning community which is using complex activation functions . We can call it controlling each unit behavior in a smart way instead of letting them fire naively. My notion also agrees with this idea. I believe even more complication we need for smart units in our deep models like Spike and Slap networks.