What I read for deep-learning

Today, I spent some time on two new papers proposing a new way of training very deep neural networks (Highway-Networks) and a new activation function for Auto-Encoders (ZERO-BIAS AUTOENCODERS AND THE BENEFITS OF
CO-ADAPTING FEATURES) which evades the use of any regularization methods such as Contraction or Denoising.

Lets start with the first one. Highway-Networks proposes a new activation type similar to LTSM networks and they claim that this peculiar activation is robust to any choice of initialization scheme and learning problems occurred at very deep NNs. It is also incentive to see that they trained models with >100 number of layers. The basic intuition here is to learn a gating function attached to a real activation function that decides to pass the activation or the input itself. Here is the formulation

Screenshot from 2015-05-11 11:35:12

Screenshot from 2015-05-11 11:36:12

T(x,W_t ) is the gating function and H(x,W_H) is the real activation. They use Sigmoid activation for gating and Rectifier for the normal activation in the paper. I also implemented it with Lasagne and tried to replicate the results (I aim to release the code later). It is really impressive to see its ability to learn for 50 layers (this is the most I can for my PC).

The other paper ZERO-BIAS AUTOENCODERS AND THE BENEFITS OF CO-ADAPTING FEATURES suggests the use of non-biased rectifier units for the inference of AEs. You can train your model with a biased Rectifier Unit but at the inference time (test time), you should extract features by ignoring bias term. They show that doing so gives better recognition at CIFAR dataset. They also device a new activation function which has the similar intuition to Highway Networks.  Again, there is a gating unit which thresholds the normal activation function.

Screenshot from 2015-05-11 11:44:42

Screenshot from 2015-05-11 11:47:27

The first equation is the threshold function with a predefined threshold (they use 1 for their experiments).  The second equation shows the reconstruction of the proposed model. Pay attention that, in this equation they use square of a linear activation for thresholding and they call this model TLin  but they also use normal linear function which is called TRec. What this activation does here is to diminish the small activations so that the model is implicitly regularized without any additional regularizer. This is actually good for learning over-complete representation for the given data.

For more than this silly into, please refer to papers :) and warn me for any mistake.

These two papers shows a new coming trend to Deep Learning community which is using complex activation functions . We can call it controlling each unit behavior in a smart way instead of letting them fire naively. My notion also agrees with this idea. I believe even more complication we need for smart units in our deep models like Spike and Slap networks.

LinkedInStumbleUponPocketRedditShare

A Slide About Model Evaluation Methods

Here we have a very good slide summarizing performance measures, statistical tests and sampling for model comparison and evaluation.  You can refer it when you have some couple of classifiers on different datasets and you want to see which one is better and why?

Gradient Boosted Trees Notes

Gradient Boosted Trees (GBT) is an ensemble mechanism which learns incrementally new trees optimizing the present ensemble's residual error.  This residual error is resemblance to a gradient step of a linear model. A GBT tries to estimate gradient steps by a new tree and update the present ensemble with this new tree so that whole model is updated in the optimizing direction. This is not very formal explanation but it gives my intuition.

One formal way to think about GBT is, there are all possible tree constructions and our algorithms is just selects the useful ones for the given data.  Hence, compared to all possible trees,  number of tress constructed in the model is very small. This is similar to constructing all these infinite  number of trees and averaging them with the weights estimated by  LASSO.

GBT includes different hyper parameters mostly for regularization.

  • Early Stopping : How many rounds your GBT continue.
  • Shrinkage : Limit the update of each tree with the coefficient 0 < \alpha < 1
  • Data subsampling: Do not use whole the data for each tree, instead sample instances. In general sample ration  n = 0.5 but it can be lower for larger datasets.
  • One side note: Subsampling without shrinkage performs poorly.

Then my initial setting is:

  • Run pretty long with many many round observing a validation data loss.
  • Use small shrinkage value \alpha = 0.001
  • Sample 0.5 of the data
  • Sample 0.9 of the features as well or do the reverse.

Kaggle Plankton Challenge Winner's Approach

I recently attended Plankton Classification Challenge  on Kaggle. Even tough I used simpler (stupidly simpler compared to the winner) Deep NN model for my submissions and ended up at 192th position among  1046 participants. However, this was very good experiment area for me to test new comer ideas to Deep Learning community  and try some couple of novel things which I plan to explain later in my blog.

In this post, I share my notes about the winner's approach (which is explained here extensively).

Recent Advances in Deep Learning

In this text, I would like to talk about some of the recent advances of Deep Learning models by no means complete. (Click heading for the reference)

  1. Parametric Rectifier Linear Unit (PReLU)
    • The idea is to allow negative activation in well-known ReLU units by controlling it with a learnable parameter. In other words, you learn how much negative activationsyou need for each unit to discriminate classes. In the work, it is proposed that PReLU unit is very useful for especially very deep models that lacks for gradient propagation to initial layers due to its depth. What is different is PReLU allows more gradient return by allowing negative activation.PReLU
  2. A new initialization method (MSRA for Caffe users)
    • Xavier initialization was proposed by Bengio's team and it considers number of fan-in and fan-out to a certain unit to define the initial weights.  However, the work says that Xavier method and its alternations considers linear activation functions for the formulation of the method. Hence, they propose some changes related to ReLU activation that they empirically proved its effect in practice with better convergence rate.
  3. Batch Normalization 
    • This work serves data normalization as a structural part of the model. They say that the distribution of the training data changes as the model evolves and it priorities the initialization scheme and the learning schedule we use for the learning. Each mini-batch of the data is normalized with the described scheme just before its propagation through the network and it allows faster convergence  with larger learning rates and robust models to initialization scheme that we choose.  Each mini-batch is normalized by its mean and variance, then it is scaled and shifted by a learned coefficient and residual.

      From the paper
      From the paper
  4. Inception Layers
    • This is one of the ingredients of last year's ImageNet winner GoogleNet. The trick is to use multi-scale filters all together in a layer and concatenating their responses for the next layer. In that way we are able to learn difference covariances per each layer by different sizes and structures. inception_module

Comparison: SGD vs Momentum vs RMSprop vs Momentum+RMSprop vs AdaGrad

In this post I'll briefly introduce some update tricks for training of your ML model. Then, I will present my empirical findings with a linked NOTEBOOK that uses 2 layer Neural Network on CIFAR dataset.

I assume at least you know what is Stochastic Gradient Descent (SGD). If you don't, you can follow this tutorial .  Beside, I'll consider some improvements of SGD rule that result better performance and faster convergence.

SGD is basically a way of optimizing your model parameters based on the gradient information of your loss function (Means Square Error, Cross-Entropy Error ... ). We can formulate this;

w(t) = w(t-1) - \epsilon * \bigtriangleup w(t)

w is the model parameter, \epsilon is learning rate and \bigtriangleup w(t) is the gradient at the time t.

SGD as itself  is solely depending on the given instance (or the batch of instances) of the present iteration. Therefore, it  tends to have unstable update steps per iteration and corollary convergence takes more time or even your model is akin to stuck into a poor local minima.

To solve this problem, we can use Momentum idea (Nesterov Momentum in literature). Intuitively, what momentum does is to keep the history of the previous update steps and combine this information with the next gradient step to keep the resulting updates stable and conforming the optimization history. It basically, prevents chaotic jumps.  We can formulate  Momentum technique as follows;

v(t) = \alpha v(t-1) - \epsilon \frac{\partial E}{\partial w}(t)  (update velocity history with the new gradient)

\bigtriangleup w(t) = v(t) (The weight change is equal to the current velocity)

\alpha is the momentum coefficient and 0.9 is a value to start. \frac{\partial E}{\partial w}(t) is the derivative of w wrt. the loss.

Okay we now soothe wild SGD updates with the moderation of Momentum lookup. But still nature of SGD proposes another potential problem. The idea behind SGD is to approximate the real update step by taking the average of the all given instances (or mini batches). Now think about a case where  a model parameter gets a gradient of +0.001 for each  instances then suddenly it gets -0.009 for a particular instance and this instance is possibly a outlier. Then it destroys all the gradient information before. The solution to such problem is suggested by G. Hinton in the Coursera course lecture 6 and this is an unpublished work even I believe it is worthy of.  This is called RMSprop. It keeps running average of its recent gradient magnitudes and divides the next gradient by this average so that loosely gradient values are normalized. RMSprop is performed as below;

MeanSquare(w,t) =0.9 MeansSquare(w, t-1)+0.1\frac{\partial E}{\partial w}(t)^2

\bigtriangleup w(t) = \epsilon\frac{\partial E}{\partial w}(t) / (\sqrt{MeanSquare(w,t)} + \mu)

\mu is a smoothing value for numerical convention.

You can also combine Momentum and RMSprop by applying successively and aggregating their update values.

Lets add AdaGrad before finish. AdaGrad is an Adaptive Gradient Method that implies different adaptive learning rates for each feature. Hence it is more intuitive for especially sparse problems and it is likely to find more discriminative features and filters for your Convolutional NN. Although you provide an initial learning rate, AdaGrad tunes it regarding the history of the gradients for each feature dimension. The formulation of AdaGrad is as below;

w_i(t) = w_i(t-1) + \frac{\epsilon}{\sum_{k=1}^{t}\sqrt{{g_{ki}}^2}}  where g_{ki} = \frac{\partial E}{\partial w_i}

So the upper formula states that, for each feature dimension, learning rate is divided by the all the squared root gradient history.

Now you completed my intro to the applied ideas in this NOTEBOOK and you can see the practical results of these applied ideas on CIFAR dataset. Of course this into does not mean complete by itself. If you need more refer to other resources. I really suggest the Coursera NN course by G. Hinton for RMSprop idea and this notes for AdaGrad.

For more information you can look this great lecture slide from Toronto Group.

Lately, I found this great visualization of optimization methods. I really suggest you to take a look at it.

Microsoft Research introduced a new NN model that beats Google and the others

MS researcher recently introduced a new deep ( indeed very deep :) ) NN model (PReLU Net) [1] and they push the state of art in ImageNet 2012 dataset from 6.66% (GoogLeNet) to 4.94% top-5 error rate.

In this work, they introduce an alternation of well-known ReLU activation function. They call it PReLu (Parametric Rectifier Linear Unit). The idea behind is to allow negative activations on the ReLU function with a control parameter a which is also learned over the training phase. Therefore, PReLU allows negative activations and in the paper they argue and emprically show that PReLU is better to resolve diminishing gradient problem for very deep neural networks  (> 13 layers) due to allowance of negative activations. That means more activations per layer, hence more gradient feedback at the backpropagation stage.

PReLU
all figures are from the paper

Continue reading Microsoft Research introduced a new NN model that beats Google and the others

Updating your local forked project by a commit to the main project?

This is from my stackoverflow question. Thanks to "bitoiu". Here is the real thread.

How to pick up a single commit from a remote repo

Assuming you have a local clone of the repo you forked if you type in the following you should get a single origin:

> git show remote
origin

Unless you've added the original's repo location, you won't have access to the commit you want to pick into your local one. So we need to add that, let's assume this repo ishttps://github.com/GitbookIO/gitbook.git. Notice this is an HTTPS clone URL because you won't have write access to this repo. Let's name it original_repo:

> git remote add original_repo https://github.com/GitbookIO/gitbook.git

And now let's get all the refs back:

> git fetch origina_repo

At this point you have all you need locally, you'll just need to merge the commit into one of your branches, let's assume your local master.

Find the commit you want to merge. This implies finding it in one of the branches the team used. Could be already merged to master or you could be picking it up from the branch that was used for the pull request. Either way, just run a series of git log to check what commit you want if you don't know the reference. When you do simply go to the branch where you want to merge the commit to and run:

> git cherry-pick COMMIT_ID

This will bring the commit to whatever branch you are at the moment.

How to merge a branch from a remote repo

The only difference in this steps is that instead of doing the cherry-pick you will be doing a merge. So imagine the contents of the pull request are in a branch named so-pr, you would simply do:

> git merge original_repo/so-pr

And that would merge the contents of so-pr into your working branch.

How to keep your forked project updated with the main project ?


# Add the remote, call it "upstream":

git remote add upstream https://github.com/whoever/whatever.git

# Fetch all the branches of that remote into remote-tracking branches,
# such as upstream/master:

git fetch upstream

# Make sure that you're on your master branch:

git checkout master

# Rewrite your master branch so that any commits of yours that
# aren't already in upstream/master are replayed on top of that
# other branch:

git rebase upstream/master

#If you don't want to rewrite the history of your master branch, (for # example because other people may have cloned it) then you should # replace the last command with However, for making further pull    # requests that are as clean as possible, it's probably better to # rebase.
git merge upstream/master.