Comparison: SGD vs Momentum vs RMSprop vs Momentum+RMSprop vs AdaGrad

In this post I'll briefly introduce some update tricks for training of your ML model. Then, I will present my empirical findings with a linked NOTEBOOK that uses 2 layer Neural Network on CIFAR dataset.

I assume at least you know what is Stochastic Gradient Descent (SGD). If you don't, you can follow this tutorial .  Beside, I'll consider some improvements of SGD rule that result better performance and faster convergence.

SGD is basically a way of optimizing your model parameters based on the gradient information of your loss function (Means Square Error, Cross-Entropy Error ... ). We can formulate this;

w(t) = w(t-1) - \epsilon * \bigtriangleup w(t)

w is the model parameter, \epsilon is learning rate and \bigtriangleup w(t) is the gradient at the time t.

SGD as itself  is solely depending on the given instance (or the batch of instances) of the present iteration. Therefore, it  tends to have unstable update steps per iteration and corollary convergence takes more time or even your model is akin to stuck into a poor local minima.

To solve this problem, we can use Momentum idea (Nesterov Momentum in literature). Intuitively, what momentum does is to keep the history of the previous update steps and combine this information with the next gradient step to keep the resulting updates stable and conforming the optimization history. It basically, prevents chaotic jumps.  We can formulate  Momentum technique as follows;

v(t) = \alpha v(t-1) - \epsilon \frac{\partial E}{\partial w}(t)  (update velocity history with the new gradient)

\bigtriangleup w(t) = v(t) (The weight change is equal to the current velocity)

\alpha is the momentum coefficient and 0.9 is a value to start. \frac{\partial E}{\partial w}(t) is the derivative of w wrt. the loss.

Okay we now soothe wild SGD updates with the moderation of Momentum lookup. But still nature of SGD proposes another potential problem. The idea behind SGD is to approximate the real update step by taking the average of the all given instances (or mini batches). Now think about a case where  a model parameter gets a gradient of +0.001 for each  instances then suddenly it gets -0.009 for a particular instance and this instance is possibly a outlier. Then it destroys all the gradient information before. The solution to such problem is suggested by G. Hinton in the Coursera course lecture 6 and this is an unpublished work even I believe it is worthy of.  This is called RMSprop. It keeps running average of its recent gradient magnitudes and divides the next gradient by this average so that loosely gradient values are normalized. RMSprop is performed as below;

MeanSquare(w,t) =0.9 MeansSquare(w, t-1)+0.1\frac{\partial E}{\partial w}(t)^2

\bigtriangleup w(t) = \epsilon\frac{\partial E}{\partial w}(t) / (\sqrt{MeanSquare(w,t)} + \mu)

\mu is a smoothing value for numerical convention.

You can also combine Momentum and RMSprop by applying successively and aggregating their update values.

Lets add AdaGrad before finish. AdaGrad is an Adaptive Gradient Method that implies different adaptive learning rates for each feature. Hence it is more intuitive for especially sparse problems and it is likely to find more discriminative features and filters for your Convolutional NN. Although you provide an initial learning rate, AdaGrad tunes it regarding the history of the gradients for each feature dimension. The formulation of AdaGrad is as below;

w_i(t) = w_i(t-1) + \frac{\epsilon}{\sum_{k=1}^{t}\sqrt{{g_{ki}}^2}}  where g_{ki} = \frac{\partial E}{\partial w_i}

So the upper formula states that, for each feature dimension, learning rate is divided by the all the squared root gradient history.

Now you completed my intro to the applied ideas in this NOTEBOOK and you can see the practical results of these applied ideas on CIFAR dataset. Of course this into does not mean complete by itself. If you need more refer to other resources. I really suggest the Coursera NN course by G. Hinton for RMSprop idea and this notes for AdaGrad.

For more information you can look this great lecture slide from Toronto Group.

Lately, I found this great visualization of optimization methods. I really suggest you to take a look at it.


Microsot Research introduced a new NN model that beats Google and the others

MS researcher recently introduced a new deep ( indeed very deep :) ) NN model (PReLU Net) [1] and they push the state of art in ImageNet 2012 dataset from 6.66% (GoogLeNet) to 4.94% top-5 error rate.

In this work, they introduce an alternation of well-known ReLU activation function. They call it PReLu (Parametric Rectifier Linear Unit). The idea behind is to allow negative activations on the ReLU function with a control parameter a which is also learned over the training phase. Therefore, PReLU allows negative activations and in the paper they argue and emprically show that PReLU is better to resolve diminishing gradient problem for very deep neural networks  (> 13 layers) due to allowance of negative activations. That means more activations per layer, hence more gradient feedback at the backpropagation stage.

all figures are from the paper

Continue reading Microsot Research introduced a new NN model that beats Google and the others

Updating your local forked project by a commit to the main project?

This is from my stackoverflow question. Thanks to "bitoiu". Here is the real thread.

How to pick up a single commit from a remote repo

Assuming you have a local clone of the repo you forked if you type in the following you should get a single origin:

> git show remote

Unless you've added the original's repo location, you won't have access to the commit you want to pick into your local one. So we need to add that, let's assume this repo is Notice this is an HTTPS clone URL because you won't have write access to this repo. Let's name it original_repo:

> git remote add original_repo

And now let's get all the refs back:

> git fetch origina_repo

At this point you have all you need locally, you'll just need to merge the commit into one of your branches, let's assume your local master.

Find the commit you want to merge. This implies finding it in one of the branches the team used. Could be already merged to master or you could be picking it up from the branch that was used for the pull request. Either way, just run a series of git log to check what commit you want if you don't know the reference. When you do simply go to the branch where you want to merge the commit to and run:

> git cherry-pick COMMIT_ID

This will bring the commit to whatever branch you are at the moment.

How to merge a branch from a remote repo

The only difference in this steps is that instead of doing the cherry-pick you will be doing a merge. So imagine the contents of the pull request are in a branch named so-pr, you would simply do:

> git merge original_repo/so-pr

And that would merge the contents of so-pr into your working branch.

How to keep your forked project updated with the main project ?

# Add the remote, call it "upstream":

git remote add upstream

# Fetch all the branches of that remote into remote-tracking branches,
# such as upstream/master:

git fetch upstream

# Make sure that you're on your master branch:

git checkout master

# Rewrite your master branch so that any commits of yours that
# aren't already in upstream/master are replayed on top of that
# other branch:

git rebase upstream/master

#If you don't want to rewrite the history of your master branch, (for # example because other people may have cloned it) then you should # replace the last command with However, for making further pull    # requests that are as clean as possible, it's probably better to # rebase.
git merge upstream/master. 


Intro. to Contractive Auto-Encoders

Contractive Auto-Encoder is a variation of well-known Auto-Encoder algorithm that has a solid background in the information theory and lately deep learning community. The simple Auto-Encoder targets to compress information of the given data as keeping the reconstruction cost lower as much as possible. However another use is to enlarge the given input's representation. In that case, you learn over-complete representation of the given data instead of compressing it. Most common implication is Sparse Auto-Encoder that learns over-complete representation but in a sparse (smart) manner. That means, for a given instance only informative set of units are activated, therefore you are able to capture more discriminative representation, especially if you use AE for pre-training of your deep neural network.

After this intro. what is special about Contraction Auto-Encoder (CAE)?  CAE simply targets to learn invariant representations to unimportant transformations for the given data. It only learns transformations that are exactly in the given dataset and try to avoid more. For instance, if you have set of car images and they have left and right view points in the dataset, then CAE is sensitive to those changes but it is insensitive to frontal view point. What it means that if you give a frontal car image to CAE after the training phase, it tries to contract its representation to one of the left or right view point car representation at the hidden layer. In that way you obtain some level of view point in-variance. (I know, this is not very good example for a cannier guy but I only try to give some intuition for CAE).

From the mathematical point of view, it gives the effect of contraction by adding an additional term to reconstruction cost. This addition is the Sqrt Frobenius norm of Jacobian of the hidden layer representation with respect to input values. If this value is zero, it means, as we change input values, we don't observe any change on the learned hidden representations. If we get very large values then the learned representation is unstable as the input values change.

This was just a small intro to CAE, if you like the idea please follow the below videos of Hugo Larochelle's lecture and Pascal Vincent's talk at ICML 2011 for the paper.


Here is the G. Hinton's talk at MIT about t inabilities of Convolutional Neural Networks and 4 basic arguments to solve these.

I just watched it with a slight distraction and I need to reiterate. However these are the basic arguments in which G. Hinton is proposed whilst the speech.

1.  CNN + Max Pooling is not the way of handling visual information as the human brain does. Yes, it works in practice for the current state of the art but, especially view point changes of the target objects are still unsolved.

2. Apply Equivariance instead of Invariance. Instead of learning invariant representations to the view point changes, learn changing representations correlated with the view point changes.

3. In the space of CNN weight matrices, view point changes are totally non-linear and therefore hard to learn. However, if we transfer instances into a space where the view point changes are globally linear, we can ease the problem. ( Use graphics representation uses explicit pose coordinates)

4. Route information to right set of neurons instead of unguided forward and backward passes. Define certain neuron groups ( called capsules ) that are receptive to  particular set of data clusters in the instance space and each of these capsules contributes to the whole model as much as the given instance's membership to neuron's cluster.

ML Work-Flow (Part 5) – Feature Preprocessing

We already discussed first four steps of ML work-flow. So far, we preprocessed crude data by DICTR (Discretization, Integration, Cleaning, Transformation, Reduction), then applied a way of feature extraction procedure to convert data into machine understandable representation, and finally divided data into different bunches like train and test sets . Now, it is time to preprocess feature values and make them ready for the state of art ML model ;).

We need Feature Preprocessing in order to:

  1. Evade scale differences between dimensions.
  2. Convey instances into a bounded region in the space.
  3. Remove correlations between different dimensions.

You may ask “Why are we so concerned about these?” Because

  1. Evading scale differences reduces unit differences between particular feature dimensions. Think about Age and Height of your customers. Age is scaled in years and Height is scaled in cm's. Therefore, these two dimension values are distributed in different manners. We need to resolve this and convert data into a scale invariant representation before training your ML algorithm, especially if you are using one of the linear models like Logistic Regression or SVM (Tree based models are more robust to scale differences).
  2. Conveying instances into a bounded region in the space resolves the representation biases between instances. For instance, if you work on a document classification problem with bag of words representation then you should care about document length since longer documents include more words which result in more crowded feature histograms. One of the reasonable ways to solve this issue is to divide each word frequency by the total word frequency in the document so that we can convert each histogram value into a probability of seeing that word in the document. As a result, document is represented with a feature vector that is 1 in total of its elements. This new space is called vector space model in the literature.
  3. Removing correlations between dimensions cleans your data from redundant information exposed by multiple feature dimensions. Hence data is projected into a new space where each dimension explains something independently important from the other feature dimensions.

Okay, I hope now we are clear why we are concerned about these. Henceforth, I'll try to emphasis some basic stuff in our toolkit for feature preprocessing.


  • Can be applied to both feature dimensions or data instances.
  • If we apply to dimensions, it reduces unit effect and if we apply to instances then we solve instance biases as in the case of the document classification problem.
  • The result of standardization is that each feature dimension (instance) is scaled into defined mean and variance so that we fix the unit differences between dimensions.
  •  z = (x-\mu)/\alpha  : for each dimension (instance),  subtract the mean and divide by the variance of that dimension (instance) so that each dimension is kept inside a mean = 0 , variance = 1 curve.

Min Max Scaling

  • Personally, I've not applied Min-Max Scaling to instances,
  • It is still useful for unit difference problem.
  • Instead of distributional consideration, it hinges the values in the range  [0,1].
  • x_{norm} = (x - x_{min})/(x_{max} - x_{min}) :  Find max and min values of the feature dimension and apply the formula.

Caveat 1: One common problem of Scaling and Standardization is you need to keep min and max for Scaling, mean and variance values for Standardization for the novel data and the test time. We estimate these values from only the training data and assume that these are still valid for the test and real world data. This assumption might be true for small problems but especially for online environment this caveat should be dealt with a great importance.

Sigmoid Functions

  • Sigmoid function naturally fetches given values into a [0, 1] range
  • Does not need any assumption about the data like mean and variance
  • It penalizes large values  more than the small ones.
  • You can use other activation functions like tanh.
Sigmoid function

Caveat 2: How to choose and what to choose are very problem dependent questions. However, if you have a clustering problem then standardization seems more reasonable for better similarity measure between instance and if you intend to use Neural Networks then some particular kind of NN demands [0,1] scaled data (or even more interesting scale ranges for better gradient propagation on the NN model). Also, I personally use sigmoid function for simple problems in order to get fast result by SVM without complex investigation.

Zero Phase Component Analysis (ZCA Whitening)

  • As I explained before, whitening is a process to reduce redundant information by decorrelating data with a final diagonal correlation matrix with preferable all diagonals are one.
  • It has especially very important implications in Image Recognition and Feature Learning  so as to make visual cues more concrete on images.
  • Instead of formula, it is more intuitive to wire some code
Covariance Matrices before and after ZCA

I tried to touch some methods and common concerns of feature preprocessing, by no means  complete. Nevertheless, a couple of takeaways from this post are; do not ignore normalizing your feature values before going into training phase and choose the correct method by investigating the values painstakingly.

PS: I actually promised to write a post per week but I am as busy as a bee right now and I barely find some time to write a new stuff. Sorry about it :(