**Teacher - Student paradigm:**- The idea is flickered by (up to my best knowledge) Caruana et. al. 2006. Basically, the idea is to train an ensemble of networks and use their outputs on a held-out set to distill the knowledge to a smaller network. Then this idea is recently hashed by G. Hinton's work which trains larger network then use this network output with a mixture of the original train data to train a smaller network. One important trick is to using higher temperature values on softmax layer of the teacher network so class probabilities are smoothly distributed over classes . Student networks is then able to learn class relations induced by the teacher network beside the true classes of the instances as it is suppose to. Eventually, we are able to compress the knowledge of the teacher net by a smaller network with less number of parameters and faster execution time. Bengio has also one similar work called Fitnets which is the beneficiary of the same idea from a wider aspect. They do not only use the outputs of the teacher net, but they carry representation power of hidden layers of the teacher to the student net by a regression loss that approximates the teacher hidden layer weights from the student weights.

**Bayesian Breezes :**- We are finally able to see some Bayesian arguments on Deep Models. One of the prevailing works belongs to Maxwelling "Bayesian Dark Knowledge". Again we have the previous idea but with a very simple trick in mind. Basically, we introduces a Gaussian noise, which is scaled by the decaying learning rate, to the gradient signals. This noise indices a MCMC dynamics to the network and it implicitly learns ensemble networks. The teacher trained in that fashion, is then used to train student nets with a similar approach proposed by G. Hinton. I won't go into mathematical details here. I guess this is one of the rare Bayesian approaches which is close to be applicable for real-time problems with its a simple trick which is enough to do all the Bayesian magic.
- Variational Auto Encoder is not a new work but it recently draw my attention. The difference between VAE and conventional AE is, given a probability distribution, VAE learns the best possible representation that is parametrized by defined distribution. Let's say we want to fit gaussian distribution to the data. Then, It is able to learn mean and standard deviation of the multiple gaussian functions ( corressponding VAE latent units) with backpropagation with a simple parametrization trick. Eventually, you obtain multiple gaussians with different mean and std on the latent units of VAE and you can sample new instances out of these. You can learn more from this great tutorial.

**Recurrent Models for Visual Recognition:**- ReNet is a paper from Montreal group. They explain an alternative approach to convolutional neural networks in order to learn spatial structures over visual data. Their idea relies on recurrent neural network which scans the image in a sequence of horizontal and then vertical direction. At the end, RNN is able to learn the structure over the whole image (or image patch). Although, their results are not better than state of art, spotting an new alternative to old fashion convolution is exciting effort.

**Model Accelerator and Compression Methods:**- We already talked about dark knowledge approach that is able to compress larger modes into a small ones. Beside, there are some structural approaches so as to compress larger models. One instance to these works is "Learning both Weights and Connections for Efficient Neural Networks". You can reach my personal note relating to this work by this link.
- "Neural Networks with Few Multiplication" by Bengio's team introduces a yet another algorithmic solution for faster and less memory bloating training.

# Category Archives: Research Notes

# Stochastic Gradient formula for different learning algorithms

# devil in the implementation details

I was hassling with interesting problem lately. I trained a custom deep neural network model with ImageNet and ended up very good results at least on training logs. I used Caffe for all these. Then, I ported my model to python interface and give some objects to it. Boommm!not working and even raised random prob values like it is not even trained for 4 days. It was really frustrating. After a dozens of hours I discovered that "Devil is in the details" .

I was using one of the **Batch Normalization ("what is it ? "little intro here )** PR that is not merged to master branch but seems fine. Then I found that interesting problem. The code in the branch computes each batch's mean by only looking at that batch. When we give only one example at test time, then the mean values are exactly the values of this particular image. This disables everything and the net starts to behave strangely. After a small search I found the solution which uses moving average instead of exact batch average. Now, I am at the stage of implementation. The puchcard is, do not use any PR which is not merged to master branch, that simple 🙂

# ImageNet winners after 2012

- 2012
- 0.15 - Supervision (AlexNet) - ~ 60954656 params
- 0.26 - ISI (ensemble of features)
- 0.27 - LEAR (Fisher Vectors)

- 2013
- 2014
- 0.06 - GoogLeNet (Inception Modules) - ~ 11176896 params
- 0.07 - VGGnet (Go deeper and deeper)
- 0.08 - SPPnet (A retrospective addition from early vision)

# Bits and Missings for NNs

**Adversarial instances**and robust models- Generative Adversarial Network http://arxiv.org/abs/1406.2661 - Train classifier net as oppose to another net creating possible adversarial instances as the training evolves.
- Apply genetic algorithms per N training iteration of net and create some adversarial instances.
- Apply fast gradient approach to image pixels to generate intruding images.
- Goodfellow states that DAE or CAE are not full solutions to this problem. (verify why ? )

**Blind training of nets**- We train huge networks in a very brute force fashion. What I mean is, we are using larger and larger models since we do not know how to learn concise and effective models. Instead we rely on redundancy and expect to have at least some units are receptive to discriminating features.

**Optimization (as always)**- It seems inefficient to me to use back-propagation after all these work in the field. Another interesting fact, all the effort in the research community goes to find some new tricks that ease back-propagation flaws. I thing we should replace back-propagation all together instead of daily fleeting solutions.
- Still use SGD ? Still ?

**Sparsity ?**- After a year of hot discussion for sparse representations and it is similarity to human brain activity, it seems like it's been shelved. I still believe, sparsity is very important part of good data representations. It should be integrated to state of art learning models, not only AutoEncoders.

**DISCLAIMER**: If you are reading this, this is only captain's note and intended to my own research make up. So many missing references and novice arguments.