Thanks to http://research.microsoft.com/pubs/192769/tricks-2012.pdf
I was hassling with interesting problem lately. I trained a custom deep neural network model with ImageNet and ended up very good results at least on training logs. I used Caffe for all these. Then, I ported my model to python interface and give some objects to it. Boommm!not working and even raised random prob values like it is not even trained for 4 days. It was really frustrating. After a dozens of hours I discovered that "Devil is in the details" .
I was using one of the Batch Normalization ("what is it ? "little intro here ) PR that is not merged to master branch but seems fine. Then I found that interesting problem. The code in the branch computes each batch's mean by only looking at that batch. When we give only one example at test time, then the mean values are exactly the values of this particular image. This disables everything and the net starts to behave strangely. After a small search I found the solution which uses moving average instead of exact batch average. Now, I am at the stage of implementation. The puchcard is, do not use any PR which is not merged to master branch, that simple 🙂
- 0.15 - Supervision (AlexNet) - ~ 60954656 params
- 0.26 - ISI (ensemble of features)
- 0.27 - LEAR (Fisher Vectors)
- 0.06 - GoogLeNet (Inception Modules) - ~ 11176896 params
- 0.07 - VGGnet (Go deeper and deeper)
- 0.08 - SPPnet (A retrospective addition from early vision)
- Adversarial instances and robust models
- Generative Adversarial Network http://arxiv.org/abs/1406.2661 - Train classifier net as oppose to another net creating possible adversarial instances as the training evolves.
- Apply genetic algorithms per N training iteration of net and create some adversarial instances.
- Apply fast gradient approach to image pixels to generate intruding images.
- Goodfellow states that DAE or CAE are not full solutions to this problem. (verify why ? )
- Blind training of nets
- We train huge networks in a very brute force fashion. What I mean is, we are using larger and larger models since we do not know how to learn concise and effective models. Instead we rely on redundancy and expect to have at least some units are receptive to discriminating features.
- Optimization (as always)
- It seems inefficient to me to use back-propagation after all these work in the field. Another interesting fact, all the effort in the research community goes to find some new tricks that ease back-propagation flaws. I thing we should replace back-propagation all together instead of daily fleeting solutions.
- Still use SGD ? Still ?
- Sparsity ?
- After a year of hot discussion for sparse representations and it is similarity to human brain activity, it seems like it's been shelved. I still believe, sparsity is very important part of good data representations. It should be integrated to state of art learning models, not only AutoEncoders.
DISCLAIMER: If you are reading this, this is only captain's note and intended to my own research make up. So many missing references and novice arguments.