MS researcher recently introduced a new deep ( indeed very deep 🙂 ) NN model (PReLU Net)  and they push the state of art in ImageNet 2012 dataset from 6.66% (GoogLeNet) to 4.94% top-5 error rate.
In this work, they introduce an alternation of well-known ReLU activation function. They call it PReLu (Parametric Rectifier Linear Unit). The idea behind is to allow negative activations on the ReLU function with a control parameter which is also learned over the training phase. Therefore, PReLU allows negative activations and in the paper they argue and emprically show that PReLU is better to resolve diminishing gradient problem for very deep neural networks (> 13 layers) due to allowance of negative activations. That means more activations per layer, hence more gradient feedback at the backpropagation stage.
Contractive Auto-Encoder is a variation of well-known Auto-Encoder algorithm that has a solid background in the information theory and lately deep learning community. The simple Auto-Encoder targets to compress information of the given data as keeping the reconstruction cost lower as much as possible. However another use is to enlarge the given input's representation. In that case, you learn over-complete representation of the given data instead of compressing it. Most common implication is Sparse Auto-Encoder that learns over-complete representation but in a sparse (smart) manner. That means, for a given instance only informative set of units are activated, therefore you are able to capture more discriminative representation, especially if you use AE for pre-training of your deep neural network.
After this intro. what is special about Contraction Auto-Encoder (CAE)? CAE simply targets to learn invariant representations to unimportant transformations for the given data. It only learns transformations that are exactly in the given dataset and try to avoid more. For instance, if you have set of car images and they have left and right view points in the dataset, then CAE is sensitive to those changes but it is insensitive to frontal view point. What it means that if you give a frontal car image to CAE after the training phase, it tries to contract its representation to one of the left or right view point car representation at the hidden layer. In that way you obtain some level of view point in-variance. (I know, this is not very good example for a cannier guy but I only try to give some intuition for CAE).
From the mathematical point of view, it gives the effect of contraction by adding an additional term to reconstruction cost. This addition is the Sqrt Frobenius norm of Jacobian of the hidden layer representation with respect to input values. If this value is zero, it means, as we change input values, we don't observe any change on the learned hidden representations. If we get very large values then the learned representation is unstable as the input values change.
This was just a small intro to CAE, if you like the idea please follow the below videos of Hugo Larochelle's lecture and Pascal Vincent's talk at ICML 2011 for the paper.
Here is the G. Hinton's talk at MIT about t inabilities of Convolutional Neural Networks and 4 basic arguments to solve these.
I just watched it with a slight distraction and I need to reiterate. However these are the basic arguments in which G. Hinton is proposed whilst the speech.
1. CNN + Max Pooling is not the way of handling visual information as the human brain does. Yes, it works in practice for the current state of the art but, especially view point changes of the target objects are still unsolved.
2. Apply Equivariance instead of Invariance. Instead of learning invariant representations to the view point changes, learn changing representations correlated with the view point changes.
3. In the space of CNN weight matrices, view point changes are totally non-linear and therefore hard to learn. However, if we transfer instances into a space where the view point changes are globally linear, we can ease the problem. ( Use graphics representation uses explicit pose coordinates)
4. Route information to right set of neurons instead of unguided forward and backward passes. Define certain neuron groups ( called capsules ) that are receptive to particular set of data clusters in the instance space and each of these capsules contributes to the whole model as much as the given instance's membership to neuron's cluster.
In this post, I'll talk about the details of Feature Extraction (aka Feature Construction, Feature Aggregation …) in the path of successful ML. Finding good feature representations is a domain related process and it has an important influence on your final results. Even if you keep all the settings same, with different Feature Extraction methods you would observe drastically different results at the end. Therefore, choosing the correct Feature Extraction methodology requires painstaking work.
Feature Extraction is a process of conveying the given raw data into set of instance points embedded in a standardized, distinctive and machine understandable space. Standardized means comparable representations with same length; so you can compute similarities or differences of the instances that have initially very versatile structural differences (like different length documents). Distinctive means having different feature values for different class instances so that we can observe clusters of different classes in the new data space. Machine understandable representation is mostly the numerical representation of the given instances. You can understand any document by reading it but machines only understand semantics implied by the numbers. Continue reading ML Work-Flow (Part 3) - Feature Extraction→
Since the initial standpoint of science, technology and AI, scientists following Blaise Pascal and Von Leibniz ponder about a machine that is intellectually capable as much as humans. Famous writers like Jules Continue reading Brief History of Machine Learning→
Here I share enhanced version of one of my Quora answer to a similar question ...
There is no single answer for this question since there are many diverse set of methods to extract feature from an image.
First, what is called feature? "a distinctive attribute or aspect of something." so the thing is to have some set of values for a particular instance that diverse that instance from the counterparts. In the field of images, features might be raw pixels for simple problems like digit recognition of well-known Mnist dataset. However, in natural images, usage of simple image pixels are not descriptive enough. Instead there are two main steam to follow. One is to use hand engineered feature extraction methods (e.g. SIFT, VLAD, HOG, GIST, LBP) and the another stream is to learn features that are discriminative in the given context (i.e. Sparse Coding, Auto Encoders, Restricted Boltzmann Machines, PCA, ICA, K-means). Note that second alternative, Continue reading How does Feature Extraction work on Images?→