This work proposes yet another way to initialize your network, namely LUV (Layer-sequential Unit-variance) targeting especially deep networks. The idea relies on lately served Orthogonal initialization and fine-tuning the weights by the data to have variance of 1 for each layer output.
The scheme follows three stages;
Initialize weights by unit variance Gaussian
Find components of these weights using SVD
Replace the weights with these components
By using minibatches of data, try to rescale weights to have variance of 1 for each layer. This iterative procedure is described as below pseudo code.
In order to describe the code in words, for each iteration we give a new mini-batch and compute the output variance. We compare the computed variance by the threshold we defined as to the target variance 1. If number of iterations is below the maximum number iterations or the difference is above we rescale the layer weights by the squared variance of the minibatch. After initializing this layer go on to the next layer.
In essence, what this method does. First, we start with a normal Gaussian initialization which we know that it is not enough for deep networks. Orthogonalization stage, decorrelates the weights so that each unit of the layer starts to learn from particularly different point in the space. At the final stage, LUV iterations rescale the weights and keep the back and forth propagated signals close to a useful variance against vanishing or exploding gradient problem , similar to Batch Normalization but without computational load. Nevertheless, as also they points, LUV is not interchangeable with BN for especially large datasets like ImageNet. Still, I'd like to see a comparison with LUV vs BN but it is not done or not written to paper (Edit by the Author: Figure 3 on the paper has CIFAR comparison of BN and LUV and ImageNet results are posted on https://github.com/ducha-aiki/caffenet-benchmark).
The good side of this method is it works, for at least for my experiments made on ImageNet with different architectures. It is also not too much hurdle to code, if you already have Orthogonal initialization on the hand. Even, if you don't have it, you can start with a Gaussian initialization scheme and skip Orthogonalization stage and directly use LUV iterations. It still works with slight decrease of performance.
There is theoretical proof that any one hidden layer network with enough number of sigmoid function is able to learn any decision boundary. Empirical practice, however, posits us that learning good data representations demands deeper networks, like the last year's ImageNet winner ResNet.
There are two important findings of this work. The first is,we need convolution, for at least image recognition problems, and the second is deeper is always better . Their results are so decisive on even small dataset like CIFAR-10.
They also give a good little paragraph explaining a good way to curate best possible shallow networks based on the deep teachers.
- train state of deep models
- form an ensemble by the best subset
- collect eh predictions on a large enough transfer test
- distill the teacher ensemble knowledge to shallow network.
(if you like to see more about how to apply teacher - student paradigm successfully refer to the paper. It gives very comprehensive set of instructions.)
Still, ass shown by the experimental results also, best possible shallow network is beyond the deep counterpart.
I believe the success of the deep versus shallow depends not the theoretical basis but the way of practical learning of the networks. If we think networks as representation machine which gives finer details to coerce concepts such as thinking to learn a face without knowing what is an eye, does not seem tangible. Due to the one way information flow of convolution networks, this hierarchy of concepts stays and disables shallow architectures to learn comparable to deep ones.
Then how can we train shallow networks comparable to deep ones, once we have such theoretical justifications. I believe one way is to add intra-layer connections which are connections each unit of one layer to other units of that layer. It might be a recursive connection or just literal connections that gives shallow networks the chance of learning higher abstractions.
Convolution is also obviously necessary. Although, we learn each filter from the whole input, still each filter is receptive to particular local commonalities. It is not doable by fully connected layers since it learns from the whole spatial range of the input.
given a imbalanced learning problem with a large class and a small class with number of instances N and M respectively;
cluster the larger class into M clusters and use cluster centers for training the model.
If it is a neural network or some compatible model. Cluster the the large class into K clusters and use these clusters as pseudo classes to train your model. This method is also useful for training your network with small number of classes case. It pushes your net to learn fine-detailed representations.
Divide large class into subsets with M instances then train multiple classifiers and use the ensemble.
Hard-mining is a solution which is unfortunately akin to over-fitting but yields good results in some particular cases such as object detection. The idea is to select the most confusing instances from the large set per iteration. Thus, select M most confusing instances from the large class and use for that iteration and repeat for the next iteration.
For specially batch learning, frequency based batch sampling might be useful. For each batch you can sample instances from the small class by the probability M/(M+N) and N/(M+N) for tha large class so taht you prioritize the small class instances for being the next batch. As you do data augmentation techniques like in CNN models, mostly repeating instances of small class is not a big problem.
Note for metrics, normal accuracy rate is not a good measure for suh problems since you see very high accuracy if your model just predicts the larger class for all the instances. Instead prefer ROC curve or keep watching Precision and Recall.
Please keep me updated if you know something more. Even, this is a very common issue in practice, still hard to find a working solution.
In this post, I like to compute what number of visual instances we observes over time, with the assumption that we visually perceive life as a constant video with a certain fps rate.
Let's dive into the computation. Relying on , average person can see the world with 45 fps on average. It goes to extremes for such people like fighter pilots which is 225fps with the adrenaline kicked in. I took the average life time 71 years  equals to (2 .24 billion) secs and we are awake almost of it which makes (1.49 billion) secs . Then we assume that on average there are neurons in our brain . This is our model size.
Eventually and roughly, that means without any further investigation, we have a model with 86 billion parameters which learns from almost 67 billion images.
Of course this is not a convenient way to come with this numbers but fun comes by ignorance 🙂
After some crawling on the Internet, I stumbled upon this thread on Quora. For the lazy ones, the thread is about the things that can be done by humans but not by computers after N years. There are many references to Turing Test in answers stating that the best AI is still not able to pass Turing Test; therefore we do not need to worry about AI being an existential threat for the humanity. First off, I ought to say that I am on the cautious side (like Elon Musk and Bill Gates) on AI being a threat. To explain myself, I would like to show that AI is a threat that has begun to affect, even we think the Turing Test as the validation method. We only need to think in a different way to verify the test.
For the ones who don't know what Turing Test is; A and B (one machine - one human) are hidden from the human observer C. Looking at the interaction between A and B; the observer C tries to decide which one is human and which is the machine. If observer C cannot decide whether there is a machine or a human behind the curtain; than the machine passes the test. Conclusion is that machine exhibits intelligent behavior equivalent to, or indistinguishable from, that of a human.
From the definition, it is one of the legitimate milestones for AI to compass human capable agents. Therefore, it is normal for people to evaluate present AI to define its state and future potential using Turing Test.
I think a different formation regarding Turing Test where we replace the observer C with a machine as well. Then the remaining question turns out to be, is the machine C able to identify the machine A or even is this identification is necessary henceforth? Thinking the formulation in that way resolves many concerns for the AI supporters who say AI is not a threat since it does not and will not be able to pass Turing Test (at least in the short run). Nevertheless, when we replace C with a machine than the machine does not need to pass Turing Test to be a threat, right? Because we are out of the context like poor B depicted on the above figure.
Now let me explain, what does it mean in practice, changing the observer human with a machine. I believe real life "communication" is a good way to illustrate Turing Test. Think about the communication history. We started with bare foot messengers and have come to light speed flow of the today's world. At the time, we were sending a message and waiting very long for the response. The reason was the tools were the bottleneck for the communication. First we expedited these tools and come up with new technologies. If we think today, then we see that our tools are so fast that we are the bottleneck of the flow any more. We send our mails and messages in a second that bursts the inboxes and message stacks and consequently bursts us as well. If we also accept that the communication is the bare bone of the today's business world, companies do not want to waste time - time is money- and attempt to replace the slowest part with faster alternatives and so computerized solutions come to stage in place of humanized old fashion solution. Now, after we changed the tools for communication, we also start to change the sides of the communication up to a point that there is no need for any human being. There, we also have a fancy name for this Internet of "Things" (not humans any more). If you also look to the statistics, we see that huge partition of the data flow is between machine to machine communication. Could you say, in a more immense level of communication revolution, indistinguishability of a computer agent by a human observer is important? It is clear that we can still devastate our lives by our AI agents without passing Turing Test. You can watch out unemployment rates with the growth of the technological solutions.
Basically, what I try to say here is, yes, Turing Test is a barrier for Sci-Fi level AI threat but we changed the rules of the test by placing machines on the both side of the curtain. That means, there is no place in that test (even in the real life) for human unless some silly machine cannot replace you, but be sure it is yet to come.
Final saying, I am an AI guy and of course I am not saying we should stop but it is an ominously proceeding field. The punch card here is to underline the need of introspection of AI and related technologies and finding ways to serve AI for human needs not the contrary or any other way. We should be skeptical and be warned.
Maxout  units are well-known and frequently used tools for Deep Neural Networks. For whom does not know, with a basic explanation, a Maxout unit is a set of internal activation units competing with each other for each instance and activation of the winner is propagated as output and the loosers are kept silent. At the backpropagation phase, it means we update only the winner unit. That also means, implicitly, we always prefer to back-propagate gradient signal through the strongest path. It is an important aspect of Maxout units, especially for very deep models which are prone to gradient instability.
Although Maxout units have very good properties like which I told (please refer to the paper for more details), I am a proactive sceptic of its ability to encode underlying information and pass it to next layer. Here is a very simple example. Suppose we have two competing functions (filters) in a Maxout unit. One of these functions is receptive of edge structures whereas the other is receptive of corners. For an instance, we might have the first filter as the winner with a value, let’s say, ~3 which means Maxout output is also ~3. For another instance, we have the other function as the winner with approximately same value ~3. If we assume that each NN layer is a classifier which takes the previous layer output as a feature vector (I guess not very wrong assumption), then basically we give the same value for different detections for a particular feature dimension (which is corresponded to our Maxout unit). Eventually, we cannot expect from the next layer to be able to discern this signal.
One can argue that we should evaluate Maxout unit as a whole and it is reminiscent of OR function on top of multiple filters. This is a valid argument which I cannot refuse directly but the problem that I indicated above is still floating on air. Beside, why we would waste our expensive NN parameters, if we could come up with a better encoding scheme for Maxout units
Here is one alternative approach for better encoding of competing functions, which we call NegOut. Let's assume we have a ordering of two competing functions by heart as 1st and 2nd. If the winner is the 1st function, NegOut outputs the 1st function's value and otherwise it outputs the 2nd function but by taking its negative. NegOut yields two assumptions. The first, competing functions are always positive (like ReLU functions ). The second, we have 2 competing functions.
If we consider the backpropagation signal, the only difference from Maxout unit is to take negative of the gradient signal for the 2nd competing unit, if it is the winner.
As you can see from the figure, the inherent property here is to output different values for different winner detectors in which the value captures both the structural difference and the strength of the winner activation.
I performed some experiments on CIFAR-10 and MNIST comparing Maxout Network with NegOut Network with exact same architectures explained in the Maxout Paper . The table below summarizes results that I observe by the initial runs without any finetunning or hyper-parameter optimization yet. More comparisons on larger datasets are still in progress.
NegOut give better results on CIFAR, although it is slightly lower on MNIST. Again notice that no tunning has been took a place for our NegOut network where as Maout Network is optimized as described in the paper . In addition, NegOut network uses 2 competing set of units (as it is constrained by its nature) for the last FC layer in comparison to Maxout net which uses 5 competing units. My expectation is to have more difference as we go through larger models and datasets since as we scale up, representational power takes more place for better results.
Here, I tried to give a basic sketch of my recent work by no means complete. Different observations and experiments are still running. I also need to include LWTA  for being more fair and grasp more wider aspect of competing units. Please feel free to share your thoughts as well. Any contribution is appreciated.
PS: Lately, I devote myself to analyze the internal dynamics of Neural Networks with different architectures, layers and activation functions. The aim is checking under the hood and analysing any intuitionally well-functioning ideas applied to Deep Neural Networks. I also expect to share more of my findings at my blog.
 Maxout networks IJ Goodfellow, D Warde-Farley, M Mirza, A Courville, Y Bengio arXiv preprint arXiv:1302.4389
 Understanding Locally Competitive Networks Rupesh Kumar Srivastava, Jonathan Masci, Faustino Gomez, Jürgen Schmidhuber. http://arxiv.org/abs/1410.1165
The idea is flickered by (up to my best knowledge) Caruana et. al. 2006. Basically, the idea is to train an ensemble of networks and use their outputs on a held-out set to distill the knowledge to a smaller network. Then this idea is recently hashed by G. Hinton's work which trains larger network then use this network output with a mixture of the original train data to train a smaller network. One important trick is to using higher temperature values on softmax layer of the teacher network so class probabilities are smoothly distributed over classes . Student networks is then able to learn class relations induced by the teacher network beside the true classes of the instances as it is suppose to. Eventually, we are able to compress the knowledge of the teacher net by a smaller network with less number of parameters and faster execution time. Bengio has also one similar work called Fitnets which is the beneficiary of the same idea from a wider aspect. They do not only use the outputs of the teacher net, but they carry representation power of hidden layers of the teacher to the student net by a regression loss that approximates the teacher hidden layer weights from the student weights.
Bayesian Breezes :
We are finally able to see some Bayesian arguments on Deep Models. One of the prevailing works belongs to Maxwelling "Bayesian Dark Knowledge". Again we have the previous idea but with a very simple trick in mind. Basically, we introduces a Gaussian noise, which is scaled by the decaying learning rate, to the gradient signals. This noise indices a MCMC dynamics to the network and it implicitly learns ensemble networks. The teacher trained in that fashion, is then used to train student nets with a similar approach proposed by G. Hinton. I won't go into mathematical details here. I guess this is one of the rare Bayesian approaches which is close to be applicable for real-time problems with its a simple trick which is enough to do all the Bayesian magic.
Variational Auto Encoder is not a new work but it recently draw my attention. The difference between VAE and conventional AE is, given a probability distribution, VAE learns the best possible representation that is parametrized by defined distribution. Let's say we want to fit gaussian distribution to the data. Then, It is able to learn mean and standard deviation of the multiple gaussian functions ( corressponding VAE latent units) with backpropagation with a simple parametrization trick. Eventually, you obtain multiple gaussians with different mean and std on the latent units of VAE and you can sample new instances out of these. You can learn more from this great tutorial.
Recurrent Models for Visual Recognition:
ReNet is a paper from Montreal group. They explain an alternative approach to convolutional neural networks in order to learn spatial structures over visual data. Their idea relies on recurrent neural network which scans the image in a sequence of horizontal and then vertical direction. At the end, RNN is able to learn the structure over the whole image (or image patch). Although, their results are not better than state of art, spotting an new alternative to old fashion convolution is exciting effort.
Generative Adversarial Network http://arxiv.org/abs/1406.2661 - Train classifier net as oppose to another net creating possible adversarial instances as the training evolves.
Apply genetic algorithms per N training iteration of net and create some adversarial instances.
Apply fast gradient approach to image pixels to generate intruding images.
Goodfellow states that DAE or CAE are not full solutions to this problem. (verify why ? )
Blind training of nets
We train huge networks in a very brute force fashion. What I mean is, we are using larger and larger models since we do not know how to learn concise and effective models. Instead we rely on redundancy and expect to have at least some units are receptive to discriminating features.
Optimization (as always)
It seems inefficient to me to use back-propagation after all these work in the field. Another interesting fact, all the effort in the research community goes to find some new tricks that ease back-propagation flaws. I thing we should replace back-propagation all together instead of daily fleeting solutions.
Still use SGD ? Still ?
After a year of hot discussion for sparse representations and it is similarity to human brain activity, it seems like it's been shelved. I still believe, sparsity is very important part of good data representations. It should be integrated to state of art learning models, not only AutoEncoders.
DISCLAIMER: If you are reading this, this is only captain's note and intended to my own research make up. So many missing references and novice arguments.