Selfies are everywhere. With different fun masks, poses and filters, it goes crazy. When we coincide with any of these selfies, we automatically give an intuitive score regarding the quality and beauty of the selfie. However, it is not really possible to describe what makes a beautiful selfie. There are some obvious attributes but they are not fully prescribed.
With the folks at 8bit.ai, we decided to develop a system which analyzes selfie images and scores them in accordance to its quality and beauty. The idea was to see whether it is possible to mimic that bizarre perceptual understanding of human with the recent advancements of AI. And if it is, then let's make a mobile application and let people use it for whatever purpose. Spoiler alert! We already developed Selfai app available on iOS and Android and we have one instagram bot @selfai_robot. You can check before reading.
Continuous distribution on the simplex which approximates discrete vectors (one hot vectors) and differentiable by its parameters with reparametrization trick used in VAE.
It is used for semi-supervised learning.
DEEP UNSUPERVISED LEARNING WITH SPATIAL CONTRASTING
Learning useful unsupervised image representations by using triplet loss on image patches. The triplet is defined by two image patches from the same images as the anchor and the positive instances and a patch from a different image which is the negative. It gives a good boost on CIFAR-10 after using it as a pretraning method.
How would you apply to real and large scale classification problem?
UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION
This paper states the following phrase. Traditional machine learning frameworks (VC dimensions, Rademacher complexity etc.) trying to explain how learning occurs are not very explanatory for the success of deep learning models and we need more understanding looking from different perspectives.
They rely on following empirical observations;
Deep networks are able to learn any kind of train data even with white noise instances with random labels. It entails that neural networks have very good brute-force memorization capacity.
Explicit regularization techniques - dropout, weight decay, batch norm - improves model generalization but it does not mean that same network give poor generalization performance without any of these. For instance, an inception network trained without ant explicit technique has 80.38% top-5 rate where as the same network achieved 83.6% on ImageNet challange with explicit techniques.
A 2 layers network with 2n+d parameters can learn the function f with n samples in d dimensions. They provide a proof of this statement on appendix section. From the empirical stand-view, they show the network performances on MNIST and CIFAR-10 datasets with 2 layers Multi Layer Perceptron.
Above observations entails following questions and conflicts;
Traditional notion of learning suggests stronger regularization as we use more powerful models. However, large enough network model is able to memorize any kind of data even if this data is just a random noise. Also, without any further explicit regularization techniques these models are able to generalize well in natural datasets. It shows us that, conflicting to general belief, brute-force memorization is still a good learning method yielding reasonable generalization performance in test time.
Classical approaches are poorly suited to explain the success of neural networks and more investigation is imperative in order to understand what is really going on from theoretical view.
Generalization power of the networks are not really defined by the explicit techniques, instead implicit factors like learning method or the model architecture seems more effective.
Explanation of generalization is need to be redefined in order to solve the conflicts depicted above.
My take : These large models are able to learn any function (and large does not mean deep anymore) and if there is any kind of information match between the training data and the test data, they are able to generalize well as well. Maybe it might be an explanation to think this models as an ensemble of many millions of smaller models on which is controlled by the zeroing effect of activation functions. Thus, it is able to memorize any function due to its size and implicated capacity but it still generalize well due-to this ensembling effect.
A crucial problem in a real DL system design is to capture test data distribution with the trained model which only sees the training data distribution. Therefore, it is always important to find a good data splitting scheme which at least gives the right measures to such divergence.
It is always a waste to spend all your time for fine-tunning your model on the measure of validation data taken from training data only. Because, when you deploy the model, it undergoes new instances sampled from dynamically shifting data distribution. If you have a chance to see some samples from this dynamic environment, use that to test your model on these real instances and keep your model more coherent and don't mislead your training flow.
That being said, on the above figure, the second row depicts the right way to choose your data split. And the third row shows the smoothed version which is suggested in practice.
Above figure shows common machine learning problems in relation to different components of your work flow. It is really important to understand what is really said here and what these problems explain.
Bias is the quality of your model on training data. If it predicts wrong on training, it has a "Bias" problem. If you have a good performance on training data but not on validation data, it yields "Variance" problem. If performance differs for validation data taken from training set and test set, it is "Train - Test mismatch". If performance suffers due to distribution shift on test time, it is "Overfitting".
Bias requires better architecture and longer training. Variance needs more data and regularization. Train - Test mismatch needs more training data from distribution similar to your test data. Overfitting needs regularization, more data, and data synthesis effort.
Above chart shows a salient way of conducting DL system evolution. Follow these decisions with empirical evidences and don't skip any of these in order not to be disappointed in the end. (I said it with many disappointments 🙂 )
When we see that train, validation errors are close enough to human level performance, it means more variance problem and we need to collect more data similar to test portion and hurdle more data synthesis work. Train and validation errors far from human level performance is the sign of bias problem, requires larger models and more training time. Keep in mind that, human performance is not the limit of what your model is theoretically capable of.
Disclaimer: Figures are taken from https://kevinzakka.github.io/2016/09/26/applying-deep-learning/ which summarizes Andrew Ng's talk.
A successful AI agent should communicate. It is all about language. It should understand and explain itself in words in order to communicate us. All of these spark with the "meaning" of words which the atomic part of human-wise communication. This is one of the fundamental problems of Natural Language Processing (NLP).
"meaning" is described as "the idea that is represented by a word, phrase, etc. How about representing the meaning of a word in a computer. The first attempt is to use some kind of hardly curated taxonomies such as WordNet. However such hand made structures not flexible enough, need human labor to elaborate and do not have semantic relations between words other then the carved rules. It is not what we expect from a real AI agent.
Then NLP research focused to use number vectors to symbolize words. The first use is to donate words with discrete (one-hot) representations. That is, if we assume a vocabulary with 1K words then we create a 1K length 0 vector with only one 1 representing the target word. Continue reading Why do we need better word representations ?→
This paper is an interesting work which tries to explain similarities and differences between representation learned by different networks in the same architecture.
To the extend of their experiments, they train 4 different AlexNet and compare the units of these networks by correlation and mutual information analysis.
They asks following question;
Can we find one to one matching of units between network , showing that these units are sensitive to similar or the same commonalities on the image?
Is the one to one matching stays the same by different similarity measures? They first use correlation then mutual information to confirm the findings.
Is a representation learned by a network is a rotated version of the other, to the extend that one to one matching is not possible between networks?
Is clustering plausible for grouping units in different networks?
Answers to these questions are as follows;
It is possible to find good matching units with really high correlation values but there are some units learning unique representation that are not replicated by the others. The degree of representational divergence between networks goes higher with the number of layers. Hence, we see large correlations by conv1 layers and it the value decreases toward conv5 and it is minimum by conv4 layer.
They first analyze layers by the correlation values among units. Then they measure the overlap with the mutual information and the results are confirming each other..
To see the differences between learned representation, they use a very smart trick. They approximate representations learned by a layer of a network by the another network using the same layer. A sparse approximation is performed using LASSO. The result indicating that some units are approximated well with 1 or 2 units of the other network but remaining set of units require almost 4 counterpart units for good approximation. It shows that some units having good one to one matching has local codes learned and other units have slight distributed codes approximated by multiple counterpart units.
They also run a hierarchical clustering in order to group similar units successfully.
For details please refer to the paper.
My discussion: We see that different networks learn similar representations with some level of accompanying uniqueness. It is intriguing to see that, after this paper, these are the unique representations causing performance differences between networks and whether the effect is improving or worsening. Additionally, maybe we might combine these differences at the end to improve network performances by some set of smart tricks.
One deficit of the paper is that they do not experiment deep networks which are the real deal of the time. As we see from the results, as the layers go deeper, different abstractions exhumed by different networks. I believe this is more harsh by deeper architectures such as Inception or VGG kind.
One another curious thing is to study Residual netwrosk. The intuition of Residual networks to pass the already learned representation to upper layers and adding more to residual channel if something useful learned by the next layer. That idea shows some promise that two residual networks might be more similar compared to two Inception networks. Moreover, we can compare different layers inside a single Residual Network to see at what level the representation stays the same.
Decompose the network structure into two networks F and G keeping a set of top layers T at the end. F and G are small and more advance network structures respectively. Thus F is cheap to execute with lower performance compared to G.
In order to reduce the whole computation and embrace both performance and computation gains provided by both networks, they suggest an incremental pass of input data through F to G.
Network F decides the salient regions on the input by using a gradient feedback and then these smaller regions are sent to network G to have better recognition performance.
Given an input image x, coarse network F is applied and then coarse representations of different regions of the given input is computed. These coarse representations are propagated to the top layers T and T computes the final output of the network which are the class predictions. An entropy measure is used to see that how each coerce representation effects the model's uncertainty leading that if a regions is salient then we expect to have large change of the uncertainty with respect to its representation.
We select top k input regions as salient by the hint of computed entropy changes then these regions are given to fine network G obtain finer representations. Eventually, we merge all the coarse, fine representations and give to top layers T again and get the final predictions.
At the training time, all networks and layers trained simultaneously. However, still one might decide to train each network F and G separately by using the same top layers T. Authors posits that the simultaneous training is useful to keep fine and coarse representations similar so that the final layers T do not struggle too much to learn from two difference representation distribution.
I only try to give the overlooked idea here, if you like to see more detail and dwell into formulas please see the paper.
My discussion: There are some other works using attention mechanisms to improve final performance. However, this work is limited with the small datasets and small spatial dimensions. I really like to see whether it is also usefule for large problems like ImageNet or even larger.
Another caveat is the datasets used for the expeirments are not so cluttered. Therefore, it is easy to detect salient regions, even with by some algrithmic techniques. Thus, still this method obscure to me in real life problems.
There is theoretical proof that any one hidden layer network with enough number of sigmoid function is able to learn any decision boundary. Empirical practice, however, posits us that learning good data representations demands deeper networks, like the last year's ImageNet winner ResNet.
There are two important findings of this work. The first is,we need convolution, for at least image recognition problems, and the second is deeper is always better . Their results are so decisive on even small dataset like CIFAR-10.
They also give a good little paragraph explaining a good way to curate best possible shallow networks based on the deep teachers.
- train state of deep models
- form an ensemble by the best subset
- collect eh predictions on a large enough transfer test
- distill the teacher ensemble knowledge to shallow network.
(if you like to see more about how to apply teacher - student paradigm successfully refer to the paper. It gives very comprehensive set of instructions.)
Still, ass shown by the experimental results also, best possible shallow network is beyond the deep counterpart.
I believe the success of the deep versus shallow depends not the theoretical basis but the way of practical learning of the networks. If we think networks as representation machine which gives finer details to coerce concepts such as thinking to learn a face without knowing what is an eye, does not seem tangible. Due to the one way information flow of convolution networks, this hierarchy of concepts stays and disables shallow architectures to learn comparable to deep ones.
Then how can we train shallow networks comparable to deep ones, once we have such theoretical justifications. I believe one way is to add intra-layer connections which are connections each unit of one layer to other units of that layer. It might be a recursive connection or just literal connections that gives shallow networks the chance of learning higher abstractions.
Convolution is also obviously necessary. Although, we learn each filter from the whole input, still each filter is receptive to particular local commonalities. It is not doable by fully connected layers since it learns from the whole spatial range of the input.