# Paper Notes: The Shattered Gradients Problem ...

The whole heading of the paper is "The Shattered Gradients Problem: If resnets are the answer, then what is the question?". It is really interesting work with all its findings about gradient dynamics of neural networks. It also examines Batch Normalization (BN) and Residual Networks (Resnet) under this problem.

The problem, dubbed "Shattered Gradients", described as gradient feedbacks resembling random noise for nearby data points. White noise gradients (random value around 0 with some unknown variance) are not useful for training and they stall the network. What we expect to see is Brownian noise (next value is obtained with a small change on the last value) from a working model. Deep neural networks are more prone to white noise gradients. However, latest advances like BN and Resnet are described to be more resilient to random gradients even in deep networks.

White noise gradients undermines the effectiveness of networks because they violates gradient based learning methods which expects similar gradient feedbacks for data points close by in the vector space. Once you have white noise gradient for such close points, the model is not able to capture data manifold through these learning algorithms. Brownian updates yields more correlation on updates and this preludes effective learning.

For normal networks, they give a empirical evidence that the correlation of network updates decreases with the order $(/2^L$ where L is number of layers. Decreasing correlation means more white noise gradient feedbacks.

One important reason of white noise feedbacks is to be co-activations of network units. From a working model, we expect to have units receptive to different structures in the given data. Therefore, for each different instance, different subset of units should be active for effective information flux. They observe that as activation goes through layers, co-activation rate goes higher. BN layers prevents this by keeping the co-activation rate 1/4 (1/4 units are active per layer).

Beside the co-activation rate, how dispersed units activation is another important question. Thus, similar instances need to activate similar subset of units and activation should be distributes to other subsets as we change the data structure. This stage is where the skip-connections get into the play. Their observation is skip-connections improve networks in that respect. This can be observed at below figure.

The effectiveness of skip-connections increases with Beta scaling introduced by InceptionV4 architecture. It  is scaling residual connections by a constant value before summing up with the current layer activation.

#### A small discussion

This is a very intriguing paper to me as being one of the scarse works investigating network dynamics instead of blind updates on architectures for racing accuracy values.

Resnet is known to be train hundreds of layers which was not possible before. Now, with this work, we have another scientific argument explaining its effectiveness. I also like to point Veit et al. (2016) demystifying Resnet as an ensemble of many shallow networks. When we combine both of these papers, it makes total sense to me how Resnets are useful for training very deep networks. If shattered gradient effect, as stated here,  increasing with number of layers with the order 2^L then it is impossible to train hundred layers with an ad-hoc network. Corollary, since Resnet behaves like a ensemble of shallow networks this effects is rehabilitated. We are able to see it empirically in this paper and it is complimentary in that sense.

Note: This hastily written paper note might include any kind of error. Please let me know if you find one. Best 🙂

# Duplicate Question Detection with Deep Learning on Quora Dataset

Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not.  In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.

Beside the proposed method, it includes some examples showing how to use Pandas, Gensim, Spacy and Keras. For the full code you check Github.

## Data Quirks

There are 255045 negative (non-duplicate) and 149306 positive (duplicate) instances. This induces a class imbalance however when you consider the nature of the problem, it seems reasonable to keep the same data bias with your ML model since negative instances are more expectable in a real-life scenario.

When we analyze the data, the shortest question is 1 character long (which is stupid and useless for the task) and the longest question is 1169 character (which is a long, complicated love affair question). I see that if any of the pairs is shorter than 10 characters, they do not make sense thus, I remove such pairs.  The average length is 59 and std is 32.

There are two other columns "q1id" and "q2id" but I really do not know how they are useful since the same question used in different rows has different ids.

Some labels are not true, especially for the duplicate ones. In anyways, I decided to rely on the labels and defer pruning due to hard manual effort.

## Proposed Method

#### Converting Questions into Vectors

Here, I plan to use Word2Vec to convert each question into a semantic vector then I stack a Siamese network to detect if the pair is duplicate.

Word2Vec is a general term used for similar algorithms that embed words into a vector space with 300 dimensions in general.  These vectors capture semantics and even analogies between different words. The famous example is ;

king - man + woman = queen.

Word2Vec vectors can be used for may useful applications. You can compute semantic word similarity, classify documents or input these vectors to Recurrent Neural Networks for more advance applications.

There are two well-known algorithms in this domain. One is Google's network architecture which learns representation by trying to predict surrounding words of a target word given certain window size. GLOVE is the another methos which relies on co-occurrence matrices. GLOVE is easy to train and it is flexible to add new words out-side of your vocabulary. You might like visit this tutorial to learn more and check this brilliant use-case Sense2Vec.

We still need a way to combine word vectors for singleton question representation. One simple alternative is taking the mean of all word vectors of each question. This is simple but really effective way for document classification and I expect it to work for this problem too.   In addition,  it is possible to enhance mean vector representation by using TF-IDF scores defined for each word. We apply weighted average of word vectors by using these scores. It emphasizes importance of discriminating words and avoid useless, frequent words which are shared by many questions.

#### Siamese Network

I described Siamese network in a previous post. In short, it is a two way network architecture which takes two inputs from the both side. It projects data into a space in which similar items are contracted and dissimilar ones are dispersed over the learned space. It is computationally efficient since networks are sharing parameters.

## Implementation

Let's load the training data first.

For this particular problem, I train my own GLOVE model by using Gensim.

The above code trains a GLOVE model and saves it. It generates 300 dimensional vectors for words. Hyper parameters would be chosen better but it is just a baseline to see a initial performance. However, as I'll show this model gives performance below than my expectation. I believe, this is because our questions are short and does not induce a semantic structure that GLOVE is able to learn a salient model.

Due to the performance issue and the observation above, I decide to use a pre-trained GLOVE model which comes free with Spacy. It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. This is how we use Spacy for this purpose.

Before going further, I really like Spacy. It is really fast and it does everything you need for NLP in a flash of time by hiding many intrinsic details. It deserves a good remuneration.  Similar to Gensim model, it also provides 300 dimensional embedding vectors.

The result I get from Spacy vectors is above Gensim model I trained. It is a better choice to go further with TF-IDF scoring.  For TF-IDF, I used scikit-learn (heaven of ML).  It provides TfIdfVectorizer which does everything you need.

After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores. The below code does this for just "question1" column.

Now, we are ready to create training data for Siamese network. Basically, I've just fetch the labels and covert mean word2vec vectors to numpy format. I split the data into train and test set too.

In this stage, we need to define Siamese network structure. I use Keras for its simplicity. Below, it is the whole script that I used for the definition of the model.

I share here the best performing network with residual connections. It is a 3 layers network using Euclidean distance as the measure of instance similarity. It has Batch Normalization per layer. It is particularly important since BN layers enhance the performance considerably. I believe, they are able to normalize the final feature vectors and Euclidean distance performances better in this normalized space.

I tried Cosine distance which is more concordant to Word2Vec vectors theoretically but cannot handle to obtain better results. I also tried to normalize data into unit variance or L2 norm but nothing gives better results than the original feature values.

Let's train the network with the prepared data. I used the same model and hyper-parameters for all configurations. It is always possible to optimize these but hitherto I am able to give promising baseline results.

## Results

In this section, I like to share test set accuracy values obtained by different model and feature extraction settings.  We expect to see improvement over 0.63 since when we set all the labels as 0, it is the accuracy we get.

These are the best results I obtain with varying GLOVE models. they all use the same network and hyper-parameters after I find the best on the last configuration depicted below.

• Gensim (my model) + Siamese: 0.69
• Spacy + Siamese :  0.72
• Spacy + TD-IDF + Siamese : 0.79

We can also investigate the effect of different model architectures.  These are the values following  the best word2vec model shown above.

• 2 layers net : 0.67
• 3 layers net + adam : 0.74
• 3 layers resnet (after relu BN) + adam : 0.77
• 3 layers resnet (before relu BN) + adam : 0.78
• 3 layers resnet (before relu BN) + adam + dropout : 0.75
• 3 layers resnet (before relu BN) + adam + layer concat : 0.79
• 3 layers resnet (before relu BN) + adam + unit_norm + cosine_distance : Fail

Adam works quite well for this problem compared to SGD with learning rate scheduling. Batch Normalization also yields a good improvement. I tried to introduce Dropout between layers in different orders (before ReLU, after BN etc.), the best I obtain is 0.75.  Concatenation of different layers improves the performance by 1 percent as the final gain.

In conclusion, here I tried to present a solution to this unique problem by composing different aspects of deep learning. We start with Word2Vec and combine it  with TF-IDF and then use Siamese network to find duplicates. Results are not perfect and akin to different optimizations. However, it is just a small try to see the power of deep learning in this domain. I hope you find it useful :).

## Updates

• Switching last layer to FC layer improves performance to 0.84.
• By using bidirectional RNN and 1D convolutional layers together as feature extractors improves performance to 0.91. Maybe I'll explain details with another post.

# Dilated Convolution

In simple terms, dilated convolution is just a convolution applied to input with defined gaps. With this definitions, given our input is an 2D image, dilation rate k=1 is normal convolution and k=2 means skipping one pixel per input and k=4 means skipping 3 pixels. The best to see the figures below with the same k values.

The figure below shows dilated convolution on 2D data. Red dots are the inputs to a filter which is 3x3 in this example, and greed area is the receptive field captured by each of these inputs. Receptive field is the implicit area captured on the initial input by each input (unit) to the next layer .

Dilated convolution is a way of increasing receptive view (global view) of the network exponentially and linear parameter accretion. With this purpose, it finds usage in applications cares more about integrating knowledge of the wider context with less cost.

One general use is image segmentation where each pixel is labelled by its corresponding class. In this case, the network output needs to be in the same size of the input image. Straight forward way to do is to apply convolution then add deconvolution layers to upsample[1]. However, it introduces many more parameters to learn. Instead, dilated convolution is applied to keep the output resolutions high and it avoids the need of upsampling [2][3].

Dilated convolution is applied in domains beside vision as well. One good example is WaveNet[4] text-to-speech solution and ByteNet learn time text translation. They both use dilated convolution in order to capture global view of the input with less parameters.

In short, dilated convolution is a simple but effective idea and you might consider it in two cases;

1. Detection of fine-details by processing inputs in higher resolutions.
2. Broader view of the input to capture more contextual information.
3. Faster run-time with less parameters

[1] Long, J., Shelhamer, E., & Darrell, T. (2014). Fully Convolutional Networks for Semantic Segmentation. Retrieved from http://arxiv.org/abs/1411.4038v1

[2]Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, 1–14. Retrieved from http://arxiv.org/abs/1412.7062

[3]Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. Iclr, 1–9. http://doi.org/10.16373/j.cnki.ahr.150049

[4]Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio, 1–15. Retrieved from http://arxiv.org/abs/1609.03499

[5]Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. van den, Graves, A., & Kavukcuoglu, K. (2016). Neural Machine Translation in Linear Time. Arxiv, 1–11. Retrieved from http://arxiv.org/abs/1610.10099

# Ensembling Against Adversarial Instances

### What is Adversarial?

Machine learning is everywhere and we are amazed with capabilities of these algorithms. However, they are not great and sometimes they behave so dumb.  For instance, let's consider an image recognition model. This model  induces really high empirical performance and it works great for normal images. Nevertheless, it might fail when you change some of the pixels of an image even so this little perturbation might be indifferent to human eye. There we call this image an adversarial instance.

There are various methods to generate adversarial instances [1][2][3][4]. One method is to take derivative of the model outputs wrt the input values so that we can change instance values to manipulate the model decision. Another approach exploits genetic algorithms to generate manipulative instances which are confidently classified as a known concept (say 'dog') but they are nothing to human eyes.

So why these models are that weak against adversarial instances. One reliable idea states that because adversarial instances lie on the low probability regions of the instance space. Therefore, they are so weird to the network which is trained with a limited number of instances from higher probability regions.

That being said, maybe there is no way to escape from the fretting adversarial instances, especially when they are produced by exploiting weaknesses of a target model with a gradient guided probing. This is a analytic way of searching for a misleading input for that model with an (almost) guaranteed certainty. Therefore in one way or another, we find an perturbed input deceiving any model.

Due to that observation, I believe that adversarial instances can be resolved by multiple models backing each other. In essence, this is the motivation of this work.

### Proposed Work

In this work, I like to share my observations focusing on strength of the ensembles against adversarial instances. This is just a toy example with so much short-comings but I hope it'll give the idea with some emiprical evidences.

As a summary, this is what we do here;

• Train a baseline MNIST ConvNet.
• Create adversarial instances on this model by using cleverhans and save.
• Measure the baseline model performance on adversarial.
• Train the same ConvNet architecture including adversarial instances and measure its performance.
• Train an ensemble of 10 models of the same ConvNet architecture and measure ensemble performance and support the backing argument stated above.

My code full code can be seen on github and I here only share the results and observations. You need cleverhans, Tensorflow and Keras for adversarial generation and you need PyTorch for ensemble training. (Sorry for verbosity of libraries but I like to try PyTorch as well after yeras of tears with Lua).

One problem of the proposed experiment is that we do not recreate adversarial instances for each model and we use a previously created one. Anyways, I believe the empirical values verifies my assumption even in this setting.  In addition,  I plan to do more extensive study as a future work.

#### Create adversarial instances.

I start by training a simple ConvNet architecture on MNIST dataset by using legitimate train and test set splits. This network gives 0.98 test set accuracy after 5 epochs.

For creating adversarial instances, I use fast gradient sign method which perturbs images using the derivative of the model outputs wrt the input values.  You can see a bunch of adversarial samples below.

The same network suffers on adversarial instances (as above) created on the legitimate test set. It gives 0.09 accuracy which is worse then random guess.

#### Plot adversarial instances.

Then I like to see the representational power of the trained model on both the normal and the adversarial instances. I do this by using well-known dimension reduction technique T-SNE. I first compute the last hidden layer representation of the network per instance and use these values as an input to T-SNE which aims to project data onto 2-D space. Here is the final projection for the both types of data.

These projections clearly show that adversarial instances are just a random data points to the trained model and they are receding from the real data points creating what we call low probability regions for the trained model. I also trained the same model architecture by dynamically creating adversarial instances in train time then test its value on the adversarials created previously. This new model yields 0.98 on normal test set, 0.91 on previously created adversarial test set and 0.71 on its own dynamically created adversarial.

Above results show that including adversarial instances strengthen the model. However,  this is conforming to the low probability region argument. By providing adversarial, we let the model to discover low probability regions of adversarial instances. Beside, this is not applicable to large scale problems like ImageNet since you cannot afford to augment your millions of images per iteration. Therefore,  by assuming it works, ensembling is more viable alternative as already a common method to increase overall prediction performance.

#### Ensemble Training

In this part, I train multiple models in different ensemble settings. First, I train N different models with the same whole train data. Then, I bootstrap as I train N different models by randomly sampling data from the normal train set. I also observe the affect of N.

The best single model obtains 0.98 accuracy on the legitimate test set. However, the best single model only obtains 0.22 accuracy on the adversarial instances created in previous part.

When we ensemble models by averaging scores, we do not see any gain and we stuck on 0.24 accuracy for the both training settings. However, surprisingly when we perform max ensemble (only count on the most confident model for each instance), we observe 0.35 for uniformly trained ensemble and 0.57 for the bootstrapped ensemble with N equals to 50.

Increasing N raises the adversarial performance. It is much more effective on bootstrapped ensemble. With N=5 we obtain 0.27 for uniform ensemble and 0.32 for bootstrapped ensemble. With N=25 we obtain 0.30 and 0.45 respectively.

These values are interesting especially for the difference of mean and max ensemble. My intuition behind the superiority of maxing is maxing out predictions is able to cover up weaknesses of models by the most confident one, as I suggested in the first place. In that vein, one following observation is that adversarial performance increases as we use smaller random chunks for each model up to a certain threshold with increasing N (number of models in ensemble). It shows us that bootstrapping enables models to learn some of the local regions better and some worse but the worse sides are covered by the more confident model in the ensemble.

As I said before, it is not convenient to use previously created adversarials created by the baseline model in the first part. However, I believe my claim still holds. Assume that we include the baseline model in our best max ensemble above. Still its mistakes would be corrected by the other models. I also tried this (after the comments below) and include the baseline model in our ensemble. 0.57 accuracy only reduces to 0.55. It is still pretty high compared to any other method not seeing adversarial in the training phase.

#### Conclusion

1. It is much more harder to create adversarials for ensemble of models with gradient methods. However, genetic algorithms are applicable.
2. Blind stops of individual models are covered by the peers in the ensemble when we rely on the most confident one.
3. We observe that as we train a model with dynamically created adversarial instances per iteration, it resolves the adversarials created by the test set. That is, since as the model sees examples from these regions it becomes immune to adversarials. It supports the argument stating low probability regions carry adversarial instances.

### (Before finish) This is Serious!

Before I finish, I like to widen the meaning of this post's heading. Ensemble against adversarial!!

"Adversarial instances" is peculiar AI topic. It attracted so much interest first but now it seems forgotten beside research targeting GANs since it does not yield direct profit, compared to having better accuracy.

Even though this is the case hitherto, we need consider this topic more painstakingly from now on. As we witness more extensive and greater AI in many different domains (such as health, law, governace), adversarial instances akin to cause greater problems intentionally or by pure randomness. This is not a sci-fi scenario I'm drawing here. It is a reality as it is prototyped in [3]. Just switch a simple recognition model in [3]  with a AI ruling court for justice.

Therefore, if we believe in a future embracing AI as a great tool to "make the world better place!", we need to study this subject extensively before passing a certain AI threshold.

#### Last Words

This work overlooks many important aspects but after all it only aims to share some of my findings in a spare time research.  For a next post, I like study unsupervised models like Variational Encoders and Denoising Autoencoders by applying these on adversarial instances (I already started!). In addition, I plan to work on other methods for creating different types of adversarials.

From this post you should take;

• References to adversarial instances
• Good example codes waiting you on github that can be used many different projects.
•  Power of ensemble.
• Some of non-proven claims and opinions on the topic.

IN ANY WAY HOPE YOU LIKE IT ! 🙂

References

[1] Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep Neural Networks are Easily Fooled. Computer Vision and Pattern Recognition, 2015 IEEE Conference on, 427–436.

[2] Szegedy, C., Zaremba, W., & Sutskever, I. (2013). Intriguing properties of neural networks. arXiv Preprint arXiv: …, 1–10. Retrieved from http://arxiv.org/abs/1312.6199

[3] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2016). Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples. arXiv. Retrieved from http://arxiv.org/abs/1602.02697

[4] Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. Iclr 2015, 1–11. Retrieved from http://arxiv.org/abs/1412.6572

# Selfai: A Method for Understanding Beauty in Selfies

Selfies are everywhere. With different fun masks, poses and filters,  it goes crazy.  When we coincide with any of these selfies, we automatically give an intuitive score regarding the quality and beauty of the selfie. However, it is not really possible to describe what makes a beautiful selfie. There are some obvious attributes but they are not fully prescribed.

With the folks at 8bit.ai, we decided to develop a system which analyzes selfie images and scores them in accordance to its quality and beauty.  The idea was to see whether it is possible to mimic that bizarre perceptual understanding of human with the recent advancements of AI. And if it is, then let's make an application and let people use it for whatever purpose.  For now, we only have an Instagram bot @selfai_robot. You can check before reading.

# What I read lately

##### CATEGORICAL REPARAMETERIZATION WITH GUMBEL SOFTMAX
• Link: https://arxiv.org/pdf/1611.01144v1.pdf
• Continuous distribution on the simplex which approximates discrete vectors (one hot vectors) and differentiable by its parameters with reparametrization trick used in VAE.
• It is used for semi-supervised learning.

##### DEEP UNSUPERVISED LEARNING WITH SPATIAL CONTRASTING
• Learning useful unsupervised image representations by using triplet loss on image patches. The triplet is defined by two image patches from the same images as the anchor and the positive instances and a patch from a different image which is the negative.  It gives a good boost on CIFAR-10 after using it as a pretraning method.
• How would you apply to real and large scale classification problem?

##### MULTI-RESIDUAL NETWORKS
• For 110-layers ResNet the most contribution to gradient updates come from the paths with 10-34 layers.
• ResNet trained with only these effective paths has comparable performance with the full ResNet. It is done by sampling paths with lengths in the effective range for each mini-batch.
• Instead of going deeper adding more residual connections provides more boost due to the notion of exponential ensemble of shallow networks by the residual connections.
• Removing a residual block from a ResNet has negligible drop on performance in test time in contrast to VGG and GoogleNet.

# Paper review - Understanding Deep Learning Requires Rethinking Generalization

This paper states the following phrase. Traditional machine learning frameworks (VC dimensions, Rademacher complexity etc.) trying to explain how learning occurs are not very explanatory for the success of deep learning models and we need more understanding looking from different perspectives.

They rely on following empirical observations;

• Deep networks are able to learn any kind of train data even with white noise instances with random labels. It entails that neural networks have very good brute-force memorization capacity.
• Explicit regularization techniques - dropout, weight decay, batch norm - improves model generalization but it does not mean that same network give poor generalization performance without any of these. For instance, an inception network trained without ant explicit technique has 80.38% top-5 rate where as the same network achieved 83.6% on ImageNet challange with explicit techniques.
• A 2 layers network with 2n+d parameters can learn the function f with n samples in d dimensions. They provide a proof of this statement on appendix section. From the empirical stand-view, they show the network performances on MNIST and CIFAR-10 datasets with 2 layers Multi Layer Perceptron.

Above observations entails following questions and conflicts;

• Traditional notion of learning suggests stronger regularization as we use more powerful models. However, large enough network model is able to memorize any kind of data even if this data is just a random noise. Also, without any further explicit regularization techniques these models are able to generalize well in natural datasets.  It shows us that, conflicting to general belief, brute-force memorization is still a good learning method yielding reasonable generalization performance in test time.
• Classical approaches are poorly suited to explain the success of neural networks and more investigation is imperative in order to understand what is really going on from theoretical view.
• Generalization power of the networks are not really defined by the explicit techniques, instead implicit factors like learning method or the model architecture seems more effective.
• Explanation of generalization is need to be redefined in order to solve the conflicts depicted above.

My take :  These large models are able to learn any function (and large does not mean deep anymore) and if there is any kind of information match between the training data and the test data, they are able to generalize well as well. Maybe it might be an explanation to think this models as an ensemble of many millions of smaller models on which is controlled by the zeroing effect of activation functions.  Thus, it is able to memorize any function due to its size and implicated capacity but it still generalize well due-to this ensembling effect.

# Important Nuances to Train Deep Learning Models.

A crucial problem in a real DL system design is to capture test data distribution with the trained model which only sees the training data distribution.  Therefore, it is always important to find a good data splitting scheme which at least gives the right measures to such divergence.

It is always a waste to spend all your time for fine-tunning your model on the measure of validation data taken from training data only. Because, when you deploy the model, it undergoes new instances sampled from dynamically shifting data distribution. If you have a chance to see some samples from this dynamic environment, use that to test your model on these real instances and keep your model more coherent and don't mislead your training flow.

That being said, on the above figure, the second row depicts the right way to choose your data split. And the third row shows the smoothed version which is suggested in practice.

Above figure shows common machine learning problems in relation to different components of your work flow. It is really important to understand what is really said here and what these problems explain.

Bias is the quality of your model on training data. If it predicts wrong on training, it has a "Bias" problem. If you have a good performance on training data but not on validation data, it yields "Variance" problem. If performance differs for validation data taken from training set and test set, it is "Train - Test mismatch". If performance suffers due to distribution shift on test time, it is "Overfitting".

Bias requires better architecture and longer training. Variance needs more data and regularization. Train - Test mismatch needs more training data from distribution similar to your test data. Overfitting needs regularization, more data, and data synthesis effort.

Above chart shows a salient way of conducting  DL system evolution.  Follow these decisions with empirical evidences and don't skip any of these in order not to be disappointed in the end. (I said it with many disappointments 🙂 )

When we see that train, validation errors are close enough to human level performance, it means more variance problem and we need to  collect more data similar to test portion and hurdle more data synthesis work. Train and validation errors  far from human level performance is the sign of bias problem, requires larger models and more training time. Keep in mind that, human performance is not the limit of what your model is theoretically capable of.

Disclaimer: Figures are taken from https://kevinzakka.github.io/2016/09/26/applying-deep-learning/ which summarizes Andrew Ng's talk.

EDIT:

From NIPS 2016:

# Face Detection by Literature

Please ping me if you know something more.

Multi-view Face Detection Using Deep Convolutional Neural Network

1. Train face classifier with face (> 0.5 overlap) and background (<0.5 overlap) images.
2.  Compute heatmap over test image scaled to different sizes with sliding window
3.  Apply NMS .
4.  Computation intensive, especially for CPU.
•  http://arxiv.org/abs/1502.02766

From Facial Parts Responses to Face Detection: A Deep Learning Approach

Keywords: object proposals, facial parts,  more annotation.

1. Use facial part annotations
2. Bottom up to detect face from facial parts.
3. "Faceness-Net’s pipeline consists of three stages,i.e. generating partness maps, ranking candidate windows by faceness scores, and refining face proposals for face detection."
4. Train part based classifiers based on attributes related to different parts of the face i.e. for hair part train ImageNet pre-trained network for color classification.
5. Very robust to occlusion and background clutter.
6. To much annotation effort.
7. Still object proposals (DL community should skip proposal approach. It complicate the problem by creating a new domain of problem :)) ).
• http://arxiv.org/abs/1509.06451

Supervised Transformer Network for Efficient Face Detection

• http://home.ustc.edu.cn/~chendong/STN_Detector/stn_detector.pdf

UnitBox: An Advanced Object Detection Network

• http://arxiv.org/abs/1608.02236

Deep Convolutional Network Cascade for Facial Point Detection

• http://www.cv-foundation.org/openaccess/content_cvpr_2013/papers/Sun_Deep_Convolutional_Network_2013_CVPR_paper.pdf
• http://mmlab.ie.cuhk.edu.hk/archive/CNN_FacePoint.htm
• https://github.com/luoyetx/deep-landmark

WIDER FACE: A Face Detection Benchmark

A novel cascade detection method being a state of art at WIDER FACE

1. Train separate CNNs for small range of scales.
2. Each detector has two stages; Region Proposal Network + Detection Network
• http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/
• http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/support/paper.pdf

DenseBox (DenseBox: Unifying Landmark Localization with End to End Object Detection)

Keywords: upsampling, hardmining, no object proposal, BAIDU

1.  Similar to YOLO .
2.  Image pyramid of input
3.  Feed to network
4. Upsample feature maps after a layer.
5. Predict classification score and bbox location per pixel on upsampled feature map.
6. NMS to bbox locations.
7. SoA at MALF face dataset
• http://arxiv.org/pdf/1509.04874v3.pdf
• http://www.cbsr.ia.ac.cn/faceevaluation/results.html

Face Detection without Bells and Whistles

Keywords: no NN, DPM, Channel Features

1. ECCV 2014
2. Very high quality detections
3. Very slow on CPU and acceptable on GPU
• https://bitbucket.org/rodrigob/doppia/
• http://rodrigob.github.io/documents/2014_eccv_face_detection_with_supplementary_material.pdf

# Why do we need better word representations ?

A successful AI agent should communicate. It is all about language. It should understand and explain itself in words in order to communicate us.  All of these spark with the "meaning" of words which the atomic part of human-wise communication. This is one of the fundamental problems of Natural Language Processing (NLP).

"meaning" is described as "the idea that is represented by a word, phrase, etc. How about representing the meaning of a word in a computer. The first attempt is to use some kind of hardly curated taxonomies such as WordNet. However such hand made structures not flexible enough, need human labor to elaborate and  do not have semantic relations between words other then the carved rules. It is not what we expect from a real AI agent.

Then NLP research focused to use number vectors to symbolize words. The first use is to donate words with discrete (one-hot) representations. That is, if we assume a vocabulary with 1K words then we create a 1K length 0 vector with only one 1 representing the target word. Continue reading Why do we need better word representations ?