# Paper review - Understanding Deep Learning Requires Rethinking Generalization

This paper states the following phrase. Traditional machine learning frameworks (VC dimensions, Rademacher complexity etc.) trying to explain how learning occurs are not very explanatory for the success of deep learning models and we need more understanding looking from different perspectives.

They rely on following empirical observations;

• Deep networks are able to learn any kind of train data even with white noise instances with random labels. It entails that neural networks have very good brute-force memorization capacity.
• Explicit regularization techniques - dropout, weight decay, batch norm - improves model generalization but it does not mean that same network give poor generalization performance without any of these. For instance, an inception network trained without ant explicit technique has 80.38% top-5 rate where as the same network achieved 83.6% on ImageNet challange with explicit techniques.
• A 2 layers network with 2n+d parameters can learn the function f with n samples in d dimensions. They provide a proof of this statement on appendix section. From the empirical stand-view, they show the network performances on MNIST and CIFAR-10 datasets with 2 layers Multi Layer Perceptron.

Above observations entails following questions and conflicts;

• Traditional notion of learning suggests stronger regularization as we use more powerful models. However, large enough network model is able to memorize any kind of data even if this data is just a random noise. Also, without any further explicit regularization techniques these models are able to generalize well in natural datasets.  It shows us that, conflicting to general belief, brute-force memorization is still a good learning method yielding reasonable generalization performance in test time.
• Classical approaches are poorly suited to explain the success of neural networks and more investigation is imperative in order to understand what is really going on from theoretical view.
• Generalization power of the networks are not really defined by the explicit techniques, instead implicit factors like learning method or the model architecture seems more effective.
• Explanation of generalization is need to be redefined in order to solve the conflicts depicted above.

My take :  These large models are able to learn any function (and large does not mean deep anymore) and if there is any kind of information match between the training data and the test data, they are able to generalize well as well. Maybe it might be an explanation to think this models as an ensemble of many millions of smaller models on which is controlled by the zeroing effect of activation functions.  Thus, it is able to memorize any function due to its size and implicated capacity but it still generalize well due-to this ensembling effect.

# Why do we need better word representations ?

A successful AI agent should communicate. It is all about language. It should understand and explain itself in words in order to communicate us.  All of these spark with the "meaning" of words which the atomic part of human-wise communication. This is one of the fundamental problems of Natural Language Processing (NLP).

"meaning" is described as "the idea that is represented by a word, phrase, etc. How about representing the meaning of a word in a computer. The first attempt is to use some kind of hardly curated taxonomies such as WordNet. However such hand made structures not flexible enough, need human labor to elaborate and  do not have semantic relations between words other then the carved rules. It is not what we expect from a real AI agent.

Then NLP research focused to use number vectors to symbolize words. The first use is to donate words with discrete (one-hot) representations. That is, if we assume a vocabulary with 1K words then we create a 1K length 0 vector with only one 1 representing the target word. Continue reading Why do we need better word representations ?

# Object Detection Literature

<Please let me know if there are more works comparable to these below.>

R-CNN minus R

• http://arxiv.org/pdf/1506.06981.pdf

FasterRCNN (Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks)

Keywords: RCNN, RoI pooling, object proposals, ImageNet 2015 winner.

PASCAL VOC2007: 73.2%

PASCAL VOC2012: 70.4%

ImageNet Val2 set: 45.4% MAP

1. Model agnostic
2. State of art with Residual Networks
•  http://arxiv.org/pdf/1512.03385v1.pdf
3. Fast enough for oflline systems and partially for inline systems
• https://arxiv.org/pdf/1506.01497.pdf
• https://github.com/ShaoqingRen/faster_rcnn (official)
• https://github.com/rbgirshick/py-faster-rcnn
• http://web.cs.hacettepe.edu.tr/~aykut/classes/spring2016/bil722/slides/w05-FasterR-CNN.pdf
• https://github.com/precedenceguo/mx-rcnn
• https://github.com/mitmul/chainer-faster-rcnn

YOLO (You Only Look Once: Unified, Real-Time Object Detection)

Keywords: real-time detection, end2end training.

PASCAL VOC 2007: 63,4% (YOLO), 57.9% (Fast YOLO)

RUN-TIME : 45 FPS (YOLO), 155 FPS (Fast YOLO)

1. VGG-16 based model
2. End-to-end learning with no extra hassle (no proposals)
3. Fastest with some performance payback relative to Faster RCNN
4. Applicable to online systems
• http://pjreddie.com/darknet/yolo/
• https://github.com/pjreddie/darknet
• https://github.com/BriSkyHekun/py-darknet-yolo (python interface to darknet)
• https://github.com/tommy-qichang/yolo.torch
• https://github.com/gliese581gg/YOLO_tensorflow
• https://github.com/ZhouYzzz/YOLO-mxnet
• https://github.com/xingwangsfu/caffe-yolo
• https://github.com/frankzhangrui/Darknet-Yolo (custom training)

MultiBox (Scalable Object Detection using Deep Neural Networks)

Keywords: cascade classifiers, object proposal network.

1. Similar to YOLO
2. Two successive networks for generating object proposals and classifying these
• http://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Erhan_Scalable_Object_Detection_2014_CVPR_paper.pdf

ION (Inside - Outside Net)

Keywords: object proposal network, RNN, context features

1. RNN networks on top of conv5 layer in 4 different directions
2. Concate different layer features with L2 norm + rescaling
• (great slide) http://www.seanbell.ca/tmp/ion-coco-talk-bell2015.pdf

UnitBox ( UnitBox: An Advanced Object Detection Network)

• https://arxiv.org/pdf/1608.01471v1.pdf

DenseBox (DenseBox: Unifying Landmark Localization with End to End Object Detection)

Keywords: upsampling, hardmining, no object proposal, BAIDU

1.  Similar to YOLO .
2.  Image pyramid of input
3.  Feed to network
4. Upsample feature maps after a layer.
5. Predict classification score and bbox location per pixel on upsampled feature map.
6. NMS to bbox locations.
• http://arxiv.org/pdf/1509.04874v3.pdf

MRCNN: Object detection via a multi-region & semantic segmentation-aware CNN model

PASCAL VOC2007: 78.2% MAP

PASCAL VOC2012: 73.9% MAP

Keywords: bbox regression, segmentation aware

1. very large model and so much detail.
2. Divide each detection windows to different regions.
3. Learn different networks per region scheme.
4. Empower representation by using the entire image network.
5. Use segmentation aware network which takes the etnrie image as input.
• http://arxiv.org/pdf/1505.01749v3.pdf
• https://github.com/gidariss/mrcnn-object-detection

SSD: Single Shot MultiBox Detector

PASCAL VOC2007: 75.5% MAP (SSD 500), 72.1% MAP (SSD 300)

PASCAL VOC2012: 73.1% MAP (SSD 500)

RUN-TIME: 23 FPS (SSD 500), 58 FPS (SSD 300)

Keywords: real-time, no object proposal, end2end training

1. Faster and accurate then YOLO (their claim)
2. Not useful for small objects
• https://arxiv.org/pdf/1512.02325v2.pdf
• https://github.com/weiliu89/caffe/tree/ssd

CRAFT (CRAFT Objects from Images)

PASCAL VOC2007: 75.7% MAP

PASCAL VOC2012: 71.3% MAP

ImageNet Val2 set: 48.5% MAP

• intro: CVPR 2016. Cascade Region-proposal-network And FasT-rcnn. an extension of Faster R-CNN
• http://byangderek.github.io/projects/craft.html
• https://github.com/byangderek/CRAFT
• https://arxiv.org/abs/1604.03239

Hierarchical Object Detection with Deep Reinforcement Learning

1. Hierarchically propose object regions
2. Do not share conv computation by RoI pooling
3. Use direct proposals on the input image
4. Conv sharing reduces the performance sue to spatial information loss (their claim)
5. They do not give extensive experimentation !
6. Given visual examples are simple without any clutter background !
7. Still using Reinforcement Learning seems curious.
• https://arxiv.org/pdf/1611.03718v1.pdf

# Paper review: CONVERGENT LEARNING: DO DIFFERENT NEURAL NETWORKS LEARN THE SAME REPRESENTATIONS?

This paper is an interesting work which tries to explain similarities and differences between representation learned by different networks in the same architecture.

To the extend of their experiments, they train 4 different AlexNet and compare the units of these networks by correlation and mutual information analysis.

• Can we find one to one matching of units between network , showing that these units are sensitive to similar or the same commonalities on the image?
• Is the one to one matching stays the same by different similarity measures? They first use correlation then mutual information to confirm the findings.
• Is a representation learned by a network is a rotated version of the other, to the extend that one to one matching is not possible  between networks?
• Is clustering plausible for grouping units in different networks?

Answers to these questions are as follows;

• It is possible to find good matching units with really high correlation values but there are some units learning unique representation that are not replicated by the others. The degree of representational divergence between networks goes higher with the number of layers. Hence, we see large correlations by conv1 layers and it the value decreases toward conv5 and it is minimum by conv4 layer.
• They first analyze layers by the correlation values among units. Then they measure the overlap with the mutual information and the results are confirming each other..
• To see the differences between learned representation, they use a very smart trick. They approximate representations  learned by a layer of a network by the another network using the same layer.  A sparse approximation is performed using LASSO. The result indicating that some units are approximated well with 1 or 2 units of the other network but remaining set of units require almost 4 counterpart units for good approximation. It shows that some units having good one to one matching has local codes learned and other units have slight distributed codes approximated by multiple counterpart units.
• They also run a hierarchical clustering in order to group similar units successfully.

For details please refer to the paper.

My discussion: We see that different networks learn similar representations with some level of accompanying uniqueness. It is intriguing  to see that, after this paper, these  are the unique representations causing performance differences between networks and whether the effect is improving or worsening. Additionally, maybe we might combine these differences at the end to improve network performances by some set of smart tricks.

One deficit of the paper is that they do not experiment deep networks which are the real deal of the time. As we see from the results, as the layers go deeper,  different abstractions exhumed by different networks. I believe this is more harsh by deeper architectures such as Inception or VGG kind.

One another curious thing is to study Residual netwrosk. The intuition of Residual networks to pass the already learned representation to upper layers and adding more to residual channel if something useful learned by the next layer. That idea shows some promise that two residual networks might be more similar compared to two Inception networks. Moreover, we can compare different layers inside a single Residual Network to see at what level the representation stays the same.

# Paper review: ALL YOU NEED IS A GOOD INIT

This work proposes yet another way to initialize your network, namely LUV (Layer-sequential Unit-variance) targeting especially deep networks.  The idea relies on lately served Orthogonal initialization and fine-tuning the weights by the data to have variance of 1 for each layer output.

The scheme follows three stages;

1.  Initialize weights by unit variance Gaussian
2.  Find components of these weights using SVD
3.  Replace the weights with these components
4.  By using minibatches of data, try to rescale weights to have variance of 1 for each layer. This iterative procedure is described as below pseudo code.

In order to describe the code in words, for each iteration we give a new mini-batch and compute the output variance. We compare the computed variance by the threshold we defined as $Tol_{var}$ to the target variance 1.   If number of iterations is below the maximum number iterations or the difference is above $Tol_{var}$ we rescale the layer weights by the squared variance of the minibatch.  After initializing this layer go on to the next layer.

In essence, what this method does. First, we start with a normal Gaussian initialization which we know that it is not enough for deep networks. Orthogonalization stage, decorrelates the weights so that each unit of the layer starts to learn from particularly different point in the space. At the final stage, LUV iterations rescale the weights and keep the back and forth propagated signals close to a useful variance against vanishing or exploding gradient problem , similar to Batch Normalization but without computational load.  Nevertheless, as also they points, LUV is not interchangeable with BN for especially large datasets like ImageNet. Still, I'd like to see a comparison with LUV vs BN but it is not done or not written to paper (Edit by the Author: Figure 3 on the paper has CIFAR comparison of BN and LUV and ImageNet results are posted on https://github.com/ducha-aiki/caffenet-benchmark).

The good side of this method is it works, for at least for my experiments made on ImageNet with different architectures. It is also not too much hurdle to code, if you already have Orthogonal initialization on the hand. Even, if you don't have it, you can start with a Gaussian initialization scheme and skip Orthogonalization stage and directly use LUV iterations. It still works with slight decrease of performance.

# Paper Review: Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)?

There is theoretical proof that any one hidden layer network with enough number of sigmoid function is able to learn any decision boundary. Empirical practice, however, posits us that learning good data representations demands deeper networks, like the last year's ImageNet winner ResNet.

There are two important findings of this work. The first is,we need convolution, for at least image recognition problems, and the second is deeper is always better . Their results are so decisive on even small dataset like CIFAR-10.

They also give a good little paragraph explaining a good way to curate best possible shallow networks based on the deep teachers.

- train state of deep models

- form an ensemble by the best subset

- collect eh predictions on a large enough transfer test

- distill the teacher ensemble knowledge to shallow network.

(if you like to see more about how to apply teacher - student paradigm successfully refer to the paper. It gives very comprehensive set of instructions.)

Still, ass shown by the experimental results also, best possible shallow network is beyond the deep counterpart.

My Discussion:

I believe the success of the deep versus shallow depends not the theoretical basis but the way of practical learning of the networks. If we think networks as representation machine which gives finer details to coerce concepts such as thinking to learn a face without knowing what is an eye, does not seem tangible. Due to the one way information flow of convolution networks, this hierarchy of concepts stays and disables shallow architectures to learn comparable to deep ones.

Then how can we train shallow networks comparable to deep ones, once we have such theoretical justifications. I believe one way is to add intra-layer connections which are connections each unit of one layer to other units of that layer. It might be a recursive connection or just literal connections that gives shallow networks the chance of learning higher abstractions.

Convolution is also obviously necessary. Although, we learn each filter from the whole input, still each filter is receptive to particular local commonalities.  It is not doable by fully connected layers since it learns from the whole spatial range of the input.

# ParseNet: Looking Wider to See Better

In this work, they propose two related problems and comes with a simple but functional solution to this. the problems are;

1. Learning object location on the image with Proposal + Classification approach is very tiresome since it needs to classify >1000 patched per image. Therefore, use of end to end pixel-wise segmentation is a better solution as proposed by FCN (Long et al. 2014).
2. FCN oversees the contextual information since it predicts the objects of each pixel independently. Therefore, even the thing on the image is Cat, there might be unrelated predictions for different pixels. They solve this by applying Conditional Random Field (CRF) on top of FCN. This is a way to consider context by using pixel relations.  Nevertheless, this is still not a method that is able to learn end-to-end since CRF needs additional learning stage after FCN.

Based on these two problems they provide ParseNet architecture. It declares contextual information by looking each channel feature map and aggregating the activations values.  These aggregations then merged to be appended to final features of the network as depicted below;

Their experiments construes the effectiveness of the additional contextual features.  Yet there are two important points to consider before using these features together. Due to the scale differences of each layer activations, one needs to normalize first per layer then append them together.  They L2 normalize each layer's feature. However, this results very small feature values which also hinder the network to learn in a fast manner.  As a cure to this, they learn scale parameters to each feature as used by the Batch Normalization method so that they first normalize and scale the values with scaling weights learned from the data.

The takeaway from this paper,  for myself, adding intermediate layer features improves the results with a correct normalization framework and as we add more layers, network is more robust to local changes by the context defined by the aggregated features.

They use VGG16 and fine-tune it for their purpose, VGG net does not use Batch Normalization. Therefore, use of Batch Normalization from the start might evades the need of additional scale parameters even maybe the L2 normalization of aggregated features. This is because, Batch Normalization already scales and shifts the feature values into a common norm.

Note: this is a hasty used article sorry for any inconvenience or mistake or stupidly written sentences.

# Think "Turing Test" in another way ?

After some crawling on the Internet, I stumbled upon this thread on Quora. For the lazy ones, the thread is about the things that can be done by humans but not by computers after N years. There are many references to Turing Test in answers stating that the best AI is still not able to pass Turing Test; therefore we do not need to worry about AI being an existential threat for the humanity. First off, I ought to say that I am on the cautious side (like Elon Musk and Bill Gates) on AI being a threat. To explain myself, I would like to show that AI is a threat that has begun to affect, even we think the Turing Test as the validation method. We only need to think in a different way to verify the test.

For the ones who don't know what Turing Test is;  A and B (one machine - one human) are hidden from the human observer C. Looking at the interaction between  A and  B; the observer C tries to decide which one is human and which is the machine. If observer C cannot decide whether there is a machine or a human behind the curtain; than the machine passes the test. Conclusion is that machine exhibits intelligent behavior equivalent to, or indistinguishable from, that of a human.

From the definition, it is one of the legitimate milestones for AI to compass human capable agents. Therefore, it is normal for people to evaluate present AI to define its state and future potential using Turing Test.

I think a different formation regarding Turing Test where we replace the observer C with a machine as well. Then the remaining question turns out to be, is the machine C able to identify the machine A or even is this identification is necessary henceforth? Thinking the formulation in that way resolves many concerns for the AI supporters who say AI is not a threat since it does not and will not be able to pass Turing Test (at least in the short run). Nevertheless, when we replace C with a machine than the machine does not need to pass Turing Test to be a threat, right? Because we are out of the context like poor B depicted on the above figure.

Now let me explain, what does it mean in practice, changing the observer human with a machine. I believe real life "communication" is a good way to illustrate Turing Test.  Think about the communication history. We started with bare foot messengers and have come to light speed flow of the today's world. At the time, we were sending a message and waiting very long for the response. The reason was the tools were the bottleneck for the communication. First we expedited these tools and come up with new technologies. If we think today, then we see that our tools are so fast that we are the bottleneck of the flow any more. We send our mails and messages in a second that bursts the inboxes and message stacks and consequently bursts us as well. If we also accept that the communication is the bare bone of the today's business world, companies do not want to waste time - time is money- and attempt to replace the slowest part with faster alternatives and so computerized solutions come to stage in place of humanized old fashion solution.  Now, after we changed the tools for communication, we also start to change the sides of the communication up to a point that there is no need for any human being. There, we also have a fancy name for this Internet of "Things" (not humans any more). If you also look to the statistics, we see that huge partition of the data flow is between machine to machine communication.  Could you say, in a more immense level of communication revolution, indistinguishability of a computer agent by a human observer is important? It is clear that we can still devastate our lives by our AI agents without passing Turing Test. You can watch out unemployment rates with the growth of the technological solutions.

Basically, what I try to say here is, yes, Turing Test is a barrier for Sci-Fi level AI threat but we changed the rules of the test by placing machines on the both side of the curtain. That means, there is no place in that test (even in the real life)  for human unless some silly machine cannot  replace you, but be sure it is yet to come.

Final saying, I am an AI guy and of course I am not saying we should stop but it is an ominously proceeding field. The punch card here is to underline the need of introspection of AI and related technologies and finding ways to serve AI for human needs not the contrary or any other way. We should be skeptical and be warned.

# Harnessing Deep Neural Networks with Logic Rules

This work posits a way to integrate first order logic rules with neural networks structures. It enables to cooperate expert knowledge with the workhorse deep neural networks. For being more specific, given a sentiment analysis problem, you know that if there is "but" in the sentence the sentiment content changes direction along the sentence. Such rules are harnessed with the network.

The method combines two precursor ideas of information distilling [Hinton et al. 2015] and posterior regularization [Ganchev et al. 2010].  We have teacher and student networks. They learn simultaneously.  Student networks directly uses the labelled data and learns model distribution P then given the logic rules, teacher networks adapts distribution Q as keeping it close to P but in the constraints of the given logic rules. That projects what is inside P to distribution Q bounded by the logic rules. as the below figure suggests.

I don't like to go into deep math since my main purpose is to give the intuition rather than the formulation. However, formulation follows mathematical formulation of first order logic rules suitable to be in a loss function. Then the student loss is defined by the real network loss (cross-entropy) and the loss of the logic rules with a importance weight.

$theta$ is the student model weight, the first part of the loss is the network loss and the second part is the logic loss. This function distills the information adapted by the given rules into student network.

Teacher network exploits KL divergence to approximate best Q which is close to P with a slack variable.

Since the problem is convex, solution van be found by its dual form with closed form solution as below.

So the whole algorithm is as follows;

For the experiments and use cases of this algorithm please refer to the paper. They show promising results at sentiment classification with convolution networks by definition of such BUT rules to the network.

My take away is, it is perfectly possible to use expert knowledge with the wild deep networks. I guess the recent trend of deep learning shows the same promise. It seems like our wild networks goes to be a efficient learning and inference rule for large graphical probabilistic models with variational methods and such rules imposing methods.  Still such expert knowledge is tenuous in the domain of image recognition problems.

Disclaimer; it is written hastily without any review therefore it is far from being complete but it targets the intuition of the work to make it memorable for latter use.