# Paper review: CONVERGENT LEARNING: DO DIFFERENT NEURAL NETWORKS LEARN THE SAME REPRESENTATIONS?

paper: http://arxiv.org/pdf/1511.07543v3.pdf

code : https://github.com/yixuanli/convergent_learning

This paper is an interesting work which tries to explain similarities and differences between representation learned by different networks in the same architecture.

To the extend of their experiments, they train 4 different AlexNet and compare the units of these networks by correlation and mutual information analysis.

They asks following question;

- Can we find one to one matching of units between network , showing that these units are sensitive to similar or the same commonalities on the image?
- Is the one to one matching stays the same by different similarity measures? They first use correlation then mutual information to confirm the findings.
- Is a representation learned by a network is a rotated version of the other, to the extend that one to one matching is not possible between networks?
- Is clustering plausible for grouping units in different networks?

Answers to these questions are as follows;

- It is possible to find good matching units with really high correlation values but there are some units learning unique representation that are not replicated by the others. The degree of representational divergence between networks goes higher with the number of layers. Hence, we see large correlations by conv1 layers and it the value decreases toward conv5 and it is minimum by conv4 layer.
- They first analyze layers by the correlation values among units. Then they measure the overlap with the mutual information and the results are confirming each other..
- To see the differences between learned representation, they use a very smart trick. They approximate representations learned by a layer of a network by the another network using the same layer. A sparse approximation is performed using LASSO. The result indicating that some units are approximated well with 1 or 2 units of the other network but remaining set of units require almost 4 counterpart units for good approximation. It shows that some units having good one to one matching has local codes learned and other units have slight distributed codes approximated by multiple counterpart units.
- They also run a hierarchical clustering in order to group similar units successfully.

For details please refer to the paper.

**My discussion: **We see that different networks learn similar representations with some level of accompanying uniqueness. It is intriguing to see that, after this paper, these are the unique representations causing performance differences between networks and whether the effect is improving or worsening. Additionally, maybe we might combine these differences at the end to improve network performances by some set of smart tricks.

One deficit of the paper is that they do not experiment deep networks which are the real deal of the time. As we see from the results, as the layers go deeper, different abstractions exhumed by different networks. I believe this is more harsh by deeper architectures such as Inception or VGG kind.

One another curious thing is to study Residual netwrosk. The intuition of Residual networks to pass the already learned representation to upper layers and adding more to residual channel if something useful learned by the next layer. That idea shows some promise that two residual networks might be more similar compared to two Inception networks. Moreover, we can compare different layers inside a single Residual Network to see at what level the representation stays the same.

# Paper review: ALL YOU NEED IS A GOOD INIT

paper: http://arxiv.org/abs/1511.06422

code: https://github.com/yobibyte/yobiblog/blob/master/posts/all-you-need-is-a-good-init.md

This work proposes yet another way to initialize your network, namely LUV (Layer-sequential Unit-variance) targeting especially deep networks. The idea relies on lately served Orthogonal initialization and fine-tuning the weights by the data to have variance of 1 for each layer output.

The scheme follows three stages;

- Initialize weights by unit variance Gaussian
- Find components of these weights using SVD
- Replace the weights with these components
- By using minibatches of data, try to rescale weights to have variance of 1 for each layer. This iterative procedure is described as below pseudo code.

In order to describe the code in words, for each iteration we give a new mini-batch and compute the output variance. We compare the computed variance by the threshold we defined as to the target variance 1. If number of iterations is below the maximum number iterations or the difference is above we rescale the layer weights by the squared variance of the minibatch. After initializing this layer go on to the next layer.

In essence, what this method does. First, we start with a normal Gaussian initialization which we know that it is not enough for deep networks. Orthogonalization stage, decorrelates the weights so that each unit of the layer starts to learn from particularly different point in the space. At the final stage, LUV iterations rescale the weights and keep the back and forth propagated signals close to a useful variance against vanishing or exploding gradient problem , similar to Batch Normalization but without computational load. Nevertheless, as also they points, LUV is not interchangeable with BN for especially large datasets like ImageNet. Still, I'd like to see a comparison with LUV vs BN but it is not done or not written to paper (Edit by the Author: Figure 3 on the paper has CIFAR comparison of BN and LUV and ImageNet results are posted on https://github.com/ducha-aiki/caffenet-benchmark).

The good side of this method is it works, for at least for my experiments made on ImageNet with different architectures. It is also not too much hurdle to code, if you already have Orthogonal initialization on the hand. Even, if you don't have it, you can start with a Gaussian initialization scheme and skip Orthogonalization stage and directly use LUV iterations. It still works with slight decrease of performance.

# Setting individual learning rate per layer in Torch

# Paper review: Dynamic Capacity Networks

Paper: http://arxiv.org/pdf/1511.07838v7.pdf

Decompose the network structure into two networks F and G keeping a set of top layers T at the end. F and G are small and more advance network structures respectively. Thus F is cheap to execute with lower performance compared to G.

In order to reduce the whole computation and embrace both performance and computation gains provided by both networks, they suggest an incremental pass of input data through F to G.

Network F decides the salient regions on the input by using a gradient feedback and then these smaller regions are sent to network G to have better recognition performance.

Given an input image x, coarse network F is applied and then coarse representations of different regions of the given input is computed. These coarse representations are propagated to the top layers T and T computes the final output of the network which are the class predictions. An entropy measure is used to see that how each coerce representation effects the model's uncertainty leading that if a regions is salient then we expect to have large change of the uncertainty with respect to its representation.

We select top k input regions as salient by the hint of computed entropy changes then these regions are given to fine network G obtain finer representations. Eventually, we merge all the coarse, fine representations and give to top layers T again and get the final predictions.

At the training time, all networks and layers trained simultaneously. However, still one might decide to train each network F and G separately by using the same top layers T. Authors posits that the simultaneous training is useful to keep fine and coarse representations similar so that the final layers T do not struggle too much to learn from two difference representation distribution.

I only try to give the overlooked idea here, if you like to see more detail and dwell into formulas please see the paper.

My discussion: There are some other works using attention mechanisms to improve final performance. However, this work is limited with the small datasets and small spatial dimensions. I really like to see whether it is also usefule for large problems like ImageNet or even larger.

Another caveat is the datasets used for the expeirments are not so cluttered. Therefore, it is easy to detect salient regions, even with by some algrithmic techniques. Thus, still this method obscure to me in real life problems.

# Paper Review: Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)?

There is theoretical proof that any one hidden layer network with enough number of sigmoid function is able to learn any decision boundary. Empirical practice, however, posits us that learning good data representations demands deeper networks, like the last year's ImageNet winner ResNet.

There are two important findings of this work. The first is,we need convolution, for at least image recognition problems, and the second is deeper is always better . Their results are so decisive on even small dataset like CIFAR-10.

They also give a good little paragraph explaining a good way to curate best possible shallow networks based on the deep teachers.

- train state of deep models

- form an ensemble by the best subset

- collect eh predictions on a large enough transfer test

- distill the teacher ensemble knowledge to shallow network.

(if you like to see more about how to apply teacher - student paradigm successfully refer to the paper. It gives very comprehensive set of instructions.)

Still, ass shown by the experimental results also, best possible shallow network is beyond the deep counterpart.

**My Discussion:**

I believe the success of the deep versus shallow depends not the theoretical basis but the way of practical learning of the networks. If we think networks as representation machine which gives finer details to coerce concepts such as thinking to learn a face without knowing what is an eye, does not seem tangible. Due to the one way information flow of convolution networks, this hierarchy of concepts stays and disables shallow architectures to learn comparable to deep ones.

Then how can we train shallow networks comparable to deep ones, once we have such theoretical justifications. I believe one way is to add intra-layer connections which are connections each unit of one layer to other units of that layer. It might be a recursive connection or just literal connections that gives shallow networks the chance of learning higher abstractions.

Convolution is also obviously necessary. Although, we learn each filter from the whole input, still each filter is receptive to particular local commonalities. It is not doable by fully connected layers since it learns from the whole spatial range of the input.

# Fighting against class imbalance in a supervised ML problem.

ML on imbalanced data

given a imbalanced learning problem with a large class and a small class with number of instances N and M respectively;

- cluster the larger class into M clusters and use cluster centers for training the model.
- If it is a neural network or some compatible model. Cluster the the large class into K clusters and use these clusters as pseudo classes to train your model. This method is also useful for training your network with small number of classes case. It pushes your net to learn fine-detailed representations.
- Divide large class into subsets with M instances then train multiple classifiers and use the ensemble.
- Hard-mining is a solution which is unfortunately akin to over-fitting but yields good results in some particular cases such as object detection. The idea is to select the most confusing instances from the large set per iteration. Thus, select M most confusing instances from the large class and use for that iteration and repeat for the next iteration.
- For specially batch learning, frequency based batch sampling might be useful. For each batch you can sample instances from the small class by the probability M/(M+N) and N/(M+N) for tha large class so taht you prioritize the small class instances for being the next batch. As you do data augmentation techniques like in CNN models, mostly repeating instances of small class is not a big problem.

Note for metrics, normal accuracy rate is not a good measure for suh problems since you see very high accuracy if your model just predicts the larger class for all the instances. Instead prefer ROC curve or keep watching Precision and Recall.

Please keep me updated if you know something more. Even, this is a very common issue in practice, still hard to find a working solution.

# How many training samples we observe over life time ?

In this post, I like to compute what number of visual instances we observes over time, with the assumption that we visually perceive life as a constant video with a certain fps rate.

Let's dive into the computation. Relying on [1], average person can see the world with 45 fps on average. It goes to extremes for such people like fighter pilots which is 225fps with the adrenaline kicked in. I took the average life time 71 years [3] equals to (2 .24 billion) secs and we are awake almost of it which makes (1.49 billion) secs . Then we assume that on average there are neurons in our brain [2]. This is our model size.

Eventually and roughly, that means without any further investigation, we have a model with 86 billion parameters which learns from almost **67 billion images**.

Of course this is not a convenient way to come with this numbers but fun comes by ignorance 🙂

[1] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2826883/figure/F2/

[2] http://www.ncbi.nlm.nih.gov/pubmed/19226510

[3] http://www.who.int/gho/mortality_burden_disease/life_tables/situation_trends/en/

# Methods used by us as Qualcomm Research at ImageNet 2015

# ParseNet: Looking Wider to See Better

**paper**: http://arxiv.org/pdf/1506.04579v2.pdf

**code** : https://gist.github.com/shelhamer/80667189b218ad570e82

In this work, they propose two related problems and comes with a simple but functional solution to this. the problems are;

- Learning object location on the image with Proposal + Classification approach is very tiresome since it needs to classify >1000 patched per image. Therefore, use of end to end pixel-wise segmentation is a better solution as proposed by FCN (Long et al. 2014).
- FCN oversees the contextual information since it predicts the objects of each pixel independently. Therefore, even the thing on the image is Cat, there might be unrelated predictions for different pixels. They solve this by applying Conditional Random Field (CRF) on top of FCN. This is a way to consider context by using pixel relations. Nevertheless, this is still not a method that is able to learn end-to-end since CRF needs additional learning stage after FCN.

Based on these two problems they provide ParseNet architecture. It declares contextual information by looking each channel feature map and aggregating the activations values. These aggregations then merged to be appended to final features of the network as depicted below;

Their experiments construes the effectiveness of the additional contextual features. Yet there are two important points to consider before using these features together. Due to the scale differences of each layer activations, one needs to normalize first per layer then append them together. They L2 normalize each layer's feature. However, this results very small feature values which also hinder the network to learn in a fast manner. As a cure to this, they learn scale parameters to each feature as used by the Batch Normalization method so that they first normalize and scale the values with scaling weights learned from the data.

The takeaway from this paper, for myself, adding intermediate layer features improves the results with a correct normalization framework and as we add more layers, network is more robust to local changes by the context defined by the aggregated features.

They use VGG16 and fine-tune it for their purpose, VGG net does not use Batch Normalization. Therefore, use of Batch Normalization from the start might evades the need of additional scale parameters even maybe the L2 normalization of aggregated features. This is because, Batch Normalization already scales and shifts the feature values into a common norm.

**Note**: this is a hasty used article sorry for any inconvenience or mistake or stupidly written sentences.