# Paper Notes: The Shattered Gradients Problem ...

The whole heading of the paper is "The Shattered Gradients Problem: If resnets are the answer, then what is the question?". It is really interesting work with all its findings about gradient dynamics of neural networks. It also examines Batch Normalization (BN) and Residual Networks (Resnet) under this problem.

The problem, dubbed "Shattered Gradients", described as gradient feedbacks resembling random noise for nearby data points. White noise gradients (random value around 0 with some unknown variance) are not useful for training and they stall the network. What we expect to see is Brownian noise (next value is obtained with a small change on the last value) from a working model. Deep neural networks are more prone to white noise gradients. However, latest advances like BN and Resnet are described to be more resilient to random gradients even in deep networks.

White noise gradients undermines the effectiveness of networks because they violates gradient based learning methods which expects similar gradient feedbacks for data points close by in the vector space. Once you have white noise gradient for such close points, the model is not able to capture data manifold through these learning algorithms. Brownian updates yields more correlation on updates and this preludes effective learning.

For normal networks, they give a empirical evidence that the correlation of network updates decreases with the order $(/2^L$ where L is number of layers. Decreasing correlation means more white noise gradient feedbacks.

One important reason of white noise feedbacks is to be co-activations of network units. From a working model, we expect to have units receptive to different structures in the given data. Therefore, for each different instance, different subset of units should be active for effective information flux. They observe that as activation goes through layers, co-activation rate goes higher. BN layers prevents this by keeping the co-activation rate 1/4 (1/4 units are active per layer).

Beside the co-activation rate, how dispersed units activation is another important question. Thus, similar instances need to activate similar subset of units and activation should be distributes to other subsets as we change the data structure. This stage is where the skip-connections get into the play. Their observation is skip-connections improve networks in that respect. This can be observed at below figure.

The effectiveness of skip-connections increases with Beta scaling introduced by InceptionV4 architecture. It  is scaling residual connections by a constant value before summing up with the current layer activation.

#### A small discussion

This is a very intriguing paper to me as being one of the scarse works investigating network dynamics instead of blind updates on architectures for racing accuracy values.

Resnet is known to be train hundreds of layers which was not possible before. Now, with this work, we have another scientific argument explaining its effectiveness. I also like to point Veit et al. (2016) demystifying Resnet as an ensemble of many shallow networks. When we combine both of these papers, it makes total sense to me how Resnets are useful for training very deep networks. If shattered gradient effect, as stated here,  increasing with number of layers with the order 2^L then it is impossible to train hundred layers with an ad-hoc network. Corollary, since Resnet behaves like a ensemble of shallow networks this effects is rehabilitated. We are able to see it empirically in this paper and it is complimentary in that sense.

Note: This hastily written paper note might include any kind of error. Please let me know if you find one. Best 🙂

# Dilated Convolution

In simple terms, dilated convolution is just a convolution applied to input with defined gaps. With this definitions, given our input is an 2D image, dilation rate k=1 is normal convolution and k=2 means skipping one pixel per input and k=4 means skipping 3 pixels. The best to see the figures below with the same k values.

The figure below shows dilated convolution on 2D data. Red dots are the inputs to a filter which is 3x3 in this example, and greed area is the receptive field captured by each of these inputs. Receptive field is the implicit area captured on the initial input by each input (unit) to the next layer .

Dilated convolution is a way of increasing receptive view (global view) of the network exponentially and linear parameter accretion. With this purpose, it finds usage in applications cares more about integrating knowledge of the wider context with less cost.

One general use is image segmentation where each pixel is labelled by its corresponding class. In this case, the network output needs to be in the same size of the input image. Straight forward way to do is to apply convolution then add deconvolution layers to upsample[1]. However, it introduces many more parameters to learn. Instead, dilated convolution is applied to keep the output resolutions high and it avoids the need of upsampling [2][3].

Dilated convolution is applied in domains beside vision as well. One good example is WaveNet[4] text-to-speech solution and ByteNet learn time text translation. They both use dilated convolution in order to capture global view of the input with less parameters.

In short, dilated convolution is a simple but effective idea and you might consider it in two cases;

1. Detection of fine-details by processing inputs in higher resolutions.
2. Broader view of the input to capture more contextual information.
3. Faster run-time with less parameters

[1] Long, J., Shelhamer, E., & Darrell, T. (2014). Fully Convolutional Networks for Semantic Segmentation. Retrieved from http://arxiv.org/abs/1411.4038v1

[2]Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, 1–14. Retrieved from http://arxiv.org/abs/1412.7062

[3]Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. Iclr, 1–9. http://doi.org/10.16373/j.cnki.ahr.150049

[4]Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio, 1–15. Retrieved from http://arxiv.org/abs/1609.03499

[5]Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. van den, Graves, A., & Kavukcuoglu, K. (2016). Neural Machine Translation in Linear Time. Arxiv, 1–11. Retrieved from http://arxiv.org/abs/1610.10099

# Selfai: A Method for Understanding Beauty in Selfies

Selfies are everywhere. With different fun masks, poses and filters,  it goes crazy.  When we coincide with any of these selfies, we automatically give an intuitive score regarding the quality and beauty of the selfie. However, it is not really possible to describe what makes a beautiful selfie. There are some obvious attributes but they are not fully prescribed.

With the folks at 8bit.ai, we decided to develop a system which analyzes selfie images and scores them in accordance to its quality and beauty.  The idea was to see whether it is possible to mimic that bizarre perceptual understanding of human with the recent advancements of AI. And if it is, then let's make an application and let people use it for whatever purpose.  For now, we only have an Instagram bot @selfai_robot. You can check before reading.

# Face Detection by Literature

Please ping me if you know something more.

Multi-view Face Detection Using Deep Convolutional Neural Network

1. Train face classifier with face (> 0.5 overlap) and background (<0.5 overlap) images.
2.  Compute heatmap over test image scaled to different sizes with sliding window
3.  Apply NMS .
4.  Computation intensive, especially for CPU.
•  http://arxiv.org/abs/1502.02766

From Facial Parts Responses to Face Detection: A Deep Learning Approach

Keywords: object proposals, facial parts,  more annotation.

1. Use facial part annotations
2. Bottom up to detect face from facial parts.
3. "Faceness-Net’s pipeline consists of three stages,i.e. generating partness maps, ranking candidate windows by faceness scores, and refining face proposals for face detection."
4. Train part based classifiers based on attributes related to different parts of the face i.e. for hair part train ImageNet pre-trained network for color classification.
5. Very robust to occlusion and background clutter.
6. To much annotation effort.
7. Still object proposals (DL community should skip proposal approach. It complicate the problem by creating a new domain of problem :)) ).
• http://arxiv.org/abs/1509.06451

Supervised Transformer Network for Efficient Face Detection

• http://home.ustc.edu.cn/~chendong/STN_Detector/stn_detector.pdf

UnitBox: An Advanced Object Detection Network

• http://arxiv.org/abs/1608.02236

Deep Convolutional Network Cascade for Facial Point Detection

• http://www.cv-foundation.org/openaccess/content_cvpr_2013/papers/Sun_Deep_Convolutional_Network_2013_CVPR_paper.pdf
• http://mmlab.ie.cuhk.edu.hk/archive/CNN_FacePoint.htm
• https://github.com/luoyetx/deep-landmark

WIDER FACE: A Face Detection Benchmark

A novel cascade detection method being a state of art at WIDER FACE

1. Train separate CNNs for small range of scales.
2. Each detector has two stages; Region Proposal Network + Detection Network
• http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/
• http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/support/paper.pdf

DenseBox (DenseBox: Unifying Landmark Localization with End to End Object Detection)

Keywords: upsampling, hardmining, no object proposal, BAIDU

1.  Similar to YOLO .
2.  Image pyramid of input
3.  Feed to network
4. Upsample feature maps after a layer.
5. Predict classification score and bbox location per pixel on upsampled feature map.
6. NMS to bbox locations.
7. SoA at MALF face dataset
• http://arxiv.org/pdf/1509.04874v3.pdf
• http://www.cbsr.ia.ac.cn/faceevaluation/results.html

Face Detection without Bells and Whistles

Keywords: no NN, DPM, Channel Features

1. ECCV 2014
2. Very high quality detections
3. Very slow on CPU and acceptable on GPU
• https://bitbucket.org/rodrigob/doppia/
• http://rodrigob.github.io/documents/2014_eccv_face_detection_with_supplementary_material.pdf

# Paper review: Dynamic Capacity Networks

Decompose the network structure into two networks F and G keeping a set of top layers T at the end. F and G are small and more advance network structures respectively. Thus F is cheap to execute with lower performance compared to G.

In order to reduce the whole computation and embrace both performance and computation gains provided by both networks, they suggest an incremental pass of input data through F to G.

Network F decides the salient regions on the input by using a gradient feedback and then these smaller regions are sent to network G to have better recognition performance.

Given an input image x, coarse network F is applied and then coarse representations of different regions of the given input is computed. These coarse representations are propagated to the top layers T and T computes the final output of the network which are the class predictions. An entropy measure is used to see that how each coerce representation effects the model's uncertainty leading that if a regions is salient then we expect to have large change of the uncertainty with respect to its representation.

We select top k input regions as salient by the hint of computed entropy changes then these regions are given to fine network G obtain finer representations. Eventually, we merge all the coarse, fine representations and give to top layers T again and get the final predictions.

At the training time, all networks and layers trained simultaneously. However, still one might decide to train each network F and G separately by using the same top layers T.  Authors posits that the simultaneous training is useful to keep fine and coarse representations similar so that the final layers T do not struggle too much to learn from two difference representation distribution.

I only try to give the overlooked idea here, if you like to see more detail and dwell into formulas please see the paper.

My discussion: There are some other works using attention mechanisms to improve final performance. However, this work is limited with the small datasets and small spatial dimensions. I really like to see whether it is also usefule for large problems like ImageNet or even larger.

Another caveat is the datasets used for the expeirments are not so cluttered. Therefore, it is easy to detect salient regions, even with by some algrithmic techniques. Thus, still this method obscure to me in real life problems.

# How many training samples we observe over life time ?

In this post, I like to compute what number of visual instances we observes over time, with the assumption that we visually perceive life as a constant video with a certain fps rate.

Let's dive into the computation. Relying on [1],  average person can see the world with 45 fps on average. It goes to extremes for such people like fighter pilots which is 225fps with the adrenaline kicked in.  I took the average life time 71 years [3] equals to $2239056000$ (2 .24 billion) secs and we are awake almost $2/3$ of  it which makes $1492704000$ (1.49 billion) secs .  Then we assume that on average there are $86*10^9$ neurons in our brain [2]. This is our model size.

Eventually and roughly, that means without any further investigation, we have a model with 86 billion parameters which learns from  $1492704000 * 45 = 67171680000$  almost 67 billion images.

Of course this is not a convenient way to come with this numbers but fun comes by ignorance 🙂

[1] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2826883/figure/F2/

[2] http://www.ncbi.nlm.nih.gov/pubmed/19226510

[3] http://www.who.int/gho/mortality_burden_disease/life_tables/situation_trends/en/

# ParseNet: Looking Wider to See Better

In this work, they propose two related problems and comes with a simple but functional solution to this. the problems are;

1. Learning object location on the image with Proposal + Classification approach is very tiresome since it needs to classify >1000 patched per image. Therefore, use of end to end pixel-wise segmentation is a better solution as proposed by FCN (Long et al. 2014).
2. FCN oversees the contextual information since it predicts the objects of each pixel independently. Therefore, even the thing on the image is Cat, there might be unrelated predictions for different pixels. They solve this by applying Conditional Random Field (CRF) on top of FCN. This is a way to consider context by using pixel relations.  Nevertheless, this is still not a method that is able to learn end-to-end since CRF needs additional learning stage after FCN.

Based on these two problems they provide ParseNet architecture. It declares contextual information by looking each channel feature map and aggregating the activations values.  These aggregations then merged to be appended to final features of the network as depicted below;

Their experiments construes the effectiveness of the additional contextual features.  Yet there are two important points to consider before using these features together. Due to the scale differences of each layer activations, one needs to normalize first per layer then append them together.  They L2 normalize each layer's feature. However, this results very small feature values which also hinder the network to learn in a fast manner.  As a cure to this, they learn scale parameters to each feature as used by the Batch Normalization method so that they first normalize and scale the values with scaling weights learned from the data.

The takeaway from this paper,  for myself, adding intermediate layer features improves the results with a correct normalization framework and as we add more layers, network is more robust to local changes by the context defined by the aggregated features.

They use VGG16 and fine-tune it for their purpose, VGG net does not use Batch Normalization. Therefore, use of Batch Normalization from the start might evades the need of additional scale parameters even maybe the L2 normalization of aggregated features. This is because, Batch Normalization already scales and shifts the feature values into a common norm.

Note: this is a hasty used article sorry for any inconvenience or mistake or stupidly written sentences.

# FAME: Face Association by Model Evolution - CVPR presentation.

My work to learn  people's face by a simple query to a image search engine. Just search the name at Google, prune the irrelevant images iteratively and train a final classifier.

# Recent Advances in Deep Learning

In this text, I would like to talk about some of the recent advances of Deep Learning models by no means complete. (Click heading for the reference)

1. Parametric Rectifier Linear Unit (PReLU)
• The idea is to allow negative activation in well-known ReLU units by controlling it with a learnable parameter. In other words, you learn how much negative activationsyou need for each unit to discriminate classes. In the work, it is proposed that PReLU unit is very useful for especially very deep models that lacks for gradient propagation to initial layers due to its depth. What is different is PReLU allows more gradient return by allowing negative activation.
2. A new initialization method (MSRA for Caffe users)
• Xavier initialization was proposed by Bengio's team and it considers number of fan-in and fan-out to a certain unit to define the initial weights.  However, the work says that Xavier method and its alternations considers linear activation functions for the formulation of the method. Hence, they propose some changes related to ReLU activation that they empirically proved its effect in practice with better convergence rate.
3. Batch Normalization
• This work serves data normalization as a structural part of the model. They say that the distribution of the training data changes as the model evolves and it priorities the initialization scheme and the learning schedule we use for the learning. Each mini-batch of the data is normalized with the described scheme just before its propagation through the network and it allows faster convergence  with larger learning rates and robust models to initialization scheme that we choose.  Each mini-batch is normalized by its mean and variance, then it is scaled and shifted by a learned coefficient and residual.

4. Inception Layers
• This is one of the ingredients of last year's ImageNet winner GoogleNet. The trick is to use multi-scale filters all together in a layer and concatenating their responses for the next layer. In that way we are able to learn difference covariances per each layer by different sizes and structures.