# Random Dilation Networks for Action Recognition in Videos

Lately, we (TwentyBN) took a part in Activity Net trimmed action recognition challenge. The dataset is called Kinetics and recently released. It is a collection of 10 second YouTube videos. Each video has a single label among 400 different action classes. The dataset released by DeepMind with a baseline 61% Top-1 and 81.3% Top-5. For baseline models please refer to their dataset paper. But, it took 2 months for people to briskly hoist the bar high above.

As you might see above, we have the best Top-5 accuracy with 97% which is ~16% improvement on top of the baseline. The average of Top-1 and Top-5 decides the leader-board which places us to 3rd place. Yet, it is a great result for us where we could dabble only 2 weeks with limited juice. Team matters here!! Thx to my mates Raghav Goyal and Valentin Haenel for being great.

Here, I like to succinctly describe our novel network architecture. It has the best single network performance. (We plan to share a more detailed description in a separate Medium soon.) Namely, it is called BesNet due to a cheap cryptographic reason :). BesNet yields 74% Top-1 with only RGB . It is half-size of the baseline network described in the DeepMind paper.

In detail, BesNet is devised on top of ResNet-50 architecture. Distinctly, BesNet performs 3D convolutions that are able to learn both spatiotemporal features. In a better extent, BesNet takes not a single frame, but a set of frames from a video. It convolves pixels between consecutive frames as wells as single frame pixels. Each ResNet-50 module buckled with 1x1 + 3x3 +1x1 filters in order. Each such module followed by a residual connection coming from preceding module. It uses ReLU activation followed by a Batch-Normalization for each layer. In order to convert Resnet-50 to BesNet, we inflate 1x1 filters to 3x1x1 filters and 3x3 filters to 1x3x3 filters where the ordering of the dimensions is sequence x height x width. After convolution layers, an average pooling layer aggregates spatial dimension as in the normal ResNet. Subsequently, a max pooling layer aggregates temporal dimensions. A fully-connected layer used for predictions.

BesNet is initialized with ImageNet weights. In order to convert 2D filter weights to 3D filter weights, we replicate 2D filters along an additional dimension and then normalize the weights by the replication factor. This normalization keeps the activation values stable despite the architectural change. For example, a 1x1 filter is converted to 3x1x1 by copying the 1x1 filter 3 times along the third dimension and weights are divided by 3 at the end.

In BesNet, 3x1x1 filters are responsible for temporal and spatial cross-channel regularities. 1x3x3 filters pay into only spatial properties of individual feature maps. This orientation excites several observations. First off, it decouples temporal and spatial computations. It learns specialized layers for each of the temporal and spatial dimensions. The idea also entertained by the pooling layers. We decomposed spatial and temporal dimension over average and max pooling layers respectively. BesNet reduces the spatial dimensions along the convolutional layers yet it keeps the size of temporal dimension constant. This makes BesNet flexible to handle videos with different number of frames. Hence, given a video with K frames, BesNet keeps the temporal dimension as K until the pooling layers. Thereafter, max pooling layer aggregates K temporal channels into one. In a practical sense, this is easy with a dynamic computational graph library. Pytorch is a bliss here !! (Sorry TF, You're so crusty.)

BesNet has a peculiar use of dilation in 3x1x1 layers which defines the real novel aspect of our architecture. BesNet uses dilation only on temporal dimension and it picks a random dilation factor per 3x1x1 layer for each mini-batch. It sets padding parameters in accordance to keep the temporal dimension unchanged. At the test time, each layer computes outputs for each possible dilation factor, then takes the average of the output feature maps. Random dilation enables the network to learn complex temporal relations. It also regularizes the network in the temporal domain. In practice, it reduces the effect of FPS used for casting videos into frames.

We discuss that for Kinetics, it is important to learn long range relations between frames. Videos are long and they have only a single label. So the network needs to learn the general context of the video. In that sense, small motions that are observed by a normal 3D convolution are not that important. Random dilation pays into this. It augments the contextual temporal window of the network.

Our experiments with only frame futures support our hypothesis here. We extracted frame features with ResNet-50 and train an MLP after pooling the features. It gets 65% accuracy. It is better than DeepMind's baseline network with 3D convolution layers. That shows us contextual information means more than motion learned by 3D layers.

Motion information might be complementary but not the core. It is then verified by the random dilation. BesNet with no dilation results 70% , dilation 2 68% and the random dilation 74% accuracy. This stands to be a simple empirical proof backing our claim here.

Random dilation is really easy to implement with Pytorch. Just take normal Conv class and overwrite its forward pass by randomizing dilation parameter. If you like to try out before we release fell free.

I try to give a very sketchy description of BesNet here by no means complete. Please ping me if you have any question. We plan to study BesNet a little more and share it in the near future in legit formats. We also plan to share a finer description of our challenge approach with some open-source enjoyment.

Please note that BesNet is a work in progress. Anyways, feedbacks are always warmly welcome. Best :).

# RaspberryPi Home Surveillance with only ~150 lines of Python Code.

I owned a Raspberry Pi long ago and it was just sitting in my tech wash box. After watching a Youtube session of creative Raspberry applications, with envy , I decided to try something by myself. The first obvious idea to me was a home security system to inspect your house while you are away.

The final thingy is able to detect and roughly localize any motion through a camera. It takes photos and mails them to your email account. Plus, we are able to interact with it in our local network using a simple web interface so we are able to activate or deactivate it in front of home door. I assume that if someone is able to reach the local wifi network, most probably s/he is one of us (fair enough ?).

# Paper Notes: The Shattered Gradients Problem ...

The whole heading of the paper is "The Shattered Gradients Problem: If resnets are the answer, then what is the question?". It is really interesting work with all its findings about gradient dynamics of neural networks. It also examines Batch Normalization (BN) and Residual Networks (Resnet) under this problem.

The problem, dubbed "Shattered Gradients", described as gradient feedbacks resembling random noise for nearby data points. White noise gradients (random value around 0 with some unknown variance) are not useful for training and they stall the network. What we expect to see is Brownian noise (next value is obtained with a small change on the last value) from a working model. Deep neural networks are more prone to white noise gradients. However, latest advances like BN and Resnet are described to be more resilient to random gradients even in deep networks.

White noise gradients undermines the effectiveness of networks because they violates gradient based learning methods which expects similar gradient feedbacks for data points close by in the vector space. Once you have white noise gradient for such close points, the model is not able to capture data manifold through these learning algorithms. Brownian updates yields more correlation on updates and this preludes effective learning.

For normal networks, they give a empirical evidence that the correlation of network updates decreases with the order $(/2^L$ where L is number of layers. Decreasing correlation means more white noise gradient feedbacks.

One important reason of white noise feedbacks is to be co-activations of network units. From a working model, we expect to have units receptive to different structures in the given data. Therefore, for each different instance, different subset of units should be active for effective information flux. They observe that as activation goes through layers, co-activation rate goes higher. BN layers prevents this by keeping the co-activation rate 1/4 (1/4 units are active per layer).

Beside the co-activation rate, how dispersed units activation is another important question. Thus, similar instances need to activate similar subset of units and activation should be distributes to other subsets as we change the data structure. This stage is where the skip-connections get into the play. Their observation is skip-connections improve networks in that respect. This can be observed at below figure.

The effectiveness of skip-connections increases with Beta scaling introduced by InceptionV4 architecture. It  is scaling residual connections by a constant value before summing up with the current layer activation.

#### A small discussion

This is a very intriguing paper to me as being one of the scarse works investigating network dynamics instead of blind updates on architectures for racing accuracy values.

Resnet is known to be train hundreds of layers which was not possible before. Now, with this work, we have another scientific argument explaining its effectiveness. I also like to point Veit et al. (2016) demystifying Resnet as an ensemble of many shallow networks. When we combine both of these papers, it makes total sense to me how Resnets are useful for training very deep networks. If shattered gradient effect, as stated here,  increasing with number of layers with the order 2^L then it is impossible to train hundred layers with an ad-hoc network. Corollary, since Resnet behaves like a ensemble of shallow networks this effects is rehabilitated. We are able to see it empirically in this paper and it is complimentary in that sense.

Note: This hastily written paper note might include any kind of error. Please let me know if you find one. Best 🙂

# Dilated Convolution

In simple terms, dilated convolution is just a convolution applied to input with defined gaps. With this definitions, given our input is an 2D image, dilation rate k=1 is normal convolution and k=2 means skipping one pixel per input and k=4 means skipping 3 pixels. The best to see the figures below with the same k values.

The figure below shows dilated convolution on 2D data. Red dots are the inputs to a filter which is 3x3 in this example, and green area is the receptive field captured by each of these inputs. Receptive field is the implicit area captured on the initial input by each input (unit) to the next layer .

Dilated convolution is a way of increasing receptive view (global view) of the network exponentially and linear parameter accretion. With this purpose, it finds usage in applications cares more about integrating knowledge of the wider context with less cost.

One general use is image segmentation where each pixel is labelled by its corresponding class. In this case, the network output needs to be in the same size of the input image. Straight forward way to do is to apply convolution then add deconvolution layers to upsample[1]. However, it introduces many more parameters to learn. Instead, dilated convolution is applied to keep the output resolutions high and it avoids the need of upsampling [2][3].

Dilated convolution is applied in domains beside vision as well. One good example is WaveNet[4] text-to-speech solution and ByteNet learn time text translation. They both use dilated convolution in order to capture global view of the input with less parameters.

In short, dilated convolution is a simple but effective idea and you might consider it in two cases;

1. Detection of fine-details by processing inputs in higher resolutions.
2. Broader view of the input to capture more contextual information.
3. Faster run-time with less parameters

[1] Long, J., Shelhamer, E., & Darrell, T. (2014). Fully Convolutional Networks for Semantic Segmentation. Retrieved from http://arxiv.org/abs/1411.4038v1

[2]Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, 1–14. Retrieved from http://arxiv.org/abs/1412.7062

[3]Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. Iclr, 1–9. http://doi.org/10.16373/j.cnki.ahr.150049

[4]Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio, 1–15. Retrieved from http://arxiv.org/abs/1609.03499

[5]Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. van den, Graves, A., & Kavukcuoglu, K. (2016). Neural Machine Translation in Linear Time. Arxiv, 1–11. Retrieved from http://arxiv.org/abs/1610.10099

# Selfai: A Method for Understanding Beauty in Selfies

Selfies are everywhere. With different fun masks, poses and filters,  it goes crazy.  When we coincide with any of these selfies, we automatically give an intuitive score regarding the quality and beauty of the selfie. However, it is not really possible to describe what makes a beautiful selfie. There are some obvious attributes but they are not fully prescribed.

With the folks at 8bit.ai, we decided to develop a system which analyzes selfie images and scores them in accordance to its quality and beauty.  The idea was to see whether it is possible to mimic that bizarre perceptual understanding of human with the recent advancements of AI. And if it is, then let's make an application and let people use it for whatever purpose.  For now, we only have an Instagram bot @selfai_robot. You can check before reading.

# Face Detection by Literature

Please ping me if you know something more.

Multi-view Face Detection Using Deep Convolutional Neural Network

1. Train face classifier with face (> 0.5 overlap) and background (<0.5 overlap) images.
2.  Compute heatmap over test image scaled to different sizes with sliding window
3.  Apply NMS .
4.  Computation intensive, especially for CPU.
•  http://arxiv.org/abs/1502.02766

From Facial Parts Responses to Face Detection: A Deep Learning Approach

Keywords: object proposals, facial parts,  more annotation.

1. Use facial part annotations
2. Bottom up to detect face from facial parts.
3. "Faceness-Net’s pipeline consists of three stages,i.e. generating partness maps, ranking candidate windows by faceness scores, and refining face proposals for face detection."
4. Train part based classifiers based on attributes related to different parts of the face i.e. for hair part train ImageNet pre-trained network for color classification.
5. Very robust to occlusion and background clutter.
6. To much annotation effort.
7. Still object proposals (DL community should skip proposal approach. It complicate the problem by creating a new domain of problem :)) ).
• http://arxiv.org/abs/1509.06451

Supervised Transformer Network for Efficient Face Detection

• http://home.ustc.edu.cn/~chendong/STN_Detector/stn_detector.pdf

UnitBox: An Advanced Object Detection Network

• http://arxiv.org/abs/1608.02236

Deep Convolutional Network Cascade for Facial Point Detection

• http://www.cv-foundation.org/openaccess/content_cvpr_2013/papers/Sun_Deep_Convolutional_Network_2013_CVPR_paper.pdf
• http://mmlab.ie.cuhk.edu.hk/archive/CNN_FacePoint.htm
• https://github.com/luoyetx/deep-landmark

WIDER FACE: A Face Detection Benchmark

A novel cascade detection method being a state of art at WIDER FACE

1. Train separate CNNs for small range of scales.
2. Each detector has two stages; Region Proposal Network + Detection Network
• http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/
• http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/support/paper.pdf

DenseBox (DenseBox: Unifying Landmark Localization with End to End Object Detection)

Keywords: upsampling, hardmining, no object proposal, BAIDU

1.  Similar to YOLO .
2.  Image pyramid of input
3.  Feed to network
4. Upsample feature maps after a layer.
5. Predict classification score and bbox location per pixel on upsampled feature map.
6. NMS to bbox locations.
7. SoA at MALF face dataset
• http://arxiv.org/pdf/1509.04874v3.pdf
• http://www.cbsr.ia.ac.cn/faceevaluation/results.html

Face Detection without Bells and Whistles

Keywords: no NN, DPM, Channel Features

1. ECCV 2014
2. Very high quality detections
3. Very slow on CPU and acceptable on GPU
• https://bitbucket.org/rodrigob/doppia/
• http://rodrigob.github.io/documents/2014_eccv_face_detection_with_supplementary_material.pdf

# Paper review: Dynamic Capacity Networks

Decompose the network structure into two networks F and G keeping a set of top layers T at the end. F and G are small and more advance network structures respectively. Thus F is cheap to execute with lower performance compared to G.

In order to reduce the whole computation and embrace both performance and computation gains provided by both networks, they suggest an incremental pass of input data through F to G.

Network F decides the salient regions on the input by using a gradient feedback and then these smaller regions are sent to network G to have better recognition performance.

Given an input image x, coarse network F is applied and then coarse representations of different regions of the given input is computed. These coarse representations are propagated to the top layers T and T computes the final output of the network which are the class predictions. An entropy measure is used to see that how each coerce representation effects the model's uncertainty leading that if a regions is salient then we expect to have large change of the uncertainty with respect to its representation.

We select top k input regions as salient by the hint of computed entropy changes then these regions are given to fine network G obtain finer representations. Eventually, we merge all the coarse, fine representations and give to top layers T again and get the final predictions.

At the training time, all networks and layers trained simultaneously. However, still one might decide to train each network F and G separately by using the same top layers T.  Authors posits that the simultaneous training is useful to keep fine and coarse representations similar so that the final layers T do not struggle too much to learn from two difference representation distribution.

I only try to give the overlooked idea here, if you like to see more detail and dwell into formulas please see the paper.

My discussion: There are some other works using attention mechanisms to improve final performance. However, this work is limited with the small datasets and small spatial dimensions. I really like to see whether it is also usefule for large problems like ImageNet or even larger.

Another caveat is the datasets used for the expeirments are not so cluttered. Therefore, it is easy to detect salient regions, even with by some algrithmic techniques. Thus, still this method obscure to me in real life problems.

# How many training samples we observe over life time ?

In this post, I like to compute what number of visual instances we observes over time, with the assumption that we visually perceive life as a constant video with a certain fps rate.

Let's dive into the computation. Relying on [1],  average person can see the world with 45 fps on average. It goes to extremes for such people like fighter pilots which is 225fps with the adrenaline kicked in.  I took the average life time 71 years [3] equals to $2239056000$ (2 .24 billion) secs and we are awake almost $2/3$ of  it which makes $1492704000$ (1.49 billion) secs .  Then we assume that on average there are $86*10^9$ neurons in our brain [2]. This is our model size.

Eventually and roughly, that means without any further investigation, we have a model with 86 billion parameters which learns from  $1492704000 * 45 = 67171680000$  almost 67 billion images.

Of course this is not a convenient way to come with this numbers but fun comes by ignorance 🙂

[1] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2826883/figure/F2/

[2] http://www.ncbi.nlm.nih.gov/pubmed/19226510

[3] http://www.who.int/gho/mortality_burden_disease/life_tables/situation_trends/en/

# ParseNet: Looking Wider to See Better

In this work, they propose two related problems and comes with a simple but functional solution to this. the problems are;

1. Learning object location on the image with Proposal + Classification approach is very tiresome since it needs to classify >1000 patched per image. Therefore, use of end to end pixel-wise segmentation is a better solution as proposed by FCN (Long et al. 2014).
2. FCN oversees the contextual information since it predicts the objects of each pixel independently. Therefore, even the thing on the image is Cat, there might be unrelated predictions for different pixels. They solve this by applying Conditional Random Field (CRF) on top of FCN. This is a way to consider context by using pixel relations.  Nevertheless, this is still not a method that is able to learn end-to-end since CRF needs additional learning stage after FCN.

Based on these two problems they provide ParseNet architecture. It declares contextual information by looking each channel feature map and aggregating the activations values.  These aggregations then merged to be appended to final features of the network as depicted below;

Their experiments construes the effectiveness of the additional contextual features.  Yet there are two important points to consider before using these features together. Due to the scale differences of each layer activations, one needs to normalize first per layer then append them together.  They L2 normalize each layer's feature. However, this results very small feature values which also hinder the network to learn in a fast manner.  As a cure to this, they learn scale parameters to each feature as used by the Batch Normalization method so that they first normalize and scale the values with scaling weights learned from the data.

The takeaway from this paper,  for myself, adding intermediate layer features improves the results with a correct normalization framework and as we add more layers, network is more robust to local changes by the context defined by the aggregated features.

They use VGG16 and fine-tune it for their purpose, VGG net does not use Batch Normalization. Therefore, use of Batch Normalization from the start might evades the need of additional scale parameters even maybe the L2 normalization of aggregated features. This is because, Batch Normalization already scales and shifts the feature values into a common norm.

Note: this is a hasty used article sorry for any inconvenience or mistake or stupidly written sentences.