Category Archives: Computer Vision

Dilated Convolution

In simple terms, dilated convolution is just a convolution applied to input with defined gaps. With this definitions, given our input is an 2D image, dilation rate k=1 is normal convolution and k=2 means skipping one pixel per input and k=4 means skipping 3 pixels. The best to see the figures below with the same k values.

The figure below shows dilated convolution on 2D data. Red dots are the inputs to a filter which is 3x3 in this example, and greed area is the receptive field captured by each of these inputs. Receptive field is the implicit area captured on the initial input by each input (unit) to the next layer .

Dilated convolution is a way of increasing receptive view (global view) of the network exponentially and linear parameter accretion. With this purpose, it finds usage in applications cares more about integrating knowledge of the wider context with less cost.

One general use is image segmentation where each pixel is labelled by its corresponding class. In this case, the network output needs to be in the same size of the input image. Straight forward way to do is to apply convolution then add deconvolution layers to upsample[1]. However, it introduces many more parameters to learn. Instead, dilated convolution is applied to keep the output resolutions high and it avoids the need of upsampling [2][3].

Dilated convolution is applied in domains beside vision as well. One good example is WaveNet[4] text-to-speech solution and ByteNet learn time text translation. They both use dilated convolution in order to capture global view of the input with less parameters.

From [5]
In short, dilated convolution is a simple but effective idea and you might consider it in two cases;

  1. Detection of fine-details by processing inputs in higher resolutions.
  2. Broader view of the input to capture more contextual information.
  3. Faster run-time with less parameters

[1] Long, J., Shelhamer, E., & Darrell, T. (2014). Fully Convolutional Networks for Semantic Segmentation. Retrieved from http://arxiv.org/abs/1411.4038v1

[2]Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, 1–14. Retrieved from http://arxiv.org/abs/1412.7062

[3]Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. Iclr, 1–9. http://doi.org/10.16373/j.cnki.ahr.150049

[4]Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio, 1–15. Retrieved from http://arxiv.org/abs/1609.03499

[5]Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. van den, Graves, A., & Kavukcuoglu, K. (2016). Neural Machine Translation in Linear Time. Arxiv, 1–11. Retrieved from http://arxiv.org/abs/1610.10099

Selfai: A Method for Understanding Beauty in Selfies

Selfies are everywhere. With different fun masks, poses and filters,  it goes crazy.  When we coincide with any of these selfies, we automatically give an intuitive score regarding the quality and beauty of the selfie. However, it is not really possible to describe what makes a beautiful selfie. There are some obvious attributes but they are not fully prescribed.

With the folks at 8bit.ai, we decided to develop a system which analyzes selfie images and scores them in accordance to its quality and beauty.  The idea was to see whether it is possible to mimic that bizarre perceptual understanding of human with the recent advancements of AI. And if it is, then let's make a mobile application and let people use it for whatever purpose. Spoiler alert! We already developed Selfai app available on iOS and Android and we have one instagram bot @selfai_robot. You can check before reading.

 

Selfai - available on iOS and Android
Selfai - available on iOS and Android

 

Continue reading Selfai: A Method for Understanding Beauty in Selfies

Face Detection by Literature

Please ping me if you know something more.

Multi-view Face Detection Using Deep Convolutional Neural Network

  1. Train face classifier with face (> 0.5 overlap) and background (<0.5 overlap) images.
  2.  Compute heatmap over test image scaled to different sizes with sliding window
  3.  Apply NMS .
  4.  Computation intensive, especially for CPU.
  •  http://arxiv.org/abs/1502.02766

multiview_face

 

From Facial Parts Responses to Face Detection: A Deep Learning Approach

Keywords: object proposals, facial parts,  more annotation.

  1. Use facial part annotations
  2. Bottom up to detect face from facial parts.
  3. "Faceness-Net’s pipeline consists of three stages,i.e. generating partness maps, ranking candidate windows by faceness scores, and refining face proposals for face detection."
  4. Train part based classifiers based on attributes related to different parts of the face i.e. for hair part train ImageNet pre-trained network for color classification.
  5. Very robust to occlusion and background clutter.
  6. To much annotation effort.
  7. Still object proposals (DL community should skip proposal approach. It complicate the problem by creating a new domain of problem :)) ).
  • http://arxiv.org/abs/1509.06451

facial_parts

 

Supervised Transformer Network for Efficient Face Detection

  • http://home.ustc.edu.cn/~chendong/STN_Detector/stn_detector.pdf

 

UnitBox: An Advanced Object Detection Network

  • http://arxiv.org/abs/1608.02236

 

Deep Convolutional Network Cascade for Facial Point Detection

  • http://www.cv-foundation.org/openaccess/content_cvpr_2013/papers/Sun_Deep_Convolutional_Network_2013_CVPR_paper.pdf
  • http://mmlab.ie.cuhk.edu.hk/archive/CNN_FacePoint.htm
  • https://github.com/luoyetx/deep-landmark

 

WIDER FACE: A Face Detection Benchmark

A novel cascade detection method being a state of art at WIDER FACE

  1. Train separate CNNs for small range of scales.
  2. Each detector has two stages; Region Proposal Network + Detection Network
  • http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/
  • http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/support/paper.pdf

face_wider

DenseBox (DenseBox: Unifying Landmark Localization with End to End Object Detection)

Keywords: upsampling, hardmining, no object proposal, BAIDU

  1.  Similar to YOLO .
  2.  Image pyramid of input
  3.  Feed to network
  4. Upsample feature maps after a layer.
  5. Predict classification score and bbox location per pixel on upsampled feature map.
  6. NMS to bbox locations.
  7. SoA at MALF face dataset
  • http://arxiv.org/pdf/1509.04874v3.pdf
  • http://www.cbsr.ia.ac.cn/faceevaluation/results.html

Face Detection without Bells and Whistles

Keywords: no NN, DPM, Channel Features

  1. ECCV 2014
  2. Very high quality detections
  3. Very slow on CPU and acceptable on GPU
  • https://bitbucket.org/rodrigob/doppia/
  • http://rodrigob.github.io/documents/2014_eccv_face_detection_with_supplementary_material.pdf

Paper review: Dynamic Capacity Networks

Paper: http://arxiv.org/pdf/1511.07838v7.pdf

Decompose the network structure into two networks F and G keeping a set of top layers T at the end. F and G are small and more advance network structures respectively. Thus F is cheap to execute with lower performance compared to G.

In order to reduce the whole computation and embrace both performance and computation gains provided by both networks, they suggest an incremental pass of input data through F to G.

Network F decides the salient regions on the input by using a gradient feedback and then these smaller regions are sent to network G to have better recognition performance.

Given an input image x, coarse network F is applied and then coarse representations of different regions of the given input is computed. These coarse representations are propagated to the top layers T and T computes the final output of the network which are the class predictions. An entropy measure is used to see that how each coerce representation effects the model's uncertainty leading that if a regions is salient then we expect to have large change of the uncertainty with respect to its representation.

We select top k input regions as salient by the hint of computed entropy changes then these regions are given to fine network G obtain finer representations. Eventually, we merge all the coarse, fine representations and give to top layers T again and get the final predictions.

At the training time, all networks and layers trained simultaneously. However, still one might decide to train each network F and G separately by using the same top layers T.  Authors posits that the simultaneous training is useful to keep fine and coarse representations similar so that the final layers T do not struggle too much to learn from two difference representation distribution.

I only try to give the overlooked idea here, if you like to see more detail and dwell into formulas please see the paper.

My discussion: There are some other works using attention mechanisms to improve final performance. However, this work is limited with the small datasets and small spatial dimensions. I really like to see whether it is also usefule for large problems like ImageNet or even larger.

Another caveat is the datasets used for the expeirments are not so cluttered. Therefore, it is easy to detect salient regions, even with by some algrithmic techniques. Thus, still this method obscure to me in real life problems.

How many training samples we observe over life time ?

In this post, I like to compute what number of visual instances we observes over time, with the assumption that we visually perceive life as a constant video with a certain fps rate.

Let's dive into the computation. Relying on [1],  average person can see the world with 45 fps on average. It goes to extremes for such people like fighter pilots which is 225fps with the adrenaline kicked in.  I took the average life time 71 years [3] equals to 2239056000 (2 .24 billion) secs and we are awake almost 2/3 of  it which makes 1492704000 (1.49 billion) secs .  Then we assume that on average there are 86*10^9 neurons in our brain [2]. This is our model size.

Eventually and roughly, that means without any further investigation, we have a model with 86 billion parameters which learns from  1492704000 * 45 = 67171680000  almost 67 billion images.

Of course this is not a convenient way to come with this numbers but fun comes by ignorance 🙂

[1] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2826883/figure/F2/

[2] http://www.ncbi.nlm.nih.gov/pubmed/19226510

[3] http://www.who.int/gho/mortality_burden_disease/life_tables/situation_trends/en/

ParseNet: Looking Wider to See Better

paperhttp://arxiv.org/pdf/1506.04579v2.pdf

codehttps://gist.github.com/shelhamer/80667189b218ad570e82

In this work, they propose two related problems and comes with a simple but functional solution to this. the problems are;

  1. Learning object location on the image with Proposal + Classification approach is very tiresome since it needs to classify >1000 patched per image. Therefore, use of end to end pixel-wise segmentation is a better solution as proposed by FCN (Long et al. 2014).
  2. FCN oversees the contextual information since it predicts the objects of each pixel independently. Therefore, even the thing on the image is Cat, there might be unrelated predictions for different pixels. They solve this by applying Conditional Random Field (CRF) on top of FCN. This is a way to consider context by using pixel relations.  Nevertheless, this is still not a method that is able to learn end-to-end since CRF needs additional learning stage after FCN.

Based on these two problems they provide ParseNet architecture. It declares contextual information by looking each channel feature map and aggregating the activations values.  These aggregations then merged to be appended to final features of the network as depicted below;

Figure from the paper. It shows the problem told above and proposed feature aggregation
Figure from the paper. It shows the problem told above and proposed feature aggregation

 

Their experiments construes the effectiveness of the additional contextual features.  Yet there are two important points to consider before using these features together. Due to the scale differences of each layer activations, one needs to normalize first per layer then append them together.  They L2 normalize each layer's feature. However, this results very small feature values which also hinder the network to learn in a fast manner.  As a cure to this, they learn scale parameters to each feature as used by the Batch Normalization method so that they first normalize and scale the values with scaling weights learned from the data.

The takeaway from this paper,  for myself, adding intermediate layer features improves the results with a correct normalization framework and as we add more layers, network is more robust to local changes by the context defined by the aggregated features.

They use VGG16 and fine-tune it for their purpose, VGG net does not use Batch Normalization. Therefore, use of Batch Normalization from the start might evades the need of additional scale parameters even maybe the L2 normalization of aggregated features. This is because, Batch Normalization already scales and shifts the feature values into a common norm.

Note: this is a hasty used article sorry for any inconvenience or mistake or stupidly written sentences.

Recent Advances in Deep Learning

In this text, I would like to talk about some of the recent advances of Deep Learning models by no means complete. (Click heading for the reference)

  1. Parametric Rectifier Linear Unit (PReLU)
    • The idea is to allow negative activation in well-known ReLU units by controlling it with a learnable parameter. In other words, you learn how much negative activationsyou need for each unit to discriminate classes. In the work, it is proposed that PReLU unit is very useful for especially very deep models that lacks for gradient propagation to initial layers due to its depth. What is different is PReLU allows more gradient return by allowing negative activation.PReLU
  2. A new initialization method (MSRA for Caffe users)
    • Xavier initialization was proposed by Bengio's team and it considers number of fan-in and fan-out to a certain unit to define the initial weights.  However, the work says that Xavier method and its alternations considers linear activation functions for the formulation of the method. Hence, they propose some changes related to ReLU activation that they empirically proved its effect in practice with better convergence rate.
  3. Batch Normalization 
    • This work serves data normalization as a structural part of the model. They say that the distribution of the training data changes as the model evolves and it priorities the initialization scheme and the learning schedule we use for the learning. Each mini-batch of the data is normalized with the described scheme just before its propagation through the network and it allows faster convergence  with larger learning rates and robust models to initialization scheme that we choose.  Each mini-batch is normalized by its mean and variance, then it is scaled and shifted by a learned coefficient and residual.

      From the paper
      From the paper
  4. Inception Layers
    • This is one of the ingredients of last year's ImageNet winner GoogleNet. The trick is to use multi-scale filters all together in a layer and concatenating their responses for the next layer. In that way we are able to learn difference covariances per each layer by different sizes and structures. inception_module

Intro. to Contractive Auto-Encoders

Contractive Auto-Encoder is a variation of well-known Auto-Encoder algorithm that has a solid background in the information theory and lately deep learning community. The simple Auto-Encoder targets to compress information of the given data as keeping the reconstruction cost lower as much as possible. However another use is to enlarge the given input's representation. In that case, you learn over-complete representation of the given data instead of compressing it. Most common implication is Sparse Auto-Encoder that learns over-complete representation but in a sparse (smart) manner. That means, for a given instance only informative set of units are activated, therefore you are able to capture more discriminative representation, especially if you use AE for pre-training of your deep neural network.

After this intro. what is special about Contraction Auto-Encoder (CAE)?  CAE simply targets to learn invariant representations to unimportant transformations for the given data. It only learns transformations that are exactly in the given dataset and try to avoid more. For instance, if you have set of car images and they have left and right view points in the dataset, then CAE is sensitive to those changes but it is insensitive to frontal view point. What it means that if you give a frontal car image to CAE after the training phase, it tries to contract its representation to one of the left or right view point car representation at the hidden layer. In that way you obtain some level of view point in-variance. (I know, this is not very good example for a cannier guy but I only try to give some intuition for CAE).

From the mathematical point of view, it gives the effect of contraction by adding an additional term to reconstruction cost. This addition is the Sqrt Frobenius norm of Jacobian of the hidden layer representation with respect to input values. If this value is zero, it means, as we change input values, we don't observe any change on the learned hidden representations. If we get very large values then the learned representation is unstable as the input values change.

This was just a small intro to CAE, if you like the idea please follow the below videos of Hugo Larochelle's lecture and Pascal Vincent's talk at ICML 2011 for the paper.