Face Detection by Literature

Please ping me if you know something more.

Multi-view Face Detection Using Deep Convolutional Neural Network

  1. Train face classifier with face (> 0.5 overlap) and background (<0.5 overlap) images.
  2.  Compute heatmap over test image scaled to different sizes with sliding window
  3.  Apply NMS .
  4.  Computation intensive, especially for CPU.
  •  http://arxiv.org/abs/1502.02766

multiview_face

 

From Facial Parts Responses to Face Detection: A Deep Learning Approach

Keywords: object proposals, facial parts,  more annotation.

  1. Use facial part annotations
  2. Bottom up to detect face from facial parts.
  3. "Faceness-Net’s pipeline consists of three stages,i.e. generating partness maps, ranking candidate windows by faceness scores, and refining face proposals for face detection."
  4. Train part based classifiers based on attributes related to different parts of the face i.e. for hair part train ImageNet pre-trained network for color classification.
  5. Very robust to occlusion and background clutter.
  6. To much annotation effort.
  7. Still object proposals (DL community should skip proposal approach. It complicate the problem by creating a new domain of problem :)) ).
  • http://arxiv.org/abs/1509.06451

facial_parts

 

Supervised Transformer Network for Efficient Face Detection

  • http://home.ustc.edu.cn/~chendong/STN_Detector/stn_detector.pdf

 

UnitBox: An Advanced Object Detection Network

  • http://arxiv.org/abs/1608.02236

 

Deep Convolutional Network Cascade for Facial Point Detection

  • http://www.cv-foundation.org/openaccess/content_cvpr_2013/papers/Sun_Deep_Convolutional_Network_2013_CVPR_paper.pdf
  • http://mmlab.ie.cuhk.edu.hk/archive/CNN_FacePoint.htm
  • https://github.com/luoyetx/deep-landmark

 

WIDER FACE: A Face Detection Benchmark

A novel cascade detection method being a state of art at WIDER FACE

  1. Train separate CNNs for small range of scales.
  2. Each detector has two stages; Region Proposal Network + Detection Network
  • http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/
  • http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/support/paper.pdf

face_wider

DenseBox (DenseBox: Unifying Landmark Localization with End to End Object Detection)

Keywords: upsampling, hardmining, no object proposal, BAIDU

  1.  Similar to YOLO .
  2.  Image pyramid of input
  3.  Feed to network
  4. Upsample feature maps after a layer.
  5. Predict classification score and bbox location per pixel on upsampled feature map.
  6. NMS to bbox locations.
  7. SoA at MALF face dataset
  • http://arxiv.org/pdf/1509.04874v3.pdf
  • http://www.cbsr.ia.ac.cn/faceevaluation/results.html

Face Detection without Bells and Whistles

Keywords: no NN, DPM, Channel Features

  1. ECCV 2014
  2. Very high quality detections
  3. Very slow on CPU and acceptable on GPU
  • https://bitbucket.org/rodrigob/doppia/
  • http://rodrigob.github.io/documents/2014_eccv_face_detection_with_supplementary_material.pdf

Why do we need better word representations ?

A successful AI agent should communicate. It is all about language. It should understand and explain itself in words in order to communicate us.  All of these spark with the "meaning" of words which the atomic part of human-wise communication. This is one of the fundamental problems of Natural Language Processing (NLP).

"meaning" is described as "the idea that is represented by a word, phrase, etc. How about representing the meaning of a word in a computer. The first attempt is to use some kind of hardly curated taxonomies such as WordNet. However such hand made structures not flexible enough, need human labor to elaborate and  do not have semantic relations between words other then the carved rules. It is not what we expect from a real AI agent.

Then NLP research focused to use number vectors to symbolize words. The first use is to donate words with discrete (one-hot) representations. That is, if we assume a vocabulary with 1K words then we create a 1K length 0 vector with only one 1 representing the target word. Continue reading Why do we need better word representations ?

Object Detection Literature

<Please let me know if there are more works comparable to these below.>

R-CNN minus R

  • http://arxiv.org/pdf/1506.06981.pdf

 

FasterRCNN (Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks)

Keywords: RCNN, RoI pooling, object proposals, ImageNet 2015 winner.

PASCAL VOC2007: 73.2%

PASCAL VOC2012: 70.4%

ImageNet Val2 set: 45.4% MAP

  1. Model agnostic
  2. State of art with Residual Networks
    •  http://arxiv.org/pdf/1512.03385v1.pdf
  3. Fast enough for oflline systems and partially for inline systems
  • https://arxiv.org/pdf/1506.01497.pdf
  • https://github.com/ShaoqingRen/faster_rcnn (official)
  • https://github.com/rbgirshick/py-faster-rcnn
  • http://web.cs.hacettepe.edu.tr/~aykut/classes/spring2016/bil722/slides/w05-FasterR-CNN.pdf
  • https://github.com/precedenceguo/mx-rcnn
  • https://github.com/mitmul/chainer-faster-rcnn
  • https://github.com/andreaskoepf/faster-rcnn.torch

 

YOLO (You Only Look Once: Unified, Real-Time Object Detection)

Keywords: real-time detection, end2end training.

PASCAL VOC 2007: 63,4% (YOLO), 57.9% (Fast YOLO)

RUN-TIME : 45 FPS (YOLO), 155 FPS (Fast YOLO)

  1. VGG-16 based model
  2. End-to-end learning with no extra hassle (no proposals)
  3. Fastest with some performance payback relative to Faster RCNN
  4. Applicable to online systems
  • http://pjreddie.com/darknet/yolo/
  • https://github.com/pjreddie/darknet
  • https://github.com/BriSkyHekun/py-darknet-yolo (python interface to darknet)
  • https://github.com/tommy-qichang/yolo.torch
  • https://github.com/gliese581gg/YOLO_tensorflow
  • https://github.com/ZhouYzzz/YOLO-mxnet
  • https://github.com/xingwangsfu/caffe-yolo
  • https://github.com/frankzhangrui/Darknet-Yolo (custom training)

 

MultiBox (Scalable Object Detection using Deep Neural Networks)

Keywords: cascade classifiers, object proposal network.

  1. Similar to YOLO
  2. Two successive networks for generating object proposals and classifying these
  • http://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Erhan_Scalable_Object_Detection_2014_CVPR_paper.pdf
  • https://github.com/google/multibox
  • https://research.googleblog.com/2014/12/high-quality-object-detection-at-scale.html

 

ION (Inside - Outside Net) 

Keywords: object proposal network, RNN, context features

  1. RNN networks on top of conv5 layer in 4 different directions
  2. Concate different layer features with L2 norm + rescaling
  • (great slide) http://www.seanbell.ca/tmp/ion-coco-talk-bell2015.pdf

 

UnitBox ( UnitBox: An Advanced Object Detection Network)

  • https://arxiv.org/pdf/1608.01471v1.pdf

 

DenseBox (DenseBox: Unifying Landmark Localization with End to End Object Detection)

Keywords: upsampling, hardmining, no object proposal, BAIDU

  1.  Similar to YOLO .
  2.  Image pyramid of input
  3.  Feed to network
  4. Upsample feature maps after a layer.
  5. Predict classification score and bbox location per pixel on upsampled feature map.
  6. NMS to bbox locations.
  • http://arxiv.org/pdf/1509.04874v3.pdf

 

MRCNN: Object detection via a multi-region & semantic segmentation-aware CNN model

PASCAL VOC2007: 78.2% MAP

PASCAL VOC2012: 73.9% MAP

Keywords: bbox regression, segmentation aware

  1. very large model and so much detail.
  2. Divide each detection windows to different regions.
  3. Learn different networks per region scheme.
  4. Empower representation by using the entire image network.
  5. Use segmentation aware network which takes the etnrie image as input.
  • http://arxiv.org/pdf/1505.01749v3.pdf
  • https://github.com/gidariss/mrcnn-object-detection

 

SSD: Single Shot MultiBox Detector

PASCAL VOC2007: 75.5% MAP (SSD 500), 72.1% MAP (SSD 300)

PASCAL VOC2012: 73.1% MAP (SSD 500)

RUN-TIME: 23 FPS (SSD 500), 58 FPS (SSD 300)

Keywords: real-time, no object proposal, end2end training

  1. Faster and accurate then YOLO (their claim)
  2. Not useful for small objects
  • https://arxiv.org/pdf/1512.02325v2.pdf
  • https://github.com/weiliu89/caffe/tree/ssd
Results for SSD, YOLO and F-RCNN
Results for SSD, YOLO and F-RCNN

 

CRAFT (CRAFT Objects from Images)

PASCAL VOC2007: 75.7% MAP

PASCAL VOC2012: 71.3% MAP

ImageNet Val2 set: 48.5% MAP

  • intro: CVPR 2016. Cascade Region-proposal-network And FasT-rcnn. an extension of Faster R-CNN
  • http://byangderek.github.io/projects/craft.html
  • https://github.com/byangderek/CRAFT
  • https://arxiv.org/abs/1604.03239

 

How to use Python Decorators

Decorators are handy sugars for Python programmers to shorten things and provides more concise programming.

For instance you can use decorators for user authentication for your REST API servers. Assume that, you need to auth. the user for before each REST calls. Instead of appending the same procedure to each call function, it is better to define decorator and tagging it onto your call functions.

Let's see the small example below. I hope it is self-descriptive.


"""
How to use Decorators:

Decorators are functions called by annotations
Annotations are the tags prefixed by @
"""

### Decorator functions ###
def helloSpace(target_func):
def new_func():
print "Hello Space!"
target_func()
return new_func

def helloCosmos(target_func):
def  new_func():
print "Hello Cosmos!"
target_func()
return new_func


@helloCosmos # annotation
@helloSpace # annotation
def hello():
print "Hello World!"

### Above code is equivalent to these lines
# hello = helloSpace(hello)
# hello = helloCosmos(hello)

### Let's Try
hello()

 

Comparison of Deep Learning Libraries After Years of Use

As we witness the golden age of AI and deep learning, there are many different tools and frameworks continuously proposed by different communities. Sometimes it is even hard to catch up what is going on. You choose one over another then you see a new library and you go for it. However, it seems the exact choice is not obvious to anyone.

From my point of view, libraries are measured by flexibility and run-time trade-off. If you go with a library which is really easy to use, it is slow as much as that. If the library is so fast, then it does not serve that mush of flexibility or it is so specialized to a particular type of models like Convolutional NNs, hence they do not support the type of your interest such as Recurrent NNs.

After all the tear, shed and blood dropped by years of experience in deep learning, I decide to share my own intuition and opinion about the common deep learning libraries so that these might help you to choose the right one for your own sake .

Let's start by defining some metrics to evaluate a library. These are the pinpoints that I consider; Continue reading Comparison of Deep Learning Libraries After Years of Use

Paper review: CONVERGENT LEARNING: DO DIFFERENT NEURAL NETWORKS LEARN THE SAME REPRESENTATIONS?

paper: http://arxiv.org/pdf/1511.07543v3.pdf
code : https://github.com/yixuanli/convergent_learning

This paper is an interesting work which tries to explain similarities and differences between representation learned by different networks in the same architecture.

To the extend of their experiments, they train 4 different AlexNet and compare the units of these networks by correlation and mutual information analysis.

They asks following question;

  • Can we find one to one matching of units between network , showing that these units are sensitive to similar or the same commonalities on the image?
  • Is the one to one matching stays the same by different similarity measures? They first use correlation then mutual information to confirm the findings.
  • Is a representation learned by a network is a rotated version of the other, to the extend that one to one matching is not possible  between networks?
  • Is clustering plausible for grouping units in different networks?

Answers to these questions are as follows;

  • It is possible to find good matching units with really high correlation values but there are some units learning unique representation that are not replicated by the others. The degree of representational divergence between networks goes higher with the number of layers. Hence, we see large correlations by conv1 layers and it the value decreases toward conv5 and it is minimum by conv4 layer.
  • They first analyze layers by the correlation values among units. Then they measure the overlap with the mutual information and the results are confirming each other..
  • To see the differences between learned representation, they use a very smart trick. They approximate representations  learned by a layer of a network by the another network using the same layer.  A sparse approximation is performed using LASSO. The result indicating that some units are approximated well with 1 or 2 units of the other network but remaining set of units require almost 4 counterpart units for good approximation. It shows that some units having good one to one matching has local codes learned and other units have slight distributed codes approximated by multiple counterpart units.
  • They also run a hierarchical clustering in order to group similar units successfully.

For details please refer to the paper.

My discussion: We see that different networks learn similar representations with some level of accompanying uniqueness. It is intriguing  to see that, after this paper, these  are the unique representations causing performance differences between networks and whether the effect is improving or worsening. Additionally, maybe we might combine these differences at the end to improve network performances by some set of smart tricks.

One deficit of the paper is that they do not experiment deep networks which are the real deal of the time. As we see from the results, as the layers go deeper,  different abstractions exhumed by different networks. I believe this is more harsh by deeper architectures such as Inception or VGG kind.

One another curious thing is to study Residual netwrosk. The intuition of Residual networks to pass the already learned representation to upper layers and adding more to residual channel if something useful learned by the next layer. That idea shows some promise that two residual networks might be more similar compared to two Inception networks. Moreover, we can compare different layers inside a single Residual Network to see at what level the representation stays the same.

Paper review: ALL YOU NEED IS A GOOD INIT

paper: http://arxiv.org/abs/1511.06422
code: https://github.com/yobibyte/yobiblog/blob/master/posts/all-you-need-is-a-good-init.md

This work proposes yet another way to initialize your network, namely LUV (Layer-sequential Unit-variance) targeting especially deep networks.  The idea relies on lately served Orthogonal initialization and fine-tuning the weights by the data to have variance of 1 for each layer output.

The scheme follows three stages;

  1.  Initialize weights by unit variance Gaussian
  2.  Find components of these weights using SVD
  3.  Replace the weights with these components
  4.  By using minibatches of data, try to rescale weights to have variance of 1 for each layer. This iterative procedure is described as below pseudo code.
FROM the paper. Pseudo code of the initialization scheme.
FROM the paper. Pseudo code of the initialization scheme.

 

In order to describe the code in words, for each iteration we give a new mini-batch and compute the output variance. We compare the computed variance by the threshold we defined as Tol_{var} to the target variance 1.   If number of iterations is below the maximum number iterations or the difference is above Tol_{var} we rescale the layer weights by the squared variance of the minibatch.  After initializing this layer go on to the next layer.

In essence, what this method does. First, we start with a normal Gaussian initialization which we know that it is not enough for deep networks. Orthogonalization stage, decorrelates the weights so that each unit of the layer starts to learn from particularly different point in the space. At the final stage, LUV iterations rescale the weights and keep the back and forth propagated signals close to a useful variance against vanishing or exploding gradient problem , similar to Batch Normalization but without computational load.  Nevertheless, as also they points, LUV is not interchangeable with BN for especially large datasets like ImageNet. Still, I'd like to see a comparison with LUV vs BN but it is not done or not written to paper (Edit by the Author: Figure 3 on the paper has CIFAR comparison of BN and LUV and ImageNet results are posted on https://github.com/ducha-aiki/caffenet-benchmark).

The good side of this method is it works, for at least for my experiments made on ImageNet with different architectures. It is also not too much hurdle to code, if you already have Orthogonal initialization on the hand. Even, if you don't have it, you can start with a Gaussian initialization scheme and skip Orthogonalization stage and directly use LUV iterations. It still works with slight decrease of performance.

Paper review: Dynamic Capacity Networks

Paper: http://arxiv.org/pdf/1511.07838v7.pdf

Decompose the network structure into two networks F and G keeping a set of top layers T at the end. F and G are small and more advance network structures respectively. Thus F is cheap to execute with lower performance compared to G.

In order to reduce the whole computation and embrace both performance and computation gains provided by both networks, they suggest an incremental pass of input data through F to G.

Network F decides the salient regions on the input by using a gradient feedback and then these smaller regions are sent to network G to have better recognition performance.

Given an input image x, coarse network F is applied and then coarse representations of different regions of the given input is computed. These coarse representations are propagated to the top layers T and T computes the final output of the network which are the class predictions. An entropy measure is used to see that how each coerce representation effects the model's uncertainty leading that if a regions is salient then we expect to have large change of the uncertainty with respect to its representation.

We select top k input regions as salient by the hint of computed entropy changes then these regions are given to fine network G obtain finer representations. Eventually, we merge all the coarse, fine representations and give to top layers T again and get the final predictions.

At the training time, all networks and layers trained simultaneously. However, still one might decide to train each network F and G separately by using the same top layers T.  Authors posits that the simultaneous training is useful to keep fine and coarse representations similar so that the final layers T do not struggle too much to learn from two difference representation distribution.

I only try to give the overlooked idea here, if you like to see more detail and dwell into formulas please see the paper.

My discussion: There are some other works using attention mechanisms to improve final performance. However, this work is limited with the small datasets and small spatial dimensions. I really like to see whether it is also usefule for large problems like ImageNet or even larger.

Another caveat is the datasets used for the expeirments are not so cluttered. Therefore, it is easy to detect salient regions, even with by some algrithmic techniques. Thus, still this method obscure to me in real life problems.