# RaspberryPi Home Surveillance with only ~150 lines of Python Code.

I owned a Raspberry Pi long ago and it was just sitting in my tech wash box. After watching a Youtube session of creative Raspberry applications, with envy , I decided to try something by myself. The first obvious idea to me was a home security system to inspect your house while you are away.

The final thingy is able to detect and roughly localize any motion through a camera. It takes photos and mails them to your email account. Plus, we are able to interact with it in our local network using a simple web interface so we are able to activate or deactivate it in front of home door. I assume that if someone is able to reach the local wifi network, most probably s/he is one of us (fair enough ?).

# Object Detection Literature

<Please let me know if there are more works comparable to these below.>

R-CNN minus R

• http://arxiv.org/pdf/1506.06981.pdf

FasterRCNN (Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks)

Keywords: RCNN, RoI pooling, object proposals, ImageNet 2015 winner.

PASCAL VOC2007: 73.2%

PASCAL VOC2012: 70.4%

ImageNet Val2 set: 45.4% MAP

1. Model agnostic
2. State of art with Residual Networks
•  http://arxiv.org/pdf/1512.03385v1.pdf
3. Fast enough for oflline systems and partially for inline systems
• https://arxiv.org/pdf/1506.01497.pdf
• https://github.com/ShaoqingRen/faster_rcnn (official)
• https://github.com/rbgirshick/py-faster-rcnn
• http://web.cs.hacettepe.edu.tr/~aykut/classes/spring2016/bil722/slides/w05-FasterR-CNN.pdf
• https://github.com/precedenceguo/mx-rcnn
• https://github.com/mitmul/chainer-faster-rcnn

YOLO (You Only Look Once: Unified, Real-Time Object Detection)

Keywords: real-time detection, end2end training.

PASCAL VOC 2007: 63,4% (YOLO), 57.9% (Fast YOLO)

RUN-TIME : 45 FPS (YOLO), 155 FPS (Fast YOLO)

1. VGG-16 based model
2. End-to-end learning with no extra hassle (no proposals)
3. Fastest with some performance payback relative to Faster RCNN
4. Applicable to online systems
• http://pjreddie.com/darknet/yolo/
• https://github.com/pjreddie/darknet
• https://github.com/BriSkyHekun/py-darknet-yolo (python interface to darknet)
• https://github.com/tommy-qichang/yolo.torch
• https://github.com/gliese581gg/YOLO_tensorflow
• https://github.com/ZhouYzzz/YOLO-mxnet
• https://github.com/xingwangsfu/caffe-yolo
• https://github.com/frankzhangrui/Darknet-Yolo (custom training)

MultiBox (Scalable Object Detection using Deep Neural Networks)

Keywords: cascade classifiers, object proposal network.

1. Similar to YOLO
2. Two successive networks for generating object proposals and classifying these
• http://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Erhan_Scalable_Object_Detection_2014_CVPR_paper.pdf

ION (Inside - Outside Net)

Keywords: object proposal network, RNN, context features

1. RNN networks on top of conv5 layer in 4 different directions
2. Concate different layer features with L2 norm + rescaling
• (great slide) http://www.seanbell.ca/tmp/ion-coco-talk-bell2015.pdf

UnitBox ( UnitBox: An Advanced Object Detection Network)

• https://arxiv.org/pdf/1608.01471v1.pdf

DenseBox (DenseBox: Unifying Landmark Localization with End to End Object Detection)

Keywords: upsampling, hardmining, no object proposal, BAIDU

1.  Similar to YOLO .
2.  Image pyramid of input
3.  Feed to network
4. Upsample feature maps after a layer.
5. Predict classification score and bbox location per pixel on upsampled feature map.
6. NMS to bbox locations.
• http://arxiv.org/pdf/1509.04874v3.pdf

MRCNN: Object detection via a multi-region & semantic segmentation-aware CNN model

PASCAL VOC2007: 78.2% MAP

PASCAL VOC2012: 73.9% MAP

Keywords: bbox regression, segmentation aware

1. very large model and so much detail.
2. Divide each detection windows to different regions.
3. Learn different networks per region scheme.
4. Empower representation by using the entire image network.
5. Use segmentation aware network which takes the etnrie image as input.
• http://arxiv.org/pdf/1505.01749v3.pdf
• https://github.com/gidariss/mrcnn-object-detection

SSD: Single Shot MultiBox Detector

PASCAL VOC2007: 75.5% MAP (SSD 500), 72.1% MAP (SSD 300)

PASCAL VOC2012: 73.1% MAP (SSD 500)

RUN-TIME: 23 FPS (SSD 500), 58 FPS (SSD 300)

Keywords: real-time, no object proposal, end2end training

1. Faster and accurate then YOLO (their claim)
2. Not useful for small objects
• https://arxiv.org/pdf/1512.02325v2.pdf
• https://github.com/weiliu89/caffe/tree/ssd

CRAFT (CRAFT Objects from Images)

PASCAL VOC2007: 75.7% MAP

PASCAL VOC2012: 71.3% MAP

ImageNet Val2 set: 48.5% MAP

• intro: CVPR 2016. Cascade Region-proposal-network And FasT-rcnn. an extension of Faster R-CNN
• http://byangderek.github.io/projects/craft.html
• https://github.com/byangderek/CRAFT
• https://arxiv.org/abs/1604.03239

Hierarchical Object Detection with Deep Reinforcement Learning

1. Hierarchically propose object regions
2. Do not share conv computation by RoI pooling
3. Use direct proposals on the input image
4. Conv sharing reduces the performance sue to spatial information loss (their claim)
5. They do not give extensive experimentation !
6. Given visual examples are simple without any clutter background !
7. Still using Reinforcement Learning seems curious.
• https://arxiv.org/pdf/1611.03718v1.pdf

# How many training samples we observe over life time ?

In this post, I like to compute what number of visual instances we observes over time, with the assumption that we visually perceive life as a constant video with a certain fps rate.

Let's dive into the computation. Relying on [1],  average person can see the world with 45 fps on average. It goes to extremes for such people like fighter pilots which is 225fps with the adrenaline kicked in.  I took the average life time 71 years [3] equals to $2239056000$ (2 .24 billion) secs and we are awake almost $2/3$ of  it which makes $1492704000$ (1.49 billion) secs .  Then we assume that on average there are $86*10^9$ neurons in our brain [2]. This is our model size.

Eventually and roughly, that means without any further investigation, we have a model with 86 billion parameters which learns from  $1492704000 * 45 = 67171680000$  almost 67 billion images.

Of course this is not a convenient way to come with this numbers but fun comes by ignorance 🙂

[1] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2826883/figure/F2/

[2] http://www.ncbi.nlm.nih.gov/pubmed/19226510

[3] http://www.who.int/gho/mortality_burden_disease/life_tables/situation_trends/en/

# ImageNet winners after 2012

• 2012
• 0.15 - Supervision (AlexNet) - ~ 60954656 params
• 0.26 - ISI (ensemble of features)
• 0.27 - LEAR (Fisher Vectors)
• 2013
• 0.117 - Clarifai (paper)
• 0.129 - NUS (very parametric but interesting method based on test and train data affinities)
• 0.135 - ZF (same paper)
• 2014
• 0.06 - GoogLeNet (Inception Modules) -  ~ 11176896 params
• 0.07 - VGGnet (Go deeper and deeper)
• 0.08 - SPPnet  (A retrospective addition from early vision)

# Recent Advances in Deep Learning

In this text, I would like to talk about some of the recent advances of Deep Learning models by no means complete. (Click heading for the reference)

1. Parametric Rectifier Linear Unit (PReLU)
• The idea is to allow negative activation in well-known ReLU units by controlling it with a learnable parameter. In other words, you learn how much negative activationsyou need for each unit to discriminate classes. In the work, it is proposed that PReLU unit is very useful for especially very deep models that lacks for gradient propagation to initial layers due to its depth. What is different is PReLU allows more gradient return by allowing negative activation.
2. A new initialization method (MSRA for Caffe users)
• Xavier initialization was proposed by Bengio's team and it considers number of fan-in and fan-out to a certain unit to define the initial weights.  However, the work says that Xavier method and its alternations considers linear activation functions for the formulation of the method. Hence, they propose some changes related to ReLU activation that they empirically proved its effect in practice with better convergence rate.
3. Batch Normalization
• This work serves data normalization as a structural part of the model. They say that the distribution of the training data changes as the model evolves and it priorities the initialization scheme and the learning schedule we use for the learning. Each mini-batch of the data is normalized with the described scheme just before its propagation through the network and it allows faster convergence  with larger learning rates and robust models to initialization scheme that we choose.  Each mini-batch is normalized by its mean and variance, then it is scaled and shifted by a learned coefficient and residual.

4. Inception Layers
• This is one of the ingredients of last year's ImageNet winner GoogleNet. The trick is to use multi-scale filters all together in a layer and concatenating their responses for the next layer. In that way we are able to learn difference covariances per each layer by different sizes and structures.

# I presented my Master Dissertation?

I am glad to be presented my master dissertation at the end, I collect valuable feedback from my community. Before, I shared what I have done so far for the thesis on the different posts. CMAP (Concept Map) and FAME (Face Association through Model Evolution) (I called it AME at the Thesis for being more generic) are basically two different method for mining visual concepts from noisy image sources such as Google Image Search or Flickr. You might prefer to look at the posts for details or I posted here also the presentation for the brief view of my work.

# Our ECCV2014 work "ConceptMap: Mining noisy web data for concept learning"

---- I am living the joy of seeing my paper title on the list of accepted ECCV14 papers :). Seeing the outcome of your work makes worthwhile all your day to night efforts, REALLY!!!. Before start, I shall thank to my supervisor Pinar Duygulu for her great guidance.----

In this post, I would like to summarize the title work since I believe sometimes a friendly blog post might be more expressive than a solid scientific article.

"ConceptMap: Mining noisy web data for concept learning" proposes a pipeline so as to learn wide range of visual concepts by only defining a query to a image search engine. The idea is to query a concept at the service and download a huge bunch of images. Cluster images as removing the irrelevant instances. Learn a model from each of the clusters. At the end, each concept is represented by the ensemble of these classifiers. Continue reading Our ECCV2014 work "ConceptMap: Mining noisy web data for concept learning"

# Large data really helps for Object Detection ?

I stumbled upon a interesting BMVC 2012 paper (Do We Need More Training Data or Better Models for Object Detection? -- Zhu, Xiangxin, Vondrick, Carl, Ramanan, Deva, Fowlkes, Charless). It is claming something contrary to current notion of big data theory that advocates benefit of large data-sets so as to learn better models with increasing training data size. Nevertheless, the paper states that large training data is not that much helpful for learning better models, indeed more data is maleficent without careful tuning of your system !! Continue reading Large data really helps for Object Detection ?

# How does Feature Extraction work on Images?

Here I share enhanced version of one of my Quora answer to a similar question ...

There is no single answer for this question since there are many diverse set of methods to extract feature from an image.

First, what is called feature? "a distinctive attribute or aspect of something." so the thing is to have some set of values for a particular instance that diverse that instance from the counterparts. In the field of images, features might be raw pixels for simple problems like digit recognition of well-known Mnist dataset. However, in natural images, usage of simple image pixels are not descriptive enough. Instead there are two main steam to follow. One is to use hand engineered feature extraction methods (e.g. SIFT, VLAD, HOG, GIST, LBP) and the another stream is to learn features that are discriminative in the given context (i.e. Sparse Coding, Auto Encoders, Restricted Boltzmann Machines, PCA, ICA, K-means). Note that second alternative, Continue reading How does Feature Extraction work on Images?