I owned a Raspberry Pi long ago and it was just sitting in my tech wash box. After watching a Youtube session of creative Raspberry applications, with envy , I decided to try something by myself. The first obvious idea to me was a home security system to inspect your house while you are away.
The final thingy is able to detect and roughly localize any motion through a camera. It takes photos and mails them to your email account. Plus, we are able to interact with it in our local network using a simple web interface so we are able to activate or deactivate it in front of home door. I assume that if someone is able to reach the local wifi network, most probably s/he is one of us (fair enough ?).
In this post, I like to compute what number of visual instances we observes over time, with the assumption that we visually perceive life as a constant video with a certain fps rate.
Let's dive into the computation. Relying on , average person can see the world with 45 fps on average. It goes to extremes for such people like fighter pilots which is 225fps with the adrenaline kicked in. I took the average life time 71 years  equals to (2 .24 billion) secs and we are awake almost of it which makes (1.49 billion) secs . Then we assume that on average there are neurons in our brain . This is our model size.
Eventually and roughly, that means without any further investigation, we have a model with 86 billion parameters which learns from almost 67 billion images.
Of course this is not a convenient way to come with this numbers but fun comes by ignorance 🙂
The idea is to allow negative activation in well-known ReLU units by controlling it with a learnable parameter. In other words, you learn how much negative activationsyou need for each unit to discriminate classes. In the work, it is proposed that PReLU unit is very useful for especially very deep models that lacks for gradient propagation to initial layers due to its depth. What is different is PReLU allows more gradient return by allowing negative activation.
Xavier initialization was proposed by Bengio's team and it considers number of fan-in and fan-out to a certain unit to define the initial weights. However, the work says that Xavier method and its alternations considers linear activation functions for the formulation of the method. Hence, they propose some changes related to ReLU activation that they empirically proved its effect in practice with better convergence rate.
This work serves data normalization as a structural part of the model. They say that the distribution of the training data changes as the model evolves and it priorities the initialization scheme and the learning schedule we use for the learning. Each mini-batch of the data is normalized with the described scheme just before its propagation through the network and it allows faster convergence with larger learning rates and robust models to initialization scheme that we choose. Each mini-batch is normalized by its mean and variance, then it is scaled and shifted by a learned coefficient and residual.
This is one of the ingredients of last year's ImageNet winner GoogleNet. The trick is to use multi-scale filters all together in a layer and concatenating their responses for the next layer. In that way we are able to learn difference covariances per each layer by different sizes and structures.
I am glad to be presented my master dissertation at the end, I collect valuable feedback from my community. Before, I shared what I have done so far for the thesis on the different posts. CMAP (Concept Map) and FAME (Face Association through Model Evolution) (I called it AME at the Thesis for being more generic) are basically two different method for mining visual concepts from noisy image sources such as Google Image Search or Flickr. You might prefer to look at the posts for details or I posted here also the presentation for the brief view of my work.
---- I am living the joy of seeing my paper title on the list of accepted ECCV14 papers :). Seeing the outcome of your work makes worthwhile all your day to night efforts, REALLY!!!. Before start, I shall thank to my supervisor Pinar Duygulu for her great guidance.----
In this post, I would like to summarize the title work since I believe sometimes a friendly blog post might be more expressive than a solid scientific article.
"ConceptMap: Mining noisy web data for concept learning" proposes a pipeline so as to learn wide range of visual concepts by only defining a query to a image search engine. The idea is to query a concept at the service and download a huge bunch of images. Cluster images as removing the irrelevant instances. Learn a model from each of the clusters. At the end, each concept is represented by the ensemble of these classifiers. Continue reading Our ECCV2014 work "ConceptMap: Mining noisy web data for concept learning"→
I stumbled upon a interesting BMVC 2012 paper (Do We Need More Training Data or Better Models for Object Detection?-- Zhu, Xiangxin, Vondrick, Carl, Ramanan, Deva, Fowlkes, Charless). It is claming something contrary to current notion of big data theory that advocates benefit of large data-sets so as to learn better models with increasing training data size. Nevertheless, the paper states that large training data is not that much helpful for learning better models, indeed more data is maleficent without careful tuning of your system !!Continue reading Large data really helps for Object Detection ?→
Here I share enhanced version of one of my Quora answer to a similar question ...
There is no single answer for this question since there are many diverse set of methods to extract feature from an image.
First, what is called feature? "a distinctive attribute or aspect of something." so the thing is to have some set of values for a particular instance that diverse that instance from the counterparts. In the field of images, features might be raw pixels for simple problems like digit recognition of well-known Mnist dataset. However, in natural images, usage of simple image pixels are not descriptive enough. Instead there are two main steam to follow. One is to use hand engineered feature extraction methods (e.g. SIFT, VLAD, HOG, GIST, LBP) and the another stream is to learn features that are discriminative in the given context (i.e. Sparse Coding, Auto Encoders, Restricted Boltzmann Machines, PCA, ICA, K-means). Note that second alternative, Continue reading How does Feature Extraction work on Images?→