RaspberryPi Home Surveillance with only ~150 lines of Python Code.

I owned a Raspberry Pi long ago and it was just sitting in my tech wash box. After watching a Youtube session of creative Raspberry applications, with envy , I decided to try something by myself. The first obvious idea to me was a home security system to inspect your house while you are away.

The final thingy is able to detect and roughly localize any motion through a camera. It takes photos and mails them to your email account. Plus, we are able to interact with it in our local network using a simple web interface so we are able to activate or deactivate it in front of home door. I assume that if someone is able to reach the local wifi network, most probably s/he is one of us (fair enough ?). This is the final look of my raspi.

Error-Driven Incremental Learning with Deep CNNs

This paper posits a way of incremental training of a network where you have continuous flow of new data categories.  they propose two main problems related to that problem.  First, with increasing number of instances we need more capacitive networks which are hard to train compared to small networks. Therefore starting with a small network and gradually increasing its size seems feasible. Second is to expand the network instead of using already learned features in new tasks. For instance, if you would like to use a pre-trained ImageNet network to your specific problem using it as a sole feature extractor does not reflect the real potential of the network. Instead, training it as it goes wild with the new data is a better choice.

They also recall the forgetting problem when new data is proposed to a pre-trained model. Al ready learned features are forgotten with the new data and the problem.

The proposed method here relies on tree-like structures networks as the below figure depicts. The algorithm starts with a pretrained network L0 with K superclasses. When we add new classes (depicted green), we clone network L0 to leaf networks L1, L2 and branching network B. That is, all set of new networks are the exact clone of L0. Then B is the branching network which decides the leaf network to be activated for the given instance. Then activated leaf network leads to the final prediction for the given instance. For partition the classes the idea is to keep more confusing classes together so that the later stages of the network can resolve this confusion.  So any new set of classes with the corresponding instances are passed through the already trained networks and mostly active network by its softmax outputs is selected for that single category to be appended.  Another choice to increase the number of categories is to add the new categories to output layer only by keeping the network capacity the same. When we need to increase the capacity then we can branch the network again and this network stays as a branching network now. When we need to decide the leaf network following that branching network we sum the confidence values of the classes of each leaf network and maximum confidence network is selected as the leaf network. While all these processes, any parameter is transfered from a branching network to leaf networks unless we have some mismatch between category units. Only these mismatch parameters are initialized randomly.

This work proposes a good approach for a scalable learning architecture with new categories coming in time. It both considers how to add new categories and increase the network capacity in a guided manner. One another good side of this architecture is that each of these network can be trained independently so that we can parallelize the training process.

What I read for deep-learning

Today, I spent some time on two new papers proposing a new way of training very deep neural networks (Highway-Networks) and a new activation function for Auto-Encoders (ZERO-BIAS AUTOENCODERS AND THE BENEFITS OF
CO-ADAPTING FEATURES) which evades the use of any regularization methods such as Contraction or Denoising.

Lets start with the first one. Highway-Networks proposes a new activation type similar to LTSM networks and they claim that this peculiar activation is robust to any choice of initialization scheme and learning problems occurred at very deep NNs. It is also incentive to see that they trained models with >100 number of layers. The basic intuition here is to learn a gating function attached to a real activation function that decides to pass the activation or the input itself. Here is the formulation  $T(x,W_t )$ is the gating function and $H(x,W_H)$ is the real activation. They use Sigmoid activation for gating and Rectifier for the normal activation in the paper. I also implemented it with Lasagne and tried to replicate the results (I aim to release the code later). It is really impressive to see its ability to learn for 50 layers (this is the most I can for my PC).

The other paper ZERO-BIAS AUTOENCODERS AND THE BENEFITS OF CO-ADAPTING FEATURES suggests the use of non-biased rectifier units for the inference of AEs. You can train your model with a biased Rectifier Unit but at the inference time (test time), you should extract features by ignoring bias term. They show that doing so gives better recognition at CIFAR dataset. They also device a new activation function which has the similar intuition to Highway Networks.  Again, there is a gating unit which thresholds the normal activation function.  The first equation is the threshold function with a predefined threshold (they use 1 for their experiments).  The second equation shows the reconstruction of the proposed model. Pay attention that, in this equation they use square of a linear activation for thresholding and they call this model TLin  but they also use normal linear function which is called TRec. What this activation does here is to diminish the small activations so that the model is implicitly regularized without any additional regularizer. This is actually good for learning over-complete representation for the given data.

For more than this silly into, please refer to papers 🙂 and warn me for any mistake.

These two papers shows a new coming trend to Deep Learning community which is using complex activation functions . We can call it controlling each unit behavior in a smart way instead of letting them fire naively. My notion also agrees with this idea. I believe even more complication we need for smart units in our deep models like Spike and Slap networks.

Comparison: SGD vs Momentum vs RMSprop vs Momentum+RMSprop vs AdaGrad

In this post I'll briefly introduce some update tricks for training of your ML model. Then, I will present my empirical findings with a linked NOTEBOOK that uses 2 layer Neural Network on CIFAR dataset.

I assume at least you know what is Stochastic Gradient Descent (SGD). If you don't, you can follow this tutorial .  Beside, I'll consider some improvements of SGD rule that result better performance and faster convergence.

SGD is basically a way of optimizing your model parameters based on the gradient information of your loss function (Means Square Error, Cross-Entropy Error ... ). We can formulate this;

$w(t) = w(t-1) - epsilon * bigtriangleup w(t)$

$w$ is the model parameter, $epsilon$ is learning rate and $bigtriangleup w(t)$ is the gradient at the time $t$.

SGD as itself  is solely depending on the given instance (or the batch of instances) of the present iteration. Therefore, it  tends to have unstable update steps per iteration and corollary convergence takes more time or even your model is akin to stuck into a poor local minima.

To solve this problem, we can use Momentum idea (Nesterov Momentum in literature). Intuitively, what momentum does is to keep the history of the previous update steps and combine this information with the next gradient step to keep the resulting updates stable and conforming the optimization history. It basically, prevents chaotic jumps.  We can formulate  Momentum technique as follows;

$v(t) = alpha v(t-1) - epsilon frac{partial E}{partial w}(t)$  (update velocity history with the new gradient)

$bigtriangleup w(t) = v(t)$ (The weight change is equal to the current velocity)

$alpha$ is the momentum coefficient and 0.9 is a value to start. $frac{partial E}{partial w}(t)$ is the derivative of $w$ wrt. the loss.

Okay we now soothe wild SGD updates with the moderation of Momentum lookup. But still nature of SGD proposes another potential problem. The idea behind SGD is to approximate the real update step by taking the average of the all given instances (or mini batches). Now think about a case where  a model parameter gets a gradient of +0.001 for each  instances then suddenly it gets -0.009 for a particular instance and this instance is possibly a outlier. Then it destroys all the gradient information before. The solution to such problem is suggested by G. Hinton in the Coursera course lecture 6 and this is an unpublished work even I believe it is worthy of.  This is called RMSprop. It keeps running average of its recent gradient magnitudes and divides the next gradient by this average so that loosely gradient values are normalized. RMSprop is performed as below;

$MeanSquare(w,t) =0.9 MeansSquare(w, t-1)+0.1frac{partial E}{partial w}(t)^2$

$bigtriangleup w(t) = epsilonfrac{partial E}{partial w}(t) / (sqrt{MeanSquare(w,t)} + mu)$

$mu$ is a smoothing value for numerical convention.

You can also combine Momentum and RMSprop by applying successively and aggregating their update values.

$w_i(t) = w_i(t-1) + frac{epsilon}{sum_{k=1}^{t}sqrt{{g_{ki}}^2}} * g_i$  where $g_{i} = frac{partial E}{partial w_i}$

So tihe upper formula states that, for each feature dimension, learning rate is divided by the all the squared root gradient history.

Now you completed my intro to the applied ideas in this NOTEBOOK and you can see the practical results of these applied ideas on CIFAR dataset. Of course this into does not mean complete by itself. If you need more refer to other resources. I really suggest the Coursera NN course by G. Hinton for RMSprop idea and this notes for AdaGrad.

For more information you can look this great lecture slide from Toronto Group.

Lately, I found this great visualization of optimization methods. I really suggest you to take a look at it.

ML Work-Flow (Part 5) – Feature Preprocessing

We already discussed first four steps of ML work-flow. So far, we preprocessed crude data by DICTR (Discretization, Integration, Cleaning, Transformation, Reduction), then applied a way of feature extraction procedure to convert data into machine understandable representation, and finally divided data into different bunches like train and test sets . Now, it is time to preprocess feature values and make them ready for the state of art ML model ;).

We need Feature Preprocessing in order to:

1. Evade scale differences between dimensions.
2. Convey instances into a bounded region in the space.
3. Remove correlations between different dimensions.

You may ask “Why are we so concerned about these?” Because

1. Evading scale differences reduces unit differences between particular feature dimensions. Think about Age and Height of your customers. Age is scaled in years and Height is scaled in cm's. Therefore, these two dimension values are distributed in different manners. We need to resolve this and convert data into a scale invariant representation before training your ML algorithm, especially if you are using one of the linear models like Logistic Regression or SVM (Tree based models are more robust to scale differences).
2. Conveying instances into a bounded region in the space resolves the representation biases between instances. For instance, if you work on a document classification problem with bag of words representation then you should care about document length since longer documents include more words which result in more crowded feature histograms. One of the reasonable ways to solve this issue is to divide each word frequency by the total word frequency in the document so that we can convert each histogram value into a probability of seeing that word in the document. As a result, document is represented with a feature vector that is 1 in total of its elements. This new space is called vector space model in the literature.
3. Removing correlations between dimensions cleans your data from redundant information exposed by multiple feature dimensions. Hence data is projected into a new space where each dimension explains something independently important from the other feature dimensions.

Okay, I hope now we are clear why we are concerned about these. Henceforth, I'll try to emphasis some basic stuff in our toolkit for feature preprocessing.

Standardization

• Can be applied to both feature dimensions or data instances.
• If we apply to dimensions, it reduces unit effect and if we apply to instances then we solve instance biases as in the case of the document classification problem.
• The result of standardization is that each feature dimension (instance) is scaled into defined mean and variance so that we fix the unit differences between dimensions.
• $z = (x-mu)/alpha$  : for each dimension (instance),  subtract the mean and divide by the variance of that dimension (instance) so that each dimension is kept inside a mean = 0 , variance = 1 curve.

Min Max Scaling

• Personally, I've not applied Min-Max Scaling to instances,
• It is still useful for unit difference problem.
• Instead of distributional consideration, it hinges the values in the range  [0,1].
• $x_{norm} = (x - x_{min})/(x_{max} - x_{min})$ :  Find max and min values of the feature dimension and apply the formula.

Caveat 1: One common problem of Scaling and Standardization is you need to keep min and max for Scaling, mean and variance values for Standardization for the novel data and the test time. We estimate these values from only the training data and assume that these are still valid for the test and real world data. This assumption might be true for small problems but especially for online environment this caveat should be dealt with a great importance.

Sigmoid Functions

• Sigmoid function naturally fetches given values into a [0, 1] range
• Does not need any assumption about the data like mean and variance
• It penalizes large values  more than the small ones.
• You can use other activation functions like tanh.

Caveat 2: How to choose and what to choose are very problem dependent questions. However, if you have a clustering problem then standardization seems more reasonable for better similarity measure between instance and if you intend to use Neural Networks then some particular kind of NN demands [0,1] scaled data (or even more interesting scale ranges for better gradient propagation on the NN model). Also, I personally use sigmoid function for simple problems in order to get fast result by SVM without complex investigation.

Zero Phase Component Analysis (ZCA Whitening)

• As I explained before, whitening is a process to reduce redundant information by decorrelating data with a final diagonal correlation matrix with preferable all diagonals are one.
• It has especially very important implications in Image Recognition and Feature Learning  so as to make visual cues more concrete on images.
• Instead of formula, it is more intuitive to wire some code Covariance Matrices before and after ZCA

I tried to touch some methods and common concerns of feature preprocessing, by no means  complete. Nevertheless, a couple of takeaways from this post are; do not ignore normalizing your feature values before going into training phase and choose the correct method by investigating the values painstakingly.

PS: I actually promised to write a post per week but I am as busy as a bee right now and I barely find some time to write a new stuff. Sorry about it 🙁

ML WORK-FLOW (Part2) - Data Preprocessing

I try to keep my promised schedule on as much as possible. Here is the detailed the first step discussion of my proposed Machine Learning Work-Flow, that is Data Preprocessing.

Data Preprocessing is an important step in which mostly aims to improve raw data quality before you dwell into the technical concerns. Even-though this step involves very easy tasks to do, without this, you might observe very false or even freaking results at the end.

I also stated at the work-flow that, Data Preprocessing is statistical job other than ML. By saying this, Data Preprocessing demands good data inference and analysis just before any possible decision you made. These components are not the subjects of a ML course but are for a Statistics. Hence, if you aim to be cannier at ML as a whole, do not ignore statistics.

We can divide Data Preprocessing into 5 different headings;

1. Data Integration
2. Data Cleaning
3. Data Transformation
4. Data Discretization
5. Data Reduction

Why I chose industry over academy

In general, if I need to choose something over some other thing I enlist the positive and negative facts about options and have a basic summation to find out the correct one.

Here I itemize my subjective pros and cons list. Maybe you might find it skewed or ridiculous but these are based on my 3 years of hard core academic effort and 2 years in industry (sum of my partial efforts). I think they present at least some of the obstacles you would see in the both worlds.

Pros--

1. Academic life is the best in terms of freedom at work. You choose your study topic, at least to some extent, you team-up and follow the boundaries of human knowledge so as to extend it a bit. This is a very respectful and curious search. For sure, it is better than having a boss choosing your way to go. However , even this freedom is limited as in  the below comic 🙂 2. Dresscode. Yes, academy is not so certain to define a particular dresscode for you, in most cases. You are free to put on your comfortable shorts and flip-flops and go to your office to work. However, it should be pointed out that present industry also realized the idiocy of strict dresscodes and it provides better conditions for the employees as well. Yet, business is still not comparable with the academy. 3. Travel around the world with conferences, summer-schools, meetings, internships at low-cost. Meet people around the globe and feel the international sense.
4. Respectful job. It urges the sense of respect as you say you are an academic and  people usually assume you are more intelligent than the most, thanks to great scientist ancestors.
5. Set your schedule. Schedule of an academic is more flexible and you have a bit of freedom to define your work time.
6. Teaching. It is really great to envision young people with your knowledge and experience. Even-more, it is a vital role in a society since you are able to shape the future with the young people you touch.
7. Elegant social circle. Being an academic chains you with a social circle of people with a similar education level and supposedly similar level of cultivation. That of course does not mean that the industry consists of the ignorant but living in corporate life is more susceptible to facing unfortunate minds.

Our ECCV2014 work "ConceptMap: Mining noisy web data for concept learning"

---- I am living the joy of seeing my paper title on the list of accepted ECCV14 papers :). Seeing the outcome of your work makes worthwhile all your day to night efforts, REALLY!!!. Before start, I shall thank to my supervisor Pinar Duygulu for her great guidance.----

In this post, I would like to summarize the title work since I believe sometimes a friendly blog post might be more expressive than a solid scientific article.

"ConceptMap: Mining noisy web data for concept learning" proposes a pipeline so as to learn wide range of visual concepts by only defining a query to a image search engine. The idea is to query a concept at the service and download a huge bunch of images. Cluster images as removing the irrelevant instances. Learn a model from each of the clusters. At the end, each concept is represented by the ensemble of these classifiers. Continue reading Our ECCV2014 work "ConceptMap: Mining noisy web data for concept learning"