# XNOR-Net

32 x memory saving and 58 x faster convolution operation. Only 2.9% performance loss (Top-1) with Binary-Weight version for AlexNet compared to the full precision version. Input and Weight binarization, XNOR-Net, scales the gap to 12.5%.

When the weights are binary convolution operation can be approximated by only summation and subtraction. Binary-Wight networks can fit into mobile devices with 2x speed-up on the operations.

To take the idea further, XNER-Net uses both binary weights and inputs. When both of them binary this allows convolution with XNOR and bitcount operation.  This enable both CPU time inference and training of even state of art models.

Here they give a good summary of compressing models into smaller sizes.

1. Shallow networks --  estimate deep models with shallower architectures with different methods like information distilling.
2. Compressing networks -- compression of larger networks.
1. Weight Decay [17]
2. Optimal Brain Damage [18]
3. Optimal Brain Surgeon [19]
4. Deep Compression [22]
5. HashNets[23]
3. Design compact layers -- From the beginning keep the network minimal
1. Decomposing 3x3 layers to 2 1x1 layers [27]
2. Replace 3x3 layers with 1x1  layers achieving 50% less parameters.
4. Quantization of parameters -- High precision is not so important for good results in deep networks [29]
1. 8-bit values instead of 32-bit float weight values [31]
2. Ternary weights and 3-bits activation [32]
3. Quantization of layers with L2 loss  [33]
5. Network binarization --
1. Expectation Backpropagation [36]
2. Binary Connect [38]
3. BinaryNet [11]
4. Retaining of a pre-trained model  [41]

Binary-Weight-Net

Binary-Weight-Net is defined as a approximateion of real-valued layers as $W approx alpha B$ where $alpha$ is scaling factor and $B in [+1, -1]$. Since values are binary we can perform convolution operation with only summation and subtraction.

$I*W approx (I oplus B ) alpha$

With the details given in the paper:

$B = sign(W)$ and $alpha = 1/n||W||_{l1}$

Training of Binary-Weights-Net includes 3 main steps; forward pass, backward pass, parameters update. In both forward and backward stages weights are binarized but for updates real value weights are used to keep the small changes effective enough.

XNOR-Networks

At this stage, the idea is extended and input values are also binarized to reduce the convolution operation cost by using only binary operation XNOR and bitcount.  Basically, input values are binarized as the precious way they use for weight values. Sign operation is used for binary mapping of values and scale values are estimated by l1 norm of input values.

$C = sign(X^T)sign(W) = H^TB$

$gamma approx (1/n ||X||_{l1})(1/n||W||_{l1}) = beta alpha$

where $gamma$ is the scale vector and $C$ is binary mapping of the feature mapping after convolution.

I am lazy to go into much more details. For more and implementation details have a look at the paper.

For such works, this is always pain to replicate the results.  I hope they will release some code work for being a basis.  Other then this, using such tricks to compress gargantuan deep models into more moderate sizes is very useful for small groups who has no GPU back-end like big companies or deploy such models into small computing devices.  Given such boost on computing time and small memory footprint, it is tempting  to train such models as a big ensemble and compare against single full precision model.

# My Notes - Weight Normalization

Deep Learning is defined as (Goodfellow et al., 2016) a sub-field of machine learning consists in learning models that are wholly or partially specified by a class of flexible differentiable functions.
In this study there are three main methods which are Weight Normalization, a new data depended initialization method and Mean Only Batch Normalization.
Weight normalization id formalized as below. Weight values w are decoupled by their norms  g and the direction v / ||v||. In this way they propose that SGD gives faster convergence.
They compare Weight Normalization with Batch Normalization. The main disadvantage they posit that BN has stochasticity due to varying data batches and one additional difference is that WN has lower computational burden compared to BN.
the second perk is data depended initialization of the network. They first give a initial minibatch to network and compute mean activation and std per layer. Then given the initial weight values sampled from mean 0 and std 0.05, they set g = 1 / std and b = - mean / std
One downside is that since this scheme is batch depended, it might suffer for the forthcoming batches with possible different data statistics. However, they say that this scheme works well in practice.
The  third perk is Mean Only Batch Normalization.
This is a lighter operation due to the avoidance of variance normalization. We might easily skip variance normalization because of the initialization scheme already applied it. One another upside is that avodiance of variance normalization provides less distracted gradient feedbacks and therefore better learning.
At the experiments side, they note that batch normalization is 16% slower than weight normalization whereas BN yields better progress especially for initial iterations.  As a final remark they note 7.31% CIFAR-10 performance which is the state of art up to my knowledge (not better then my best network :)) in terms of published works. they also experiment with different architectures like RNNs , reinforcement learning and others but please refer to the paper for more.

# My Notes - SqueezeNet: AlexNet accuracy with 100X smaller network

My notes on Evernote.

# NegOut: Substitute for MaxOut units

Maxout [1] units are well-known and frequently used tools for Deep Neural Networks. For whom does not know, with a basic explanation, a Maxout unit is a set of internal activation units competing with each other for each instance and activation of the winner is propagated as output and the loosers are kept silent. At the backpropagation phase, it means we update only the winner unit. That also means, implicitly, we always prefer to back-propagate gradient signal through the strongest path.  It is an important aspect of Maxout units, especially for very deep models which are prone to gradient instability.

Although Maxout units have very good properties like which I told (please refer to the paper for more details), I am a proactive sceptic of its ability to encode underlying information and pass it to next layer.  Here is a very simple example. Suppose we have two competing functions (filters) in a Maxout unit. One of these functions is receptive of edge structures whereas the other is receptive of corners. For an instance, we might have the first filter as the winner with a value, let’s say, ~3 which means Maxout output is also ~3. For another instance, we have the other function as the winner with approximately same value ~3. If we assume that each NN layer is a classifier which takes the previous layer output as a feature vector (I guess not very wrong assumption), then basically we give the same value for different detections for a particular feature dimension (which is corresponded to our Maxout unit). Eventually, we cannot expect from the next layer to be able to discern this signal.

One can argue that we should evaluate Maxout unit as a whole and it is reminiscent of OR function on top of multiple filters. This is a valid argument which I cannot refuse directly but the problem that I indicated above is still floating on air.  Beside,  why we would waste our expensive NN parameters, if we could come up with a better encoding scheme for Maxout units

Here is one alternative approach for better encoding of competing functions, which we call NegOut. Let's assume we have a ordering of two competing functions by heart as 1st and 2nd. If the winner is the 1st function, NegOut outputs the 1st function's value and otherwise it outputs the 2nd function but by taking its negative. NegOut yields two assumptions. The first, competing functions are always positive (like ReLU functions ). The second, we have 2 competing functions.

If we consider the backpropagation signal, the only difference from Maxout unit is to take negative of the gradient signal for the 2nd competing unit, if it is the winner.

As you can see from the figure, the inherent property here is to output different values for different winner detectors in which the value captures both the structural difference and the strength of the winner activation.

I performed some experiments on CIFAR-10 and MNIST comparing Maxout Network with NegOut Network with exact same architectures explained in the Maxout Paper [1].  The table below summarizes results that I observe by the initial runs without any finetunning or hyper-parameter optimization yet. More comparisons on larger datasets are still in progress.

NegOut give better results on CIFAR, although it is slightly lower on MNIST. Again notice that no tunning has been took a place for our NegOut network where as Maout Network is optimized as described in the paper [1].  In addition, NegOut network uses 2 competing set of units (as it is constrained by its nature) for the last FC layer in comparison to Maxout net which uses 5 competing units. My expectation is to have more difference as we go through larger models and datasets since as we scale up, representational power takes more place for better results.

Here, I tried to give a basic sketch of my recent work by no means complete. Different observations and experiments are still running. I also need to include LWTA [2] for being more fair and grasp more wider aspect of competing units. Please feel free to share your thoughts as well. Any contribution is appreciated.

PS: Lately, I devote myself to analyze the internal dynamics of Neural Networks with different architectures, layers and activation functions. The aim is checking under the hood and analysing any intuitionally well-functioning ideas applied to  Deep Neural Networks. I also expect to share more of my findings at my blog.

[1] Maxout networks IJ Goodfellow, D Warde-Farley, M Mirza, A Courville, Y Bengio arXiv preprint arXiv:1302.4389

[2] Understanding Locally Competitive Networks Rupesh Kumar Srivastava, Jonathan Masci, Faustino Gomez, Jürgen Schmidhuber. http://arxiv.org/abs/1410.1165

# Stochastic Gradient formula for different learning algorithms

Thanks to http://research.microsoft.com/pubs/192769/tricks-2012.pdf

# FAME: Face Association by Model Evolution - CVPR presentation.

My work to learn  people's face by a simple query to a image search engine. Just search the name at Google, prune the irrelevant images iteratively and train a final classifier.

Here is the G. Hinton's talk at MIT about t inabilities of Convolutional Neural Networks and 4 basic arguments to solve these.

I just watched it with a slight distraction and I need to reiterate. However these are the basic arguments in which G. Hinton is proposed whilst the speech.

1.  CNN + Max Pooling is not the way of handling visual information as the human brain does. Yes, it works in practice for the current state of the art but, especially view point changes of the target objects are still unsolved.

2. Apply Equivariance instead of Invariance. Instead of learning invariant representations to the view point changes, learn changing representations correlated with the view point changes.

3. In the space of CNN weight matrices, view point changes are totally non-linear and therefore hard to learn. However, if we transfer instances into a space where the view point changes are globally linear, we can ease the problem. ( Use graphics representation uses explicit pose coordinates)

4. Route information to right set of neurons instead of unguided forward and backward passes. Define certain neuron groups ( called capsules ) that are receptive to  particular set of data clusters in the instance space and each of these capsules contributes to the whole model as much as the given instance's membership to neuron's cluster.

# FAME: Face Association through Model Evolution

Here, I summarize a new method called FAME for learning Face Models from noisy set of web images. I am studying this for my MS Thesis. To be a little intro to my thesis, the title is "Mining Web Images for Concept Learning" and it introduces two new methods for automatic learning of visual concepts from noisy web images. First proposed method is FAME and the other work was presented here before, that is namely ConceptMap and it is accepted for ECCV14 (self promotion :)).

Before I start, I should disclaim that FAME is not a fully furnished work and waiting your valuable comments. Please leave your statements about anything you find useful, ridiculous, awkward or great.

In this work, we grasp the problem of learning face models for public faces from images collected from web through querying a particular person name. Collected images are called weakly-labelled by the rough prescription of defined query. However, the data is very noisy even after face detection, with false detections or several irrelevant faces Continue reading FAME: Face Association through Model Evolution

# Our ECCV2014 work "ConceptMap: Mining noisy web data for concept learning"

---- I am living the joy of seeing my paper title on the list of accepted ECCV14 papers :). Seeing the outcome of your work makes worthwhile all your day to night efforts, REALLY!!!. Before start, I shall thank to my supervisor Pinar Duygulu for her great guidance.----

In this post, I would like to summarize the title work since I believe sometimes a friendly blog post might be more expressive than a solid scientific article.

"ConceptMap: Mining noisy web data for concept learning" proposes a pipeline so as to learn wide range of visual concepts by only defining a query to a image search engine. The idea is to query a concept at the service and download a huge bunch of images. Cluster images as removing the irrelevant instances. Learn a model from each of the clusters. At the end, each concept is represented by the ensemble of these classifiers. Continue reading Our ECCV2014 work "ConceptMap: Mining noisy web data for concept learning"