# Fighting against class imbalance in a supervised ML problem.

ML on imbalanced data

given a imbalanced learning problem with a large class and a small class with number of instances N and M respectively;

• cluster the larger class into M clusters and use cluster centers for training the model.
• If it is a neural network or some compatible model. Cluster the the large class into K clusters and use these clusters as pseudo classes to train your model. This method is also useful for training your network with small number of classes case. It pushes your net to learn fine-detailed representations.
• Divide large class into subsets with M instances then train multiple classifiers and use the ensemble.
• Hard-mining is a solution which is unfortunately akin to over-fitting but yields good results in some particular cases such as object detection. The idea is to select the most confusing instances from the large set per iteration. Thus, select M most confusing instances from the large class and use for that iteration and repeat for the next iteration.
• For specially batch learning, frequency based batch sampling might be useful. For each batch you can sample instances from the small class by the probability M/(M+N) and N/(M+N) for tha large class so taht you prioritize the small class instances for being the next batch. As you do data augmentation techniques like in CNN models, mostly repeating instances of small class is not a big problem.

Note for metrics, normal accuracy rate is not a good measure for suh problems since you see very high accuracy if your model just predicts the larger class for all the instances. Instead prefer ROC curve or keep watching Precision and Recall.

Please keep me updated if you know something more. Even, this is a very common issue in practice,  still hard to find a working solution.

Share

# How many training samples we observe over life time ?

In this post, I like to compute what number of visual instances we observes over time, with the assumption that we visually perceive life as a constant video with a certain fps rate.

Let's dive into the computation. Relying on [1],  average person can see the world with 45 fps on average. It goes to extremes for such people like fighter pilots which is 225fps with the adrenaline kicked in.  I took the average life time 71 years [3] equals to $2239056000$ (2 .24 billion) secs and we are awake almost $2/3$ of  it which makes $1492704000$ (1.49 billion) secs .  Then we assume that on average there are $86*10^9$ neurons in our brain [2]. This is our model size.

Eventually and roughly, that means without any further investigation, we have a model with 86 billion parameters which learns from  $1492704000 * 45 = 67171680000$  almost 67 billion images.

Of course this is not a convenient way to come with this numbers but fun comes by ignorance

[1] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2826883/figure/F2/

[2] http://www.ncbi.nlm.nih.gov/pubmed/19226510

[3] http://www.who.int/gho/mortality_burden_disease/life_tables/situation_trends/en/

# ParseNet: Looking Wider to See Better

In this work, they propose two related problems and comes with a simple but functional solution to this. the problems are;

1. Learning object location on the image with Proposal + Classification approach is very tiresome since it needs to classify >1000 patched per image. Therefore, use of end to end pixel-wise segmentation is a better solution as proposed by FCN (Long et al. 2014).
2. FCN oversees the contextual information since it predicts the objects of each pixel independently. Therefore, even the thing on the image is Cat, there might be unrelated predictions for different pixels. They solve this by applying Conditional Random Field (CRF) on top of FCN. This is a way to consider context by using pixel relations.  Nevertheless, this is still not a method that is able to learn end-to-end since CRF needs additional learning stage after FCN.

Based on these two problems they provide ParseNet architecture. It declares contextual information by looking each channel feature map and aggregating the activations values.  These aggregations then merged to be appended to final features of the network as depicted below;

Their experiments construes the effectiveness of the additional contextual features.  Yet there are two important points to consider before using these features together. Due to the scale differences of each layer activations, one needs to normalize first per layer then append them together.  They L2 normalize each layer's feature. However, this results very small feature values which also hinder the network to learn in a fast manner.  As a cure to this, they learn scale parameters to each feature as used by the Batch Normalization method so that they first normalize and scale the values with scaling weights learned from the data.

The takeaway from this paper,  for myself, adding intermediate layer features improves the results with a correct normalization framework and as we add more layers, network is more robust to local changes by the context defined by the aggregated features.

They use VGG16 and fine-tune it for their purpose, VGG net does not use Batch Normalization. Therefore, use of Batch Normalization from the start might evades the need of additional scale parameters even maybe the L2 normalization of aggregated features. This is because, Batch Normalization already scales and shifts the feature values into a common norm.

Note: this is a hasty used article sorry for any inconvenience or mistake or stupidly written sentences.

# Think "Turing Test" in another way ?

After some crawling on the Internet, I stumbled upon this thread on Quora. For the lazy ones, the thread is about the things that can be done by humans but not by computers after N years. There are many references to Turing Test in answers stating that the best AI is still not able to pass Turing Test; therefore we do not need to worry about AI being an existential threat for the humanity. First off, I ought to say that I am on the cautious side (like Elon Musk and Bill Gates) on AI being a threat. To explain myself, I would like to show that AI is a threat that has begun to affect, even we think the Turing Test as the validation method. We only need to think in a different way to verify the test.

For the ones who don't know what Turing Test is;  A and B (one machine - one human) are hidden from the human observer C. Looking at the interaction between  A and  B; the observer C tries to decide which one is human and which is the machine. If observer C cannot decide whether there is a machine or a human behind the curtain; than the machine passes the test. Conclusion is that machine exhibits intelligent behavior equivalent to, or indistinguishable from, that of a human.

From the definition, it is one of the legitimate milestones for AI to compass human capable agents. Therefore, it is normal for people to evaluate present AI to define its state and future potential using Turing Test.

I think a different formation regarding Turing Test where we replace the observer C with a machine as well. Then the remaining question turns out to be, is the machine C able to identify the machine A or even is this identification is necessary henceforth? Thinking the formulation in that way resolves many concerns for the AI supporters who say AI is not a threat since it does not and will not be able to pass Turing Test (at least in the short run). Nevertheless, when we replace C with a machine than the machine does not need to pass Turing Test to be a threat, right? Because we are out of the context like poor B depicted on the above figure.

Now let me explain, what does it mean in practice, changing the observer human with a machine. I believe real life "communication" is a good way to illustrate Turing Test.  Think about the communication history. We started with bare foot messengers and have come to light speed flow of the today's world. At the time, we were sending a message and waiting very long for the response. The reason was the tools were the bottleneck for the communication. First we expedited these tools and come up with new technologies. If we think today, then we see that our tools are so fast that we are the bottleneck of the flow any more. We send our mails and messages in a second that bursts the inboxes and message stacks and consequently bursts us as well. If we also accept that the communication is the bare bone of the today's business world, companies do not want to waste time - time is money- and attempt to replace the slowest part with faster alternatives and so computerized solutions come to stage in place of humanized old fashion solution.  Now, after we changed the tools for communication, we also start to change the sides of the communication up to a point that there is no need for any human being. There, we also have a fancy name for this Internet of "Things" (not humans any more). If you also look to the statistics, we see that huge partition of the data flow is between machine to machine communication.  Could you say, in a more immense level of communication revolution, indistinguishability of a computer agent by a human observer is important? It is clear that we can still devastate our lives by our AI agents without passing Turing Test. You can watch out unemployment rates with the growth of the technological solutions.

Basically, what I try to say here is, yes, Turing Test is a barrier for Sci-Fi level AI threat but we changed the rules of the test by placing machines on the both side of the curtain. That means, there is no place in that test (even in the real life)  for human unless some silly machine cannot  replace you, but be sure it is yet to come.

Final saying, I am an AI guy and of course I am not saying we should stop but it is an ominously proceeding field. The punch card here is to underline the need of introspection of AI and related technologies and finding ways to serve AI for human needs not the contrary or any other way. We should be skeptical and be warned.

# Harnessing Deep Neural Networks with Logic Rules

This work posits a way to integrate first order logic rules with neural networks structures. It enables to cooperate expert knowledge with the workhorse deep neural networks. For being more specific, given a sentiment analysis problem, you know that if there is "but" in the sentence the sentiment content changes direction along the sentence. Such rules are harnessed with the network.

The method combines two precursor ideas of information distilling [Hinton et al. 2015] and posterior regularization [Ganchev et al. 2010].  We have teacher and student networks. They learn simultaneously.  Student networks directly uses the labelled data and learns model distribution P then given the logic rules, teacher networks adapts distribution Q as keeping it close to P but in the constraints of the given logic rules. That projects what is inside P to distribution Q bounded by the logic rules. as the below figure suggests.

I don't like to go into deep math since my main purpose is to give the intuition rather than the formulation. However, formulation follows mathematical formulation of first order logic rules suitable to be in a loss function. Then the student loss is defined by the real network loss (cross-entropy) and the loss of the logic rules with a importance weight.

$\theta$ is the student model weight, the first part of the loss is the network loss and the second part is the logic loss. This function distills the information adapted by the given rules into student network.

Teacher network exploits KL divergence to approximate best Q which is close to P with a slack variable.

Since the problem is convex, solution van be found by its dual form with closed form solution as below.

So the whole algorithm is as follows;

For the experiments and use cases of this algorithm please refer to the paper. They show promising results at sentiment classification with convolution networks by definition of such BUT rules to the network.

My take away is, it is perfectly possible to use expert knowledge with the wild deep networks. I guess the recent trend of deep learning shows the same promise. It seems like our wild networks goes to be a efficient learning and inference rule for large graphical probabilistic models with variational methods and such rules imposing methods.  Still such expert knowledge is tenuous in the domain of image recognition problems.

Disclaimer; it is written hastily without any review therefore it is far from being complete but it targets the intuition of the work to make it memorable for latter use.

# XNOR-Net

32 x memory saving and 58 x faster convolution operation. Only 2.9% performance loss (Top-1) with Binary-Weight version for AlexNet compared to the full precision version. Input and Weight binarization, XNOR-Net, scales the gap to 12.5%.

When the weights are binary convolution operation can be approximated by only summation and subtraction. Binary-Wight networks can fit into mobile devices with 2x speed-up on the operations.

To take the idea further, XNER-Net uses both binary weights and inputs. When both of them binary this allows convolution with XNOR and bitcount operation.  This enable both CPU time inference and training of even state of art models.

Here they give a good summary of compressing models into smaller sizes.

1. Shallow networks --  estimate deep models with shallower architectures with different methods like information distilling.
2. Compressing networks -- compression of larger networks.
1. Weight Decay [17]
2. Optimal Brain Damage [18]
3. Optimal Brain Surgeon [19]
4. Deep Compression [22]
5. HashNets[23]
3. Design compact layers -- From the beginning keep the network minimal
1. Decomposing 3x3 layers to 2 1x1 layers [27]
2. Replace 3x3 layers with 1x1  layers achieving 50% less parameters.
4. Quantization of parameters -- High precision is not so important for good results in deep networks [29]
1. 8-bit values instead of 32-bit float weight values [31]
2. Ternary weights and 3-bits activation [32]
3. Quantization of layers with L2 loss  [33]
5. Network binarization --
1. Expectation Backpropagation [36]
2. Binary Connect [38]
3. BinaryNet [11]
4. Retaining of a pre-trained model  [41]

Binary-Weight-Net

Binary-Weight-Net is defined as a approximateion of real-valued layers as $W \approx \alpha B$ where $\alpha$ is scaling factor and $B \in [+1, -1]$. Since values are binary we can perform convolution operation with only summation and subtraction.

$I*W \approx (I \oplus B ) \alpha$

With the details given in the paper:

$B = sign(W)$ and $\alpha = 1/n||W||_{l1}$

Training of Binary-Weights-Net includes 3 main steps; forward pass, backward pass, parameters update. In both forward and backward stages weights are binarized but for updates real value weights are used to keep the small changes effective enough.

XNOR-Networks

At this stage, the idea is extended and input values are also binarized to reduce the convolution operation cost by using only binary operation XNOR and bitcount.  Basically, input values are binarized as the precious way they use for weight values. Sign operation is used for binary mapping of values and scale values are estimated by l1 norm of input values.

$C = sign(X^T)sign(W) = H^TB$

$\gamma \approx (1/n ||X||_{l1})(1/n||W||_{l1}) = \beta \alpha$

where $\gamma$ is the scale vector and $C$ is binary mapping of the feature mapping after convolution.

I am lazy to go into much more details. For more and implementation details have a look at the paper.

For such works, this is always pain to replicate the results.  I hope they will release some code work for being a basis.  Other then this, using such tricks to compress gargantuan deep models into more moderate sizes is very useful for small groups who has no GPU back-end like big companies or deploy such models into small computing devices.  Given such boost on computing time and small memory footprint, it is tempting  to train such models as a big ensemble and compare against single full precision model.

# Error-Driven Incremental Learning with Deep CNNs

This paper posits a way of incremental training of a network where you have continuous flow of new data categories.  they propose two main problems related to that problem.  First, with increasing number of instances we need more capacitive networks which are hard to train compared to small networks. Therefore starting with a small network and gradually increasing its size seems feasible. Second is to expand the network instead of using already learned features in new tasks. For instance, if you would like to use a pre-trained ImageNet network to your specific problem using it as a sole feature extractor does not reflect the real potential of the network. Instead, training it as it goes wild with the new data is a better choice.

They also recall the forgetting problem when new data is proposed to a pre-trained model. Al ready learned features are forgotten with the new data and the problem.

The proposed method here relies on tree-like structures networks as the below figure depicts. The algorithm starts with a pretrained network L0 with K superclasses. When we add new classes (depicted green), we clone network L0 to leaf networks L1, L2 and branching network B. That is, all set of new networks are the exact clone of L0. Then B is the branching network which decides the leaf network to be activated for the given instance. Then activated leaf network leads to the final prediction for the given instance.

For partition the classes the idea is to keep more confusing classes together so that the later stages of the network can resolve this confusion.  So any new set of classes with the corresponding instances are passed through the already trained networks and mostly active network by its softmax outputs is selected for that single category to be appended.  Another choice to increase the number of categories is to add the new categories to output layer only by keeping the network capacity the same. When we need to increase the capacity then we can branch the network again and this network stays as a branching network now. When we need to decide the leaf network following that branching network we sum the confidence values of the classes of each leaf network and maximum confidence network is selected as the leaf network.

While all these processes, any parameter is transfered from a branching network to leaf networks unless we have some mismatch between category units. Only these mismatch parameters are initialized randomly.

This work proposes a good approach for a scalable learning architecture with new categories coming in time. It both considers how to add new categories and increase the network capacity in a guided manner. One another good side of this architecture is that each of these network can be trained independently so that we can parallelize the training process.

# My Notes - Weight Normalization

Deep Learning is defined as (Goodfellow et al., 2016) a sub-field of machine learning consists in learning models that are wholly or partially specified by a class of flexible differentiable functions.
In this study there are three main methods which are Weight Normalization, a new data depended initialization method and Mean Only Batch Normalization.
Weight normalization id formalized as below. Weight values w are decoupled by their norms  g and the direction v / ||v||. In this way they propose that SGD gives faster convergence.
They compare Weight Normalization with Batch Normalization. The main disadvantage they posit that BN has stochasticity due to varying data batches and one additional difference is that WN has lower computational burden compared to BN.
the second perk is data depended initialization of the network. They first give a initial minibatch to network and compute mean activation and std per layer. Then given the initial weight values sampled from mean 0 and std 0.05, they set g = 1 / std and b = - mean / std
One downside is that since this scheme is batch depended, it might suffer for the forthcoming batches with possible different data statistics. However, they say that this scheme works well in practice.
The  third perk is Mean Only Batch Normalization.
This is a lighter operation due to the avoidance of variance normalization. We might easily skip variance normalization because of the initialization scheme already applied it. One another upside is that avodiance of variance normalization provides less distracted gradient feedbacks and therefore better learning.
At the experiments side, they note that batch normalization is 16% slower than weight normalization whereas BN yields better progress especially for initial iterations.  As a final remark they note 7.31% CIFAR-10 performance which is the state of art up to my knowledge (not better then my best network :)) in terms of published works. they also experiment with different architectures like RNNs , reinforcement learning and others but please refer to the paper for more.

# My Notes - SqueezeNet: AlexNet accuracy with 100X smaller network

My notes on Evernote.