Online Hard Example Mining (OHEM) is a way to pick hard examples with reduced computation cost to improve your network performance on borderline cases which generalize to the general performance. It is mostly used for Object Detection. Suppose you like to train a car detector and you have positive (with car) and negative images (with no car). Now you like to train your network. In practice, you find yourself in many negatives as oppose to relatively much small positives. To this end, it is clever to pick a subset of negatives that are the most informative for your network. Hard Example Mining is the way to go to this.
In general, to pick a subset of negatives, first you train your network for couple of iterations, then you run your network all along your negative instances then you pick the ones with the greater loss values. However, it is very computationally toilsome since you have possibly millions of images to process, and sub-optimal for your optimization since you freeze your network while picking your hard instances that are not all being used for the next couple of iterations. That is, you assume here all hard negatives you pick are useful for all the next iterations until the next selection. Which is an imperfect assumption especially for large datasets.
Okay, what Online means in this regard. OHEM solves these two aforementioned problems by performing hard example selection batch-wise. Given a batch sized K, it performs regular forward propagation and computes per instance losses. Then, it finds M<K hard examples in the batch with high loss values and it only back-propagates the loss computed over the selected instances. Smart hah ? 🙂
It reduces computation by running hand to hand with your regular optimization cycle. It also unties the assumption of the foreseen usefulness by picking hard examples per iteration so thus we now really pick the hard examples for each iteration.
If you like to test yourself, here is PyTorch OHEM implemetation that I offer you to use a bit of grain of salt.
Let's directly dive in. The thing here is to use Tensorboard to plot your PyTorch trainings. For this, I use TensorboardX which is a nice interface communicating Tensorboard avoiding Tensorflow dependencies.
First install the requirements;
pip install tensorboard
pip install tensorboardX
Things thereafter very easy as well, but you need to know how you need to communicate with the board to show your training and it is not that easy, if you don't know Tensorboard hitherto.
from tensorboardX import SummaryWriter
writer = SummaryWriter('your/path/to/log_files/')
# in training loop
writer.add_scalar('Train/Loss', loss, num_iteration)
writer.add_scalar('Train/Prec@1', top1, num_iteration)
writer.add_scalar('Train/Prec@5', top5, num_iteration)
# in validation loop
writer.add_scalar('Val/Loss', loss, epoch)
writer.add_scalar('Val/Prec@1', top1, epoch)
writer.add_scalar('Val/Pred@5', top5, epoch)
You can also see the embedding of your dataset
from torchvision import datasets
from tensorboardX import SummaryWriter
dataset = datasets.MNIST('mnist', train=False, download=True)
images = dataset.test_data[:100].float()
label = dataset.test_labels[:100]
features = images.view(100, 784)
writer.add_embedding(features, metadata=label, label_img=images.unsqueeze(1))
This is also how you can plot your model graph. The important part is to give the output tensor to writer as well with you model. So that, it computes the tensor shapes in between. I also need to say, it is very slow for large models.
import torch.nn as nn
import torchvision.utils as vutils
import numpy as np
import torch.nn.functional as F
import torchvision.models as models
from tensorboardX import SummaryWriter
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
self.bn = nn.BatchNorm2d(20)
def forward(self, x):
x = F.max_pool2d(self.conv1(x), 2)
x = F.relu(x)+F.relu(-x)
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = self.bn(x)
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
x = F.log_softmax(x)
model = Mnist()
# if you want to show the input tensor, set requires_grad=True
res = model(torch.autograd.Variable(torch.Tensor(1,1,28,28), requires_grad=True))
writer = SummaryWriter()
ReLU is defined as a way to train an ensemble of exponential number of linear models due to its zeroing effect. Each iteration means a random set of active units hence, combinations of different linear models. They discuss, relying on the given observation, it might be useful to remove non-linearities for some layers and letting them to learn combination of linearities as the whole layer.
Another argument as poised, some representations are hard to approximate by a stack of non-linear layers. as shown by He et al. 2016. To this end, letting linearities for a subset of layers might ameliorate the condition.
The way they apply EraseReLU is removing the last ReLU layer of each "module". "Module" here is defined depending on the model architecture as shown above.
Experiments show that EraseReLU increases the performance of networks and its effect is larger for deeper networks. It is also more resilient to over-fitting for deep networks. The loss curves also show faster convergence for EraseReLU and the difference more obvious for larger datasets.
My 2 cents: Results are not that different on ImageNet but still better to the favor of EraseReLU. Then it might be the case of lucky shoot since there is no confidence interval or variance given for the trainings.
Faster convergence makes sense with the help of second guessing after the paper. Since there are more active units possible it entails to propagate more gradients. However, all such comments assumes that error signals are always positive. Which is very unlikely. Therefore, more open valves might cause more chaotic back-propagation signal.
Yet it is very simple idea, it shows faster convergence, better results and a good investgation of ReLU function. It think it is useful and can take its position in my next training session.
Disclaimer: This is written hastily in 10 mins. If you think something wrong or even worse let me know :).
There are numerous on-line and off-line technical resources about deep learning. Everyday people publish new papers and write new things. However, it is rare to see resources teaching practical concerns for structuring a deep learning projects; from top to bottom, from problem to solution. People know fancy technicalities but even some experienced people feel lost in the details, once they need to structure their own project.
Andrew Ng’s new initiative deeplearning.ai rushes to help at this stage. deeplearning.ai is a collection of courses teaching deep learning with all the necessary details (great place to start deep learning !!) and one of the courses called “Structuring Machine Learning Project” particularly teaches the design of a deep learning solution intuitively with real-life examples. It is not possible to give a single pill ruling the all but this course establishes a good basis to think about a DL project. And apparently, I still have things to learn from Andrew Ng. after all my years of experience. Note that, I also started my ML career with his famous Coursera ML course :). Thank you Mr. Ng.
Below, I try to plot a diagram summarizing what is mentioned in the course. However be aware that it is not a verbatim depiction and probably includes some of my own measures. In the end, It was a good exercise for me, and hopefully, is a good reference for you.
Please go and check the videos, if the diagram does not make sense at all and ping me if you have something that you don’t like.
Lately, we (TwentyBN) took a part in Activity Net trimmed action recognition challenge. The dataset is called Kinetics and recently released. It is a collection of 10 second YouTube videos. Each video has a single label among 400 different action classes. The dataset released by DeepMind with a baseline 61% Top-1 and 81.3% Top-5. For baseline models please refer to their dataset paper. But, it took 2 months for people to briskly hoist the bar high above.
As you might see above, we have the best Top-5 accuracy with 97% which is ~16% improvement on top of the baseline. The average of Top-1 and Top-5 decides the leader-board which places us to 3rd place. Yet, it is a great result for us where we could dabble only 2 weeks with limited juice. Team matters here!! Thx to my mates Raghav Goyal and Valentin Haenel for being great.
Here, I like to succinctly describe our novel network architecture. It has the best single network performance. (We plan to share a more detailed description in a separate Medium soon.) Namely, it is called BesNet due to a cheap cryptographic reason :). BesNet yields 74% Top-1 with only RGB . It is half-size of the baseline network described in the DeepMind paper.
In detail, BesNet is devised on top of ResNet-50 architecture. Distinctly, BesNet performs 3D convolutions that are able to learn both spatiotemporal features. In a better extent, BesNet takes not a single frame, but a set of frames from a video. It convolves pixels between consecutive frames as wells as single frame pixels. Each ResNet-50 module buckled with 1x1 + 3x3 +1x1 filters in order. Each such module followed by a residual connection coming from preceding module. It uses ReLU activation followed by a Batch-Normalization for each layer. In order to convert Resnet-50 to BesNet, we inflate 1x1 filters to 3x1x1 filters and 3x3 filters to 1x3x3 filters where the ordering of the dimensions is sequence x height x width. After convolution layers, an average pooling layer aggregates spatial dimension as in the normal ResNet. Subsequently, a max pooling layer aggregates temporal dimensions. A fully-connected layer used for predictions.
BesNet is initialized with ImageNet weights. In order to convert 2D filter weights to 3D filter weights, we replicate 2D filters along an additional dimension and then normalize the weights by the replication factor. This normalization keeps the activation values stable despite the architectural change. For example, a 1x1 filter is converted to 3x1x1 by copying the 1x1 filter 3 times along the third dimension and weights are divided by 3 at the end.
In BesNet, 3x1x1 filters are responsible for temporal and spatial cross-channel regularities. 1x3x3 filters pay into only spatial properties of individual feature maps. This orientation excites several observations. First off, it decouples temporal and spatial computations. It learns specialized layers for each of the temporal and spatial dimensions. The idea also entertained by the pooling layers. We decomposed spatial and temporal dimension over average and max pooling layers respectively. BesNet reduces the spatial dimensions along the convolutional layers yet it keeps the size of temporal dimension constant. This makes BesNet flexible to handle videos with different number of frames. Hence, given a video with K frames, BesNet keeps the temporal dimension as K until the pooling layers. Thereafter, max pooling layer aggregates K temporal channels into one. In a practical sense, this is easy with a dynamic computational graph library. Pytorch is a bliss here !! (Sorry TF, You're so crusty.)
BesNet has a peculiar use of dilation in 3x1x1 layers which defines the real novel aspect of our architecture. BesNet uses dilation only on temporal dimension and it picks a random dilation factor per 3x1x1 layer for each mini-batch. It sets padding parameters in accordance to keep the temporal dimension unchanged. At the test time, each layer computes outputs for each possible dilation factor, then takes the average of the output feature maps. Random dilation enables the network to learn complex temporal relations. It also regularizes the network in the temporal domain. In practice, it reduces the effect of FPS used for casting videos into frames.
We discuss that for Kinetics, it is important to learn long range relations between frames. Videos are long and they have only a single label. So the network needs to learn the general context of the video. In that sense, small motions that are observed by a normal 3D convolution are not that important. Random dilation pays into this. It augments the contextual temporal window of the network.
Our experiments with only frame futures support our hypothesis here. We extracted frame features with ResNet-50 and train an MLP after pooling the features. It gets 65% accuracy. It is better than DeepMind's baseline network with 3D convolution layers. That shows us contextual information means more than motion learned by 3D layers.
Motion information might be complementary but not the core. It is then verified by the random dilation. BesNet with no dilation results 70% , dilation 2 68% and the random dilation 74% accuracy. This stands to be a simple empirical proof backing our claim here.
Random dilation is really easy to implement with Pytorch. Just take normal Conv class and overwrite its forward pass by randomizing dilation parameter. If you like to try out before we release fell free.
I try to give a very sketchy description of BesNet here by no means complete. Please ping me if you have any question. We plan to study BesNet a little more and share it in the near future in legit formats. We also plan to share a finer description of our challenge approach with some open-source enjoyment.
Please note that BesNet is a work in progress. Anyways, feedbacks are always warmly welcome. Best :).
One of the main problems of neural networks is to tame layer activations so that one is able to obtain stable gradients to learn faster without any confining factor. Batch Normalization shows us that keeping values with mean 0 and variance 1 seems to work things. However, albeit indisputable effectiveness of BN, it adds more layers and computations to your model that you'd not like to have in the best case.
ELU (Exponential Linear Unit) is a activation function aiming to tame neural networks on the fly by a slight modification of activation function. It keeps the positive values as it is and exponentially skew negative values.
ELU does its job good enough, if you like to evade the cost of Bath Normalization, however its effectiveness does not rely on a theoretical proof beside empirical satisfaction. And finding a good is just a guess.
Self-Normalizing Neural Networks takes things to next level. In short, it describes a new activation function SELU (Scaled Exponential Linear Units), a new initialization scheme and a new dropout variant as a repercussion,
The main topic here is to keep network activation in a certain basin defined by a mean and a variance values. These can be any values of your choice but for the paper it is mean 0 and variance 1 (similar to notion of Batch Normalization). The question afterward is to modifying ELU function by some scaling factors to keep the activations with that mean and variance on the fly. They find these scaling values by a long theoretical justification. Stating that, scaling factors of ELU are supposed to be defined as such any passing value of ELU should be contracted to define mean and variance. (This is just verbal definition by no means complete. Please refer to paper to be more into theory side. )
Above, the scaling factors are shown as and . After long run of computations these values appears to be 1.6732632423543772848170429916717 and 1.0507009873554804934193349852946 relatively. Nevertheless, do not forget that these scaling factors are targeting specifically mean 0 and variance 1. Any change preludes to change these values as well.
Initialization is also another important part of the whole method. The aim here is to start with the right values. They suggest to sample weights from a Gaussian distribution with mean 0 and variance where n is number of weights.
It is known with a well credence that Dropout does not play well with Batch Normalization since it smarting network activations in a purely random manner. This method seems even more brittle to dropout effect. As a cure, they propose Alpha Dropout. It randomly sets inputs to saturatied negative value of SELU which is . Then an affine transformation is applied to it with and values computed relative to dropout rate, targeted mean and variance.It randomizes network without degrading network properties.
In a practical point of view, SELU seems promising by reducing the computation time relative to RELU+BN for normalizing the network. In the paper they does not provide any vision based baseline such a MNIST, CIFAR and they only pounce on Fully-Connected models. I am still curios to see its performance vis-a-vis on these benchmarks agains Bath Normalization. I plan to give it a shoot in near future.
One tickle in my mind after reading the paper is the obsession of mean 0 and variance 1 for not only this paper but also the other normalization techniques. In deed, these values are just relative so why 0 and 1 but not 0 and 4. If you have a answer to this please ping me below.
I owned a Raspberry Pi long ago and it was just sitting in my tech wash box. After watching a Youtube session of creative Raspberry applications, with envy , I decided to try something by myself. The first obvious idea to me was a home security system to inspect your house while you are away.
The final thingy is able to detect and roughly localize any motion through a camera. It takes photos and mails them to your email account. Plus, we are able to interact with it in our local network using a simple web interface so we are able to activate or deactivate it in front of home door. I assume that if someone is able to reach the local wifi network, most probably s/he is one of us (fair enough ?).
The whole heading of the paper is "The Shattered Gradients Problem: If resnets are the answer, then what is the question?". It is really interesting work with all its findings about gradient dynamics of neural networks. It also examines Batch Normalization (BN) and Residual Networks (Resnet) under this problem.
The problem, dubbed "Shattered Gradients", described as gradient feedbacks resembling random noise for nearby data points. White noise gradients (random value around 0 with some unknown variance) are not useful for training and they stall the network. What we expect to see is Brownian noise (next value is obtained with a small change on the last value) from a working model. Deep neural networks are more prone to white noise gradients. However, latest advances like BN and Resnet are described to be more resilient to random gradients even in deep networks.
White noise gradients undermines the effectiveness of networks because they violates gradient based learning methods which expects similar gradient feedbacks for data points close by in the vector space. Once you have white noise gradient for such close points, the model is not able to capture data manifold through these learning algorithms. Brownian updates yields more correlation on updates and this preludes effective learning.
For normal networks, they give a empirical evidence that the correlation of network updates decreases with the order where L is number of layers. Decreasing correlation means more white noise gradient feedbacks.
One important reason of white noise feedbacks is to be co-activations of network units. From a working model, we expect to have units receptive to different structures in the given data. Therefore, for each different instance, different subset of units should be active for effective information flux. They observe that as activation goes through layers, co-activation rate goes higher. BN layers prevents this by keeping the co-activation rate 1/4 (1/4 units are active per layer).
Beside the co-activation rate, how dispersed units activation is another important question. Thus, similar instances need to activate similar subset of units and activation should be distributes to other subsets as we change the data structure. This stage is where the skip-connections get into the play. Their observation is skip-connections improve networks in that respect. This can be observed at below figure.
The effectiveness of skip-connections increases with Beta scaling introduced by InceptionV4 architecture. It is scaling residual connections by a constant value before summing up with the current layer activation.
A small discussion
This is a very intriguing paper to me as being one of the scarse works investigating network dynamics instead of blind updates on architectures for racing accuracy values.
Resnet is known to be train hundreds of layers which was not possible before. Now, with this work, we have another scientific argument explaining its effectiveness. I also like to point Veit et al. (2016) demystifying Resnet as an ensemble of many shallow networks. When we combine both of these papers, it makes total sense to me how Resnets are useful for training very deep networks. If shattered gradient effect, as stated here, increasing with number of layers with the order 2^L then it is impossible to train hundred layers with an ad-hoc network. Corollary, since Resnet behaves like a ensemble of shallow networks this effects is rehabilitated. We are able to see it empirically in this paper and it is complimentary in that sense.
Note: This hastily written paper note might include any kind of error. Please let me know if you find one. Best 🙂