Here I like to give a simple run-down to install all requirements to make Selenium available on a Raspi. Basically, we install first Firefox, then Geckodriver and finally Selenium and we are ready to go.
Before start, better to note that ChromeDriver does not support ARM processors anymore, therefore it is not possible to use Chromium with Selenium on Raspberry.
First, install system requirements. Update the system, install Firefox and xvfb (display server implementing X11);
Online Hard Example Mining (OHEM) is a way to pick hard examples with reduced computation cost to improve your network performance on borderline cases which generalize to the general performance. It is mostly used for Object Detection. Suppose you like to train a car detector and you have positive (with car) and negative images (with no car). Now you like to train your network. In practice, you find yourself in many negatives as oppose to relatively much small positives. To this end, it is clever to pick a subset of negatives that are the most informative for your network. Hard Example Mining is the way to go to this.
In general, to pick a subset of negatives, first you train your network for couple of iterations, then you run your network all along your negative instances then you pick the ones with the greater loss values. However, it is very computationally toilsome since you have possibly millions of images to process, and sub-optimal for your optimization since you freeze your network while picking your hard instances that are not all being used for the next couple of iterations. That is, you assume here all hard negatives you pick are useful for all the next iterations until the next selection. Which is an imperfect assumption especially for large datasets.
Okay, what Online means in this regard. OHEM solves these two aforementioned problems by performing hard example selection batch-wise. Given a batch sized K, it performs regular forward propagation and computes per instance losses. Then, it finds M<K hard examples in the batch with high loss values and it only back-propagates the loss computed over the selected instances. Smart hah ? 🙂
It reduces computation by running hand to hand with your regular optimization cycle. It also unties the assumption of the foreseen usefulness by picking hard examples per iteration so thus we now really pick the hard examples for each iteration.
If you like to test yourself, here is PyTorch OHEM implementation that I offer you to use a bit of grain of salt.
Let's directly dive in. The thing here is to use Tensorboard to plot your PyTorch trainings. For this, I use TensorboardX which is a nice interface communicating Tensorboard avoiding Tensorflow dependencies.
First install the requirements;
pip install tensorboard
pip install tensorboardX
Things thereafter very easy as well, but you need to know how you need to communicate with the board to show your training and it is not that easy, if you don't know Tensorboard hitherto.
from tensorboardX import SummaryWriter
writer = SummaryWriter('your/path/to/log_files/')
# in training loop
writer.add_scalar('Train/Loss', loss, num_iteration)
writer.add_scalar('Train/Prec@1', top1, num_iteration)
writer.add_scalar('Train/Prec@5', top5, num_iteration)
# in validation loop
writer.add_scalar('Val/Loss', loss, epoch)
writer.add_scalar('Val/Prec@1', top1, epoch)
writer.add_scalar('Val/Pred@5', top5, epoch)
You can also see the embedding of your dataset
from torchvision import datasets
from tensorboardX import SummaryWriter
dataset = datasets.MNIST('mnist', train=False, download=True)
images = dataset.test_data[:100].float()
label = dataset.test_labels[:100]
features = images.view(100, 784)
writer.add_embedding(features, metadata=label, label_img=images.unsqueeze(1))
This is also how you can plot your model graph. The important part is to give the output tensor to writer as well with you model. So that, it computes the tensor shapes in between. I also need to say, it is very slow for large models.
import torch.nn as nn
import torchvision.utils as vutils
import numpy as np
import torch.nn.functional as F
import torchvision.models as models
from tensorboardX import SummaryWriter
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
self.bn = nn.BatchNorm2d(20)
def forward(self, x):
x = F.max_pool2d(self.conv1(x), 2)
x = F.relu(x)+F.relu(-x)
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = self.bn(x)
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
x = F.log_softmax(x)
model = Mnist()
# if you want to show the input tensor, set requires_grad=True
res = model(torch.autograd.Variable(torch.Tensor(1,1,28,28), requires_grad=True))
writer = SummaryWriter()
Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not. In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.
Beside the proposed method, it includes some examples showing how to use Pandas, Gensim, Spacy and Keras. For the full code you check Github.
There are 255045 negative (non-duplicate) and 149306 positive (duplicate) instances. This induces a class imbalance however when you consider the nature of the problem, it seems reasonable to keep the same data bias with your ML model since negative instances are more expectable in a real-life scenario.
When we analyze the data, the shortest question is 1 character long (which is stupid and useless for the task) and the longest question is 1169 character (which is a long, complicated love affair question). I see that if any of the pairs is shorter than 10 characters, they do not make sense thus, I remove such pairs. The average length is 59 and std is 32.
There are two other columns "q1id" and "q2id" but I really do not know how they are useful since the same question used in different rows has different ids.
Some labels are not true, especially for the duplicate ones. In anyways, I decided to rely on the labels and defer pruning due to hard manual effort.
Converting Questions into Vectors
Here, I plan to use Word2Vec to convert each question into a semantic vector then I stack a Siamese network to detect if the pair is duplicate.
Word2Vec is a general term used for similar algorithms that embed words into a vector space with 300 dimensions in general. These vectors capture semantics and even analogies between different words. The famous example is ;
king - man + woman = queen.
Word2Vec vectors can be used for may useful applications. You can compute semantic word similarity, classify documents or input these vectors to Recurrent Neural Networks for more advance applications.
There are two well-known algorithms in this domain. One is Google's network architecture which learns representation by trying to predict surrounding words of a target word given certain window size. GLOVE is the another methos which relies on co-occurrence matrices. GLOVE is easy to train and it is flexible to add new words out-side of your vocabulary. You might like visit this tutorial to learn more and check this brilliant use-case Sense2Vec.
We still need a way to combine word vectors for singleton question representation. One simple alternative is taking the mean of all word vectors of each question. This is simple but really effective way for document classification and I expect it to work for this problem too. In addition, it is possible to enhance mean vector representation by using TF-IDF scores defined for each word. We apply weighted average of word vectors by using these scores. It emphasizes importance of discriminating words and avoid useless, frequent words which are shared by many questions.
I described Siamese network in a previous post. In short, it is a two way network architecture which takes two inputs from the both side. It projects data into a space in which similar items are contracted and dissimilar ones are dispersed over the learned space. It is computationally efficient since networks are sharing parameters.
Let's load the training data first.
For this particular problem, I train my own GLOVE model by using Gensim.
The above code trains a GLOVE model and saves it. It generates 300 dimensional vectors for words. Hyper parameters would be chosen better but it is just a baseline to see a initial performance. However, as I'll show this model gives performance below than my expectation. I believe, this is because our questions are short and does not induce a semantic structure that GLOVE is able to learn a salient model.
Due to the performance issue and the observation above, I decide to use a pre-trained GLOVE model which comes free with Spacy. It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. This is how we use Spacy for this purpose.
Before going further, I really like Spacy. It is really fast and it does everything you need for NLP in a flash of time by hiding many intrinsic details. It deserves a good remuneration. Similar to Gensim model, it also provides 300 dimensional embedding vectors.
The result I get from Spacy vectors is above Gensim model I trained. It is a better choice to go further with TF-IDF scoring. For TF-IDF, I used scikit-learn (heaven of ML). It provides TfIdfVectorizer which does everything you need.
After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores. The below code does this for just "question1" column.
Now, we are ready to create training data for Siamese network. Basically, I've just fetch the labels and covert mean word2vec vectors to numpy format. I split the data into train and test set too.
In this stage, we need to define Siamese network structure. I use Keras for its simplicity. Below, it is the whole script that I used for the definition of the model.
I share here the best performing network with residual connections. It is a 3 layers network using Euclidean distance as the measure of instance similarity. It has Batch Normalization per layer. It is particularly important since BN layers enhance the performance considerably. I believe, they are able to normalize the final feature vectors and Euclidean distance performances better in this normalized space.
I tried Cosine distance which is more concordant to Word2Vec vectors theoretically but cannot handle to obtain better results. I also tried to normalize data into unit variance or L2 norm but nothing gives better results than the original feature values.
Let's train the network with the prepared data. I used the same model and hyper-parameters for all configurations. It is always possible to optimize these but hitherto I am able to give promising baseline results.
In this section, I like to share test set accuracy values obtained by different model and feature extraction settings. We expect to see improvement over 0.63 since when we set all the labels as 0, it is the accuracy we get.
These are the best results I obtain with varying GLOVE models. they all use the same network and hyper-parameters after I find the best on the last configuration depicted below.
Gensim (my model) + Siamese: 0.69
Spacy + Siamese : 0.72
Spacy + TD-IDF + Siamese : 0.79
We can also investigate the effect of different model architectures. These are the values following the best word2vec model shown above.
Adam works quite well for this problem compared to SGD with learning rate scheduling. Batch Normalization also yields a good improvement. I tried to introduce Dropout between layers in different orders (before ReLU, after BN etc.), the best I obtain is 0.75. Concatenation of different layers improves the performance by 1 percent as the final gain.
In conclusion, here I tried to present a solution to this unique problem by composing different aspects of deep learning. We start with Word2Vec and combine it with TF-IDF and then use Siamese network to find duplicates. Results are not perfect and akin to different optimizations. However, it is just a small try to see the power of deep learning in this domain. I hope you find it useful :).
Switching last layer to FC layer improves performance to 0.84.
By using bidirectional RNN and 1D convolutional layers together as feature extractors improves performance to 0.91. Maybe I'll explain details with another post.
Decorators are handy sugars for Python programmers to shorten things and provides more concise programming.
For instance you can use decorators for user authentication for your REST API servers. Assume that, you need to auth. the user for before each REST calls. Instead of appending the same procedure to each call function, it is better to define decorator and tagging it onto your call functions.
Let's see the small example below. I hope it is self-descriptive.
How to use Decorators:
Decorators are functions called by annotations
Annotations are the tags prefixed by @
### Decorator functions ###
print "Hello Space!"
print "Hello Cosmos!"
@helloCosmos # annotation
@helloSpace # annotation
print "Hello World!"
### Above code is equivalent to these lines
# hello = helloSpace(hello)
# hello = helloCosmos(hello)
### Let's Try