Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not. In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.
Beside the proposed method, it includes some examples showing how to use Pandas, Gensim, Spacy and Keras. For the full code you check Github.
There are 255045 negative (non-duplicate) and 149306 positive (duplicate) instances. This induces a class imbalance however when you consider the nature of the problem, it seems reasonable to keep the same data bias with your ML model since negative instances are more expectable in a real-life scenario.
When we analyze the data, the shortest question is 1 character long (which is stupid and useless for the task) and the longest question is 1169 character (which is a long, complicated love affair question). I see that if any of the pairs is shorter than 10 characters, they do not make sense thus, I remove such pairs. The average length is 59 and std is 32.
There are two other columns "q1id" and "q2id" but I really do not know how they are useful since the same question used in different rows has different ids.
Some labels are not true, especially for the duplicate ones. In anyways, I decided to rely on the labels and defer pruning due to hard manual effort.
Converting Questions into Vectors
Here, I plan to use Word2Vec to convert each question into a semantic vector then I stack a Siamese network to detect if the pair is duplicate.
Word2Vec is a general term used for similar algorithms that embed words into a vector space with 300 dimensions in general. These vectors capture semantics and even analogies between different words. The famous example is ;
king - man + woman = queen.
Word2Vec vectors can be used for may useful applications. You can compute semantic word similarity, classify documents or input these vectors to Recurrent Neural Networks for more advance applications.
There are two well-known algorithms in this domain. One is Google's network architecture which learns representation by trying to predict surrounding words of a target word given certain window size. GLOVE is the another methos which relies on co-occurrence matrices. GLOVE is easy to train and it is flexible to add new words out-side of your vocabulary. You might like visit this tutorial to learn more and check this brilliant use-case Sense2Vec.
We still need a way to combine word vectors for singleton question representation. One simple alternative is taking the mean of all word vectors of each question. This is simple but really effective way for document classification and I expect it to work for this problem too. In addition, it is possible to enhance mean vector representation by using TF-IDF scores defined for each word. We apply weighted average of word vectors by using these scores. It emphasizes importance of discriminating words and avoid useless, frequent words which are shared by many questions.
I described Siamese network in a previous post. In short, it is a two way network architecture which takes two inputs from the both side. It projects data into a space in which similar items are contracted and dissimilar ones are dispersed over the learned space. It is computationally efficient since networks are sharing parameters.
Let's load the training data first.
For this particular problem, I train my own GLOVE model by using Gensim.
The above code trains a GLOVE model and saves it. It generates 300 dimensional vectors for words. Hyper parameters would be chosen better but it is just a baseline to see a initial performance. However, as I'll show this model gives performance below than my expectation. I believe, this is because our questions are short and does not induce a semantic structure that GLOVE is able to learn a salient model.
Due to the performance issue and the observation above, I decide to use a pre-trained GLOVE model which comes free with Spacy. It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. This is how we use Spacy for this purpose.
Before going further, I really like Spacy. It is really fast and it does everything you need for NLP in a flash of time by hiding many intrinsic details. It deserves a good remuneration. Similar to Gensim model, it also provides 300 dimensional embedding vectors.
The result I get from Spacy vectors is above Gensim model I trained. It is a better choice to go further with TF-IDF scoring. For TF-IDF, I used scikit-learn (heaven of ML). It provides TfIdfVectorizer which does everything you need.
After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores. The below code does this for just "question1" column.
Now, we are ready to create training data for Siamese network. Basically, I've just fetch the labels and covert mean word2vec vectors to numpy format. I split the data into train and test set too.
In this stage, we need to define Siamese network structure. I use Keras for its simplicity. Below, it is the whole script that I used for the definition of the model.
I share here the best performing network with residual connections. It is a 3 layers network using Euclidean distance as the measure of instance similarity. It has Batch Normalization per layer. It is particularly important since BN layers enhance the performance considerably. I believe, they are able to normalize the final feature vectors and Euclidean distance performances better in this normalized space.
I tried Cosine distance which is more concordant to Word2Vec vectors theoretically but cannot handle to obtain better results. I also tried to normalize data into unit variance or L2 norm but nothing gives better results than the original feature values.
Let's train the network with the prepared data. I used the same model and hyper-parameters for all configurations. It is always possible to optimize these but hitherto I am able to give promising baseline results.
In this section, I like to share test set accuracy values obtained by different model and feature extraction settings. We expect to see improvement over 0.63 since when we set all the labels as 0, it is the accuracy we get.
These are the best results I obtain with varying GLOVE models. they all use the same network and hyper-parameters after I find the best on the last configuration depicted below.
Gensim (my model) + Siamese: 0.69
Spacy + Siamese : 0.72
Spacy + TD-IDF + Siamese : 0.79
We can also investigate the effect of different model architectures. These are the values following the best word2vec model shown above.
Adam works quite well for this problem compared to SGD with learning rate scheduling. Batch Normalization also yields a good improvement. I tried to introduce Dropout between layers in different orders (before ReLU, after BN etc.), the best I obtain is 0.75. Concatenation of different layers improves the performance by 1 percent as the final gain.
In conclusion, here I tried to present a solution to this unique problem by composing different aspects of deep learning. We start with Word2Vec and combine it with TF-IDF and then use Siamese network to find duplicates. Results are not perfect and akin to different optimizations. However, it is just a small try to see the power of deep learning in this domain. I hope you find it useful :).
Switching last layer to FC layer improves performance to 0.84.
By using bidirectional RNN and 1D convolutional layers together as feature extractors improves performance to 0.91. Maybe I'll explain details with another post.
Decorators are handy sugars for Python programmers to shorten things and provides more concise programming.
For instance you can use decorators for user authentication for your REST API servers. Assume that, you need to auth. the user for before each REST calls. Instead of appending the same procedure to each call function, it is better to define decorator and tagging it onto your call functions.
Let's see the small example below. I hope it is self-descriptive.
How to use Decorators:
Decorators are functions called by annotations
Annotations are the tags prefixed by @
### Decorator functions ###
print "Hello Space!"
print "Hello Cosmos!"
@helloCosmos # annotation
@helloSpace # annotation
print "Hello World!"
### Above code is equivalent to these lines
# hello = helloSpace(hello)
# hello = helloCosmos(hello)
### Let's Try
This is from my stackoverflow question. Thanks to "bitoiu". Here is the real thread.
How to pick up a single commit from a remote repo
Assuming you have a local clone of the repo you forked if you type in the following you should get a single origin:
> git show remote
Unless you've added the original's repo location, you won't have access to the commit you want to pick into your local one. So we need to add that, let's assume this repo ishttps://github.com/GitbookIO/gitbook.git. Notice this is an HTTPS clone URL because you won't have write access to this repo. Let's name it original_repo:
At this point you have all you need locally, you'll just need to merge the commit into one of your branches, let's assume your local master.
Find the commit you want to merge. This implies finding it in one of the branches the team used. Could be already merged to master or you could be picking it up from the branch that was used for the pull request. Either way, just run a series of git log to check what commit you want if you don't know the reference. When you do simply go to the branch where you want to merge the commit to and run:
> git cherry-pick COMMIT_ID
This will bring the commit to whatever branch you are at the moment.
How to merge a branch from a remote repo
The only difference in this steps is that instead of doing the cherry-pick you will be doing a merge. So imagine the contents of the pull request are in a branch named so-pr, you would simply do:
> git merge original_repo/so-pr
And that would merge the contents of so-pr into your working branch.
# Add the remote, call it "upstream":
git remote add upstream https://github.com/whoever/whatever.git
# Fetch all the branches of that remote into remote-tracking branches,
# such as upstream/master:
git fetch upstream
# Make sure that you're on your master branch:
git checkout master
# Rewrite your master branch so that any commits of yours that
# aren't already in upstream/master are replayed on top of that
# other branch:
git rebase upstream/master
#If you don't want to rewrite the history of your master branch, (for # example because other people may have cloned it) then you should # replace the last command with However, for making further pull # requests that are as clean as possible, it's probably better to # rebase.
git merge upstream/master.