Duplicate Question Detection with Deep Learning on Quora Dataset

Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not.  In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.

Beside the proposed method, it includes some examples showing how to use Pandas, Gensim, Spacy and Keras. For the full code you check Github.

Data Quirks

There are 255045 negative (non-duplicate) and 149306 positive (duplicate) instances. This induces a class imbalance however when you consider the nature of the problem, it seems reasonable to keep the same data bias with your ML model since negative instances are more expectable in a real-life scenario.

When we analyze the data, the shortest question is 1 character long (which is stupid and useless for the task) and the longest question is 1169 character (which is a long, complicated love affair question). I see that if any of the pairs is shorter than 10 characters, they do not make sense thus, I remove such pairs.  The average length is 59 and std is 32.

There are two other columns "q1id" and "q2id" but I really do not know how they are useful since the same question used in different rows has different ids.

Some labels are not true, especially for the duplicate ones. In anyways, I decided to rely on the labels and defer pruning due to hard manual effort.

Proposed Method

Converting Questions into Vectors

Here, I plan to use Word2Vec to convert each question into a semantic vector then I stack a Siamese network to detect if the pair is duplicate.

Word2Vec is a general term used for similar algorithms that embed words into a vector space with 300 dimensions in general.  These vectors capture semantics and even analogies between different words. The famous example is ;

king - man + woman = queen.

Word2Vec vectors can be used for may useful applications. You can compute semantic word similarity, classify documents or input these vectors to Recurrent Neural Networks for more advance applications.

There are two well-known algorithms in this domain. One is Google's network architecture which learns representation by trying to predict surrounding words of a target word given certain window size. GLOVE is the another methos which relies on co-occurrence matrices. GLOVE is easy to train and it is flexible to add new words out-side of your vocabulary. You might like visit this tutorial to learn more and check this brilliant use-case Sense2Vec.

We still need a way to combine word vectors for singleton question representation. One simple alternative is taking the mean of all word vectors of each question. This is simple but really effective way for document classification and I expect it to work for this problem too.   In addition,  it is possible to enhance mean vector representation by using TF-IDF scores defined for each word. We apply weighted average of word vectors by using these scores. It emphasizes importance of discriminating words and avoid useless, frequent words which are shared by many questions.

Siamese Network

I described Siamese network in a previous post. In short, it is a two way network architecture which takes two inputs from the both side. It projects data into a space in which similar items are contracted and dissimilar ones are dispersed over the learned space. It is computationally efficient since networks are sharing parameters.

Siamese network tries to contract instances belonging to the same classes and disperse instances from different classes in the feature space.

 

Implementation

Let's load the training data first.

For this particular problem, I train my own GLOVE model by using Gensim.

The above code trains a GLOVE model and saves it. It generates 300 dimensional vectors for words. Hyper parameters would be chosen better but it is just a baseline to see a initial performance. However, as I'll show this model gives performance below than my expectation. I believe, this is because our questions are short and does not induce a semantic structure that GLOVE is able to learn a salient model.

Due to the performance issue and the observation above, I decide to use a pre-trained GLOVE model which comes free with Spacy. It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. This is how we use Spacy for this purpose.

Before going further, I really like Spacy. It is really fast and it does everything you need for NLP in a flash of time by hiding many intrinsic details. It deserves a good remuneration.  Similar to Gensim model, it also provides 300 dimensional embedding vectors.

The result I get from Spacy vectors is above Gensim model I trained. It is a better choice to go further with TF-IDF scoring.  For TF-IDF, I used scikit-learn (heaven of ML).  It provides TfIdfVectorizer which does everything you need.

After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores. The below code does this for just "question1" column.

Now, we are ready to create training data for Siamese network. Basically, I've just fetch the labels and covert mean word2vec vectors to numpy format. I split the data into train and test set too.

In this stage, we need to define Siamese network structure. I use Keras for its simplicity. Below, it is the whole script that I used for the definition of the model.

I share here the best performing network with residual connections. It is a 3 layers network using Euclidean distance as the measure of instance similarity. It has Batch Normalization per layer. It is particularly important since BN layers enhance the performance considerably. I believe, they are able to normalize the final feature vectors and Euclidean distance performances better in this normalized space.

I tried Cosine distance which is more concordant to Word2Vec vectors theoretically but cannot handle to obtain better results. I also tried to normalize data into unit variance or L2 norm but nothing gives better results than the original feature values.

Let's train the network with the prepared data. I used the same model and hyper-parameters for all configurations. It is always possible to optimize these but hitherto I am able to give promising baseline results.

Results

In this section, I like to share test set accuracy values obtained by different model and feature extraction settings.  We expect to see improvement over 0.63 since when we set all the labels as 0, it is the accuracy we get.

These are the best results I obtain with varying GLOVE models. they all use the same network and hyper-parameters after I find the best on the last configuration depicted below.

  • Gensim (my model) + Siamese: 0.69
  • Spacy + Siamese :  0.72
  • Spacy + TD-IDF + Siamese : 0.79

We can also investigate the effect of different model architectures.  These are the values following  the best word2vec model shown above.

  • 2 layers net : 0.67
  • 3 layers net + adam : 0.74
  • 3 layers resnet (after relu BN) + adam : 0.77
  • 3 layers resnet (before relu BN) + adam : 0.78
  • 3 layers resnet (before relu BN) + adam + dropout : 0.75
  • 3 layers resnet (before relu BN) + adam + layer concat : 0.79
  • 3 layers resnet (before relu BN) + adam + unit_norm + cosine_distance : Fail

Adam works quite well for this problem compared to SGD with learning rate scheduling. Batch Normalization also yields a good improvement. I tried to introduce Dropout between layers in different orders (before ReLU, after BN etc.), the best I obtain is 0.75.  Concatenation of different layers improves the performance by 1 percent as the final gain.

In conclusion, here I tried to present a solution to this unique problem by composing different aspects of deep learning. We start with Word2Vec and combine it  with TF-IDF and then use Siamese network to find duplicates. Results are not perfect and akin to different optimizations. However, it is just a small try to see the power of deep learning in this domain. I hope you find it useful :).

Updates

  • Switching last layer to FC layer improves performance to 0.84.
  • By using bidirectional RNN and 1D convolutional layers together as feature extractors improves performance to 0.91. Maybe I'll explain details with another post.
Share
  • Bahadır

    Embedding sonrası feature extraction amaçlı cnn/bidirectional lstm+max pooling onların tepesine de relu yerine sigmoid veya tanh patlatıp en üstte herhangi bir similarity measure olan bi architecturela daha yüksek accuracy gelebilir (siamese structureını aynen tutatarak her iki koldan yürüyecek gene). Gpuda bile çalıştırsan training uzun sürüyor tahmin edersin.
    İngilizcesine kasamadım şimdi =)

  • Dan Ofer

    What was your AUC?
    (I have 85 without any W2V)

    • Don't know yet. Let me return once I measure AUC. If you have accuracy we can also compare it too.

      • Dan Ofer

        Sweet.
        (Big fan of your blog btw)

      • Dan Ofer

        76.7 % accuracy roughly.

  • Jeroen van Dalen

    What is the baseline of just using the distance metric straight after vectorisation, with a (learned) cut off point? Thus: Input, Vector (glove or you other methods), distance, cutoff? Would expect it do better then chance?

    • Sorry for very late response. It is better than random but not crucially. I tried it before doing anything else but don't remember the exact values.

  • Zhiguo

    We got 88.69% on our test set. You can find the details about our model at https://arxiv.org/pdf/1702.03814.pdf, and our dataset partition can be find at https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing

  • Dan Ofer

    Might interest you, spacy/Thinc just did something similar (but with "neural bag of words"). There results are better, but I'm now clear how they did train/test split.

    https://explosion.ai/blog/quora-deep-text-pair-classification

  • Reiji Tellez

    Cool article. Have you looked into doc2vec for further comparisons? Gensim of course has both flavors (DM / DBOW). We have looked into averages of word vectors vs. doc2vec for classification tasks varying the # epochs and our results are quite interesting....Great work btw!

    • Thanks for the compliment :). I plan to check doc2vec but i guess still the problem with word2vec will be with dec2vec even in a greater extend.

      • Big Deeper

        Hello. Could you clarify if the embedded code corresponds to the "3 layers resnet (before relu BN) + adam + layer concat : 0.79" line? Thanks.

  • abhi

    Interesting. I get 0.80+ accuracy without any complicated method and within 50 lines of python code. I'll publish my blog post soon 😉

  • abhi
  • Иван Косолапов

    What versions of libraries do you use? I can not start your code (

    • it is really hard to check this since it was a temporary virtual env.

  • Sri

    Great post! Love the explanation and the step by step code to follow along. One question - when you refer to resnet are you referring to the architecture proposed here: https://arxiv.org/pdf/1512.03385.pdf ? If so how do you build that using Keras. The network in the `create_base_network` seems to be a 3 layer feed forward network with batch normalization.

  • Fouad

    Thank you for your amazing post , I am wondering how did you submit your results to the competition and what was your best score.
    Siamese outcomes Euclidean distance , how did you transform the distance to PDF to calculate the LogLoss. are you using (1/1+d) or the expo transformation ?

    • I ve never submitted to challenge. But you might have your predictions by thresholding siamese distance. You might change siamese layer with an MLP and get better accuracy

  • Ryan De Vera

    Thanks for this post! It is very insightful. Would you be willing to share some of the details around your second update? Does your base network consist of a single 1D convolutional layer that feeds into a BiLSTM? How many filters does your convolutional layer use?

    • Canoe

      same questions here

      • Ryan de Vera

        Hey Canoe! Wondering if you had any luck with this? I was able to run the Conv1D and BiLSTM but I was unable to reproduce the 0.91 accuracy. I tried using 128, 256 & 512 filters in the convolutional layer.

        • Big Deeper

          Are you able to reproduce even 0.79 accuracy? The accuracy function itself appears to be flawed.

        • Big Deeper

          I got 100% validation accuracy with a few network changes and a specific initialization, but I know it is absolutely wrong, because the accuracy function does not compute what it is supposed to compute.

    • Big Deeper

      I am having a problem recreating numbers from the code that is already here. Maybe you can try getting the code in this blog to run and get the 79% number to help either confirm it, or see if there is an issue. Replacing one layer with a Keras Dense layer and playing with a limited number of Conv1D and RNNs combos should not be too difficult.

    • unfortunately, I don't plan to share these details, at least to the end of Kaggle.

  • Big Deeper

    I made an attempt to recreate the results for "3 layers resnet (before relu BN) + adam + layer concat" which is the best configuration proposed here. I have run it for 100 epochs/iterations and 2 passes per loop (200) and I did not see anything much barely above 74%, mostly some high 73%+ as computed by "te_acc = compute_accuracy(pred, Y_test)" I was testing it on Titan X. The score jumped up and down a lot, but it was mostly staying below 74% after touching it.

    Can you shed some light on the specific conditions for the 79% measurement?

  • Manuel C

    Great job. Two questions, 1) Would it be better to use CountVectorizer with binary=True instead of tf-idf, since the questions are short? 2) Since you use batch normalization, maybe you can use a sigmoid activation function instead of ReLU without gradient vanishing?

  • Big Deeper

    I don't think this function "te_acc = compute_accuracy(pred, Y_test)" correctly computes accuracy. Let's say there are five (5) data points that result in "distance" < 0.5. Then labels[predictions.ravel() < 0.5] would extract a slice from labels that is 5 items long. If those 5 items happen to be all ones (1) then labels[X_validate.ravel() < 0.5].mean() should result in one (1) or 100% accuracy, which is clearly wrong as there are hundreds of thousands of other points with label = 1 still not considered, there may be negative question pairs which are also incorrectly classified.

    Do I misunderstand something about what labels[predictions.ravel() < 0.5].mean() computes?

    • You are right. I used locally below code but forgot to change it here,

      def compute_accuracy(predictions, labels):
      return np.mean(np.equal(predictions.ravel() < 0.5, labels))

  • Jenkins yu

    Is there any reason that you set idf=0 if not existed in word2tfidf?
    try:
    idf = word2tfidf[str(word)]
    except:
    #print word
    idf = 0
    I mean, isn't it more reasonable to set idf=1?
    I am also confusing that why some wors are thrown away in word2tfidf, I checked part of them and found their length are > 1. Here are some samples of those words
    "sheik
    bsd
    roose
    adarsh
    mather
    elliot
    20a"

    • You are right by the first point.
      Some words are not in the word2vec list. I think this is the reason for these removals.

  • AAKASH NAIN

    Thanks for the great insight. Can you please share the BRNN approach too ?

    • I don't plan to share this until Kaggle competition ends. Sorry about that.

  • Gareth Dwyer

    As others in the comments worked out, the accuracy function, copied from the Keras example, is flawed. It calculates precision, not accuracy, so results can be pretty strange (and you can 100% by predicting only one label)

    It should be

    def compute_accuracy(predictions, labels):
    return np.mean(np.equal(predictions.ravel() < 0.5, labels))

    See https://github.com/fchollet/keras/issues/5963

    • Big Deeper

      I guess any hope that the author would answer any of the questions, pointing out problems, would be futile. although he was pretty quick to accept praise.

    • if you assume that labels are 0 and 1 then my computation is the right accuracy. Which is also the case in my code.

    • You are right. I changed it locally but forgot to reflect it here. But the values are computed by the correct code.

  • Anyone tried to replicate stuff here, please be aware of the wrong calculation of accuracy here in the example codes. I corrected it locally before releasing these values but forgot to correct it on Gist. Just now, I corrected it here as well !!! (Thanks @bigdeeper:disqus )

    I do not plan to release the latest code until Kaggle ends and then still no promise, depends on the workload. But I should say hyper-parameter tuning is the silver-lining of this solution, at both feature extraction and model training steps. I suggest you to use probabilistic hyper-parameter optimization frameworks for this purpose not simple grid-search.

    And please be cautioned that this post is only about investigating Quora problem with SiameseNet. I do not intend to get SATO or anything else, just fun. Happ hacking!!!

  • Abu Bakr Soliman

    The competition has finished and the hackers won !!. Anyway we still need your code updates to be posted please 🙂