Category Archives: CodeBook

Installing OpenCV 3.2 to Anaconda Environment with ffmpeg Support

Sometimes, It is really a mess to try installing OpenCV to your system. Nevertheless, it is really great library for any case of vision and you are obliged to use it. (No complain, just C++).

I try to list my commands here in a sequence  and hope it will work for you too.

Install dependencies


apt install gcc g++ git libjpeg-dev libpng-dev libtiff5-dev libjasper-dev libavcodec-dev libavformat-dev libswscale-dev pkg-config cmake libgtk2.0-dev libeigen3-dev libtheora-dev libvorbis-dev libxvidcore-dev libx264-dev sphinx-common libtbb-dev yasm libfaac-dev libopencore-amrnb-dev libopencore-amrwb-dev libopenexr-dev libgstreamer-plugins-base1.0-dev libavcodec-dev libavutil-dev libavfilter-dev libavformat-dev libavresample-dev

conda install libgcc

Download OpenCV


//First, go to your folder to hosting installation
wget https://github.com/Itseez/opencv/archive/3.2.0.zip

unzip 3.2.0.zip
cd opencv-3.2.0

mkdir build
cd build

Cmake and Setup Opencv

This cmake command targets python3.x and your target virtual environment. Therefore, before running it activate your environment. Do not forget to check flags depending on your case.


cmake -DWITH_CUDA=OFF -DBUILD_TIFF=ON -DBUILD_opencv_java=OFF -DENABLE_AVX=ON -DWITH_OPENGL=ON -DWITH_OPENCL=ON -DWITH_IPP=ON -DWITH_TBB=ON -DWITH_EIGEN=ON -DWITH_V4L=ON -DWITH_VTK=OFF -DBUILD_TESTS=OFF -DBUILD_PERF_TESTS=OFF -DCMAKE_BUILD_TYPE=RELEASE -DBUILD_opencv_python2=OFF -DCMAKE_INSTALL_PREFIX=$(python3 -c "import sys; print(sys.prefix)") -DPYTHON3_EXECUTABLE=$(which python3) -DPYTHON3_INCLUDE_DIR=$(python3 -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())") -DPYTHON3_PACKAGES_PATH=$(python3 -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())") -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D INSTALL_PYTHON_EXAMPLES=ON -D INSTALL_C_EXAMPLES=OFF -D PYTHON_EXECUTABLE=~/miniconda3/envs/dl/bin/python -D BUILD_EXAMPLES=ON ..

make -j 4

sudo make install

Then check your installation on Python


import cv2

print(cv2.__version__) # should output opencv-3.2.0

Share

Duplicate Question Detection with Deep Learning on Quora Dataset

Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not.  In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.

Beside the proposed method, it includes some examples showing how to use Pandas, Gensim, Spacy and Keras. For the full code you check Github.

Data Quirks

There are 255045 negative (non-duplicate) and 149306 positive (duplicate) instances. This induces a class imbalance however when you consider the nature of the problem, it seems reasonable to keep the same data bias with your ML model since negative instances are more expectable in a real-life scenario.

When we analyze the data, the shortest question is 1 character long (which is stupid and useless for the task) and the longest question is 1169 character (which is a long, complicated love affair question). I see that if any of the pairs is shorter than 10 characters, they do not make sense thus, I remove such pairs.  The average length is 59 and std is 32.

There are two other columns "q1id" and "q2id" but I really do not know how they are useful since the same question used in different rows has different ids.

Some labels are not true, especially for the duplicate ones. In anyways, I decided to rely on the labels and defer pruning due to hard manual effort.

Proposed Method

Converting Questions into Vectors

Here, I plan to use Word2Vec to convert each question into a semantic vector then I stack a Siamese network to detect if the pair is duplicate.

Word2Vec is a general term used for similar algorithms that embed words into a vector space with 300 dimensions in general.  These vectors capture semantics and even analogies between different words. The famous example is ;

king - man + woman = queen.

Word2Vec vectors can be used for may useful applications. You can compute semantic word similarity, classify documents or input these vectors to Recurrent Neural Networks for more advance applications.

There are two well-known algorithms in this domain. One is Google's network architecture which learns representation by trying to predict surrounding words of a target word given certain window size. GLOVE is the another methos which relies on co-occurrence matrices. GLOVE is easy to train and it is flexible to add new words out-side of your vocabulary. You might like visit this tutorial to learn more and check this brilliant use-case Sense2Vec.

We still need a way to combine word vectors for singleton question representation. One simple alternative is taking the mean of all word vectors of each question. This is simple but really effective way for document classification and I expect it to work for this problem too.   In addition,  it is possible to enhance mean vector representation by using TF-IDF scores defined for each word. We apply weighted average of word vectors by using these scores. It emphasizes importance of discriminating words and avoid useless, frequent words which are shared by many questions.

Siamese Network

I described Siamese network in a previous post. In short, it is a two way network architecture which takes two inputs from the both side. It projects data into a space in which similar items are contracted and dissimilar ones are dispersed over the learned space. It is computationally efficient since networks are sharing parameters.

Siamese network tries to contract instances belonging to the same classes and disperse instances from different classes in the feature space.

 

Implementation

Let's load the training data first.

For this particular problem, I train my own GLOVE model by using Gensim.

The above code trains a GLOVE model and saves it. It generates 300 dimensional vectors for words. Hyper parameters would be chosen better but it is just a baseline to see a initial performance. However, as I'll show this model gives performance below than my expectation. I believe, this is because our questions are short and does not induce a semantic structure that GLOVE is able to learn a salient model.

Due to the performance issue and the observation above, I decide to use a pre-trained GLOVE model which comes free with Spacy. It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. This is how we use Spacy for this purpose.

Before going further, I really like Spacy. It is really fast and it does everything you need for NLP in a flash of time by hiding many intrinsic details. It deserves a good remuneration.  Similar to Gensim model, it also provides 300 dimensional embedding vectors.

The result I get from Spacy vectors is above Gensim model I trained. It is a better choice to go further with TF-IDF scoring.  For TF-IDF, I used scikit-learn (heaven of ML).  It provides TfIdfVectorizer which does everything you need.

After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores. The below code does this for just "question1" column.

Now, we are ready to create training data for Siamese network. Basically, I've just fetch the labels and covert mean word2vec vectors to numpy format. I split the data into train and test set too.

In this stage, we need to define Siamese network structure. I use Keras for its simplicity. Below, it is the whole script that I used for the definition of the model.

I share here the best performing network with residual connections. It is a 3 layers network using Euclidean distance as the measure of instance similarity. It has Batch Normalization per layer. It is particularly important since BN layers enhance the performance considerably. I believe, they are able to normalize the final feature vectors and Euclidean distance performances better in this normalized space.

I tried Cosine distance which is more concordant to Word2Vec vectors theoretically but cannot handle to obtain better results. I also tried to normalize data into unit variance or L2 norm but nothing gives better results than the original feature values.

Let's train the network with the prepared data. I used the same model and hyper-parameters for all configurations. It is always possible to optimize these but hitherto I am able to give promising baseline results.

Results

In this section, I like to share test set accuracy values obtained by different model and feature extraction settings.  We expect to see improvement over 0.63 since when we set all the labels as 0, it is the accuracy we get.

These are the best results I obtain with varying GLOVE models. they all use the same network and hyper-parameters after I find the best on the last configuration depicted below.

  • Gensim (my model) + Siamese: 0.69
  • Spacy + Siamese :  0.72
  • Spacy + TD-IDF + Siamese : 0.79

We can also investigate the effect of different model architectures.  These are the values following  the best word2vec model shown above.

  • 2 layers net : 0.67
  • 3 layers net + adam : 0.74
  • 3 layers resnet (after relu BN) + adam : 0.77
  • 3 layers resnet (before relu BN) + adam : 0.78
  • 3 layers resnet (before relu BN) + adam + dropout : 0.75
  • 3 layers resnet (before relu BN) + adam + layer concat : 0.79
  • 3 layers resnet (before relu BN) + adam + unit_norm + cosine_distance : Fail

Adam works quite well for this problem compared to SGD with learning rate scheduling. Batch Normalization also yields a good improvement. I tried to introduce Dropout between layers in different orders (before ReLU, after BN etc.), the best I obtain is 0.75.  Concatenation of different layers improves the performance by 1 percent as the final gain.

In conclusion, here I tried to present a solution to this unique problem by composing different aspects of deep learning. We start with Word2Vec and combine it  with TF-IDF and then use Siamese network to find duplicates. Results are not perfect and akin to different optimizations. However, it is just a small try to see the power of deep learning in this domain. I hope you find it useful :).

Updates

  • Switching last layer to FC layer improves performance to 0.84.
  • By using bidirectional RNN and 1D convolutional layers together as feature extractors improves performance to 0.91. Maybe I'll explain details with another post.
Share

How to use Python Decorators

Decorators are handy sugars for Python programmers to shorten things and provides more concise programming.

For instance you can use decorators for user authentication for your REST API servers. Assume that, you need to auth. the user for before each REST calls. Instead of appending the same procedure to each call function, it is better to define decorator and tagging it onto your call functions.

Let's see the small example below. I hope it is self-descriptive.


"""
How to use Decorators:

Decorators are functions called by annotations
Annotations are the tags prefixed by @
"""

### Decorator functions ###
def helloSpace(target_func):
def new_func():
print "Hello Space!"
target_func()
return new_func

def helloCosmos(target_func):
def  new_func():
print "Hello Cosmos!"
target_func()
return new_func


@helloCosmos # annotation
@helloSpace # annotation
def hello():
print "Hello World!"

### Above code is equivalent to these lines
# hello = helloSpace(hello)
# hello = helloCosmos(hello)

### Let's Try
hello()

 

Share

Updating your local forked project by a commit to the main project?

This is from my stackoverflow question. Thanks to "bitoiu". Here is the real thread.

How to pick up a single commit from a remote repo

Assuming you have a local clone of the repo you forked if you type in the following you should get a single origin:

> git show remote
origin

Unless you've added the original's repo location, you won't have access to the commit you want to pick into your local one. So we need to add that, let's assume this repo ishttps://github.com/GitbookIO/gitbook.git. Notice this is an HTTPS clone URL because you won't have write access to this repo. Let's name it original_repo:

> git remote add original_repo https://github.com/GitbookIO/gitbook.git

And now let's get all the refs back:

> git fetch origina_repo

At this point you have all you need locally, you'll just need to merge the commit into one of your branches, let's assume your local master.

Find the commit you want to merge. This implies finding it in one of the branches the team used. Could be already merged to master or you could be picking it up from the branch that was used for the pull request. Either way, just run a series of git log to check what commit you want if you don't know the reference. When you do simply go to the branch where you want to merge the commit to and run:

> git cherry-pick COMMIT_ID

This will bring the commit to whatever branch you are at the moment.

How to merge a branch from a remote repo

The only difference in this steps is that instead of doing the cherry-pick you will be doing a merge. So imagine the contents of the pull request are in a branch named so-pr, you would simply do:

> git merge original_repo/so-pr

And that would merge the contents of so-pr into your working branch.

Share