Comparison of Deep Learning Libraries After Years of Use

As we witness the golden age of AI and deep learning, there are many different tools and frameworks continuously proposed by different communities. Sometimes it is even hard to catch up what is going on. You choose one over another then you see a new library and you go for it. However, it seems the exact choice is not obvious to anyone.

From my point of view, libraries are measured by flexibility and run-time trade-off. If you go with a library which is really easy to use, it is slow as much as that. If the library is so fast, then it does not serve that mush of flexibility or it is so specialized to a particular type of models like Convolutional NNs, hence they do not support the type of your interest such as Recurrent NNs.

After all the tear, shed and blood dropped by years of experience in deep learning, I decide to share my own intuition and opinion about the common deep learning libraries so that these might help you to choose the right one for your own sake .

Let's start by defining some metrics to evaluate a library. These are the pinpoints that I consider;

  1. Community support :  It is really important, especially for a beginner to ask questions and gather answers to learn the library. This is basically related to success and visibility of the community of the library.
  2. Documentation:  Even you are familiar to a library, due to their extensive and evolving nature , updated documentation is really vital for a user.  A library might be the best but, it also need a solid documentation to prove it to a user.
  3. Stability:  Many of the libraries are open-source. It is of course good to have the all but open-source means more fragile and buggy implementations. It is also really hard to understand in advance that a library is stable enough to use for your code. It needs time to investigate and the worse is to see the real problem at the end of your implementation. It is really disruptive, I experienced once and never again 🙂
  4. Run-time performance:  It includes; GPU, CPU run-times and use of the hardware capabilities, distributed training with multiple-GPUs on single machine and multiple machines and memory use which limits the models you train.
  5. Flexibility: Experimenting new things and development of new custom tools and layers are also crucial part of the game. If you are a researcher it is maybe the foremost point that you count on.
  6. Development: Some libraries are being developed with a great pace and therefore it is always easy to find new functionalities and state-of-art layers and functions. It is good from that point but sometimes it makes the library hard to consolidate, if especially it has deficient documentation resources.
  7. Pre-trained models and examples: For a hacker to use deep learning, this is the most important metric. Many of the successful deep learning models are trained by using big computer clusters with very extensive experimentations. Not every one is able to budget up for such computation power. Therefore, it is important to have pre-trained models to step into.

Below with each heading of library, I discuss the library on these points.


Torch is the one which I use as the latest. As most of you know,  Torch is a Lua based library and used extensively by Facebook and Twitter Research teams for deep learning products and research.

  1. Community Support: It has a relatively small community compared to other libraries but still it is very responsive to any problem and question that you encounter.
  2. Documentation: Good documentation is waiting for you. But still if you are new to Lua too, sometimes it is not enough and it leaves you to google more.
  3. Stability:  It is really stable. I couldn't see any problem in terms of robustness yet.
  4. Run-time performance: It is the most powerful metric of Torch. It uses all the capability of any hardware you use. You can switch different hardware supports by importing regarding modules. It is not that invisible but still easy to convert you CUDA code to CPU or vice a versa.  One another difference, you need to convert your CPU or GPU model to another architecture to use change its basis. A trained model is not compatible with all architectures without a touch. Thus you see many questions asking "How can I convert GPU model to CPU?". It is not hard but you need to know.  It is also very easy to use multiple-GPUs in single machine but yet I do not see any support for distributed training in multi-machine setting.
  5. Flexibility: Due to the weirdness of Lua, it is not my choice, if I need to develop something custom. Also it has no powerful auto-differentiation mechanism (AFAIK) so it needs you code the back-propagation function for your layers as well.  Despite of such caveats, it is still the most accessible library by the research people and publications.
  6. Development: It is maybe the most successful library to follow up what is new in the deep learning literature. It is actively developed but it is not that open to third-party developers to contribute. At least this is my intuition.
  7. Pre-trained models and examples: It has a pretty good pre-trained model support, in generaled released by facebook team such as ResNet. Other than this, it supports to convert Caffe models like some other libraries. In addition, you can find different examples about different architectures or problems including NLP, Vision or some others.


I really like the performance and its aggressive resource use on CPU or GPU. However, I observe a bit more memory use in GPU which is a bottleneck for training large models. I'd personally use Torch for my main tool but Lua seems still very intricate compared to Python. For a guy ,like me, who uses Python for everything, using Torch models complicates the development. Still there is a great library Lutorpy which makes Torch model plausible from Python.


MxNet is a library backed by Distribute Machine Learning Community that already conducted many great project such as the dream tool xgboost of many Kagglers . Albeit it is not highlighted on web or deep learning communities,  it is really powerful library supporting many different languages; Python, Scala, R.

  1. Community Support: Every conversation and question is going on through github issues and many problems are answered directly by the core developers which is great. Beside, you need sometime to gather your answers.
  2. Documentation: Compared to its really fast development progress, its documentation falls slightly behind. It is better to follow merge requests on the repo and read raw codes to see what is happening. Beside of that, it has very good collection of examples and tutorials waiting in different formats and languages.
  3. Stability:  Dense development effort causes some instability issues. I experienced couple of those .  For instance, a trained model gives different outputs with different back-end architectures. I guess they solved it to some extend but still I see it with some of my models.
  4. Run-time performance: I believe, it is the fastest training time library. One important note is to set all the options well for your machine configuration. It has really efficient memory footprint compared to other libraries by its optimized computational graph paradigm. I am able to train many large models that are not allowed by the other libraries. It is very easy to use multiple GPUs and it supports distributed training as well by distributed SGD algorithm.
  5. Flexibility: It is based on a third party Tensor computation library developed in C++, called MShadow. Therefore, you need to learn that first to develop custom things utilizing full potential of the library. You are also welcome to code custom things through language interfaces like Python. It is also possible to use implemented blocks to create some custom functionalities as well. However, to be honest, I did not see so much researcher using MxNet.
  6. Development: It inhibits really good development effort, mainly regulated by the core developers of the team. Still they're open to pull requests and discuss something new with you.
  7. Pre-trained models and examples: You can convert some set of pre-trained Caffe models like VGG by the provided script. They also released InveptionV3 type of ImageNet networks and InceptionV2 type model trained on 21K ImageNet collection which is really great for fine-tuning. I also wait for ResNet but still none.


This is the library of my choice for many of my projects, mostly due to run-time efficiency, really solid Python support and less GPU memory use.

Some critics, MxNet mostly support Vision problems and they partially start to work on NLP architectures. You need to convert all data to their data format for the best efficiency, it slows the implementation time but makes things more efficient in terms of memory and hard-drive use. Still for small projects, it is a pain. You can convert your data to numpy array and use it but then you are not able to use extensive set of data augmentation techniques provided by the library.


This library is maintained by Montreal group. It is the first of its kind as far as I know. It is a Python library which takes your written code and compiles it to C++ and CUDA. Hence, it targets machine learning applications, not just deep learning. It also converts the code to computational graph like MxNet then optimizes memory and execution. However, all these optimizations take a good time which is the real problem of the library. Since Theano is a general use machine learning library, following facts are based on deep learning libraries Lasagne and Keras which share many properties.

  1. Community Support: They have both big communities supporting google user groups and github issue pages. I'd say Keras has more support then Lasagne. You can get any question answered quickly.
  2. Documentation: Simple but powerful documentation for both. Once you got the logic behind these libraries, it is so fluid to develop your own models and applications. Each important subject is explained by a example which I really like to see from scikit-learn as well.
  3. Stability: They are really high paced libraries. Due to Theano's simplicity to develop new things, they follow what is new easily but it is also dangerous in terms of stability. As far as you do not rely on these latest features, they are stable.
  4. Run-time performance: They are bounded by the abilities of Theano and beside this any Theano based library just diverges by the programming techniques and the correct use of Theano codes.  The real problem for these libraries is the compile time in which you wait before model execution.  It is sometimes too much to bare, specially for large models. If you compile successfully, after the last update of Theano, it is really fast for training in GPU. I've not experienced CPU execution too much. Memory use is not that efficient compared to MxNet but still comparable with Torch. AFAIK, they started to support multi GPU execution after the last version of Theano but distributed training is still out of the scope.
  5. Flexibility: Due to auto-differentiation of Theano and the syntactic goods of Python, it is really easy to develop something new. You only need to take a already implemented layer or a function then modify it to your custom thing.
  6. Development: These libraries are really community driven open-source counterparts. They are so fast to capture what is new . Due to the easiness of development, sometimes one thing might have lots of alternative implementations.
  7. Pre-trained models and examples: They provide VGG networks and there are scripts to convert Caffe models. However, I've not experimented converted Caffe models with these libraries.


If we need to compare Keras and Lasagne, Keras is more modular and hides all the details from the developer which reminds scikit-learn. Lasagne is more like a toolbox which you use to come up with more custom things.

I believe, these libraries are perfect for quick prototyping. Anything can be implemented in a flash of time without keeping the details out of your view.


Caffe is the flagship of deep learning libraries for both industry and research. It is the first successful open-source implementation with very solid but simple foundation. You do not need to know code to use Caffe. You define your network with a description files and train it.

  1. Community Support: It has maybe the largest community. I believe anyone interested in deep learning would have some experience with it.  It has a large and old google users group and github issues pages that are full of information.
  2. Documentation: I always see that documentation is always a bit old compared to the current stage of the library. Even they do not have a extensive documentation page comparable to other libraries, you can always find tutorials and examples on web to learn more. A simple google query would give many different resources as well.
  3. Stability: It is really solid library. It uses well-known libraries for matrix operations and CUDA calls. I've not seen any problem yet.
  4. Run-time performance: It is not the best but always acceptable. It uses well-founded libraries for any run-time crucial operations like convolution. It is bounded by these libraries. Custom solutions are akin to better run-times but they also degrade the stability as that amount. You can switch to CPU or GPU backend by a simple call without any change of your code.  It does well in terms of memory consumption but still too much compared to MxNet especially Inception type models. One problem is that, it does not support GPUs other than Nvidia. There are of course branches but I've never used them. It supports multi-gpu training on single machine but not distributed training.
  5. Flexibility: Learning to code with Caffe is not that hard but documentation is not helpful enough. You need to look to source code to understand what is happening and use present implementations to template your custom code. After you understand the basics, it is easy to use and bend the library as you need. It has a good interface to Python and is compatible to new layers written with Python. It is a good library which hides the GPU and CPU integration from the developer. Caffe is very acceptable by the research community as well.
  6. Development: It has very broad developer support and many forks that target different applications but the master branch is so picky to something new. This is good for a stable library but also causes this many forks. For instance, Batch Normalization is merged with the master branch after years of wait and discussion.
  7. Pre-trained models and examples:  Caffe model zoo is the heaven of pre-trained models for variety of domains and the collection keeps increasing. It has good set of example codes that can initiate you own project.


Caffe is the first successful deep learning library from many different aspects. It is stable, efficient.

Sometimes it is a huge bother to define large models by a model description file. It makes things very wobbling and akin to be mistaken. For example, you can miss a number of mistype it then your model crushes. finding such small problems over hundreds of lines is a huge bother.  In such cases, Python interface is wiser choice by defining some functions to create common layers.

NOTE: This is all my own experience with these libraries. Please correct me if you see something wrong or deceitful. Hope this helps to you. BEST 🙂



code :

This paper is an interesting work which tries to explain similarities and differences between representation learned by different networks in the same architecture.

To the extend of their experiments, they train 4 different AlexNet and compare the units of these networks by correlation and mutual information analysis.

They asks following question;

  • Can we find one to one matching of units between network , showing that these units are sensitive to similar or the same commonalities on the image?
  • Is the one to one matching stays the same by different similarity measures? They first use correlation then mutual information to confirm the findings.
  • Is a representation learned by a network is a rotated version of the other, to the extend that one to one matching is not possible  between networks?
  • Is clustering plausible for grouping units in different networks?

Answers to these questions are as follows;

  • It is possible to find good matching units with really high correlation values but there are some units learning unique representation that are not replicated by the others. The degree of representational divergence between networks goes higher with the number of layers. Hence, we see large correlations by conv1 layers and it the value decreases toward conv5 and it is minimum by conv4 layer.
  • They first analyze layers by the correlation values among units. Then they measure the overlap with the mutual information and the results are confirming each other..
  • To see the differences between learned representation, they use a very smart trick. They approximate representations  learned by a layer of a network by the another network using the same layer.  A sparse approximation is performed using LASSO. The result indicating that some units are approximated well with 1 or 2 units of the other network but remaining set of units require almost 4 counterpart units for good approximation. It shows that some units having good one to one matching has local codes learned and other units have slight distributed codes approximated by multiple counterpart units.
  • They also run a hierarchical clustering in order to group similar units successfully.

For details please refer to the paper.

My discussion: We see that different networks learn similar representations with some level of accompanying uniqueness. It is intriguing  to see that, after this paper, these  are the unique representations causing performance differences between networks and whether the effect is improving or worsening. Additionally, maybe we might combine these differences at the end to improve network performances by some set of smart tricks.

One deficit of the paper is that they do not experiment deep networks which are the real deal of the time. As we see from the results, as the layers go deeper,  different abstractions exhumed by different networks. I believe this is more harsh by deeper architectures such as Inception or VGG kind.

One another curious thing is to study Residual netwrosk. The intuition of Residual networks to pass the already learned representation to upper layers and adding more to residual channel if something useful learned by the next layer. That idea shows some promise that two residual networks might be more similar compared to two Inception networks. Moreover, we can compare different layers inside a single Residual Network to see at what level the representation stays the same.



This work proposes yet another way to initialize your network, namely LUV (Layer-sequential Unit-variance) targeting especially deep networks.  The idea relies on lately served Orthogonal initialization and fine-tuning the weights by the data to have variance of 1 for each layer output.

The scheme follows three stages;

  1.  Initialize weights by unit variance Gaussian
  2.  Find components of these weights using SVD
  3.  Replace the weights with these components
  4.  By using minibatches of data, try to rescale weights to have variance of 1 for each layer. This iterative procedure is described as below pseudo code.
FROM the paper. Pseudo code of the initialization scheme.
FROM the paper. Pseudo code of the initialization scheme.


In order to describe the code in words, for each iteration we give a new mini-batch and compute the output variance. We compare the computed variance by the threshold we defined as Tol_{var} to the target variance 1.   If number of iterations is below the maximum number iterations or the difference is above Tol_{var} we rescale the layer weights by the squared variance of the minibatch.  After initializing this layer go on to the next layer.

In essence, what this method does. First, we start with a normal Gaussian initialization which we know that it is not enough for deep networks. Orthogonalization stage, decorrelates the weights so that each unit of the layer starts to learn from particularly different point in the space. At the final stage, LUV iterations rescale the weights and keep the back and forth propagated signals close to a useful variance against vanishing or exploding gradient problem , similar to Batch Normalization but without computational load.  Nevertheless, as also they points, LUV is not interchangeable with BN for especially large datasets like ImageNet. Still, I'd like to see a comparison with LUV vs BN but it is not done or not written to paper (Edit by the Author: Figure 3 on the paper has CIFAR comparison of BN and LUV and ImageNet results are posted on

The good side of this method is it works, for at least for my experiments made on ImageNet with different architectures. It is also not too much hurdle to code, if you already have Orthogonal initialization on the hand. Even, if you don't have it, you can start with a Gaussian initialization scheme and skip Orthogonalization stage and directly use LUV iterations. It still works with slight decrease of performance.

Paper review: Dynamic Capacity Networks


Decompose the network structure into two networks F and G keeping a set of top layers T at the end. F and G are small and more advance network structures respectively. Thus F is cheap to execute with lower performance compared to G.

In order to reduce the whole computation and embrace both performance and computation gains provided by both networks, they suggest an incremental pass of input data through F to G.

Network F decides the salient regions on the input by using a gradient feedback and then these smaller regions are sent to network G to have better recognition performance.

Given an input image x, coarse network F is applied and then coarse representations of different regions of the given input is computed. These coarse representations are propagated to the top layers T and T computes the final output of the network which are the class predictions. An entropy measure is used to see that how each coerce representation effects the model's uncertainty leading that if a regions is salient then we expect to have large change of the uncertainty with respect to its representation.

We select top k input regions as salient by the hint of computed entropy changes then these regions are given to fine network G obtain finer representations. Eventually, we merge all the coarse, fine representations and give to top layers T again and get the final predictions.

At the training time, all networks and layers trained simultaneously. However, still one might decide to train each network F and G separately by using the same top layers T.  Authors posits that the simultaneous training is useful to keep fine and coarse representations similar so that the final layers T do not struggle too much to learn from two difference representation distribution.

I only try to give the overlooked idea here, if you like to see more detail and dwell into formulas please see the paper.

My discussion: There are some other works using attention mechanisms to improve final performance. However, this work is limited with the small datasets and small spatial dimensions. I really like to see whether it is also usefule for large problems like ImageNet or even larger.

Another caveat is the datasets used for the expeirments are not so cluttered. Therefore, it is easy to detect salient regions, even with by some algrithmic techniques. Thus, still this method obscure to me in real life problems.

Paper Review: Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)?

There is theoretical proof that any one hidden layer network with enough number of sigmoid function is able to learn any decision boundary. Empirical practice, however, posits us that learning good data representations demands deeper networks, like the last year's ImageNet winner ResNet.

There are two important findings of this work. The first is,we need convolution, for at least image recognition problems, and the second is deeper is always better . Their results are so decisive on even small dataset like CIFAR-10.

They also give a good little paragraph explaining a good way to curate best possible shallow networks based on the deep teachers.

- train state of deep models

- form an ensemble by the best subset

- collect eh predictions on a large enough transfer test

- distill the teacher ensemble knowledge to shallow network.

(if you like to see more about how to apply teacher - student paradigm successfully refer to the paper. It gives very comprehensive set of instructions.)

Still, ass shown by the experimental results also, best possible shallow network is beyond the deep counterpart.

FROM PAPER, network performances. As you see with number of layers, performance is also getting better and Teacher is always better then student.
FROM PAPER, network performances. As you see with number of layers, performance is also getting better and Teacher is always better then student.


My Discussion:

I believe the success of the deep versus shallow depends not the theoretical basis but the way of practical learning of the networks. If we think networks as representation machine which gives finer details to coerce concepts such as thinking to learn a face without knowing what is an eye, does not seem tangible. Due to the one way information flow of convolution networks, this hierarchy of concepts stays and disables shallow architectures to learn comparable to deep ones.

Then how can we train shallow networks comparable to deep ones, once we have such theoretical justifications. I believe one way is to add intra-layer connections which are connections each unit of one layer to other units of that layer. It might be a recursive connection or just literal connections that gives shallow networks the chance of learning higher abstractions.

Convolution is also obviously necessary. Although, we learn each filter from the whole input, still each filter is receptive to particular local commonalities.  It is not doable by fully connected layers since it learns from the whole spatial range of the input.

Fighting against class imbalance in a supervised ML problem.

ML on imbalanced data

given a imbalanced learning problem with a large class and a small class with number of instances N and M respectively;

  • cluster the larger class into M clusters and use cluster centers for training the model.
  • If it is a neural network or some compatible model. Cluster the the large class into K clusters and use these clusters as pseudo classes to train your model. This method is also useful for training your network with small number of classes case. It pushes your net to learn fine-detailed representations.
  • Divide large class into subsets with M instances then train multiple classifiers and use the ensemble.
  • Hard-mining is a solution which is unfortunately akin to over-fitting but yields good results in some particular cases such as object detection. The idea is to select the most confusing instances from the large set per iteration. Thus, select M most confusing instances from the large class and use for that iteration and repeat for the next iteration.
  • For specially batch learning, frequency based batch sampling might be useful. For each batch you can sample instances from the small class by the probability M/(M+N) and N/(M+N) for tha large class so taht you prioritize the small class instances for being the next batch. As you do data augmentation techniques like in CNN models, mostly repeating instances of small class is not a big problem.

Note for metrics, normal accuracy rate is not a good measure for suh problems since you see very high accuracy if your model just predicts the larger class for all the instances. Instead prefer ROC curve or keep watching Precision and Recall.

Please keep me updated if you know something more. Even, this is a very common issue in practice,  still hard to find a working solution.

How many training samples we observe over life time ?

In this post, I like to compute what number of visual instances we observes over time, with the assumption that we visually perceive life as a constant video with a certain fps rate.

Let's dive into the computation. Relying on [1],  average person can see the world with 45 fps on average. It goes to extremes for such people like fighter pilots which is 225fps with the adrenaline kicked in.  I took the average life time 71 years [3] equals to 2239056000 (2 .24 billion) secs and we are awake almost 2/3 of  it which makes 1492704000 (1.49 billion) secs .  Then we assume that on average there are 86*10^9 neurons in our brain [2]. This is our model size.

Eventually and roughly, that means without any further investigation, we have a model with 86 billion parameters which learns from  1492704000 * 45 = 67171680000  almost 67 billion images.

Of course this is not a convenient way to come with this numbers but fun comes by ignorance 🙂




Eren Golge's Blog