Category Archives: What I learn today

devil in the implementation details

I was hassling with interesting problem lately. I trained a custom deep neural network model with ImageNet and ended up very good results at least on training logs.  I used Caffe for all these. Then, I ported my model to python interface and give some objects to it. Boommm!not working and even raised random prob values like it is not even trained for 4 days. It was really frustrating. After a dozens of hours I discovered that "Devil is in the details" .

I was using one of the Batch Normalization ("what is it ? "little intro here ) PR that is not merged to master branch but seems fine.  Then I found that interesting problem. The code in the branch computes each batch's mean by only looking at that batch. When we give only one example at test time, then the mean values are exactly the values of this particular image. This disables everything and the net starts to behave strangely. After a small search I found the solution which uses moving average instead of exact batch average. Now, I am at the stage of implementation. The puchcard is, do not use any PR which is not merged to master branch, that simple 🙂

Share

What I read for deep-learning

Today, I spent some time on two new papers proposing a new way of training very deep neural networks (Highway-Networks) and a new activation function for Auto-Encoders (ZERO-BIAS AUTOENCODERS AND THE BENEFITS OF
CO-ADAPTING FEATURES) which evades the use of any regularization methods such as Contraction or Denoising.

Lets start with the first one. Highway-Networks proposes a new activation type similar to LTSM networks and they claim that this peculiar activation is robust to any choice of initialization scheme and learning problems occurred at very deep NNs. It is also incentive to see that they trained models with >100 number of layers. The basic intuition here is to learn a gating function attached to a real activation function that decides to pass the activation or the input itself. Here is the formulation

Screenshot from 2015-05-11 11:35:12

Screenshot from 2015-05-11 11:36:12

T(x,W_t ) is the gating function and H(x,W_H) is the real activation. They use Sigmoid activation for gating and Rectifier for the normal activation in the paper. I also implemented it with Lasagne and tried to replicate the results (I aim to release the code later). It is really impressive to see its ability to learn for 50 layers (this is the most I can for my PC).

The other paper ZERO-BIAS AUTOENCODERS AND THE BENEFITS OF CO-ADAPTING FEATURES suggests the use of non-biased rectifier units for the inference of AEs. You can train your model with a biased Rectifier Unit but at the inference time (test time), you should extract features by ignoring bias term. They show that doing so gives better recognition at CIFAR dataset. They also device a new activation function which has the similar intuition to Highway Networks.  Again, there is a gating unit which thresholds the normal activation function.

Screenshot from 2015-05-11 11:44:42

Screenshot from 2015-05-11 11:47:27

The first equation is the threshold function with a predefined threshold (they use 1 for their experiments).  The second equation shows the reconstruction of the proposed model. Pay attention that, in this equation they use square of a linear activation for thresholding and they call this model TLin  but they also use normal linear function which is called TRec. What this activation does here is to diminish the small activations so that the model is implicitly regularized without any additional regularizer. This is actually good for learning over-complete representation for the given data.

For more than this silly into, please refer to papers 🙂 and warn me for any mistake.

These two papers shows a new coming trend to Deep Learning community which is using complex activation functions . We can call it controlling each unit behavior in a smart way instead of letting them fire naively. My notion also agrees with this idea. I believe even more complication we need for smart units in our deep models like Spike and Slap networks.

Share

Microsoft Research introduced a new NN model that beats Google and the others

MS researcher recently introduced a new deep ( indeed very deep 🙂 ) NN model (PReLU Net) [1] and they push the state of art in ImageNet 2012 dataset from 6.66% (GoogLeNet) to 4.94% top-5 error rate.

In this work, they introduce an alternation of well-known ReLU activation function. They call it PReLu (Parametric Rectifier Linear Unit). The idea behind is to allow negative activations on the ReLU function with a control parameter a which is also learned over the training phase. Therefore, PReLU allows negative activations and in the paper they argue and emprically show that PReLU is better to resolve diminishing gradient problem for very deep neural networks  (> 13 layers) due to allowance of negative activations. That means more activations per layer, hence more gradient feedback at the backpropagation stage.

PReLU
all figures are from the paper

Continue reading Microsoft Research introduced a new NN model that beats Google and the others

Share

Simple Parallel Processing in Python

Here is a very concise view of Python multiprocessing module and its benefits. It is certainly important module for large scale data mining and machine learning projects and Kaggle like challenges. Therefore take a brief look to that slide to discover how to up-up your project cycle.

For more info refer to :
Share

Why can't the poor be handed out lots of money to make them rich?

Answer by Yishan Wong:

Because money is not wealth.

This is a fundamental misunderstanding that many people have about money, and in fact probably stands at the center of why some people are good at making money while others are not, why some people are wealthy and others are not.

(First, set aside the issues of inherited wealth, or "unfairly earned" money.  Those are distortive effects, but let's focus on the dominant factor)

Wealth ("being rich") means producing things of value.  It does not mean "having lots of money."

The key word there is value.  That word is more important than wealth or money, it is the real central factor around which human endeavor and economies revolve.  Money and wealth are big words that get a lot of play, but value is a boring word that most people don't notice.  It is actually the important one.  Value is what is produced when you do work, mine resources, develop an idea, produce an invention, engage in mutually-beneficial commerce, etc.  Value is the "thing" that humans make (out of nothing) by working, creating, trading, etc.  Continue reading Why can't the poor be handed out lots of money to make them rich?

Share

Some possible Matrix Algebra libraries based on C/C++

I've gathered the following from online research so far:

I've used Armadillo a little bit, and found the interface to be intuitive enough, and it was easy to locate binary packages for Ubuntu (and I'm assuming other Linux distros). I haven't compiled it from source, but my hope is that it wouldn't be too difficult. It meets most of my design criteria, and uses dense linear algebra. It can call LAPACK or MKL routines.

I've heard good things about Eigen, but haven't used it. It claims to be fast, uses templating, and supports dense linear algebra. It doesn't have LAPACK or BLAS as a dependency, but appears to be able to do everything that LAPACK can do (plus some things LAPACK can't). A lot of projects use Eigen, Continue reading Some possible Matrix Algebra libraries based on C/C++

Share

SQL injection with UNION ALL : HTS realistic mission 4

Fischer’s Animal Products: A company slaughtering animals and turning their skin into overpriced products sold to rich bastards! Help animal rights activists increase political awareness by hacking their mailing list.

So I finally got around to write a walkthrough/guide for Hack This Site realistic mission 4. Your objective is to get the email addresses of the subscribers to the news letter of Fischer’s Animal Products. Continue reading SQL injection with UNION ALL : HTS realistic mission 4

Share

Hacker's first target file /etc/passwrd on Linux ! Why?

At Linux /etc/passwrd file includes information about the user accounts on the operating system.  Permissions and password (if not encrypted) related with specific user account are stored here with some extra information. Here is the general structure of the file with the needed explanation to interpret it: Continue reading Hacker's first target file /etc/passwrd on Linux ! Why?

Share

What is "long long" type in c++?

long long is not the same as long (although they can have the same size, e.g. in most 64-bit POSIX system). It is just guaranteed that a long long is at least as long as a long. In most platforms, a long long represents a 64-bit signed integer type.

You could use long long to store the 8-byte value safely in most conventional platforms, but it's better to use int64_t/int_least64_t from <stdint.h>/<cstdint> to clarify that you want an integer type having ≥64-bit.

Share