I was hassling with interesting problem lately. I trained a custom deep neural network model with ImageNet and ended up very good results at least on training logs. I used Caffe for all these. Then, I ported my model to python interface and give some objects to it. Boommm!not working and even raised random prob values like it is not even trained for 4 days. It was really frustrating. After a dozens of hours I discovered that "Devil is in the details" .
I was using one of the Batch Normalization ("what is it ? "little intro here ) PR that is not merged to master branch but seems fine. Then I found that interesting problem. The code in the branch computes each batch's mean by only looking at that batch. When we give only one example at test time, then the mean values are exactly the values of this particular image. This disables everything and the net starts to behave strangely. After a small search I found the solution which uses moving average instead of exact batch average. Now, I am at the stage of implementation. The puchcard is, do not use any PR which is not merged to master branch, that simple 🙂
Lets start with the first one. Highway-Networks proposes a new activation type similar to LTSM networks and they claim that this peculiar activation is robust to any choice of initialization scheme and learning problems occurred at very deep NNs. It is also incentive to see that they trained models with >100 number of layers. The basic intuition here is to learn a gating function attached to a real activation function that decides to pass the activation or the input itself. Here is the formulation
is the gating function and is the real activation. They use Sigmoid activation for gating and Rectifier for the normal activation in the paper. I also implemented it with Lasagne and tried to replicate the results (I aim to release the code later). It is really impressive to see its ability to learn for 50 layers (this is the most I can for my PC).
The other paper ZERO-BIAS AUTOENCODERS AND THE BENEFITS OFCO-ADAPTING FEATURES suggests the use of non-biased rectifier units for the inference of AEs. You can train your model with a biased Rectifier Unit but at the inference time (test time), you should extract features by ignoring bias term. They show that doing so gives better recognition at CIFAR dataset. They also device a new activation function which has the similar intuition to Highway Networks. Again, there is a gating unit which thresholds the normal activation function.
The first equation is the threshold function with a predefined threshold (they use 1 for their experiments). The second equation shows the reconstruction of the proposed model. Pay attention that, in this equation they use square of a linear activation for thresholding and they call this model TLin but they also use normal linear function which is called TRec. What this activation does here is to diminish the small activations so that the model is implicitly regularized without any additional regularizer. This is actually good for learning over-complete representation for the given data.
For more than this silly into, please refer to papers 🙂 and warn me for any mistake.
These two papers shows a new coming trend to Deep Learning community which is using complex activation functions . We can call it controlling each unit behavior in a smart way instead of letting them fire naively. My notion also agrees with this idea. I believe even more complication we need for smart units in our deep models like Spike and Slap networks.
MS researcher recently introduced a new deep ( indeed very deep 🙂 ) NN model (PReLU Net)  and they push the state of art in ImageNet 2012 dataset from 6.66% (GoogLeNet) to 4.94% top-5 error rate.
In this work, they introduce an alternation of well-known ReLU activation function. They call it PReLu (Parametric Rectifier Linear Unit). The idea behind is to allow negative activations on the ReLU function with a control parameter which is also learned over the training phase. Therefore, PReLU allows negative activations and in the paper they argue and emprically show that PReLU is better to resolve diminishing gradient problem for very deep neural networks (> 13 layers) due to allowance of negative activations. That means more activations per layer, hence more gradient feedback at the backpropagation stage.
Here is a very concise view of Python multiprocessing module and its benefits. It is certainly important module for large scale data mining and machine learning projects and Kaggle like challenges. Therefore take a brief look to that slide to discover how to up-up your project cycle.
This is a fundamental misunderstanding that many people have about money, and in fact probably stands at the center of why some people are good at making money while others are not, why some people are wealthy and others are not.
(First, set aside the issues of inherited wealth, or "unfairly earned" money. Those are distortive effects, but let's focus on the dominant factor)
Wealth ("being rich") means producing things of value. It does not mean "having lots of money."
The key word there is value. That word is more important than wealth or money, it is the real central factor around which human endeavor and economies revolve. Money and wealth are big words that get a lot of play, but value is a boring word that most people don't notice. It is actually the important one. Value is what is produced when you do work, mine resources, develop an idea, produce an invention, engage in mutually-beneficial commerce, etc. Value is the "thing" that humans make (out of nothing) by working, creating, trading, etc. Continue reading Why can't the poor be handed out lots of money to make them rich?→
I've gathered the following from online research so far:
I've used Armadillo a little bit, and found the interface to be intuitive enough, and it was easy to locate binary packages for Ubuntu (and I'm assuming other Linux distros). I haven't compiled it from source, but my hope is that it wouldn't be too difficult. It meets most of my design criteria, and uses dense linear algebra. It can call LAPACK or MKL routines.
Fischer’s Animal Products: A company slaughtering animals and turning their skin into overpriced products sold to rich bastards! Help animal rights activists increase political awareness by hacking their mailing list.
At Linux /etc/passwrd file includes information about the user accounts on the operating system. Permissions and password (if not encrypted) related with specific user account are stored here with some extra information. Here is the general structure of the file with the needed explanation to interpret it: Continue reading Hacker's first target file /etc/passwrd on Linux ! Why?→
long long is not the same as long (although they can have the same size, e.g. in most 64-bit POSIX system). It is just guaranteed that a long long is at least as long as a long. In most platforms, a long long represents a 64-bit signed integer type.
You could use long long to store the 8-byte value safely in most conventional platforms, but it's better to use int64_t/int_least64_t from <stdint.h>/<cstdint> to clarify that you want an integer type having ≥64-bit.