paper: http://arxiv.org/pdf/1603.05279v1.pdf

32 x memory saving and 58 x faster convolution operation. Only 2.9% performance loss (Top-1) with Binary-Weight version for AlexNet compared to the full precision version. Input and Weight binarization, XNOR-Net, scales the gap to 12.5%.

When the weights are binary convolution operation can be approximated by only summation and subtraction. Binary-Wight networks can fit into mobile devices with 2x speed-up on the operations.

To take the idea further, XNER-Net uses both binary weights and inputs. When both of them binary this allows convolution with XNOR and bitcount operation.  This enable both CPU time inference and training of even state of art models.

Here they give a good summary of compressing models into smaller sizes.

  1. Shallow networks --  estimate deep models with shallower architectures with different methods like information distilling.
  2. Compressing networks -- compression of larger networks.
    1. Weight Decay [17]
    2. Optimal Brain Damage [18]
    3. Optimal Brain Surgeon [19]
    4. Deep Compression [22]
    5. HashNets[23]
  3. Design compact layers -- From the beginning keep the network minimal
    1. Decomposing 3x3 layers to 2 1x1 layers [27]
    2. Replace 3x3 layers with 1x1  layers achieving 50% less parameters.
  4. Quantization of parameters -- High precision is not so important for good results in deep networks [29]
    1. 8-bit values instead of 32-bit float weight values [31]
    2. Ternary weights and 3-bits activation [32]
    3. Quantization of layers with L2 loss  [33]
  5. Network binarization --
    1. Expectation Backpropagation [36]
    2. Binary Connect [38]
    3. BinaryNet [11]
    4. Retaining of a pre-trained model  [41]



Binary-Weight-Net is defined as a approximateion of real-valued layers as W approx alpha B where alpha is scaling factor and B in [+1, -1]. Since values are binary we can perform convolution operation with only summation and subtraction.

I*W approx (I oplus B ) alpha

With the details given in the paper:

B = sign(W) and alpha = 1/n||W||_{l1}

Training of Binary-Weights-Net includes 3 main steps; forward pass, backward pass, parameters update. In both forward and backward stages weights are binarized but for updates real value weights are used to keep the small changes effective enough.

Binary-Weight-Net training cycle
Binary-Weight-Net training cycle



At this stage, the idea is extended and input values are also binarized to reduce the convolution operation cost by using only binary operation XNOR and bitcount.  Basically, input values are binarized as the precious way they use for weight values. Sign operation is used for binary mapping of values and scale values are estimated by l1 norm of input values.

C = sign(X^T)sign(W) = H^TB

gamma approx (1/n ||X||_{l1})(1/n||W||_{l1}) = beta alpha

where gamma is the scale vector and C is binary mapping of the feature mapping after convolution.

I am lazy to go into much more details. For more and implementation details have a look at the paper.

For such works, this is always pain to replicate the results.  I hope they will release some code work for being a basis.  Other then this, using such tricks to compress gargantuan deep models into more moderate sizes is very useful for small groups who has no GPU back-end like big companies or deploy such models into small computing devices.  Given such boost on computing time and small memory footprint, it is tempting  to train such models as a big ensemble and compare against single full precision model.