Recently, I started at Mozilla Research. I am really excited to be a part of a small but great team working hard to solve important ML problems. And everything is open-sourced. We license things to make open-sourced. Oxymoron by first sight isn't it. But I like it !!
Before my presence, our team already released the best known open-sourced STT (Speech to Text) implementation based on Tensorflow. The next step is to improve the current Baidu's Deep Speech architecture and also implement a new TTS (Text to Speech) solution that complements the whole conversational AI agent. So after these two projects, anyone around the world will be able to create his own Alexa without any commercial attachment. Which is the real way to democratize AI, at least I believe it is?
Up until now, I worked on a variety of data types and ML problems, except audio. Now it is time to learn it. And the first thing to do is a comprehensive literature review (like a boss). Here I like to share the top-notch DL architectures dealing with TTS (Text to Speech). I also invite you to our Github repository hosting PyTorch implementation of the first version implementation. (We switched to PyTorch for obvious reasons). It is a work in progress and please feel free to comment and contribute.
Below I like to share my pinpoint summary of the well-known TTS papers which are by no means complete but useful to highlight important aspects of these papers. Let's start.
Phonemes: units of sounds, we pronounce as we speak. Necessary since very similar words in the letter might be pronounced very differently (e.g. "Rough" "Though")
Vocoder: part of the system decoding from features to audio signals. Wave is used in Deep Voice at that stage.
Fundamental Frequency - F0: lowest frequency of a periodic waveform describing the pitch of the sound.
Autoregressive Model: Specifies a model depending linearly on its own outputs and on a parameter set which can be approximated.
Query, Key, Value: Key is used by the attention module to compute attention weights. Value is the vector stipulated by the attention weights to compute the module output. A query vector is the hidden state of the decoder.
Grapheme: Cool way to say character.
Error Modes: Sub-optimal status for the attention block where it is not able to escape.
Monotonic Attention: Use only a limited scope of nodes close in time to the output step. It improves performance for TTS since there is a certain relation btw the output at time t and the input at time t. However, it is not that reasonable for translation problem since words orders might not be the same. https://arxiv.org/pdf/1704.00784.pdf
MOS: Mean Opinion Score. Crowd-source the evaluation process with native speakers. It is not easy to measure, especially for a layman.
Context vector: Output of an attention module which summarizes multiple time-step outputs of the encoder.
Teacher Forcing: Providing model's expected output at time t as input at time t+1. It is controlled ground-truth feedback as a teacher does to a student.
Casual convolution: Convolution which does not foresee the future units given the reference time step T which we like to predict next. In practice, it is implemented by setting right padding orientation to normal convolution layers.
Deep Voice (25 Feb 2017)
Text to phonemes. Deterministically computed with a dictionary. Or Seq2Seq model to deal with the unseen words.
Lately, I study time series to see something more out the limit of my experience. I decide to use what I learn in cryptocurrency price predictions with a hunch of being rich. Kidding? Or not :). As I see more about the intricacies of the problem I got deeper and I got a new challenge out of this. Now, I am in a process of creating something new using traditional machine learning to latest reinforcement learning achievements.
So the story aside, I like to see if an AI bot trading without manual help is possible or is a luring dream. Lately, I read a lot about the topic from traditional financial technical analysis to latest ML solutions. What I see at the ML front is many people claim to use lazy ML with success and sell deceitful dreams.What I call lazy ML is, downloading data , training the model and done. We are rich!! What I really experience is they have false conclusion induced by false interpretations. And the bad side of this, many other people try to replicate their results (aka beginner me) and waste a lot of time. Here, I like to show a particular mistake in those works with a accompanying code helping us to realize the problem better off.
Briefly, this work illustrates a simple supervised setting where a model predicts the next Bitcoin move given the current state. Here is the full Notebook and to see more advance set of experiments check out the repo. Hope you like that.
Online Hard Example Mining (OHEM) is a way to pick hard examples with reduced computation cost to improve your network performance on borderline cases which generalize to the general performance. It is mostly used for Object Detection. Suppose you like to train a car detector and you have positive (with car) and negative images (with no car). Now you like to train your network. In practice, you find yourself in many negatives as oppose to relatively much small positives. To this end, it is clever to pick a subset of negatives that are the most informative for your network. Hard Example Mining is the way to go to this.
In general, to pick a subset of negatives, first you train your network for couple of iterations, then you run your network all along your negative instances then you pick the ones with the greater loss values. However, it is very computationally toilsome since you have possibly millions of images to process, and sub-optimal for your optimization since you freeze your network while picking your hard instances that are not all being used for the next couple of iterations. That is, you assume here all hard negatives you pick are useful for all the next iterations until the next selection. Which is an imperfect assumption especially for large datasets.
Okay, what Online means in this regard. OHEM solves these two aforementioned problems by performing hard example selection batch-wise. Given a batch sized K, it performs regular forward propagation and computes per instance losses. Then, it finds M<K hard examples in the batch with high loss values and it only back-propagates the loss computed over the selected instances. Smart hah ? 🙂
It reduces computation by running hand to hand with your regular optimization cycle. It also unties the assumption of the foreseen usefulness by picking hard examples per iteration so thus we now really pick the hard examples for each iteration.
If you like to test yourself, here is PyTorch OHEM implementation that I offer you to use a bit of grain of salt.
ReLU is defined as a way to train an ensemble of exponential number of linear models due to its zeroing effect. Each iteration means a random set of active units hence, combinations of different linear models. They discuss, relying on the given observation, it might be useful to remove non-linearities for some layers and letting them to learn combination of linearities as the whole layer.
Another argument as poised, some representations are hard to approximate by a stack of non-linear layers. as shown by He et al. 2016. To this end, letting linearities for a subset of layers might ameliorate the condition.
The way they apply EraseReLU is removing the last ReLU layer of each "module". "Module" here is defined depending on the model architecture as shown above.
Experiments show that EraseReLU increases the performance of networks and its effect is larger for deeper networks. It is also more resilient to over-fitting for deep networks. The loss curves also show faster convergence for EraseReLU and the difference more obvious for larger datasets.
My 2 cents: Results are not that different on ImageNet but still better to the favor of EraseReLU. Then it might be the case of lucky shoot since there is no confidence interval or variance given for the trainings.
Faster convergence makes sense with the help of second guessing after the paper. Since there are more active units possible it entails to propagate more gradients. However, all such comments assumes that error signals are always positive. Which is very unlikely. Therefore, more open valves might cause more chaotic back-propagation signal.
Yet it is very simple idea, it shows faster convergence, better results and a good investgation of ReLU function. It think it is useful and can take its position in my next training session.
Disclaimer: This is written hastily in 10 mins. If you think something wrong or even worse let me know :).
There are numerous on-line and off-line technical resources about deep learning. Everyday people publish new papers and write new things. However, it is rare to see resources teaching practical concerns for structuring a deep learning projects; from top to bottom, from problem to solution. People know fancy technicalities but even some experienced people feel lost in the details, once they need to structure their own project.
Andrew Ng’s new initiative deeplearning.ai rushes to help at this stage. deeplearning.ai is a collection of courses teaching deep learning with all the necessary details (great place to start deep learning !!) and one of the courses called “Structuring Machine Learning Project” particularly teaches the design of a deep learning solution intuitively with real-life examples. It is not possible to give a single pill ruling the all but this course establishes a good basis to think about a DL project. And apparently, I still have things to learn from Andrew Ng. after all my years of experience. Note that, I also started my ML career with his famous Coursera ML course :). Thank you Mr. Ng.
Below, I try to plot a diagram summarizing what is mentioned in the course. However be aware that it is not a verbatim depiction and probably includes some of my own measures. In the end, It was a good exercise for me, and hopefully, is a good reference for you.
Please go and check the videos, if the diagram does not make sense at all and ping me if you have something that you don’t like.
One of the main problems of neural networks is to tame layer activations so that one is able to obtain stable gradients to learn faster without any confining factor. Batch Normalization shows us that keeping values with mean 0 and variance 1 seems to work things. However, albeit indisputable effectiveness of BN, it adds more layers and computations to your model that you'd not like to have in the best case.
ELU (Exponential Linear Unit) is a activation function aiming to tame neural networks on the fly by a slight modification of activation function. It keeps the positive values as it is and exponentially skew negative values.
ELU does its job good enough, if you like to evade the cost of Bath Normalization, however its effectiveness does not rely on a theoretical proof beside empirical satisfaction. And finding a good is just a guess.
Self-Normalizing Neural Networks takes things to next level. In short, it describes a new activation function SELU (Scaled Exponential Linear Units), a new initialization scheme and a new dropout variant as a repercussion,
The main topic here is to keep network activation in a certain basin defined by a mean and a variance values. These can be any values of your choice but for the paper it is mean 0 and variance 1 (similar to notion of Batch Normalization). The question afterward is to modifying ELU function by some scaling factors to keep the activations with that mean and variance on the fly. They find these scaling values by a long theoretical justification. Stating that, scaling factors of ELU are supposed to be defined as such any passing value of ELU should be contracted to define mean and variance. (This is just verbal definition by no means complete. Please refer to paper to be more into theory side. )
Above, the scaling factors are shown as and . After long run of computations these values appears to be 1.6732632423543772848170429916717 and 1.0507009873554804934193349852946 relatively. Nevertheless, do not forget that these scaling factors are targeting specifically mean 0 and variance 1. Any change preludes to change these values as well.
Initialization is also another important part of the whole method. The aim here is to start with the right values. They suggest to sample weights from a Gaussian distribution with mean 0 and variance where n is number of weights.
It is known with a well credence that Dropout does not play well with Batch Normalization since it smarting network activations in a purely random manner. This method seems even more brittle to dropout effect. As a cure, they propose Alpha Dropout. It randomly sets inputs to saturatied negative value of SELU which is . Then an affine transformation is applied to it with and values computed relative to dropout rate, targeted mean and variance.It randomizes network without degrading network properties.
In a practical point of view, SELU seems promising by reducing the computation time relative to RELU+BN for normalizing the network. In the paper they does not provide any vision based baseline such a MNIST, CIFAR and they only pounce on Fully-Connected models. I am still curios to see its performance vis-a-vis on these benchmarks agains Bath Normalization. I plan to give it a shoot in near future.
One tickle in my mind after reading the paper is the obsession of mean 0 and variance 1 for not only this paper but also the other normalization techniques. In deed, these values are just relative so why 0 and 1 but not 0 and 4. If you have a answer to this please ping me below.
The whole heading of the paper is "The Shattered Gradients Problem: If resnets are the answer, then what is the question?". It is really interesting work with all its findings about gradient dynamics of neural networks. It also examines Batch Normalization (BN) and Residual Networks (Resnet) under this problem.
The problem, dubbed "Shattered Gradients", described as gradient feedbacks resembling random noise for nearby data points. White noise gradients (random value around 0 with some unknown variance) are not useful for training and they stall the network. What we expect to see is Brownian noise (next value is obtained with a small change on the last value) from a working model. Deep neural networks are more prone to white noise gradients. However, latest advances like BN and Resnet are described to be more resilient to random gradients even in deep networks.
White noise gradients undermines the effectiveness of networks because they violates gradient based learning methods which expects similar gradient feedbacks for data points close by in the vector space. Once you have white noise gradient for such close points, the model is not able to capture data manifold through these learning algorithms. Brownian updates yields more correlation on updates and this preludes effective learning.
For normal networks, they give a empirical evidence that the correlation of network updates decreases with the order where L is number of layers. Decreasing correlation means more white noise gradient feedbacks.
One important reason of white noise feedbacks is to be co-activations of network units. From a working model, we expect to have units receptive to different structures in the given data. Therefore, for each different instance, different subset of units should be active for effective information flux. They observe that as activation goes through layers, co-activation rate goes higher. BN layers prevents this by keeping the co-activation rate 1/4 (1/4 units are active per layer).
Beside the co-activation rate, how dispersed units activation is another important question. Thus, similar instances need to activate similar subset of units and activation should be distributes to other subsets as we change the data structure. This stage is where the skip-connections get into the play. Their observation is skip-connections improve networks in that respect. This can be observed at below figure.
The effectiveness of skip-connections increases with Beta scaling introduced by InceptionV4 architecture. It is scaling residual connections by a constant value before summing up with the current layer activation.
A small discussion
This is a very intriguing paper to me as being one of the scarse works investigating network dynamics instead of blind updates on architectures for racing accuracy values.
Resnet is known to be train hundreds of layers which was not possible before. Now, with this work, we have another scientific argument explaining its effectiveness. I also like to point Veit et al. (2016) demystifying Resnet as an ensemble of many shallow networks. When we combine both of these papers, it makes total sense to me how Resnets are useful for training very deep networks. If shattered gradient effect, as stated here, increasing with number of layers with the order 2^L then it is impossible to train hundred layers with an ad-hoc network. Corollary, since Resnet behaves like a ensemble of shallow networks this effects is rehabilitated. We are able to see it empirically in this paper and it is complimentary in that sense.
Note: This hastily written paper note might include any kind of error. Please let me know if you find one. Best 🙂
In simple terms, dilated convolution is just a convolution applied to input with defined gaps. With this definitions, given our input is an 2D image, dilation rate k=1 is normal convolution and k=2 means skipping one pixel per input and k=4 means skipping 3 pixels. The best to see the figures below with the same k values.
The figure below shows dilated convolution on 2D data. Red dots are the inputs to a filter which is 3x3 in this example, and green area is the receptive field captured by each of these inputs. Receptive field is the implicit area captured on the initial input by each input (unit) to the next layer .
Dilated convolution is a way of increasing receptive view (global view) of the network exponentially and linear parameter accretion. With this purpose, it finds usage in applications cares more about integrating knowledge of the wider context with less cost.
One general use is image segmentation where each pixel is labelled by its corresponding class. In this case, the network output needs to be in the same size of the input image. Straight forward way to do is to apply convolution then add deconvolution layers to upsample. However, it introduces many more parameters to learn. Instead, dilated convolution is applied to keep the output resolutions high and it avoids the need of upsampling .
Dilated convolution is applied in domains beside vision as well. One good example is WaveNet text-to-speech solution and ByteNet learn time text translation. They both use dilated convolution in order to capture global view of the input with less parameters.
In short, dilated convolution is a simple but effective idea and you might consider it in two cases;
Detection of fine-details by processing inputs in higher resolutions.
Broader view of the input to capture more contextual information.
Faster run-time with less parameters
 Long, J., Shelhamer, E., & Darrell, T. (2014). Fully Convolutional Networks for Semantic Segmentation. Retrieved from http://arxiv.org/abs/1411.4038v1
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, 1–14. Retrieved from http://arxiv.org/abs/1412.7062
Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. Iclr, 1–9. http://doi.org/10.16373/j.cnki.ahr.150049
Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio, 1–15. Retrieved from http://arxiv.org/abs/1609.03499
Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. van den, Graves, A., & Kavukcuoglu, K. (2016). Neural Machine Translation in Linear Time. Arxiv, 1–11. Retrieved from http://arxiv.org/abs/1610.10099
Machine learning is everywhere and we are amazed with capabilities of these algorithms. However, they are not great and sometimes they behave so dumb. For instance, let's consider an image recognition model. This model induces really high empirical performance and it works great for normal images. Nevertheless, it might fail when you change some of the pixels of an image even so this little perturbation might be indifferent to human eye. There we call this image an adversarial instance.
There are various methods to generate adversarial instances . One method is to take derivative of the model outputs wrt the input values so that we can change instance values to manipulate the model decision. Another approach exploits genetic algorithms to generate manipulative instances which are confidently classified as a known concept (say 'dog') but they are nothing to human eyes.
So why these models are that weak against adversarial instances. One reliable idea states that because adversarial instances lie on the low probability regions of the instance space. Therefore, they are so weird to the network which is trained with a limited number of instances from higher probability regions.
That being said, maybe there is no way to escape from the fretting adversarial instances, especially when they are produced by exploiting weaknesses of a target model with a gradient guided probing. This is a analytic way of searching for a misleading input for that model with an (almost) guaranteed certainty. Therefore in one way or another, we find an perturbed input deceiving any model.
Due to that observation, I believe that adversarial instances can be resolved by multiple models backing each other. In essence, this is the motivation of this work.
In this work, I like to share my observations focusing on strength of the ensembles against adversarial instances. This is just a toy example with so much short-comings but I hope it'll give the idea with some emiprical evidences.
As a summary, this is what we do here;
Train a baseline MNIST ConvNet.
Create adversarial instances on this model by using cleverhans and save.
Measure the baseline model performance on adversarial.
Train the same ConvNet architecture including adversarial instances and measure its performance.
Train an ensemble of 10 models of the same ConvNet architecture and measure ensemble performance and support the backing argument stated above.
My code full code can be seen on github and I here only share the results and observations. You need cleverhans, Tensorflow and Keras for adversarial generation and you need PyTorch for ensemble training. (Sorry for verbosity of libraries but I like to try PyTorch as well after yeras of tears with Lua).
One problem of the proposed experiment is that we do not recreate adversarial instances for each model and we use a previously created one. Anyways, I believe the empirical values verifies my assumption even in this setting. In addition, I plan to do more extensive study as a future work.
Create adversarial instances.
I start by training a simple ConvNet architecture on MNIST dataset by using legitimate train and test set splits. This network gives 0.98 test set accuracy after 5 epochs.
For creating adversarial instances, I use fast gradient sign method which perturbs images using the derivative of the model outputs wrt the input values. You can see a bunch of adversarial samples below.
The same network suffers on adversarial instances (as above) created on the legitimate test set. It gives 0.09 accuracy which is worse then random guess.
Plot adversarial instances.
Then I like to see the representational power of the trained model on both the normal and the adversarial instances. I do this by using well-known dimension reduction technique T-SNE. I first compute the last hidden layer representation of the network per instance and use these values as an input to T-SNE which aims to project data onto 2-D space. Here is the final projection for the both types of data.
These projections clearly show that adversarial instances are just a random data points to the trained model and they are receding from the real data points creating what we call low probability regions for the trained model. I also trained the same model architecture by dynamically creating adversarial instances in train time then test its value on the adversarials created previously. This new model yields 0.98 on normal test set, 0.91 on previously created adversarial test set and 0.71 on its own dynamically created adversarial.
Above results show that including adversarial instances strengthen the model. However, this is conforming to the low probability region argument. By providing adversarial, we let the model to discover low probability regions of adversarial instances. Beside, this is not applicable to large scale problems like ImageNet since you cannot afford to augment your millions of images per iteration. Therefore, by assuming it works, ensembling is more viable alternative as already a common method to increase overall prediction performance.
In this part, I train multiple models in different ensemble settings. First, I train N different models with the same whole train data. Then, I bootstrap as I train N different models by randomly sampling data from the normal train set. I also observe the affect of N.
The best single model obtains 0.98 accuracy on the legitimate test set. However, the best single model only obtains 0.22 accuracy on the adversarial instances created in previous part.
When we ensemble models by averaging scores, we do not see any gain and we stuck on 0.24 accuracy for the both training settings. However, surprisingly when we perform max ensemble (only count on the most confident model for each instance), we observe 0.35 for uniformly trained ensemble and 0.57 for the bootstrapped ensemble with N equals to 50.
Increasing N raises the adversarial performance. It is much more effective on bootstrapped ensemble. With N=5 we obtain 0.27 for uniform ensemble and 0.32 for bootstrapped ensemble. With N=25 we obtain 0.30 and 0.45 respectively.
These values are interesting especially for the difference of mean and max ensemble. My intuition behind the superiority of maxing is maxing out predictions is able to cover up weaknesses of models by the most confident one, as I suggested in the first place. In that vein, one following observation is that adversarial performance increases as we use smaller random chunks for each model up to a certain threshold with increasing N (number of models in ensemble). It shows us that bootstrapping enables models to learn some of the local regions better and some worse but the worse sides are covered by the more confident model in the ensemble.
As I said before, it is not convenient to use previously created adversarials created by the baseline model in the first part. However, I believe my claim still holds. Assume that we include the baseline model in our best max ensemble above. Still its mistakes would be corrected by the other models. I also tried this (after the comments below) and include the baseline model in our ensemble. 0.57 accuracy only reduces to 0.55. It is still pretty high compared to any other method not seeing adversarial in the training phase.
It is much more harder to create adversarials for ensemble of models with gradient methods. However, genetic algorithms are applicable.
Blind stops of individual models are covered by the peers in the ensemble when we rely on the most confident one.
We observe that as we train a model with dynamically created adversarial instances per iteration, it resolves the adversarials created by the test set. That is, since as the model sees examples from these regions it becomes immune to adversarials. It supports the argument stating low probability regions carry adversarial instances.
(Before finish) This is Serious!
Before I finish, I like to widen the meaning of this post's heading. Ensemble against adversarial!!
"Adversarial instances" is peculiar AI topic. It attracted so much interest first but now it seems forgotten beside research targeting GANs since it does not yield direct profit, compared to having better accuracy.
Even though this is the case hitherto, we need consider this topic more painstakingly from now on. As we witness more extensive and greater AI in many different domains (such as health, law, governace), adversarial instances akin to cause greater problems intentionally or by pure randomness. This is not a sci-fi scenario I'm drawing here. It is a reality as it is prototyped in . Just switch a simple recognition model in  with a AI ruling court for justice.
Therefore, if we believe in a future embracing AI as a great tool to "make the world better place!", we need to study this subject extensively before passing a certain AI threshold.
This work overlooks many important aspects but after all it only aims to share some of my findings in a spare time research. For a next post, I like study unsupervised models like Variational Encoders and Denoising Autoencoders by applying these on adversarial instances (I already started!). In addition, I plan to work on other methods for creating different types of adversarials.
From this post you should take;
References to adversarial instances
Good example codes waiting you on github that can be used many different projects.
Power of ensemble.
Some of non-proven claims and opinions on the topic.
IN ANY WAY HOPE YOU LIKE IT ! 🙂
 Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep Neural Networks are Easily Fooled. Computer Vision and Pattern Recognition, 2015 IEEE Conference on, 427–436.
 Szegedy, C., Zaremba, W., & Sutskever, I. (2013). Intriguing properties of neural networks. arXiv Preprint arXiv: …, 1–10. Retrieved from http://arxiv.org/abs/1312.6199
 Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2016). Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples. arXiv. Retrieved from http://arxiv.org/abs/1602.02697
 Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. Iclr 2015, 1–11. Retrieved from http://arxiv.org/abs/1412.6572