Tacotron is a commonly used Text-to-Speech architecture. It is a very flexible alternative over traditional solutions. It only requires text and corresponding voice clips to train the model. It avoids the toil of fine-grained annotation of the data. However, Tacotron might also be very time demanding to train, especially if you don't know the right hyperparameters, to begin with. Here, I like to share a gradual training scheme to ease the training difficulty. In my experiments, it provides faster training, tolerance for hyperparameters and more time with your family.
In summary, Tacotron is an Encoder-Decoder architecture with Attention. it takes a sentence as a sequence of characters (or phonemes) and it outputs sequence of spectrogram frames to be ultimately converted to speech with an additional vocoder algorithm (e.g. Griffin-Lim or WaveRNN). There are two versions of Tacotron. Tacotron is a more complicated architecture but it has fewer model parameters as opposed to Tacotron2. Tacotron2 is much simpler but it is ~4x larger (~7m vs ~24m parameters). To be clear, so far, I mostly use gradual training method with Tacotron and about to begin to experiment with Tacotron2 soon.
Here is the trick. Tacotron has a parameter called 'r' which defines the number of spectrogram frames predicted per decoder iteration. It is a useful parameter to reduce the number of computations since the larger 'r', the fewer the decoder iterations. But setting the value to high might reduce the performance as well. Another benefit of higherr value is that the alignment module stabilizes much faster. If you talk someone who used Tacotron, he'd probably know what struggle the attention means. So finding the right trade-off for 'r' is a great deal. In the original Tacotron paper, authors used 'r' as 2 for the best-reported model. They also emphasize the challenge of training the model with r=1.
Gradual training comes to the rescue at this point. What it means is that we set 'r' initially large, such as 7. Then, as the training continues, we reduce it until the convergence. This simple trick helps quite magically to solve two main problems. The first, it helps the network to learn the monotonic attention after almost the first epoch. The second, it expedites convergence quite much. As a result, the final model happens to have more stable and resilient attention without any degrigation of performance. You can even eventually let the network to train with r=1 which was not even reported in the original paper.
Here, I like to share some results to prove the effectiveness. I used LJspeech dataset for all the results. The training schedule can be summarized as follows. (You see I also change the batch_size but it is not necessary if you have enough GPU memory.)
Below you can see the attention at validation time after just 1K iterations with the training schedule above.
Next, let's check the model training curve and convergence.
You can listen to voice examples generated with the final model using GriffinLim vocoder. I'd say the quality of these examples is quite good to my ear.
It was a short post but if you like to replicate the results here, you can visit our repo Mozilla TTS and just run the training with the provided config.json file. Hope, imperfect documentation on the repo would help you. Otherwise, you can always ask for help creating an issue or on Mozilla TTS Discourse page. There are some other cool things in the repo that I also write about in the future. Until next time..!
Disclaimer: In this post, I just wanted to briefly share a trick that I find quite useful in my TTS work. Please feel free to share your comments. This work might be a more legit research work in the future.
Recently, I started at Mozilla Research. I am really excited to be a part of a small but great team working hard to solve important ML problems. And everything is open-sourced. We license things to make open-sourced. Oxymoron by first sight isn't it. But I like it !!
Before my presence, our team already released the best known open-sourced STT (Speech to Text) implementation based on Tensorflow. The next step is to improve the current Baidu's Deep Speech architecture and also implement a new TTS (Text to Speech) solution that complements the whole conversational AI agent. So after these two projects, anyone around the world will be able to create his own Alexa without any commercial attachment. Which is the real way to democratize AI, at least I believe it is?
Up until now, I worked on a variety of data types and ML problems, except audio. Now it is time to learn it. And the first thing to do is a comprehensive literature review (like a boss). Here I like to share the top-notch DL architectures dealing with TTS (Text to Speech). I also invite you to our Github repository hosting PyTorch implementation of the first version implementation. (We switched to PyTorch for obvious reasons). It is a work in progress and please feel free to comment and contribute.
Below I like to share my pinpoint summary of the well-known TTS papers which are by no means complete but useful to highlight important aspects of these papers. Let's start.
Phonemes: units of sounds, we pronounce as we speak. Necessary since very similar words in the letter might be pronounced very differently (e.g. "Rough" "Though")
Vocoder: part of the system decoding from features to audio signals. Wave is used in Deep Voice at that stage.
Fundamental Frequency - F0: lowest frequency of a periodic waveform describing the pitch of the sound.
Autoregressive Model: Specifies a model depending linearly on its own outputs and on a parameter set which can be approximated.
Query, Key, Value: Key is used by the attention module to compute attention weights. Value is the vector stipulated by the attention weights to compute the module output. A query vector is the hidden state of the decoder.
Grapheme: Cool way to say character.
Error Modes: Sub-optimal status for the attention block where it is not able to escape.
Monotonic Attention: Use only a limited scope of nodes close in time to the output step. It improves performance for TTS since there is a certain relation btw the output at time t and the input at time t. However, it is not that reasonable for translation problem since words orders might not be the same. https://arxiv.org/pdf/1704.00784.pdf
MOS: Mean Opinion Score. Crowd-source the evaluation process with native speakers. It is not easy to measure, especially for a layman.
Context vector: Output of an attention module which summarizes multiple time-step outputs of the encoder.
Teacher Forcing: Providing model's expected output at time t as input at time t+1. It is controlled ground-truth feedback as a teacher does to a student.
Casual convolution: Convolution which does not foresee the future units given the reference time step T which we like to predict next. In practice, it is implemented by setting right padding orientation to normal convolution layers.
Deep Voice (25 Feb 2017)
Text to phonemes. Deterministically computed with a dictionary. Or Seq2Seq model to deal with the unseen words.
Lately, I study time series to see something more out the limit of my experience. I decide to use what I learn in cryptocurrency price predictions with a hunch of being rich. Kidding? Or not :). As I see more about the intricacies of the problem I got deeper and I got a new challenge out of this. Now, I am in a process of creating something new using traditional machine learning to latest reinforcement learning achievements.
So the story aside, I like to see if an AI bot trading without manual help is possible or is a luring dream. Lately, I read a lot about the topic from traditional financial technical analysis to latest ML solutions. What I see at the ML front is many people claim to use lazy ML with success and sell deceitful dreams.What I call lazy ML is, downloading data , training the model and done. We are rich!! What I really experience is they have false conclusion induced by false interpretations. And the bad side of this, many other people try to replicate their results (aka beginner me) and waste a lot of time. Here, I like to show a particular mistake in those works with a accompanying code helping us to realize the problem better off.
Briefly, this work illustrates a simple supervised setting where a model predicts the next Bitcoin move given the current state. Here is the full Notebook and to see more advance set of experiments check out the repo. Hope you like that.
Let's directly dive in. The thing here is to use Tensorboard to plot your PyTorch trainings. For this, I use TensorboardX which is a nice interface communicating Tensorboard avoiding Tensorflow dependencies.
First install the requirements;
pip install tensorboard
pip install tensorboardX
Things thereafter very easy as well, but you need to know how you need to communicate with the board to show your training and it is not that easy, if you don't know Tensorboard hitherto.
from tensorboardX import SummaryWriter
writer = SummaryWriter('your/path/to/log_files/')
# in training loop
writer.add_scalar('Train/Loss', loss, num_iteration)
writer.add_scalar('Train/Prec@1', top1, num_iteration)
writer.add_scalar('Train/Prec@5', top5, num_iteration)
# in validation loop
writer.add_scalar('Val/Loss', loss, epoch)
writer.add_scalar('Val/Prec@1', top1, epoch)
writer.add_scalar('Val/Pred@5', top5, epoch)
You can also see the embedding of your dataset
from torchvision import datasets
from tensorboardX import SummaryWriter
dataset = datasets.MNIST('mnist', train=False, download=True)
images = dataset.test_data[:100].float()
label = dataset.test_labels[:100]
features = images.view(100, 784)
writer.add_embedding(features, metadata=label, label_img=images.unsqueeze(1))
This is also how you can plot your model graph. The important part is to give the output tensor to writer as well with you model. So that, it computes the tensor shapes in between. I also need to say, it is very slow for large models.
import torch.nn as nn
import torchvision.utils as vutils
import numpy as np
import torch.nn.functional as F
import torchvision.models as models
from tensorboardX import SummaryWriter
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
self.bn = nn.BatchNorm2d(20)
def forward(self, x):
x = F.max_pool2d(self.conv1(x), 2)
x = F.relu(x)+F.relu(-x)
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = self.bn(x)
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
x = F.log_softmax(x)
model = Mnist()
# if you want to show the input tensor, set requires_grad=True
res = model(torch.autograd.Variable(torch.Tensor(1,1,28,28), requires_grad=True))
writer = SummaryWriter()
ReLU is defined as a way to train an ensemble of exponential number of linear models due to its zeroing effect. Each iteration means a random set of active units hence, combinations of different linear models. They discuss, relying on the given observation, it might be useful to remove non-linearities for some layers and letting them to learn combination of linearities as the whole layer.
Another argument as poised, some representations are hard to approximate by a stack of non-linear layers. as shown by He et al. 2016. To this end, letting linearities for a subset of layers might ameliorate the condition.
The way they apply EraseReLU is removing the last ReLU layer of each "module". "Module" here is defined depending on the model architecture as shown above.
Experiments show that EraseReLU increases the performance of networks and its effect is larger for deeper networks. It is also more resilient to over-fitting for deep networks. The loss curves also show faster convergence for EraseReLU and the difference more obvious for larger datasets.
My 2 cents: Results are not that different on ImageNet but still better to the favor of EraseReLU. Then it might be the case of lucky shoot since there is no confidence interval or variance given for the trainings.
Faster convergence makes sense with the help of second guessing after the paper. Since there are more active units possible it entails to propagate more gradients. However, all such comments assumes that error signals are always positive. Which is very unlikely. Therefore, more open valves might cause more chaotic back-propagation signal.
Yet it is very simple idea, it shows faster convergence, better results and a good investgation of ReLU function. It think it is useful and can take its position in my next training session.
Disclaimer: This is written hastily in 10 mins. If you think something wrong or even worse let me know :).
Lately, we (TwentyBN) took a part in Activity Net trimmed action recognition challenge. The dataset is called Kinetics and recently released. It is a collection of 10 second YouTube videos. Each video has a single label among 400 different action classes. The dataset released by DeepMind with a baseline 61% Top-1 and 81.3% Top-5. For baseline models please refer to their dataset paper. But, it took 2 months for people to briskly hoist the bar high above.
As you might see above, we have the best Top-5 accuracy with 97% which is ~16% improvement on top of the baseline. The average of Top-1 and Top-5 decides the leader-board which places us to 3rd place. Yet, it is a great result for us where we could dabble only 2 weeks with limited juice. Team matters here!! Thx to my mates Raghav Goyal and Valentin Haenel for being great.
Here, I like to succinctly describe our novel network architecture. It has the best single network performance. (We plan to share a more detailed description in a separate Medium soon.) Namely, it is called BesNet due to a cheap cryptographic reason :). BesNet yields 74% Top-1 with only RGB . It is half-size of the baseline network described in the DeepMind paper.
In detail, BesNet is devised on top of ResNet-50 architecture. Distinctly, BesNet performs 3D convolutions that are able to learn both spatiotemporal features. In a better extent, BesNet takes not a single frame, but a set of frames from a video. It convolves pixels between consecutive frames as wells as single frame pixels. Each ResNet-50 module buckled with 1x1 + 3x3 +1x1 filters in order. Each such module followed by a residual connection coming from preceding module. It uses ReLU activation followed by a Batch-Normalization for each layer. In order to convert Resnet-50 to BesNet, we inflate 1x1 filters to 3x1x1 filters and 3x3 filters to 1x3x3 filters where the ordering of the dimensions is sequence x height x width. After convolution layers, an average pooling layer aggregates spatial dimension as in the normal ResNet. Subsequently, a max pooling layer aggregates temporal dimensions. A fully-connected layer used for predictions.
BesNet is initialized with ImageNet weights. In order to convert 2D filter weights to 3D filter weights, we replicate 2D filters along an additional dimension and then normalize the weights by the replication factor. This normalization keeps the activation values stable despite the architectural change. For example, a 1x1 filter is converted to 3x1x1 by copying the 1x1 filter 3 times along the third dimension and weights are divided by 3 at the end.
In BesNet, 3x1x1 filters are responsible for temporal and spatial cross-channel regularities. 1x3x3 filters pay into only spatial properties of individual feature maps. This orientation excites several observations. First off, it decouples temporal and spatial computations. It learns specialized layers for each of the temporal and spatial dimensions. The idea also entertained by the pooling layers. We decomposed spatial and temporal dimension over average and max pooling layers respectively. BesNet reduces the spatial dimensions along the convolutional layers yet it keeps the size of temporal dimension constant. This makes BesNet flexible to handle videos with different number of frames. Hence, given a video with K frames, BesNet keeps the temporal dimension as K until the pooling layers. Thereafter, max pooling layer aggregates K temporal channels into one. In a practical sense, this is easy with a dynamic computational graph library. Pytorch is a bliss here !! (Sorry TF, You're so crusty.)
BesNet has a peculiar use of dilation in 3x1x1 layers which defines the real novel aspect of our architecture. BesNet uses dilation only on temporal dimension and it picks a random dilation factor per 3x1x1 layer for each mini-batch. It sets padding parameters in accordance to keep the temporal dimension unchanged. At the test time, each layer computes outputs for each possible dilation factor, then takes the average of the output feature maps. Random dilation enables the network to learn complex temporal relations. It also regularizes the network in the temporal domain. In practice, it reduces the effect of FPS used for casting videos into frames.
We discuss that for Kinetics, it is important to learn long range relations between frames. Videos are long and they have only a single label. So the network needs to learn the general context of the video. In that sense, small motions that are observed by a normal 3D convolution are not that important. Random dilation pays into this. It augments the contextual temporal window of the network.
Our experiments with only frame futures support our hypothesis here. We extracted frame features with ResNet-50 and train an MLP after pooling the features. It gets 65% accuracy. It is better than DeepMind's baseline network with 3D convolution layers. That shows us contextual information means more than motion learned by 3D layers.
Motion information might be complementary but not the core. It is then verified by the random dilation. BesNet with no dilation results 70% , dilation 2 68% and the random dilation 74% accuracy. This stands to be a simple empirical proof backing our claim here.
Random dilation is really easy to implement with Pytorch. Just take normal Conv class and overwrite its forward pass by randomizing dilation parameter. If you like to try out before we release fell free.
I try to give a very sketchy description of BesNet here by no means complete. Please ping me if you have any question. We plan to study BesNet a little more and share it in the near future in legit formats. We also plan to share a finer description of our challenge approach with some open-source enjoyment.
Please note that BesNet is a work in progress. Anyways, feedbacks are always warmly welcome. Best :).
Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not. In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.
Beside the proposed method, it includes some examples showing how to use Pandas, Gensim, Spacy and Keras. For the full code you check Github.
There are 255045 negative (non-duplicate) and 149306 positive (duplicate) instances. This induces a class imbalance however when you consider the nature of the problem, it seems reasonable to keep the same data bias with your ML model since negative instances are more expectable in a real-life scenario.
When we analyze the data, the shortest question is 1 character long (which is stupid and useless for the task) and the longest question is 1169 character (which is a long, complicated love affair question). I see that if any of the pairs is shorter than 10 characters, they do not make sense thus, I remove such pairs. The average length is 59 and std is 32.
There are two other columns "q1id" and "q2id" but I really do not know how they are useful since the same question used in different rows has different ids.
Some labels are not true, especially for the duplicate ones. In anyways, I decided to rely on the labels and defer pruning due to hard manual effort.
Converting Questions into Vectors
Here, I plan to use Word2Vec to convert each question into a semantic vector then I stack a Siamese network to detect if the pair is duplicate.
Word2Vec is a general term used for similar algorithms that embed words into a vector space with 300 dimensions in general. These vectors capture semantics and even analogies between different words. The famous example is ;
king - man + woman = queen.
Word2Vec vectors can be used for may useful applications. You can compute semantic word similarity, classify documents or input these vectors to Recurrent Neural Networks for more advance applications.
There are two well-known algorithms in this domain. One is Google's network architecture which learns representation by trying to predict surrounding words of a target word given certain window size. GLOVE is the another methos which relies on co-occurrence matrices. GLOVE is easy to train and it is flexible to add new words out-side of your vocabulary. You might like visit this tutorial to learn more and check this brilliant use-case Sense2Vec.
We still need a way to combine word vectors for singleton question representation. One simple alternative is taking the mean of all word vectors of each question. This is simple but really effective way for document classification and I expect it to work for this problem too. In addition, it is possible to enhance mean vector representation by using TF-IDF scores defined for each word. We apply weighted average of word vectors by using these scores. It emphasizes importance of discriminating words and avoid useless, frequent words which are shared by many questions.
I described Siamese network in a previous post. In short, it is a two way network architecture which takes two inputs from the both side. It projects data into a space in which similar items are contracted and dissimilar ones are dispersed over the learned space. It is computationally efficient since networks are sharing parameters.
Let's load the training data first.
For this particular problem, I train my own GLOVE model by using Gensim.
The above code trains a GLOVE model and saves it. It generates 300 dimensional vectors for words. Hyper parameters would be chosen better but it is just a baseline to see a initial performance. However, as I'll show this model gives performance below than my expectation. I believe, this is because our questions are short and does not induce a semantic structure that GLOVE is able to learn a salient model.
Due to the performance issue and the observation above, I decide to use a pre-trained GLOVE model which comes free with Spacy. It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. This is how we use Spacy for this purpose.
Before going further, I really like Spacy. It is really fast and it does everything you need for NLP in a flash of time by hiding many intrinsic details. It deserves a good remuneration. Similar to Gensim model, it also provides 300 dimensional embedding vectors.
The result I get from Spacy vectors is above Gensim model I trained. It is a better choice to go further with TF-IDF scoring. For TF-IDF, I used scikit-learn (heaven of ML). It provides TfIdfVectorizer which does everything you need.
After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores. The below code does this for just "question1" column.
Now, we are ready to create training data for Siamese network. Basically, I've just fetch the labels and covert mean word2vec vectors to numpy format. I split the data into train and test set too.
In this stage, we need to define Siamese network structure. I use Keras for its simplicity. Below, it is the whole script that I used for the definition of the model.
I share here the best performing network with residual connections. It is a 3 layers network using Euclidean distance as the measure of instance similarity. It has Batch Normalization per layer. It is particularly important since BN layers enhance the performance considerably. I believe, they are able to normalize the final feature vectors and Euclidean distance performances better in this normalized space.
I tried Cosine distance which is more concordant to Word2Vec vectors theoretically but cannot handle to obtain better results. I also tried to normalize data into unit variance or L2 norm but nothing gives better results than the original feature values.
Let's train the network with the prepared data. I used the same model and hyper-parameters for all configurations. It is always possible to optimize these but hitherto I am able to give promising baseline results.
In this section, I like to share test set accuracy values obtained by different model and feature extraction settings. We expect to see improvement over 0.63 since when we set all the labels as 0, it is the accuracy we get.
These are the best results I obtain with varying GLOVE models. they all use the same network and hyper-parameters after I find the best on the last configuration depicted below.
Gensim (my model) + Siamese: 0.69
Spacy + Siamese : 0.72
Spacy + TD-IDF + Siamese : 0.79
We can also investigate the effect of different model architectures. These are the values following the best word2vec model shown above.
Adam works quite well for this problem compared to SGD with learning rate scheduling. Batch Normalization also yields a good improvement. I tried to introduce Dropout between layers in different orders (before ReLU, after BN etc.), the best I obtain is 0.75. Concatenation of different layers improves the performance by 1 percent as the final gain.
In conclusion, here I tried to present a solution to this unique problem by composing different aspects of deep learning. We start with Word2Vec and combine it with TF-IDF and then use Siamese network to find duplicates. Results are not perfect and akin to different optimizations. However, it is just a small try to see the power of deep learning in this domain. I hope you find it useful :).
Switching last layer to FC layer improves performance to 0.84.
By using bidirectional RNN and 1D convolutional layers together as feature extractors improves performance to 0.91. Maybe I'll explain details with another post.
Machine learning is everywhere and we are amazed with capabilities of these algorithms. However, they are not great and sometimes they behave so dumb. For instance, let's consider an image recognition model. This model induces really high empirical performance and it works great for normal images. Nevertheless, it might fail when you change some of the pixels of an image even so this little perturbation might be indifferent to human eye. There we call this image an adversarial instance.
There are various methods to generate adversarial instances . One method is to take derivative of the model outputs wrt the input values so that we can change instance values to manipulate the model decision. Another approach exploits genetic algorithms to generate manipulative instances which are confidently classified as a known concept (say 'dog') but they are nothing to human eyes.
So why these models are that weak against adversarial instances. One reliable idea states that because adversarial instances lie on the low probability regions of the instance space. Therefore, they are so weird to the network which is trained with a limited number of instances from higher probability regions.
That being said, maybe there is no way to escape from the fretting adversarial instances, especially when they are produced by exploiting weaknesses of a target model with a gradient guided probing. This is a analytic way of searching for a misleading input for that model with an (almost) guaranteed certainty. Therefore in one way or another, we find an perturbed input deceiving any model.
Due to that observation, I believe that adversarial instances can be resolved by multiple models backing each other. In essence, this is the motivation of this work.
In this work, I like to share my observations focusing on strength of the ensembles against adversarial instances. This is just a toy example with so much short-comings but I hope it'll give the idea with some emiprical evidences.
As a summary, this is what we do here;
Train a baseline MNIST ConvNet.
Create adversarial instances on this model by using cleverhans and save.
Measure the baseline model performance on adversarial.
Train the same ConvNet architecture including adversarial instances and measure its performance.
Train an ensemble of 10 models of the same ConvNet architecture and measure ensemble performance and support the backing argument stated above.
My code full code can be seen on github and I here only share the results and observations. You need cleverhans, Tensorflow and Keras for adversarial generation and you need PyTorch for ensemble training. (Sorry for verbosity of libraries but I like to try PyTorch as well after yeras of tears with Lua).
One problem of the proposed experiment is that we do not recreate adversarial instances for each model and we use a previously created one. Anyways, I believe the empirical values verifies my assumption even in this setting. In addition, I plan to do more extensive study as a future work.
Create adversarial instances.
I start by training a simple ConvNet architecture on MNIST dataset by using legitimate train and test set splits. This network gives 0.98 test set accuracy after 5 epochs.
For creating adversarial instances, I use fast gradient sign method which perturbs images using the derivative of the model outputs wrt the input values. You can see a bunch of adversarial samples below.
The same network suffers on adversarial instances (as above) created on the legitimate test set. It gives 0.09 accuracy which is worse then random guess.
Plot adversarial instances.
Then I like to see the representational power of the trained model on both the normal and the adversarial instances. I do this by using well-known dimension reduction technique T-SNE. I first compute the last hidden layer representation of the network per instance and use these values as an input to T-SNE which aims to project data onto 2-D space. Here is the final projection for the both types of data.
These projections clearly show that adversarial instances are just a random data points to the trained model and they are receding from the real data points creating what we call low probability regions for the trained model. I also trained the same model architecture by dynamically creating adversarial instances in train time then test its value on the adversarials created previously. This new model yields 0.98 on normal test set, 0.91 on previously created adversarial test set and 0.71 on its own dynamically created adversarial.
Above results show that including adversarial instances strengthen the model. However, this is conforming to the low probability region argument. By providing adversarial, we let the model to discover low probability regions of adversarial instances. Beside, this is not applicable to large scale problems like ImageNet since you cannot afford to augment your millions of images per iteration. Therefore, by assuming it works, ensembling is more viable alternative as already a common method to increase overall prediction performance.
In this part, I train multiple models in different ensemble settings. First, I train N different models with the same whole train data. Then, I bootstrap as I train N different models by randomly sampling data from the normal train set. I also observe the affect of N.
The best single model obtains 0.98 accuracy on the legitimate test set. However, the best single model only obtains 0.22 accuracy on the adversarial instances created in previous part.
When we ensemble models by averaging scores, we do not see any gain and we stuck on 0.24 accuracy for the both training settings. However, surprisingly when we perform max ensemble (only count on the most confident model for each instance), we observe 0.35 for uniformly trained ensemble and 0.57 for the bootstrapped ensemble with N equals to 50.
Increasing N raises the adversarial performance. It is much more effective on bootstrapped ensemble. With N=5 we obtain 0.27 for uniform ensemble and 0.32 for bootstrapped ensemble. With N=25 we obtain 0.30 and 0.45 respectively.
These values are interesting especially for the difference of mean and max ensemble. My intuition behind the superiority of maxing is maxing out predictions is able to cover up weaknesses of models by the most confident one, as I suggested in the first place. In that vein, one following observation is that adversarial performance increases as we use smaller random chunks for each model up to a certain threshold with increasing N (number of models in ensemble). It shows us that bootstrapping enables models to learn some of the local regions better and some worse but the worse sides are covered by the more confident model in the ensemble.
As I said before, it is not convenient to use previously created adversarials created by the baseline model in the first part. However, I believe my claim still holds. Assume that we include the baseline model in our best max ensemble above. Still its mistakes would be corrected by the other models. I also tried this (after the comments below) and include the baseline model in our ensemble. 0.57 accuracy only reduces to 0.55. It is still pretty high compared to any other method not seeing adversarial in the training phase.
It is much more harder to create adversarials for ensemble of models with gradient methods. However, genetic algorithms are applicable.
Blind stops of individual models are covered by the peers in the ensemble when we rely on the most confident one.
We observe that as we train a model with dynamically created adversarial instances per iteration, it resolves the adversarials created by the test set. That is, since as the model sees examples from these regions it becomes immune to adversarials. It supports the argument stating low probability regions carry adversarial instances.
(Before finish) This is Serious!
Before I finish, I like to widen the meaning of this post's heading. Ensemble against adversarial!!
"Adversarial instances" is peculiar AI topic. It attracted so much interest first but now it seems forgotten beside research targeting GANs since it does not yield direct profit, compared to having better accuracy.
Even though this is the case hitherto, we need consider this topic more painstakingly from now on. As we witness more extensive and greater AI in many different domains (such as health, law, governace), adversarial instances akin to cause greater problems intentionally or by pure randomness. This is not a sci-fi scenario I'm drawing here. It is a reality as it is prototyped in . Just switch a simple recognition model in  with a AI ruling court for justice.
Therefore, if we believe in a future embracing AI as a great tool to "make the world better place!", we need to study this subject extensively before passing a certain AI threshold.
This work overlooks many important aspects but after all it only aims to share some of my findings in a spare time research. For a next post, I like study unsupervised models like Variational Encoders and Denoising Autoencoders by applying these on adversarial instances (I already started!). In addition, I plan to work on other methods for creating different types of adversarials.
From this post you should take;
References to adversarial instances
Good example codes waiting you on github that can be used many different projects.
Power of ensemble.
Some of non-proven claims and opinions on the topic.
IN ANY WAY HOPE YOU LIKE IT ! 🙂
 Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep Neural Networks are Easily Fooled. Computer Vision and Pattern Recognition, 2015 IEEE Conference on, 427–436.
 Szegedy, C., Zaremba, W., & Sutskever, I. (2013). Intriguing properties of neural networks. arXiv Preprint arXiv: …, 1–10. Retrieved from http://arxiv.org/abs/1312.6199
 Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2016). Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples. arXiv. Retrieved from http://arxiv.org/abs/1602.02697
 Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. Iclr 2015, 1–11. Retrieved from http://arxiv.org/abs/1412.6572
This paper studies description of semantic information with higher level units of an network and blind spot of the network models againt adversarial instances. They illustrate the learned semantics inferring maximally activating instances per unit. They also interpret the effect of adversarial examples and their generalization on different network architectures and datasets.
Findings might be summarized as follows;
Certain dimensions of the each layer reflects different semantics of data. (This is a well-known fact to this date therefore I skip this to discuss more)
Adversarial instances are general to different models and datasets.
Adversarial instances are more significant to higher layers of the networks.
Auto-Encoders are more resilient to adversarial instances.
Adversarial instances are general to different models and datasets.
They posit that advertorials exploiting a particular network architectures are also hard to classify for the others. They illustrate it by creating adversarial instances yielding 100% error-rate on the target network architecture and using these on the another network. It is shown that these adversarial instances are still hard for the other network ( a network with 2% error-rate degraded to 5%). Of course the influence is not that strong compared to the target architecture (which has 100% error-rate).
Adversarial instances are more significant to higher layers of networks.
As you go to higher layers of the network, instability induced by adversarial instances increases as they measure by Lipschitz constant. This is justifiable observation with that the higher layers capture more abstract semantics and therefore any perturbation on an input might override the constituted semantic. (For instance a concept of "dog head" might be perturbed to something random).
Auto-Encoders are more resilient to adversarial instances.
AE is an unsupervised algorithm and it is different from the other models used in the paper since it learns the implicit distribution of the training data instead of mere discriminant features. Thus, it is expected to be more tolerant to adversarial instances. It is understood by Table2 that AE model needs stronger perturbations to achieve 100% classification error with generated adversarials.
One intriguing observation is that shallow model with no hidden unit is yet to be more robust to adversarial instance created from the deeper models. It questions the claim of generalization of adversarial instances. I believe, if the term generality is supposed to be hold, then a higher degree of susceptibility ought to be obtained in this example (and in other too).
I also happy to see that unsupervised method is more robust to adversarial as expected since I believe the notion of general AI is only possible with the unsupervised learning which learns the space of data instead of memorizing things. This is also what I plan to examine after this paper to see how the new tools like Variational Auto Encoders behave againt adversarial instance.
I believe that it is really hard to fight with adversarial instances especially, the ones created by counter optimization against a particular supervised model. A supervised model always has flaws to be exploited in this manner since it memorizes things [ref] and when you go beyond its scope (especially with adversarial instances are of low probability), it makes natural mistakes. Beside, it is known that a neural network converges to local minimum due to its non-convex nature. Therefore, by definition, it has such weaknesses.
Adversarial instances are, in practical sense, not a big deal right now.However, this is akin to be a far more important topic, as we journey through a more advanced AI. Right now, a ML model only makes tolerable mistakes. However, consider advanced systems waiting us in a close future with a use of great importance such as deciding who is guilty, who has cancer. Then this is question of far more important means.
Selfies are everywhere. With different fun masks, poses and filters, it goes crazy. When we coincide with any of these selfies, we automatically give an intuitive score regarding the quality and beauty of the selfie. However, it is not really possible to describe what makes a beautiful selfie. There are some obvious attributes but they are not fully prescribed.
With the folks at 8bit.ai, we decided to develop a system which analyzes selfie images and scores them in accordance to its quality and beauty. The idea was to see whether it is possible to mimic that bizarre perceptual understanding of human with the recent advancements of AI. And if it is, then let's make an application and let people use it for whatever purpose. For now, we only have an Instagram bot @selfai_robot. You can check before reading.