Here is the presentation, I gave at Mozilla All-Hands Orlando about https://github.com/mozilla/TTS
To explain briefly, WSL enables you to run Linux on Win10 and you can use your favorite Linux tools (bash, zsh, vim) for your development cycle and you can enjoy Win10 for the rest. It obviates the need for dual-boot configuration which might be a nightmare sometimes.
Why I do this? Basically, if you have an Optimus Laptop, it is an onerous job to set up a Linux distro. You need to find the right Nvidia driver to enable GPU. Then you need to install nvidia-prime or if you are lucky, you make bumblebee work. Let's say you've done everything. After some time, you update something on your system by mistake and the next thing you see a black screen for the next reboot. It is time to search what is wrong on your phone and try to fix it. It is horrendous!
As far as my experience goes, WSL Linux gives all the necessary features for your development with a vital exception of reaching to GPU. You can apt-get software, run it. Even you can run a software with UI if you set things right. However, due to the GPU limitation, you are able to compile CUDA codes but cannot run on Linux. Here I just like to explain, how you can deal with limitation with a small trick using the ability of WSL Linux running Win binaries.
The first thing to do is to install your preferred Linux distro from Windows Store. Just go to the store, search for the distro and install. If installation is not available, you might need to update your Windows.
1. Install Linux and activate WSL
Before launching Linux, follow the documnetation here to activate WSL on Win10.
After you installed the distro and activated WSL, you can either open the command-line and type ```bash``` or directly use the Linux launcher to get into the linux terminal.
2. Install Hyper Terminal for Linux like experience.
One problem I've experienced with Windows command-line is the differences of shortcuts (Copy-Paste) and inability to open multiple tabs. These are quite important features for Linux custom. I solved this by switching to hyper terminal. It simulates the best possible Linux like experience.
3. Install Conda in Windows and add its binaries to `path`
Now you have Linux and a cool terminal. It is time to install the rest. Note that, if you don't bother to use GPU, you can install everything you like on Linux right away and use. For this example, we install miniconda to Windows and use the python.exe from Linux to run our codes on GPU.
Another cool thing about Linux on WSL is that it enables you to run Windows binaries on Linux environment. Also, Windows' PATH environment variable is exposed to Linux too. As we install python with miniconda, it asks you to add python.exe to the PATH variable. Just do it. Then we set an alias on the Linux side to run python.exe, when we type python so that we can develop things on Linux but run the code on Windows by using the GPU.
Now install miniconda. Say next until you see the screen below and set the ticks for the all options.
After the installation, if you run python on command-line you should see python session running as shown below.
You should also be able to run python form Linux. Open the terminal, switch Linux, type python.exe and you get it working.
4. Create aliases on Linux
The last step is to creating an alias on Linux bash that runs python.exe when you call python. These are the aliases I set.
alias conda="conda.exe" alias ipython="ipython.exe" alias nosetests="nosetests.exe" alias pip="pip.exe"
After all, you should be able to run your code on GPU. One important note is that since we use python on windows, you need to set folder paths in relation to Windows. Don't forget the escape character for separating folder.
Right now, I created a folder /users/erogol/projects and I keep my development craft in it. So it is actually different from the home folder set for your Linux installation. But it does not matter since we use windows file paths. Now, you can install your favorite editor and enjoy training new models.
Please let me know if I skip something here. It is very likely since I wrote this after I set everything.
It is good too see that Microsoft changed direction and start to embrance Linux into their ecosystem by listening the needs of their users. It was a meaningless fight from the start.
Small Intro. and Background
Recently, I started at Mozilla Research. I am really excited to be a part of a small but a great team working hard to solve important ML problems. And everything is open-sourced. We license things to make open-sourced. Oxymoron by the first sight, isn't it. But I like it !!
Before my presence, our team already released the best known open-sourced STT (Speech to Text) implementation based on Tensorflow. The next step is to improve the current Baidu's Deep Speech architecture and also implement a new TTS (Text to Speech) solution that complements the whole conversational AI agent. So after these two projects, anyone around the world will be able to create his own Alexa without any commercial attachment. Which is the real way to democratize AI, at least I believe it is.
Up until now, I worked on variety of data types and ML problems, except audio. Now it is time learn it. And the first thing to do is a comprehensive literature review (like a boss). Here I like to share the top notch DL architectures dealing with TTS (Text to Speech). I also invite you to our Github repository hosting PyTorch implementation of the first version implementation. (We switched to PyTorch for obvious reasons). It is a work in progress and please feel free to comment and contribute.
Below I like to share my pinpoint summary of the well-known TTS papers which are by no means complete but useful to highlight important aspects of these papers. Let's start.
- Prosody: https://en.wikipedia.org/wiki/Prosody_(linguistics)
- Phonemes : units of sounds, we pronounce as we speak. Necessary since very similar words in letter might be pronounced very differently (e.g. "Rough" "Though")
- Vocoder: part of the system decoding from features to audio signals. Wave is used in Deep Voice at that stage.
- Fundamental Frequency - F0: lowest frequency of a periodic waveform describing the pitch of the sound.
- Autoregressive Model: Specifies a model depending linearly on its own outputs and on a parameter set which can be approximated.
- Query, Key, Value: Key is used by attention module to compute attention weights. Value is the vector stipulated by the attention weights to compute the module output. Query vector is the hidden state of the decoder.
- Grapheme: Cool way to say character.
- Error Modes: Sub-optimal status for the attention block where it is not able to escape.
- Monotonic Attention: Use only a limited scope of nodes close in time to the output step. It improves performance for TTS since there is a certain relation btw the output at time t and the input at time t. However, it is not that reasonable for translation problem since words orders might no be the same. https://arxiv.org/pdf/1704.00784.pdf
- MOS: Mean Opinion Score. Crows-source the evaluation process with native speakers. It is not easy to measure, especially for a layman.
- Context vector: Output of an attention module which summarizes multiple time-step output of the encoder.
- Hann Window Function: https://en.wikipedia.org/wiki/Window_function#Hann_window
- Teacher Forcing: Providing model's expected output at time t as a input at time t+1. It is controlled ground-truth feedback as a teacher does to a student.
- Casual convolution: Convolution which does not foresee the future units given the reference time step T which we like to predict next. In practice, it is implemented by setting right padding orientation to to normal convolution layers.
Deep Voice (25 Feb 2017)
- Text to phonemes. Deterministically computed with a dictionary. Or Seq2Seq model to deal with the unseen words.
- http://www.speech.cs.cmu.edu/cgi-bin/cmudict CMU dictionary
- The same phoneme might hold different durations in different words. We need to predict the duration. It is sequence depended.
- Fundamental frequency for the pitch of the each phoneme. It is sequence depended.
- Frequency + Phonemes + Duration = Voice synthesis. It is done via Google's WaveNet.
- Segmentation Model
- Segment audio signal to phonemes.
- CTC loss
- Predict phoneme pairs due to probability mass
- Audio clip of “It was early spring”
- Phonemes (label)
- [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
- [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
- Pairs of Phonemes with their start time
- [(IH1, T, 0:00), (T, ., 0:01), (., W, 0:02), (W, AA1, 0:025), (NG, ., 0:035)]
- Pairs of Phonemes with their start time
- Fundamental Freq & Duration Models
- Segmentation model predictions are the labels for these models.
- Duration, Probability, F0 for each phoneme; [H, 0.1, 25hz], ...
- Audio Synthesizer Model
- Simplified WaveNet
- Duration and F0 for phonemes + audio signals (labels)
- Synthesis audio signal
- Segmentation Model
Here I like to give a simple run-down to install all requirements to make Selenium available on a Raspi. Basically, we install first Firefox, then Geckodriver and finally Selenium and we are ready to go.
Before start, better to note that ChromeDriver does not support ARM processors anymore, therefore it is not possible to use Chromium with Selenium on Raspberry.
First, install system requirements. Update the system, install Firefox and xvfb (display server implementing X11);
sudo apt-get update sudo apt-get install iceweasel sudo apt-get install xvfb
Then, install python requirements. Selenium, PyVirtualDisplay that you can use for running Selenium with hidden browser display and xvfbwrapper.
sudo pip install selenium sudo pip install PyVirtualDisplay sudo pip install xvfbwrapper
Hope everything run well and now you can test the installation.
from pyvirtualdisplay import Display from selenium import webdriver display = Display(visible=0, size=(1024, 768)) display.start() driver = webdriver.Firefox() driver.get('http://www.erogol.com/') driver.quit() display.stop()
Lately, I study time series to see something more out the limit of my experience. I decide to use what I learn in cryptocurrency price predictions with a hunch of being rich. Kidding? Or not :). As I see more about the intricacies of the problem I got deeper and I got a new challenge out of this. Now, I am in a process of creating something new using traditional machine learning to latest reinforcement learning achievements.
So the story aside, I like to see if an AI bot trading without manual help is possible or is a luring dream. Lately, I read a lot about the topic from traditional financial technical analysis to latest ML solutions. What I see at the ML front is many people claim to use lazy ML with success and sell deceitful dreams.What I call lazy ML is, downloading data , training the model and done. We are rich!! What I really experience is they have false conclusion induced by false interpretations. And the bad side of this, many other people try to replicate their results (aka beginner me) and waste a lot of time. Here, I like to show a particular mistake in those works with a accompanying code helping us to realize the problem better off.
Briefly, this work illustrates a simple supervised setting where a model predicts the next Bitcoin move given the current state. Here is the full Notebook and to see more advance set of experiments check out the repo. Hope you like that.
Online Hard Example Mining (OHEM) is a way to pick hard examples with reduced computation cost to improve your network performance on borderline cases which generalize to the general performance. It is mostly used for Object Detection. Suppose you like to train a car detector and you have positive (with car) and negative images (with no car). Now you like to train your network. In practice, you find yourself in many negatives as oppose to relatively much small positives. To this end, it is clever to pick a subset of negatives that are the most informative for your network. Hard Example Mining is the way to go to this.
In general, to pick a subset of negatives, first you train your network for couple of iterations, then you run your network all along your negative instances then you pick the ones with the greater loss values. However, it is very computationally toilsome since you have possibly millions of images to process, and sub-optimal for your optimization since you freeze your network while picking your hard instances that are not all being used for the next couple of iterations. That is, you assume here all hard negatives you pick are useful for all the next iterations until the next selection. Which is an imperfect assumption especially for large datasets.
Okay, what Online means in this regard. OHEM solves these two aforementioned problems by performing hard example selection batch-wise. Given a batch sized K, it performs regular forward propagation and computes per instance losses. Then, it finds M<K hard examples in the batch with high loss values and it only back-propagates the loss computed over the selected instances. Smart hah ? 🙂
It reduces computation by running hand to hand with your regular optimization cycle. It also unties the assumption of the foreseen usefulness by picking hard examples per iteration so thus we now really pick the hard examples for each iteration.
If you like to test yourself, here is PyTorch OHEM implementation that I offer you to use a bit of grain of salt.
Let's directly dive in. The thing here is to use Tensorboard to plot your PyTorch trainings. For this, I use TensorboardX which is a nice interface communicating Tensorboard avoiding Tensorflow dependencies.
First install the requirements;
pip install tensorboard pip install tensorboardX
Things thereafter very easy as well, but you need to know how you need to communicate with the board to show your training and it is not that easy, if you don't know Tensorboard hitherto.
... from tensorboardX import SummaryWriter ... writer = SummaryWriter('your/path/to/log_files/') ... # in training loop writer.add_scalar('Train/Loss', loss, num_iteration) writer.add_scalar('Train/Prec@1', top1, num_iteration) writer.add_scalar('Train/Prec@5', top5, num_iteration) ... # in validation loop writer.add_scalar('Val/Loss', loss, epoch) writer.add_scalar('Val/Prec@1', top1, epoch) writer.add_scalar('Val/Pred@5', top5, epoch)
You can also see the embedding of your dataset
from torchvision import datasets from tensorboardX import SummaryWriter dataset = datasets.MNIST('mnist', train=False, download=True) images = dataset.test_data[:100].float() label = dataset.test_labels[:100] features = images.view(100, 784) writer.add_embedding(features, metadata=label, label_img=images.unsqueeze(1))
This is also how you can plot your model graph. The important part is to give the output tensor to writer as well with you model. So that, it computes the tensor shapes in between. I also need to say, it is very slow for large models.
import torch import torch.nn as nn import torchvision.utils as vutils import numpy as np import torch.nn.functional as F import torchvision.models as models from tensorboardX import SummaryWriter class Mnist(nn.Module): def __init__(self): super(Mnist, self).__init__() self.conv1 = nn.Conv2d(1, 10, kernel_size=5) self.conv2 = nn.Conv2d(10, 20, kernel_size=5) self.conv2_drop = nn.Dropout2d() self.fc1 = nn.Linear(320, 50) self.fc2 = nn.Linear(50, 10) self.bn = nn.BatchNorm2d(20) def forward(self, x): x = F.max_pool2d(self.conv1(x), 2) x = F.relu(x)+F.relu(-x) x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2)) x = self.bn(x) x = x.view(-1, 320) x = F.relu(self.fc1(x)) x = F.dropout(x, training=self.training) x = self.fc2(x) x = F.log_softmax(x) return x model = Mnist() # if you want to show the input tensor, set requires_grad=True res = model(torch.autograd.Variable(torch.Tensor(1,1,28,28), requires_grad=True)) writer = SummaryWriter() writer.add_graph(model, res) writer.close()
ReLU is defined as a way to train an ensemble of exponential number of linear models due to its zeroing effect. Each iteration means a random set of active units hence, combinations of different linear models. They discuss, relying on the given observation, it might be useful to remove non-linearities for some layers and letting them to learn combination of linearities as the whole layer.
Another argument as poised, some representations are hard to approximate by a stack of non-linear layers. as shown by He et al. 2016. To this end, letting linearities for a subset of layers might ameliorate the condition.
The way they apply EraseReLU is removing the last ReLU layer of each "module". "Module" here is defined depending on the model architecture as shown above.
Experiments show that EraseReLU increases the performance of networks and its effect is larger for deeper networks. It is also more resilient to over-fitting for deep networks. The loss curves also show faster convergence for EraseReLU and the difference more obvious for larger datasets.
My 2 cents: Results are not that different on ImageNet but still better to the favor of EraseReLU. Then it might be the case of lucky shoot since there is no confidence interval or variance given for the trainings.
Faster convergence makes sense with the help of second guessing after the paper. Since there are more active units possible it entails to propagate more gradients. However, all such comments assumes that error signals are always positive. Which is very unlikely. Therefore, more open valves might cause more chaotic back-propagation signal.
Yet it is very simple idea, it shows faster convergence, better results and a good investgation of ReLU function. It think it is useful and can take its position in my next training session.
Disclaimer: This is written hastily in 10 mins. If you think something wrong or even worse let me know :).
There are numerous on-line and off-line technical resources about deep learning. Everyday people publish new papers and write new things. However, it is rare to see resources teaching practical concerns for structuring a deep learning projects; from top to bottom, from problem to solution. People know fancy technicalities but even some experienced people feel lost in the details, once they need to structure their own project.
Andrew Ng’s new initiative deeplearning.ai rushes to help at this stage. deeplearning.ai is a collection of courses teaching deep learning with all the necessary details (great place to start deep learning !!) and one of the courses called “Structuring Machine Learning Project” particularly teaches the design of a deep learning solution intuitively with real-life examples. It is not possible to give a single pill ruling the all but this course establishes a good basis to think about a DL project. And apparently, I still have things to learn from Andrew Ng. after all my years of experience. Note that, I also started my ML career with his famous Coursera ML course :). Thank you Mr. Ng.
Below, I try to plot a diagram summarizing what is mentioned in the course. However be aware that it is not a verbatim depiction and probably includes some of my own measures. In the end, It was a good exercise for me, and hopefully, is a good reference for you.
Please go and check the videos, if the diagram does not make sense at all and ping me if you have something that you don’t like.
Lately, we (TwentyBN) took a part in Activity Net trimmed action recognition challenge. The dataset is called Kinetics and recently released. It is a collection of 10 second YouTube videos. Each video has a single label among 400 different action classes. The dataset released by DeepMind with a baseline 61% Top-1 and 81.3% Top-5. For baseline models please refer to their dataset paper. But, it took 2 months for people to briskly hoist the bar high above.
As you might see above, we have the best Top-5 accuracy with 97% which is ~16% improvement on top of the baseline. The average of Top-1 and Top-5 decides the leader-board which places us to 3rd place. Yet, it is a great result for us where we could dabble only 2 weeks with limited juice. Team matters here!! Thx to my mates Raghav Goyal and Valentin Haenel for being great.
Here, I like to succinctly describe our novel network architecture. It has the best single network performance. (We plan to share a more detailed description in a separate Medium soon.) Namely, it is called BesNet due to a cheap cryptographic reason :). BesNet yields 74% Top-1 with only RGB . It is half-size of the baseline network described in the DeepMind paper.
In detail, BesNet is devised on top of ResNet-50 architecture. Distinctly, BesNet performs 3D convolutions that are able to learn both spatiotemporal features. In a better extent, BesNet takes not a single frame, but a set of frames from a video. It convolves pixels between consecutive frames as wells as single frame pixels. Each ResNet-50 module buckled with 1x1 + 3x3 +1x1 filters in order. Each such module followed by a residual connection coming from preceding module. It uses ReLU activation followed by a Batch-Normalization for each layer. In order to convert Resnet-50 to BesNet, we inflate 1x1 filters to 3x1x1 filters and 3x3 filters to 1x3x3 filters where the ordering of the dimensions is sequence x height x width. After convolution layers, an average pooling layer aggregates spatial dimension as in the normal ResNet. Subsequently, a max pooling layer aggregates temporal dimensions. A fully-connected layer used for predictions.
BesNet is initialized with ImageNet weights. In order to convert 2D filter weights to 3D filter weights, we replicate 2D filters along an additional dimension and then normalize the weights by the replication factor. This normalization keeps the activation values stable despite the architectural change. For example, a 1x1 filter is converted to 3x1x1 by copying the 1x1 filter 3 times along the third dimension and weights are divided by 3 at the end.
In BesNet, 3x1x1 filters are responsible for temporal and spatial cross-channel regularities. 1x3x3 filters pay into only spatial properties of individual feature maps. This orientation excites several observations. First off, it decouples temporal and spatial computations. It learns specialized layers for each of the temporal and spatial dimensions. The idea also entertained by the pooling layers. We decomposed spatial and temporal dimension over average and max pooling layers respectively. BesNet reduces the spatial dimensions along the convolutional layers yet it keeps the size of temporal dimension constant. This makes BesNet flexible to handle videos with different number of frames. Hence, given a video with K frames, BesNet keeps the temporal dimension as K until the pooling layers. Thereafter, max pooling layer aggregates K temporal channels into one. In a practical sense, this is easy with a dynamic computational graph library. Pytorch is a bliss here !! (Sorry TF, You're so crusty.)
BesNet has a peculiar use of dilation in 3x1x1 layers which defines the real novel aspect of our architecture. BesNet uses dilation only on temporal dimension and it picks a random dilation factor per 3x1x1 layer for each mini-batch. It sets padding parameters in accordance to keep the temporal dimension unchanged. At the test time, each layer computes outputs for each possible dilation factor, then takes the average of the output feature maps. Random dilation enables the network to learn complex temporal relations. It also regularizes the network in the temporal domain. In practice, it reduces the effect of FPS used for casting videos into frames.
We discuss that for Kinetics, it is important to learn long range relations between frames. Videos are long and they have only a single label. So the network needs to learn the general context of the video. In that sense, small motions that are observed by a normal 3D convolution are not that important. Random dilation pays into this. It augments the contextual temporal window of the network.
Our experiments with only frame futures support our hypothesis here. We extracted frame features with ResNet-50 and train an MLP after pooling the features. It gets 65% accuracy. It is better than DeepMind's baseline network with 3D convolution layers. That shows us contextual information means more than motion learned by 3D layers.
Motion information might be complementary but not the core. It is then verified by the random dilation. BesNet with no dilation results 70% , dilation 2 68% and the random dilation 74% accuracy. This stands to be a simple empirical proof backing our claim here.
Random dilation is really easy to implement with Pytorch. Just take normal Conv class and overwrite its forward pass by randomizing dilation parameter. If you like to try out before we release fell free.
I try to give a very sketchy description of BesNet here by no means complete. Please ping me if you have any question. We plan to study BesNet a little more and share it in the near future in legit formats. We also plan to share a finer description of our challenge approach with some open-source enjoyment.
Please note that BesNet is a work in progress. Anyways, feedbacks are always warmly welcome. Best :).