Two Attention Methods for Better Alignment with Tacotron

In this post, I like to introduce two methods that worked well in my experience for better attention alignment in Tacotron models. If you like to try your own you can visit Mozilla TTS. The first method is Bidirectional Decoder and the second is Graves Attention (Gaussian Attention) with small tweaks.

Bidirectional Decoder

from the paper

Bidirectional decoding uses an extra decoder which takes the encoder outputs in the reverse order and then, there is an extra loss function that compares the output states of the forward decoder with the backward one. With this additional loss, the forward decoder models what it needs to expect for the next iterations. In this regard, the backward decoder punishes bad decisions of the forward decoder and vice versa.

Intuitionally, if the forward decoder fails to align the attention, that would cause a big loss and ultimately it would learn to go monotonically through the alignment process with a correction induced by the backward decoder. Therefore, this method is able to prevent "catastrophic failure" where the attention falls apart in the middle of a sentence and it never aligns again.

At the inference time, the paper suggests to us only the forward decoder and demote the backward decoder. However, it is possible to think more elaborate ways to combine these two models.

Example attention figures of both of the decoders.

There are 2 main pitfalls of this method. The first, due to additional parameters of the backward decoder, it is slower to train this model (almost 2x) and this makes a huge difference especially when the reduction rate is low (number of frames the model generates per iteration). The second, if the backward decoder penalizes the forward one too harshly, that causes prosody degradation in overall. The paper suggests activating the additional loss just for fine-tuning, due to this.

My experience is that Bidirectional training is quite robust against alignment problems and it is especially useful if your dataset is hard. It also aligns almost after the first epoch. Yes, at inference time, it sometimes causes pronunciation problems but I solved this by doing the opposite of the paper's suggestion. I finetune the network without the additional loss for just an epoch and everything started to work well.

Graves Attention

Tacotron uses Bahdenau Attention which is a content-based attention method. However, it does not consider location information, therefore, it needs to learn the monotonicity of the alignment just looking into the content which is a hard deal. Tacotron2 uses Location Sensitive Attention which takes account of the previous attention weights. By doing so, it learns the monotonic constraint. But it does not solve all of the problems and you can still experience failures with long or out of domain sentences.

Graves Attention is an alternative that uses content information to decide how far it needs to go on the alignment per iteration. It does this by using a mixture of Gaussian distribution.

Graves Attention takes the context vector of time t-1 and passes it through couple of fully connected layers ([FC > ReLU > FC] in our model) and estimates step-size, variance and distribution weights for time t. Then the estimated step-size is used to update the mean of Gaussian modes. Analogously, mean is the point of interest t the alignment path, variance is attention window over this point of interest and distribution weight is the importance of each distribution head.

(g,b,k) = FC(ReLU(FC(c))))) \\
\delta = softplus(k)\\
\sigma = exp(-b)\\
w = softmax(g) \\
\mu_{t} = \mu_{t-1} + \delta \\
\alpha_{i,j} = \sum_{k} w_{k} exp\left(-\frac{(j-\mu_{i,k})^2}{2\sigma_{i,k})}\right )

I try to formulate above how I compute the alignment in my implementation. g, b, k are intermediate values. \delta is the step size, \sigma is the variance, w_{k} is the distribution weight for the GMM node k. (You can also check the code).

Some other versions are explained here but so far I found the above formulation works for me the best, without any NaNs in training. I also realized that with the best-claimed method in this paper, one of the distribution nodes overruns the others in the middle of the training and basically, attention starts to run on a single Gaussian head.

Test time attention plots with Graves Attention. The attention looks softer due to the scale differences of the values. In the first step, the attention is weight is big since distributions have almost uniform weights. And they differ as the attention runs forward.

The benefit of using GMM is to have more robust attention. It is also computationally light-weight compared to both bidirectional decoding and normal location attention. Therefore, you can increase your batch size and possibly converge faster.

The downside is that, although my experiments are not complete, GMM's not provided slightly worse prosody and naturalness compared to the other methods.

Comparison

Here I compare Graves Attention, Bidirectional Decoding and Location Sensitive Attention trained on LJSpeech dataset. For the comparison, I used the set of sentences provided by this work. There are in total of 50 sentences.

Bidirectional Decoding has 1, Graves attention has 6, Location Sensitive Attention has 18, Location Sensitive Attention with inference time windowing has 11 failures out of these 50 sentences.

In terms of prosodic quality, in my opinion, Location Sensitive Attention > Bidirectional Decoding > Graves Attention > Location Sensitive Attention with Windowing. However, I should say the quality difference is hardly observable in LJSpeech dataset. I also need to point out that, it is a hard dataset.

If you like to try these methods, all these are implemented on Mozilla TTS and give it a try.


Share

Gradual Training with Tacotron for Faster Convergence

Tacotron is a commonly used Text-to-Speech architecture. It is a very flexible alternative over traditional solutions. It only requires text and corresponding voice clips to train the model. It avoids the toil of fine-grained annotation of the data. However, Tacotron might also be very time demanding to train, especially if you don't know the right hyperparameters, to begin with. Here, I like to share a gradual training scheme to ease the training difficulty. In my experiments, it provides faster training, tolerance for hyperparameters and more time with your family.

In summary, Tacotron is an Encoder-Decoder architecture with Attention. it takes a sentence as a sequence of characters (or phonemes) and it outputs sequence of spectrogram frames to be ultimately converted to speech with an additional vocoder algorithm (e.g. Griffin-Lim or WaveRNN). There are two versions of Tacotron. Tacotron is a more complicated architecture but it has fewer model parameters as opposed to Tacotron2. Tacotron2 is much simpler but it is ~4x larger (~7m vs ~24m parameters). To be clear, so far, I mostly use gradual training method with Tacotron and about to begin to experiment with Tacotron2 soon.

Tacotron architecture (Thx @yweweler for the figure)

Here is the trick. Tacotron has a parameter called 'r' which defines the number of spectrogram frames predicted per decoder iteration. It is a useful parameter to reduce the number of computations since the larger 'r', the fewer the decoder iterations. But setting the value to high might reduce the performance as well. Another benefit of higher r value is that the alignment module stabilizes much faster. If you talk someone who used Tacotron, he'd probably know what struggle the attention means. So finding the right trade-off for 'r' is a great deal. In the original Tacotron paper, authors used 'r' as 2 for the best-reported model. They also emphasize the challenge of training the model with r=1.

Gradual training comes to the rescue at this point. What it means is that we set 'r' initially large, such as 7. Then, as the training continues, we reduce it until the convergence. This simple trick helps quite magically to solve two main problems. The first, it helps the network to learn the monotonic attention after almost the first epoch. The second, it expedites convergence quite much. As a result, the final model happens to have more stable and resilient attention without any degrigation of performance. You can even eventually let the network to train with r=1 which was not even reported in the original paper.

Here, I like to share some results to prove the effectiveness. I used LJspeech dataset for all the results. The training schedule can be summarized as follows. (You see I also change the batch_size but it is not necessary if you have enough GPU memory.)

"gradual_training": [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 8]] # [start_step, r, batch_size]

Below you can see the attention at validation time after just 1K iterations with the training schedule above.

Tacotron after 950 steps on LJSpeech. Don't worry about the last part, it is just because the model does not know where to stop initially.

Next, let's check the model training curve and convergence.


(Ignore the plot in the middle.) You see here the model jumping from r=7 to r=5. There is obvious easy gain after the jump.
Test time model results after 300K. r=1 after 290K steps.
Here is the training plot until ~300K iterations.
(For some reason I could not move the first plot to the end)

You can listen to voice examples generated with the final model using GriffinLim vocoder. I'd say the quality of these examples is quite good to my ear.

It was a short post but if you like to replicate the results here, you can visit our repo Mozilla TTS and just run the training with the provided config.json file. Hope, imperfect documentation on the repo would help you. Otherwise, you can always ask for help creating an issue or on Mozilla TTS Discourse page. There are some other cool things in the repo that I also write about in the future. Until next time..!

Disclaimer: In this post, I just wanted to briefly share a trick that I find quite useful in my TTS work. Please feel free to share your comments. This work might be a more legit research work in the future.

Share

Using WSL Linux on Windows 10 for Deep Learning Development.

To explain briefly, WSL enables you to run Linux on Win10 and you can use your favorite Linux tools (bash, zsh, vim) for your development cycle and you can enjoy Win10 for the rest. It obviates the need for dual-boot configuration which might be a nightmare sometimes.

Why I do this? Basically, if you have an Optimus Laptop, it is an onerous job to set up a Linux distro. You need to find the right Nvidia driver to enable GPU. Then you need to install nvidia-prime or if you are lucky, you make bumblebee work. Let's say you've done everything. After some time, you update something on your system by mistake and the next thing you see a black screen for the next reboot. It is time to search what is wrong on your phone and try to fix it. It is horrendous!

As far as my experience goes, WSL Linux gives all the necessary features for your development with a vital exception of reaching to GPU. You can apt-get software, run it. Even you can run a software with UI if you set things right. However, due to the GPU limitation, you are able to compile CUDA codes but cannot run on Linux. Here I just like to explain, how you can deal with limitation with a small trick using the ability of WSL Linux running Win binaries.

The first thing to do is to install your preferred Linux distro from Windows Store. Just go to the store, search for the distro and install. If installation is not available, you might need to update your Windows.

1. Install Linux and activate WSL

Before launching Linux, follow the documnetation here to activate WSL on Win10.

After you installed the distro and activated  WSL, you can either open the command-line and type ```bash```  or directly use the Linux launcher to get into the linux terminal.

2. Install Hyper Terminal for Linux like experience.

One problem I've experienced with Windows command-line is the differences of shortcuts (Copy-Paste) and inability to open multiple tabs. These are quite important features for Linux custom. I solved this by switching to hyper terminal. It simulates the best possible Linux like experience.

3. Install Conda in Windows and add its binaries to `path`

Now you have Linux and a cool terminal. It is time to install the rest. Note that, if you don't bother to use GPU, you can install everything you like on Linux right away and use. For this example, we install miniconda to Windows and use the python.exe from Linux to run our codes on GPU.

Another cool thing about Linux on WSL is that it enables you to run Windows binaries on Linux environment. Also, Windows' PATH environment variable is exposed to Linux too. As we install python with miniconda, it asks you to add python.exe to the PATH variable. Just do it. Then we set an alias on the Linux side to run python.exe, when we type python so that we can develop things on Linux but run the code on Windows by using the GPU.

Now install miniconda. Say next until you see the screen below and set the ticks for the all options.

After the installation, if you run python on command-line you should see python session running as shown below.

You should also be able to run python form Linux. Open the terminal, switch Linux, type python.exe and you get it working.

4. Create aliases on Linux

The last step is to creating an alias on Linux bash that runs python.exe when you call python. These are the aliases I set.


alias conda="conda.exe"
alias ipython="ipython.exe"
alias nosetests="nosetests.exe"
alias pip="pip.exe"
alias nvidia-smi="/mnt/c/Program\ Files/NVIDIA\ Corporation/NVSMI/nvidia-smi.exe"

After all, you should be able to run your code on GPU. One important note is that since we use python on windows, you need to set folder paths in relation to Windows. Don't forget the escape character for separating folder.

Right now, I created a folder /users/erogol/projects and I keep my development craft in it. So it is actually different from the home folder set for your Linux installation. But it does not matter since we use windows file paths. Now, you can install your favorite editor and enjoy training new models.

Please let me know if I skip something here. It is very likely since I wrote this after I set everything.

It is good too see that Microsoft changed direction and start to embrance Linux into their ecosystem by listening the needs of their users. It was a meaningless fight from the start.

 

Edit: You need run the terminal with Run as Administrator' to install things with conda to windows.

Share

Text to Speech Deep Learning Architectures

Small Intro. and Background

Recently, I started at Mozilla Research. I am really excited to be a part of a small but great team working hard to solve important ML problems. And everything is open-sourced. We license things to make open-sourced. Oxymoron by first sight isn't it. But I like it !!

Before my presence, our team already released the best known open-sourced STT (Speech to Text) implementation based on Tensorflow. The next step is to improve the current Baidu's Deep Speech architecture and also implement a new TTS (Text to Speech) solution that complements the whole conversational AI agent. So after these two projects, anyone around the world will be able to create his own Alexa without any commercial attachment. Which is the real way to democratize AI, at least I believe it is?

Up until now, I worked on a variety of data types and ML problems, except audio. Now it is time to learn it. And the first thing to do is a comprehensive literature review (like a boss). Here I like to share the top-notch DL architectures dealing with TTS (Text to Speech). I also invite you to our Github repository hosting PyTorch implementation of the first version implementation. (We switched to PyTorch for obvious reasons). It is a work in progress and please feel free to comment and contribute.

Below I like to share my pinpoint summary of the well-known TTS papers which are by no means complete but useful to highlight important aspects of these papers. Let's start.

Glossary

  • Prosody: https://en.wikipedia.org/wiki/Prosody_(linguistics)
  • Phonemes: units of sounds, we pronounce as we speak. Necessary since very similar words in the letter might be pronounced very differently (e.g. "Rough" "Though")
  • Vocoder: part of the system decoding from features to audio signals. Wave is used in Deep Voice at that stage.
  • Fundamental Frequency - F0: lowest frequency of a periodic waveform describing the pitch of the sound.
  • Autoregressive Model: Specifies a model depending linearly on its own outputs and on a parameter set which can be approximated.
  • Query, Key, Value: Key is used by the attention module to compute attention weights. Value is the vector stipulated by the attention weights to compute the module output. A query vector is the hidden state of the decoder.
  • Grapheme: Cool way to say character.
  • Error Modes: Sub-optimal status for the attention block where it is not able to escape.
  • Monotonic Attention: Use only a limited scope of nodes close in time to the output step. It improves performance for TTS since there is a certain relation btw the output at time t and the input at time t. However, it is not that reasonable for translation problem since words orders might not be the same. https://arxiv.org/pdf/1704.00784.pdf
  • MOS: Mean Opinion Score. Crowd-source the evaluation process with native speakers. It is not easy to measure, especially for a layman.
  • Context vector: Output of an attention module which summarizes multiple time-step outputs of the encoder.
  • Hann Window Function: https://en.wikipedia.org/wiki/Window_function#Hann_window
  • Teacher Forcing: Providing model's expected output at time t as input at time t+1. It is controlled ground-truth feedback as a teacher does to a student.
  • Casual convolution: Convolution which does not foresee the future units given the reference time step T which we like to predict next. In practice, it is implemented by setting right padding orientation to normal convolution layers.

Deep Voice (25 Feb 2017)

  • Text to phonemes. Deterministically computed with a dictionary. Or Seq2Seq model to deal with the unseen words.
  • The same phoneme might hold different durations in different words. We need to predict the duration. It is sequence depended.
  • Fundamental frequency for the pitch of each phoneme. It is sequence depended.
  • Frequency + Phonemes + Duration = Voice synthesis. It is done via Google's WaveNet.
  • Models
    • Segmentation Model
      • Segment audio signal to phonemes.
      • CTC loss
      • Predict phoneme pairs due to probability mass
      • Inputs:
        • Audio clip of “It was early spring”
        • Phonemes (label)
          • [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
      • Outputs:
        • Pairs of Phonemes with their start time
          • [(IH1, T, 0:00), (T, ., 0:01), (., W, 0:02), (W, AA1, 0:025), (NG, ., 0:035)]
    • Fundamental Freq & Duration Models
      • Segmentation model predictions are the labels for these models.
      • Inputs:
        • Phonemes
      • Outputs:
        • Duration, Probability, F0 for each phoneme; [H, 0.1, 25hz], ...
    • Audio Synthesizer Model
      • Simplified WaveNet
      • Inputs:
        • Duration and F0 for phonemes + audio signals (labels)
      • Outputs:
        • Synthesis audio signal

Continue reading Text to Speech Deep Learning Architectures

Share

Setting Up Selenium on RaspberryPi 2/3

Selenium is a great tool for Internet scraping or automated testing for websites. I personally use it for scrapping on dynamic content website in which the content is created by JavaScript routines. Lately, I also tried to run Selenium on Raspberry and found out that it is not easy to install all requirements. Here I like to share my commands to make things easier for you.

Here I like to give a simple run-down to install all requirements to make Selenium available on a Raspi. Basically, we install first Firefox, then Geckodriver and finally Selenium and we are ready to go.

Before start,  better to note that ChromeDriver does not support ARM processors anymore, therefore it is not possible to use Chromium with Selenium on Raspberry.

First, install system requirements. Update the system, install Firefox and xvfb (display server implementing X11);

sudo apt-get update
sudo apt-get install iceweasel
sudo apt-get install xvfb

Then, install python requirements. Selenium, PyVirtualDisplay that you can use for running Selenium with hidden  browser display and xvfbwrapper.

sudo pip install selenium
sudo pip install PyVirtualDisplay
sudo pip install xvfbwrapper

Hope everything run well and now you can test the installation.

from pyvirtualdisplay import Display
from selenium import webdriver

display = Display(visible=0, size=(1024, 768))
display.start()

driver = webdriver.Firefox()
driver.get('http://www.erogol.com/')
driver.quit()

display.stop()

 

 

Share

Why mere Machine Learning cannot predict Bitcoin price

Lately, I study time series to see something more out the limit of my experience. I decide to use what I learn in cryptocurrency price predictions with a hunch of being rich. Kidding? Or not :).  As I see more about the intricacies of the problem I got deeper and I got a new challenge out of this. Now, I am in a process of creating something new using traditional machine learning to latest reinforcement learning achievements.

So the story aside, I like to see if an AI bot trading without manual help is possible or is a luring dream. Lately, I read a lot about the topic  from traditional financial technical analysis to latest ML solutions. What I see at the ML front is many people claim to use lazy ML with success and sell deceitful dreams.What I call lazy ML is, downloading data , training the model and done. We are rich!! What I really experience is they have false conclusion induced by false interpretations. And the bad side of this, many other people try to replicate their results (aka beginner me) and waste a lot of time. Here, I like to show a particular mistake in those works with a accompanying code helping us to realize the problem better off.

Briefly, this work illustrates a simple supervised setting where a model predicts the next Bitcoin move given the current state.  Here is the full Notebook and to see more advance set of experiments check out the repo.  Hope you like that.

Continue reading Why mere Machine Learning cannot predict Bitcoin price

Share

Online Hard Example Mining on PyTorch

Online Hard Example Mining (OHEM) is a way to pick hard examples with reduced computation cost to improve your network performance on borderline cases which generalize to the general performance. It is mostly used for Object Detection. Suppose you like to train a car detector and you have positive (with car) and negative images (with no car). Now you like to train your network. In practice, you find yourself in many negatives as oppose to relatively much small positives. To this end, it is clever to pick a subset of negatives that are the most informative for your network. Hard Example Mining is the way to go to this.

In a detection problem, hard examples corresponds to false positive detection depicted here with red.

In general, to pick a subset of negatives, first you train your network for couple of iterations, then you run your network all along your negative instances then you pick the ones with the greater loss values. However, it is very computationally toilsome since you have possibly millions of images to process, and sub-optimal for your optimization since you freeze your network while picking your hard instances that are not all being used for the next couple of iterations. That is, you assume here all hard negatives you pick are useful for all the next iterations until the next selection. Which is an imperfect assumption especially for large datasets.

Okay, what Online means in this regard. OHEM solves these two aforementioned problems by performing hard example selection batch-wise. Given a batch sized K, it performs regular forward propagation and computes per instance losses. Then, it finds M<K hard examples in the batch with high loss values and it only back-propagates the loss computed over the  selected instances. Smart hah ? 🙂

It reduces computation by running hand to hand with your regular optimization cycle. It also unties the assumption of the foreseen usefulness by picking hard examples per iteration so thus we now really pick the hard examples for each iteration.

If you like to test yourself, here is PyTorch OHEM implementation that I offer you to use a bit of grain of salt.

Share