# Two Attention Methods for Better Alignment with Tacotron

In this post, I like to introduce two methods that worked well in my experience for better attention alignment in Tacotron models. If you like to try your own you can visit Mozilla TTS. The first method is Bidirectional Decoder and the second is Graves Attention (Gaussian Attention) with small tweaks.

## Bidirectional Decoder

Bidirectional decoding uses an extra decoder which takes the encoder outputs in the reverse order and then, there is an extra loss function that compares the output states of the forward decoder with the backward one. With this additional loss, the forward decoder models what it needs to expect for the next iterations. In this regard, the backward decoder punishes bad decisions of the forward decoder and vice versa.

Intuitionally, if the forward decoder fails to align the attention, that would cause a big loss and ultimately it would learn to go monotonically through the alignment process with a correction induced by the backward decoder. Therefore, this method is able to prevent "catastrophic failure" where the attention falls apart in the middle of a sentence and it never aligns again.

At the inference time, the paper suggests to us only the forward decoder and demote the backward decoder. However, it is possible to think more elaborate ways to combine these two models.

There are 2 main pitfalls of this method. The first, due to additional parameters of the backward decoder, it is slower to train this model (almost 2x) and this makes a huge difference especially when the reduction rate is low (number of frames the model generates per iteration). The second, if the backward decoder penalizes the forward one too harshly, that causes prosody degradation in overall. The paper suggests activating the additional loss just for fine-tuning, due to this.

My experience is that Bidirectional training is quite robust against alignment problems and it is especially useful if your dataset is hard. It also aligns almost after the first epoch. Yes, at inference time, it sometimes causes pronunciation problems but I solved this by doing the opposite of the paper's suggestion. I finetune the network without the additional loss for just an epoch and everything started to work well.

## Graves Attention

Tacotron uses Bahdenau Attention which is a content-based attention method. However, it does not consider location information, therefore, it needs to learn the monotonicity of the alignment just looking into the content which is a hard deal. Tacotron2 uses Location Sensitive Attention which takes account of the previous attention weights. By doing so, it learns the monotonic constraint. But it does not solve all of the problems and you can still experience failures with long or out of domain sentences.

Graves Attention is an alternative that uses content information to decide how far it needs to go on the alignment per iteration. It does this by using a mixture of Gaussian distribution.

Graves Attention takes the context vector of time t-1 and passes it through couple of fully connected layers ([FC > ReLU > FC] in our model) and estimates step-size, variance and distribution weights for time t. Then the estimated step-size is used to update the mean of Gaussian modes. Analogously, mean is the point of interest t the alignment path, variance is attention window over this point of interest and distribution weight is the importance of each distribution head.

I try to formulate above how I compute the alignment in my implementation. $g, b, k$ are intermediate values. $\delta$ is the step size, $\sigma$ is the variance, $w_{k}$ is the distribution weight for the GMM node k. (You can also check the code).

Some other versions are explained here but so far I found the above formulation works for me the best, without any NaNs in training. I also realized that with the best-claimed method in this paper, one of the distribution nodes overruns the others in the middle of the training and basically, attention starts to run on a single Gaussian head.

The benefit of using GMM is to have more robust attention. It is also computationally light-weight compared to both bidirectional decoding and normal location attention. Therefore, you can increase your batch size and possibly converge faster.

The downside is that, although my experiments are not complete, GMM's not provided slightly worse prosody and naturalness compared to the other methods.

## Comparison

Here I compare Graves Attention, Bidirectional Decoding and Location Sensitive Attention trained on LJSpeech dataset. For the comparison, I used the set of sentences provided by this work. There are in total of 50 sentences.

Bidirectional Decoding has 1, Graves attention has 6, Location Sensitive Attention has 18, Location Sensitive Attention with inference time windowing has 11 failures out of these 50 sentences.

In terms of prosodic quality, in my opinion, Location Sensitive Attention > Bidirectional Decoding > Graves Attention > Location Sensitive Attention with Windowing. However, I should say the quality difference is hardly observable in LJSpeech dataset. I also need to point out that, it is a hard dataset.

If you like to try these methods, all these are implemented on Mozilla TTS and give it a try.

# Gradual Training with Tacotron for Faster Convergence

Tacotron is a commonly used Text-to-Speech architecture. It is a very flexible alternative over traditional solutions. It only requires text and corresponding voice clips to train the model. It avoids the toil of fine-grained annotation of the data. However, Tacotron might also be very time demanding to train, especially if you don't know the right hyperparameters, to begin with. Here, I like to share a gradual training scheme to ease the training difficulty. In my experiments, it provides faster training, tolerance for hyperparameters and more time with your family.

In summary, Tacotron is an Encoder-Decoder architecture with Attention. it takes a sentence as a sequence of characters (or phonemes) and it outputs sequence of spectrogram frames to be ultimately converted to speech with an additional vocoder algorithm (e.g. Griffin-Lim or WaveRNN). There are two versions of Tacotron. Tacotron is a more complicated architecture but it has fewer model parameters as opposed to Tacotron2. Tacotron2 is much simpler but it is ~4x larger (~7m vs ~24m parameters). To be clear, so far, I mostly use gradual training method with Tacotron and about to begin to experiment with Tacotron2 soon.

Here is the trick. Tacotron has a parameter called 'r' which defines the number of spectrogram frames predicted per decoder iteration. It is a useful parameter to reduce the number of computations since the larger 'r', the fewer the decoder iterations. But setting the value to high might reduce the performance as well. Another benefit of higher r value is that the alignment module stabilizes much faster. If you talk someone who used Tacotron, he'd probably know what struggle the attention means. So finding the right trade-off for 'r' is a great deal. In the original Tacotron paper, authors used 'r' as 2 for the best-reported model. They also emphasize the challenge of training the model with r=1.

Gradual training comes to the rescue at this point. What it means is that we set 'r' initially large, such as 7. Then, as the training continues, we reduce it until the convergence. This simple trick helps quite magically to solve two main problems. The first, it helps the network to learn the monotonic attention after almost the first epoch. The second, it expedites convergence quite much. As a result, the final model happens to have more stable and resilient attention without any degrigation of performance. You can even eventually let the network to train with r=1 which was not even reported in the original paper.

Here, I like to share some results to prove the effectiveness. I used LJspeech dataset for all the results. The training schedule can be summarized as follows. (You see I also change the batch_size but it is not necessary if you have enough GPU memory.)

"gradual_training": [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 8]] # [start_step, r, batch_size]

Below you can see the attention at validation time after just 1K iterations with the training schedule above.

Next, let's check the model training curve and convergence.

You can listen to voice examples generated with the final model using GriffinLim vocoder. I'd say the quality of these examples is quite good to my ear.

It was a short post but if you like to replicate the results here, you can visit our repo Mozilla TTS and just run the training with the provided config.json file. Hope, imperfect documentation on the repo would help you. Otherwise, you can always ask for help creating an issue or on Mozilla TTS Discourse page. There are some other cool things in the repo that I also write about in the future. Until next time..!

Disclaimer: In this post, I just wanted to briefly share a trick that I find quite useful in my TTS work. Please feel free to share your comments. This work might be a more legit research work in the future.