# Two Attention Methods for Better Alignment with Tacotron

In this post, I like to introduce two methods that worked well in my experience for better attention alignment in Tacotron models. If you like to try your own you can visit Mozilla TTS. The first method is Bidirectional Decoder and the second is Graves Attention (Gaussian Attention) with small tweaks.

## Bidirectional Decoder

Bidirectional decoding uses an extra decoder which takes the encoder outputs in the reverse order and then, there is an extra loss function that compares the output states of the forward decoder with the backward one. With this additional loss, the forward decoder models what it needs to expect for the next iterations. In this regard, the backward decoder punishes bad decisions of the forward decoder and vice versa.

Intuitionally, if the forward decoder fails to align the attention, that would cause a big loss and ultimately it would learn to go monotonically through the alignment process with a correction induced by the backward decoder. Therefore, this method is able to prevent "catastrophic failure" where the attention falls apart in the middle of a sentence and it never aligns again.

At the inference time, the paper suggests to us only the forward decoder and demote the backward decoder. However, it is possible to think more elaborate ways to combine these two models.

There are 2 main pitfalls of this method. The first, due to additional parameters of the backward decoder, it is slower to train this model (almost 2x) and this makes a huge difference especially when the reduction rate is low (number of frames the model generates per iteration). The second, if the backward decoder penalizes the forward one too harshly, that causes prosody degradation in overall. The paper suggests activating the additional loss just for fine-tuning, due to this.

My experience is that Bidirectional training is quite robust against alignment problems and it is especially useful if your dataset is hard. It also aligns almost after the first epoch. Yes, at inference time, it sometimes causes pronunciation problems but I solved this by doing the opposite of the paper's suggestion. I finetune the network without the additional loss for just an epoch and everything started to work well.

## Graves Attention

Tacotron uses Bahdenau Attention which is a content-based attention method. However, it does not consider location information, therefore, it needs to learn the monotonicity of the alignment just looking into the content which is a hard deal. Tacotron2 uses Location Sensitive Attention which takes account of the previous attention weights. By doing so, it learns the monotonic constraint. But it does not solve all of the problems and you can still experience failures with long or out of domain sentences.

Graves Attention is an alternative that uses content information to decide how far it needs to go on the alignment per iteration. It does this by using a mixture of Gaussian distribution.

Graves Attention takes the context vector of time t-1 and passes it through couple of fully connected layers ([FC > ReLU > FC] in our model) and estimates step-size, variance and distribution weights for time t. Then the estimated step-size is used to update the mean of Gaussian modes. Analogously, mean is the point of interest t the alignment path, variance is attention window over this point of interest and distribution weight is the importance of each distribution head.

I try to formulate above how I compute the alignment in my implementation. $g, b, k$ are intermediate values. $\delta$ is the step size, $\sigma$ is the variance, $w_{k}$ is the distribution weight for the GMM node k. (You can also check the code).

Some other versions are explained here but so far I found the above formulation works for me the best, without any NaNs in training. I also realized that with the best-claimed method in this paper, one of the distribution nodes overruns the others in the middle of the training and basically, attention starts to run on a single Gaussian head.

The benefit of using GMM is to have more robust attention. It is also computationally light-weight compared to both bidirectional decoding and normal location attention. Therefore, you can increase your batch size and possibly converge faster.

The downside is that, although my experiments are not complete, GMM's not provided slightly worse prosody and naturalness compared to the other methods.

## Comparison

Here I compare Graves Attention, Bidirectional Decoding and Location Sensitive Attention trained on LJSpeech dataset. For the comparison, I used the set of sentences provided by this work. There are in total of 50 sentences.

Bidirectional Decoding has 1, Graves attention has 6, Location Sensitive Attention has 18, Location Sensitive Attention with inference time windowing has 11 failures out of these 50 sentences.

In terms of prosodic quality, in my opinion, Location Sensitive Attention > Bidirectional Decoding > Graves Attention > Location Sensitive Attention with Windowing. However, I should say the quality difference is hardly observable in LJSpeech dataset. I also need to point out that, it is a hard dataset.

If you like to try these methods, all these are implemented on Mozilla TTS and give it a try.

# Pull all repository with all submodules

When you use a git repository with submodules, you need to pull all of them at once to keep the unity. Following call does it for you.

git submodule foreach git pull origin master

# Updating your local forked project by a commit to the main project?

This is from my stackoverflow question. Thanks to "bitoiu". Here is the real thread.

How to pick up a single commit from a remote repo

Assuming you have a local clone of the repo you forked if you type in the following you should get a single origin:

> git show remote
origin


Unless you've added the original's repo location, you won't have access to the commit you want to pick into your local one. So we need to add that, let's assume this repo ishttps://github.com/GitbookIO/gitbook.git. Notice this is an HTTPS clone URL because you won't have write access to this repo. Let's name it original_repo:

> git remote add original_repo https://github.com/GitbookIO/gitbook.git


And now let's get all the refs back:

> git fetch origina_repo


At this point you have all you need locally, you'll just need to merge the commit into one of your branches, let's assume your local master.

Find the commit you want to merge. This implies finding it in one of the branches the team used. Could be already merged to master or you could be picking it up from the branch that was used for the pull request. Either way, just run a series of git log to check what commit you want if you don't know the reference. When you do simply go to the branch where you want to merge the commit to and run:

> git cherry-pick COMMIT_ID


This will bring the commit to whatever branch you are at the moment.

How to merge a branch from a remote repo

The only difference in this steps is that instead of doing the cherry-pick you will be doing a merge. So imagine the contents of the pull request are in a branch named so-pr, you would simply do:

> git merge original_repo/so-pr


And that would merge the contents of so-pr into your working branch.

# How to keep your forked project updated with the main project ?


# Add the remote, call it "upstream":

git remote add upstream https://github.com/whoever/whatever.git

# Fetch all the branches of that remote into remote-tracking branches,
# such as upstream/master:

git fetch upstream

# Make sure that you're on your master branch:

git checkout master

# Rewrite your master branch so that any commits of yours that
# aren't already in upstream/master are replayed on top of that
# other branch:

git rebase upstream/master

#If you don't want to rewrite the history of your master branch, (for # example because other people may have cloned it) then you should # replace the last command with However, for making further pull    # requests that are as clean as possible, it's probably better to # rebase.
git merge upstream/master.



# Kohonen Learning Procedure K-Means vs Lloyd's K-means

K-means maybe the most common data quantization method, used widely for many different domain of problems. Even it relies on very simple idea, it proposes satisfying results in a computationally efficient environment.

Underneath of the formula of K-means optimization, the objective is to minimize the distance between data points to its closest centroid (cluster center). Here we can write the objective as;

$argmin sum_{i=1}^{k}sum_{x_j in S_i} ||x_j - mu_i||^2$

$mu_i$ is the closest centroid to instance $x_j$.