Tag Archives: research

Gradual Training with Tacotron for Faster Convergence

Tacotron is a commonly used Text-to-Speech architecture. It is a very flexible alternative over traditional solutions. It only requires text and corresponding voice clips to train the model. It avoids the toil of fine-grained annotation of the data. However, Tacotron might also be very time demanding to train, especially if you don't know the right hyperparameters, to begin with. Here, I like to share a gradual training scheme to ease the training difficulty. In my experiments, it provides faster training, tolerance for hyperparameters and more time with your family.

In summary, Tacotron is an Encoder-Decoder architecture with Attention. it takes a sentence as a sequence of characters (or phonemes) and it outputs sequence of spectrogram frames to be ultimately converted to speech with an additional vocoder algorithm (e.g. Griffin-Lim or WaveRNN). There are two versions of Tacotron. Tacotron is a more complicated architecture but it has fewer model parameters as opposed to Tacotron2. Tacotron2 is much simpler but it is ~4x larger (~7m vs ~24m parameters). To be clear, so far, I mostly use gradual training method with Tacotron and about to begin to experiment with Tacotron2 soon.

Tacotron architecture (Thx @yweweler for the figure)

Here is the trick. Tacotron has a parameter called 'r' which defines the number of spectrogram frames predicted per decoder iteration. It is a useful parameter to reduce the number of computations since the larger 'r', the fewer the decoder iterations. But setting the value to high might reduce the performance as well. Another benefit of higher r value is that the alignment module stabilizes much faster. If you talk someone who used Tacotron, he'd probably know what struggle the attention means. So finding the right trade-off for 'r' is a great deal. In the original Tacotron paper, authors used 'r' as 2 for the best-reported model. They also emphasize the challenge of training the model with r=1.

Gradual training comes to the rescue at this point. What it means is that we set 'r' initially large, such as 7. Then, as the training continues, we reduce it until the convergence. This simple trick helps quite magically to solve two main problems. The first, it helps the network to learn the monotonic attention after almost the first epoch. The second, it expedites convergence quite much. As a result, the final model happens to have more stable and resilient attention without any degrigation of performance. You can even eventually let the network to train with r=1 which was not even reported in the original paper.

Here, I like to share some results to prove the effectiveness. I used LJspeech dataset for all the results. The training schedule can be summarized as follows. (You see I also change the batch_size but it is not necessary if you have enough GPU memory.)

"gradual_training": [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 8]] # [start_step, r, batch_size]

Below you can see the attention at validation time after just 1K iterations with the training schedule above.

Tacotron after 950 steps on LJSpeech. Don't worry about the last part, it is just because the model does not know where to stop initially.

Next, let's check the model training curve and convergence.


(Ignore the plot in the middle.) You see here the model jumping from r=7 to r=5. There is obvious easy gain after the jump.
Test time model results after 300K. r=1 after 290K steps.
Here is the training plot until ~300K iterations.
(For some reason I could not move the first plot to the end)

You can listen to voice examples generated with the final model using GriffinLim vocoder. I'd say the quality of these examples is quite good to my ear.

It was a short post but if you like to replicate the results here, you can visit our repo Mozilla TTS and just run the training with the provided config.json file. Hope, imperfect documentation on the repo would help you. Otherwise, you can always ask for help creating an issue or on Mozilla TTS Discourse page. There are some other cool things in the repo that I also write about in the future. Until next time..!

Disclaimer: In this post, I just wanted to briefly share a trick that I find quite useful in my TTS work. Please feel free to share your comments. This work might be a more legit research work in the future.

Share

What I read lately

CATEGORICAL REPARAMETERIZATION WITH GUMBEL SOFTMAX
  • Link: https://arxiv.org/pdf/1611.01144v1.pdf
  • Continuous distribution on the simplex which approximates discrete vectors (one hot vectors) and differentiable by its parameters with reparametrization trick used in VAE.
  • It is used for semi-supervised learning.

 

DEEP UNSUPERVISED LEARNING WITH SPATIAL CONTRASTING
  • Learning useful unsupervised image representations by using triplet loss on image patches. The triplet is defined by two image patches from the same images as the anchor and the positive instances and a patch from a different image which is the negative.¬† It gives a good boost on CIFAR-10 after using it as a pretraning method.
  • How would you apply to real and large scale classification problem?

 

UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION

 

MULTI-RESIDUAL NETWORKS
  • For 110-layers ResNet the most contribution to gradient updates come from the paths with 10-34 layers.
  • ResNet trained with only these effective paths has comparable performance with the full ResNet. It is done by sampling paths with lengths in the effective range for each mini-batch.
  • Instead of going deeper adding more residual connections provides more boost due to the notion of exponential ensemble of shallow networks by the residual connections.
  • Removing a residual block from a ResNet has negligible drop on performance in test time in contrast to VGG and GoogleNet.
Share

Share Research Data Sets via Torrent

I recognized a newbie but very bright idea today. The idea is to share academic data sets and papers via torrent. Especially, if you are working on big scale of data sets like ImageNet , having such a distributed approach is just delighting (albeit it presently does not include ImageNet) because in many cases of downloads, in the course of time your download speed starts to attenuate a very small values, even with additional download peers it gets worse. However, in such a torrent based system, it is on the other way around. If you are familiar to bit-torrent, you well know that as the data is distributed to many machines, you experienced faster download speed owing to the nature of torrent system.

Share

A good document organizing tool Zotero

I just started to be a graduate student who need to be ready for reading lots of and lots of article and papers online. Thus I search a tool that can help me while dealing with all these stuffs and I found that tool ZOTERO. I started to use it for taking notes about papers and keep track of the articles and sources online. It is really great tool and I thing someone like me can find it useful. You might watch that informative video.

Share