Tag Archives: text2speech

Gradual Training with Tacotron for Faster Convergence

Tacotron is a commonly used Text-to-Speech architecture. It is a very flexible alternative over traditional solutions. It only requires text and corresponding voice clips to train the model. It avoids the toil of fine-grained annotation of the data. However, Tacotron might also be very time demanding to train, especially if you don't know the right hyperparameters, to begin with. Here, I like to share a gradual training scheme to ease the training difficulty. In my experiments, it provides faster training, tolerance for hyperparameters and more time with your family.

In summary, Tacotron is an Encoder-Decoder architecture with Attention. it takes a sentence as a sequence of characters (or phonemes) and it outputs sequence of spectrogram frames to be ultimately converted to speech with an additional vocoder algorithm (e.g. Griffin-Lim or WaveRNN). There are two versions of Tacotron. Tacotron is a more complicated architecture but it has fewer model parameters as opposed to Tacotron2. Tacotron2 is much simpler but it is ~4x larger (~7m vs ~24m parameters). To be clear, so far, I mostly use gradual training method with Tacotron and about to begin to experiment with Tacotron2 soon.

Tacotron architecture (Thx @yweweler for the figure)

Here is the trick. Tacotron has a parameter called 'r' which defines the number of spectrogram frames predicted per decoder iteration. It is a useful parameter to reduce the number of computations since the larger 'r', the fewer the decoder iterations. But setting the value to high might reduce the performance as well. Another benefit of higher r value is that the alignment module stabilizes much faster. If you talk someone who used Tacotron, he'd probably know what struggle the attention means. So finding the right trade-off for 'r' is a great deal. In the original Tacotron paper, authors used 'r' as 2 for the best-reported model. They also emphasize the challenge of training the model with r=1.

Gradual training comes to the rescue at this point. What it means is that we set 'r' initially large, such as 7. Then, as the training continues, we reduce it until the convergence. This simple trick helps quite magically to solve two main problems. The first, it helps the network to learn the monotonic attention after almost the first epoch. The second, it expedites convergence quite much. As a result, the final model happens to have more stable and resilient attention without any degrigation of performance. You can even eventually let the network to train with r=1 which was not even reported in the original paper.

Here, I like to share some results to prove the effectiveness. I used LJspeech dataset for all the results. The training schedule can be summarized as follows. (You see I also change the batch_size but it is not necessary if you have enough GPU memory.)

"gradual_training": [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 8]] # [start_step, r, batch_size]

Below you can see the attention at validation time after just 1K iterations with the training schedule above.

Tacotron after 950 steps on LJSpeech. Don't worry about the last part, it is just because the model does not know where to stop initially.

Next, let's check the model training curve and convergence.


(Ignore the plot in the middle.) You see here the model jumping from r=7 to r=5. There is obvious easy gain after the jump.
Test time model results after 300K. r=1 after 290K steps.
Here is the training plot until ~300K iterations.
(For some reason I could not move the first plot to the end)

You can listen to voice examples generated with the final model using GriffinLim vocoder. I'd say the quality of these examples is quite good to my ear.

It was a short post but if you like to replicate the results here, you can visit our repo Mozilla TTS and just run the training with the provided config.json file. Hope, imperfect documentation on the repo would help you. Otherwise, you can always ask for help creating an issue or on Mozilla TTS Discourse page. There are some other cool things in the repo that I also write about in the future. Until next time..!

Disclaimer: In this post, I just wanted to briefly share a trick that I find quite useful in my TTS work. Please feel free to share your comments. This work might be a more legit research work in the future.

Share

Text to Speech Deep Learning Architectures

Small Intro. and Background

Recently, I started at Mozilla Research. I am really excited to be a part of a small but great team working hard to solve important ML problems. And everything is open-sourced. We license things to make open-sourced. Oxymoron by first sight isn't it. But I like it !!

Before my presence, our team already released the best known open-sourced STT (Speech to Text) implementation based on Tensorflow. The next step is to improve the current Baidu's Deep Speech architecture and also implement a new TTS (Text to Speech) solution that complements the whole conversational AI agent. So after these two projects, anyone around the world will be able to create his own Alexa without any commercial attachment. Which is the real way to democratize AI, at least I believe it is?

Up until now, I worked on a variety of data types and ML problems, except audio. Now it is time to learn it. And the first thing to do is a comprehensive literature review (like a boss). Here I like to share the top-notch DL architectures dealing with TTS (Text to Speech). I also invite you to our Github repository hosting PyTorch implementation of the first version implementation. (We switched to PyTorch for obvious reasons). It is a work in progress and please feel free to comment and contribute.

Below I like to share my pinpoint summary of the well-known TTS papers which are by no means complete but useful to highlight important aspects of these papers. Let's start.

Glossary

  • Prosody: https://en.wikipedia.org/wiki/Prosody_(linguistics)
  • Phonemes: units of sounds, we pronounce as we speak. Necessary since very similar words in the letter might be pronounced very differently (e.g. "Rough" "Though")
  • Vocoder: part of the system decoding from features to audio signals. Wave is used in Deep Voice at that stage.
  • Fundamental Frequency - F0: lowest frequency of a periodic waveform describing the pitch of the sound.
  • Autoregressive Model: Specifies a model depending linearly on its own outputs and on a parameter set which can be approximated.
  • Query, Key, Value: Key is used by the attention module to compute attention weights. Value is the vector stipulated by the attention weights to compute the module output. A query vector is the hidden state of the decoder.
  • Grapheme: Cool way to say character.
  • Error Modes: Sub-optimal status for the attention block where it is not able to escape.
  • Monotonic Attention: Use only a limited scope of nodes close in time to the output step. It improves performance for TTS since there is a certain relation btw the output at time t and the input at time t. However, it is not that reasonable for translation problem since words orders might not be the same. https://arxiv.org/pdf/1704.00784.pdf
  • MOS: Mean Opinion Score. Crowd-source the evaluation process with native speakers. It is not easy to measure, especially for a layman.
  • Context vector: Output of an attention module which summarizes multiple time-step outputs of the encoder.
  • Hann Window Function: https://en.wikipedia.org/wiki/Window_function#Hann_window
  • Teacher Forcing: Providing model's expected output at time t as input at time t+1. It is controlled ground-truth feedback as a teacher does to a student.
  • Casual convolution: Convolution which does not foresee the future units given the reference time step T which we like to predict next. In practice, it is implemented by setting right padding orientation to normal convolution layers.

Deep Voice (25 Feb 2017)

  • Text to phonemes. Deterministically computed with a dictionary. Or Seq2Seq model to deal with the unseen words.
  • The same phoneme might hold different durations in different words. We need to predict the duration. It is sequence depended.
  • Fundamental frequency for the pitch of each phoneme. It is sequence depended.
  • Frequency + Phonemes + Duration = Voice synthesis. It is done via Google's WaveNet.
  • Models
    • Segmentation Model
      • Segment audio signal to phonemes.
      • CTC loss
      • Predict phoneme pairs due to probability mass
      • Inputs:
        • Audio clip of “It was early spring”
        • Phonemes (label)
          • [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
      • Outputs:
        • Pairs of Phonemes with their start time
          • [(IH1, T, 0:00), (T, ., 0:01), (., W, 0:02), (W, AA1, 0:025), (NG, ., 0:035)]
    • Fundamental Freq & Duration Models
      • Segmentation model predictions are the labels for these models.
      • Inputs:
        • Phonemes
      • Outputs:
        • Duration, Probability, F0 for each phoneme; [H, 0.1, 25hz], ...
    • Audio Synthesizer Model
      • Simplified WaveNet
      • Inputs:
        • Duration and F0 for phonemes + audio signals (labels)
      • Outputs:
        • Synthesis audio signal

Continue reading Text to Speech Deep Learning Architectures

Share