Category Archives: Uncategorized

Gradual Training with Tacotron for Faster Convergence

Tacotron is a commonly used Text-to-Speech architecture. It is a very flexible alternative over traditional solutions. It only requires text and corresponding voice clips to train the model. It avoids the toil of fine-grained annotation of the data. However, Tacotron might also be very time demanding to train, especially if you don't know the right hyperparameters, to begin with. Here, I like to share a gradual training scheme to ease the training difficulty. In my experiments, it provides faster training, tolerance for hyperparameters and more time with your family.

In summary, Tacotron is an Encoder-Decoder architecture with Attention. it takes a sentence as a sequence of characters (or phonemes) and it outputs sequence of spectrogram frames to be ultimately converted to speech with an additional vocoder algorithm (e.g. Griffin-Lim or WaveRNN). There are two versions of Tacotron. Tacotron is a more complicated architecture but it has fewer model parameters as opposed to Tacotron2. Tacotron2 is much simpler but it is ~4x larger (~7m vs ~24m parameters). To be clear, so far, I mostly use gradual training method with Tacotron and about to begin to experiment with Tacotron2 soon.

Tacotron architecture (Thx @yweweler for the figure)

Here is the trick. Tacotron has a parameter called 'r' which defines the number of spectrogram frames predicted per decoder iteration. It is a useful parameter to reduce the number of computations since the larger 'r', the fewer the decoder iterations. But setting the value to high might reduce the performance as well. Another benefit of higher r value is that the alignment module stabilizes much faster. If you talk someone who used Tacotron, he'd probably know what struggle the attention means. So finding the right trade-off for 'r' is a great deal. In the original Tacotron paper, authors used 'r' as 2 for the best-reported model. They also emphasize the challenge of training the model with r=1.

Gradual training comes to the rescue at this point. What it means is that we set 'r' initially large, such as 7. Then, as the training continues, we reduce it until the convergence. This simple trick helps quite magically to solve two main problems. The first, it helps the network to learn the monotonic attention after almost the first epoch. The second, it expedites convergence quite much. As a result, the final model happens to have more stable and resilient attention without any degrigation of performance. You can even eventually let the network to train with r=1 which was not even reported in the original paper.

Here, I like to share some results to prove the effectiveness. I used LJspeech dataset for all the results. The training schedule can be summarized as follows. (You see I also change the batch_size but it is not necessary if you have enough GPU memory.)

"gradual_training": [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 8]] # [start_step, r, batch_size]

Below you can see the attention at validation time after just 1K iterations with the training schedule above.

Tacotron after 950 steps on LJSpeech. Don't worry about the last part, it is just because the model does not know where to stop initially.

Next, let's check the model training curve and convergence.


(Ignore the plot in the middle.) You see here the model jumping from r=7 to r=5. There is obvious easy gain after the jump.
Test time model results after 300K. r=1 after 290K steps.
Here is the training plot until ~300K iterations.
(For some reason I could not move the first plot to the end)

You can listen to voice examples generated with the final model using GriffinLim vocoder. I'd say the quality of these examples is quite good to my ear.

It was a short post but if you like to replicate the results here, you can visit our repo Mozilla TTS and just run the training with the provided config.json file. Hope, imperfect documentation on the repo would help you. Otherwise, you can always ask for help creating an issue or on Mozilla TTS Discourse page. There are some other cool things in the repo that I also write about in the future. Until next time..!

Disclaimer: In this post, I just wanted to briefly share a trick that I find quite useful in my TTS work. Please feel free to share your comments. This work might be a more legit research work in the future.

Share

Using WSL Linux on Windows 10 for Deep Learning Development.

To explain briefly, WSL enables you to run Linux on Win10 and you can use your favorite Linux tools (bash, zsh, vim) for your development cycle and you can enjoy Win10 for the rest. It obviates the need for dual-boot configuration which might be a nightmare sometimes.

Why I do this? Basically, if you have an Optimus Laptop, it is an onerous job to set up a Linux distro. You need to find the right Nvidia driver to enable GPU. Then you need to install nvidia-prime or if you are lucky, you make bumblebee work. Let's say you've done everything. After some time, you update something on your system by mistake and the next thing you see a black screen for the next reboot. It is time to search what is wrong on your phone and try to fix it. It is horrendous!

As far as my experience goes, WSL Linux gives all the necessary features for your development with a vital exception of reaching to GPU. You can apt-get software, run it. Even you can run a software with UI if you set things right. However, due to the GPU limitation, you are able to compile CUDA codes but cannot run on Linux. Here I just like to explain, how you can deal with limitation with a small trick using the ability of WSL Linux running Win binaries.

The first thing to do is to install your preferred Linux distro from Windows Store. Just go to the store, search for the distro and install. If installation is not available, you might need to update your Windows.

1. Install Linux and activate WSL

Before launching Linux, follow the documnetation here to activate WSL on Win10.

After you installed the distro and activated  WSL, you can either open the command-line and type ```bash```  or directly use the Linux launcher to get into the linux terminal.

2. Install Hyper Terminal for Linux like experience.

One problem I've experienced with Windows command-line is the differences of shortcuts (Copy-Paste) and inability to open multiple tabs. These are quite important features for Linux custom. I solved this by switching to hyper terminal. It simulates the best possible Linux like experience.

3. Install Conda in Windows and add its binaries to `path`

Now you have Linux and a cool terminal. It is time to install the rest. Note that, if you don't bother to use GPU, you can install everything you like on Linux right away and use. For this example, we install miniconda to Windows and use the python.exe from Linux to run our codes on GPU.

Another cool thing about Linux on WSL is that it enables you to run Windows binaries on Linux environment. Also, Windows' PATH environment variable is exposed to Linux too. As we install python with miniconda, it asks you to add python.exe to the PATH variable. Just do it. Then we set an alias on the Linux side to run python.exe, when we type python so that we can develop things on Linux but run the code on Windows by using the GPU.

Now install miniconda. Say next until you see the screen below and set the ticks for the all options.

After the installation, if you run python on command-line you should see python session running as shown below.

You should also be able to run python form Linux. Open the terminal, switch Linux, type python.exe and you get it working.

4. Create aliases on Linux

The last step is to creating an alias on Linux bash that runs python.exe when you call python. These are the aliases I set.


alias conda="conda.exe"
alias ipython="ipython.exe"
alias nosetests="nosetests.exe"
alias pip="pip.exe"
alias nvidia-smi="/mnt/c/Program\ Files/NVIDIA\ Corporation/NVSMI/nvidia-smi.exe"

After all, you should be able to run your code on GPU. One important note is that since we use python on windows, you need to set folder paths in relation to Windows. Don't forget the escape character for separating folder.

Right now, I created a folder /users/erogol/projects and I keep my development craft in it. So it is actually different from the home folder set for your Linux installation. But it does not matter since we use windows file paths. Now, you can install your favorite editor and enjoy training new models.

Please let me know if I skip something here. It is very likely since I wrote this after I set everything.

It is good too see that Microsoft changed direction and start to embrance Linux into their ecosystem by listening the needs of their users. It was a meaningless fight from the start.

 

Edit: You need run the terminal with Run as Administrator' to install things with conda to windows.

Share

Installing OpenCV 3.2 to Anaconda Environment with ffmpeg Support

Sometimes, It is really a mess to try installing OpenCV to your system. Nevertheless, it is really great library for any case of vision and you are obliged to use it. (No complain, just C++).

I try to list my commands here in a sequence  and hope it will work for you too.

Install dependencies


apt install gcc g++ git libjpeg-dev libpng-dev libtiff5-dev libjasper-dev libavcodec-dev libavformat-dev libswscale-dev pkg-config cmake libgtk2.0-dev libeigen3-dev libtheora-dev libvorbis-dev libxvidcore-dev libx264-dev sphinx-common libtbb-dev yasm libfaac-dev libopencore-amrnb-dev libopencore-amrwb-dev libopenexr-dev libgstreamer-plugins-base1.0-dev libavcodec-dev libavutil-dev libavfilter-dev libavformat-dev libavresample-dev

conda install libgcc

Download OpenCV


//First, go to your folder to hosting installation
wget https://github.com/Itseez/opencv/archive/3.2.0.zip

unzip 3.2.0.zip
cd opencv-3.2.0

mkdir build
cd build

Cmake and Setup Opencv

This cmake command targets python3.x and your target virtual environment. Therefore, before running it activate your environment. Do not forget to check flags depending on your case.


cmake -DWITH_CUDA=OFF -DBUILD_TIFF=ON -DBUILD_opencv_java=OFF -DENABLE_AVX=ON -DWITH_OPENGL=ON -DWITH_OPENCL=ON -DWITH_IPP=ON -DWITH_TBB=ON -DWITH_EIGEN=ON -DWITH_V4L=ON -DWITH_VTK=OFF -DBUILD_TESTS=OFF -DBUILD_PERF_TESTS=OFF -DCMAKE_BUILD_TYPE=RELEASE -DBUILD_opencv_python2=OFF -DCMAKE_INSTALL_PREFIX=$(python3 -c "import sys; print(sys.prefix)") -DPYTHON3_EXECUTABLE=$(which python3) -DPYTHON3_INCLUDE_DIR=$(python3 -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())") -DPYTHON3_PACKAGES_PATH=$(python3 -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())") -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D INSTALL_PYTHON_EXAMPLES=ON -D INSTALL_C_EXAMPLES=OFF -D PYTHON_EXECUTABLE=~/miniconda3/envs/dl/bin/python -D BUILD_EXAMPLES=ON ..

make -j 4

sudo make install

Then check your installation on Python


import cv2

print(cv2.__version__) # should output opencv-3.2.0

Share

Some CNN visualization tools and techniques

Deep Visualization Toolbox

Github: https://github.com/yosinski/deep-visualization-toolbox

Understanding Image Representations by Inverting Them

Paper: https://arxiv.org/pdf/1412.0035v1.pdf

Learning FRAME Models Using CNN filters

Project page:  http://www.stat.ucla.edu/~yang.lu/project/deepFrame/main.html

Convergent Learning: Do different neural networks learn the same representations?

Github: https://github.com/yixuanli/convergent_learning

Torch-visbox

https://github.com/Aysegul/torch-visbox

Plot caffe models online

http://ethereon.github.io/netscope/#/editor

Grad-CAM: Gradient-weighted Class Activation Mapping

https://github.com/ramprs/grad-cam/

Quiver: Interactive Feature Visualization for Keras

https://github.com/jakebian/quiver

CS231 Stanford notes on Visualization

http://cs231n.github.io/understanding-cnn/

 

Share