# Speech to Text Deep Learning Architectures

### Small Intro. and Background

Recently, I started at Mozilla Research. I am really excited to be a part of a small but a great team working hard to solve important ML problems with a great deal of being open-sourced. It is the first place, I've even been where people warn you to license your code to enable other to use it freely. Oxymoron by the first sight, isn't it.

Before my presence, our team already released the best known open-sourced STT (Speech to Text) implementation based on Tensorflow. The next step is to improve the current Baidu's Deep Speech architecture and also implement a new TTS (Text to Speech) solution that complements the whole conversational AI agent. So after these two projects, anyone around the world will be able to create his own Alexa without any commercial attachment. Which is the real way to democratize AI, at least I believe it is.

Up until now, I worked on variety of data types and ML problems, except audio. Now it is time learn a it. And the first thing to do is a comprehensive literature review (like a boss). Here I like to share top notch DL architectures dealing with TTS (Text to Speech). I also invite you to our Github repository hosting PyTorch implementation of the first version of our implementation. (We switched to PyTorch for obvious reasons). It is a work in progress and please feel free to comment and contribute.

Below I like to share my pinpoint summary of the well-known TTS papers which are by no means complete but useful to highlight important aspects of these papers. Let's start.

### Glossary

• Phonemes : units of sounds, we pronounce as we speak. Necessary since very similar words in letter might be pronounced very differently (e.g. "Rough" "Though")
• Vocoder: part of the system decoding from features to audio signals. Wave is used in Deep Voice at that stage.
• Fundamental Frequency - F0: lowest frequency of a periodic waveform describing the pitch of the sound.
• Autoregressive Model: Specifies a model depending linearly on its own outputs and on a parameter set which can be approximated.
• Query, Key, Value: Key is used by attention module to compute attention weights. Value is the vector stipulated by the attention weights to compute the module output. Query vector is the hidden state of the decoder.
• Grapheme: Cool way to say character.
• Error Modes: Sub-optimal status for the attention block where it is not able to escape.
• Monotonic Attention: Use only a limited scope of nodes close in time to the output step. It improves performance for TTS since there is a certain relation btw the output at time t and the input at time t. However, it is not that reasonable for translation problem since words orders might no be the same. https://arxiv.org/pdf/1704.00784.pdf
• MOS: Mean Opinion Score. Crows-source the evaluation process with native speakers. It is not easy to measure, especially for a layman.
• Context vector: Output of an attention module which summarizes multiple time-step output of the encoder.
• Hann Window Function: https://en.wikipedia.org/wiki/Window_function#Hann_window
• Teacher Forcing: Providing model's expected output at time t as a input at time t+1. It is controlled ground-truth feedback as a teacher does to a student.
• Casual convolution: Convolution which does not foresee the future units given the reference time step T which we like to predict next. In practice, it is implemented by setting right padding orientation to to normal convolution layers.

### Deep Voice (25 Feb 2017)

• Text to phonemes. Deterministically computed with a dictionary. Or Seq2Seq model to deal with the unseen words.
• The same phoneme might hold different durations in different words. We need to predict the duration. It is sequence depended.
• Fundamental frequency for the pitch of the each phoneme. It is sequence depended.
• Frequency + Phonemes + Duration = Voice synthesis. It is done via Google's WaveNet.
• Models
• Segmentation Model
• Segment audio signal to phonemes.
• CTC loss
• Predict phoneme pairs due to probability mass
• Inputs:
• Audio clip of “It was early spring”
• Phonemes (label)
• [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
• Outputs:
• Pairs of Phonemes with their start time
• [(IH1, T, 0:00), (T, ., 0:01), (., W, 0:02), (W, AA1, 0:025), (NG, ., 0:035)]
• Fundamental Freq & Duration Models
• Segmentation model predictions are the labels for these models.
• Inputs:
• Phonemes
• Outputs:
• Duration, Probability, F0 for each phoneme; [H, 0.1, 25hz], ...
• Audio Synthesizer Model
• Simplified WaveNet
• Inputs:
• Duration and F0 for phonemes + audio signals (labels)
• Outputs:
• Synthesis audio signal

# Paper review: EraseReLU

ReLU is defined as a way to train an ensemble of exponential number of linear models due to its zeroing effect. Each iteration means a random set of active units hence, combinations of different linear models. They discuss, relying on the given observation, it might be useful to remove non-linearities for some layers and letting them to learn combination of linearities as the whole layer.

Another argument as poised, some representations are hard to approximate by a stack of non-linear layers. as shown by He et al. 2016. To this end, letting linearities for a subset of layers might ameliorate the condition.

The way they apply EraseReLU is removing the last ReLU layer of each "module". "Module" here is defined depending on the model architecture as shown above.

Experiments show that EraseReLU increases the performance of networks and its effect is larger for deeper networks. It is also more resilient to over-fitting for deep networks. The loss curves also show faster convergence for EraseReLU and the difference more obvious for larger datasets.

My 2 cents: Results are not that different on ImageNet but still better to the favor of EraseReLU. Then it might be the case of lucky shoot since there is no confidence interval or variance given for the trainings.

Faster convergence makes sense with the help of second guessing after the paper. Since there are more active units possible it entails to propagate more gradients. However, all such comments assumes that error signals are always positive. Which is very unlikely. Therefore, more open valves might cause more chaotic back-propagation signal.

Yet it is very simple idea, it shows faster convergence, better results and a good investgation of ReLU function. It think it is useful and can take its position in my next training session.

Disclaimer: This is written hastily in 10 mins. If you think something wrong or even worse let me know :).

# Paper Review: Self-Normalizing Neural Networks

One of the main problems of neural networks is to tame layer activations so that one is able to obtain stable gradients to learn faster without any confining factor. Batch Normalization shows us that keeping values with mean 0 and variance 1 seems to work things. However, albeit indisputable effectiveness of BN, it adds more layers and computations to your model that you'd not like to have in the best case.

ELU (Exponential Linear Unit) is a activation function aiming to tame neural networks on the fly by a slight modification of activation function. It keeps the positive values as it is and exponentially skew negative values.

ELU does its job good enough, if you like to evade the cost of Bath Normalization, however its effectiveness does not rely on a theoretical proof beside empirical satisfaction. And finding a good $\alpha$ is just a guess.

Self-Normalizing Neural Networks takes things to next level. In short, it describes a new activation function SELU (Scaled Exponential Linear Units), a new initialization scheme and a new dropout variant as a repercussion,

The main topic here is to keep network activation in a certain basin defined by a mean and a variance values. These can be any values of your choice but for the paper it is mean 0 and variance 1 (similar to notion of Batch Normalization). The question afterward is to modifying ELU function by some scaling factors to keep the activations with that mean and variance on the fly. They find these scaling values by a long theoretical justification. Stating that, scaling factors of ELU are supposed to be defined as such any passing value of ELU should be contracted to define mean and variance.  (This is just verbal definition by no means complete. Please refer to paper to be more into theory side. )

Above, the scaling factors are shown as $\alpha$ and $\lambda$.  After long run of computations these values appears to be 1.6732632423543772848170429916717 and 1.0507009873554804934193349852946 relatively. Nevertheless, do not forget that these scaling factors are targeting specifically mean 0 and variance 1.  Any change preludes to change these values as well.

Initialization is also another important part of the whole method. The aim here is to start with the right values. They suggest to sample weights from a Gaussian distribution with mean 0 and variance $1/n$ where n is number of weights.

It is known with a well credence that Dropout does not play well with Batch Normalization since it smarting network activations in a purely random manner. This method seems even more brittle to dropout effect. As a cure, they propose Alpha Dropout. It randomly sets inputs to saturatied negative value of SELU which is $-\alpha\lambda$. Then an affine transformation is applied to it with $a$ and $b$ values computed relative to dropout rate, targeted mean and variance.It randomizes network without degrading network properties.

In a practical point of view, SELU seems promising by reducing the computation time relative to RELU+BN for normalizing the network. In the paper they does not provide any vision based baseline such a MNIST, CIFAR and they only pounce on Fully-Connected models. I am still curios to see its performance vis-a-vis on these benchmarks agains Bath Normalization. I plan to give it a shoot in near future.

One tickle in my mind after reading the paper is the obsession of mean 0 and variance 1 for not only this paper but also the other normalization techniques. In deed, these values are just relative so why 0 and 1 but not 0 and 4. If you have a answer to this please ping me below.

# Paper Notes: The Shattered Gradients Problem ...

The whole heading of the paper is "The Shattered Gradients Problem: If resnets are the answer, then what is the question?". It is really interesting work with all its findings about gradient dynamics of neural networks. It also examines Batch Normalization (BN) and Residual Networks (Resnet) under this problem.

The problem, dubbed "Shattered Gradients", described as gradient feedbacks resembling random noise for nearby data points. White noise gradients (random value around 0 with some unknown variance) are not useful for training and they stall the network. What we expect to see is Brownian noise (next value is obtained with a small change on the last value) from a working model. Deep neural networks are more prone to white noise gradients. However, latest advances like BN and Resnet are described to be more resilient to random gradients even in deep networks.

White noise gradients undermines the effectiveness of networks because they violates gradient based learning methods which expects similar gradient feedbacks for data points close by in the vector space. Once you have white noise gradient for such close points, the model is not able to capture data manifold through these learning algorithms. Brownian updates yields more correlation on updates and this preludes effective learning.

For normal networks, they give a empirical evidence that the correlation of network updates decreases with the order $(/2^L$ where L is number of layers. Decreasing correlation means more white noise gradient feedbacks.

One important reason of white noise feedbacks is to be co-activations of network units. From a working model, we expect to have units receptive to different structures in the given data. Therefore, for each different instance, different subset of units should be active for effective information flux. They observe that as activation goes through layers, co-activation rate goes higher. BN layers prevents this by keeping the co-activation rate 1/4 (1/4 units are active per layer).

Beside the co-activation rate, how dispersed units activation is another important question. Thus, similar instances need to activate similar subset of units and activation should be distributes to other subsets as we change the data structure. This stage is where the skip-connections get into the play. Their observation is skip-connections improve networks in that respect. This can be observed at below figure.

The effectiveness of skip-connections increases with Beta scaling introduced by InceptionV4 architecture. It  is scaling residual connections by a constant value before summing up with the current layer activation.

#### A small discussion

This is a very intriguing paper to me as being one of the scarse works investigating network dynamics instead of blind updates on architectures for racing accuracy values.

Resnet is known to be train hundreds of layers which was not possible before. Now, with this work, we have another scientific argument explaining its effectiveness. I also like to point Veit et al. (2016) demystifying Resnet as an ensemble of many shallow networks. When we combine both of these papers, it makes total sense to me how Resnets are useful for training very deep networks. If shattered gradient effect, as stated here,  increasing with number of layers with the order 2^L then it is impossible to train hundred layers with an ad-hoc network. Corollary, since Resnet behaves like a ensemble of shallow networks this effects is rehabilitated. We are able to see it empirically in this paper and it is complimentary in that sense.

Note: This hastily written paper note might include any kind of error. Please let me know if you find one. Best 🙂

# Paper Notes: Intriguing Properties of Neural Networks

Paper: https://arxiv.org/abs/1312.6199

This paper studies description of semantic information with higher level units of an network and blind spot of the network models againt adversarial instances. They illustrate the learned semantics inferring maximally activating instances per unit. They also interpret the effect of adversarial examples and their generalization on different network architectures and datasets.

Findings might be summarized as follows;

1. Certain dimensions of the each layer reflects different semantics of data. (This is a well-known fact to this date therefore I skip this to discuss more)
2. Adversarial instances are general to different models and datasets.
3. Adversarial instances are more significant to higher layers of the networks.
4. Auto-Encoders are more resilient to adversarial instances.

#### Adversarial instances are general to different models and datasets.

They posit that advertorials exploiting a particular network architectures are also hard to classify for the others. They illustrate it by creating adversarial instances yielding 100% error-rate on the target network architecture and using these on the another network. It is shown that these adversarial instances are still hard for the other network ( a network with 2% error-rate degraded to 5%). Of course the influence is not that strong compared to the target architecture (which has 100% error-rate).

#### Adversarial instances are more significant to higher layers of networks.

As you go to higher layers of the network, instability induced by adversarial instances increases as they measure by Lipschitz constant. This is justifiable observation with that the higher layers capture more abstract semantics and therefore any perturbation on an input might override the constituted semantic. (For instance a concept of "dog head" might be perturbed to something random).

#### Auto-Encoders are more resilient to adversarial instances.

AE is an unsupervised algorithm and it is different from the other models used in the paper since it learns the implicit distribution of the training data instead of mere discriminant features. Thus, it is expected to be more tolerant to adversarial instances. It is understood by Table2 that AE model needs stronger perturbations to achieve 100% classification error with generated adversarials.

#### My Notes

One intriguing observation is that shallow model with no hidden unit is yet to be more robust to adversarial instance created from the deeper models. It questions the claim of generalization of adversarial instances. I believe, if the term generality is supposed to be hold, then a higher degree of susceptibility ought to be obtained in this example (and in other too).

I also happy to see that unsupervised method is more robust to adversarial as expected since I believe the notion of general AI is only possible with the unsupervised learning which learns the space of data instead of memorizing things. This is also what I plan to examine after this paper to see how the new tools like Variational Auto Encoders behave againt adversarial instance.

I believe that it is really hard to fight with adversarial instances especially, the ones created by counter optimization against a particular supervised model. A supervised model always has flaws to be exploited in this manner since it memorizes things [ref] and when you go beyond its scope (especially with adversarial instances are of low probability), it makes natural mistakes. Beside, it is known that a neural network converges to local minimum due to its non-convex nature. Therefore, by definition, it has such weaknesses.

Adversarial instances are, in practical sense, not a big deal right now.However, this is akin to be a far more important topic, as we journey through a more advanced AI. Right now, a ML model only makes tolerable mistakes. However, consider advanced systems waiting us in a close future with a use of great importance such as deciding who is guilty, who has cancer. Then this is question of far more important means.

##### CATEGORICAL REPARAMETERIZATION WITH GUMBEL SOFTMAX
• Continuous distribution on the simplex which approximates discrete vectors (one hot vectors) and differentiable by its parameters with reparametrization trick used in VAE.
• It is used for semi-supervised learning.

##### DEEP UNSUPERVISED LEARNING WITH SPATIAL CONTRASTING
• Learning useful unsupervised image representations by using triplet loss on image patches. The triplet is defined by two image patches from the same images as the anchor and the positive instances and a patch from a different image which is the negative.  It gives a good boost on CIFAR-10 after using it as a pretraning method.
• How would you apply to real and large scale classification problem?

##### MULTI-RESIDUAL NETWORKS
• For 110-layers ResNet the most contribution to gradient updates come from the paths with 10-34 layers.
• ResNet trained with only these effective paths has comparable performance with the full ResNet. It is done by sampling paths with lengths in the effective range for each mini-batch.
• Instead of going deeper adding more residual connections provides more boost due to the notion of exponential ensemble of shallow networks by the residual connections.
• Removing a residual block from a ResNet has negligible drop on performance in test time in contrast to VGG and GoogleNet.

# Paper review - Understanding Deep Learning Requires Rethinking Generalization

This paper states the following phrase. Traditional machine learning frameworks (VC dimensions, Rademacher complexity etc.) trying to explain how learning occurs are not very explanatory for the success of deep learning models and we need more understanding looking from different perspectives.

They rely on following empirical observations;

• Deep networks are able to learn any kind of train data even with white noise instances with random labels. It entails that neural networks have very good brute-force memorization capacity.
• Explicit regularization techniques - dropout, weight decay, batch norm - improves model generalization but it does not mean that same network give poor generalization performance without any of these. For instance, an inception network trained without ant explicit technique has 80.38% top-5 rate where as the same network achieved 83.6% on ImageNet challange with explicit techniques.
• A 2 layers network with 2n+d parameters can learn the function f with n samples in d dimensions. They provide a proof of this statement on appendix section. From the empirical stand-view, they show the network performances on MNIST and CIFAR-10 datasets with 2 layers Multi Layer Perceptron.

Above observations entails following questions and conflicts;

• Traditional notion of learning suggests stronger regularization as we use more powerful models. However, large enough network model is able to memorize any kind of data even if this data is just a random noise. Also, without any further explicit regularization techniques these models are able to generalize well in natural datasets.  It shows us that, conflicting to general belief, brute-force memorization is still a good learning method yielding reasonable generalization performance in test time.
• Classical approaches are poorly suited to explain the success of neural networks and more investigation is imperative in order to understand what is really going on from theoretical view.
• Generalization power of the networks are not really defined by the explicit techniques, instead implicit factors like learning method or the model architecture seems more effective.
• Explanation of generalization is need to be redefined in order to solve the conflicts depicted above.

My take :  These large models are able to learn any function (and large does not mean deep anymore) and if there is any kind of information match between the training data and the test data, they are able to generalize well as well. Maybe it might be an explanation to think this models as an ensemble of many millions of smaller models on which is controlled by the zeroing effect of activation functions.  Thus, it is able to memorize any function due to its size and implicated capacity but it still generalize well due-to this ensembling effect.

# Paper review: CONVERGENT LEARNING: DO DIFFERENT NEURAL NETWORKS LEARN THE SAME REPRESENTATIONS?

This paper is an interesting work which tries to explain similarities and differences between representation learned by different networks in the same architecture.

To the extend of their experiments, they train 4 different AlexNet and compare the units of these networks by correlation and mutual information analysis.

• Can we find one to one matching of units between network , showing that these units are sensitive to similar or the same commonalities on the image?
• Is the one to one matching stays the same by different similarity measures? They first use correlation then mutual information to confirm the findings.
• Is a representation learned by a network is a rotated version of the other, to the extend that one to one matching is not possible  between networks?
• Is clustering plausible for grouping units in different networks?

Answers to these questions are as follows;

• It is possible to find good matching units with really high correlation values but there are some units learning unique representation that are not replicated by the others. The degree of representational divergence between networks goes higher with the number of layers. Hence, we see large correlations by conv1 layers and it the value decreases toward conv5 and it is minimum by conv4 layer.
• They first analyze layers by the correlation values among units. Then they measure the overlap with the mutual information and the results are confirming each other..
• To see the differences between learned representation, they use a very smart trick. They approximate representations  learned by a layer of a network by the another network using the same layer.  A sparse approximation is performed using LASSO. The result indicating that some units are approximated well with 1 or 2 units of the other network but remaining set of units require almost 4 counterpart units for good approximation. It shows that some units having good one to one matching has local codes learned and other units have slight distributed codes approximated by multiple counterpart units.
• They also run a hierarchical clustering in order to group similar units successfully.

For details please refer to the paper.

My discussion: We see that different networks learn similar representations with some level of accompanying uniqueness. It is intriguing  to see that, after this paper, these  are the unique representations causing performance differences between networks and whether the effect is improving or worsening. Additionally, maybe we might combine these differences at the end to improve network performances by some set of smart tricks.

One deficit of the paper is that they do not experiment deep networks which are the real deal of the time. As we see from the results, as the layers go deeper,  different abstractions exhumed by different networks. I believe this is more harsh by deeper architectures such as Inception or VGG kind.

One another curious thing is to study Residual netwrosk. The intuition of Residual networks to pass the already learned representation to upper layers and adding more to residual channel if something useful learned by the next layer. That idea shows some promise that two residual networks might be more similar compared to two Inception networks. Moreover, we can compare different layers inside a single Residual Network to see at what level the representation stays the same.

# Paper review: ALL YOU NEED IS A GOOD INIT

This work proposes yet another way to initialize your network, namely LUV (Layer-sequential Unit-variance) targeting especially deep networks.  The idea relies on lately served Orthogonal initialization and fine-tuning the weights by the data to have variance of 1 for each layer output.

The scheme follows three stages;

1.  Initialize weights by unit variance Gaussian
2.  Find components of these weights using SVD
3.  Replace the weights with these components
4.  By using minibatches of data, try to rescale weights to have variance of 1 for each layer. This iterative procedure is described as below pseudo code.

In order to describe the code in words, for each iteration we give a new mini-batch and compute the output variance. We compare the computed variance by the threshold we defined as $Tol_{var}$ to the target variance 1.   If number of iterations is below the maximum number iterations or the difference is above $Tol_{var}$ we rescale the layer weights by the squared variance of the minibatch.  After initializing this layer go on to the next layer.

In essence, what this method does. First, we start with a normal Gaussian initialization which we know that it is not enough for deep networks. Orthogonalization stage, decorrelates the weights so that each unit of the layer starts to learn from particularly different point in the space. At the final stage, LUV iterations rescale the weights and keep the back and forth propagated signals close to a useful variance against vanishing or exploding gradient problem , similar to Batch Normalization but without computational load.  Nevertheless, as also they points, LUV is not interchangeable with BN for especially large datasets like ImageNet. Still, I'd like to see a comparison with LUV vs BN but it is not done or not written to paper (Edit by the Author: Figure 3 on the paper has CIFAR comparison of BN and LUV and ImageNet results are posted on https://github.com/ducha-aiki/caffenet-benchmark).

The good side of this method is it works, for at least for my experiments made on ImageNet with different architectures. It is also not too much hurdle to code, if you already have Orthogonal initialization on the hand. Even, if you don't have it, you can start with a Gaussian initialization scheme and skip Orthogonalization stage and directly use LUV iterations. It still works with slight decrease of performance.

# Paper review: Dynamic Capacity Networks

Decompose the network structure into two networks F and G keeping a set of top layers T at the end. F and G are small and more advance network structures respectively. Thus F is cheap to execute with lower performance compared to G.

In order to reduce the whole computation and embrace both performance and computation gains provided by both networks, they suggest an incremental pass of input data through F to G.

Network F decides the salient regions on the input by using a gradient feedback and then these smaller regions are sent to network G to have better recognition performance.

Given an input image x, coarse network F is applied and then coarse representations of different regions of the given input is computed. These coarse representations are propagated to the top layers T and T computes the final output of the network which are the class predictions. An entropy measure is used to see that how each coerce representation effects the model's uncertainty leading that if a regions is salient then we expect to have large change of the uncertainty with respect to its representation.

We select top k input regions as salient by the hint of computed entropy changes then these regions are given to fine network G obtain finer representations. Eventually, we merge all the coarse, fine representations and give to top layers T again and get the final predictions.

At the training time, all networks and layers trained simultaneously. However, still one might decide to train each network F and G separately by using the same top layers T.  Authors posits that the simultaneous training is useful to keep fine and coarse representations similar so that the final layers T do not struggle too much to learn from two difference representation distribution.

I only try to give the overlooked idea here, if you like to see more detail and dwell into formulas please see the paper.

My discussion: There are some other works using attention mechanisms to improve final performance. However, this work is limited with the small datasets and small spatial dimensions. I really like to see whether it is also usefule for large problems like ImageNet or even larger.

Another caveat is the datasets used for the expeirments are not so cluttered. Therefore, it is easy to detect salient regions, even with by some algrithmic techniques. Thus, still this method obscure to me in real life problems.