In recent years, Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs) trained with the Connectionist Temporal Classification (CTC) objective won many international handwriting recognition evaluations. The CTC algorithm is based on a forward-backward procedure, avoiding the need of a segmentation of the input before training. The network outputs are characters labels, and a special non-character label. On the other hand, in the hybrid Neural Network / Hidden Markov Models (NN/HMM) framework, networks are trained with framewise criteria to predict state labels. In this paper, we show that CTC training is close to forward-backward training of NN/HMMs, and can be extended to more standard HMM topologies. We apply this method to Multi-Layer Perceptrons (MLPs), and investigate the properties of CTC, especially the role of the special label.
Acknowledgements. This work was carried out during my PhD in the research team of A2iA and at the LIMSI-CNRS lab, under the supervision of Prof. Hermann Ney (RWTH Aachen, LIMSI) and Christopher Kermorvant (Teklia, A2iA). Please cite one of the following if you wish to refer to this work:
Questions:
NB. Some useful references for intergrated NN/HMM training: [Bengio et al., 1992, Haffner, 1993, Senior and Robinson, 1996, Konig et al., 1996, Hennebert et al., 1997, Yan et al., 1997], and for CTC: [Graves et al., 2006]
The equations are in the paper(s). Basically:
CTC = HMM training, without transition/prior probabilities
(zeroth-order model), and with specific outputs for standalone NN recognition ( = one HMM state / char. + blank)
CTC could be applied with different HMM topologies, to other kinds of NN than RNN (e.g. MLPs)
CTC = Cross-entropy training + forward-backward to consider all possible segmentations
we can compare the training strategies, see the effect of forward-backward,
with different HMM topologies (number of HMM states / char.)
The differences are summarized in the following table:
Training | Cost | NN outputs |
---|---|---|
Cross-entropy | $$-\log \prod_t P(q_t | x_t)$$ | HMM states (5-6 / char.) |
CTC | $$-\log {\color{myblue} \sum_{\mathbf{q}}} \prod_t P(q_t | \mathbf{x})$$ | characters + blank |
HMM training | $$-\log {\color{myblue} \sum_{\mathbf{q}}} \prod_t \frac{P(q_t | x_t)}{\color{myred} P(q_t)} {\color{myred} P(q_t | q_{t-1})}$$ | HMM states (5-6 / char.) |
Goal: study the differences and dynamics of training methods
First, let's do the usual comparison, the one you will find in many papers switching from classical framewise training from HMM alignments to CTC. We compare on the one hand the framewise system, with standard multi-state without blank HMMs, and on the other hand CTC as defined in [Graves et al., 2006], with one-state character models, and blank:
MLP | RNN |
One of the aspect of CTC, the one most visible in the cost function, is the segmentation-free approach by summing over all possible labelings. Let's forget about blank and evaluate only this aspect with different numbers of states per character:
MLP | RNN |
Back to the blank label, let's see what is the effect of optionaly adding it between every character model, with framewise training first, and with CTC-like training, for different numbers of states per characters:
MLP | RNN |
MLP | RNN |
To complete the picture, we may also evaluate the segmentation-free aspect of CTC when the blank label is included:
MLP | RNN |
Merging all these plots, we can see that more or less all curves, i.e. all choices of blank inclusion and training method, look practically the same, except for one:
MLP | RNN |
A typical observation in CTC-trained RNNs is the dominance of blank predictions in the output sequence, with localized peaks for character predictions. Interestingly, this behaviour still occurs with more states (e.g. with three states: three peaks of states predictions with many blanks between each character). We also observe this with CTC-trained MLPs. Thus, we can conclude that this typical output is due to the interaction between blank and CTC training, rather than a consequence of using RNNs or single-state character models.
On the following figure, we show how the outputs of a network with one state per character and blank evolve in the beginning of CTC training. In the top left plot, we see the outputs before training. Since the weights of the network have not yet been adjusted, the outputs are more or less random. As the training procedure advances, the predictions of the blank symbol increases, until the posterior probability is close to one for the whole sequence (end of the second line). At this point, the network predicts only blank labels. Then (third line), the probabilities of character labels start increasing, and peaks emerge at specific locations.
This is not surprising. In the CTC graph, there is one blank between each character. In early stages of training, the NN outputs are more or less random (at first approximation, drawn from a uniform distribution). Any path is equally likely, but summing all blank posteriors for a given timestep in the forward-backward procedure, the target for blank will be much higher than for any other character. The backpropagated error will make the network more likely to give a high posterior probability to the blanks, which in turn will favor path with many blanks in subsequent forward-backward computations of the cost.
The previous figure shows the posterior probabilities computed with the forward-backward algorithm during training, which are the targets of the CTC training algorithm. We see that the fact that one in two labels is a blank will give a higher posterior probability for this symbol. When the network starts outputting only blanks, the paths with many blanks will be more likely. On the other hand, the difference between the network output for the blank label, and the posterior probability of the blank computed with the forward-backward algorithm will become small.
Predicting only blanks will penalize the cost function because of the low probabilities given to characters, and the network has only to find one location to predict each character, to "jump" from one sequence of blanks to another. Since a valid path must go through character labels, the posterior probabilities of these labels will also increase. We see on the second plot of the previous figure that at some positions, the character labels have a higher posterior probability. Since the network predicts only blanks, the gradients for characters at these positions will be high, and the network will be trained to output the character labels at specific locations. Finally, the path with blanks everywhere and characters at localized position will have a much higher probability, and the training algorithm may not consider much the other possible segmentations (bottom plot of the previous figure).
Actually, the results of CTC without blank presented previously were not trained with CTC from scratch. When we did so, the training procedure did not converge to a good result. On the next figure, we show the outputs of a neural network when there is no blank symbol between characters. On top, we display the output after framewise training. Below is the output after simple CTC training from a random initialization. The big gray area corresponds to the whitespace, and the character predictions, in color, only appear at the beginning and the end of the sequences.
Like the blank was predominant in the CTC graph, now the whitespace symbol is the most frequent symbol, and the network learnt to predict it everywhere. Although it is not clear why we do not observe this behaviour with blank, it seems to indicate that this symbol is important in the CTC training from a random initialization, maybe helping to align the outputs of the network with the target sequence.
When we initialize the network by one epoch of framewise training, before switching to CTC training, this problem disappears. Thus, when there is no blank, we should follow the same procedure as in [Senior and Robinson, 1996, Hennebert et al., 1997], consisting of first training the network with Viterbi alignments, and then refining it with forward-backward training.
By unbalancing the output distribution towards blanks, it prevents from alignment issues in early stages of training - especially for whitespaces - when the net's outputs are not informative
For decodingThe blanks are uninformative and shared by all word models. Peaked and localized character predictions make corrections less costly (only one frame has to be changed to correct a substitution/deletion/insertion) = faster / keeps more hypotheses for fixed beam
NB. - The optimal optical scale in decoding was 1 with CTC, and 1/ (avg. char. length) for framewise or no-blank training
Thanks to the previous analysis, we can devise experiments to try to avoid peaks, or at least to take advantage of the understanding of CTC to improve the training procedure. Among the things we try, let's note:
Framewise initialization There is no peaks with framewise triaining, and framewise initialization enabled convergence of models without blank. With framewise initialization and blank, however, the peaks appear again when we switch to CTC training.
Repeating labels Since the peaks are induced by the blank symbol being too much represented at each timestep in the CTC graph, we modified the CTC to force each character prediction last for at least N steps, with N=2,3 or 5 (i.e. similar to a N-state HMM with shared emission model). However, we did not enforce this constraint for decoding, but it turned out the network learnt to predict exactly sequences of N identical labels, the remaining still being blanks. (nb. blanks are uninformative, so probably still easier to learn)
Blank penalty We also tried to add a penalty on the transition to blanks in the computation of the CTC. We managed to control the importance of the peaks, but the performances also degradated towards those of the models with more states and no blanks.
Lower learning rate To limit the impact of the high gradients on blanks at the beginning of training, and to keep more alternatives, we trained the networks with a smaller learning rate. For RNNs, peaks remained the best option, but the results were poorer. For MLP, we managed to avoid those peaks, while getting higher performances, which seems to confirm that it is too difficult or sub-optimal for an MLP to predict localized peaks.
Adaptive learning rate per parameter (RMSProp) Last but not least, a similar idea to try to control the impact of the gradients for the blank label is to use an adaptive learning rate (per parameter). On more complicated databases, we sometimes observe plateaus at the beginning of CTC training. The cost value more of less corresponds to blank predictions only (see black and blue curves on the figure below). [Louradour and Kermorvant, 2014] argue that it is due to the difficulty for this training method to both align and recognize characters. They propose a training curriculum where short sequences (easier alignment task) are favoured at the beginning of training.
Following our analysis, we suppose that it is also explained by the difficulty to start predicting characters instead of blanks, because of the backpropagated error signal. Thus, we chose an training algorithm that adapts the learning rate for each parameter, according to the past gradients: RMSProp [Tieleman and Hinton, 2012]. With this approach the learning rate will be smaller if the parameter received high gradients, and bigger in the case of small gradients. From the labels perspective, we might hope that it will mitigate the high gradients coming from the blank.
In the next figure, we present the evolution of predictions under CTC training with SGD and RMSProp. This is a toy example: the values of the predictions (inputs of the CTC) are directly optimized from random initialization (there is no neural network, it only shows the dynamics of CTC training).
Simple SGD | RMSProp |
On two tasks exhibiting plateauing behaviours, we see on the next plot that even if smaller learning rates or curriculum training alleviate the problem, with Adagrad (or similarly RMSProp) the problem completely vanishes:
Task 1 | Task 2 |
We studied the CTC training algorithm, its relation to framewise and HMM training, and the role of the blank symbol in the CTC.
Copyright Notice.This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.