The automatic transcription of text in handwritten documents has many applications, from automatic document processing, to indexing and document understanding. One of the most popular approaches nowadays consists in scanning the text line image with a sliding window, from which features are extracted, and modeled by Hidden Markov Models (HMMs). Associated with Neural Networks, such as Multi-Layer Perceptrons (MLPs) or Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs), and with a language model, these models yield good transcriptions. On the other hand, in many machine learning applications, including speech recognition and computer vision, deep Neural Networks consisting of several hidden layers recently produced a significant reduction of error rates.
In this thesis, we have conducted a thorough study of different aspects of optical model based on deep Neural Networks in the hybrid Neural Network / HMM scheme, in order to better understand and evaluate their relative importance. First, we show that deep Neural Networks produce consistent and significant improvements over networks with one or two hidden layers, independently of the kind of neural network, MLP or RNN, and of input, handcrafted features or pixels. Then, we show that deep Neural Networks with pixel inputs compete with those using handcrafted features, and that depth plays an important role in the reduction of the performance gap between the two kinds of inputs, supporting the idea that deep Neural Networks effectively build hierarchical and relevant representations of their inputs, and that features are automatically learnt on the way. Despite the dominance of LSTM-RNNs in the recent literature of handwriting recognition, we show that deep MLPs achieve comparable results. Moreover, we evaluated different training criteria. With sequence-discriminative training, we report similar improvements for MLP/HMMs as those observed in speech recognition. We also show how the Connectionist Temporal Classification framework is especially suited to RNNs. Finally, the novel dropout technique to regularize neural networks was recently applied to LSTM-RNNs. We tested its effect at different positions in LSTM-RNNs, thus extending previous works, and we show that its relative position to the recurrent connections is important.
We conducted the experiments on three public databases, representing two languages (English and French) and two epochs, using different kinds of neural network inputs: handcrafted features and pixels. We validated our approach by taking part to the HTRtS contest in 2014. The results of the final systems presented in this thesis, namely MLPs and RNNs, with handcrafted feature or pixel inputs, are comparable to the state-of-the-art on Rimes and IAM. Moreover, the combination of these systems outperformed all published results on the considered databases.
Deep Neural Networks Multi-Layer Perceptron Recurrent Neural Networks Hidden Markov Models Dropout Handwriting Recognition
Prof. Hermann Ney, RWTH Aachen, LIMSI CNRS
Christopher Kermorvant, Teklia, A2iALab