hidden markov model-based ensemble methods for offline handwritten text line recognition

Pattern Recognition 41 (2008) 3452 -- 3460

Contents lists available at ScienceDirect

Pattern Recognition

journal homepage: www.e lsev ier .com/ locate /pr

Hidden Markov model-based ensemble methods for offline handwritten text linerecognition

Roman Bertolami∗, Horst BunkeInstitute of Computer Science and Applied Mathematics, University of Bern, Neubruckstrasse 10, CH-3012 Bern, Switzerland

A R T I C L E I N F O A B S T R A C T

Article history:Received 10 July 2007Received in revised form 25 March 2008Accepted 2 April 2008

Keywords:Offline handwritten text line recognitionEnsemble methodsConfidence measures

This paper investigates various ensemble methods for offline handwritten text line recognition. To obtainensembles of recognisers, we implement bagging, random feature subspace, and language model varia-tion methods. For the combination, the word sequences returned by the individual ensemble membersare first aligned. Then a confidence-based voting strategy determines the final word sequence. A num-ber of confidence measures based on normalised likelihoods and alternative candidates are evaluated.Experiments show that the proposed ensemble methods can improve the recognition accuracy over anoptimised single reference recogniser.

© 2008 Elsevier Ltd. All rights reserved.

1. Introduction

Early research activities in offline handwriting recognition wererestricted to isolated character [1] or numeral recognition [2]. Next,the recognition of cursively written words and digit sequences hasbeen considered, motivated by automatic cheque [3] and postal ad-dress reading [4]. Research on general handwritten text recognition,as considered in this paper, has started much later. As of today, thisproblem is still considered widely unexplored, particularly if no con-straints are imposed during writing.

The main challenges in offline handwritten text recognition arethe individual writing styles, the large number of different wordclasses, and the word segmentation problem. In a writer indepen-dent system, the differences in writing style as well as in writinginstruments are typically high. Since natural languages are consid-ered, the underlying lexicon must contain a huge amount of wordclasses to cover many different domains of texts. Additionally, seg-mentation errors often occur because the correct number of wordsin a text line is unknown in advance. For these reasons, a high recog-nition accuracy is difficult to achieve. In the literature, recognitionrates between 50% and 80% are reported, depending on the experi-mental setup [5--8].

Automatic reading of general handwritten text is interesting fortasks such as the transcription and indexing of handwritten histori-cal archives and the automatic reading of forms, handwritten faxes,personal notes, and annotations on documents. However, today's

∗ Corresponding author. Tel.: +41316314865.E-mail addresses: [email protected] (R. Bertolami),

[email protected] (H. Bunke).

0031-3203/$30.00 © 2008 Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2008.04.003

recognition systems rarely achieve recognition rates that are goodenough for these applications.

A possible strategy to improve the accuracy of pattern classifiersis the use of ensemble methods which has been shown to be effectivefor many different classification problems [9,10]. By the combinationof the results of multiple classifiers, the recognition accuracy is oftenimproved compared to a single classifier. Given that some errorsmade by the individual recognisers are different, we can expect thatby combining multiple classifiers, errors of individual recogniserscan be corrected.

The contribution of the current paper is that ensemble methodsare applied for the first time to the problem of offline handwrittentext line recognition. Multiple ensemble generation methods, basedon either altering the input data or the system architecture, areevaluated. Furthermore, we compare various confidence measuresfor the combination of recognised word sequences.

The remaining part of the paper is organised as follows. In thenext section, related work is discussed. Section 3 introduces the im-plemented confidencemeasures and the handwritten text line recog-niser that is used as the base recogniser in all considered multipleclassifier systems. Section 4 describes the ensemble creation meth-ods. Next, the combination algorithms are discussed in Section 5. Anexperimental evaluation is presented in Section 6, and conclusionsare drawn in the last section of the paper.

2. Related work

In the handwriting recognition literature, several ensemblemeth-ods have been presented for character [11], numeral [12--14], andword [15,16] recognition. An automatic self-configuration scheme,

http://www.sciencedirect.com/science/journal/pr

http://www.elsevier.com/locate/pr

R. Bertolami, H. Bunke / Pattern Recognition 41 (2008) 3452 -- 3460 3453

using genetic algorithms to combine multiple character recognitionsystems, has been proposed in Ref. [11]. In numeral recognition, theapplication of statistical combination methods has been reported inRef. [12], where the so-called behaviour knowledge space methodsresulted in successful classifier combinations. A feature selection ap-proach based on a hierarchical algorithm was used in Ref. [13] tobuild ensembles of digit recognisers. In Ref. [14], a framework tocombine numeral string recognisers was proposed that applies agraph-based approach for combination. An evaluation of several de-cision combination strategies for handwritten word recognition hasbeen reported in Ref. [15]. Borda count methods, fuzzy integrals, andmultilayer perceptrons have been compared. In Ref. [16], various en-semble methods including bagging, boosting, and feature subspacemethods have been applied to offline handwritten word recognition.

The investigation of ensemble methods for unconstrained offlinehandwritten text line recognition as considered in this paper hasstarted only recently [17--19]. The combination of multiple text linerecognition systems requires additional synchronisation effort, i.e.an alignment procedure, because the number of words in the outputreturned by the individual recognisers might differ. In Ref. [17], po-sitional information which is output by the recognisers reduces thesearch space of the alignment procedure. This information leads to asubstantial speed up of the alignment process without significantlylosing recognition accuracy. A novel ensemble member generationstrategy based on specific integration of a language model was pro-posed in Ref. [18]. The ROVER combination method [20] was appliedfor the first time in handwritten text line recognition to combinethe recognition results. A statistical decision method which can beapplied to problems with an arbitrary large number of classes hasbeen introduced in Ref. [19] where the ensemble members originatefrom random feature subspaces.

3. Handwritten text line recognition system

The offline handwritten text line recognition system we use asthe base recogniser for the ensembles is an extended version of therecognition system introduced in Ref. [21]. The system can be dividedinto three phases: pre-processing, recognition, and post-processing.During the pre-processing phase the image is normalised and thefeatures are extracted. These features are then the input for a hid-den Markov model (HMM)-based recogniser. In the post-processingphase, confidence measures are computed for each recognised word.In the following subsections, these individual phases are describedin greater detail. Enhancements to the recogniser introduced in Ref.[21] were implemented at the language model integration level, i.e.the integration parameters are systematically optimised, as well asin the modelling of the characters where each character model hasan individually optimised number of states.

3.1. Image normalisation and feature extraction

To reduce the impact of different writing styles, a handwrittentext line image is first normalised with respect to skew, slant, base-line position, and average letter width. During skew correction, theimage is rotated such that the line on which the words are writtenbecomes horizontal. The slant correction brings the handwriting intoupright position by applying a shear transformation. The baselinepositioning scales the three main areas of the text line (i.e. the as-cender part, the middle part, and the descender part) to predefinedheight. The average letter width is estimated and normalised to apredefined value by a horizontal scaling transformation. An exampleof the image normalisation steps appears in Fig. 1.

After these normalisation steps, a handwritten text line is con-verted into a sequence of feature vectors using a sliding window.The window has a width of one pixel and moves from left to right

over the image, one pixel at each step. Nine geometrical featuresare extracted at each position of the sliding window. The first threefeatures contain the number of foreground pixels in the window aswell as the first and the second order moment of the foregroundpixels. Features four to seven contain the position of the upper andthe lower contour and its first order derivative. The last two fea-tures represent the number of vertical black-white transitions andthe pixel density between the upper and the lower contour. We referto Ref. [21] for further details.

3.2. HMM-based recognition

For the recognition we apply a HMM-based technique. HMMshave become a standard tool to recognise various types of observa-tion sequences including speech [22], online handwriting [23], andoffline handwriting [7,24,25]. In our system, each character is mod-elled with a separate HMM. For all HMMs a linear topology is used,i.e. each state of the HMM has only two transitions, one to itself andone to the next state. Because the characters differ in length, thenumber of states is chosen individually for each character as pro-posed in Ref. [8]. Twelve Gaussian mixture components model theoutput distribution in each state.

The Baum--Welch algorithm [22] is used for the training of theHMMs, and the recognition is performed by the Viterbi algorithm[26]. A statistical bigram language model supports the recognitionprocess. For language model smoothing we use the Kneser-Ney tech-nique [27].

3.3. Confidence measures

A common way to express the certainty of a recogniser aboutits decision is to calculate a confidence value for each recognisedword. Based on this confidence value, rejection strategies can beimplemented. If the confidence value of a recognised word exceedsa specific threshold, the recognition result is accepted. Otherwise,it is rejected. In the context of classifier combination, confidencemeasures are used to give a result with a higher confidence a higherimportance in the combination.

In the literature, a large number of confidence measures havebeen developed. In offline handwriting recognition, confidence mea-sures for address reading [4], cheque processing [28], character [29],and word [30] recognition systems have been proposed. Confidencemeasures based on alternative candidates have been proposed incontinuous speech recognition [31,32]. Recently, similar confidencemeasures have also been applied to handwritten text recognition[33].

In this paper we implement three different confidence measures.The first of these confidence measures is based on likelihoods, whilethe other two make use of alternative candidates. Next, these confi-dence measures will be described in more detail.

3.3.1. Likelihood-based confidenceThe first confidence measure is derived from normalised likeli-

hoods. Based on state transition probabilities and emission probabili-ties, the HMM-based recogniser accumulates log likelihoods for eachframe, i.e. each position of the sliding window. The sum of these loglikelihoods gives the log likelihood score for a word, and is used asthe recognition score in the decoding step [22]. Because this recog-nition score is influenced by the length of the handwritten word, itis normalised, i.e. it is divided by the number of frames. The resultis an average log likelihood which we use as a confidence measure.

The likelihood-based confidence measure is simple and fast tocompute. However, because the likelihoods output by a continuousHMM-based recogniser are typically no probabilities, its reliabilityis restricted.

3454 R. Bertolami, H. Bunke / Pattern Recognition 41 (2008) 3452 -- 3460

Fig. 1. Pre-processing of an image of handwritten text. The first line shows the original image, whereas the normalised image appears on the second line. The normalisationincludes skew correction, slant correction, baseline positioning, and width normalisation.

Fig. 2. Counting the number of times n a recognised word occurs in alternativecandidate sequences.

3.3.2. Confidence derived from candidatesA more sophisticated type of confidence measure is derived

from a list of candidates. In addition to the recognition results,i.e. the recogniser's top ranked output W = (w1, . . . ,wm), the con-sidered list contains K alternative candidates W1, . . . , WK , whereWi = (wi

1, . . . ,wimi ).

The quality of these alternative candidates is a key aspect fora good performance of a confidence measure. In the misrecogni-tion ideal case, an alternative candidate sequence distinguishes itselffrom the recognised top ranked output word sequence exactly atthe positions where top ranked output words have been recognisedincorrectly. Of course, in practice this is rarely the case, as alterna-tive candidates sometimes differ in words that have been recognisedcorrectly or coincide with misrecognised words.

A common way to produce alternative candidates is the extrac-tion of an n-best list, containing the n highest ranked transcriptionsof a given image of handwritten text. However, it has been shown inthe speech [32] as well as in the handwriting recognition literature[34] that candidates derived from specific integration of a languagemodel provide better rejection performance than n-best lists. There-fore, we use this method to produce the alternative candidates. Werefer to Ref. [34] for more details.

Once the alternative candidates are available, they are alignedwith the recognised top ranked outputW using dynamic string align-ment [35]. Based on this alignment a confidence measure p(c|wi,n)is computed for each word wi of W. The quantity p(c|w,n) repre-sents the probability of a word w of the top ranked output beingrecognised correctly, where c ∈ {0, 1} (0 stands for incorrect and 1for correct) and n ∈ 0, . . . ,K corresponds to the number of times aword w is observed in the K alternative candidates. See Fig. 2 for anexample.

If our training set was large enough, it would be possible toestimate the probability p(c|w,n) for every value of n and all thewords w contained in the dictionary. Since most words appear onlya few times (many words appear not at all in the training set), weapproximate the expression p(c|n,w) by two different confidencemeasures.

The first approximation, Conf1, estimates the probability p(c|n,w)by p(c|n). The underlying assumption is that the probability of beingcorrectly recognised is independent of the considered word w. Thisassumption allows a straightforward and robust estimation of p(c|n):

Conf1 = p(c|n) (1)

However, the assumption that the probability of a correct recognitionis independent of the considered word may be too severe. There arewords that are easy to recognise, while others are more difficult.Therefore, the second approximation, Conf2, explicitly considers thecurrent word w. Bayes' rule is used to reformulate p(c|n,w):

p(c|n,w) = p(n|c,w) · p(c|w)∑

x=0,1 p(n|x,w) · p(x|w)(2)

We then simplify the right-hand side of Eq. (2) using the assumptionthat p(n|c,w) � p(n|c) [31]. By this approximation we finally obtainconfidence measure Conf2 as follows:

Conf2 = p(n|c) · p(c|w)∑

x=0,1 p(n|x) · p(x|w)(3)

All probabilities occurring in Eqs. (1) and (3) are estimated usingrelative frequencies obtained from the training set during the train-ing phase. If there are not enough training samples of a word w toestimate p(c|w), Conf2 is not applicable and therefore Conf1 is usedinstead.

4. Ensemble generation

To implement an ensemble method, two main issues have to beaddressed. First, an ensemble creation strategy must be defined togenerate multiple classifiers. The second issue is to find an appro-priate combination method that enables us to fuse the results of theindividual classifiers and to derive the final result. This section ad-dresses the ensemble creation problems, whereas the combinationmethods are discussed in Section 5.

Diversity among the individual ensemble members is an impor-tant aspect to obtain good results with ensemble methods [36,37].Since the goal is to correct the errors of one ensemble member withthe output of other ensemble members, we need a certain diver-sity among the classifiers in the ensemble. Intuitively speaking, themembers should make as few coincident errors as possible.

Basically, there are two different strategies to automatically cre-ate multiple diverse classifiers. Under the first strategy, the trainingdata are altered, while in the second method, the architecture of therecognisers varies. Many different methods have been proposed togenerate multiple classifiers by supplying the classifier with differ-ent training data. The best known among these methods are k-foldcross validation [38], bagging [39], boosting [40] and the randomfeature subspace method [41]. Much less investigations have beenconducted on using different classifier architectures, or varying partof the classifiers. Typically such variations are problem specific andnot so generally applicable as the use of different training sets. Asan example of varying the classifier architecture, in Ref. [42] thenumber of hidden neurons is changed to produce multiple artificialneural networks.

In the present paper, we investigate two methods that alter thetraining data, namely bagging and random feature subspace, and onemethod where the architecture is altered such that the integrationof the underlying statistical language model differs for the different


ensemble members.1 To determine the final ensemble, we use anoverproduce-and-select strategy [10] by applying a greedy forwardsearch to select the individual ensembles members.

4.1. Bagging

Bagging is an acronym for bootstrap aggregating. It was intro-duced in Ref. [39] and was among the first methods proposed forensemble creation. The ensemble contains classifiers trained onbootstrap replicas of the training set.

Given a training set S of size N, the bagging method builds n newtraining sets S1, . . . , Sn each of size N by randomly choosing elementsof the original training set. The same elementmay be chosenmultipletimes. If all elements are chosenwith an equal probability, only 63.2%of all training elements are included in each training set Si.

A recogniser Ri is then trained for each of the generated setsSi. Thus, an ensemble of n classifiers is obtained from the baggingprocedure. The diversity of the recognisers, needed for the ensembleto work properly, originates from the differences among the trainingsets.

4.2. Random feature subspace

In the random feature subspace method [41], the individualrecognisers use only a subset of all available features for trainingand testing. These subsets are chosen randomly with a fixed size d.The only constraint is that the same subset must not be used morethan once.

For the handwriting recognition systemwe use, only nine featuresare available, which is a rather low number. The dimension of thesubsets d is set to six. This number turned out optimal in a similarexperiment in the field of handwritten word recognition [16].

It is worth noting that the total amount of data that is used totrain a single recogniser is approximately the same with the featuresubspace and with the bagging method, i.e. about 65% of the originaltraining data. However, when it comes to the test data, only six outof nine features are used by a single feature subspace recogniser,whereas a single bagging recogniser uses all available data of a textline.

4.3. Language model integration variation

One possible way to create multiple recognition results by archi-tecture modification is to alter the integration of the statistical lan-guage model [18]. It has been shown that the parts of a recognisedword sequence that are sensitive to changes in the underlying lan-guage model are often recognised incorrectly [32,33]. For those partswe are seeking alternative interpretations to improve the recogni-tion rate.

For an HMM-based recognition system with integrated languagemodel, such as the one used in this work, the most likely wordsequence W = (w1, . . . ,wm) for a given observation sequence X iscomputed as follows:

W = argmaxW

{logp(X|W) + � logp(W) + m�} (4)

According to Eq. (4), the likelihood of the optical model p(X|W),which is the result of the HMM decoding, is combined with thelikelihood p(W) obtained from the language model. Because HMMdecoding and language model merely produce approximations ofprobabilities, we use two additional parameters, � and �, to control

1 We did not apply boosting for computational complexity reasons. Baggingand random feature subspace recognisers can be trained in parallel on differentmachines, whereas a sequential training is required by boosting. Furthermore, theentire training set has to be decoded for each recogniser in the ensemble whenboosting is applied.

Fig. 3. Multiple recognition results derived from language model integration varia-tions. Specific parameters � and � are used in Eq. (4) to build diverse recognisers.

the integration of the language model. The parameter � is calledgrammar scale factor and weights the impact of the statistical lan-guage model. The term word insertion penalty is used for parameter�, which controls the segmentation rate of the recogniser. A highervalue of � results in more individual words to be output by therecogniser.

By varying the parameters � and �, various recognition resultscan be produced from the same image of a handwritten text. Toobtain n recognition results, we choose n different parameter pairs(�i,�i), where i = 1, . . . ,n.

An example of recognisers obtained from varying the integrationof a languagemodel appears in Fig. 3. Multiple recognition results areproduced for the handwritten text "Barry and Eric have enthusiasm”.This example provides a good illustration of the impact of the two pa-rameters � and �. If parameter � increases nonsense word sequencesare eliminated. On the other hand, the average amount of words(including punctuation marks) increases if the value of � gets larger.

4.4. Ensemble member selection

To optimise the composition of the ensemble, we apply an en-semble member selection strategy. The idea is not to use all possiblyavailable recognisers, but only those recognisers that add a benefit tothe ensemble. This method is known as overproduce-and-select [10].

On a validation set, we apply a greedy forward search to find theoptimised ensemble. First, the individual recogniser which performsbest is selected as the first ensemble member. Then, for each of theremaining recognisers, we tentatively add it to the selected ensem-ble member and measure the performance of the new ensemble. Thebest performing ensemble is saved and used for continuation. Iter-atively, we add the best remaining individual recogniser to the en-semble. Thus, at each iteration the ensemble size increases by one.We continue until the last available recogniser has been added. Then,we determine the best performing ensemble among all generatedensembles which, in general, contains not all available recognisers.This best performing ensemble is used as the final ensemble. Notethat with this ensemble member selection strategy, we also optimisethe ensemble size n.

It is worth noting that, because of its greedy nature, the forwardsearch does not provide an optimal ensemble in general. However,given m recognisers, a complete search would require m! ensemblesto be evaluated which is only feasible for very small values of m. Theproposed greedy forward search substantially reduces the numberof ensembles to be evaluated to

∑mk=1 k. The experimental results

of previous work have shown that this search procedure provides areasonable trade-off between computational cost, combination ac-curacy, and ensemble diversity [43]. In order to avoid overfitting onthe test set, the search should be conducted on an independent val-idation set.


5. Result combination

Many different methods for decision level classifier combinationhave been proposed in the literature [10,44]. They depend on thetype of output produced by the individual classifiers. If the output isonly the best-ranked class, voting can be applied. More sophisticatedcombination schemes look at dependencies between classifiers in theso-called behavior-knowledge space [12]. If the classifiers' output is aranked list of classes, Borda count or related methods can be applied[45]. In the most general case, a classifier outputs a confidence valuefor each recognised class. This confidence value is then consideredfor the combination decision, e.g. in the confidence-based votingapproach described in Section 5.2.

The combination of handwritten text line recognisers is differ-ent from most other classifier combination problems because theoutput of a recogniser consists of a sequence of classes rather thanjust a single class. Standard classifier combination rules, as discussedabove, are not directly applicable to the problem of text line com-bination. Because of segmentation errors, it cannot be assumed thatthe sequences, produced by the different recognisers, all have thesame length. Therefore, some synchronisation mechanism is needed.It has been proposed to use dynamic programming techniques toalign the individual outputs of the recognisers. However, this topicis still under research, and only a few solutions have been reportedin the handwriting recognition literature [14,17,18,46].

The combination method we apply in the present paper is basedon the recogniser output voting error reduction (ROVER) algorithm.ROVER was developed in the domain of continuous speech recogni-tion [20] and became a standard tool for the combination of multi-ple sequence results [46,47]. The combination based on ROVER canbe divided into an alignment and a voting phase, which will be de-scribed next.

5.1. Word sequence alignment

In the alignment phase, we have to find an alignment of n wordsequences. For computational reasons, a sub-optimal incrementalalignment algorithm, which operates sequentially, is applied. At thebeginning, the first two sequences are aligned using a standard stringmatching algorithm [35]. The result of this alignment is a word tran-sition network (WTN). The third word sequence is then aligned withthis WTN, resulting in a new WTN, which next is aligned with thefourth word sequence, and so on. We refer to Ref. [20] for furtherdetails.

The iterative alignment procedure does not guarantee an opti-mal solution with minimal edit cost as the alignment is affected bythe order in which the word sequences are considered. But in prac-tice, the sub-optimal alignment often provides an adequate solutionfor the trade-off between computational complexity and alignmentaccuracy.

An example of multiple sequence alignment using ROVER appearsin Fig. 4. Given the image of the handwritten text the mouth-organ,the recognisers R1, R2, and R3 produce three different results. Inthe first step the results of R1 and R2 are aligned in a single WTN.Subsequently, the result of R3 is aligned with this WTN. Note thatbecause the result of R3 contains a word that does not appear in theoutput of R1 and R2, a null transition arc � must be added to theWTN. Note furthermore that punctuation marks are treated in thesame manner as full words.

5.2. Confidence-based voting

The voting phase fuses the different word sequences once theyare aligned in a WTN. The goal is to identify the best scoring wordsequence in the WTN and extract it as the final result.

Fig. 4. Example of an iterative alignment of multiple recognition results.

The decisions about the word to be finally output are made indi-vidually for each segment of the WTN. Thus, neither of the adjacentsegments has an effect on the current segment. The decision de-pends on the size n of the ensemble, on the number of occurrencesmw of a word w in the current segment, and on the confidence valuecw of word w. The confidence value cw is defined as the maximumconfidence2 among all occurrences of w at the current position inthe WTN. For each occurring word class w, we calculate the score swas follows:

sw = �mwn

+ (1 − �)cw (5)

As a final result for the current segment, we select the word class wwith the highest score sw.

To apply Eq. (5), we experimentally determine the value of �. Pa-rameter � weights the impact of the number of occurrences againstthe confidence measure cw. Additionally, we determine the confi-dence measure c� for null transition arcs, because no intrinsic con-fidence score is associated with a null transition �. For this purpose,we evaluate various values of � and c� on a validation set.

A special case of confidence-based voting occurs if � = 1 in Eq.(5). Then, the confidence measure cw does not have any impact onsw which means that simple plurality voting is applied.

6. Experiments and results

In this section we present an experimental evaluation of the pro-posed ensemble methods. The base recogniser used for the experi-ments is the HMM-based recognition system described in Section 3.

6.1. Experimental setup

All experiments reported in this section are conducted on hand-written text lines from the IAM3 [48]. A writer independent taskhas been considered, which implies that no information about thewriters who contributed to the test set is available during the train-ing and validation phase. The training set consists of 6161 text lineswritten by 283 writers; 56 writers have contributed 920 text lines tothe validation set, and the test set contains 2,781 text lines producedby 161 writers.

2 The maximum rule performed better than other rules, e.g. the average rule,in preliminary experiments; this is consistent with the findings reported in Ref. [20].

3 The IAM database is publicly available for download at http://www.iam.unibe.ch/fki/databases/iam-handwriting-database.

http://www.iam.unibe.ch/fki/databases/iam-handwriting-database

http://www.iam.unibe.ch/fki/databases/iam-handwriting-database


Fig. 5. Validation of the ensemble size and composition with greedy forward search for each ensemble generation strategy and each combination method.

The statistical language model is based on three different corpora,the LOB corpus [49], the Brown corpus [50], and the Wellingtoncorpus [51]. A bigram language model is built for each of the corpora.These bigram models are then linearly combined with optimisedmixture weights [52].

The underlying lexicon consists of the 20,000 most frequentwords that occur in the corpora. The lexicon has not been closedover the test set, i.e. there are out-of-vocabulary words in the test setthat do not occur among the 20,000 most frequent words includedin the lexicon. This scenario is more realistic than a closed lexicon,because the texts in the test set are usually unknown in advance.Our test set contains 6.3% out-of-vocabulary words. This results ina word level accuracy of 93.7% assuming perfect recognition.

As a reference system we train and optimise a single recogniser.The training is conducted on the entire training data. The integrationof the statistical language model is optimised as described in Ref. [8].

Twenty-four ensemble members are generated with each of theensemble generation methods described in Section 4, i.e. bagging,random feature subspace, and language model integration variation.The ensembles generated with a single method will be called single-source ensembles. For each single-source ensemble member, we cal-culate the confidence measures described in Section 3.3. In additionto the three single-source ensembles, we build a multi-source en-semble by including ensemble members from all the three singleensemble generation methods. The motivation for multi-source en-sembles is that a higher diversity among the ensemble members canbe expected if they are generated with different procedures.

The combination is performed with the ROVER-based algorithmas described in Section 5. After the alignment, the decisions are madefor each aligned segment. First, only plurality voting is applied for

combination. Then, the three confidence measures, i.e. likelihood-based confidence and the alternative-based confidences Conf1 andConf2, are used for a more sophisticated decision.

The recognition performance is measured in terms of wordlevel accuracy. This accuracy is defined as the number of correctlyrecognised words minus the number of insertions, i.e. additionallyrecognised words, divided by the total number of words in thetranscription. A 100% word level accuracy is only reached if therecognition result matches the transcription word by word.

6.2. Optimisation on the validation set

The validation set is used to optimise the integration of the bi-gram language model, to train the probabilities of the alternative-based confidence measures, to validate the ROVER parameters, andto optimise the ensemble size and composition.

The grammar scale factor � and the word insertion penalty �,which control the integration of the statistical language model, areoptimised for each ensemble member as well as for the referencesystem. The optimised values are found by systematically testingvalues � ∈ [0, 100] and � ∈ [−100, 250] on the validation set.

For the confidence measures of Section 3.3.2, the probabilitiesp(c|n), p(n|c), and p(c|w) are estimated by calculating the relative fre-quencies on the validation set.4 To get reliable estimates of p(c|w),

4 For computational complexity reasons only the validation set is used toestimate these probabilities. With this approach we avoid the expensive decodingof the large training set.


Table 1Recognition accuracy of the different ensemble methods on the validation set

Voting Likelihood Conf1 Conf2

Bagging 71.51 71.51 71.96 72.23Random subspace 71.49 71.44 71.56 71.74LM variations 68.34 68.52 68.73 69.39

Multi-source 72.60 72.70 72.85 73.15

The reference system achieves an accuracy of 69.02%.

Table 2Ensemble size of the different ensemble methods with a given confidence measure


Bagging 12 11 11 13Random subspace 18 18 15 13LM variations 7 12 5 7

Multi-source 23 21 27 27

The ensemble size is optimised on the validation set.

Table 3Recognition accuracy of the different ensemble methods on the test set


Bagging 65.63 65.67 65.88 66.37Random subspace 65.08 65.32 65.53 65.39LM variations 63.38 63.13 63.50 63.83

Multi-source 66.73 66.50 66.66 67.17Reference system 64.48

The reference system achieves an accuracy of 64.48%. The best performing ensemblethat uses a single ensemble generation method is obtained by bagging with Conf2.Using multiple ensemble generation methods with Conf2 further improves thisaccuracy to 67.17%.

we define a minimum of thirty samples that must be available forestimation. For all other words, Conf2 is backed off by Conf1.

The greedy ensemble member selection described in Section 4.4is applied for each ensemble generation method and the differentconfidence measures. During this selection, the parameters � and c�of the ROVER combination are optimised for each validated ensem-ble. The results of the ensemble selection method appear in Fig. 5.Independent of the ensemble generation strategy, confidence mea-sure Conf2 outperforms the other confidence measures and pluralityvoting. Bagging provides the best performing single-source ensem-bles. We note that ensembles generated by language model vari-ation do not perform well. The highest accuracy on the validationset is obtained if we use all three sources to build ensembles. Theoptimised performance of the ensembles on the validation set aresummarised in Table 1. However, because many system parameterhave been optimised on the validation set, the explanatory powerof the differences between individual approaches on the validationset, including the single base recogniser used as reference system, isrestricted.

Table 2 shows the optimised ensemble sizes. Ensembles gener-ated with language model variations reach their optimised perfor-mance with only 5--12 members. In contrast, random feature sub-space methods require 13--18 ensemble members to perform well.As expected, multi-source ensembles have more diversity among themembers, thus adding more recognisers to these ensembles givesbenefit in recognition performance.

6.3. Test set results

The results on the test set appear in Table 3. The reference systemis an optimised single recogniser that achieves 64.48% accuracy. Sim-ilar to the validation set, confidence measure Conf2 outperforms theother decision methods, except for random feature subspace ensem-bles. If only a single ensemble generation method is considered, the

bagging method clearly outperforms random feature subspace andlanguage model variation ensembles. Bagging with Conf2 achieves66.37% recognition accuracy and is the best performing single-sourceensemble. Combining recognisers obtained by language model varia-tion does not improve performance over the reference system. How-ever, if these recognisers are combined with bagging and randomfeature subspace recognisers in multi-source ensembles, accuracyfurther improves up to 67.17% (when Conf2 is used). The multi-source ensemble combined with voting achieves a considerably highaccuracy of 66.73% which is an indication that the more diverse therecognisers are, the less important is the confidence measure. Allimprovements over the reference system are statistically significantat the 5% significance level.

7. Conclusions

This paper provides a systematic investigation and comparison ofdifferent ensemble methods for offline handwritten text line recog-nition. We discussed the underlying handwriting base recogniser,ensemble creation methods, and result combination schemes. Ex-perimental evaluation on a large data set of handwritten text linesshows the effectiveness of the proposed methods.

The handwritten text line recognisers are based on hiddenMarkovmodels. After image normalisation, a slidingwindowmethodextracts a feature vector sequence from the image of a handwrittentext. Each character is modelled with an individual number of states.Next, based on a lexicon, word models are built. Finally, a bigramlanguage model enables us to build text line models. Three differentconfidence measures are computed in a post-processing step.

From a single base recogniser, we automatically generate ensem-bles with random feature subspaces, bagging, and variation of theintegration of the statistical language model. The combination is per-formed with the ROVER combination framework investigating thesuitability of the different confidence measures to the combinationtask. Because all ensemble members are derived from the same baserecogniser we can expect them to be quite similar. On the otherhand, we can improve recognition performance without the expen-sive development of additional base recognisers.

Experimental evaluation on a large set of handwritten text linesindicate that the proposed ensemble methods improve recognitionaccuracy compared to a single recognition system. If only a singleensemble generation strategy is considered, bagging performs best.Further improvements are achieved if multiple ensemble generationmethods are used to build an ensemble. Alternative candidate-basedconfidence measures outperform likelihood-based confidence mea-sures and simple plurality voting for most ensembles.

Although the proposed ensemble methods for hidden Markovmodel-based recognisers achieve significant improvements in thedifficult task of recognising entire handwritten text lines, fur-ther investigations are needed in order to achieve recognitionresults comparable to the human ability to read. Improvementsshould be achieved at different levels, i.e. at the image prepro-cessing level to reduce variability in a writer independent sys-tem, at the feature extraction level where additional features andfeature transformations can be investigated, and at the recog-nition level where additional recognition approaches based onneural networks and conditional random fields should be sys-tematically evaluated. Finally, the postprocessing level, includingconfidence measure calculation and ensemble methods, has beenaddressed only recently and therefore, there is still much space forimprovement.

Future work could include the application of boosting to the con-sidered task. The problem is that the training of boosted ensemblescannot be executed in parallel, and therefore, it is computationallyextremely expensive. Additionally, a combination of the different


ensemble generation methods could be considered to build a singleensemble member. Furthermore, developing a strategy to includethe language model in the combination process would be promising.With the standard ROVER approach, the decisions are made inde-pendently for each segment of the word transition network. Thus,language model information is lost. Finally, it would be interestingto analyse diversity for the considered ensemble methods in greaterdetail.

Acknowledgements

This research was supported by the Swiss National Science Foun-dation (Nr. 200020-19124/1). The authors thank Dr. Matthias Zim-mermann for providing the statistical language model.

References

[1] S. Impedovo, L. Ottaviano, S. Occhiegro, Optical character recognition---a survey,Int. J. Pattern Recognition Artif. Intell. 5 (1991) 1--24.

[2] C. Suen, C. Nadal, R. Legault, T. Mai, L. Lam, Computer recognition ofunconstrained handwritten numerals, Proc. IEEE 80 (7) (1992) 1162--1180.

[3] in: S. Impedovo, P. Wang, H. Bunke (Eds.), Automatic Bankcheck Processing,World Scientific, Singapore, 1997.

[4] A. Brakensiek, G. Rigoll, Handwritten address recognition using hidden Markovmodels, in: A. Dengel, M. Junker, A. Weisbecker (Eds.), Reading and Learning,Springer, Berlin, 2004, pp. 103--122.

[5] G. Kim, V. Govindaraju, S. Srihari, An architecture for handwritten textrecognition systems, in: S.-W. Lee (Ed.), Advances in Handwriting Recognition,World Scientific, Singapore, 1999, pp. 163--172.

[6] A. Senior, A. Robinson, An off-line cursive handwriting recognition system, IEEETrans. Pattern Anal. Mach. Intell. 20 (3) (1998) 309--321.

[7] A. Vinciarelli, S. Bengio, H. Bunke, Offline recognition of unconstrainedhandwritten texts using HMMs and statistical language models, IEEE Trans.Pattern Anal. Mach. Intell. 26 (6) (2004) 709--720.

[8] M. Zimmermann, J.-C. Chappelier, H. Bunke, Offline grammar-based recognitionof handwritten sentences, IEEE Trans. Pattern Anal. Mach. Intell. 28 (5) (2006)818--821.

[9] B.V. Dasarathy, Decision Fusion, IEEE Computer Society Press, Los Alamitos, CA,USA, 1994.

[10] L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, Wiley,NY, 2004.

[11] K. Sirlantzkis, M. Fairhurst, M. Hoque, Genetic algorithms for multi-classifier system configuration: a case study in character recognition, in: 2ndInternational Workshop on Multiple Classifier Systems, Cambridge, England,Lecture Notes in Computer Science, vol. 2096, Springer, Berlin, 2001, pp. 99--108.

[12] Y. Huang, C. Suen, A method of combining multiple experts for the recognitionof unconstrained handwritten numerals, IEEE Trans. Pattern Anal. Mach. Intell.17 (1) (1995) 90--94.

[13] L.S. Oliveira, M. Morita, R. Sabourin, Feature selection for ensembles applied tohandwriting recognition, Int. J. Document Anal. Recognition 8 (4) (2006) 262--279.

[14] X. Ye, M. Cheriet, C.Y. Suen, StrCombo: combination of string recognizers,Pattern Recognition Lett. 23 (2002) 381--394.

[15] P. Gader, M. Mohamed, J. Keller, Fusion of handwritten word classifiers, PatternRecognition Lett. 17 (1996) 577--584.

[16] S. Gunter, H. Bunke, Ensembles of classifiers for handwritten word recognition,Int. J. Document Anal. Recognition 5 (4) (2003) 224--232.

[17] U.-V. Marti, H. Bunke, Use of positional information in sequence alignment formultiple classifier combination, in: J. Kittler, F. Roli (Eds.), 2nd InternationalWorkshop on Multiple Classifier Systems, Cambridge, England, Lecture Notesin Computer Science, vol. 2096, Springer, Berlin, 2001, pp. 388--398.

[18] R. Bertolami, H. Bunke, Multiple handwritten text recognition systems derivedfrom specific integration of a language model, in: Proceedings of the 8thInternational Conference on Document Analysis and Recognition, Seoul, Korea,vol. 1, 2005, pp. 521--524.

[19] R. Bertolami, H. Bunke, Multiple classifier methods for offline handwrittentext line recognition, in: M. Haindl, J. Kittler, F. Roli (Eds.), 7th InternationalWorkshop on Multiple Classifier Systems, Prague, Czech Republic, Lecture Notesin Computer Science, vol. 4472, Springer, Berlin, 2007, pp. 72--81.

[20] J. Fiscus, A post-processing system to yield reduced word error rates: recognizeroutput voting error reduction, in: Proceedings of the IEEE Workshop onAutomatic Speech Recognition and Understanding, Santa Barbara, 1997, pp.347--352.

[21] U.-V. Marti, H. Bunke, Using a statistical language model to improve theperformance of an HMM-based cursive handwriting recognition system, Int. J.Pattern Recognition Artif. Intell. 15 (2001) 65--90.

[22] L. Rabiner, A tutorial on hidden Markov models and selected application inspeech recognition, Proc. IEEE 77 (2) (1989) 257--286.

[23] J.F. Pitrelli, J. Subrahmonia, M.P. Perrone, Confidence modeling for handwritingrecognition: algorithms and applications, Int. J. Document Anal. Recognition 8(1) (2006) 35--46.

[24] G. Fink, Markov Models for Pattern Recognition, From Theory to Applications,Springer, Heidelberg, 2007.

[25] A.H. Toselli, V. Romero, E. Vidal, L. Rodriguez, Computer assisted transcriptionof handwritten text images, in: Proceedings of the 9th International Conferenceon Document Analysis and Recognition, Curitiba, Brazil, 2007, pp. 944--948.

[26] A. Viterbi, Error bounds for convolutional codes and an asymptotically optimaldecoding algorithm, IEEE Trans. Inform. Theory 13 (2) (1967) 260--269.

[27] R. Kneser, H. Ney, Improved backing-off for m-gram language modeling, in:Proceedings of the International Conference on Acoustics, Speech, and SignalProcessing, Detroit, USA, 1995, pp. 181--184.

[28] N. Gorski, Optimizing error-reject trade off in recognition systems, in:Proceedings of the 4th International Conference on Document Analysis andRecognition, vol. 2, Ulm, Germany, 1997, pp. 1092--1096.

[29] J. Pitrelli, M.P. Perrone, Confidence-scoring post-processing for off-linehandwritten-character recognition verification, in: Proceedings of the 7thInternational Conference on Document Analysis and Recognition, vol. 1,Edinburgh, Scotland, 2003, pp. 278--282.

[30] A.L. Koerich, Rejection strategies for handwritten word recognition, in:Proceedings of the 9th International Workshop on Frontiers in HandwritingRecognition, Tokyo, Japan, 2004, pp. 479--484.

[31] A. Sanchis, V. Jimenez, E. Vidal, Efficient use of the grammar scale factor toclassify incorrect words in speech recognition verification, in: Proceedings ofthe International Conference on Pattern Recognition, vol. 3, Barcelona, Spain,2000, pp. 278--281.

[32] T. Zeppenfeld, M. Finke, K. Ries, M. Westphal, A. Waibel, Recognition ofconversational telephone speech using the JANUS speech engine, in: Proceedingsof the International Conference on Acoustics, Speech, and Signal Processing,Munich, Germany, 1997, pp. 1815--1818.

[33] R. Bertolami, M. Zimmermann, H. Bunke, Rejection strategies for offlinehandwritten text line recognition, Pattern Recognition Lett. 27 (16) (2006) 2005--2012.

[34] M. Zimmermann, R. Bertolami, H. Bunke, Rejection strategies for offlinehandwritten sentence recognition, in: Proceedings of the 17th InternationalConference on Pattern Recognition, vol. 2, Cambridge, England, 2004, pp.550--553.

[35] R. Wagner, M. Fischer, The string-to-string correction problem, J. ACM 21 (1)(1974) 168--173.

[36] G. Brown, J. Wyatt, R. Harris, X. Yao, Diversity creation methods: a survey andcategorisation, Inform. Fusion 6 (2005) 5--20.

[37] T. Windeatt, Diversity measures for multiple classifier system analysis anddesign, Inform. Fusion 6 (1) (2004) 21--36.

[38] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimationand model selection, in: Proceedings of the International Joint Conference onArtificial Intelligence, 1995, pp. 1137--1145.

[39] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123--140.[40] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning

and an application to Boosting, in: Proceedings of the European Conference onComputational Learning Theory, 1995, pp. 23--37.

[41] T.K. Ho, The random space method for constructing decision forests, IEEE Trans.Pattern Anal. Mach. Intell. 20 (8) (1998) 832--844.

[42] D. Partridge, W.B. Yates, Engineering multiversion neural-net systems, NeuralComput. 8 (4) (1996) 869--893.

[43] R. Bertolami, H. Bunke, Ensemble methods for handwritten text line recognitionsystems, in: Proceedings of the International Conference on Systems, Man andCybernetics, Hawaii, USA, 2005, pp. 2334--2339.

[44] A. Rahmann, M. Fairhurst, Multiple expert classification: a new methodologyfor parallel decision fusion, Int. J. Document Anal. Recognition 3 (1) (2000) 40--55.

[45] T.K. Ho, J.J. Hull, S.N. Srihari, Decision combination in multiple classifier systems,IEEE Trans. Pattern Anal. Mach. Intell. 16 (1) (1994) 66--75.

[46] W. Wang, A. Brakensiek, G. Rigoll, Combination of multiple classifiers forhandwritten word recognition, in: Proceedings of the 8th InternationalWorkshop on Frontiers in Handwriting Recognition, Niagara-on-the-Lake,Canada, 2002, pp. 117--122.

[47] A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. Gadde, M. Plauche, C. Richey,E. Shriberg, K. Sonmez, J. Zheng, F. Weng, The SRI March 2000 Hub-5Conversational Speech Transcription System, 2000.

[48] U.-V. Marti, H. Bunke, The IAM-database: an English sentence database foroffline handwriting recognition, Int. J. Document Anal. Recognition 5 (2002) 39--46.

[49] S. Johansson, E. Atwell, R. Garside, G. Leech, The Tagged LOB Corpus, User'sManual, Norwegian Computing Center for the Humanities, Bergen, Norway,1986.

[50] W.N. Francis, H. Kucera, Brown Corpus Manual. Manual of Information toAccompany a Standard Corpus of Present-Day Edited American English, foruse with Digital Computers, Department of Linguistics, Brown University,Providence, RI, USA, 1979.

[51] L. Bauer, Manual of Information to Accompany the Wellington Corpus of WrittenNew Zealand English, Department of Linguistics, Victoria University, Wellington,New Zealand, 1993.

[52] J. Goodman, A bit of progress in language modeling, Tech. Rep. MSR-TR-2001-72, Microsoft Research, 2001.


About the Author---ROMAN BERTOLAMI received his M.S. degree in Computer Science from the University of Bern, Switzerland, in 2004. The topic of his master thesiswas the rejection of words in off-line handwritten sentence recognition. He is now employed as research and lecture assistant in the research group of computer visionand artificial intelligence at the University of Bern. His current research interests include combination of multiple text line recognizer as well as confidence measures forhandwritten text recognition systems.

About the Author---HORST BUNKE received his M.S. and Ph.D. degrees in Computer Science from the University of Erlangen, Germany. In 1984, he joined the University ofBern, Switzerland, where he is a professor in the Computer Science Department. He was Department Chairman from 1992--1996, Dean of the Faculty of Science from 1997to 1998, and a member of the Executive Committee of the Faculty of Science from 2001 to 2003.From 1998 to 2000 Horst Bunke served as 1st Vice-President of the International Association for Pattern Recognition (IAPR). In 2000 he also was Acting President of thisorganization. Horst Bunke is a Fellow of the IAPR, former Editor-in-Charge of the International Journal of Pattern Recognition and Artificial Intelligence, Editor-in-Chief ofthe Journal Electronic Letters of Computer Vision and Image Analysis, Editor-in-Chief of the book series on Machine Perception and Artificial Intelligence by World ScientificPubl. Co., Advisory Editor of Pattern Recognition, Associate Editor of Acta Cybernetica and Frontiers of Computer Science in China, and Former Associate Editor of theInternational Journal of Document Analysis and Recognition, and Pattern Analysis and Applications.Horst Bunke received an honorary doctor degree from the University of Szeged, Hungary. He was on the program and organization committee of many conferences and servedas a referee for numerous journals and scientific organizations. He is a member of the Scientific Advisory Board of the German Research Center for Artificial Intelligence(DFKI). Horst Bunke has more than 550 publications, including 36 authored, co-authored, edited or co-edited books and special editions of journals.

hidden markov model-based ensemble methods for offline handwritten text line recognition

Documents