recognition of tibetan wood block prints with generalized ... · writing recognition and the state...

16
Recognition of Tibetan Wood Block Prints with Generalized Hidden Markov and Kernelized Modified Quadratic Distance Function Fares Hedayati Jike Chong Kurt Keutzer Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2010-138 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-138.html November 23, 2010

Upload: others

Post on 18-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

Recognition of Tibetan Wood Block Prints with

Generalized Hidden Markov and Kernelized Modified

Quadratic Distance Function

Fares Hedayati Jike ChongKurt Keutzer

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

Technical Report No. UCB/EECS-2010-138

http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-138.html

November 23, 2010

Page 2: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

Copyright © 2010, by the author(s).All rights reserved.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.

Page 3: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

1

Recognition of Tibetan Wood Block Prints withGeneralized Hidden Markov and

Kernelized Modified Quadratic DistanceFunction

Fares Hedayati, Jike Chong, Kurt Keutzer

Abstract—Recognition of Tibetan wood block print is a difficult problem that has many challenging steps. We propose a twostage framework involving image preprocessing, which consists of noise removal and baseline detection, and simultaneouscharacter segmentation and recognition by the aid of a generalized hidden Markov model (also known as gHMM). For the latterstage, we train a gHMM and run the generalized Viterbi algorithm on our image to decode observations. There are two majormotivations for using gHMM. First, it incorporates a language model into our recognition system which in turn enforces grammarand disambiguates classification errors caused by printing errors and image noise. Second, gHMM solves the segmentationchallenge. Simply put gHMM is an HMM where the emission model allows multiple consecutive observations to be mapped tothe same state. For features of our emission model we apply line and circle Hough transform to stroke detection, and use class-specific scaling for feature weighing. With gHMM, we find KMQDF to be the most effective distance metric for discriminatingcharacter classes. The accuracy of our system is 90.03%.

Index Terms—Document analysis, handwriting analysis, optical character recognition, pattern Analysis

F

1 INTRODUCTION

There is a vast knowledge base of 14,000 volumes ofBuddhist teaching stored in Tibetan wood-block printthat is of great interest to scholars around the world.These ancient works are only available in physicalreproductions and are difficult to access and search.There is great value and interest in bringing theseancient documents into searchable electronic form andmaking them accessible to everyone. The charactersin wood blocks used for printing are individuallycarved, and contain large variations compared tomodern machine prints. Neighboring characters, of-ten merged, lines can be significantly skewed andsometimes touch one another. In many aspects, therecognition problem is as complex as handwritingrecognition. There has been some previous work inoptical character recognition of Tibetan machine print[1] [2] [3]. Masami et al [1] proposed a character cate-gorization technique that separately recognizes basicconsonants, combination characters and vowels. In alater work, Masami et al [1] used Euclidean distancewith differential weights for the classification. In thework by Ding et al [3], a modified quadratic discrim-inant function is used for classification, which bettershapes the region that defines a class in the featurespace. We leverage the techniques presented in theseprevious works used in machine printed Tibetan, andextend them to wood-block print optical characterrecognition of Tibetan, which is yet an unsolved

problem. We have encountered numerous cases insegmentation and classification where the higher levellanguage model is required to resolve ambiguities.This motivated us to use generalized hidden Markovmodels, which both solves the segmentation prob-lem and makes classification more accurate. Otherapproaches that are not based on gHMM are alsopresented here and with their overall accuracies. Wetried two other approaches beside gHMM for ourOCR system namely a segmentation-based approachand a thinning-based approached. The segmentation-based approach resulted in a 68.08% accuracy which ismuch lower than the 90.03% accuracy of the gHMM.For this approach, we used vertical histograms, DPspace search, and local edges as segmentation pointsfor segmentation. With the addition of a cost-basedsegmentation insertion for under segmentation cases,we were able to achieve 92% segmentation accuracy.This shows that segmentation is still a challenge inthis approach and the final accuracy of the system ishighly dependent on the initial segmentation. Finallythe thinning based approach is based on extract-ing the skeleton of the text. We tried three differentthinning algorithms. These methods are discussed inSection 5.2. The feature extraction of gHMM-basedand segmentation-based OCRs and their distancefunctions are the same. This made testing the overallaccuracy of these two approaches easier than thatof the thinning-based approach. We did not researchon ways to extract discriminative and robust features

Page 4: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

2

Fig. 1. General structure of Tibetan syllable (afterMasami [1])

from the skeleton of the text. Researchers shouldcome up with their own features. We only present thethinning algorithms here to give researchers insightsinto possible directions in Tibetan block print OCR.

2 SURVEY ON OCR AND OFF-LINEHANDWRITING RECOGNITIONOptical character recognition, abbreviated to OCR,is a field of research in computer vision, patternrecognition and digital image processing that convertsimages of typewritten, handwritten or printed textinto machine-editable text. This process of conver-sion usually involves intelligent association of piecesof image with their corresponding characters; henceOCR is an artificial intelligence research field. Off-line handwriting recognition, a field of research inOCR, is the ability of computers to translate humanhandwriting to machine-editable text. Off-line hand-writing recognition is different from on-line handwrit-ing recognition, as the latter involves recognition ofimages of text as it is written on special digitizeror PDA by interpreting pen-tip movements of users.The former on the other hand gets static images oftext as input. Off-line handwriting recognition is adifficult task due to variation in peoples handwritingand the amount of noise in handwritten documents.This made the researchers of the field focus on limitedrange of inputs. Consequently research in this fieldhas been moving in different directions specified byrange of inputs. The state of the art of this fieldshould be studied separately for each of these rangesof inputs. Since Tibetan block prints are individuallycarved and printed, their recognition is an off-linehandwriting recognition problem. As there has beenno research on off-line recognition of Tibetan blockprints (The only research that has ever been done onTibetan OCR has been on printed Tibetan texts [1] [2][3]), we state the state of the art of other somewhat-similar off-line handwriting recognition. One of thesimplest forms of off-line handwriting recognitions isdigit recognition. Handwritten digit recognition is atypical example of off-line handwriting recognition

that has been extensively researched. Four major in-dependent methods have been used by researchers ofthe field [12]: Shape context matching [13], pixel-to-pixel image matching with local contexts [15], invari-ant support vector machine [14], and convolutionalneural net and virtual data [16]. The accuracy of thesemethods fall above 99%. In other words off-line digitrecognition could be considered a solved problem,as misclassified digits of the aforementioned methodsare hard for humans too, to recognize. The first twoapproaches of the four, use the idea of deformabletemplates and correspondence be-tween pixels of testand reference images. The other two methods usemachine learning techniques of convolutional neuralnetworks and SVM to classify digits. Off-line hand-writing recognition of texts is a much harder problemthan that of the digit recognition primarily becauseof segmentation problem and number of classes to berecognized. Large size of classes to be recognized isa substantial bottleneck in handwriting recognition.In most languages the number of characters is muchlarger than ten. For instance in Chinese, researchersare faced with thousands of characters to recognize.The other challenge is the segmentation problem. Incontrast to digit recognition that classifies digits inisolation; handwritten text recognition should classifyall characters of an image in context. Characters arein most languages not segmented due to image noiseand writing style. Even though for most languagesresearch on optical character recognition of printedtext has been carried on, off-line handwriting recog-nition for many languages is a new field of research.Arabic, Chinese and scripts with Roman alphabetsare among the languages with enough research onoff-line handwriting recognition. We explain someof the common trends in these instances of hand-writing recognition and the state of the art of each. ForChinese off-line recognition segmentation and nor-malization is a common approach to preprocess theimage be-fore recognition [17]. Segmented charactersare then recognized through stroke-based approaches,holistic approaches or radical approaches. The stateof the art of Chinese off-line handwriting recognitionis as following: an accuracy of 98.1%, for regular orhand-print scripts, 82.1%, for fluent scripts, and 70.1%for cursive scripts [17]. Surveys on off-line cursiveword recognitions of Roman scripts show that threemajor approaches have been used in this field namelyholistic approaches, segmentation-based approachesand HMM-based approaches. Normalization of imageand some preprocessing are also common trends priorto the recognition step [18]. In segmentation-basedapproaches a word is first segmented to its constituentcharacters and each character is then recognized in-dividually. Holistic approaches on the other handrecognize words as a whole without trying to breakthem into individual pieces. HMM-based approachesuse hidden Markov models and their language model

Page 5: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

3

and emission model make them avoid segmentation.This method is adopted from speech recognition. Thisis despite the fact that in speech observations are one-dimensional and in text they are two-dimensional.There has been very little effort to develop two dimen-sional HMMs or two-dimensional stochastic modelsdue to complexity of these models. However withincrease in computation power these models couldbe of great use. HMM-based approaches and holisticapproaches have also been a common trend in Ara-bic and Persian off-line handwriting recognition sys-tems [11] [19]. Another trend in Arabic handwritingrecognition has been using fuzzy constrained charac-ter graph models. These models are fuzzily labeledgraphs that represent characters [20]. This method hasresulted in an accuracy of 73.6%.

3 PROPERTIES OF THE TIBETANSCRIPTThe Tibetan character set is composed of 30 conso-nants and 4 vowels. We define a character to be a stackof consonants with optional vowels. A syllable is agroup of characters with one essential consonant (EC)and 7 optional parts, as shown in Figure 1. There canbe up to one consonant before the combined character(CC) with EC, and up to two consonants after the CC.A stack with an EC is tightly packed together andsome consonants change form as it is stacked.

4 OCR CHALLENGESThe characters in wood blocks used for printing areindividually carved, and contain large variations com-pared to modern machine prints. Neighboring char-acters often merged, lines can be significantly skewedand sometimes touch one another. In many aspects,the recognition problem is as complex as handwritingrecognition. Off-line recognition of wood block printTibetan texts is a much more challenging task thanthat of the printed Tibetan texts. In single-font systemsonly bitmap font mask for exact match is required.In multi-font, on the other hand systems structureand contour based feature extraction is required. Sinceeach character is well defined, with no interferencefrom neighbors, segmentation is an easy task. This isin contrast to wood block prints where large width,height and proportion variation are common andstokes intrude from neighboring characters and lines.Character strokes merge with neighboring charactersdue to manual Rubbing printing techniques. In woodblock prints characters merge with one another withno clear boundaries. Knowledge of character formsand language are required to disambiguate segmen-tation points. This problem defines the entire recog-nition framework. Tseks that are word delimiters inTibetan are clearly distinguishable in machine printedtexts and corner cases are resolved by rule based

Fig. 2. Printed versus wood block print

Fig. 3. Baselines aligned after recognition

syntax correction. On the other hand majority of Tseksare not optically visible in wood block prints. Thereis significant blurring with baseline of neighboringcharacters and in-depth knowledge of Tibetan vocab-ulary is required for effective Tsek insertion. Figure2 compares a line of wood block print with a line ofmachine printed Tibetan text.

4.1 BASELINE DETECTION CHALLENGES

Baselines are not always straight and parallel. Further-more the distance between baselines is not uniformand varies from line to line and from page to page.Some-times due to noise or printing errors top ofone line touches the bottom of another. All thesemake baseline detection hard. Figure 9 shows all thesechallenges that baseline detection is faced with. Afterbaseline detection lines should be aligned equally (seeFigure 3).

4.2 SEGMENTATION CHALLENGES

Segmentation of baselines into characters is one ofthe most challenging steps in off-line recognition ofTibetan block prints. Segmentation requires syntaxknowledge. In a sense it is a chicken-and-egg prob-lem. If we recognize the characters we can segmentthem easily, but we need to segment characters inorder for them to be recognized (see Figure 2). Eventhough the majority of segmentation points could bedetected by some heuristics (see Section 5.1), thereare still cases that segmentation point detection isimpossible without context knowledge (see Figure 4).Segmentation has direct effect on recognition. Thosecropped images that have wrong segmentation pointsare not going be recognized correctly, no matter howaccurate and well-trained the recognition system is. Tosegment or not to segment is a critical decision that wehave to make for our OCR system. Using generalizedhidden Markov models we can avoid independentsegmentation and combine the step with recognition.See Section 5.3.

Page 6: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

4

Fig. 4. Image in red box is under-segmented, the onein green is over-segmented

4.3 Recognition ChallengesAs mentioned earlier at the beginning of Section 6,optical character recognition of Tibetan block prints isas hard as offline handwriting recognition. Charactersthat are separated by gaps in machine printed textsare not visually separable in Tibetan block prints. Thismakes recognition hard, because we have to know ifthe image we are trying to recognize corresponds toa character and is not noise or does not correspondto different parts of multiple characters. Anotherchallenge is the amount of noise and variation inwidth, height, size and curvature of each character.Features that are invariant to noise, rotation, smearand other transformation are needed to over-comethis challenge. In short segmentation and invariantfeatures are needed for off-line recognition of Tibetanblock prints.

5 OVERVIEW OF THE RECOGNITIONAPPROACHES5.1 Segmentation-based approachesThere are two approaches for a Tibetan recognitionframework based on segmentation. The first is seg-menting lines into characters and recognizing thecharacters (see Figure 6) and the second is segmentinglines to characters, then its constituting consonantsand vowels, then recognizing the consonants andvowels (see Figure 5).

The first approach involves discriminating approxi-mately 500 common characters, whereas the secondapproach involves discriminating approximately 50consonant and vowel symbols. Since characters, vow-els and consonants are not completely separable inTibetan block-prints, due to noise, line slant, smear-ing of the ink, etc, the first approach seems morereasonable. Here we present a pipeline of the firstapproach. After detecting baseline of an image wesegment each line into characters. Then for each ofthese cropped images of characters some featuresare extracted which are in turn fed into a distancefunction. The distance function finds the closest classof characters to its input and output that class. Featureextraction and distance function of our segmentation-based approach is exactly the same as of those of thegHMM-based approach. For more information about

Fig. 5. Two stack characters segmented into vowelsand consonants

them please refer to Sections 10 and 11. In this Sectionwe present the segmentation algorithm of the method.

Segmenting a line into characters is a significantproblem in Tibetan wood block prints. Unlike ma-chine printed Tibetan, character width and spacingare highly variable. Searching for white spaces in thevertical histogram finds only 35% of segmentationpoints (see Figure 6). Top vowels and bottom vowelsfrequently extend into the neigh-boring charactersand characters often merge into each other, sometimesat multiple points. By using dynamic programming tofind points that connect the top and bot-tom whitespace, we obtain no more than 60% of expectedsegmentation points. By extracting Tibetan specificfeatures such as tail-stroke at the right end of morethan 80% of the characters, we capture up to 85%of the segmentation points (see Figure 9). The aboveheuristics does not find all segmentation points. Bydetecting gaps of segmentation points that are largerthan the typical character width, we insert potentialsegmentation points not captured by previous heuris-tics. For wide gaps, multiple possible segmentationscenarios are possible. For example, a typical Tibetanwood block print character ranges from 15-25 pixelswide at 300 dpi, a gap in detected segmentation pointsof 70 pixels may contain 3 or 4 characters. We examinethe potential segmentation sites for different segmen-tation scenarios, and choose the scenario with thelowest average cost. The insertion cost is calculated bya discrimination function over the vertical histogram:

g(m) =2V (m)− V (m− 1)V (m+ 1)

V (m)+ 1 (1)

Where V (m) = the number of pixels at the verticalstrip at position x = m and g(m). Where and isthe tsek insertion cost; segmentation points are thelocal minima of this function. Applying this techniqueincreases the segmentation point detection rate to92%. There are many segmentation points that posesignificant challenge for image analysis techniquesto detect. They are: 1) closely linked ga and sheynear the end of a sentence, 2) neighboring characterstouching at more than one place, and 3) ambiguoussegmentation points that provide multiple segmen-tation possibilities. These require some higher-levellanguage model to help discriminate.

5.2 Thinning Based Approach

In the thinning-based approach, features are extractedfrom the skeleton of the image. Extracting features

Page 7: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

5

Fig. 6. Character segmentation

from the skeleton of an image has been a commonapproach for many OCR researchers. For instance,Khorsheed [11] uses a thinning algorithm similar toStentiford [5] to extract the skeleton of images ofArabic manuscripts. He then traverses these skeletonsand extracts different features from them. Curvature,orientation, and percentage of the skeleton above orbelow the baseline are some of these features. Hefinally feeds this features to a hidden Markov modeltrained to classify Arabic characters. Our initial goalwas feature extraction from skeleton. However sincethe amount of noise in Tibetan block prints is veryhigh we ended up using our thinning algorithms fornoise removal and smoothing only. We developedthree methods for thinning Tibetan block prints.Figure 7 shows outputs of these three methodsand the original image at the bottom. Even thoughwe only used one of these algorithms namely thethird method, we pre-sent all three here for futureresearchers of the field. The first thinning algorithm, isthe most sophisticated thinning algorithm of all. It isbased on contour detection and iteratively shrinkingthe image, the pseudo-code of this algorithm is inalgorithm 1.

In algorthim-1 after detecting the contour of animage, was mark its inner contour. Inner contourhere is defined as the immediate black neighbors ofa contour that are trapped inside the contour. If acontour pixel does not have any neighbors that istrapped in, that pixel is also included in the innercontour. After this step, the current contour is updatedby the inner contour that was just marked, and theoriginal image is shrunk by removing the old contour.This step is repeated iteratively until the skeleton ofthe image is found and the contour does not changeanymore. In each step of the iteration we make surethat the image remains connected by identifying jointpoints on the contour and putting them back on thenew contour. Joint points are contour pixels that areneighbors with no more than two inside pixels andhave at least a contour pixel neighbor that does nothave any inside neighbors. The output of this thinningalgorithm is further smoothed and thinned by yetanother skeletonizer called Stentiford [5]. The secondthinning algorithm, is based on contour detection

Algorithm 1 Input: binary image β, Output: Thinnedimage γ

while true doδ ← ContourDetection(β)only keeps black pixels that haveat least 1 white neighborη ← InnerContourDetection(δ, β)only keeps black pixels in βthat are adjacent to black pixels of δβ̂ ← β - RemoveJointPoints(δ)RemoveJointPoints removes black pixelsthat are in δ and are neighbors with nomore than two pixels in β − δ and haveat least a neighbor in δ that does nothave any neighbors in β − δδ ← ηif β̂ 6= β then

Breakelseβ̂ ← β

end ifend whileγ ← Stentiford (β)return γ

Fig. 7. The fourth line is the original image and theother three lines are outputs of our thinning algorithms.

and iteratively shrinking the image like the previousmethod. However it is much simpler and it does notuse the skeletonization algorithm on the output. Thethird thinning method, corresponding to the thirdline in figure 6, is exactly like the second thinningmethod where we iteratively shrink the contour untilit converges to the skeleton, with one exception. Inthis method when we are updating our contour byinner contour we keep any contour pixel that hasmore than four black neighbors. This way we keepthe contour more connected compared to the othermethod. Algorithm 2 explains this algorithm:

5.3 gHMM Based ApproachSegmentation and classification are the two inter-dependent stages in our OCR project. Facing the sameproblem for optical character recognition of Latinmanuscripts Jaety Edwards and et al used generalized

Page 8: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

6

Algorithm 2 Input: binary image β, Output: Thinnedimage γ

while true doδ ← ContourDetection(β)only keeps black pixels that haveat least 1 white neighborη ← InnerContourDetection(δ, β)only keeps black pixels in βthat are adjacent to black pixels of δβ̂ ← β - RemoveJointPoints(δ)RemoveJointPoints removes blackpixels that are in δ and have morethan four black neighborsδ ← ηif β̂ 6= β then

Breakelseβ̂ ← β

end ifend whileγ ← βreturn γ

hidden Markov models [10]. We have encounterednumerous cases in segmentation and classificationwhere the higher level language model is requiredto resolve ambiguities. [3] proposed an iterative ap-proach, where character recognition result is fed backto the character segmentation and classification stages.gHMM based model uses a structured approach forincorporating language model based on the gener-alized Hidden Markov Model (gHMM). Simply putgHMM is an HMM where the emission model allowsmultiple consecutive observations to be mapped tothe same state. The problem to be solved is to distin-guish good segmentation points and bad segmenta-tion points based on character transition model andstate emission model. gHMM is characterized by threeconstituents: State transition model, initial state, andstate emission model. The language model is encodedin all three parameters. For OCR, the hidden statesare the characters to be recognized labeled with theirrelative position in the word, and the observationsare the images at the input. Character to charactertransition model is encoded in the state transitionmodel and the initial state. The emission model thenhelps to compute the probability that an image is aninstance of a state. Given a sequence of observationsour gHMM uses a variation of the Viterbi algorithmto find the most probable sequence of characters thatproduced the image.

6 OUR APPROACH: GHMMWe find the gHMM based approached most promisingand most accurate. This is because it recognizes Ti-betan characters in context not in isolation by us-ing a

language model of Tibetan language that incorporatesgrammar into its recognition system. Furthermore itsolves the segmentation and classification problemsimultaneously. It finds the best way to segmentcharacters and the most likely sequence of charactersthat could correspond to an image of a line of Tibetanblock print. This method increases the accuracy of asegmentation-based model by 25%.

7 COMPONENTS OF GHMM7.1 Emission Model: Feature Extraction and Ob-servation Probability

Emission model is the common part of all classi-fication algorithms in general and OCR systems inparticular. Emission model consists of two parts: fea-ture extraction and probability distribution of featurevectors for each class, these probability distributionsare known as observation probabilities. Feature ex-traction is the process of converting our input datainto a vector of features. All classification algorithmsshould eventually go through this process. The moredescriptive the extracted features the more accuratethe final system. We extract 157 features for our OCRsystem. These features are discussed in details inSection 9. The observation probability can be alsoused to define a distance function which measuresthe closeness of feature vectors to each other. Withouta proper distance function that groups vectors of thesame class together and separates them from vectorsof other classes, classification would be impossibleor very inaccurate. There are four different distancefunctions that we worked with. Section 10 explainsthem meticulously.

7.2 Language Model

Hidden Markov Models have this language modelpart which incorporates transitional and initial prob-abilities of a set of states, in our case characters, intotheir system. This basically helps the recognition sys-tem to avoid un-likely and ungrammatical sentencesand take the gram-mar of the language into account.Section 11.1 explains this part in details.

7.3 Viterbi Decoder for Simultaneous CharacterSegmentation and Character Classification

Viterbi decoder is the most essential part of a gHMMsystem. It is the heart of the system. gHMM could bethought of a black box that has raw image of a line ofTibetan as its input and the most probably sequenceof Tibetan characters with its corresponding segmen-tation points as its output . The algorithm that doesthis most favorable and simultaneous segmentationand classification is called Viterbi. Refer to Section 11.3for a through explanation of the algorithm.

Page 9: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

7

Fig. 8. Line of Tibetan block print with its skeleton

8 PREPARING THE INPUT TO GHMM: IM-AGE PREPROCESSING

8.1 Noise RemovalOne of the main challenges in recognition of Tibetanblock prints is that they are very noisy. Noise incomputer vision is defined as those pixels that are notpart of the image. They make recognition hard. Noiseremoval is an essential step in our OCR pipeline.In this step all pixels are traversed one by one andthose pixels that do not have more than 2 non-whiteneighbors are whitened, i.e. deleted. This way noisypixels and dots are removed and the image is cleaned.

8.2 Thinning Thick and Noisy CharactersNoise is not restricted to random dots. Smeary charac-ters are considered noise too. The characters in woodblocks used for printing are individually carved, andcontain large variations compared to modern ma-chine prints. Some of these variations are not tolera-ble. More specifically when a character is too smearyand thickened by printing errors, recognizing it be-comes very hard. These characters are thinned byalgorithm 2 in section 5.2 that extracts the skeletonof characters. After thinning we add pixels aroundthe skeleton in two steps. In the first step directneighbors of the skeleton are added to the thinnedimage which originally contained the skeleton only,and in the second step the immediate neighbors ofthe thinned image are added. This way the pixels thatare either in skeleton or at most two pixels away fromit are included in the final thinned image. Figure 8shows a line of Tibetan block print with its skeletonextracted using our skeleton extraction algorithm. Formore details of this thinning algorithm refer to Section5.2.

8.3 Baseline DetectionThe lines of text in Tibetan are aligned to a baselinenear the top of the characters. This feature can be usedto align Tibetan text lines in an image for effectivecharacter window search. However, in Tibetan textsare written in wide and short pages. The significantvariations in line slant angle and line spacing acrossa page makes any global baseline search techniqueineffective and error prone, see left of Figure 9. We usea horizontal histogram of a typical line of Tibetan textlearnt from samples, to dynamically detect local lineoffset variations in short segments of 100-200 pixels

Fig. 9. Baseline detection using horizontal histograms

across the page, see right of Figure 9. For each localwindow which is usually 60 x 100-200 pixels we builda horizontal histogram that contains the number ofblack pixels along each y position in the window. Thishistogram is then compared to our typical histogramlearned from training samples and if they are closeenough, the vertical offset of the window or its localbaseline is detected. The vertical offset for each pixelis computed by interpolating neighboring measure-ments. Each line is separated to avoid over writingneighboring lines, and pixels are shifted rather thanrotated to gain computation efficiency. This way, thelines are aligned for character segmentation.

9 EMISSION MODEL OF GHMM, PART I:FEATURE EXTRACTION

The segmentation points found in the section 5.1are used to divide each line of text into individualcharacters. For training the classifier, we call the char-acters class templates. Several classes of features areconsidered for wood-block print OCR.

1. Skeletonization and stroke detection (structurebased approach) 2. Histograms and profiles (texturebased approach) 3. Hough transform and variations(structure and texture based approach)

Skeletonization and stroke detection has been usedsuccessfully in many handwriting OCR systems [4].The approach thins a stroke down to single pixelwidth, while keeping the connected components con-nected. We developed three thinning algorithms.While these approaches are suitable for handwritingwith thin strokes, the strokes in Tibetan wood blockprint are dense and thick, and the Skeletonization pro-cess generates considerable systematic noise. Thesethinning algorithms produce split ends at the tip ofa stroke, and have difficulties thinning dense junctionpoints. Our experiments with vertical and horizontalhistograms and pro-files showed that histograms aresensitive to stroke thick-ness and profiles are easilyaffected by cross-sections generated while segmentingmerged characters. We use Hough transform as a basisfor stroke detection. Sections 9.1 and 9.2 explain strokedetection in details. The Hough transform is an imagespace transformation that maps points in an image inx-y Euclidean plane into the Hough space, definedby θ and ρ. By checking for the strongest response in

Page 10: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

8

Fig. 10. Lines and circles detected using Hough Trans-form

Hough space, we can quickly find the most prominentline feature in the image, see Figure 10.

9.1 Hough Line Extraction

The Hough transform has been frequently used toex-tract features in image outlines [6]. Tibetan woodblock print is not suited for such an extraction ap-proach. Firstly, a character outline may be badlydistorted by noise and merged characters. Secondly,the thin outlines require precise matches with angles(θ ) resulting in expensive computation time makingfeatures less robust to rotational variations. In ourwork, we compute the Hough transform from pre-processed image data, i.e. with noise removed andtoo thick characters thinned to normal size. Prepro-cessing keeps the strokes thick and only makes thesmeary and too thick stokes look like other strokes.Extracting Hough response from images with thickstrokes has a positive and a negative implication onfeature ex-traction. The positive implication is that athick stroke can respond strongly to lines of manyangles, so only a few angles need to be tested to collecta significant response from a stroke. This makes thestroke more rotationally in-variant, as small amountof rotations will still yield a prominent response.The negative implication is that thick strokes addconsiderable noise for Hough transform responses,as two or three cross sections of thick strokes yielda response as high as an average length stroke. Tomitigate this problem, we verify the high responsein the Hough space by checking the collinear pixelsin the x-y plane that produced the response. Wedefine a validated prominent stroke to be a strokethat has a certain number of consecutive pixels. TheHough space efficiently stores a wealth of informationabout the character. As illustrated above, the strengthof the response at a (θ ,ρ) position illustrate howmany collinear points lie on that line in x-y plane.We can also examine the response along the ρ axisin Hough space for a particular orientation θ . Thisshows us the parallel strokes found in the orientationθ at various offsets. For example, long vertical strokesthat appear at the right side of the image will havea significant response at θ = 0 and ρ at a particularoffset corresponding to the right side of the image.

This information helps determine the location of theresponse. We use the Hough transform to extract atotal of 90 features. Analysis shows that 6 orientationof equally spaced θ is enough to capture the strokesin Tibetan. For each stroke orientation, we bin thecaptured stroke by their verified consecutive lengthto long, medium, and short segments. For each bin,five features are extracted: number of verified promi-nent stroke, maximum and mini-mum offset ρ ofverified prominent stroke, average offset ρ of promi-nent strokes, and average offset ρ of all responses,prominent and non-prominent. The non-prominentresponses are those lines with enough collinear pointsto trigger a response but not enough points are foundto be consecutive. These features help capture bothstructural and textural information in the character.The extraction of 6 orientations on thick strokes pro-vides us with local rotational invariance. The binningof stroke length provides invariance in size of strokes,which vary significantly over characters, even on thesame line. Small variations of the stroke location inthe direction of the stroke do not affect the features.Small variations of the stroke location perpendicularto the stroke will only translate to small offset in offsetρ, thus providing local translational invariance.

9.2 Hough Circle Extraction

Tibetan scripts contain many loops curves that are noteasily extracted with straight line strokes. We extendthe Hough transform to produce 13 more featuresfor detect-ing circular and half ellipsoid features tocapture loops in cha, chha like characters, and curvesin ya-ta. We found the circle response to be highlysensitive to the circle radius. Analysis shows thatthree bins of radii produced the most discriminatingfeatures: those that capture small circles as in cha,chha, tsha, zha, tha, a, ya, la, sha and finally thosethat capture larger circles as in ba, kha and ga, andthose that capture half ellipsoids as in characters withya-ta consonants. For each of the circles detected,we construct 3 bins for their locations: those thattouch the top baseline, those that are one stroke awayfrom the baseline, and those that are far below thebottom baseline. Each bin contains 2 features, numberof circles detected above a response threshold andthe maximum response produced in that bin. Sincethere are at most two circles in each bin this methodof extracting features related to circles is invariant toshift along the x-axis. As the y locations of circles arebinned to top, middle, or bottom, the method is alsolo-cally invariant to shift along the y-axis. For the halfellipsoidal detected, one binary feature is constructedcorresponding to the presence of a half ellipse of acertain size oriented downwards in the bottom partof the template.

Page 11: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

9

9.3 Character Complexity Features

In addition to stroke base extraction, we also ex-tract a complexity measure for characters by countingthe number of stroke-crossings encountered in twoorientations, the horizontal and the vertical orienta-tions. Some characters like da have a simple structureand their maximum number of horizontal or verticalcrossings is never more than three, whereas for morecomplex characters like wa the number of crossings ineither direction is at least three. We have allocated tenfeatures for each template to capture the complexityof their structure. These features are built as follows.We construct horizontal and vertical histograms of thenumber of crossings in the correspond-ing directions.The total number of crossings is recorded for verticaland horizontal orientations.

9.4 Directional features

We partition the image into eight horizontal and fourvertical strips. For the horizontal strips we extract thefollowing features: the number of black pixels acrosseach column, the height of the highest pixel in eachcolumn, the number of horizontal crossings for eachcolumn, the mean grayscale of the strips, and the x-mean and the y-mean of each column. For the verticalstrips we only extract the mean number of horizontalcrossings for each row. In summary we extract 38directional features from the image.

9.5 Additional Features

Finally we extract five basic features from the image:the grayscale of the image, the width of the image,the top vowel of the image if there is any, and the x-mean and the y-mean of the image. For the top-vowelfeature, we used a simple decision tree algorithm toclassify the top part of the image into five classes, onefor each possible top vowel and one for the case whenthere is no vowel. The decision tree is very simple andworks as following. First the top part of the baselinewhich contains vowels only is cut from the image andsent to a vowel classifier. This vowel classifier thendetects intervals where there are contiguous chunkof black pixels. These chunks are vowels. Now oneach of these vowels our classifier runs the followingdecision tree. If the number of horizontal cross-ingand vertical crossing is each at least two then thevowel is marked i, otherwise if the beginning and theending part of the vowel are local maxima then thevowel is marked o otherwise if the slope of the vowelis close to negative one, then the vowel is marked e,otherwise if the vowel is less than fifteen pixels wideit is marked n otherwise p. e, o and i are top vowelsand n is the Tibetan mark tsa-phru that is put on topof some characters like tsa. Finally p is the right partof the top vowel o. Since in some wood prints thisvowel is dis-joint our classifier might recognize it as

Fig. 11. Top vowels are marked with red ellipses, topvowels from left to right are, e, o , i, o, n and o

a two separate vowels, e or n as its left part and p asit right part. Based on this observation our classifierchecks all vowels that are identified as p and if theirprevious vowels are e or n, it combines the two vowelsinto o. See Figure 11 to see how these vowels look likein Tibetan block prints.

9.6 Feature summary

In summary, we have a feature set of 157 features,with 90 features for 6 orientations of straight linestrokes, 13 features for loops and curves, and 10features as measure of character complexity, 5 basicfeatures and 38 directional features.

10 EMISSION MODEL OF GHMM, PART II:DISTANCE FUNCTION

Tibetan requires a large number of classes to be dis-criminated. In such situations, previous works [7] [3]have proposed multilevel hierarchical classification.In this work, for simplicity, we start with a singlelevel of classification based only on the centroid ofclasses. This approach requires an effective methodto measure the close-ness of a sample point to aclass. The set of features de-scribed above containsmeasures for feature counts, off-sets, and peak re-sponses. The features of different numerical range arefirst normalized to the range (0,1). This normalizedfeature space, however, is highly inseparable withrespect to the classes. Classification using Euclideandistance achieves only 48% accuracy for a correctlysegmented image. The issue arises because we have aglobal uniform weighting on all the features. Certaincharacters such as ka and ja has prominent strokefeatures, but noisy circle features; whereas cha andchha has prominent circle features, but noisy strokefeatures. To increase the signal-to-noise ratio for mea-suring closeness to a class, we adopt a class-wise fea-ture weighing approach, where features are weigheddifferently depending which class they are measuringdistance with. In general, this approach shapes thehigh dimensional sphere in the feature space arounda class centroid into an ellipsoid to better approximatethe feature space belonging to a class. We experimentwith four different distance metrics:

1. Weighted Euclidean Distance, 2. Weighted Eu-clidean Distance with Multi-cluster Classes, 3. Modi-fied Quadratic Discrimination Function (MQDF) [8], 4.Kernelized Modified Quadratic Discrimination Func-tion (KMQDF) [9].

Page 12: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

10

The weighted Euclidean distance approach mea-sures the variance in each feature for all trainingtemplate within a class. The difference in each featurebetween a test sample and a class is then scaledby the inverse of the feature-wise variance in thisclass. This is a degenerate case of the MahalonobisDistance where the covariance matrix is a diagonalmatrix. There are two corner cases in this approach tohandle. When a feature is very consistent in a class,it has a very low variance that can lead to numericalstability problems in the variance computation step,and can allow too much weight to be placed on aparticular feature. We solve this by setting a lowerbound on the variance, thus limiting the weight onecan put on a particular feature. The second corner caseoccurs when we only have one sample for a class, inwhich case we set the variances uniformly across allfeatures. A class-wise weighting for features helpedincrease prediction accuracy significantly to 67.5%for a correctly segmented image. The classificationresults for weighted Euclidean distance indicate thatsimpler characters are predicted with less accuracy,as some prominent features tends to have large varia-tions depend-ing on the robustness of certain features.We mitigate this problem by splitting simpler, morepopulous classes into multiple clusters using k-meansalgorithm. The number of clusters in each class isproportional to the number of samples in that class.Those clusters that have population less than a certainpercentage of the average population of a cluster ina class are treated as outliers, and not used in theclassification phase. There are two parameters thatshould be determined, namely the threshold for out-liers and the scaling factor for the number of clustersin each class. These two parameters are determinedby an exhaustive search in the parameter space. Theweighted Euclidean distance can only construct el-lipsoid in the feature space in the direction of theaxis. To better approximate the data variation for aclass, we use Principle Component Analysis (PCA) toextract directions of most variance, and apply MQDFas a measure of the distance be-tween a sample and aclass. Applying this metric improved the predictionaccuracy to 75% for a correctly segmented image.Finally KMQDF is the Kernelized ver-sion of MQDF.It embeds the data in a higher dimensional Hilbertfeature space which presumably makes it more sepa-rable. We use Kernel trick in KMQDF to avoid directlymapping data to its new feature space. The familyof Kernel functions that we used was polynomialof the form k(x, y) = (x.y + 1)

p . We found to bethe most accurate Kernel. For further details pleaserefer to 11.2. Kernel trick only needs to know the dotproduct of the newly-mapped data which is easilycomputed as a function of the input in its originalfeature space. Apply-ing this metric improved theprediction accuracy to 80% for a correctly segmentedimage.

10.1 MQDF distance function

We can model each class by a multivariate Gaus-sian distribution, and use the negative of its log-probability as a distance function. This is exactly theaforementioned Mahalonobis distance function. Thereason that we are deriving our distance function froma probability distribution is that we want to use itlater in our generalized hidden Markov model as ouremission model. Let wi denote the ith class, µi denotethe mean of the ith class and Σi its covariance, theprobability that x is generated by class wi is:

p(x|wi) = (2π)−d2 |Σi|

−12 exp

− (x− µ)TΣ−1i (x− µ)

2(2)

We denote the distance of x to wi as Q(x|wi) =− log p(x|wi) =

d

2(2π) +

1

2|Σi|+

((x− µ)

TΣ−1i (x− µ)

2

)(3)

We use negative of the log likelihood for distancefunction for two reasons. The first reason is that loglikelihood has a nice form for normal distributionsand the second reason is that likelihood and distanceare reversely related. As the likelihood of a class wiincreases, the distance of x to that class decreases. Ifwe diagonalize the covariance matrix by Σi = βiΛiβ

Ti ,

where Λi is the diagonal matrix of eigenvalues ofΣi and βi is its corresponding orthonormal matrixof eigenvectors, the Mahalonobis distance functionbecomes

1

2

d∑j=1

β2ij(x− µi)2

λij+

d∑j=1

log λij + d log 2π

(4)

Since the training of the Mahalonobis classifier al-ways underestimate the patterns eigenvalues by lim-ited sample set, the minor eigenvalues become somekind of un-stable noises and affect the classifiers ro-bustness. By smoothing them in the MQDF classifier,not only the classification performance is improved,but also the computation time and storage for theparameters are saved. In MQDF we substitute minoreigenvalues including those that are zero with a smallclass-dependent constant δi and we only use K majoreigenvectors, approximating the rest by the same δi.Putting it all together we get, distance from x to classwi is Q(x|wi) = − log p(x|wi) =

1

2(

k∑j=1

(βTij(x− µi))2

λij+

k∑j=1

log λij+

d∑j=k+1

(βTij(x− µi))2

δi+ (d− k) log δi + d log 2π) =

Page 13: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

11

1

2(

k∑j=1

(βTij(x− µi))2

λij+

k∑j=1

log λij+

ri(x)

δi+ (d− k) log δi + d log 2π) (5)

Where ri(x) = ||x− µi||2 −∑kj=1

(βTij(x− µi)

)2.For more details refer to [8].

10.2 KMQDF distance functionYang and et al first used KMQDF for facial expressionrecognition and showed that it outperforms MQDF[9]. KMQDF is the Kernelized version of MQDF. It em-beds the data in a higher dimensional Hilbert featurespace which presumably makes it more separable. Weuse Kernel trick in KMQDF to avoid directly mappingdata to its new feature space. Kernel trick only needsto know the dot product of the newly-mapped datawhich is easily computed as a function of the input inits original feature space. Our experiments show thatk(x, y) = (x.y + 1)

pwith p = 1 is a good embeddingkernel function for our 157 features. For mathemati-cal details of deriving the KMQDF distance functionrefer to [9]. Here we state the final distance functionand explain its parameters. KMQDF distance functionlooks like equation (5):

1

2(

k∑j=1

M

λ2ij

(rTij (Rit − 1MRit −Ri1M−1 + 1MRi1M−1)

)2

+

k∑j=1

logλijM

+ri(x)

δi+ (d− k) log δi + d log 2π) (6)

Where

ri(x) = ||φ(x)− µφi ||2 −

k∑j=1

(βTij

(φ(x)− µφi

))2=

φ(x).φ(x)− 2× 1M−1Rit + 1MRi1M−1−

k∑j=1

1

λij

(rTij (Rit − 1MRit −Ri1M−1 + 1MRi1M−1)

)2(7)

φ is the mapping function , we do not need toknow this mapping because of the kernel trick,i.e.ker(x, y) = φ(x).φ(y)

M is the number of samples for our class.Ritis a 1×M matrix which contains the dot productsof all the M samples of our class with the input xafter being mapped to their new feature space, kerneltrick is being used to fill in this matrix.Ri is a M × M Gramm matrix which contains thedot products of all the M samples of our class afterbeing mapped to their new feature space, kernel trick

is being used to fill in this matrix.1M is a M ×M matrix with all elements equal to 1

M1M−1 is a M × 1 matrix with all elements equal to 1

Mλij and rij are the eigenvectors and eigenvalues ofRi

We used up to 125 samples for each class, i.e.our maximum equals 125. This is a tradeoff betweenaccuracy, computation and memory. Storing all oursamples, their eigenvalues and eigenvectors requiresa lot of memory and is not practical. At the same time,computing dot products and matrix multiplications ofsize is very time consuming. Our experiments showthat setting our maximum to 125 does not affect theaccuracy of our model while saving us time andmemory.

11 PUTTING ALL COMPONENTS OFGHMM TOGETHER

Segmentation and classification are the two inter-dependent stages in our OCR project. We have en-countered numerous cases in segmentation and clas-sification where the higher level language model isrequired to re-solve ambiguities. [3] proposed an it-erative approach, where character recognition resultis fed back to the character segmentation and clas-sification stages. We use a structured approach forincorporating language model based on the generalizeHidden Markov Model (gHMM). Simply put gHMMis an HMM where the emission model allows multipleconsecutive observations to be mapped to the samestate. The problem to be solved is to distinguish goodsegmentation points and bad segmentation pointsbased on character transition model and state emis-sion models. gHMM is characterized by three param-eters: State transition model, initial state, and stateemission model. The language model is encoded inall three parameters. For OCR, the hidden states arethe characters to be recognized labeled with theirrelative position in the word, and the observationsare the images at the input. Character to charactertransition model is encoded in the state transitionmodel and the initial state. The emission model thenhelps to compute the probability that an image is aninstance of a state. Given a sequence of observationsour gHMM uses a variation of the Viterbi algorithmto find the most probable sequence of character thatproduced the image. Section 11.1 talks about thelanguage model of our Tibetan gHMM, Section 11.2gives an over-view of the emission model. In Section11.3 we describe the decoding problem and the Viterbialgorithm in gHMM and in Section 11.4 we explainthe training process and forced alignment.

11.1 Language ModelThe language model consists of the transitional andinitial probabilities of a gHMM. It is an essential

Page 14: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

12

element in gHMM in that it disambiguates segmenta-tion and classification errors. We tried two differentlanguage models for our gHMM. In the first, wetreated stack characters as states and in the second, welabeled each stack character with its relative positionin the word and the size of the word. The secondmodel resulted in a more accurate gHMM. The rea-son behind this is very intuitive; the second modelenforces the grammar more and discards less gram-matically correct outputs in the Viterbi algorithm. Welabel our states in the following way. A stack charac-ters label contains the information about a charactersrelative position in a word and the size of the word.Possible labels are: 1-1, 1-2, 2-2, 1-3, 2-3, 3-3, 1-4, 2-4, 3-4, 4-4. As an example 2-3 means the character isthe second character in a three-character word. Herewe describe two of the major advantages of usinglabeled stack characters vs. unlabeled stack characters.Labeled stack characters constitute a more accuratelanguage model; hence they do not allow certain pathsin the decoding trellis that are otherwise allowed inthe unlabeled model. At the same time certain pathsare returned with more probability in this model thanin the unlabeled model, hence the overall accuracy isincreased. Usually tseks, the end-of-word delimitersin Tibetan, are blurred in the original image, whichmakes their recognition very hard. Tsek insertion is afree by-product of our language model: any state thatis the first stack character of a word, namely stackcharacters with the following labels 1-1, 1-2, 1-3, and1-4, has a tsek before it.

11.2 Emission ModelThe emission model contains the observation andlength probabilities of all states. We tried four differ-ent distance functions which are described in Section10. MQDF and KMQDF were the most accurate of allfour. KMQDF was the most accurate of all. For moreinformation about KMQDF refer to 11.2. As KMQDFand MQDF are negative of log likelihoods, they can beused in our emission model. They are easily convertedto observation probabilities by exponentiation of thenegative of the distance function. Since for numericalstability, all calculations are done in logs, we onlyhave to change the sign of our distance function toget the log of observation probability.

11.3 Decoding problem, the Viterbi algorithmIn gHMM the probability of observing a line of imagethat corresponds to a certain sequence of states is:

p(c1|α) =∏t>1

p(ct|ct−1)p(dt|ct−1)p(image|dt, ct) (8)

Where ct is the current state and dt is the numberof observations in the current state. The decodingproblem looks for a sequence of states that maximizes

the posterior probability of that sequence, given theimage. Since

p(c1|, c2, ..., cm|image) =p (c1|, c2, ..., cm, image)

p (image)(9)

the sequence of states that maximizes the jointprobability also maximizes the posterior probability.Exploit-ing this mathematical fact the Viterbi algo-rithm finds the maximizing sequence of states bymaximizing the joint probability. As the number ofsequences of a certain length is exponential in lengthand the length of the maximizing sequence is un-known due to lack of segmentation, the total numberof possible sequences is the sum of exponentials overall possible lengths, which is still exponential. TheViterbi algorithm avoids this exponential set by theaid of dynamic programming in the following way:

δt(i, l) = maxc1,c2,...,cm,m

p (image(0, t), S(t, l, i)) (10)

Where S(t, l, i) = c1, c2, ..., cm means that observa-tions t − l to t of the image corresponds to the ithstate. Let aij denote the transition probability from theith state to the jth state, andbi(t, l) be the observationprobability of observing observations t− l to t of theimage given we are in the ith state, and let π(i, l) bethe initial probability of the ith state with length l thenit is trivial to derive the following recursive formula:

δt(i, l) = bi(t, l) maxj,w

δt−1(j, w)aij (11)

δ0(i, l) = bi(t, l)π(i, l) (12)

The Viterbi algorithm retrieves the maximizing se-quence of states by looking for the maximum of overall possible states and possible lengths and tracingback from there. In this formula T is the length ofthe image, i.e. number of observations.

11.4 TrainingTraining is the most challenging task of our gHMM.Figure 12 shows a line of Tibetan block print with itssegmentation points. Supervised and manual segmen-tation and labeling of observations is time consum-ing and unpractical. Having initialized our emissionmodel with seven pages of manually-segmented blockprints, we used the automatic and unsupervised tech-nique of forced alignment to further train our model.Researchers use forced alignment algorithm, one typeof a hard EM algorithm, when the correspondinghidden states are given and only segmentation pointsare missing. Our training data consists of images oflines of Tibetan block prints with their correspondingsequence of Tibetan characters, i.e. the ground truth.Since our training data lack segmentation points, wetreat them as latent variables with indices of statesequence as our states. For instance if we know thatour input image (which is one line of Tibetan block

Page 15: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

13

Fig. 12. A line of Tibetan block print, the red linescorresponds to the segmentation points.

print) corresponds to a 20-long sequence of Tibetancharacters, then the states of our EM are the indicesof this sequence, i.e. 0 to 19. Hard EM finds the mostprobable alignment, by running the Viterbi algorithmon our new set of states, with the following changes:

a. Each column of the trellis is initialized withour new set of states. b. For each state i only i-1 is considered as the possible previous state notother states. c. The transition probability and initialprobability are not taken into consideration here, be-cause the sequence of states is given and we areonly maximizing over the observation probabilities, inother words we are looking for a segmentation thatmaximizes the product of observation probabilities,given the sequence of states.

Having found the most likely segmentation we canre-estimate our gHMM, and with our new gHMM wecan re-segment our images. We should continue thiscycle till our model becomes stable. We trained ourmodel using fifty pages of Tibetan block prints. Aftertwenty iterations of forced alignment our model out-put the same alignments meaning that it convergedto a stable solution.

12 RESULTS

We have constructed a database of 6555 charactersfrom 7 pages of Tibetan wood-block print for our ex-periments. The document used is from the publisherDege Kanjur, titled 400 Praises to the Buddha. Oursample had 168 classes of stack characters, with amaximum of 699 instances per class. We used crossvalidation to train and test our model. We manuallysegmented and labeled all pages of Dege Kanjur fortraining and initializing the gHMM. We ran forcedalignment with a total of twenty iterations on yet an-other collection of training samples with a total num-ber of 54 pages to further improve our gHMM. Thiswas done for each cross validation step. Our measure-ment of accuracy was Levenstein distance function.Levenshtein distance is a measure of closeness of twosequences to each other. The Levenshtein distancebetween two strings is given by the minimum numberof insertion, deletion, or substitution of characters toconvert one string to another. Our gHMM resulted in89.49% accuracy using the Levenshtein distance forcomparing Viterbis output to the ground truth. Lateron we manually segmented 104 pages and retrainedour OCR on them to see how accurate our forcedalignment EM was compared to the case when thelatent variable, i.e. segmentation points, were avail-able. Manual segmentation increased the accuracy to

TABLE 1Error averages of four OCR systems

Thinned with 104 Manually segmented pages 8.712834Thinned with 54 Manually segmented pages 8.862216Not Thinned EM 12.40975Thinned EM 10.5180

Fig. 13. Comparison of errors of four OCR systems,A (green) represents thinned with 104 manually seg-mented pages, B (yellow) represents thinned with 54manually segmented pages, C (blue) represents notthinned pages with EM, D (red) represents thinned withEM.

91.29% which only boosted up the accuracy by 1.8%.This shows the effectiveness of our training with EM.Table-1 shows four error rates of four different OCRsystems: two OCR systems with training EM, onewithout thinning and preprocessing and one withoutpreprocessing, and two OCR systems with manually-segmented pages, one with 54 pages and one with104 pages. Table 1 further shows that thinning andimage preprocessing increases the overall accuracy,and increasing the training samples from 54 pages to104 pages increases the accuracy by only by 0.15%, inother words more training does not have a noticeableeffect on the overall accuracy.

13 CONCLUSION

In examination of approaches to solving the OCRproblem for Tibetan block prints, one critical decisionis to segment or not to segment the image. Thisproblem is common to other off-handwriting recog-nitions, such as Arabic, Roman and Chinese OCRsystems. Our results indicate that segmentation andrecognition should not be too separate steps. Oneneeds to recognize characters in image in order tosegment them and characters are segmented in orderto be recognized. This chicken-and-egg problem issolved simultaneously by the aid of gHMM. In otherwords gHMM carries segmentation and recognition atthe same time. In conclusion gHMM-based approachoutperforms segmentation-based approaches for op-tical character recognition of Tibetan block prints.Furthermore correct choices of language model anddistance function with noise-invariant, discrimina-tive features for gHMM and enough training of thegHMM has substantial effects on the final accuracyof the overall system. More specifically KernelizedMQDF distance function outperforms MQDF which

Page 16: Recognition of Tibetan Wood Block Prints with Generalized ... · writing recognition and the state of the art of each. For Chinese off-line recognition segmentation and nor-malization

14

was initially used in our system and per-formed bet-ter than Mahalonobis-based and Euclidean distancefunctions. For language model of gHMM, characterslabeled with their relative distance to tseks and size oftheir word outperformed vanilla language model thatlacked these labels. This is due to the fact that labelsof relative distance to tseks and size of words enforcemore grammar into language model than the vanillamodel. Finally using Hard EM to train our gHMM,makes the emission model more robust and accurate.

14 FUTURE WORK

Our experiments show that good features could easilyincrease the accuracy of recognition even if the dis-tance function is not discriminative enough. Henceone of the directions that future researchers shouldtake is improving the qualities of features that wouldbe eventually used in the GHMMs emission model.Good features should have two characteristics. One isthat they should be discriminative, i.e. they should beable to separate classes of characters from each other.Secondly features should be invariant to differenttypes of noise, rotation and smear. Good featureshowever do not eliminate the need for generalizedhidden Markov models, as they do not solve the seg-mentation problem. Higher levels of language modelcould also increase the accuracy of our approach. Asin speech recognition instead of the most probablesequence of states, the Viterbi algorithm could outputa certain number of them. A higher-level languagemodel could further process these outputs using thismodel and filter the unlikely and ungrammatical sen-tences out.

REFERENCES

[1] K Masami, N Chikako, K Takanobu, A Yoko, and K Yoshiyuki,Recognition of similar characters by using object orienteddesign printed Tibetan dictionary, Trans-actions of IPSJ, 36(11),pp.23042307, 1995.

[2] K Masami, K Yoshiyuki, and K Masayuki, Characterrecognition of wooden blocked Tibetan similar manuscriptsby using Euclidean distance with deferential weight, IPSJ SIGNote, Information Processing Society of Japan (IPSJ), pp. 1318,1996.

[3] Xiaoqing Ding and Hua Wang, Multi-Font Printed TibetanOCR, Digital Document Processing, Springer Lon-don, Bidyutb. Chaudhuri edition, pp. 73-98, 2007.

[4] M Khorsheed, W Clocksin, ”Structural Features of CursiveArabic Script”. The 10th British Machine Vision Conference,University of Nottingham, Nottingham-UK, pp.422-431, 1999.

[5] H Freeman, Machine vision for three, dimensional scenes,New York, Academic Press 1990.

[6] R Dua, P Hart, Use of the Hough transformation to detectlines and curves in pictures, Commun. ACM, vol. 15, no. 1, pp.11-15 January 1972.

[7] Y Tang, Lo-Ting Tu, Jiming Liu, Seong-Whan Lee Win-Win Lin, and Ing-Shyh Shyu, offline recognition of Chinesehandwriting by multi-feature and multilevel classification,Pattern Analysis and Machine Intelligence, IEEE Transactions,vol. 20, pp. 556-561, 1998.

[8] F Kimura, K Takashina, and S Tsuruoka, Modified quadraticdiscriminant functions and the application to Chinese characterrecognition, IEEE Transactions on Pat-tern Analysis andMachine Intelligence, vol. 9. pp. 149153, 1987.

[9] D Yang, L Jin, J Yin, L Zhen, J Huang, Kernel ModifiedQuadratic Discriminant Function for Facial ExpressionRecognition, Lecture notes in computer science, Advancesin machine vision, image processing, and pattern analysis. pp.66-75, 2006.

[10] J Edwards, D. A Forsyth, Searching for character models,Advances in Neural Information Processing Systems, pp.331-338, 2006

[11] M.S Khorsheed, Recognizing Handwritten ArabicManuscripts Using a Single Hidden Markov Model. Pat-tern Recognition Letters, 24, pp.2235-2242, 2003.

[12] D Keysers Comparison and Combination of State-of-the-art Techniques for Handwritten Character Recognition:Topping the MNIST Benchmark, Image Understanding andPattern Recognition (IUPR) Group German Research Center forArtificial Intelligence (DFKI), pp/ 1-18, 2007

[13] S Belongie, J Malik, and J Puzicha. Shape Match-ing andObject Recognition Using Shape Contexts. IEEE Trans. PatternAnalysis and Machine Intelligence, 24(4). pp.509-522, April 2002.

[14] D DeCoste, B Schoelkopf. Training Invariant SupportVector Machines. Machine Learning, Machine Learning, vol. 46,Numbers 1-3, pp.161-190, 2002.

[15] D Keysers, C Gollan, H Ney. Local Context in Non-linearDeformation Models for Handwritten Character Recognition.17th Int. Conference on Pattern Recognition, vol. 4, pp. 511-514,Cambridge, UK, August 2004.

[16] P Simard. Best Practices for Convolutional Neural NetworksApplied to Visual Document Analysis. In 7th Int. Conf.Document Analysis and Recognition, pp. 958-962, Edinburgh,Scotland, August 2003.

[17] S. N Srihari, X. Yang, G. R. Ball. Offline Chinese HandwritingRecognition: A Survey. Document Analysis and Recognition,2007. ICDAR 2007. Ninth International Conference on, vol. 1,pp. 133-137, 2007

[18] A Vinciarelli A survey on off-line cursive word recognition,Pattern Recognition, vol. 35, pp.14331446, 2002.

[19] A Amin, Off-line Arabic character recognition: the stateof the art, Pattern Recognition, vol. 31, Issue 5, 1, pp. 517-530,March 1998.

[20] I.S.I Abuhaiba, S.A Mahmoud, R.J Green, ”Recognitionof handwritten cursive Arabic characters,” Pattern Analysisand Machine Intelligence, IEEE Transactions on, vol.16, no.6,pp.664-672, Jun 1994.