manual

12

Upload: renata-wisniewski

Post on 23-Oct-2014

42 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Manual

giDoc 1.0 Manual

giDoc Team

October 21, 2008

1

Page 2: Manual

Contents

1 What is giDoc? 3

2 Old Text Assisted Text Recognition 3

3 Installing giDoc 3

4 Using giDoc 4

4.1 Text Recognition Technology . . . . . . . . . . . . . . . . . . 44.2 Project Preferences . . . . . . . . . . . . . . . . . . . . . . . . 44.3 Block and Line Detection . . . . . . . . . . . . . . . . . . . . . 74.4 HTK Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.5 Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.5.1 Interactive Transcription Tool . . . . . . . . . . . . . . 84.5.2 Automatic Transcription Tool . . . . . . . . . . . . . . 9

5 Advanced Features 9

5.1 Preprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.1.1 RLSA . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.1.2 Clean . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.1.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . 105.1.4 Median Filter . . . . . . . . . . . . . . . . . . . . . . . 105.1.5 Normsize . . . . . . . . . . . . . . . . . . . . . . . . . . 115.1.6 Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.1.7 Slant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.1.8 Slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.1.9 Substract . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.2 Non-interactive use . . . . . . . . . . . . . . . . . . . . . . . . 12

2

Page 3: Manual

1 What is giDoc?

giDoc is a prototype intended to be a full operative tool for old text documenttranscription. Its name comes from the union of two words: iDoc and Gimp.iDoc because it's the project under which it's being developed, a projectsponsored by the spanish goverment. And Gimp beacuse giDoc has beendeveloped as a set of plugins of The Gimp, a very powerfull open sourceimage manipulation program. The giDoc tools make easy the transcriptionof old text documents with the assisment of a text recognition system. Youdon't need to be an expert in recognition techniques to use the giDoc plug-in, the prototype does all the harsh work internally for you, although andadvanced mode is available for the expert users.

2 Old Text Assisted Text Recognition

TODO

3 Installing giDoc

In order to install giDoc you need:

• The Gimp 2.41.

• HTK.

• Bison and Flex.

giDoc has been created with autotools so you only have to �./con�gure�,�make�, �make install� and the tools and some scripts will be installed asgimp plugins. This package has been tested on Linux system and MacosX,support for window is expected in the future.

giDoc installs various advanced tools which are not visible by default. Inorder to show them in the giDoc menu, use �./autogen �enable-developer�,�make�, �make install� instead.

You can check if the plugins are correctly installed opening the Gimpand checking in the procedure browser if the are any processes starting with"gidoc".

1It could work with lower versions but lastest is recommended

3

Page 4: Manual

4 Using giDoc

In this section we will explain the expected work�ow of a transcriptor andgive a short description of the tools.

All tools included in the giDoc plug-in can be accessed through the "gi-Doc" tab, which is shown in every image loaded in GIMP.

The images used by giDoc MUST be in the Gimp native format .xcf.From now on an image containing handwritten text will be referred as page.

4.1 Text Recognition Technology

Before continuing it's important to describe text recognition brie�y: Textrecognition is the discipline of pattern recognition which studies the computer-based translation of images with written text into machine-editable text.

giDoc includes this technology, and therefore its tools are capable of tran-scribing the text in an image for you transparently in a semi-automatic pro-cess.

Before recognizing text in an image you only need two things:

• Mark the text location in the image (see Section 4.3).

• Dispose of properly trained models (see Section 4.4).

Text recognition is intended to assists the user in the transcription pro-cess, it can't be used to do all the work by itself. OCR technology is notperfect and it's expected some errors will occur in the transcriptions recog-nized by the system, so you must allways verify them.

Take into account that as more text is used to train the models, therecognition system becomes better.

4.2 Project Preferences

First of all a set of general preferences must be de�ned. These preferencesinclude the complete set of options needed by the giDoc tools, such as the im-age preprocessing type, feature extraction method, various parameters usedduring training and recognition, as well as several performance adjustements.

Project Preferences can be accessed through giDoc -> Preferences.TODO actualizar imagenProject preferences should be stablished to use them in a whole project.

A project is a set of pages (XCF images containing handwritten text) thatshare some properites, this could be for instance the pages of a book.

4

Page 5: Manual

The project preferences have to be saved only once. You can close thepage where you stablished them, open a di�erent one, or even restart GIMPand the project preferences will still apply to all the pages you load. If youwanted to work in a page of a di�erent project, you can load its preferenceseasily clicking button Open and selecting the project directory.

TODO explicar los tabsProject Preferences are divided in X di�erent tabs... bla bla bla Here is

a short description of each parameterTODO actualizar a las nuevas preferencias

• Project directory, the directory were these preferences and all the�les generated by giDoc will be stored.

• Training Index File, a �le containing a list of transcripted pages thatwill be used in training. Complete paths must be used.

• Test Index File, a �le containing a list of pages that will be recogniseby �giDoc/recognize� option. Complete paths must be used.

• Key Binding File, GTK �le where the transcription interfece userde�ned keyboard shortcuts are stored.

• Preprocess script, name of the Gimp script used to preprocesses linesin the XCF document.

• Feature Extraction Script, name of the Gimp script used to extractfeatures from a preprocessed line.

• Image resolution, currently not used.

• Threshold, the threshold used in monochrome images.

• Histogram proyection based on, tells how to obtain the projectionhistograms.

• Layout origin-x, the x coordinate of the top left corner of the rect-angle representing the text layout.

• Layout origin-y, the y coordinate of the top left corner of the rect-angle representing the text layout.

• Layout width, the width of the rectangle representing the text layout.

• Layout height, the height of the rectangle representing the text lay-out.

5

Page 6: Manual

• Number of lines of the text block, number of lines that will becreated inside the layout rectangle.

• Pixels under baseline, de�nes how many pixels under the baselinewill be sampled when preprocessing.

• Pixels over baseline, de�nes how many pixels over the baseline willbe sampled when preprocessing.

• HMM mixtures, number of the gaussian mixtures per state that willbe trained.

• HMM states, number of states per character that will be trained.

• HMM iterations, number of iterations to train at each step of thetraining.

• HMM window width, width of the feature vector (it must be amultiple of three).

• LS �le, HTK list of simbols �le.

• HMM �le, HTK hmm macros �le.

• Vocabulary �le, HTK dictionary �le.

• Language �le, HTK language �le (word net).

• WIP, HVite word insertion penalty.

• GSF, HVite grammar scale factor.

• Recognition prunning, HVite recognition pruning.

• Enable veri�cation, toggles hypothesis ver�cation.

• Verif Reliable Threshold, Threshold over which a hypothesis is con-sidered correct.

• Verif Unreliable Threshold, Threshold over which a hypothesis isconsidered incorrect.

6

Page 7: Manual

4.3 Block and Line Detection

Before transcribing an image containing handwritten text the document lay-out of the image must be de�ned. The layout de�nes the Block of Text andthe Text Lines in the image that are going to be transcribed. A Block is arectangle that encloses consecutive Lines of text.

This can be performed almost automatically using the giDoc tools: Se-lect the block of text in the image using GIMP's Rectangular Selection tool.Then, call �rst the giDoc Block Detection tool, and then the giDoc Line De-tection tool. You will notice a rectangular path encloses the selected region,and the lines inside it have been automatically detected (see Figure ??).

TODO poner imagen xcf con el antes y el despues (una sola imagen)The Line Deteciton tool may commit some errors. Make sure all lines are

correctly underlined, and �x the paths if necessary using Gimp's path tool(for more details visit path documentation).

TODO tal vez imagen de como corregir una linea (antes y depues, conzoom en una linea)

Once both the Block and the Lines have been de�ned you can start totranscribe (see Section 4.5)

Take into account that the giDoc plug-in can only manage one block perimage. Multi block functionallity is expected in the future.

4.4 HTK Training

TODO actualizar si necesariogiDoc is able to train HTK models from transcripted pages. You have to

de�ne some values of the HMM at the Project Preferences and make a listwith full pathname to the transcripted .xcf where the values will be trained(see Figure 1). The train needs to make some preprocess before obtaining thefeatures. Initially giDoc will install two Scheme scripts that de�ne the seriesof Gimp plug-ins that make the preprocess. You can change at preferencesthe script to apply. For further information about making Scheme plug-insvisit the Tutorial on Gimp page.

Open an image and click �giDoc/HTK training� if a list of training �les iscreated the plug-in will open it and use these �les for training. If the index�le does not exist the plug-in will train the HMMs only with the openedpage. If you have opened Gimp in a terminal useful info will be printed.

7

Page 8: Manual

Figure 1: Train plug-in needed values.

4.5 Transcription

Once the lines have been marked in the image you can start transcribingusing giDoc. This can be done using both the Interactive Transcription tooland the Automatic Transcription tool.

Interactive mode permits both manual transcription as well as text recog-nition, while the Automatic mode uses only text recognition.

4.5.1 Interactive Transcription Tool

Interactive transcription is the main tool for transcription means. It allowsto transcribe manually, edit the existing transcriptions, and also performsautomatic recognition if desired.

Click on �giDoc/Interactive Transcription� and the tool's interface will beloaded. The interface is divided in three zones.

TODO poner imagen de la interfazIn the �srt zone there is a preview image of the lines being transcribed.

You should note that this preview shows the image in real size, so the editor'swindow may not �t in your screen if the image is too large. You can use�Image/Rescale Image� to reduce the image size.

When performing recognition, if veri�cation is enabled the system willperform various evaluations to determine the con�dence of the recognizedwords. The words suspicious of being incorrect will be marked in the previewusing an orange box (medium con�dence) or red (low con�dence).

TODO poner imagen con un ejemplo de veri�cacionTODO comentar que la marca de ultimo segmento si forceout, y poner

imagenIn the second zone text boxes show the transcription associated to each

line. This text can be edited as desired. UTF-8 is supported so you can enterwhichever character you need. To identify easily which line is being edited,it is shown underlined in the preview according to the lines marked furingline detection (see Section 4.3)

8

Page 9: Manual

The third zone is a set of preferences of the interface. You can adjust thefont used on the transcriptions, navigate through the document, change thenumber of lines shown and de�ne some keybindings to help transcription.

In the keybinding dialog you can set a combination of keys (e.g. <Alt>1)to a name (e.g. �ción�) as de�ned on Figure 2 so when you press <Alt>1,�ción� is written in the current transcripted line.

Figure 2: Example of some key-bindings

After transcribing remember you have to save the XCF �le or the changesmade will be lost.

4.5.2 Automatic Transcription Tool

This tool is designed to automatically perform text recognition of all the linesof a page or a set of pages without your supervision.

Click on �giDoc/Automatic Transcription� to access the tool.Once the process has ended you can use the Interactive Transcription

Tool to acces, verify and edit the recognized transcriptions.

5 Advanced Features

TODO actualizar (poner lo "advanced" del menu)

5.1 Preprocess

In the �giDoc/Preprocess� menu you will �nd various preprocess plug-ins.These plug-ins are an adaptation from the preprocess tools currently used inthe PRHLT group, made by Moises Pastor, Veronica Romero and AlejandroToselli. The image has to be in grayscale to use these tools (you can adjustit at Image/Image mode/Grayscale) and have only one layer. Note that notall the plug-ins can be applied to a selected zone, some of them are appliedto the whole page. Most of the plug-ins have an interface where the user

9

Page 10: Manual

can adjust options. If there is more than one layer after using a plugin it isprobably due to an alpha layer, delete it using �Layer/Transparency/Deletealpha layer�, delete it before continuing.

5.1.1 RLSA

This plug-in applies the Run Length Smoothing Algorithm to the currentselected region on the image. You can adjust the minimum lenght betweenthe regions that will be joined. An example can be seen at Figure 3.

Figure 3: RLSA algorithm

5.1.2 Clean

TODO

5.1.3 Feature Extraction

This plug-in applies the feature extraction used at PRHLT group and drawsit on the image. This plug-in makes a transformation similar to what isdone on speech recognition, add to the signal the �rst horizontal and verticalderivative. An example can be seen at Figure 4.

5.1.4 Median Filter

This plug-ine applies a median �lter to the selected region on the image.This kind of �lter is used to delete noise refered as salt and pepper noise. Anexample can be seen at Figure 5. You can use the median �lter available onGimp at �Filter/Enhace/Despeckle�.

10

Page 11: Manual

Figure 4: Feature Extraction algorithm

Figure 5: Median Filter algorithm

5.1.5 Normsize

This plug-in normalize the size of a image representing a textline. It cutsthe descendants and ascendants of the text and resize them to a de�ned size.This step is made to normalize the size of all the line images. This plug-in isapplied to the whole image that represent one line of text. An example canbe seen at Figure 6.

Figure 6: Normsize algorithm

5.1.6 Skew

TODO

11

Page 12: Manual

5.1.7 Slant

This plug-in corrects the slant or the vertical curvature of an image repre-senting a text line. This process generalize the multiple natural curvaturesof di�erent writers. An arti�cial example can be seen at Figure 7.

Figure 7: Slant correction algorithm

5.1.8 Slope

TODO

5.1.9 Substract

TODO

5.2 Non-interactive use

TODO

12