indexing and searching of manuscript collections

28
Transkribus User Conference 2018 - Vienna (Austria) Indexing and Searching of Manuscript Collections Alejandro H. Toselli and Enrique Vidal Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València, Spain {ahector,evidal}@prhlt.upv.es February 9th, 2018 Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11/09/2018 1 / 16

Upload: others

Post on 09-Jun-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Indexing and Searching of Manuscript Collections

Transkribus User Conference 2018 - Vienna (Austria)

Indexing and Searching of Manuscript Collections

Alejandro H Toselli and Enrique Vidal

Pattern Recognition and Human Language Technology Research CenterUniversitat Politegravecnica de Valegravencia Spainahectorevidalprhltupves

February 9th 2018

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 1 16

Outline

Textual Access to Untranscribed Manuscripts 3

Probabilistic Indexing of Text Images 4

Performance Measures 8

Preparatory Steps and Laboratory Results 9

Keyword Search Demonstrators 11

Other Indexing Tasks 15

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 2 16

Textual Access to Untranscribed Manuscripts 3

Textual access to Untranscribed Manuscripts

I Massive text image collections have been compiled by libraries and archives allover the world but their textual content remains practically inaccessible

I If perfect or sufficiently accurate text image transcripts were available imagetextual context could be straightforwardly indexed for plaintext textual access

I But manual or even interactive-predictive assisted transcription is entierelyprohibitive to deal with massive image collections

I And fully automatic transcription results lack the level of accuracy needed foruseful text indexing and search purposes

Good news indexing and textual search can be directly carried out on untrascribedimages as we will see now

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 3 16

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo

To compute P an accurate contextual (n-gram based) word classifier can be used In thisexample this helped to achieve very low posteriors in a region of X around (i=100 j=60)where a very similar (but different) word ldquomattersrdquo is written

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Probabilistic Word Indexing from the Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Technologies InvolvedI Optical modelling Deep CNN-RNN network

I Textual context modelled by finite state character n-grams

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16

Probabilistic Indexing of Text Images 4

Probabilistic Text Image Indexing and Search System Diagram

Textimages

Ingestion

Database

Page-levelindices

indexing toolKWS amp

Keyword search

I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices

I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process

I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 2: Indexing and Searching of Manuscript Collections

Outline

Textual Access to Untranscribed Manuscripts 3

Probabilistic Indexing of Text Images 4

Performance Measures 8

Preparatory Steps and Laboratory Results 9

Keyword Search Demonstrators 11

Other Indexing Tasks 15

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 2 16

Textual Access to Untranscribed Manuscripts 3

Textual access to Untranscribed Manuscripts

I Massive text image collections have been compiled by libraries and archives allover the world but their textual content remains practically inaccessible

I If perfect or sufficiently accurate text image transcripts were available imagetextual context could be straightforwardly indexed for plaintext textual access

I But manual or even interactive-predictive assisted transcription is entierelyprohibitive to deal with massive image collections

I And fully automatic transcription results lack the level of accuracy needed foruseful text indexing and search purposes

Good news indexing and textual search can be directly carried out on untrascribedimages as we will see now

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 3 16

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo

To compute P an accurate contextual (n-gram based) word classifier can be used In thisexample this helped to achieve very low posteriors in a region of X around (i=100 j=60)where a very similar (but different) word ldquomattersrdquo is written

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Probabilistic Word Indexing from the Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Technologies InvolvedI Optical modelling Deep CNN-RNN network

I Textual context modelled by finite state character n-grams

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16

Probabilistic Indexing of Text Images 4

Probabilistic Text Image Indexing and Search System Diagram

Textimages

Ingestion

Database

Page-levelindices

indexing toolKWS amp

Keyword search

I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices

I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process

I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 3: Indexing and Searching of Manuscript Collections

Textual Access to Untranscribed Manuscripts 3

Textual access to Untranscribed Manuscripts

I Massive text image collections have been compiled by libraries and archives allover the world but their textual content remains practically inaccessible

I If perfect or sufficiently accurate text image transcripts were available imagetextual context could be straightforwardly indexed for plaintext textual access

I But manual or even interactive-predictive assisted transcription is entierelyprohibitive to deal with massive image collections

I And fully automatic transcription results lack the level of accuracy needed foruseful text indexing and search purposes

Good news indexing and textual search can be directly carried out on untrascribedimages as we will see now

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 3 16

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo

To compute P an accurate contextual (n-gram based) word classifier can be used In thisexample this helped to achieve very low posteriors in a region of X around (i=100 j=60)where a very similar (but different) word ldquomattersrdquo is written

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Probabilistic Word Indexing from the Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Technologies InvolvedI Optical modelling Deep CNN-RNN network

I Textual context modelled by finite state character n-grams

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16

Probabilistic Indexing of Text Images 4

Probabilistic Text Image Indexing and Search System Diagram

Textimages

Ingestion

Database

Page-levelindices

indexing toolKWS amp

Keyword search

I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices

I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process

I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 4: Indexing and Searching of Manuscript Collections

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo

To compute P an accurate contextual (n-gram based) word classifier can be used In thisexample this helped to achieve very low posteriors in a region of X around (i=100 j=60)where a very similar (but different) word ldquomattersrdquo is written

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Probabilistic Word Indexing from the Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Technologies InvolvedI Optical modelling Deep CNN-RNN network

I Textual context modelled by finite state character n-grams

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16

Probabilistic Indexing of Text Images 4

Probabilistic Text Image Indexing and Search System Diagram

Textimages

Ingestion

Database

Page-levelindices

indexing toolKWS amp

Keyword search

I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices

I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process

I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 5: Indexing and Searching of Manuscript Collections

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo

To compute P an accurate contextual (n-gram based) word classifier can be used In thisexample this helped to achieve very low posteriors in a region of X around (i=100 j=60)where a very similar (but different) word ldquomattersrdquo is written

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Probabilistic Word Indexing from the Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Technologies InvolvedI Optical modelling Deep CNN-RNN network

I Textual context modelled by finite state character n-grams

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16

Probabilistic Indexing of Text Images 4

Probabilistic Text Image Indexing and Search System Diagram

Textimages

Ingestion

Database

Page-levelindices

indexing toolKWS amp

Keyword search

I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices

I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process

I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 6: Indexing and Searching of Manuscript Collections

Probabilistic Indexing of Text Images 4

Handwritten Text Indexing and Search Pixel-level Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Probabilistic Word Indexing from the Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Technologies InvolvedI Optical modelling Deep CNN-RNN network

I Textual context modelled by finite state character n-grams

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16

Probabilistic Indexing of Text Images 4

Probabilistic Text Image Indexing and Search System Diagram

Textimages

Ingestion

Database

Page-levelindices

indexing toolKWS amp

Keyword search

I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices

I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process

I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 7: Indexing and Searching of Manuscript Collections

Probabilistic Indexing of Text Images 4

Probabilistic Word Indexing from the Posteriorgram

P

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage

But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Technologies InvolvedI Optical modelling Deep CNN-RNN network

I Textual context modelled by finite state character n-grams

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16

Probabilistic Indexing of Text Images 4

Probabilistic Text Image Indexing and Search System Diagram

Textimages

Ingestion

Database

Page-levelindices

indexing toolKWS amp

Keyword search

I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices

I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process

I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 8: Indexing and Searching of Manuscript Collections

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Technologies InvolvedI Optical modelling Deep CNN-RNN network

I Textual context modelled by finite state character n-grams

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16

Probabilistic Indexing of Text Images 4

Probabilistic Text Image Indexing and Search System Diagram

Textimages

Ingestion

Database

Page-levelindices

indexing toolKWS amp

Keyword search

I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices

I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process

I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 9: Indexing and Searching of Manuscript Collections

Probabilistic Indexing of Text Images 4

Lexicon-free Probabilistic Index Example

200

150

100

50

0 100 200 300 400 500 600

pageID=Bentham-071-021-002-part keyword relPrb bounding box

2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31

MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31

NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31

THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31

THE 0927 430 88 30 31LHE 0056 434 88 25 31

REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31

THE 0993 110 115 43 31MATTER 0998 160 115 93 31

OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31

MATTER 0990 425 116 100 31OF 0995 542 115 25 31

LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31

TAUE 0031 575 115 55 31

LANE 0012 575 115 59 31

THE 0990 1 198 28 31MATTER 0934 61 198 64 31

OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31

FACT 0017 182 198 46 31AS 0142 200 198 29 31

HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31

GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31

SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31

SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31

Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16

Probabilistic Indexing of Text Images 4

Technologies InvolvedI Optical modelling Deep CNN-RNN network

I Textual context modelled by finite state character n-grams

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16

Probabilistic Indexing of Text Images 4

Probabilistic Text Image Indexing and Search System Diagram

Textimages

Ingestion

Database

Page-levelindices

indexing toolKWS amp

Keyword search

I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices

I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process

I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 10: Indexing and Searching of Manuscript Collections

Probabilistic Indexing of Text Images 4

Technologies InvolvedI Optical modelling Deep CNN-RNN network

I Textual context modelled by finite state character n-grams

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16

Probabilistic Indexing of Text Images 4

Probabilistic Text Image Indexing and Search System Diagram

Textimages

Ingestion

Database

Page-levelindices

indexing toolKWS amp

Keyword search

I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices

I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process

I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 11: Indexing and Searching of Manuscript Collections

Probabilistic Indexing of Text Images 4

Probabilistic Text Image Indexing and Search System Diagram

Textimages

Ingestion

Database

Page-levelindices

indexing toolKWS amp

Keyword search

I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices

I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process

I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 12: Indexing and Searching of Manuscript Collections

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 13: Indexing and Searching of Manuscript Collections

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 14: Indexing and Searching of Manuscript Collections

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 15: Indexing and Searching of Manuscript Collections

Performance Measures 8

Probabilistic Indexing amp Search Precision-Recall Tradeoff Model

Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance

Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved

If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

High confidence theshold

Low confidence threshold

Perfect (AP=10)Aut Transcript (AP=06)

Prob Index (AP=08)

If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)

In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)

This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 16: Indexing and Searching of Manuscript Collections

Preparatory Steps and Laboratory Results 9

Preparatory Steps Use of TransKribus Tools

I Line detection

I Transcribing (or page-image text aligning) a small number of training pages

I Resizing images as required to approximately achieve uniform resolution

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 17: Indexing and Searching of Manuscript Collections

Preparatory Steps and Laboratory Results 9

Large Scale Collections Laboratory ResultsChancery

XIV-XV century medieval registers written in old French by many hands

Bentham

18th-19th century manuscripts written in English by many hands

TSO

15th-16th century manuscripts written in old Spanish by many hands

0

02

04

06

08

1

0 02 04 06 08 1

Precisionπ

Recall ρ

Chancery AP=075 mAP=068

Bentham AP=079 mAP=061

TSO AP=080 mAP=083

Collection traintest Char LM Query words

Chancery 34195 acts 5-gram 6506

Bentham 155846 pgs 8-gram 3597

TSO cross-val 286 pgs 8-gram 5409

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 18: Indexing and Searching of Manuscript Collections

Keyword Search Demonstrators 11

Chancery Indexing and Search Live Demonstration

PRHLT Search Interface httpprhlt-kwsprhlthimanis

Probabilistic Index basic statistics (as of Sep-2017)Computed

Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619

Estimated from index probabilities

Running words 44216365Running words Page 264769Average Spots Running word 60

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 19: Indexing and Searching of Manuscript Collections

Keyword Search Demonstrators 11

Bentham papers Indexing and Search Large Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham

Probabilistic Index basic statistics (as of Nov-2018)Computed

Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198

Estimated from index probabilities

Running words 25487932Running words Page 283Average Spots Running word 78

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 20: Indexing and Searching of Manuscript Collections

Keyword Search Demonstrators 11

TSO Indexing and Search Large-Scale Demonstrator

PRHLT Search Interface httpprhlt-carabelaprhltupvestso

Probabilistic Index basic statistics (as of Nov-2018)Computed

Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180

Estimated from index probabilities

Running words 5396497Running words Page 150Average Spots Running word 79

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 21: Indexing and Searching of Manuscript Collections

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 22: Indexing and Searching of Manuscript Collections

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 23: Indexing and Searching of Manuscript Collections

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 24: Indexing and Searching of Manuscript Collections

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 25: Indexing and Searching of Manuscript Collections

Keyword Search Demonstrators 11

TSO Large Scale Demonstrator Query Examples

PRHLT TSO search interface httpprhlt-carabelaprhltupvestso

A small sample of query possibilities

Austriaotomano

teniente || alferez || sargentosol ampamp espantildeol

(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)

[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 26: Indexing and Searching of Manuscript Collections

Other Indexing Tasks 15

Other KWS Demonstrators

Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at

httptranscriptoriumeudemotsKWSdemos

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 27: Indexing and Searching of Manuscript Collections

Other Indexing Tasks 15

Thanks for Your Attention

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16

Page 28: Indexing and Searching of Manuscript Collections

Other Indexing Tasks 15

Average Precision (AP) versus Mean Average Precision (mAP)

AP mAP

Type of Ranking Global spot list One spot list for each query

Averaging type Microover all query events

Macroover the APs of isolated queries

Is it invariant to monotonicscore transformations no yes

Does score consistencyhave an impact yes no

Need all queries be relevant no yes

Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16