indexing and searching of manuscript collections
TRANSCRIPT
Transkribus User Conference 2018 - Vienna (Austria)
Indexing and Searching of Manuscript Collections
Alejandro H Toselli and Enrique Vidal
Pattern Recognition and Human Language Technology Research CenterUniversitat Politegravecnica de Valegravencia Spainahectorevidalprhltupves
February 9th 2018
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 1 16
Outline
Textual Access to Untranscribed Manuscripts 3
Probabilistic Indexing of Text Images 4
Performance Measures 8
Preparatory Steps and Laboratory Results 9
Keyword Search Demonstrators 11
Other Indexing Tasks 15
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 2 16
Textual Access to Untranscribed Manuscripts 3
Textual access to Untranscribed Manuscripts
I Massive text image collections have been compiled by libraries and archives allover the world but their textual content remains practically inaccessible
I If perfect or sufficiently accurate text image transcripts were available imagetextual context could be straightforwardly indexed for plaintext textual access
I But manual or even interactive-predictive assisted transcription is entierelyprohibitive to deal with massive image collections
I And fully automatic transcription results lack the level of accuracy needed foruseful text indexing and search purposes
Good news indexing and textual search can be directly carried out on untrascribedimages as we will see now
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 3 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo
To compute P an accurate contextual (n-gram based) word classifier can be used In thisexample this helped to achieve very low posteriors in a region of X around (i=100 j=60)where a very similar (but different) word ldquomattersrdquo is written
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Probabilistic Word Indexing from the Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Technologies InvolvedI Optical modelling Deep CNN-RNN network
I Textual context modelled by finite state character n-grams
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16
Probabilistic Indexing of Text Images 4
Probabilistic Text Image Indexing and Search System Diagram
Textimages
Ingestion
Database
Page-levelindices
indexing toolKWS amp
Keyword search
I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices
I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process
I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Outline
Textual Access to Untranscribed Manuscripts 3
Probabilistic Indexing of Text Images 4
Performance Measures 8
Preparatory Steps and Laboratory Results 9
Keyword Search Demonstrators 11
Other Indexing Tasks 15
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 2 16
Textual Access to Untranscribed Manuscripts 3
Textual access to Untranscribed Manuscripts
I Massive text image collections have been compiled by libraries and archives allover the world but their textual content remains practically inaccessible
I If perfect or sufficiently accurate text image transcripts were available imagetextual context could be straightforwardly indexed for plaintext textual access
I But manual or even interactive-predictive assisted transcription is entierelyprohibitive to deal with massive image collections
I And fully automatic transcription results lack the level of accuracy needed foruseful text indexing and search purposes
Good news indexing and textual search can be directly carried out on untrascribedimages as we will see now
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 3 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo
To compute P an accurate contextual (n-gram based) word classifier can be used In thisexample this helped to achieve very low posteriors in a region of X around (i=100 j=60)where a very similar (but different) word ldquomattersrdquo is written
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Probabilistic Word Indexing from the Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Technologies InvolvedI Optical modelling Deep CNN-RNN network
I Textual context modelled by finite state character n-grams
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16
Probabilistic Indexing of Text Images 4
Probabilistic Text Image Indexing and Search System Diagram
Textimages
Ingestion
Database
Page-levelindices
indexing toolKWS amp
Keyword search
I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices
I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process
I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Textual Access to Untranscribed Manuscripts 3
Textual access to Untranscribed Manuscripts
I Massive text image collections have been compiled by libraries and archives allover the world but their textual content remains practically inaccessible
I If perfect or sufficiently accurate text image transcripts were available imagetextual context could be straightforwardly indexed for plaintext textual access
I But manual or even interactive-predictive assisted transcription is entierelyprohibitive to deal with massive image collections
I And fully automatic transcription results lack the level of accuracy needed foruseful text indexing and search purposes
Good news indexing and textual search can be directly carried out on untrascribedimages as we will see now
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 3 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo
To compute P an accurate contextual (n-gram based) word classifier can be used In thisexample this helped to achieve very low posteriors in a region of X around (i=100 j=60)where a very similar (but different) word ldquomattersrdquo is written
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Probabilistic Word Indexing from the Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Technologies InvolvedI Optical modelling Deep CNN-RNN network
I Textual context modelled by finite state character n-grams
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16
Probabilistic Indexing of Text Images 4
Probabilistic Text Image Indexing and Search System Diagram
Textimages
Ingestion
Database
Page-levelindices
indexing toolKWS amp
Keyword search
I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices
I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process
I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo
To compute P an accurate contextual (n-gram based) word classifier can be used In thisexample this helped to achieve very low posteriors in a region of X around (i=100 j=60)where a very similar (but different) word ldquomattersrdquo is written
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Probabilistic Word Indexing from the Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Technologies InvolvedI Optical modelling Deep CNN-RNN network
I Textual context modelled by finite state character n-grams
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16
Probabilistic Indexing of Text Images 4
Probabilistic Text Image Indexing and Search System Diagram
Textimages
Ingestion
Database
Page-levelindices
indexing toolKWS amp
Keyword search
I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices
I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process
I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Pixel-level posterior probabilities (P ) for a text image X and word v =rdquomatterrdquo
To compute P an accurate contextual (n-gram based) word classifier can be used In thisexample this helped to achieve very low posteriors in a region of X around (i=100 j=60)where a very similar (but different) word ldquomattersrdquo is written
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Probabilistic Word Indexing from the Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Technologies InvolvedI Optical modelling Deep CNN-RNN network
I Textual context modelled by finite state character n-grams
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16
Probabilistic Indexing of Text Images 4
Probabilistic Text Image Indexing and Search System Diagram
Textimages
Ingestion
Database
Page-levelindices
indexing toolKWS amp
Keyword search
I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices
I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process
I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Probabilistic Indexing of Text Images 4
Handwritten Text Indexing and Search Pixel-level Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Probabilistic Word Indexing from the Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Technologies InvolvedI Optical modelling Deep CNN-RNN network
I Textual context modelled by finite state character n-grams
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16
Probabilistic Indexing of Text Images 4
Probabilistic Text Image Indexing and Search System Diagram
Textimages
Ingestion
Database
Page-levelindices
indexing toolKWS amp
Keyword search
I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices
I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process
I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Probabilistic Indexing of Text Images 4
Probabilistic Word Indexing from the Posteriorgram
P
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage
But for each word image region relevance probabilities and locations are easily derived from thePosteriorgram ndash and used to probabilistically index the word in an efficient way
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 4 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Technologies InvolvedI Optical modelling Deep CNN-RNN network
I Textual context modelled by finite state character n-grams
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16
Probabilistic Indexing of Text Images 4
Probabilistic Text Image Indexing and Search System Diagram
Textimages
Ingestion
Database
Page-levelindices
indexing toolKWS amp
Keyword search
I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices
I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process
I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0989 77 36 99 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
All character strings or ldquopseudo-wordsrdquo which are likely enough to be real words are indexed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Technologies InvolvedI Optical modelling Deep CNN-RNN network
I Textual context modelled by finite state character n-grams
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16
Probabilistic Indexing of Text Images 4
Probabilistic Text Image Indexing and Search System Diagram
Textimages
Ingestion
Database
Page-levelindices
indexing toolKWS amp
Keyword search
I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices
I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process
I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Probabilistic Indexing of Text Images 4
Lexicon-free Probabilistic Index Example
200
150
100
50
0 100 200 300 400 500 600
pageID=Bentham-071-021-002-part keyword relPrb bounding box
2 0929 1 36 20 3121 0064 1 36 24 31IT 0982 33 36 27 31IF 0012 33 36 26 31
MATTERS 0998 160 115 93 31MATTER 0011 77 36 93 31
NOT 0999 216 36 7 31WHETHER 1000 256 36 99 31
THE 0997 389 36 33 31MIS-SUPPOSAL 1000 455 36 193 31
THE 0927 430 88 30 31LHE 0056 434 88 25 31
REGARDS 0857 5 115 84 31UGARDS 0138 5 115 80 31
THE 0993 110 115 43 31MATTER 0998 160 115 93 31
OF 0996 271 115 23 31FACT 0999 306 115 49 31OR 0973 377 115 37 31ON 0021 377 115 42 31
MATTER 0990 425 116 100 31OF 0995 542 115 25 31
LAM 0407 575 115 30 31BIMR 0175 575 115 55 31 LAW 0032 575 115 36 31
TAUE 0031 575 115 55 31
LANE 0012 575 115 59 31
THE 0990 1 198 28 31MATTER 0934 61 198 64 31
OF 0988 141 198 28 31FAST 0367 182 198 62 31FAR 0186 182 198 36 31
FACT 0017 182 198 46 31AS 0142 200 198 29 31
HAE 0022 200 198 29 31WHERE 0992 255 198 90 31YOU 0761 365 198 45 31YOW 0030 365 198 45 31
GOUS 0064 372 198 47 31SUPPOSE 0975 429 198 120 31
SUPFROSE 0024 429 198 125 31SOME 0834 570 198 78 31
SONER 0016 576 198 83 31OME 0109 580 198 65 31ME 0022 620 198 22 31
Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 5 16
Probabilistic Indexing of Text Images 4
Technologies InvolvedI Optical modelling Deep CNN-RNN network
I Textual context modelled by finite state character n-grams
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16
Probabilistic Indexing of Text Images 4
Probabilistic Text Image Indexing and Search System Diagram
Textimages
Ingestion
Database
Page-levelindices
indexing toolKWS amp
Keyword search
I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices
I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process
I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Probabilistic Indexing of Text Images 4
Technologies InvolvedI Optical modelling Deep CNN-RNN network
I Textual context modelled by finite state character n-grams
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 6 16
Probabilistic Indexing of Text Images 4
Probabilistic Text Image Indexing and Search System Diagram
Textimages
Ingestion
Database
Page-levelindices
indexing toolKWS amp
Keyword search
I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices
I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process
I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Probabilistic Indexing of Text Images 4
Probabilistic Text Image Indexing and Search System Diagram
Textimages
Ingestion
Database
Page-levelindices
indexing toolKWS amp
Keyword search
I ldquoKWS amp indexing toolrdquo Off-line pre-computation of probabilistic indices
I ldquoIngestionrdquo Off-line creation of the actual database Typically a simple andcomputationally cheap process
I ldquoKeyword searchrdquo On-line user query analysis find the requested information andpresent the retrived images Short response times needed
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 7 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Performance Measures 8
Probabilistic Indexing amp Search Precision-Recall Tradeoff Model
Indexing and search quality can be assessed bymeans of precision (π) amp recall (ρ) performance
Precision is high if most of the retrieved resultsare correct while recall is high if most of the exist-ing correct results are retrieved
If perfectly correct text were indexed yoursquod get asingle ldquoidealrdquo point with ρ = π = 1 0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
High confidence theshold
Low confidence threshold
Perfect (AP=10)Aut Transcript (AP=06)
Prob Index (AP=08)
If automatic (typically noisy) handwritten text transcripts are naively indexed just as plaintextprecision and recall are also fixed values albeit not ldquoidealrdquo (pehaps something like with Aver-age Precision AP 06)
In contrast probabilistic indexing allows for arbitrary precision-recall tradeoffs by setting athreshold on the system confidence (relevance probability)
This flexible ldquoprecision-recall tradeoff modelrdquo obviously allows for better search and retrievalperformance than naive plaintext searching on automatic noisy transcripts
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 8 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Preparatory Steps and Laboratory Results 9
Preparatory Steps Use of TransKribus Tools
I Line detection
I Transcribing (or page-image text aligning) a small number of training pages
I Resizing images as required to approximately achieve uniform resolution
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 9 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Preparatory Steps and Laboratory Results 9
Large Scale Collections Laboratory ResultsChancery
XIV-XV century medieval registers written in old French by many hands
Bentham
18th-19th century manuscripts written in English by many hands
TSO
15th-16th century manuscripts written in old Spanish by many hands
0
02
04
06
08
1
0 02 04 06 08 1
Precisionπ
Recall ρ
Chancery AP=075 mAP=068
Bentham AP=079 mAP=061
TSO AP=080 mAP=083
Collection traintest Char LM Query words
Chancery 34195 acts 5-gram 6506
Bentham 155846 pgs 8-gram 3597
TSO cross-val 286 pgs 8-gram 5409
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 10 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Keyword Search Demonstrators 11
Chancery Indexing and Search Live Demonstration
PRHLT Search Interface httpprhlt-kwsprhlthimanis
Probabilistic Index basic statistics (as of Sep-2017)Computed
Volumes 167Page Indexed 67413Spots 266301333Average Spots Page 1594619
Estimated from index probabilities
Running words 44216365Running words Page 264769Average Spots Running word 60
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 11 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Keyword Search Demonstrators 11
Bentham papers Indexing and Search Large Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvesbentham
Probabilistic Index basic statistics (as of Nov-2018)Computed
Boxes 173Page images Indexed 95247 89911Spots 197651336Average Spots Page 2198
Estimated from index probabilities
Running words 25487932Running words Page 283Average Spots Running word 78
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 12 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Keyword Search Demonstrators 11
TSO Indexing and Search Large-Scale Demonstrator
PRHLT Search Interface httpprhlt-carabelaprhltupvestso
Probabilistic Index basic statistics (as of Nov-2018)Computed
Manuscripts 328Page imagesIndexed 41122 36010Spots 42477144Average Spots Page 1180
Estimated from index probabilities
Running words 5396497Running words Page 150Average Spots Running word 79
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 13 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Keyword Search Demonstrators 11
TSO Large Scale Demonstrator Query Examples
PRHLT TSO search interface httpprhlt-carabelaprhltupvestso
A small sample of query possibilities
Austriaotomano
teniente || alferez || sargentosol ampamp espantildeol
(valor || dolor) ampamp (amor || honor)Isabel (belleza || hermosura || nobleza)
[ Don Juan Austria ][ Lope de Vega ][Calderon de la Barca]
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 14 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Other KWS Demonstrators
Several others Handwritting Text Keyword Indexing and Search PRHLT demonstratorscan be tried at
httptranscriptoriumeudemotsKWSdemos
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 15 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Thanks for Your Attention
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16
Other Indexing Tasks 15
Average Precision (AP) versus Mean Average Precision (mAP)
AP mAP
Type of Ranking Global spot list One spot list for each query
Averaging type Microover all query events
Macroover the APs of isolated queries
Is it invariant to monotonicscore transformations no yes
Does score consistencyhave an impact yes no
Need all queries be relevant no yes
Toselli and Vidal (PRHLT) Transliteration and PI Trimming 11092018 16 16