#kbdata: exploring potential impact of technology limitations on dh research
TRANSCRIPT
![Page 1: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/1.jpg)
#kbdata: Exploring potential impact of technology limitations on DH researchMyriam C. Traub, Jacco van OssenbruggenCentrum Wiskunde & Informatica, Amsterdam
![Page 2: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/2.jpg)
Translate the established tradition of source criticism to the digital world and create a new tradition of tool criticism to systematically identify and explain technology-induced bias. http://event.cwi.nl/toolcriticism/ #toolcrit
2
![Page 3: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/3.jpg)
Context
✤ SealincMedia project, original goals:
✤ crowdsourcing enrichment
✤ measure effect on scholarly tasks
✤ Who are the scholars?
✤ What are their tasks?
3
![Page 4: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/4.jpg)
Interviews
✤ Aim:
✤ Find out what types of research tasks scholars perform on digital archives
✤ Which quantitative / distant reading tasks are not (sufficiently) supported
✤ Scholars with experience in performing historical research on digital archives
4
(see TPDL 2015 paper for details)
![Page 5: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/5.jpg)
5
I mostly use digital archives for exploration of a topic, selecting
material for close reading (T1, T2) or external processing (T4).
OCR quality in digital archives / libraries is partly very bad.
I cannot quantify its impact on my research tasks.
I would not trust quantitative analyses (T3a, T3b) based on this data sufficiently to use it in publications.
![Page 6: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/6.jpg)
Categorisation of research tasks
T1 find the first mention of a concept
T2 find a subset with relevant documents
T3 investigate quantitative results over time
T3.a compare quantitative results for two terms
T3.b compare quantitative results from two corpora
T4 tasks using external tools on archive data
![Page 7: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/7.jpg)
Literature
✤ OCR quality is addressed from the perspective of the collection owner/OCR software developer
✤ Usability studies for digital libraries
✤ Robustness of search engines towards OCR errors
✤ Error removal in post-processing either systematically or intellectually
7
![Page 8: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/8.jpg)
We care about average
performance on representative subsets
for generic cases.
I care about actual performance
on my non-representative subset
for my specific query.
8
Two different perspectives of quality evaluation
![Page 9: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/9.jpg)
Use case
✤ Aims:
✤ To study the impact on research tasks in detail
✤ Identify starting points for workarounds and/or further research
✤ Tasks T1 - T3
9
![Page 10: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/10.jpg)
T1: Finding the first mention
✤ Key requirement: recall
✤ 100% recall is unrealistic
✤ Aim: Find out how a scholar can assess the reliability of results
10
![Page 11: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/11.jpg)
“Amsterdam”
1642
11
First mention of …
… in the OCRed newspaper archive of the KB?
1618
earliest document
![Page 12: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/12.jpg)
OCR
pre-processing
post-
processing
ingestion
scanning
12
Understanding potential sources of bias and errors
✤ many details difficult to reconstruct
✤ essential to understand overall impact
![Page 13: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/13.jpg)
“Amsterdam”
1642
13
First mention of …
… in the OCRed newspaper archive of the KB?
1618
earliest document
“Amfterdam”
1624
![Page 14: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/14.jpg)
01
OCR confidence values useful?
✤ Available for all items in the collection: page, word, character
✤ Only for highest ranked words / characters, other candidates missing
✤ This information would be required to estimate recall.
14
![Page 15: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/15.jpg)
Confusion table
✤ Applied frequent OCR confusions to query
✤ 23 alternative spellings, but none of them yielded an earlier mention
✤ Problem: long tail
Amstcrdam 16-01-1743 Amstordam 01-08-1772 Amsttrdam 04-08-1705 Amslerdam 12-12-1673 Amslcrdam 20-06-1797 Amslordam 29-06-1813 Amsltrdam 13-04-1810 Amscerdam 17-10-1753 Amsccrdam 16-02-1816 Amscordam 01-11-1813 Amsctrdam 16-06-1823 Amfterdam already found Amftcrdam 17-08-1644 Amftordam 31-01-1749 Amfttrdam 26-11-1675 Amflerdam 03-03-1629 Amflcrdam 01-03-1663 Amflordam 05-03-1723 Amfltrdam 01-09-1672 Amfcerdam 22-04-1700 Amfccrdam 27-11-1742 Amfcordam - Amfctrdam 09-10-1880
correct confused
s f
n u
e c
n a
t l
t c
h b
l i
e o
e t
full table available online:http://dx.doi.org/10.6084/m9.figshare.1448810
![Page 16: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/16.jpg)
“Amsterdam”
1642
“Amfterdam”
1624
“Amsterstam”
1618
16
First mention of …
1618
… in the OCRed newspaper archive of the KB?
earliest document
![Page 17: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/17.jpg)
“Amsterdam”
1642
“Amfterdam”
1624
“Amsterstam”
1618
17
Update!
1618
Corrections for 17th century newspapers were crowdsourced!
earliest document
“Amsterdam”
1620
![Page 18: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/18.jpg)
… but why not 1618?
![Page 19: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/19.jpg)
Confusion Matrix OCR Confidence Values
Alternative Confidence Values
available: sample only full corpus not available
T1 find all queries for x, impractical
estimated precision, not helpful
improve recall
T2 as above estimated precision, requires improved UI
improve recall
T3 pattern summarized over set of alternative queries
estimates of corrected precision
estimates of corrected recall
T3.a warn for different susceptibility to errors
as above, warn for different distribution of confidence values
as above
T3.b as above as above as above
19
![Page 20: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/20.jpg)
Conclusions
Problems
✤ Scholars see OCR quality as a serious problem, but cannot assess its impact
✤ OCR technology is unlikely to be perfect
✤ OCR errors are reported in terms of averages measured over representative samples
✤ Impact on a specific research task cannot be assessed based on average error metrics
Start of solutions
✤ Impact of OCR is different for different research tasks, so these tasks need to made be explicit
✤ OCR errors often assumed to be random but are often partly systematic
✤ Tool pipelines and their limitations need to be transparent & better documented
![Page 21: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/21.jpg)
No silver bullet
✤ we propose novel strategies that solve part of the problem:✤ critical attitude (awareness and better support)
✤ transparency (provenance, open source, documentation, …)
✤ alternative quality metrics(taking research context into account)
21
![Page 22: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/22.jpg)
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●
●●
●
● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
5000000
10000000
15000000
1700 1800 1900 2000decades
num
ber o
f doc
umen
ts
# documents total# documents viewed
Viewed documents (blue) compared to overall corpus size (red)
RQ: Is this tiny fragment biased by technology?
![Page 23: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/23.jpg)
User logs
✤ 5 months on 8 servers
✤ March - July 2015
✤ 100 M requests
✤ 4 M queries
✤ 1 M unique queries (dominated by named entities)
✤ 2.7 M unique documents viewed
![Page 24: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/24.jpg)
http://resolver.kb.nl/resolve?urn=ddd:011010313
March - July 2015. 24
Top viewed documents
1. views: 700 2. views: 243 3. views: 189
http://resolver.kb.nl/resolve?urn=ddd:010775269http://resolver.kb.nl/resolve?urn=ddd:011148923
![Page 25: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/25.jpg)
Top 25 queries (# IP hashes)493 armeense 283 telegraaf 200 doodvonnis batavia 176 ajax 168 voetbal 166 nieuwsblad van het noorden 149 suriname 142 oorlog 132 hitler 132 vvd PROX complot 131 amsterdam 129 volkskrant 126 algemeen handelsblad
122 armeensche 119 limburgs dagblad 119 de telegraaf 114 zoetemelk 114 rotterdam 114 20e eeuw 113 het vrije volk 112 staatscourant 112 brand 108 de waarheid 103 soekaboemi 97 overleden
![Page 26: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/26.jpg)
Can we measure bias in all queries?
![Page 27: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/27.jpg)
Candidate metric to measure search bias
✤ Retrievability (IR, Azzopardi, CIKM 2008)
✤ measures how often documents are retrieved for a given set Q
✤ compares popular documents against non-popular
✤ Inequality expressed with Gini coefficient and Lorenz curve
✤ Inequality correlated with user interest is fine…
![Page 28: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/28.jpg)
Experimental setup
✤ Repeat original experiment with synthesised queries
✤ Run experiment with real queries from log
✤ note the ratio: 1M queries vs 100M documents
✤ To do: test known item search for different quality OCR, different media, different titles, …
![Page 29: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/29.jpg)
Lorenz curves
c=10,Gini=0.97
c=100, Gini=0.90
c=1000,Gini=0.78
![Page 30: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/30.jpg)
![Page 31: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/31.jpg)
0
1
5
10
50
100
500
1000
5000
10000
50000
1 2 3 4 5 6 7 8 9 10ret_score
counts_16_log
0
1
510
50
100
500
1000
5000
10000
50000
100000
500000
1000000
1 2 3 4 5 6 7 8 9 10ret_score
counts_17_log
01
510
50100
5001000
500010000
50000100000
5000001000000
5000000
1 2 3 4 5 6 7 8 9 10ret_score
counts_18_log
01
510
50100
5001000
500010000
50000100000
5000001000000
5000000
1 2 3 4 5 6 7 8 9 10ret_score
counts_18_log
![Page 32: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/32.jpg)
![Page 33: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/33.jpg)
For documents that were viewed at least once.
OCR page confidence values (x) and number of views by users (y)
33
![Page 34: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/34.jpg)
0.00
0.25
0.50
0.75
1.00
1700 1800 1900 2000decades
perc
enta
ges
of r(
d) r(d)01234
0.0
0.2
0.4
0.6
0.8
1700 1800 1900 2000decades
perc
enta
ges
of r(
d) r(d)01234
0.1
0.2
0.3
0.4
1700 1800 1900 2000decades
perc
enta
ges
of r(
d) r(d)01234
c=10
c=100 c=1000
![Page 35: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/35.jpg)
0.00
0.05
0.10
0.15
1700 1800 1900 2000decades
perc
enta
ges
of r(
d) r(d)01234
![Page 36: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/36.jpg)
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●
●●
●
● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
5000000
10000000
15000000
1700 1800 1900 2000decades
num
ber o
f doc
umen
ts
# documents total# documents viewed
Viewed documents compared to overall corpus size (per decade)
RQ: Is this tiny fragment biased by technology?
![Page 37: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/37.jpg)
●
●
●
●●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●●
● ●
●
●
●●
●●
●
●
●
●
● ●●
●
●
●
●●
●
●●
●
●
●●
●
● ●
● ●
●
●
●●
●●
●●
●
●
●
●
●
●
●
● ●
●
●●
●
●
1024
16384
262144
4194304
1700 1800 1900 2000decades
num
ber o
f doc
umen
ts
# documents total# documents viewed
![Page 38: #kbdata: Exploring potential impact of technology limitations on DH research](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f316f61a28ab86428b4619/html5/thumbnails/38.jpg)
Conclusions
✤ Only small fragment of newspaper corpus is viewed or even retrieved in top #10, 100, 100
✤ No clear evidence retrieval bias is correlated with OCR errors. Why?
✤ there is no relation
✤ we look for patterns at a too generic level
✤ back to the specificity of the use cases?
✤ Other forms of bias that are measurable/quantifiable?