the certainty of citations
DESCRIPTION
The Certainty of Citations. A proposal for an objective method of measuring certainty. Genealogy Background. Notice the light at the top of the picture. The FM Bobo Story. Grandmother. Grandfather of grandmother. 1860 Census. 1870 Census. - PowerPoint PPT PresentationTRANSCRIPT
The Certainty of Citations
A proposal for an objective method of measuring
certainty
Genealogy BackgroundNotice the light at the
top of the picture.
The FM Bobo StoryGrandmother
Grandfather of grandmother
1860 Census
1870 Census
Marriage RecordCarroll County Arkansas Marriage Records
Eastern District Grooms Index1869-1930
Book/Page Groom Age Bride Age Date
A 63 BOBO FRANCES M. 19 LITTRELL MATILDA 16 6/02/1872
http://www.rootsweb.com/~arcchs/MARB.html
Note 3 year gap in age.
1880 Census
Remember Jarrett for
later
1920 Census
Jarrett’s Funeral Book
Record Summary
Record Date
Record Type Birth Reported Age Reported Implied Birth Death Rept
Cen Age Date
8/23/1860 CEN 8 1852 1-Jun
7/14/1870 CEN 18 1852 1-Jun
6/2/1872 MAR 19 1853
6/17/1880 CEN 25 1855 1-Jun
1/22/1920 CEN 71 1849 1-Jan
2/12/1951 FUN 11/17/193
2
1/1/1955 GRAV 10/1/1845 184511/10/193
1
Let’s talk about that …Note person partially in
picture.
The Information Flow Diagram
• Event – an association of an action, place, time, and person(s)
EVENTEVENT
Dick Eastman at GENTECH2, January
1994
The Information Flow Diagram
• Reporter – a person who creates a record about an event.
• We can measure confidence or bias.
EVENTEVENT
REPORTERREPORTER
John Wylie, president of
GENTECH for 5 years
The Information Flow Diagram
• Record – a report about an event, which may not be complete or accurate
• Measure granularity.
EVENTEVENT
REPORTERREPORTER
RECORDRECORD
What’s Granularity?
Small Medium Large
NAME James Powell Sharbrough
J Sharbrough Sharbrough
DATE June 2, 1872 June, 1872 1872
PLACE 123 Elm St Harris County Texas
Granularity ExamplesCase 1 Case 2 Case 3
Name FM Bobo - 2
Francis M Bobo – 3
Bobo -1
Date 1953 -1 June 1853 – 2
2 Jun 1872 - 3
Place 153 Elm St, Tulsa, OK - 3
Carroll Co, Ark – 2
Ark -1
6 7 5
The Information Flow Diagram
• Reviewer – a person who reviews records and draws conclusions.
• Evaluate ER Gap, evaluate Reporter.
EVENTEVENT
REPORTERREPORTER
RECORDRECORD
REVIEWERREVIEWER
Tony Burroughs, NGS 2001, Portland OR
“ER Gap”
The Information Flow Diagram
• Conclusion – a statement by a reviewer about a collection of records related to an event
• Report – a collection of conclusions.
EVENTEVENT
REPORTERREPORTER
RECORDRECORD
REVIEWERREVIEWER
REPORTREPORT
ER Gap
FarNear
Near
Far
“Primary” Record2
“Secondary” Record1
“Secondary” Record1
All Records about my family
0
Features of EVIDENCE: The Record
• Granularity• “Mind the Gap” - ER Gap• Reporter
CONCLUSION – Rate It
• 1 - Believe• 2 - Know• 3 - Can Prove• 0 – No claim• Negative numbers -1, -2, -3
TRUST: The Report
• Do this like eBay
So many formulas …
• … so few examples.
• Record granularity measurement – 3 to 9
• ER Gap – 0, 1, or 2
• Reviewer evaluation of reporter -1 to 10
• Reviewer confidence - -3 to 3
• Trust number, positive feedback ratio• [Granularity / 5] + [ER Gap] + [Report Eval / 5]
+ [Reviewer Confidence] + [Trust ratio / 0.5]
Demographic Info
Medical Info
The Death Certificate
It’s “What-if” Time
What if we could make the future however we like?
Mechanical Certainty
Finding Needles in Really Big Haystacks
Record Linking
• Building Indices
• Finding larger patterns
Where:
• x indicates the identifier and its value on the record from the file initiating the search (record A);
• y indicates the identifier and its value on the record from the file being searched (record B);
• LINKED pairs may refer either to all linked pairs, or to a defined subset of these; and
• UNLINKABLE pairs may refer either to all unlinkable pairs, or to a defined subset, provided the linked and the unlinkable sets (or subsets) are otherwise strictly comparable with each other.
pairsunlinkableamongyxoutcomeoffrequency
pairsLINKEDamongyxoutcomeoffrequencyRATIOFREQUENCY
),(
),(
Examples– FIRST INITIALS
AGREEMENT DISAGREEMENT LETTER “Q”
– YEAR OF BIRTH SIMILARITY (difference = 1 year) DISSIMILARITY (difference = 11+ years)
– GIVEN NAMES SIMILARITY (first 3 letters agree, none disagree – eg Sam vs
Samuel) SIMILARITY + DISSIMILARITY (first 3 letters agree, 4th disagrees – eg
Samuel vs Sampson)
– DIFFERENT BUT LOGICALLY RELATED IDENTIFIERS PLACE of WORK vs PLACE of DEATH (Provo vs Salt Lake City)
Some more examplesPercentagef requencies
I dentifi ers compared Comparisonoutcomes
Links Non-Links
Global f requencyratios (links/ non-
links)
SURNAME AgreeDisagree
96.53.5
0.199.9
965/ 11/ 29
FI RST NAME AgreeDisagree
79.021.0
0.999.1
88/ 11/ 5
MI DDLE I NI TI AL AgreeDisagree
88.811.2
7.592.5
12/ 11/ 8
YEAR OF BI RTH AgreeDisagree
77.322.7
1.198.9
70/ 11/ 4
MONTH OF BI RTH AgreeDisagree
93.36.7
8.391.7
11/ 11/ 14
DAY OF BI RTH AgreeDisagree
85.114.9
3.396.7
26/ 11/ 6
STATE/ COUNTRYOF BI RTH
AgreeDisagree
98.11.9
11.788.3
8/ 11/ 46
Discrimination
• A lookup table containing the frequencies of values for identifiers, as they appear in the file being searched.
• SURNAMES Brown (0.39), Aube (0.014), and Skuda (0.00004).
• FIRST NAMES John(5.30), Axel (0.020), and Ulder (0.0045).
Competing Hypotheses
Record DateRecord
Type Birth Reptd
Age Rept
dImplied
BirthDeath
ReptCen Age
Date Rate
8/23/1860 CEN 8 1852 1-Jun 60
7/14/1870 CEN 18 1852 1-Jun 60
6/2/1872 MAR 19 1853 40
1/22/1920 CEN 71 1849 1-Jan 40
6/17/1880 CEN 25 1855 1-Jun 25
2/12/1951 FUN 11/17/193
2 10
1/1/1955 GRAV 10/1/1845 184511/10/193
1 5
The Digital Research Assistant
• Search for records on internet
• Evaluate their relevance to assignment
• Evaluate their granularity, confidence, etc
• Evaluate patterns, such as families
• Report matches
• Let me set the knobs for the parameters
The DRA will have ...
• A heirarchy of useful comparison algorithms
• A method of searching across the Internet - and paying for it
• A method of documenting the source of that search that satisfies the rules of preserving intellectual property and academic research
Who knows what the formula will be?
• We are asking which dragons must be slain, but we aren’t saying how it must happen.
• We are talking about possible ways to accomplish our goal.
• That goal is connecting to new information, with confidence.
Summary
• Any type of review– Measurements of Records– Measurement of conclusions– Rating of publishers
• Mechanical searches– Record Linking– Smart Searches– Groupwork and Rights
Never forget to have fun