Download - "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records
HATHITRUST A Shared Digital Repository
“Unique,” “Descriptive,” and Other Damned Lies: The Challenges of
Identifying Related Records
Valerie Glenn and Bill DueberLITA Forum
November 14, 2015
Overview
• Introduction/Background• What we’re trying to do & why• What is a Federal Government Document?• What’s been done• Next steps
Background
• 2011 Constitutional Convention – Ballot Initiative #4
• Resolved: “that HathiTrust facilitate collective action to create a comprehensive digital corpus of U.S. federal publications including those issued by GPO and other federal agencies”
• Resolved: “that HathiTrust develop a process of catalog record review to ensure accurate and full display of U.S. federal publications including those issued by GPO and other federal agencies”
What are we trying to do?
•Define the corpus of US federal documents•Identify documents that aren’t in the HathiTrust Digital Library
•Find documents and digitize them
What is a Federal Document?• How the HathiTrust Digital Library defines
federal document• How the Registry defines federal document• How libraries identified federal documents• Examples out-of-scope/bad records:
• Uncharted• Other governments’ documents• Organizations with “United States” or “
national” in their name• Reprints / reproductions
What’s Been Done
• Matching on Identifiers• OCLC #• LCCN• ISSN• SuDoc Call number
• “Duplicates”• Related (parts of the same series, etc.)
Enumeration and Chronology
Image found at http://goo.gl/qkrd0Q
Quick Record-matching Quiz #1
•Mathematical preparation for general physics with calculus / by Davidson, Ronald C. Published: 1973
•A textbook of oral pathology by Shafer, William G. Published: 1974
Quick Record-matching Quiz #2
•Mathematical preparation for general physics with calculus / by Davidson, Ronald C. Published: 1973
•Mathematical preparation for general physics with calculus / by Davidson, Ronald Published: 1973
Quick Record-matching Quiz #3
What is the most reliable unique identifier in all of Libraryland?
Quick Record-matching Quiz #3
What is the most reliable unique identifier in all of Libraryland?
OCLC Number
FEEL BAD!!!!!
Enum/Chron
FEEL BAD!!!!!
Examples
1985
v. 3
NO. 1-12 1963-64
This stuff we can parse with a few dozen lines of ruby, or even regex.
Examples
V. 138 NO. 125-127 PT. 2 SEP 15-17 1992
NO. 3-4, 8, 13, 15-20, 22-23, 25, 27-28, 30-31, 33, 39-41, 43-44:V. 1, 45-46, 48-58, 63-66, 68-81, 83-91, 95, 99-113, 115-128, 130-133, 135-136, 144-145, 147-148, 151, 155, 157, 159, 162-164, 173-174, 178, 180, 182, 185, 190, 195, 198-199, 201-202, 205, 207-208
Examples
V. 138 NO. 125-127 PT. 2 SEP 15-17 1992
NO. 3-4, 8, 13, 15-20, 22-23, 25, 27-28, 30-31, 33, 39-41, 43-44:V. 1, 45-46, 48-58, 63-66, 68-81, 83-91, 95, 99-113, 115-128, 130-133, 135-136, 144-145, 147-148, 151, 155, 157, 159, 162-164, 173-174, 178, 180, 182, 185, 190, 195, 198-199, 201-202, 205, 207-208
Examples
V. 33:NO. 36-54+SS1-4;SUP. ;ANNUAL SUMM. 1984
Examples
31-40D
V. 45:NO. 7-9V. 45:NO. 7-92008
2011:pt.1 (1.501-1.640) = P.1 (1.501-1.640)/2011
V 11-13,14b/d no 11ab - 14 Jul 93 + abs 1992/93 c-f not e index
Examples
982
NOS. 9-1461 WITH MANY EXCEPTIONS
So...where are we?
• Parser up over 1000 lines with a long way to go
• “parse” about 65% of enumchron (3.5M)
• Not at all sure they’re all right
• ...or how to compare them
• ...or how to do gap detection
• ...or what to do with the other 35%
FEEL BAD!!!!!
Next steps
• Refine enum/chron parsing• String matching• Automated gap detection
How to find out more
• HathiTrust Registry of US Federal Government Documents: http://www.hathitrust.org/usdocs_registry
• Contact Bill: [email protected]@billdueber
• Contact Valerie: [email protected]@vdglenn
Thank you! Questions?