"unique," "descriptive," and other damned lies: the challenges of identifying...

25
HATHITRUST A Shared Digital Repository “Unique,” “Descriptive,” and Other Damned Lies: The Challenges of Identifying Related Records Valerie Glenn and Bill Dueber LITA Forum November 14, 2015

Upload: valerie-glenn

Post on 15-Apr-2017

105 views

Category:

Presentations & Public Speaking


0 download

TRANSCRIPT

Page 1: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

HATHITRUST A Shared Digital Repository

“Unique,” “Descriptive,” and Other Damned Lies: The Challenges of

Identifying Related Records

Valerie Glenn and Bill DueberLITA Forum

November 14, 2015

Page 2: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Overview

• Introduction/Background• What we’re trying to do & why• What is a Federal Government Document?• What’s been done• Next steps

Page 3: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Background

• 2011 Constitutional Convention – Ballot Initiative #4

• Resolved: “that HathiTrust facilitate collective action to create a comprehensive digital corpus of U.S. federal publications including those issued by GPO and other federal agencies”

• Resolved: “that HathiTrust develop a process of catalog record review to ensure accurate and full display of U.S. federal publications including those issued by GPO and other federal agencies”

Page 4: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

What are we trying to do?

•Define the corpus of US federal documents•Identify documents that aren’t in the HathiTrust Digital Library

•Find documents and digitize them

Page 6: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

What’s Been Done

• Matching on Identifiers• OCLC #• LCCN• ISSN• SuDoc Call number

• “Duplicates”• Related (parts of the same series, etc.)

Page 7: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Enumeration and Chronology

Image found at http://goo.gl/qkrd0Q

Page 8: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Quick Record-matching Quiz #1

•Mathematical preparation for general physics with calculus / by Davidson, Ronald C. Published: 1973

•A textbook of oral pathology by Shafer, William G. Published: 1974

Page 9: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Quick Record-matching Quiz #2

•Mathematical preparation for general physics with calculus / by Davidson, Ronald C. Published: 1973

•Mathematical preparation for general physics with calculus / by Davidson, Ronald Published: 1973

Page 10: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Quick Record-matching Quiz #3

What is the most reliable unique identifier in all of Libraryland?

Page 11: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Quick Record-matching Quiz #3

What is the most reliable unique identifier in all of Libraryland?

OCLC Number

Page 12: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

FEEL BAD!!!!!

Page 13: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Enum/Chron

Page 14: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

FEEL BAD!!!!!

Page 15: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Examples

1985

v. 3

NO. 1-12 1963-64

This stuff we can parse with a few dozen lines of ruby, or even regex.

Page 16: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Examples

V. 138 NO. 125-127 PT. 2 SEP 15-17 1992

NO. 3-4, 8, 13, 15-20, 22-23, 25, 27-28, 30-31, 33, 39-41, 43-44:V. 1, 45-46, 48-58, 63-66, 68-81, 83-91, 95, 99-113, 115-128, 130-133, 135-136, 144-145, 147-148, 151, 155, 157, 159, 162-164, 173-174, 178, 180, 182, 185, 190, 195, 198-199, 201-202, 205, 207-208

Page 17: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Examples

V. 138 NO. 125-127 PT. 2 SEP 15-17 1992

NO. 3-4, 8, 13, 15-20, 22-23, 25, 27-28, 30-31, 33, 39-41, 43-44:V. 1, 45-46, 48-58, 63-66, 68-81, 83-91, 95, 99-113, 115-128, 130-133, 135-136, 144-145, 147-148, 151, 155, 157, 159, 162-164, 173-174, 178, 180, 182, 185, 190, 195, 198-199, 201-202, 205, 207-208

Page 18: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Examples

V. 33:NO. 36-54+SS1-4;SUP. ;ANNUAL SUMM. 1984

Page 19: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Examples

31-40D

V. 45:NO. 7-9V. 45:NO. 7-92008

2011:pt.1 (1.501-1.640) = P.1 (1.501-1.640)/2011

V 11-13,14b/d no 11ab - 14 Jul 93 + abs 1992/93 c-f not e index

Page 20: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Examples

982

NOS. 9-1461 WITH MANY EXCEPTIONS

Page 21: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

So...where are we?

• Parser up over 1000 lines with a long way to go

• “parse” about 65% of enumchron (3.5M)

• Not at all sure they’re all right

• ...or how to compare them

• ...or how to do gap detection

• ...or what to do with the other 35%

Page 22: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

FEEL BAD!!!!!

Page 23: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Next steps

• Refine enum/chron parsing• String matching• Automated gap detection

Page 24: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

How to find out more

• HathiTrust Registry of US Federal Government Documents: http://www.hathitrust.org/usdocs_registry

• Contact Bill: [email protected]@billdueber

• Contact Valerie: [email protected]@vdglenn

Page 25: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records

Thank you! Questions?