brian lavoie research scientist oclc mining for copyright evidence asis&t 2008 columbus, oh...

Brian LavoieResearch ScientistOCLC

Mining for Copyright EvidenceMining for Copyright Evidence

ASIS&T 2008 Columbus, OHOctober 28, 2008

RoadmapRoadmap

•Copyright investigations

•OCLC Copyright Evidence Registry

•WorldCat as a source of copyright evidence

•Mapping across multiple data sources

Copyright investigationCopyright investigation

• IPR issues amplified in digital environments• Rights management metadata: INDECS, ORDL, ONIX,

PREMIS, …• Section 108 Study Group• LC/JISC study: copyright law & digital preservation

•Copyright investigation increasingly important• Mass digitization, Web harvesting, preservation, …• RLG Programs report (March 2008)

• More common, but yet to converge on standardized workflow

• Ambiguity over sources of copyright evidence, procedural due diligence, benchmarks for decision-making

• Need data and tools to reduce cost and improve reliability of copyright investigations

OCLC Copyright Evidence RegistryOCLC Copyright Evidence Registry

CER essentialsCER essentials

•Collaborative environment for discovering and sharing information about copyright status of books• Search WorldCat and other data sources for copyright evidence• Record results of copyright investigations and share with others• Rules engine: implement your own rules for assessing copyright

status as automated process operating on information in CER

•Currently in pilot phasehttp://www.worldcat.org/copyrightevidence

•OCLC Research provided support during pilot development looking at:• WorldCat as source of copyright evidence• Mapping WorldCat data to other data sources

00819cam 2200253Ka 4500001 ocn180754687003 OCoLC005 20080625054239.0008 071103s2008 nyua 000 0 eng d040 $a BTCTA $c BTCTA $d BAKER $d JBL020 $a 9780399534294020 $a 0399534296092 0 $a 636.7532 $2 22100 1 $a Foster, Stephen, $d 1962-245 10 $a Walking Ollie, or, Winning the love of a difficult dog / $c Stephen Foster.246 30 $a Winning the love of a difficult dog250 $a 1st American ed.260 $a New York : $b Penguin Group, $c 2008.300 $a 177 p. : $b ill. ; $c 21 cm.500 $a "A Perigee book"650 0 $a Lurcher $v Anecdotes.

WorldCat as a source of copyright evidenceWorldCat as a source of copyright evidence

WorldCat as a copyright evidence data sourceWorldCat as a copyright evidence data source

•WorldCat very good source for detailed author/title information• Author/creator name(s): main entry (1xx), added

entries (7xx)• Title information: Title statement (245), uniform

title (240)

•Publication data also extensive, but sometimes a bit spotty …

Frequency of occurrence of key MARC data points in WorldCat records (books only)

Frequency of occurrence of key MARC data points in WorldCat records (books only)

Copyright evidence from multiple sourcesCopyright evidence from multiple sources

Example: WorldCat and the Stanford Copyright Renewal DatabaseExample: WorldCat and the Stanford Copyright Renewal Database

•Copyright renewal important for items published between 1923 and 1963• Renewal required to extend copyright protection• Renewals after 1977: available in online database• Renewals before 1977: print form only

•Stanford Copyright Renewal Database • Converted pre-1977 renewal information to machine-

readable form; manually searchable in online database• Books only• http://collections.stanford.edu/copyrightrenewals/

•Automate matching between Stanford records and WorldCat?

Automated matchingAutomated matching

•Copy of Stanford database: 246,300 records

•Copy of WorldCat (January 2008): 96,185,960 records

•Cross-record field correspondence:Stanford WorldCatTITL 245 $aAUTH 100 $a, first instance of 700 $a

•Constructed strings of normalized title/author key combinations; looked for matches across data sources

ResultsResults

•430,070 matching pairs of Stanford/WorldCat records• Multiple WorldCat matches to some Stanford records• Implies … 81,663 unique Stanford records matched to

WorldCat (about 33 percent of Stanford database)• Interpret as lower bound on number of potential matches

•Some QA …•Sample of matches checked manually to verify validity

• Excellent results!•Sample of Stanford records with no WC match;

checked manually to try to find match• Results mixed• Differences in formatting/parsing/division of data

between renewal records and WorldCat

Matching precisionMatching precision

31 percent of matches were one-to-one78 percent of matching clusters had 5 or fewer WorldCat records

ExampleExample

00612nam 2200205I 4500001 ocm04682408 003 OCoLC005 20010627101256.0008 790222s1952 mau 000 0 eng 010 $a 53006423 040 $a DLC $c CLE050 $a BX9842 $b .C27092 $a 288 $b 205100 1 $a Carnes, Paul Nathaniel, $d 1921-245 10 $a For freedom and belief; $b a manual for Unitarians.260 $a Boston, $b Beacon Press $c [1952]300 $a 71 p. $b illus. $c 20 cm.490 0 $a Beacon references series650 0 $a Unitarian Universalist churches $x Doctrinal and controversial works.

ID: RE066267DATE: 1980 TITL: For freedom and belief.AUTH: Paul Nathaniel Carnes.OREG: A76385DREG: 11Sep80ODAT: 18Dec52CLNA: Freda Carnes (W)OCLS: A

WorldCat record

Stanford record

PerspectivePerspective

•Assessment of copyright status depends on available body of “copyright evidence”

•Often, thorough assessment will require synthesizing evidence from multiple sources• e.g., WorldCat and Stanford databases

•Cost and effort of accumulating copyright evidence lowered when links between data sources can be established through automated techniques

•Can apply many familiar data processing techniques for this purpose• Parsing/extracting data within records• Linking records across data sources

ConclusionConclusion

•Traditionally, WorldCat data supports cataloging, resource discovery, resource sharing• But WorldCat data can be repurposed to support range of library

decision-making needs (e.g., copyright investigation)

•Decision-making is increasingly data-driven• What would an “evidence base” look like in various library

decision-making contexts?• What questions need to be asked of the data? Can they be

generalized & automated?

•Data-mining task is two-fold:• Identify/expose right WorldCat data to support need in question• Combine WorldCat data with relevant data from other sources

•Create value and lower cost

brian lavoie research scientist oclc mining for copyright evidence asis&t 2008 columbus, oh...

Documents

worldcat slide

sources of copyright

worldcat data

source of copyright

copyright evidence asist

fewer worldcat records

worldcat records books

records copy of worldcat