ANEXPLORATORYSTUDYOFTHEDESCRIPTIONFIELD
INTHEDIGITALPUBLICLIBRARYOFAMERICA
HannahTarver
OksanaL.Zavalina
MarkPhillips
1October14,2016
Outlineofpresentation
• Introductionandbackground
• Methodologyofthestudy
• Somefindings
• Discussion
• Conclusionsandfutureresearch
2October14,2016 2
INTRODUCTIONANDBACKGROUND
3October14,2016
Descriptionmetadata
October14,2016
• Repeatablemetadataelementswithfree-textdatavalues
• Oneormoreelementsindifferentmetadataschemes
• DublinCore:Description(abstract,tableofcontents)
• MetadataObjectDescriptionSchema:Abstract,Note,Table
ofContents
• EncodedArchivalDescription:ScopeandContent
• MachineReadableCataloging:5XXnotes(atotalof53
fields) 4
Descriptionmetadata:datavalues
October14,2016
• Bestpracticerecommendationssuggestincludinginformationaboutinformationobject’s:
• anabstract,tableofcontents,referencetoagraphicalrepresentationofcontentorafree-textaccountofthecontent(DC)
• subject,significance,andfunction(CCOandCDWA)
• provenanceandhistory,language(OSULibraries)
• etc.5
Descriptionmetadataapplication
October14,2016
Inconsistentlevelsofapplication
• in40%-99%ofmetadatarecords:• 50.9%inOAIster(Ward,2003)
• 40-75%acrossuniversitydigitalrepositories(Kurtz,2010)
• 99%indigitalvideocollections(Weagley,Gelches,&Park,2010)
• 100%indigitalimagecollections(Park,2006)
• by72%-89%ofdataprovidersinaggregations:• 72%inOAIster(Ward,2003)
• 89%inIMLSDCCaggregation(Jacksonetal.,2008)6
DPLA
• DigitalPublicLibraryofAmerica(launchedin2013)
• Oneofthelargestandrapidlygrowingdigitalrepositories
• Distributednetworkmodel:• Contenthubs• Servicehubs
October14,2016 7
DPLA&Description
8October14,2016
DescriptiondefinedinDPLAMetadataApplicationProfileas“Includesbutisnotlimitedto:anabstract,atableofcontents,orafree-textaccountofdescribedresource”• DiscrepancyinDescriptiondocumentation
• recommendedfieldinIntrotoDPLADataModel:version4(2015)• optionalincompleteDPLAMetadataApplicationProfile:version
4(2015)
• MetadatanormalizedatharvestingintoDPLA• Variousnativemetadatafieldsmappedtodcterms:description
Problemstatement
October14,2016
Lackofsystematicempirical
studiesofdigitallibrarymetadata
withthefocusonfree-text
Descriptionmetadata
• inverylargeaggregations
(e.g.,HathiTrust,Internet
Archive,DPLA,etc.)
9
METHODS
October14,2016 10
Problemstatement&ResearchQuestions
11October14,2016
Lackofsystematicempiricalstudiesofdigitallibrarymetadatawiththefocusonfree-textDescriptionmetadata
• inverylargeaggregations(e.g.,HathiTrust,InternetArchive,DPLA,etc.)
• WhatistheoverallusageofDescriptionfieldbyhubsinDPLA?
• HowcanlengthofdatavaluesprovideinsightintoDescriptionmetadatapracticesamongDPLAhubs?
Datacollection&processing
12October14,2016
BigDataapproach:DPLABulkDownloadhttp://dp.la/info/developers/download
• over11.5millionmetadatarecordsinasinglecompressedJSONfile
• eachrecordparsed,allinstancesofDescriptionfieldextracted• Solrfull-textindexerforDescriptionfields:StatsComponent
http://wiki.apache.org/solr/StatsComponent
Dataanalysis
October14,2016
• LevelofapplicationofDescriptionfieldbyDPLAhub• fieldinstancesperrecord
• LengthofdatavalueinDescriptionfield–range,mean,standarddeviationof:
• numberofcharacters• numberofwords• averagewordlength• proportionofdatavaluethatconsistofletters,
punctuation,orintegers• ContentanalysisofdatavaluesfromDescriptionfield
instances(n=200)
Thefindingspresented
here
13
FINDINGS:LEVELOFAPPLICATION
October14,2016 14
%ofrecordswith1+Descriptioninstance
October14,2016 15
Only5outof29hubsincludeDescriptioninallrecords;22hubsincludeitin50%+ofrecords
MAXnumberofDescriptioninstancesperrecord
16October14,2016
51104 17191 98
25
6511
77
98
35
111615
1216179
31
25 4611
020406080100120140160180artstor
bhlcdl
david_rumsey
digital-commonwealth
digitalnc
esdn
georgia
getty
gpo
harvard
hathitrustindiana
internet_archivekdlmdlmissouri-hubmwdl
nara
nypl
scdl
smithsonian
the_portal_to_texas_history
tn
uiuc
undefined_provider
uscvirginiawashington7hubswithover20instances:
4withover50
Averageno.ofDescriptioninstancesperrecord
October14,2016 17
Above2.00for8of29hubs:5ofthesehubshaveunusuallyhighmaxnumberofinstances
%ofDescriptioninstanceswithuniquevalues
October14,2016 18
0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%
Under50%valuesareuniqueforallbut6hubs
FINDINGS:LENGTHOFDATAVALUES
October14,2016 19
Ave.lengthofdatavalues(no.ofcharacters)
20October14,2016
200charactersorlessformosthubs
Veryhighvariabilityfor2hubs
MAXlengthofdatavalues(no.ofcharacters)
October14,2016 21
Morefindings:contentanalysisofdatavalues
22October14,2016
DatavalueinDescriptionfield Category1glassnegative:b&w;8x10in.;sulfiding. PhysicalobjectdescriptionThismaterialhasbeenprovidedbyTheRoyalCollegeofSurgeonsofEngland.TheoriginalmaybeconsultedatTheRoyalCollegeofSurgeonsofEngland
Rightsorusagestatement
ThisimageshowsasectionofThornCemeteryincludinggravestones. Objectcontentdescription
Microform. ObjecttypeorformatTitlesuppliedbycataloger. Noteormetadatasource
Thisseriescontainstranscriptsofproceedings,depositions,andoralexaminationspreparedexclusivelyfororintheDistrictCourt.ThedepositionsandoralexaminationsweretakenoutofcourtandareprimarilyinterviewswithSchoolBoardrepresentativesandemployeesconcerningthedevelopment,implementation,andreviewofdesegregationplans.
Collection-levelcontentdescription
P950. Identifierorcallnumber
Mainsourceofnon-uniquedatavalues(slide18)
DISCUSSION,CONCLUSIONS&
FUTURERESEARCH
October14,2016 23
Variabilityissues
• Notallhubs(and/orpartnerinstitutionswithinservicehubs)may:• considerfree-textDescriptionfields(e.g.,notesofvariouskinds)
tobeequallyimportant,or
• enforcetheusageofDescriptionfields
• WidevariabilityofthenumberofinstancesofDescriptionandthelengthofdatavalues
• Higher(butnotoutlying)lengthscouldindicatemorerigorousstandardsofDescriptioninhubs
• OutlierrecordswithDescriptionfieldvalues20,000characters+shouldbereviewedastotheirappropriatenesstolocaldescriptivemetadatainputrulesOctober14,2016
Unexpected:departurefrombestpractice
guidelines
24
Mappingissues
October14,2016 25
_
differingperceptionsofDescriptionsemanticsamongDPLAhubsandcontributors
absenceofabettermatchthanDescriptiontomapthe
informationfromnativemetadata(richerthanDublin
Core)inDPLA
insufficientconsistencyofcontributedrecordsformoreaccuratemapping
ThevarietyofinformationtypesobservedinDescriptiondatavaluesmightbedueto1ormorefactors:
Otherpossibleproblems
• Context
• Shortervaluesmightbeduetolocalpracticeswheredescriptionofitem
isinthecontextofthedescriptionofcollection
• Contextlostinaggregation
• Quality
• Outlierswithshortestdatavaluelengthsmightindicatelackofrelevant
informationaboutanitem
• Outlierswithlongestdatavaluelengthsmightbeduetofulltext
harvestedwiththemetadatarecordOctober14,2016 26
Conclusions
• Simplestatisticalanalysescanprovidebetterunderstandingofmetadata
usageinaggregation
• Largesetsofempiricaldata(bigdataapproach,eliminatessamplingerror)
• Diversityallowsforbetterunderstandingofcontributinginstitutions’varying
practices
• Recommendationforaggregatorstoincludeinmetadataapplication
profilesaseparateNotepropertyformappingofinformationthatdoes
notfitDescription
• AdditionalresearchisneededOctober14,2016 27
Someideasonthenext2slides
Futureresearch
UseoflanguageinDescriptiondatavalues• numberofwords
• averagewordlength
• proportionofdatavaluesthatconsistofletters,punctuation,orintegers
• proportionofwordsfromlistoffrequently-usedEnglishworks(e.g.,1K,5Ketc.,standardEnglishdictionary)
October14,2016
Datacollectedbutnotanalyzed
yet
28
Morefutureresearch1. CategorizinginformationinDescription• automaticallyidentifyingsomeofthisinformationtomapdata
valuesmoreaccuratelyormarkthemforreviewforqualitycontrol.
2. ComparingperceivedimportancewithactualapplicationofDescriptionmetadatainDPLAhubs’nativemetadata
3. Researchinto:• Howdifferentinstitutionsperceiveitem-level(andcollection-
level)metadatainnativesystemsandaspartofanaggregation• End-userperceptionofusefulnessofdescriptiveinformationin
helpingfinditemsü couldhelptorefineguidelinesonDescriptionfield
October14,2016 29
30
Questions?Comments?
Ideas?October14,2016
Workscited• Baca,M.andP.Harpring(Eds.).(2009)CategoriesfortheDescriptionofWorksofArt(CDWA),GettyResearchInstitute,SantaMonica.
• Baca,M.,etal.(2006)CatalogingCulturalObjects:AGuidetoDescribingCulturalWorksandtheirImages,AmericanLibraryAssociation,Chicago.
• DigitalPublicLibraryofAmerica(2015a,March5).AnintroductiontotheDPLAmetadatamodel.Retrievedfromhttp://dp.la/info/wp-content/uploads/2015/03/Intro_to_DPLA_metadata_model.pdf
• DigitalPublicLibraryofAmerica(2015b,March5).Metadataapplicationprofile:Version4.0.Retrievedfromhttp://dp.la/info/wp-content/uploads/2015/03/MAPv4.pdf
• EncodedArchivalDescription.(2002).Retrievedfromhttp://www.loc.gov/ead/.• EncodedArchivalDescription:EAD3.(2015).Retrievedfromhttp://www2.archivists.org/sites/all/files/TagLibrary-VersionEAD3.pdf.
• Jackson,A.S.,M.Han,K.Groetsch.,M.MustafoffandT.W.Cole.(2008).DublinCoremetadataharvestedthroughOAI-PMH.JournalofLibraryMetadata,8(1),5-21.
• Hillmann,D.(2005).UsingDublinCore.Retrievedfromhttp://dublincore.org/documents/usageguide/• Kurtz,M.(2010).DublinCore,DSpace,andabriefanalysisofthreeuniversityrepositories.InformationTechnology&Libraries,29(1),40-46.Retrievedfromhttp://ejournals.bc.edu/ojs/index.php/ital/article/view/3157/2771
• OSUKnowledgeBankMetadataApplicationProfileforDigitalVideo.(2011).Retrievedfromhttps://library.osu.edu/documents/knowledge-bank/KnowledgeBankMetadataApplicationProfile2011.pdf
• Park,J.(2006).Semanticinteroperabilityandmetadataquality:Ananalysisofmetadataitemrecordsofdigitalimagecollections.KnowledgeOrganization,33(1),20-34.
• Ward,J.(2003).AquantitativeanalysisifunqualifiedDublinCoremetadataelementsetusagewithindataprovidersregisteredwiththeOpenArchivesInitiative.Proceedingsofthe2003JointConferenceonDigitalLibraries,pp.315-317.
• Weagley,J.,E.Gelches,&J.Park.(2010).Interoperabilityandmetadataqualityindigitalvideorepositories:astudyofDublinCore.JournalofLibraryMetadata,10(1),37-57.DOI:10.1080/19386380903546984.
31October14,2016