moritz a universe of data

5
Some Notes on Digital Data – with a suggestion Tom Moritz / Internet Archive February, 2009 A UNIVERSE OF DATA??? What is “data”? The US NSF DataNet solicitation defines “data” as: “Any information that can be stored in digital form and accessed electronically, including, but not limited to, numeric data, text, publications, sensor streams, video, audio, algorithms, software, models and simulations, images, etc.” i This definition is technically acceptable but not scientifically epistemic. In fact, it is useful to think of “data” in two distinct ways. “Data” refers (as in the DataNet definition) to the computer readable code that is stored in, accessed from or flows between computers. “Data” also means precise, well‐defined representations of observations, descriptions or measurements of a referent (object or event) recorded in some standard, well‐specified way. The more inclusive DataNet definition has the virtue of forcing us to consider a unified, holistic approach to knowledge and to the formal resources that inform and express it; we are forced to confront the Web as it exists today. HOW MUCH DATA? In a now famous quip, Lewis Carroll noted that the perfect scale for maps was 1:1 but that farmers tend to become disgruntled when such maps are unrolled over their fields. The notion that we could theoretically record “everything” in real time ‐‐ “ 1:1 capture “ – leaves us to ponder the limits of “data” collection, management and longevity – full‐life‐cycle ii curation and stewardship. With the evolution of satellite coverages, nanotechnology, robotics and embedded network sensors, it is possible, for example, to systematically record presence/absence data for birds at a nesting site – at every nesting site in a given area ‐‐ 24‐7, forever [SEE for example: http://www.jamesreserve.edu/webcams.lasso?CameraID=Cam14 ] iii or for that matter to record every human heartbeat. iv And to archive these data in perpetuity? (The casual assumption that we might comprehensively save all data is belied by a recent forecast projecting that in 2007, the total data produced on earth for the first time exceeded the available storage. v )

Upload: tom-moritz

Post on 18-Dec-2014

474 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Moritz A Universe Of Data

SomeNotesonDigitalData–withasuggestionTomMoritz/InternetArchiveFebruary,2009

AUNIVERSEOFDATA???Whatis“data”?TheUSNSFDataNetsolicitationdefines“data”as:“Anyinformationthatcanbestoredindigitalformandaccessedelectronically,including,butnotlimitedto,numericdata,text,publications,sensorstreams,video,audio,algorithms,software,modelsandsimulations,images,etc.”iThisdefinitionistechnicallyacceptablebutnotscientificallyepistemic.Infact,itisusefultothinkof“data”intwodistinctways.“Data”refers(asintheDataNetdefinition)tothecomputerreadablecodethatisstoredin,accessedfromorflowsbetweencomputers.“Data”alsomeansprecise,well‐definedrepresentationsofobservations,descriptionsormeasurementsofareferent(objectorevent)recordedinsomestandard,well‐specifiedway.ThemoreinclusiveDataNetdefinitionhasthevirtueofforcingustoconsideraunified,holisticapproachtoknowledgeandtotheformalresourcesthatinformandexpressit;weareforcedtoconfronttheWebasitexiststoday.HOWMUCHDATA?Inanowfamousquip,LewisCarrollnotedthattheperfectscaleformapswas1:1butthatfarmerstendtobecomedisgruntledwhensuchmapsareunrolledovertheirfields.Thenotionthatwecouldtheoreticallyrecord“everything”inrealtime‐‐“1:1capture“–leavesustoponderthelimitsof“data”collection,managementandlongevity–full‐life‐cycleiicurationandstewardship.Withtheevolutionofsatellitecoverages,nanotechnology,roboticsandembeddednetworksensors,itispossible,forexample,tosystematicallyrecordpresence/absencedataforbirdsatanestingsite–ateverynestingsiteinagivenarea‐‐24‐7,forever[SEEforexample:http://www.jamesreserve.edu/webcams.lasso?CameraID=Cam14]iiiorforthatmattertorecordeveryhumanheartbeat.ivAndtoarchivethesedatainperpetuity?(Thecasualassumptionthatwemightcomprehensivelysavealldataisbeliedbyarecentforecastprojectingthatin2007,thetotaldataproducedonearthforthefirsttimeexceededtheavailablestorage.v)

Page 2: Moritz A Universe Of Data

viWHO’SRESPONSIBLE?Itisalsothecasethattechnology,standardsandmethodologies,thatinstitutions,organizationsandprofessions,haveevolvedandbecomeestablishedtomanageandpreservelogicaldomainsofknowledgeaswellasselectedtechnicalformatsofdata.Thepointrespectinglogicalsegmentsisrelativelyclear–naturalhistorymuseumsandherbariaholdpreserved(e.g.dead)organismsasspecimens;zoosandgardensandaquariaholdlivingorganismsexsitu;protectedareasholdlivingorganismsinsitu;cryogenicsfacilitiesholdtissuesamples–similarly,theirlibrariesholdlogicallycorrespondingpublishedorarchivalworks.Respectingtechnicalformats:librariesholdboundpaper/printmaterials;archivesholdunboundpaper/manuscriptorunboundpaper/typescriptmaterials;mediarepositoriesholdnon‐printmedia;computercentersholddatasetsandcomplexmodels(hypotheticalassemblagesofdatathatgeneratenewdata);artmuseumsholdpaintingsandsculptures;adancecompanyperformsdances;andindigenousgroupstewardsits“oldknowledge”.Similarly,librariansandarchivists,curatorsandzookeepers,rangersandinformationtechnologists,dancersandshamanshaveallreceivedvocationalchargeforsiloedsegmentsofour“knowledgebase”.Butwhoisresponsibleforthewhole?Beforetheadventofdigitaltechnologythislatterquestionwouldhavebeenmetaphysicallyinterestingbutpointless‐‐nolongeritseems.Scanningoursocietyandculture,itseemslibrariesandlibrariansarethemosteligiblecandidatesfortherole.Andifthereceived“compartments”organizational,professional,logicalstructuresarenolongerdictatedbyoperationalconstraints(egtheabilitytocurateadragonflyortoselectandconserveabook)howcanwemosteffectivelyorganizethemanagementofknowledgeasdata.Atthenationallevel,thereareprimeexamplesofinstitutionsthatadmirablyservelogicaldomainsofourknowledgebase,theNationalLibraryofMedicineisone.viiTheLibraryofCongressalonehasthestatureandscopeofinteresttocommandourtrustandexpectations.BUTDATAFORWHAT???

Page 3: Moritz A Universe Of Data

HarvardbiologistRichardLewontinnotesthat–likethedrunklookingforhiskeysunderastreetlight“becausethelightisbetterthere”–researchhasoftenbeenconstrainedtostudiesforwhichcareerorientedresearchershavetheapparatusandmethodstoproducecreditable(e.g.laudable,promotion‐worthy)results.viiiOurcurrenterahasseenanevolutionoftechnologythatchallengescomfortable“disciplinary”categoriesofresearchandconventionalformat‐definedcodesoffiduciaryresponsibility.Notonlyhavetraditionaldistinctionsbetweenthedomainsoftheartsandthehumanitiesandthesciencesbeenchallengedbuttheconventionsofscientificdisciplinesinthemselves–asfociforresearchandinvestment–arebeingchallenged.Newpossibilitiesfortrans‐disciplinarityareemergingbuttherequisitetoolsandmethodsarenotyetfullyformedandorganizationalpathsforsuchresearcharenotalwaysclear.ANDHOWDOESDATAHAVEMEANING?Whendataisconsideredinthescientificorresearchcontext,itssemanticpropertiesnecessarilybecomeessential.Thusourabilitytocontextualizedatabecomesprimary.Parametersoftimeandspaceareimmediatelyrelevant–somedatawillhaveageographiccontext(derivingoneparameterofmeaningfromlocation‐‐insitu)otherdatawillbeessentiallyageographic(exsitu),experimentalandindependentofgeographybutnotofexperimentalframe.Timeasaparameterofdatamaysimilarlybehistoricalorahistorical.Agency,materials,equipment(calibration)andoperationsalsosetprimaryparametersfordata.Huge–darewesay“exorbitant”?‐‐investmentshavebeenmadeinthe“metadataindustry”–mostparticularlyinlibraryandarchivalcataloging.Inthenewmedia,Webenvironment–othersolutionsoperatinguponnaturallanguageand“native[pre‐existent]metadata”haveproducedprodigious,cost‐effective(profitable)results.WHOSEDATA?Inanerawhencombinationsandrecombinationsofdataareroutine,“demandside”problemsoccurrespectingvalidationandcertificationofresultsand“supplyside”problemsoccurrespectingattributionandcreditfortheoriginatorsofdata.Moreoverscientists’claimsfordiscretepersonal“priority”ofdiscoveryareinevitablybeingchallenged.Collaborationismoreandmorecommon‐‐asforeseenbyRobertK.Mertonix‐‐anindividual’scontributiontothewholecorpusofknowledgeislessandlessclearlyattributable.Notionsof“authorship”arechallengedbyanonymousinstitutional/organizationalclaimstoauthorship.xAnd“smallscience”(ecology,fieldbiology,etc)–wheretheindividualscientistisstillseemasasingleactor‐‐isoftenperceivedasweaklydeveloped–asprovidingnomorethan“disaggregatedcomponentsofanincipientnetwork”xi.Atthesametimetherehasbeenaquantumincreaseintheefforttoisolateandtomonetizeintellectualpropertyxii.Intellectual“assets”–whetherintheformof

Page 4: Moritz A Universe Of Data

genomicdiscoveriesorscientificjournalarticles–havebecomeincreasinglycommoditized.xiiiItisalsothecasethatthedigitalenvironmenthasdisruptedtraditionaleconomicvaluechains(thishasbeenobviouslytrueinthepublishingindustryandintheentertainmentindustrywheretheconsequencesofthesepressureshavebeenaccusations,threatsandlawsuits–oftentothebizarreextentthatnaturalalliesinthevaluechainhaveattackedeachotheroreventothedegreethatcustomers/clientsofanindustryhavebeenattackedbytheindustryitself.AGLOBALDATAIMPERATIVE???PerhapsneglectingFaust(?),ThomasJeffersonasserted,“Thefieldofknowledgeisthecommonpropertyofallmankind.”Itseemsmoreresponsibletoconsideranethicalscaleofneedthatcompelsfreeandopenpublicaccesstotheresultsofnondestructiveresearch(obviouslythedefinitionof“nondestructive”requiresdebate).Thisspectrumofcommonneedincludes:humanhealth,pharmacology,publichealth;agrarianandagriculturalknowledge;environmentalknowledgeandconservationand–moregenerally–mostnon‐destructivescienceandtechnology,criticalforeducation.Thedilemmaweface,worldwideisthatmostdevelopingcountriesanddevelopingsegmentsofsocietyarethoseleastcapableofclearingthethresholdsofuseimposedbymarketcontrolsonknowledgeinallforms.xivInthenaiveexuberancethatformedtheLeagueofNations,an“InternationalCommitteeonIntellectualCooperation”wasenvisionedasaforumforglobalfocusoncommongoods‐‐today,inafarmoreexactway,wehavetheopportunitytoplananddeveloptechnicalresources,standardsandmethodologiesthatwillnotdenythebenefitsofhumanknowledgetotheleastprivileged.Acomprehensivestrategyrequiresthatwesuccessfullyaddress4primarymodalitiesofconstraint:technology,culture,economyandlaw.TheInternetArchive–focusingonR&Dandprototyping‐‐hasbuiltessentialcomponentsofwhatcouldultimatelybecomeafullservice,fulllifecycle‘collectiveutility’or“servicecloud”‐‐foropendigitalmanagementofhumanknowledge.ThisevolutiondoesnotrequirethattheArchiveitselfbecomethis“servicecloud”butthatitcomposeacomprehensiveresponseand‐‐togetherwithotherinstitutionsandorganizations,programsandinitiatives‐‐catalyzeacomprehensiveresponse.xvMostessentialelementsareinplace–oratleastemerging.Wecanandshouldactnow.iSustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation NSF 07-601 , p.5. ii “the data management life cycle (including data creation, access, use, and preservation)” Sustainable

Page 5: Moritz A Universe Of Data

Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation NSF 07-601 , p.5. iii Or as another instance see recent NYT article: Natalie Anger “Tracking forest creatures on the move.” NYT Feb 2, 2009 http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=cse iv The California poet William Everson once asked poignantly: “And when the last coyote has been tagged…?” v “…the amount of information created, captured or replicated exceeded available storage for the first tie in 2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital universe will not have a permanent home.” John Gantz et al. (IDC) The diverse and exploding digital universe; an updated forecast or worldwide information growth through 2011. (March, 2008)www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf vi Serge Bloch in NYT: Natalie Anger “Tracking forest creatures on the move.” NYT Feb 2, 2009 SEE: http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=cse viiHISTORICBUDGETSUPPORTFORNLMviii R. Lewontin, The Triple Helix: Gene, Organism, Environment ix “Property rights in science are whittled down to a bare minimum by the rationale of the scientific ethic. The scientist’s claim to “his” intellectual “property” is limited to that of recognition and esteem which, if the institution functions with a modicum of efficiency, is roughly commensurate with the significance of the increments brought to the common fund of knowledge.” Robert K. Merton, “A Note on Science and Democarcy,” Journal of Law and Political Sociology 1 (1942): 121. x SEE for example: Peter Galison, “The Collective Author,” in M. Biagioli and P. Galison (ed.s) Scientific Authorship: Crdit and Intelletual Property in ScienceNY, Routledge, 2003. xi SEE: THE ROLE OF SCIENTIFIC AND TECHNICAL DATA AND INFORMATION IN THE PUBLIC DOMAIN PROCEEDINGS OF A SYMPOSIUM J.M. Esanu and P.F. Uhlir, (Ed.s) Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division, National Research Council of the National Academies,, xii SEE L. Lessig, Code xiii SEE Julian Birkinshaw and Tony Sheehan, “Managing the Knowledge Life Cycle,” MIT Sloan Management Review, 44 (2) Fall, 2002: 77. xivSEEforex.:xv A short list is relatively easy to compose…