scientific discovery

Upload: varunrc

Post on 16-Mar-2016

8 views

Category:

Documents


1 download

DESCRIPTION

Scientific Dis

TRANSCRIPT

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    The Fourth Paradigm: Data-Intensive Scientific Discovery

    TonyHeyCorporateVicePresident

    MicrosoftExternalResearch

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    TonyHeyTonyHey AnIntroductionAnIntroduction

    CommanderoftheBritishEmpire

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    The Fourth Paradigm

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    Data collection Sensor networks, satellite

    surveys, high throughput laboratory instruments, observation devices, supercomputers, LHC

    Data processing, analysis, visualization Legacy codes, workflows,

    data mining, indexing, searching, graphics

    A rchiving Digital repositories,

    libraries, preservation,

    SensorMapFunctionality: Map navigationData: sensor-generated temperature, video camera feed, traffic feeds, etc.

    Scientific visualizationsNSF Cyberinfrastructure report, March 2007

    ADigitalDataDelugeinResearchADigitalDataDelugeinResearch

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    1. Thousand years ago Experimental Science Description of natural phenomena

    2. Last few hundred years Theoretical Science Newtons Laws, Maxwells Equations

    3. Last few decades Computational Science Simulation of complex phenomena

    4. Today Data-Intensive Science Scientists overwhelmed with data sets

    from many different sources Data captured by instruments Data generated by simulations Data generated by sensor networks

    eScience is the set of tools and technologiesto support data federation and collaboration

    For analysis and data mining For data visualization and exploration For scholarly communication and dissemination

    EmergenceofaFourthResearchParadigmEmergenceofaFourthResearchParadigm

    WiththankstoJimGray

    AstronomyhasbeenoneofthefirstdisciplinestoembracedataintensivesciencewiththeVirtualObservatory(VO),enablinghighlyefficientaccesstodataandanalysistoolsatacentralizedsite.TheimageshowsthePleiadesstarclusterformtheDigitizedSkySurveycombinedwithanimageofthemoon,synthesizedwithintheWorldWide

    Telescopeservice.

    Sciencemustmovefromdatato

    informationtoknowledge

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    TheimpactofJimGraysthinkingiscontinuingto

    getpeopletothinkinanewwayabouthowdata

    andsoftwareareredefiningwhatitmeanstodo

    science."

    BillGates,Chairman,MicrosoftCorporation

    Oneofthegreatestchallengesfor21stcentury

    scienceishowwerespondtothisneweraof

    dataintensivescience.Thisisrecognizedasanew

    paradigmbeyondexperimentalandtheoretical

    researchandcomputersimulationsofnatural

    phenomenaonethatrequiresnewtools,

    techniques,andwaysofworking.

    DouglasKell,UniversityofManchester

    Thecontributingauthorsinthisvolumehave

    doneanextraordinaryjobofhelpingtorefinean

    understandingofthisnewparadigmfroma

    varietyofdisciplinaryperspectives.

    GordonBell,MicrosoftResearch

    http://research.microsoft.com/fourthparadigm/

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    Listed7keyareasforactionbyFundingAgencies:1.Fundbothdevelopmentandsupportofsoftware

    tools2.Investatalllevelsofthefindingpyramid3.Funddevelopmentofgeneric

    Laboratory

    InformationManagementSystems4.Fundresearchintoscientificdatamanagement,

    dataanalysis,datavisualization,newalgorithms andtools

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    Remainingthreekeyareasforactionrelateto thefutureofScholarlyCommunicationand

    Libraries:5.EstablishDigitalLibrariesthatsupporttheother

    sciencesliketheNLMdoesforMedicine6.Funddevelopmentofnewauthoringtoolsand

    publicationmodels7.Exploredevelopmentofdigitaldatalibraries

    thatcontainscientificdata(notjustthe metadata)andsupportintegrationwith publishedliterature

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    Developing a Sustainable e-Infrastructure

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    AcceleratingtimetoinsightAcceleratingtimetoinsight withAdvancedResearchToolsandServiceswithAdvancedResearchToolsandServices

    Ourgoalisto

    accelerateresearchbycollaboratingwith academiccommunitiestouseadvancedcomputer

    scienceresearchtechnologies

    AimtohelpscientistsspendlesstimeonITissuesand moretimeonsciencebycreatingopentoolsand

    servicesbasedonMicrosoftplatformsandproductivity software

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    DataAcquisitionandModelingDataAcquisitionandModeling

    LifeUnderYourFeetResearchersatTheJohnsHopkinsUniversity

    aredeployinglargearraysofwirelesssoil sensorsinavarietyofenvironmentalsettings,

    includingapark,anurbanforestandawetland. Thenetworksenablescientiststomonitor

    ecologicalchangesonanunprecedentedscale andofferinsightsintohydrology,greenhouse

    gasesandtheactivityoforganismsinthesoil.

    TheSwissExperimentPowerfulSoftwareImprovesEnvironmental

    ForecastingEnvironmentalscientistsfacemanychallenges inmonitoringandunderstandingourplanets

    changingclimate.Throughaninternational collaborationcalledtheSwissExperiment,

    environmentalscientistsandcomputerscience expertsaredeployingadvancedsensornetworks

    anddatamanagementtoolstoimprove environmentalmonitoringandforecasting.

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    CollaborationandVisualizationCollaborationandVisualization

    SciScope

    SpeedsDataRetrievalfrom

    MultipleRepositoriesForenvironmentalscientistsandengineers,

    findingandretrievingrelevantdatacanbea dauntingandtedioustask.MicrosoftResearchis

    developinganonlinesearchenginecalled SciScope

    thatenablesresearcherstosearch

    multipledatarepositoriessimultaneouslyand retrieveinformationinaconsistentformat.

    ResearchInformationCenterCollaborationandinformationsharingamong

    researchersareamongthemostimportantbut challengingaspectsofscientificresearch.In

    recentyears,scientistshavebegunusing virtualresearchenvironments

    toexchange

    informationwithcolleaguesinspecificareasof study.MicrosoftResearchandTheBritish

    LibraryareteaminguptobuildtheResearch InformationCentre.

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    AnalysisandDataMiningAnalysisandDataMining

    PhyloD

    StatisticaltoolusedtoanalyzeDNAofHIVfrom

    largestudiesofinfectedpatientsTypicaljob,10 20CPUhourswithextreme jobsrequiring1K 2KCPUhours VeryCPUefficient

    Requiresalargenumberoftestrunsfora

    givenjob(1 10Mtests) Highlycompresseddataperjob(~100KB perjob)

    Trident

    AScientificWorkflowWorkbenchBringsClarity

    toDataScientistsattheUniversityofWashingtonare workingwithMicrosoftExternalResearchto

    demonstratehowmarryingvisualizationand workflowtechnologiescanallowresearchersto

    bettermanage,evaluateandinteractwitheven themostcomplexscientificdatasets.

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    DisseminateandShareDisseminateandShare

    Chem4WordChemistryDrawinginWordCreatedincollaborationwithUniversityof

    Cambridge;PeterMurrayRust,et.al.

    Relationships:Navigateand

    linkreferencedchemistry

    Relationships:Navigateand

    linkreferencedchemistry

    Data:Semantics

    storedinChemistry

    MarkupLanguage

    Data:Semantics

    storedinChemistry

    MarkupLanguage

    Intent:Recognizes

    chemicaldictionaryand

    ontologyterms

    Author/edit1Dand2Dchemistry.

    Changechemicallayoutstyles.

    Author/edit1Dand2Dchemistry.

    Changechemicallayoutstyles.

    Intelligence:Verifiesvalidityof

    authoredchemistry

    Intelligence:Verifiesvalidityof

    authoredchemistry

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    DisseminateandShareDisseminateandShare

    OntologyPlugInforWord

    PhilBourne LynnFink

    Relationships:

    Ontology browser

    Relationships:

    Ontology browser

    Intent:Term

    recognition &disambiguation

    Intent:Term

    recognition &disambiguation

    JohnWilbanks

    Services:

    Ontology

    downloadwebservice

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    ArchivingandPreservationArchivingandPreservation

    Asemanticcomputingplatformtostore

    andexposerelationshipsbetweendigital assets

    Asemanticcomputingplatformtostore

    andexposerelationshipsbetweendigital assets

    Flexibledatamodel

    enablesmanyscenarios andcanbeeasilyextended

    overtime

    Flexibledatamodel

    enablesmanyscenarios andcanbeeasilyextended

    overtime

    NativesupportforRSS,OAIPMH,OAI

    ORE,AtomPubandSWORDDefaultwebUIwithCSS

    supportandcustomASP.Net controls

    DefaultwebUIwithCSS

    supportandcustomASP.Net controlsZentity

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    ArchivingandPreservationArchivingandPreservation

    moleculestext

    experiments

    measurementsdocuments

    datamolecules

    data

    scientists

    Mashup(reuse)dataMashup(reuse)data

    SemanticstorageSemanticstorage

    Compounddocument

    authoring

    Compounddocument

    authoring

    oreChem

    theChemicalSemanticWeb

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    Networkanalysisisofgrowing importanceinacademic,

    commercial,andInternet socialmediacontexts

    ExistingSocialNetworkTools arechallengingformany

    noviceusers ToolslikeExcelarewidely

    used Leveragingaspreadsheetasa

    hostforSocialNetwork Analysislowersbarriersto

    networkdataanalysisand display

    Leveragespreadsheetforstorageofedgeandvertexdata

    Applydynamicfilterstothedata

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    Intent:InsertCreativeCommons

    licensesfromwithinOffice2007

    Intent:InsertCreativeCommons

    licensesfromwithinOffice2007

    Relationships:licenseinformation

    storedasRDFXMLwithinthe documentOOXML

    Relationships:licenseinformation

    storedasRDFXMLwithinthe documentOOXML

    http://ccaddin2007.codeplex.com

    Services:Integrateswith

    CreativeCommonsWebAPI tocreatenewlicenses

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    The Future Research e-Infrastructure: Client + Cloud

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    StatisticaltoolusedtoanalyzeDNAofHIVfrom largestudiesofinfectedpatients

    PhyloD

    wasdevelopedbyMicrosoftResearchand hasbeenhighlyimpactful

    Smallbutimportantgroupofresearchers 100sofHIVandHepC

    researchersactivelyuseit

    1000sofresearchcommunitiesrelyontheseresults

    Typicaljob,10 20CPUhourswithextremejobsrequiring1K 2KCPUhours VeryCPUefficient Requiresalargenumberoftestrunsforagivenjob(1

    10Mtests)

    Highlycompresseddataperjob(~100KBperjob)

    PhyloD nowportedasWindowsAzureCloudService Cloudenablesagiledeploymentofscalablescientificservices

    CoverofPLoS

    Biology

    November2008

    CourtesyofRogerBarga

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    Scientist

    Source

    Metadata

    ScientificResults

    AzureMODISServiceWebRolePortal

    RequestQueue

    SourceImageryDownloadSites

    ...

    Sciencepipelinefordownload,initialprocessing

    andreductionofsatelliteimagery.Developedby

    MSR,UVa,UCB.

    Dramaticallylowersresourceandcomplexity

    barrierstousesatelliteimageryforterrestrial

    hydrologyandgeoscience.

    Commonimagerylocationdeterminationand

    uploadfromdiversesources

    Commonreprojection

    andharmonizationto

    producesciencereadyimagerywiththesame

    length,timeandqualityattributes

    Optionalscientistprovidedreductionalgorithm

    (.NET,Java,orMatLab)

    Ondemandscalabilitybeyondlocaldesktopor

    cluster

    Inusenowtocompute10yearcontinentalscale

    waterbalanceforNorthAmerica.Peryear:

    500GB(~60Kfiles)uploadof9differentsource

    imageryproductsfrom15differentlocations

    400GBreprojected

    harmonizedimagery

    consuming~3500cpu

    hours

    5GBreducedscienceresultleveragingreported

    fielddataaggregatesconsuming~60cpu

    hour

    Additionalsciencerequestspending ExpandingabovetoEurope Additionalsourceimageryproductsand

    formats

    Reprojection

    Queue

    ReductionQueue

    DataCollectionStage

    Reprojection

    Stage

    Analysis/ReductionStage

    CatharinevanIngen(MSR),JieLi,MartyHumphreys

    (UVA),YoungryelRyu(UCB),DebAgarwal(BWC/LBL)

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    LedbyNewcastleUniversity,UK(PaulWatson), projectsupportedbyER

    Investigatingapplicabilityofcommercialcloudsforscientific

    research Buildaworkingprototypeforusecasesinchemo

    informatics UsesMicrosofttechnologiestobuildsciencerelated

    services(WindowsAzure,Silverlight)

    Builtinitialproofofconcept Silverlight

    UIforbasicQuantitativeStructure

    AnalysisRelationship(QSAR)modeling DemonstratedabilitytoscaleQSARcomputations

    inWindowsAzure

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    Aknowledgeecosystem: Aricherauthoringexperience Anecosystemofservices Semanticstorage Open,Collaborative,

    Interoperable,andAutomatic

    Data/informationisinter connectedthroughmachine

    interpretableinformation(e.g. paperX

    isaboutstarY)

    Socialnetworksareaspecialcase ofdatameshes

    Attribution:ChrisBizer

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    scholarly

    communications

    scholarly

    communications

    domainspecificservicesdomainspecificservices

    instant

    messaging

    instant

    messaging

    identityidentity

    documentstoredocumentstore

    blogs&

    socialnetworking

    blogs&

    socialnetworking

    mailmail

    notificationnotification

    searchbooks

    citations

    searchbooks

    citations

    visualizationand

    analysisservices

    visualizationand

    analysisservices

    storage/data

    services

    storage/data

    services

    computeservices

    virtualization

    computeservices

    virtualization

    Project

    management

    Project

    management

    Reference

    management

    Reference

    management

    knowledge

    management

    knowledge

    management

    knowledge

    discovery

    knowledge

    discovery

    VisionofFutureResearcheInfrastructureusingClient+Cloudresources

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    Thesitecontainsaccessanddownloadsofrelevantopentoolsand

    resourcesfortheworldwideacademicresearchcommunity.Examplesof ouropentoolsandservices:

    PluginsforOfficeOntologyAddinforWordArticleAuthoringAddinforWordChem4Word ChemistryDrawinginWord

    MicrosoftBiologyFoundationMBFEnablesandacceleratesfundamentaladvancesinbiology

    F#CollaborationwiththeacademicandresearchcommunityonF#stypedfunctionaland

    objectorientedprogrammingonthe.NETplatform

    SoftwareEngineeringToolsSpec#:ProgramverifierforC#extendedwithdesignbycontractVCC:ProgramverifierforConcurrentCPEX:automaticunittestingtoolfor.NETCHESS:UnittestingtoolsforconcurrentWin32executableand.NET

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.

    MicrosoftResearch http://research.microsoft.com MicrosoftResearchdownloads:http://research.microsoft.com/research/downloads

    MicrosoftExternalResearch http://research.microsoft.com/externalresearch

    ScienceatMicrosoft http://www.microsoft.com/science

    CodePlex http://www.codeplex.com

    TheFacultyConnection http://www.microsoft.com/education/facultyconnection

    MSDNAcademicAlliance http://msdn.microsoft.com/enus/academic

  • ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.ThisworkislicensedunderaCreativeCommonsAttribution3.0UnitedStatesLicense.