scientific data management (v2)

47
Scien&fic Data Management A tutorial at ICADL 2011 October 24, 2011 Jian Qin School of Informa&on Studies Syracuse University hGp://eslib.ischool.syr.edu/

Upload: jian-qin

Post on 11-May-2015

692 views

Category:

Education


2 download

DESCRIPTION

Covers basic concepts of data and data management and a few examples of data services provided by universities.

TRANSCRIPT

Page 1: Scientific data management (v2)

Scien&fic  Data  Management  A  tutorial  at  ICADL  2011  

October  24,  2011    

Jian  Qin  School  of  Informa&on  Studies  

Syracuse  University  hGp://eslib.ischool.syr.edu/  

 

Page 2: Scientific data management (v2)

The  morning  ahead  

12/18/11  15:51   Overview  of  E-­‐Science   2  

An  environmental  scan  •  E-­‐Science,  cyberinfrastructure,  and  data  • What  do  all  these    have  to  do  with  me?  

Case  study:  The  gravita&onal  wave  research  data  management    

Group  work:  Role  play  in  developing  data  management  ini&a&ves    

Page 3: Scientific data management (v2)

Overview  of  E-­‐Science  

An  environmental  scan  •  E-­‐Science,  cyberinfrastructure,  and  data  •  What  do  all  these    have  to  do  with  me?  

Characteris&cs  of  e-­‐science  Data  sets,  data  collec&ons,  and  data  

repositories  Why  does  it  maGer  to  libraries?  

Page 4: Scientific data management (v2)

E-­‐Science  

       “In  the  future,  e-­‐Science  will  refer  to  the  large  scale  science  that  will  increasingly  be  carried  out  through  distributed  global  collabora&ons  enabled  by  the  Internet.  ”  

 

12/18/11  15:51   Overview  of  E-­‐Science  

Na&onal  e-­‐Science  Center.  (2008).  Defining  e-­‐Science.  hGp://www.nesc.ac.uk/nesc/define.html    

4  

Page 5: Scientific data management (v2)

E-­‐Infrastructure  for  the  research    lifecycle  

12/18/11  15:51   Overview  of  E-­‐Science   5  

hGp://epubs.cclrc.ac.uk/bitstream/3857/science_lifecycle_STFC_poster1.PDF    

Page 6: Scientific data management (v2)

 Shib  in  Science  Paradigms  Thousand  years  

ago  A  few  hundred  

years  ago  A  few  decades  

ago  Today  

Science  was  empirical  describing  natural  phenomena  

Theore7cal  branch    using  models,  generaliza&ons  

A  computa7onal  approach  

simula&ng  complex  phenomena  

Data  explora7on  (eScience)  unify  theory,  experiment,  and  

simula&on  -­‐-­‐  Data  captured  by  instruments  or  generated  by  simulator  -­‐-­‐  Processed    by  sobware  -­‐-­‐  Informa&on/Knowledge  stored  in  computer  -­‐-­‐  Scien&st  analyzes  database/files  using  data  management  and  sta&s&cs  

Gray,  J.  &  Szalay,  A.  (2007).  eScience  –  A  transformed  scien&fic  method.  hGp://research.microsob.com/en-­‐us/um/people/gray/talks/NRC-­‐CSTB_eScience.ppt  

Page 7: Scientific data management (v2)

12/18/11  15:51   Overview  of  E-­‐Science   7  

Page 8: Scientific data management (v2)

X-­‐Info  •  The  evolu&on  of  X-­‐Info  and  Comp-­‐X                                                                                    

for  each  discipline  X  •  How  to  codify  and  represent  our  knowledge      

•  Data  ingest      •  Managing  a  petabyte  •  Common  schema  •  How  to  organize  it    •  How  to  reorganize  it  •  How  to  share  with  others  

•  Query  and  Vis  tools    •  Building  and  execu&ng  models  •  Integra&ng  data  and  Literature      •  Documen&ng  experiments  •  Cura&on  and  long-­‐term  preserva&on  

The  Generic  Problems  

Experiments  &  Instruments  

Simula&ons  

answers  

ques&ons  

Literature  

Other  Archives   facts  facts   ?  

Gray,  J.  &  Szalay,  A.  (2007).  eScience  –  A  transformed  scien&fic  method.  hGp://research.microsob.com/en-­‐us/um/people/gray/talks/NRC-­‐CSTB_eScience.ppt  

Page 9: Scientific data management (v2)

Useful  resources  •  What  is  eScience?        •  eScience  Ini7a7ves        •  Science  Research  and  Data        •  Science  Data  Management        •  Literature  Reviews        •  Data  Policy  Issues        •  eScience  Research  Centers        

•  hGp://eslib.ischool.syr.edu/index.php?op&on=com_content&view=sec&on&id=9&Itemid=83  

12/18/11  15:51   Overview  of  E-­‐Science   9  

hGp://research.microsob.com/en-­‐us/collabora&on/fourthparadigm/  

Page 10: Scientific data management (v2)

A  FEW  IMPORTANT  CONCEPTS  

12/18/11  15:51   Overview  of  E-­‐Science   10  

Page 11: Scientific data management (v2)

12/18/11  15:51   Overview  of  E-­‐Science  

Data  

         Any  and  all  complex  data  en&&es  from  observa&ons,  experiments,  simula&ons,  models,  and  higher  order  assemblies,  along  with  the  associated  documenta&on  needed  to  describe  and  interpret  the  data.

An  ar&st’s  concep&on  (above)  depicts  fundamental  NEON  observatory  instrumenta&on  and  systems  as  well  as  poten&al  spa&al  organiza&on  of  the  environmental  measurements  made  by  these  instruments  and  systems.  hGp://www.nsf.gov/pubs/2007/nsf0728/nsf0728_4.pdf  

11  

Page 12: Scientific data management (v2)

Scien&fic  data  formats  

12/18/11  15:51   Overview  of  E-­‐Science   12  

Common  data  format  Image  formats  Matrix  formats  

Microarray  file  formats  Communica&on  protocols  

Page 13: Scientific data management (v2)

Scien&fic  datasets  •  The  scien&fic  data  set,  

or  SDS,  is  a  group  of  data  structures  used  to  store  and  describe  mul&dimensional  arrays  of  scien&fic  data.  

•  The  boundaries  of  datasets  vary  from  discipline  to  discipline    

NCSA  HDF  Development  Group.  (1998).  HDF  4.1r2  User's  Guide.  hGp://www.hdfgroup.org/training/HDFtraining/UsersGuide/SDS_SD.fm1.html#48894  

12/18/11  15:51   13  Overview  of  E-­‐Science  

Page 14: Scientific data management (v2)

Scien&fic  workflows  •  Steps  in  data  collec&on  and  analysis  process  •  Different  types  of  scien&fic  workflows:  – Data-­‐intensive  – Compute-­‐intensive  – Analysis-­‐intensive  – Visualiza&on-­‐intensive  

12/18/11  15:51   Overview  of  E-­‐Science   14  

Ludäscher,  B.,  Al&ntas,  I.,  Berkley,  C.,  Higgins,  D.,  Jaeger,  E.,  Jones,  E.,  Lee,  E.A.,  Tao,  J.,  &  Zhao,  Y.  (2006).  Scien&fic  workflow  management  and  the  Kepler  system.  Currency  and  Computa>on:  Prac>ce  and  Experience,  18(10):  1039-­‐1065.    

Page 15: Scientific data management (v2)

Example:  Ecological  dataset  •  Floris&c  diversity  data  – Related  links  – Data  aGributes  – Download  link  

12/18/11  15:51   15  Overview  of  E-­‐Science  

Page 16: Scientific data management (v2)

Example:  Biodiversity  dataset  •  Ac7ons  for  Porcupine  

Marine  Natural  History  Society  -­‐  Marine  flora  and  fauna  records  from  the  North-­‐east  Atlan7c  –  Metadata  record  output  

in  different  standard  formats  

–  URL  for  dataset  download    

12/18/11  15:51   16  Overview  of  E-­‐Science  

Page 17: Scientific data management (v2)

Example:  The  Significant  Earthquake  Database    

12/18/11  15:51   17  Overview  of  E-­‐Science  

•  The  Significant  Earthquake  Database  –  A  database  containing  data  

about  significant  earthquake  events  and  the  damages  caused  

–  An  interface  for  extrac&ng  a  subset  of  data  

–  A  link  to  download  the  whole  dataset  

–  Documenta&on    

Page 18: Scientific data management (v2)

12/18/11  15:51   Overview  of  E-­‐Science   18  

Social  Science  Data  

Page 19: Scientific data management (v2)

Research  data  collec&ons  

12/18/11  15:51   Overview  of  E-­‐Science   19  

Data  output                          Size                            Metadata              Management                                                                                                            Standards  

Larger,  discipline-­‐based  

Smaller,  team-­‐based   None  or  

random  

Mul&ple,  comprehensive  

Heroic  individual  inside  the  team  

Organized  Ins&tu&onalized,    

Page 20: Scientific data management (v2)

Research  collec&ons  •  Limited  processing  or  long-­‐term  management

•  Not  conformed  to  any  data  standards

•  Varying  sizes  and  formats  of  data  files  

•  Low  level  of  processing,  lack  of  plan  for  data  products  

•  Low  awareness  of  metadata  standards  and  data  management  issues  

12/18/11  15:51   Overview  of  E-­‐Science   20  

Page 21: Scientific data management (v2)

Resource  collec&ons  •  Authored  by  a  community  of  inves&gators,  within  

a  domain  or  science  or  engineering  •  Developed  with  community  level  standards  •  Life  &me  is  between  mid-­‐  and  long-­‐term  

•  Example:  Hubbard  Brook  Ecosystem  Study  (hGp://www.hubbardbrook.org  )    –  One  of  the  regional  sites  in  the  Long  term  

Ecological  Research  Network  (LTER)  –  Community  of  the  ecological  domain  –  Community  of  inves&gators  from  around  the  

country  on  ecosystem  study  –  Ecological  Metadata  Language  (EML),  a  

community-­‐level  standard  –  Cataloged,  searchable  dataset  collec&ons  

12/18/11  15:51   Overview  of  E-­‐Science   21  

Page 22: Scientific data management (v2)

Reference  collec&on  •  Example:  Global  Biodiversity  Informa&on  Facility  –  Created  by  large  segments  of  science  community    –  Conform  to  robust,  well-­‐established  and  comprehensive  standards,  e.g.  •  ABCD  (Access  to  Biological  Collec&on  Data)    •  Darwin  Core    •  DiGIR  (Distributed  Generic  Informa&on  Retrieval)    •  Dublin  Core  Metadata  standard    •  GGF    (Global  Grid  Forum)    •  Invasive  Alien  Species  Profile    •  LSID  (Life  Sciences  Iden&fier)    •  OGC  (Open  Geospa&al  Consor&um)

12/18/11  15:51   Overview  of  E-­‐Science   22  

Page 23: Scientific data management (v2)

hGp://www.gbif.org/informa&cs/discoverymetadata/a-­‐metadata-­‐infrastructure/  

hGp://www.tdwg.org/standards/  Global  Biodiversity  

Informa7on  Facility  

12/18/11  15:51   23  Overview  of  E-­‐Science  

Page 24: Scientific data management (v2)

Datasets,  data  collec&ons,  and  data  repositories    

•  Data  collec&ons  are  built  for  larger  segments  of  science  and  engineering  

•  Datasets  –  typically  centered  around  an  event  or  a  study  

–  contain  a  single  file  or  mul&ple  files  in  various  formats  

–  coupled  with  documenta&on  about  the  background  of  data  collec&on  and  processing  

Data  repository  

System  for  storing,  managing,  preserving,  and  providing  access  to  datasets    

A  repository  may  contain  one  or  more  data  collec&ons      A  data  collec&on  may  contain  one  or  more  datasets    A  dataset  may  contain  one  or  more  data  files  

12/18/11  15:51   24  Overview  of  E-­‐Science  

Page 25: Scientific data management (v2)

An  emerging  trend  in  academic  libraries  

12/18/11  15:51   Overview  of  E-­‐Science   25  

Page 26: Scientific data management (v2)

Ini&a&ves  in  research  libraries  

•  Pressure  points:  –  Lack  of  resources  –  Difficulty  acquiring  the  appropriate  staff  and  

exper&se  to  provide  eScience  and  data  management  or  cura&on  services  

–  Lack  of  a  unifying  direc&on  on  campus  

12/18/11  15:51   Overview  of  E-­‐Science   26  

Data  support  and  services  in  ins&tu&ons:  

45%  

Libraries  involved  in  suppor&ng  eScience:  

73%  

Source:  Soehner,  C.,  Steeves,  C.  &  Ward,  J.  (2010).  E-­‐Science  and  data  support  services:  A  study  of  ARL  member  ins&tu&on.  hGp://www.arl.org/bm~doc/escience_report2010.pdf        

Page 27: Scientific data management (v2)

Data  management  challenges  

•  No  one-­‐size-­‐fits-­‐all  solu&on  •  Requires  an  in-­‐depth  understanding  of  scien&fic  workflows  and  research  lifecycle  

•  Involves  not  only  technical  design  and  planning  but  also  organiza&onal  collabora&on  and  ins&tu&onaliza&on  of  data  policy    

12/18/11  15:51   Overview  of  E-­‐Science   27  

Page 28: Scientific data management (v2)

Data  preserva&on  challenges  

•  Data  formats  –  Vary  in  data  types,  e.g.  vector  and  raster  data  types    –  Format  conversions,  e.g.  from  an  old  version  to  a  newer  one  

•  Data  rela&ons    –  e.g.  there  are  data  models,  annota&ons,  classifica&on  schemes,  and  symboliza&on  files  for  a  digital  map  

•  Seman&c  issues  –  Naming  datasets  and  aGributes  

Overview  of  E-­‐Science   28  12/18/11  15:51  

Page 29: Scientific data management (v2)

Data  access  challenges  

•  Reliability    •  Authen&city  •  Leverage  technology  to  make  data  access  easier  and  more  effec&ve  – Cross-­‐database  search  –  Integra&on  applica&ons  

Overview  of  E-­‐Science   29  12/18/11  15:51  

Page 30: Scientific data management (v2)

Suppor&ng  digital  research  data  •  Lifecycle  of  research  data  

–  Create:  data  crea&on/capture/gathering  from  laboratory  experiments,  field  work,  surveys,  devices,  media,  simula&on  output…  

–  Edit:  organize,  annotate,  clean,  filter…  –  Use/reuse:  analyze,  mine,  model,  derive  addi&onal  data,  visualize,  input  to  instruments  /computers  

–  Publish:  disseminate  data  via  portals  and  associate  datasets  with  research  publica&ons  

–  Preserve/destroy:  store  /  preserve,  store  /replicate  /preserve,  store  /  ignore,  destroy…  

12/18/11  15:51   Overview  of  E-­‐Science   30  

Page 31: Scientific data management (v2)

Suppor&ng  data  management  

12/18/11  15:51   Overview  of  E-­‐Science   31  

The  data  deluge  Numerical,  image,  video    Models,  simula&ons,  bit  streams    XML,  CVS,  DB,  HTML  

Specialized  search  engines  to  discover  the  data  they  need    Powerful  data  mining  tools  to  use  and  analyze  the  data  

Researchers  need:    

Page 32: Scientific data management (v2)

Research  data  management  

12/18/11  15:51   Overview  of  E-­‐Science  

Ins&tu&on  

Financial  and  policy  support  

Community  

User  requirements  

Science  domain  

Data  content  idiosyncrasies    

Ins&tu&onal  repository  

Community  repository  

Na&onal  repository  

Interna&onal  repository  

Evolving  and  interconnec&ng  –      

eScience  librarian  

32  

Page 33: Scientific data management (v2)

Implica&ons  to  scholarly  communica&on  process  

12/18/11  15:51   Overview  of  E-­‐Science  

Publishing     Cura&on   Archiving  

33  

Maintaining,  preserving  and  adding  value  to  digital  research  data  throughout  

its  lifecycle.  

The  long-­‐term  storage,  retrieval,  and  use  of  scien&fic  data  and  

methods.  

Data  publishing;  New  scholarly  publishing  models—open  access,  

ins&tu&onal  and  community    repositories,  self-­‐publishing,  library  

publishing,  ....    

Page 34: Scientific data management (v2)

12/18/11  15:50   促进学术交流:如何踢开第一脚?   34  

术语的演变 �

Page 35: Scientific data management (v2)

个案研究1:制定数据保存分享的机构政策 �

12/18/11  15:50   促进学术交流:如何踢开第一脚?   35  

Page 36: Scientific data management (v2)

现状  

12/18/11  15:50   促进学术交流:如何踢开第一脚?   36  

数据、文件  

院、系服务器  

学科仓储  

期刊、会议论文出版  

校内机构仓储  

校园服务器  

研究人员  

•  什么文件格式?  •  如何组织的?  •  如何使用的?  •  能否与非项目团队人员分享?  •  如果能,有什么条件和规定?  •  文件和数据的保存是如何做的?  •  有哪些法律条例需要遵守?  

有无学科仓储?  有无呈交?  校内仓储有无与学科仓储连接?  

Page 37: Scientific data management (v2)

12/18/11  15:50   促进学术交流:如何踢开第一脚?   37  37  

无统一规章条例 � �无文件、数据管理的认识 � �无数据使用和分享的政策规定 �

建立统一的数据获取、使用、管理、分享的政策 � �建立机构数据仓储(campus cyberinfrastructure-enabled support) � �广泛宣传、用事实说服研究人员 �

调查现有机构数据

政策 �

获取校领导及有关部门的支持 �

Proof of Concept Project �

目标 �

现状 �

Page 38: Scientific data management (v2)

Ac&ons!  

12/18/11  15:50   促进学术交流:如何踢开第一脚?   38  

校长  

VP  for  Research  

科研处   图书馆   IT  services  

VP  for  Academic  Affairs  

iSchool   College⋯  

调查现有机构数据政策,写出报告并给VP  for  Research提出建议参考意见  

与学校有关部门协作  

Page 39: Scientific data management (v2)

12/18/11  15:50   促进学术交流:如何踢开第一脚?   39  

Page 40: Scientific data management (v2)

DATA  MANAGEMENT  PRACTICES  IN  ACADEMIC  LIBRARIES  

Page 41: Scientific data management (v2)

hGp://researchdata.wisc.edu/    

Page 42: Scientific data management (v2)

hGps://confluence.cornell.edu/display/rdmsgweb/Home    

Page 43: Scientific data management (v2)

hGp://libraries.mit.edu/guides/subjects/data-­‐management/    

Page 44: Scientific data management (v2)
Page 45: Scientific data management (v2)

Summary    •  Managing  research  data  is  mo&vated  by:  – Government  funding  agency’s  policy  – Needs  for  data  sharing,  cross  valida&on  of  data  and  research,  credit,  and  large-­‐scale  interdisciplinary  discovery  

•  Organiza&onal  changes:  – New  organiza&onal  units  within  the  university  library  or  at  the  university  level  

–  Virtual  group    –  Collabora&on  among  key  units:  Libraries,  IT  services,  research  administra&on  office  

Page 46: Scientific data management (v2)

Summary    

•  Types  of  services  – Training  faculty  and  students  for  data  literacy  – Data  cura&on  services  (data  repositories,  digital  libraries,  archiving  data)  

– Consul&ng  services  – Data  management  plan  – Developing  data  policies  

Page 47: Scientific data management (v2)