3dapas/ecmls panel dynamic distributed data intensive analysis environments for life sciences: june...

3DAPAS/ECMLS panelDynamic Distributed Data Intensive Analysis Environments for Life Sciences:

June 8 2011 San JoseGeoffrey Fox, Shantenu Jha, Dan Katz,

Judy Qiu, Jon Weissman

Discussion Topics?

• Programming methods: languages vs. frameworks (advantages and disadvantages of each)

• Moving compute to data: Is the data localization model imposed by Clouds scalable and/or sustainable?

• Does Life Sciences want clouds or supercomputers?• Data model for Life Sciences; is it dynamic?, What is

size? What is Access pattern? Is it Shared or Individual?

• How important is data security and privacy?

Programming methods: languages vs. frameworks for data intensive/Life Science areas

• SaaS offers “Blast etc.” on demand• Role of PGAS, Data parallel compilers (like Chapel) i.e. main

stream HPC high level approaches• Nodes v. Cores v. GPU’s – do hybrid programming models

have special features• MapReduce v. MPI• Distributed environments like SAGA• Data parallel analysis languages like Pig Latin, Sawzall• Role of databases like SciDB and SQL based analysis– See DryadLINQ

• Is R (cloud R, parallel R) critical– What about Excel, Matlab …

Moving compute to data: Is the data localization model imposed by Clouds scalable and/or sustainable?

• This related to privacy and programming model questions• Is data stored in central resources• Does data have co-located compute resources (cloud)• If co-located, are data and compute on same cluster

– How is data spread out over disks on nodes?• Or is data in a storage system supporting wide area file system shared

by nodes of cloud?• Or is data in a database (SciDB SkyServer)?• Or is data in an object store like OpenStack?• What kind of middleware exists, or needs to be developed to enable

effective compute-data movement? Or it just a run-time scheduling problem?

• What are performance issues and how do we get data there for dynamic data as that produced by sequencers.

Data model for Life Sciences; is it dynamic?, What is size? What is Access pattern? Is it Shared or Individual?

• Is it a few large centers?• Is it a distributed set of repositories containing say all data from a

particular lab?– Or both of the above?– How to manage and present stream of new data

• The world created ~1000 exabytes of data this year – how much will Life Sciences create?

• Relative importance of large shared data centers versus instrumental or computer generated individually owned data?

• Is Data replication important?• Storage model – files, objects, databases?• How often is the different types of data read (presumably written

once!)– Which data is most important? Raw or processed to some level?

• What is metadata challenge?

Does Life Sciences want Clouds or Supercomputers?• Clouds are cost effective and elastic for varying need• Supercomputers support low latency (MPI) parallel applications• Clouds main commercial offering; supercomputers main

academic large scale computing solution– Also Open Science Grid, EGI ….

• Cost(time) of transporting data from sequencers and repositories to analysis engines ( clouds)– Will NLR or Internet2 link to clouds; they do to TeraGrid

• What can LS data-intensive community learn from the HEP community? e.g., Will the HEP approach of community-wide "workload management systems" and VOs work?

• What is the role of Campus Clusters/resources in genomic data-sharing?

• No history of large cloud budgets in federal grants

How important is data security and privacy?• Human Genome processing cannot use most cost

effective solutions which will be shared resources such as public clouds– Commercial, military applications

• What other research applications have such concerns– Analysis of copyrighted material such as digital books

• Partly technical; partly policy issue of establishing a trusted approach– Companies accept off site paper storage?

• See recent hacking attacks such as Sony network, gmail• How important is fault tolerance/autonomic

computing?

3dapas/ecmls panel dynamic distributed data intensive analysis environments for life sciences: june...

Documents