calstate csuperb 2012 · discipline x e.g. (bio-informatics , computational-biology)! how to codify...

Post on 08-Oct-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

iPlant Atmosphere: Emerging role of Cloud Infrastructure for training Data Scientists!

1

Nirav  Merchant  iPlant  Collaborative  University  of  Arizona  http://iplantcollaborative.org  

Topic Coverage ²  The 4th Paradigm of Science ² What is “Cloud Computing” ? ² What is “Big Data” ? ² Why you should care about it ? ² Who is a “Data Scientist” ? ² What is Cyberinfrastructure ²  iPlant Atmosphere: Embracing cloud platform

2

Science Paradigms 1. Thousand years ago: science was empirical

describing natural phenomena, observations 2. Last few hundred years: theoretical branch

using models, generalizations 3. Last few decades: a computational branch

simulating complex phenomena 4. Today: data exploration (eScience)

unify theory, experiment, and simulation

3

Based on the transcript of a talk given by the late Jim Gray"to the National Research Council – Computer Science and Telecommunication Board in Mountain View, CA, on January 11, 2007

The Fourth Paradigm: "Data-Intensive Scientific Discovery ²  Increasingly, scientific breakthroughs will be

powered by advanced computing capabilities that help researchers manipulate and explore massive datasets."

²  The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud computing technologies.

4 http://research.microsoft.com/en-us/collaboration/fourthparadigm/

The Discovery Lifecycle

5

The Fourth Paradigm: Data-Intensive Scientific Discovery

Evolution of X-Info ²  The evolution of X-Info and Comp-X for each

discipline X e.g. (Bio-Informatics , Computational-Biology)

² How to codify and represent our knowledge ²  The Generic Problems:

6

• How to share it with others • Query and Vis tools • Building and executing models • Integrating data and literature • Documenting experiments • Curation and long-term preservation

• Data ingest • Managing a petabyte • Common schema • How to organize it • How to reorganize it

7

+ =

Simple Formula

The Reality

8

+ + PERL Python Java Ruby Fortran C C# C++ R Matlab etc.

Amazon Azure Rackspace Campus HPC XSEDE Etc.

and lots of glue…..

+ =

Simple Formula

10

11

² Classic paradigm: You produce data, analyze, interpret (end to end)

² Conventional paradigm: Consortium/centers produce data and you consume it

² New Paradigm: Consortium/centers have produced data and creating “cyber infrastructure” to tackle the “grand challenge”

12

13

X prize for sequencing

14 2012 guidelines are different, this graphics is out dated

X prize for analyzing it ?

? 15

So who else is talking about it

16

The “V” of big data

² Volume ² Velocity ² Variety ² (Value)

17 Attributed to Gartner Consulting

Big Data ² Extracting meaningful results from vast

amount of data (linked data) ² Big data “information assets” demand cost-

effective, innovative forms of information processing for enhanced insight and decision making.

²  “Big Data” Is only the Beginning of Extreme Information Management

² Big Data Technology, all Is Not New

18 Attributed to Gartner Consulting

The transition

19 http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/

BioInforma*on  ::  Data  Flavors  

² Sequences  (much  more  than  NGS  !)  ² Structures  ² Images  ² Video  ² Audio  ² Pathways  (graphs)  ² Text  (PublicaFons)  ² Traces  ² CombinaFon  (eg  Video  &  Traces)  ² And  much  more  …    

21

22

EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/

23 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/

24 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/

25 1541-1672/11 2011 IEEE INTELLIGENT SYSTEMS

26

² Bioinformatics has become too central to biology to be left to specialist bioinformaticians.

² Biologists are all bioinformaticians now" - Lincoln Stein Dec. 2008""http://genomebiology.com/2008/9/12/114

27

²  Abstraction: "C.T. is operating in terms of multiple layers of abstraction simultaneously "C.T. is defining the relationships the between layers

²  Automation:"C.T. is thinking in terms of mechanizing the abstraction layers and their relationships

²  Mechanization is possible due to precise and exacting notations and models

²  There is some “machine” below (human or computer, virtual or physical)

²  They give us the ability and audacity to scale.

28

² Man is the best computer we can put aboard a spacecraft. "And the only one that can be mass produced with unskilled labor.

-- Wernher von Braun

29

SaaS:  SoJware  as  a  Service  (e.g.  Clustering/Assembly  is  a  service)  

IaaS:  Infrastructure  as  a  Service    (get  computer  Fme  with  a  credit  card  and  with  a  Web  interface  like  EC2)  

PaaS:  PlaRorm  as  a  Service  IaaS  plus  core  soJware  capabiliFes  on  which  you  build    SaaS  

(e.g.  Hadoop/MapReduce  is  a  PlaRorm)    

Cyberinfrastructure    Is  “Research  as  a  Service”  

http://salsahpc.indiana.edu

² We  have  designed  iPlant  to  be  consistent  with  the  pillars  of  CIF21  ü  High  Performance  CompuFng  ü  Data  and  Data  Analysis  ü  Virtual  OrganizaFon  ü  Learning  and  Workforce  

The  iPlant  Collabora*ve  Cyberinfrastructure  Philosophy  

Typical  End  Users  

ComputaFonal  Users  

Teragrid XSEDE

The  iPlant  Collabora*ve  Cyberinfrastructure  for  the  Plant  Sciences  

²  For  a  challenge  as  broad  as  “plant  science,”  focus  on  specific  applicaFons/tools  is  a  moving  target,  and  never  enough.  

 ²  Most  important  to  build  a  plaRorm  that  can  support  diverse  

and  constantly  evolving  needs.    “Cyberinfrastructure”  is,  in  fact,  infrastructure.  The  plaRorm  can    liJ  all  the  apps,  not  select  winners  and  losers.    

“The  useful  life*me  of  our  analysis  toolchains  is  now  6  months”  

                                         -­‐Ma8hew  Trunnel,    Broad  Ins*tute    

The  iPlant  Collabora*ve  Cyberinfrastructure  for  the  Plant  Sciences  

The  iPlant  Collabora*ve  Cyberinfrastructure  for  the  Plant  Sciences  

•  The  iPlant  CI  is  designed  as  infrastructure.    •  This  means  it  is  a  plaRorm  upon  which  other  projects  

can  build.    •  Use  of  the  iPlant  infrastructure  can  take  one  of  several  

forms:  ü  Storage  ü  ComputaFon  ü  HosFng  ü  Web  Services  ü  Scalability  

The  iPlant  Collabora*ve  Ways  to  access  iPlant  

•  Atmosphere:  For  cloud  infrastructure  •  iPlant  Data  Storage:  All  data  large  and  small  •  The  Discovery  Environment:  Integrated  Web  apps.    •  MyPlant:  Social  Networking.    •  DNASubway:  AnnotaFon  and  more  •  Standalone  Apps:    TNRS,  TreeViewer,  PhytoBisque,  etc  •  The  API:  For  programmers  embedding  iPlant  CI  capabiliFes  

•  Command  line  for  experts  (thru  TeraGrid/XSEDE)

iPlant Data Store: Free Your Data"

Different Users, "Different Access Needs: "One Data Store"

iPlant Data Store Performance"UC Berkeley to iDS"

² Dec 5th, 2011: "²  100GB: 29m15s"

36,000 Students 2000 Faculty

39,000 Students 2900 Faculty/Staff

100GB: 29m15s"

iPlant Data Store Performance"UC Berkeley to iDS"

Source Destination Copy Method Time (seconds) CD Desktop PC cp 320

Berkeley Server Desktop PC scp 150

External Drive Desktop PC cp 36

USB 2.0 Flash Desktop PC cp 30

iDS Desktop PC iget 18

Desktop PC Desktop PC cp 15

https://pods.iplantcollaborative.org/wiki/display/start/How+fast+is+the+iPlant+Data+Store

1 GB / 17.5 seconds"

Desktop PC (UA): Mac OS X with 7.2K Internal Hard Drive External Drive: USB 2.0: 5.4k Hard Drive Flash Drive: USB 2.0 Patriot XT

Customized cloud platform for computing on your terms !

Atmosphere: motivation ²  Standalone GUI-based applications are frequently

required for analysis ²  GUI apps not easily to transform into web apps ²  Need to handle complex software dependencies (e.g

specific bioperl version and R modules) ²  Users needing full control of their software stack

(occasional sudo access) ²  All computation does not complete in a 24 hour queue

(HPC limitations !) ²  Need to share desktop/applications for collaborative

analysis (remote collaborators) ²  Availability of Next Gen map-reduce based algorithms

(currently we have limited support)

Challenges of existing cloud platforms

² Amazon Web Services (AWS)"http://aws.amazon.com/

²  Flexible and scalable ² High level of expertise required for configurations ²  Fairly challenging for biologists to master all

steps ²  Limited lifecycle management (cost, time mgmt ) ²  Lack easy desktop integration ²  Lack easy tools for large data transfer

Steps to get started !

What is Atmosphere ? ² Self-service cloud infrastructure ² Designed to make underlying cloud

infrastructure easy to use by novice user ² Built on open source Eucalyptus ²  Fully integrated into iPlant authentication and

storage and HPC capabilities ² Enables users to build custom images/

appliances and share with community ² Cross-platform desktop access to GUI

applications in the cloud (using VNC) ² Provide easy web based access to resources

Who is this tutorial designed for ? ² Users wanting to launch configured images

in atmosphere (like app store™) ² Developers for application distribution ² Prototyping/Testing new software/modules ²  Tailored software training setups (custom

workshops/laboratory courses etc) ² Extend compute capabilities of existing

applications i.e. utilize iPlant API

²  API-­‐compaFble  implementaFon  of  Amazon  EC2/S3  interfaces  

²  Virtualize  the  execuFon  environment  for  applicaFons  and  services  

²  Up  to  12  core  /  48  GB  instances  ²  Access  to  Cloud  Storage  +  EBS  ²  Run  servers,  CloudBurst  desktop  use  

cases.  Big  data  and  the  desktop  are  co-­‐local  again!  

>60   hosted   applicaFons   in  Atmosphere   today,   including  users   from   USDA,   Forest  Service,   database   providers,  etc.    (30   more   for   postdocs   and  grad   students   for   training  classes)  

The  iPlant  Collabora*ve  Project  Atmosphere™:  Custom  Cloud  CompuFng  

Atmosphere: Collaboration

iPlant Data Store

Lifecycle

How to Connect

Different Ways to Log in to VMs

Typical Hands on exercise ²  Launching a instance (one per team) ²  Connecting to it (vnc and ssh) using the web

browser and vnc client software ²  Bringing data from iDS to Atmosphere (use idrop

or icommands) ²  Launching a application ²  Installing a new application (optional) ²  Saving data back to iDS ²  Collaborating with other users (sharing your

session) ²  Terminating the instance when you are done

Exercise (cont.) ²  Atmosphere manual is at: https://pods.iplantcollaborative.org/wiki/

x/Iaxm ²  To get started point your browser to

https://atmo-beta.iplantcollaborative.org/ ²  Login with your iPlant credentials ²  Please do not launch an instance until your instructor has directed

you to do so."It is ideal if 4 or 5 users form a team and launch a single instance to limit the load on the system

²  Once launched, check email for availability of your instance (usually 10 to 15 min) or you can check the web interface

²  Once you instance is ready you use the IP address and connect to it, and invite others to connect to it and collaborate with them with a shared screen

²  For todays session we will use the “NGS Viewers” application

Users of Atmosphere ²  Workshops:

²  Frontiers and Techniques in Plant Sciences"CSHL 2011,2012

²  Genotyping by Sequencing "Cornell Computational Biology

²  Graduate/U. Graduate course work: ²  BCB 660 Volker Brendel and Amy Toth"

Fall 2011, Iowa State University ²  ISTA 420/520 Nirav Merchant & Eric Lyons"

Fall 2012, Univ. of Arizona ²  Intro. Bioinformaics, Anne Lorraine"

Fall 2012l Univ. of North Carolina ²  Popular community contributed images:

²  PhytoMorph (Nate Miller, U. Wisconsin) ²  Twig2Genome (Haibao Tang, JCVI) ²  Julin Maloof, UC Davis*

Examples

52

Where do I start ? ² www.iplantcollaborative.org ² User.iplantcollaborative.org (get a account) ² Wiki.iplantcollaborative.org (documentation) ²  Forum.iplantcollaborative.org (ask questions)

53

54

Future Data Scientist need CT !

Pukng  it  all  to  work  

Wayne Stayskal, The Tampa Tribune

56

Staff: Greg Abram Sonali Aditya Roger Barthelson Brad Boyle Todd Bryan Gordon Burleigh John Cazes Mike Conway Karen Cranston Rion Doodey Andy Edmonds Dmitry Fedorov Michael Gatto Utkarsh Gaur Cornel Ghiban Michael Gonzales Hariolf Häfele Matthew Hanlon

74

Metadata Data Tools Workflows Viz

Executive Team: Steve Goff Dan Stanzione

Faculty Advisors & Collaborators: Ali Akoglu Greg Andrews Kobus Barnard Sue Brown Thomas Brutnell Michael Donoghue Casey Dunn Brian Enquist Damian Gessler Ruth Grene John Hartman Matthew Hudson Dan Kliebenstein Jim Leebens-Mack David Lowenthal Robert Martienssen

Students: Peter Bailey Jeremy Beaulieu Devi Bhattacharya Storme Briscoe Ya-Di Chen John Donoghue Steven Gregory Yekatarina Khartianova Monica Lent Amgad Madkour

B.S. Manjunath Nirav Merchant David Neale Brian O’Meara Sudha Ram David Salt Mark Schildhauer Doug Soltis Pam Soltis Edgar Spalding Alexis Stamatakis Ann Stapleton Lincoln Stein Val Tannen Todd Vision Doreen Ware Steve Welch Mark Westneat

Andrew Lenards Zhenyuan Lu Eric Lyons Naim Matasci Sheldon McKay Robert McLay Angel Mercer Dave Micklos Nathan Miller Steve Mock Martha Narro Praveen Nuthulapati Shannon Oliver Shiran Pasternak William Peil Titus Purdin J.A. Raygoza Garay Dennis Roberts Jerry Schneider

Anthony Heath Barbara Heath Matthew Helmke Natalie Henriques Uwe Hilgert Nicole Hopkins Eun-Sook Jeong Logan Johnson Chris Jordan B.D. Kim Kathleen Kennedy Mohammed Khalfan Seung-jin Kim Lars Koersterk Sangeeta Kuchimanchi Kristian Kvilekval Aruna Lakshmanan Sue Lauter Tina Lee

Bruce Schumaker Sriramu Singaram Edwin Skidmore Brandon Smith Mary Margaret Sprinkle Sriram Srinivasan Josh Stein Lisa Stillwell Kris Urie Peter Van Buren Hans Vasquez-Gross Matthew Vaughn Fusheng Wei Jason Williams John Wregglesworth Weijia Xu Jill Yarmchuk

Aniruddha Marathe Kurt Michaels Dhanesh Prasad Andrew Predoehl Jose Salcedo Shalini Sasidharan Gregory Striemer Jason Vandeventer Kuan Yang

Postdocs: Barbara Banbury Jamie Estill Bindu Joseph Christos Noutsos Brad Ruhfel Stephen A. Smith Chunlao Tang Lin Wang Liya Wang Norman Wickett

The  iPlant  Collabora*ve  

top related