calstate csuperb 2012 · discipline x e.g. (bio-informatics , computational-biology)! how to codify...

1

iPlant Atmosphere: Emerging role of Cloud Infrastructure for training Data Scientists!

1

Nirav Merchant iPlant Collaborative University of Arizona http://iplantcollaborative.org

Topic Coverage ²  The 4th Paradigm of Science ² What is “Cloud Computing” ? ² What is “Big Data” ? ² Why you should care about it ? ² Who is a “Data Scientist” ? ² What is Cyberinfrastructure ²  iPlant Atmosphere: Embracing cloud platform

2

Science Paradigms 1. Thousand years ago: science was empirical

describing natural phenomena, observations 2. Last few hundred years: theoretical branch

using models, generalizations 3. Last few decades: a computational branch

simulating complex phenomena 4. Today: data exploration (eScience)

unify theory, experiment, and simulation

3

Based on the transcript of a talk given by the late Jim Gray"to the National Research Council – Computer Science and Telecommunication Board in Mountain View, CA, on January 11, 2007

The Fourth Paradigm: "Data-Intensive Scientific Discovery ²  Increasingly, scientific breakthroughs will be

powered by advanced computing capabilities that help researchers manipulate and explore massive datasets."

²  The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud computing technologies.

4 http://research.microsoft.com/en-us/collaboration/fourthparadigm/

The Discovery Lifecycle

5

The Fourth Paradigm: Data-Intensive Scientific Discovery

Evolution of X-Info ²  The evolution of X-Info and Comp-X for each

discipline X e.g. (Bio-Informatics , Computational-Biology)

² How to codify and represent our knowledge ²  The Generic Problems:

6

• How to share it with others • Query and Vis tools • Building and executing models • Integrating data and literature • Documenting experiments • Curation and long-term preservation

• Data ingest • Managing a petabyte • Common schema • How to organize it • How to reorganize it

7

+ =

Simple Formula

The Reality

8

+ + PERL Python Java Ruby Fortran C C# C++ R Matlab etc.

Amazon Azure Rackspace Campus HPC XSEDE Etc.

and lots of glue…..

+ =

Simple Formula

11

² Classic paradigm: You produce data, analyze, interpret (end to end)

² Conventional paradigm: Consortium/centers produce data and you consume it

² New Paradigm: Consortium/centers have produced data and creating “cyber infrastructure” to tackle the “grand challenge”

12

∧

X prize for sequencing

14 2012 guidelines are different, this graphics is out dated

X prize for analyzing it ?

? 15

So who else is talking about it

16

The “V” of big data

² Volume ² Velocity ² Variety ² (Value)

17 Attributed to Gartner Consulting

Big Data ² Extracting meaningful results from vast

amount of data (linked data) ² Big data “information assets” demand cost-

effective, innovative forms of information processing for enhanced insight and decision making.

²  “Big Data” Is only the Beginning of Extreme Information Management

² Big Data Technology, all Is Not New

18 Attributed to Gartner Consulting

The transition

19 http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/

BioInforma*on :: Data Flavors

² Sequences (much more than NGS !) ² Structures ² Images ² Video ² Audio ² Pathways (graphs) ² Text (PublicaFons) ² Traces ² CombinaFon (eg Video & Traces) ² And much more …

22

EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/

23 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/

24 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/

25 1541-1672/11 2011 IEEE INTELLIGENT SYSTEMS

26

² Bioinformatics has become too central to biology to be left to specialist bioinformaticians.

² Biologists are all bioinformaticians now" - Lincoln Stein Dec. 2008""http://genomebiology.com/2008/9/12/114

27

²  Abstraction: "C.T. is operating in terms of multiple layers of abstraction simultaneously "C.T. is defining the relationships the between layers

²  Automation:"C.T. is thinking in terms of mechanizing the abstraction layers and their relationships

²  Mechanization is possible due to precise and exacting notations and models

²  There is some “machine” below (human or computer, virtual or physical)

²  They give us the ability and audacity to scale.

28

² Man is the best computer we can put aboard a spacecraft. "And the only one that can be mass produced with unskilled labor.

-- Wernher von Braun

29

SaaS: SoJware as a Service (e.g. Clustering/Assembly is a service)

IaaS: Infrastructure as a Service (get computer Fme with a credit card and with a Web interface like EC2)

PaaS: PlaRorm as a Service IaaS plus core soJware capabiliFes on which you build SaaS

(e.g. Hadoop/MapReduce is a PlaRorm)

Cyberinfrastructure Is “Research as a Service”

http://salsahpc.indiana.edu

² We have designed iPlant to be consistent with the pillars of CIF21 ü  High Performance CompuFng ü  Data and Data Analysis ü  Virtual OrganizaFon ü  Learning and Workforce

The iPlant Collabora*ve Cyberinfrastructure Philosophy

Typical End Users

ComputaFonal Users

Teragrid XSEDE

The iPlant Collabora*ve Cyberinfrastructure for the Plant Sciences

²  For a challenge as broad as “plant science,” focus on specific applicaFons/tools is a moving target, and never enough.

²  Most important to build a plaRorm that can support diverse

and constantly evolving needs. “Cyberinfrastructure” is, in fact, infrastructure. The plaRorm can liJ all the apps, not select winners and losers.

“The useful life*me of our analysis toolchains is now 6 months”

-‐Ma8hew Trunnel, Broad Ins*tute



•  The iPlant CI is designed as infrastructure. •  This means it is a plaRorm upon which other projects

can build. •  Use of the iPlant infrastructure can take one of several

forms: ü  Storage ü  ComputaFon ü  HosFng ü  Web Services ü  Scalability

The iPlant Collabora*ve Ways to access iPlant

•  Atmosphere: For cloud infrastructure •  iPlant Data Storage: All data large and small •  The Discovery Environment: Integrated Web apps. •  MyPlant: Social Networking. •  DNASubway: AnnotaFon and more •  Standalone Apps: TNRS, TreeViewer, PhytoBisque, etc •  The API: For programmers embedding iPlant CI capabiliFes

•  Command line for experts (thru TeraGrid/XSEDE)

iPlant Data Store: Free Your Data"

Different Users, "Different Access Needs: "One Data Store"

iPlant Data Store Performance"UC Berkeley to iDS"

² Dec 5th, 2011: "²  100GB: 29m15s"

36,000 Students 2000 Faculty

39,000 Students 2900 Faculty/Staff

100GB: 29m15s"

iPlant Data Store Performance"UC Berkeley to iDS"

Source Destination Copy Method Time (seconds) CD Desktop PC cp 320

Berkeley Server Desktop PC scp 150

External Drive Desktop PC cp 36

USB 2.0 Flash Desktop PC cp 30

iDS Desktop PC iget 18

Desktop PC Desktop PC cp 15

https://pods.iplantcollaborative.org/wiki/display/start/How+fast+is+the+iPlant+Data+Store

1 GB / 17.5 seconds"

Desktop PC (UA): Mac OS X with 7.2K Internal Hard Drive External Drive: USB 2.0: 5.4k Hard Drive Flash Drive: USB 2.0 Patriot XT

Customized cloud platform for computing on your terms !

Atmosphere: motivation ²  Standalone GUI-based applications are frequently

required for analysis ²  GUI apps not easily to transform into web apps ²  Need to handle complex software dependencies (e.g

specific bioperl version and R modules) ²  Users needing full control of their software stack

(occasional sudo access) ²  All computation does not complete in a 24 hour queue

(HPC limitations !) ²  Need to share desktop/applications for collaborative

analysis (remote collaborators) ²  Availability of Next Gen map-reduce based algorithms

(currently we have limited support)

Challenges of existing cloud platforms

² Amazon Web Services (AWS)"http://aws.amazon.com/

²  Flexible and scalable ² High level of expertise required for configurations ²  Fairly challenging for biologists to master all

steps ²  Limited lifecycle management (cost, time mgmt ) ²  Lack easy desktop integration ²  Lack easy tools for large data transfer

Steps to get started !

What is Atmosphere ? ² Self-service cloud infrastructure ² Designed to make underlying cloud

infrastructure easy to use by novice user ² Built on open source Eucalyptus ²  Fully integrated into iPlant authentication and

storage and HPC capabilities ² Enables users to build custom images/

appliances and share with community ² Cross-platform desktop access to GUI

applications in the cloud (using VNC) ² Provide easy web based access to resources

Who is this tutorial designed for ? ² Users wanting to launch configured images

in atmosphere (like app store™) ² Developers for application distribution ² Prototyping/Testing new software/modules ²  Tailored software training setups (custom

workshops/laboratory courses etc) ² Extend compute capabilities of existing

applications i.e. utilize iPlant API

²  API-‐compaFble implementaFon of Amazon EC2/S3 interfaces

²  Virtualize the execuFon environment for applicaFons and services

²  Up to 12 core / 48 GB instances ²  Access to Cloud Storage + EBS ²  Run servers, CloudBurst desktop use

cases. Big data and the desktop are co-‐local again!

>60 hosted applicaFons in Atmosphere today, including users from USDA, Forest Service, database providers, etc. (30 more for postdocs and grad students for training classes)

The iPlant Collabora*ve Project Atmosphere™: Custom Cloud CompuFng

Atmosphere: Collaboration

iPlant Data Store

Lifecycle

How to Connect

Different Ways to Log in to VMs

Typical Hands on exercise ²  Launching a instance (one per team) ²  Connecting to it (vnc and ssh) using the web

browser and vnc client software ²  Bringing data from iDS to Atmosphere (use idrop

or icommands) ²  Launching a application ²  Installing a new application (optional) ²  Saving data back to iDS ²  Collaborating with other users (sharing your

session) ²  Terminating the instance when you are done

Exercise (cont.) ²  Atmosphere manual is at: https://pods.iplantcollaborative.org/wiki/

x/Iaxm ²  To get started point your browser to

https://atmo-beta.iplantcollaborative.org/ ²  Login with your iPlant credentials ²  Please do not launch an instance until your instructor has directed

you to do so."It is ideal if 4 or 5 users form a team and launch a single instance to limit the load on the system

²  Once launched, check email for availability of your instance (usually 10 to 15 min) or you can check the web interface

²  Once you instance is ready you use the IP address and connect to it, and invite others to connect to it and collaborate with them with a shared screen

²  For todays session we will use the “NGS Viewers” application

Users of Atmosphere ²  Workshops:

²  Frontiers and Techniques in Plant Sciences"CSHL 2011,2012

²  Genotyping by Sequencing "Cornell Computational Biology

²  Graduate/U. Graduate course work: ²  BCB 660 Volker Brendel and Amy Toth"

Fall 2011, Iowa State University ²  ISTA 420/520 Nirav Merchant & Eric Lyons"

Fall 2012, Univ. of Arizona ²  Intro. Bioinformaics, Anne Lorraine"

Fall 2012l Univ. of North Carolina ²  Popular community contributed images:

²  PhytoMorph (Nate Miller, U. Wisconsin) ²  Twig2Genome (Haibao Tang, JCVI) ²  Julin Maloof, UC Davis*

Examples

52

Where do I start ? ² www.iplantcollaborative.org ² User.iplantcollaborative.org (get a account) ² Wiki.iplantcollaborative.org (documentation) ²  Forum.iplantcollaborative.org (ask questions)

53

54

Future Data Scientist need CT !

Pukng it all to work

Wayne Stayskal, The Tampa Tribune

Staff: Greg Abram Sonali Aditya Roger Barthelson Brad Boyle Todd Bryan Gordon Burleigh John Cazes Mike Conway Karen Cranston Rion Doodey Andy Edmonds Dmitry Fedorov Michael Gatto Utkarsh Gaur Cornel Ghiban Michael Gonzales Hariolf Häfele Matthew Hanlon

74

Metadata Data Tools Workflows Viz

Executive Team: Steve Goff Dan Stanzione

Faculty Advisors & Collaborators: Ali Akoglu Greg Andrews Kobus Barnard Sue Brown Thomas Brutnell Michael Donoghue Casey Dunn Brian Enquist Damian Gessler Ruth Grene John Hartman Matthew Hudson Dan Kliebenstein Jim Leebens-Mack David Lowenthal Robert Martienssen

Students: Peter Bailey Jeremy Beaulieu Devi Bhattacharya Storme Briscoe Ya-Di Chen John Donoghue Steven Gregory Yekatarina Khartianova Monica Lent Amgad Madkour

B.S. Manjunath Nirav Merchant David Neale Brian O’Meara Sudha Ram David Salt Mark Schildhauer Doug Soltis Pam Soltis Edgar Spalding Alexis Stamatakis Ann Stapleton Lincoln Stein Val Tannen Todd Vision Doreen Ware Steve Welch Mark Westneat

Andrew Lenards Zhenyuan Lu Eric Lyons Naim Matasci Sheldon McKay Robert McLay Angel Mercer Dave Micklos Nathan Miller Steve Mock Martha Narro Praveen Nuthulapati Shannon Oliver Shiran Pasternak William Peil Titus Purdin J.A. Raygoza Garay Dennis Roberts Jerry Schneider

Anthony Heath Barbara Heath Matthew Helmke Natalie Henriques Uwe Hilgert Nicole Hopkins Eun-Sook Jeong Logan Johnson Chris Jordan B.D. Kim Kathleen Kennedy Mohammed Khalfan Seung-jin Kim Lars Koersterk Sangeeta Kuchimanchi Kristian Kvilekval Aruna Lakshmanan Sue Lauter Tina Lee

Bruce Schumaker Sriramu Singaram Edwin Skidmore Brandon Smith Mary Margaret Sprinkle Sriram Srinivasan Josh Stein Lisa Stillwell Kris Urie Peter Van Buren Hans Vasquez-Gross Matthew Vaughn Fusheng Wei Jason Williams John Wregglesworth Weijia Xu Jill Yarmchuk

Aniruddha Marathe Kurt Michaels Dhanesh Prasad Andrew Predoehl Jose Salcedo Shalini Sasidharan Gregory Striemer Jason Vandeventer Kuan Yang

Postdocs: Barbara Banbury Jamie Estill Bindu Joseph Christos Noutsos Brad Ruhfel Stephen A. Smith Chunlao Tang Lin Wang Liya Wang Norman Wickett

The iPlant Collabora*ve

calstate csuperb 2012 · discipline x e.g. (bio-informatics , computational-biology)! how to codify...

Documents