calstate csuperb 2012 · discipline x e.g. (bio-informatics , computational-biology)! how to codify...
TRANSCRIPT
1
iPlant Atmosphere: Emerging role of Cloud Infrastructure for training Data Scientists!
1
Nirav Merchant iPlant Collaborative University of Arizona http://iplantcollaborative.org
Topic Coverage ² The 4th Paradigm of Science ² What is “Cloud Computing” ? ² What is “Big Data” ? ² Why you should care about it ? ² Who is a “Data Scientist” ? ² What is Cyberinfrastructure ² iPlant Atmosphere: Embracing cloud platform
2
Science Paradigms 1. Thousand years ago: science was empirical
describing natural phenomena, observations 2. Last few hundred years: theoretical branch
using models, generalizations 3. Last few decades: a computational branch
simulating complex phenomena 4. Today: data exploration (eScience)
unify theory, experiment, and simulation
3
Based on the transcript of a talk given by the late Jim Gray"to the National Research Council – Computer Science and Telecommunication Board in Mountain View, CA, on January 11, 2007
The Fourth Paradigm: "Data-Intensive Scientific Discovery ² Increasingly, scientific breakthroughs will be
powered by advanced computing capabilities that help researchers manipulate and explore massive datasets."
² The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud computing technologies.
4 http://research.microsoft.com/en-us/collaboration/fourthparadigm/
The Discovery Lifecycle
5
The Fourth Paradigm: Data-Intensive Scientific Discovery
Evolution of X-Info ² The evolution of X-Info and Comp-X for each
discipline X e.g. (Bio-Informatics , Computational-Biology)
² How to codify and represent our knowledge ² The Generic Problems:
6
• How to share it with others • Query and Vis tools • Building and executing models • Integrating data and literature • Documenting experiments • Curation and long-term preservation
• Data ingest • Managing a petabyte • Common schema • How to organize it • How to reorganize it
7
+ =
Simple Formula
The Reality
8
+ + PERL Python Java Ruby Fortran C C# C++ R Matlab etc.
Amazon Azure Rackspace Campus HPC XSEDE Etc.
and lots of glue…..
+ =
Simple Formula
10
11
² Classic paradigm: You produce data, analyze, interpret (end to end)
² Conventional paradigm: Consortium/centers produce data and you consume it
² New Paradigm: Consortium/centers have produced data and creating “cyber infrastructure” to tackle the “grand challenge”
12
∧
13
X prize for sequencing
14 2012 guidelines are different, this graphics is out dated
X prize for analyzing it ?
? 15
So who else is talking about it
16
The “V” of big data
² Volume ² Velocity ² Variety ² (Value)
17 Attributed to Gartner Consulting
Big Data ² Extracting meaningful results from vast
amount of data (linked data) ² Big data “information assets” demand cost-
effective, innovative forms of information processing for enhanced insight and decision making.
² “Big Data” Is only the Beginning of Extreme Information Management
² Big Data Technology, all Is Not New
18 Attributed to Gartner Consulting
The transition
19 http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/
BioInforma*on :: Data Flavors
² Sequences (much more than NGS !) ² Structures ² Images ² Video ² Audio ² Pathways (graphs) ² Text (PublicaFons) ² Traces ² CombinaFon (eg Video & Traces) ² And much more …
21
22
EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/
23 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/
24 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/
25 1541-1672/11 2011 IEEE INTELLIGENT SYSTEMS
26
² Bioinformatics has become too central to biology to be left to specialist bioinformaticians.
² Biologists are all bioinformaticians now" - Lincoln Stein Dec. 2008""http://genomebiology.com/2008/9/12/114
27
² Abstraction: "C.T. is operating in terms of multiple layers of abstraction simultaneously "C.T. is defining the relationships the between layers
² Automation:"C.T. is thinking in terms of mechanizing the abstraction layers and their relationships
² Mechanization is possible due to precise and exacting notations and models
² There is some “machine” below (human or computer, virtual or physical)
² They give us the ability and audacity to scale.
28
² Man is the best computer we can put aboard a spacecraft. "And the only one that can be mass produced with unskilled labor.
-- Wernher von Braun
29
SaaS: SoJware as a Service (e.g. Clustering/Assembly is a service)
IaaS: Infrastructure as a Service (get computer Fme with a credit card and with a Web interface like EC2)
PaaS: PlaRorm as a Service IaaS plus core soJware capabiliFes on which you build SaaS
(e.g. Hadoop/MapReduce is a PlaRorm)
Cyberinfrastructure Is “Research as a Service”
http://salsahpc.indiana.edu
² We have designed iPlant to be consistent with the pillars of CIF21 ü High Performance CompuFng ü Data and Data Analysis ü Virtual OrganizaFon ü Learning and Workforce
The iPlant Collabora*ve Cyberinfrastructure Philosophy
Typical End Users
ComputaFonal Users
Teragrid XSEDE
The iPlant Collabora*ve Cyberinfrastructure for the Plant Sciences
² For a challenge as broad as “plant science,” focus on specific applicaFons/tools is a moving target, and never enough.
² Most important to build a plaRorm that can support diverse
and constantly evolving needs. “Cyberinfrastructure” is, in fact, infrastructure. The plaRorm can liJ all the apps, not select winners and losers.
“The useful life*me of our analysis toolchains is now 6 months”
-‐Ma8hew Trunnel, Broad Ins*tute
The iPlant Collabora*ve Cyberinfrastructure for the Plant Sciences
The iPlant Collabora*ve Cyberinfrastructure for the Plant Sciences
• The iPlant CI is designed as infrastructure. • This means it is a plaRorm upon which other projects
can build. • Use of the iPlant infrastructure can take one of several
forms: ü Storage ü ComputaFon ü HosFng ü Web Services ü Scalability
The iPlant Collabora*ve Ways to access iPlant
• Atmosphere: For cloud infrastructure • iPlant Data Storage: All data large and small • The Discovery Environment: Integrated Web apps. • MyPlant: Social Networking. • DNASubway: AnnotaFon and more • Standalone Apps: TNRS, TreeViewer, PhytoBisque, etc • The API: For programmers embedding iPlant CI capabiliFes
• Command line for experts (thru TeraGrid/XSEDE)
iPlant Data Store: Free Your Data"
Different Users, "Different Access Needs: "One Data Store"
iPlant Data Store Performance"UC Berkeley to iDS"
² Dec 5th, 2011: "² 100GB: 29m15s"
36,000 Students 2000 Faculty
39,000 Students 2900 Faculty/Staff
100GB: 29m15s"
iPlant Data Store Performance"UC Berkeley to iDS"
Source Destination Copy Method Time (seconds) CD Desktop PC cp 320
Berkeley Server Desktop PC scp 150
External Drive Desktop PC cp 36
USB 2.0 Flash Desktop PC cp 30
iDS Desktop PC iget 18
Desktop PC Desktop PC cp 15
https://pods.iplantcollaborative.org/wiki/display/start/How+fast+is+the+iPlant+Data+Store
1 GB / 17.5 seconds"
Desktop PC (UA): Mac OS X with 7.2K Internal Hard Drive External Drive: USB 2.0: 5.4k Hard Drive Flash Drive: USB 2.0 Patriot XT
Customized cloud platform for computing on your terms !
Atmosphere: motivation ² Standalone GUI-based applications are frequently
required for analysis ² GUI apps not easily to transform into web apps ² Need to handle complex software dependencies (e.g
specific bioperl version and R modules) ² Users needing full control of their software stack
(occasional sudo access) ² All computation does not complete in a 24 hour queue
(HPC limitations !) ² Need to share desktop/applications for collaborative
analysis (remote collaborators) ² Availability of Next Gen map-reduce based algorithms
(currently we have limited support)
Challenges of existing cloud platforms
² Amazon Web Services (AWS)"http://aws.amazon.com/
² Flexible and scalable ² High level of expertise required for configurations ² Fairly challenging for biologists to master all
steps ² Limited lifecycle management (cost, time mgmt ) ² Lack easy desktop integration ² Lack easy tools for large data transfer
Steps to get started !
What is Atmosphere ? ² Self-service cloud infrastructure ² Designed to make underlying cloud
infrastructure easy to use by novice user ² Built on open source Eucalyptus ² Fully integrated into iPlant authentication and
storage and HPC capabilities ² Enables users to build custom images/
appliances and share with community ² Cross-platform desktop access to GUI
applications in the cloud (using VNC) ² Provide easy web based access to resources
Who is this tutorial designed for ? ² Users wanting to launch configured images
in atmosphere (like app store™) ² Developers for application distribution ² Prototyping/Testing new software/modules ² Tailored software training setups (custom
workshops/laboratory courses etc) ² Extend compute capabilities of existing
applications i.e. utilize iPlant API
² API-‐compaFble implementaFon of Amazon EC2/S3 interfaces
² Virtualize the execuFon environment for applicaFons and services
² Up to 12 core / 48 GB instances ² Access to Cloud Storage + EBS ² Run servers, CloudBurst desktop use
cases. Big data and the desktop are co-‐local again!
>60 hosted applicaFons in Atmosphere today, including users from USDA, Forest Service, database providers, etc. (30 more for postdocs and grad students for training classes)
The iPlant Collabora*ve Project Atmosphere™: Custom Cloud CompuFng
Atmosphere: Collaboration
iPlant Data Store
Lifecycle
How to Connect
Different Ways to Log in to VMs
Typical Hands on exercise ² Launching a instance (one per team) ² Connecting to it (vnc and ssh) using the web
browser and vnc client software ² Bringing data from iDS to Atmosphere (use idrop
or icommands) ² Launching a application ² Installing a new application (optional) ² Saving data back to iDS ² Collaborating with other users (sharing your
session) ² Terminating the instance when you are done
Exercise (cont.) ² Atmosphere manual is at: https://pods.iplantcollaborative.org/wiki/
x/Iaxm ² To get started point your browser to
https://atmo-beta.iplantcollaborative.org/ ² Login with your iPlant credentials ² Please do not launch an instance until your instructor has directed
you to do so."It is ideal if 4 or 5 users form a team and launch a single instance to limit the load on the system
² Once launched, check email for availability of your instance (usually 10 to 15 min) or you can check the web interface
² Once you instance is ready you use the IP address and connect to it, and invite others to connect to it and collaborate with them with a shared screen
² For todays session we will use the “NGS Viewers” application
Users of Atmosphere ² Workshops:
² Frontiers and Techniques in Plant Sciences"CSHL 2011,2012
² Genotyping by Sequencing "Cornell Computational Biology
² Graduate/U. Graduate course work: ² BCB 660 Volker Brendel and Amy Toth"
Fall 2011, Iowa State University ² ISTA 420/520 Nirav Merchant & Eric Lyons"
Fall 2012, Univ. of Arizona ² Intro. Bioinformaics, Anne Lorraine"
Fall 2012l Univ. of North Carolina ² Popular community contributed images:
² PhytoMorph (Nate Miller, U. Wisconsin) ² Twig2Genome (Haibao Tang, JCVI) ² Julin Maloof, UC Davis*
Examples
52
Where do I start ? ² www.iplantcollaborative.org ² User.iplantcollaborative.org (get a account) ² Wiki.iplantcollaborative.org (documentation) ² Forum.iplantcollaborative.org (ask questions)
53
54
Future Data Scientist need CT !
Pukng it all to work
Wayne Stayskal, The Tampa Tribune
56
Staff: Greg Abram Sonali Aditya Roger Barthelson Brad Boyle Todd Bryan Gordon Burleigh John Cazes Mike Conway Karen Cranston Rion Doodey Andy Edmonds Dmitry Fedorov Michael Gatto Utkarsh Gaur Cornel Ghiban Michael Gonzales Hariolf Häfele Matthew Hanlon
74
Metadata Data Tools Workflows Viz
Executive Team: Steve Goff Dan Stanzione
Faculty Advisors & Collaborators: Ali Akoglu Greg Andrews Kobus Barnard Sue Brown Thomas Brutnell Michael Donoghue Casey Dunn Brian Enquist Damian Gessler Ruth Grene John Hartman Matthew Hudson Dan Kliebenstein Jim Leebens-Mack David Lowenthal Robert Martienssen
Students: Peter Bailey Jeremy Beaulieu Devi Bhattacharya Storme Briscoe Ya-Di Chen John Donoghue Steven Gregory Yekatarina Khartianova Monica Lent Amgad Madkour
B.S. Manjunath Nirav Merchant David Neale Brian O’Meara Sudha Ram David Salt Mark Schildhauer Doug Soltis Pam Soltis Edgar Spalding Alexis Stamatakis Ann Stapleton Lincoln Stein Val Tannen Todd Vision Doreen Ware Steve Welch Mark Westneat
Andrew Lenards Zhenyuan Lu Eric Lyons Naim Matasci Sheldon McKay Robert McLay Angel Mercer Dave Micklos Nathan Miller Steve Mock Martha Narro Praveen Nuthulapati Shannon Oliver Shiran Pasternak William Peil Titus Purdin J.A. Raygoza Garay Dennis Roberts Jerry Schneider
Anthony Heath Barbara Heath Matthew Helmke Natalie Henriques Uwe Hilgert Nicole Hopkins Eun-Sook Jeong Logan Johnson Chris Jordan B.D. Kim Kathleen Kennedy Mohammed Khalfan Seung-jin Kim Lars Koersterk Sangeeta Kuchimanchi Kristian Kvilekval Aruna Lakshmanan Sue Lauter Tina Lee
Bruce Schumaker Sriramu Singaram Edwin Skidmore Brandon Smith Mary Margaret Sprinkle Sriram Srinivasan Josh Stein Lisa Stillwell Kris Urie Peter Van Buren Hans Vasquez-Gross Matthew Vaughn Fusheng Wei Jason Williams John Wregglesworth Weijia Xu Jill Yarmchuk
Aniruddha Marathe Kurt Michaels Dhanesh Prasad Andrew Predoehl Jose Salcedo Shalini Sasidharan Gregory Striemer Jason Vandeventer Kuan Yang
Postdocs: Barbara Banbury Jamie Estill Bindu Joseph Christos Noutsos Brad Ruhfel Stephen A. Smith Chunlao Tang Lin Wang Liya Wang Norman Wickett
The iPlant Collabora*ve