Download - A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Intensive Research
“A Campus-Scale High Performance Cyberinfrastructure is Required
for Data-Intensive Research”
Keynote Presentation
CENIC 2013
Held at Calit2@UCSD
March 11, 2013
Dr. Larry Smarr
Director, California Institute for Telecommunications and Information Technology
Harry E. Gruber Professor,
Dept. of Computer Science and Engineering
Jacobs School of Engineering, UCSD
http://lsmarr.calit2.net
1
“Blueprint for the Digital University”--Report of the UCSD Research Cyberinfrastructure Design Team
• A Five Year Process Begins Pilot Deployment This Year
research.ucsd.edu/documents/rcidt/RCIDTReportFinal2009.pdf
No Data Bottlenecks--Design for
Gigabit/s Data Flows
April 2009
See talk on RCI by Richard MooreToday at 4pm
Calit2 Sunlight OptIPuter Exchange Connects 60 Campus Sites Each Dedicated at 10Gbps
Maxine Brown,
EVL, UICOptIPuter
Project Manager
Rapid Evolution of 10GbE Port PricesMakes Campus-Scale 10Gbps CI Affordable
2005 2007 2009 2010 2011 2013
$80K/port Chiaro(60 Max)
$ 5KForce 10(40 max)
$ 500Arista48 ports
$ 400 (48 ports – today); 576 ports (2013)
• Port Pricing is Falling • Density is Rising – Dramatically• Cost of 10GbE Approaching Cluster HPC Interconnects
Source: Philip Papadopoulos, SDSC/Calit2
Arista Enables SDSC’s Massively Parallel 10G Switched Data Analysis Resource
12
Partnering Opportunities with NSF:SDSC’s Gordon-Dedicated Dec. 5, 2011
• Data-Intensive Supercomputer Based on SSD Flash Memory and Virtual Shared Memory SW– Emphasizes MEM and IOPS over FLOPS– Supernode has Virtual Shared Memory:
– 2 TB RAM Aggregate– 8 TB SSD Aggregate
– Total Machine = 32 Supernodes– 4 PB Disk Parallel File System >100 GB/s I/O
• System Designed to Accelerate Access to Massive Datasets being Generated in Many Fields of Science, Engineering, Medicine, and Social Science
Source: Mike Norman, Allan Snavely SDSC
Gordon Bests Previous Mega I/O per Second by 25x
Creating a “Big Data Freeway” SystemConnecting Instruments, Computers, & Storage
Phil Papadopoulos, PILarry Smarr co-PI
PRISM@UCSD
Start Date1/1/13
See talk on PRISM
by Phil P.Tomorrow at
9am
Many Disciplines Beginning to NeedDedicated High Bandwidth on Campus
• Remote Analysis of Large Data Sets– Particle Physics
• Connection to Remote Campus Compute & Storage Clusters– Ocean Observatory
– Microscopy and Next Gen Sequencers
• Providing Remote Access to Campus Data Repositories– Protein Data Bank and Mass Spectrometry
• Enabling Remote Collaborations– National and International
How to Terminate a CENIC 100G Campus Connection
PRISM@UCSD Enables Remote Analysis of Large Data Sets
CERN’s CMS Detector is One of the World’s Most Complex Scientific Instrument
See talk on LHC 100G Networks by Azher Mughal, CaltechToday at 10am
CERN’s CMS ExperimentGenerates Massive Amounts of Data
UCSD is a Tier-2 LHC Data Center
Source: Frank Wuerthwein, Physics UCSD
Flow Out of CERN for CMS DetectorPeaks at 32 Gbps!
14Source: Frank Wuerthwein, Physics UCSD
CMS Flow Into Fermi LabPeaks at 10Gbps
15Source: Frank Wuerthwein, Physics UCSD
CMS Flow into UCSD PhysicsPeaks at 2.4 Gbps
16Source: Frank Wuerthwein, Physics UCSD
Open for all of science, includingbiology, chemistry, computer science, engineering, mathematics, medicine, and physics
The Open Science GridA Consortium of Universities and National Labs
Source: Frank Wuerthwein, Physics UCSD
Dan Cayan USGS Water Resources Discipline
Scripps Institution of Oceanography, UC San Diego
much support from Mary Tyree, Mike Dettinger, Guido Franco and other colleagues
Sponsors: California Energy Commission NOAA RISA program California DWR, DOE, NSF
Planning for climate change in California substantial shifts on top of already high climate variability
Greenhouse Gas
Emissionsand
ConcentrationCMIP3 GCM’s
UCSD Campus Climate Researchers Need to Download Results from Remote Supercomputer Simulations
Source: Dan Cayan, SIO UCSD
GCMs ~150km downscaled toRegional models ~ 12km
Many simulationsIPCC AR4 and IPCC AR5 have been downscaledusing statistical methods
INCREASING VOLUME OF CLIMATE SIMULATIONS
in comparison to 4th IPCC (CMIP3) GCMs :
Latest Generation CMIP5 Models Provide: More Simulations Higher Spatial Resolution More Developed Process Representation Daily Output is More Available
Global to Regional Downscaling
Source: Dan Cayan, SIO UCSD
average summer afternoon temperature
average summer afternoon temperature
21GFDL A2 1km downscaled to 1kmHugo Hidalgo Tapash Das Mike Dettinger
HOW MUCH CALIFORNIA SNOW LOSS ? Initial projections indicate substantial reduction
in snow water for Sierra Nevada+
declining Apr 1 SWE:2050 median SWE ~ 2/3 historical median2100 median SWE ~ 1/3 historical median
PRISM@UCSD Enables Connection to Remote Campus Compute & Storage Clusters
The OOI CI is Built on Dedicated 10GEand Serves Researchers, Education, and Public
Source: Matthew Arrott, John Orcutt OOI CI
Reused Undersea Optical CablesForm a Part of the Ocean Observatories
Source: John Delaney UWash OOI
Source: John Orcutt, Matthew Arrott, SIO/Calit2
OOI CI is Built on Dedicated Optical Networks and Federal Agency & Commercial Clouds
OOI CI Team at Scripps Institution of Oceanography Needs Connection to Its Server Complex in Calit2
Ultra High Resolution Microscopy ImagesCreated at the National Center for Microscopy Imaging
Zeiss Merlin 3View w/ 32k x 32k Scanning and Automated Mosaicing:
Current= 1-2 TB/week soon 12 TB/week
JEOL-4000EX w/ 8k x 8k CD, Automated Mosaicing, and Serial Tomography:
Current= 1 TB/week
FEI Titan w/ 4k x 4k STEM, EELS, 4k x 3.5k DDD, 4k x4k CCD, Automated Mosaicing, and Multi-tilt Tomography:
Current= 1 TB/week
200-500 TB/year Raw >2 PB/year Aggregate
Microscopes Are Big Data Generators – Driving Software & Cyberinfrastructure Development
Source: Mark Ellisman, School of Medicine, UCSD
NIH National Center for Microscopy & Imaging Research Integrated Infrastructure of Shared Resources
Source: Steve Peltier, Mark Ellisman, NCMIR
Local SOM Infrastructure
Scientific Instruments
End UserWorkstations
Shared Infrastructure
Agile System that Spans Resource Classes
SDSC Gordon Supercomputer Analysisof LS Gut Microbiome Displayed on Calit2 VROOM
Calit2 VROOM-FuturePatient Expedition
See Live Demo on Calit2 to CICESE 10G
Weds at 8:30am
PRISM@UCSD Enables Providing Remote Access to Campus Data Repositories
Protein Data Bank (PDB) NeedsBandwidth to Connect Resources and Users
• Archive of experimentally determined 3D structures of proteins, nucleic acids, complex assemblies
• One of the largest scientific resources in life sciences
Source: Phil Bourne and Andreas Prlić, PDBHemoglobin
Virus
PDB Usage Is Growing Over Time
• More than 300,000 Unique Visitors per Month• Up to 300 Concurrent Users• ~10 Structures are Downloaded per Second 7/24/365• Increasingly Popular Web Services Traffic
Source: Phil Bourne and Andreas Prlić, PDB
RCSB PDB159 millionentry downloads
PDBe34 millionentry downloads
PDBj16 millionentry downloads
2010 FTP Traffic
36
Source: Phil Bourne and Andreas Prlić, PDB
• Why is it Important?– Enables PDB to Better Serve Its Users by Providing
Increased Reliability and Quicker Results
• How Will it be Done?– By More Evenly Allocating PDB Resources at Rutgers and
UCSD– By Directing Users to the Closest Site
• Need High Bandwidth Between Rutgers & UCSD Facilities
PDB Plans to Establish Global Load Balancing
Source: Phil Bourne and Andreas Prlić, PDB
UCSD Center for Computational Mass SpectrometryBecoming Global MS Repository
ProteoSAFe: Compute-intensive discovery MS at the click of a button
MassIVE: repository and identification platform for all
MS data in the world
Source: Nuno Bandeira,Vineet Bafna, Pavel Pevzner,
Ingolf Krueger, UCSD
proteomics.ucsd.edu
Automation: Do it Billions of Times
• Large Volumes of Data from Many Sources--Must Automate– Thousands of Users, Tens of Thousands of Searches
– Multi-Omics: Proteomics, Metabolomics, Proteogenomics, Natural Products, Glycomics, etc.
• CCMS ProteoSAFe– Scalable: Distributed Computation over 1000s of CPUs
– Accessible: Intuitive Web-Based User Interfaces
– Flexible: Easy Integration of New Analysis Workflows
• Already Analyzed >1B Spectra in >26,000 Searches from >2,200 users
PRISM@UCSD Enables Enabling Remote National and International Collaborations
Tele-Collaboration for Audio Post-ProductionRealtime Picture & Sound Editing Synchronized Over IP
Skywalker Sound@Marin Calit2@San Diego
Tele-Collaboration for Cinema Post-Production
Disney + Skywalker Sound + Digital Domain + Laser Pacific NTT Labs + UCSD/Calit2 + UIC/EVL + Pacific Interface
Collaboration Between EVL’s CAVE2 and Calit2’s VROOM Over 10Gb Wavelength
EVL
Calit2
Source: NTT Sponsored ON*VECTOR Workshop at Calit2 March 6, 2013
Calit2 is Linked to CICESE at 10GCoupling OptIPortals at Each Site
See Live Demo on Calit2 to CICESE 10G
Weds at 8:30am
PRAGMAA Practical Collaboration Framework
Build and Sustain Collaborations
Advance & Improve Cyberinfrastructure
Through Applications Source: Peter Arzberger, Calit2 UCSD