a uniquely flexible hpc resource for new communities and ...€¦ · a uniquely flexible hpc...

23
A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications & Bridges PI · [email protected] HPE Theater Presentation · November 19, 2015 © 2015 Pittsburgh Supercomputing Center

Upload: others

Post on 22-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

A Uniquely Flexible HPC Resource forNew Communities and Data Analytics

Nick Nystrom · Director, Strategic Applications & Bridges PI · [email protected] Theater Presentation · November 19, 2015

© 2015 Pittsburgh Supercomputing Center

Page 2: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

2

Pittsburgh is a city of bridges: from its history in steel to its leadership incomputer science and biotechnology, between diverse neighborhoodshousing its many universities, and at PSC, from science-inspired nationalcyberinfrastructure to researchers’ breakthroughs.

Bridges is a new kind of converged HPC + Big Data system that will integrateadvanced memory technologies to empower new communities, bring desktopconvenience to HPC, connect to campuses, and intuitively express data-intensive workflows.

Page 3: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

3

From HPC to Big Data

Pan-STARRS telescopehttp://pan-starrs.ifa.hawaii.edu/public/

Genome sequencers(Wikipedia Commons)

NOAA climate modelinghttp://www.ornl.gov/info/ornlreview/v42_3_09/article02.shtml

CollectionsHorniman museum: http://www.horniman.ac.uk/get_involved/blog/bioblitz-insects-reviewed

Legacy documentsWikipedia Commons

Environmental sensors: Water temperature profiles from tagged hooded sealshttp://www.arctic.noaa.gov/report11/biodiv_whales_walrus.html

Library of Congress stackshttps://www.flickr.com/photos/danlem2001/6922113091/

VideoWikipedia Commons

Social networks and the Internet

New Emphases

Page 4: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

4

More data has been created in the past 2 years than in the entire previous history of the human race.http://www.sintef.no/en/corporate-news/big-data--for-better-or-worse/

300+ hours of video are uploaded to YouTube every minute.http://www.reelseo.com/youtube-300-hours/

More than 1 billion people used Facebook in a single day.https://www.facebook.com/zuck/posts/10102329188394581

There are 500 million tweets per day,200 billion per year.http://www.internetlivestats.com/twitter-statistics/

The Square KilometerArray (SKA) will gather14 exabytes and store~1 petabyte of data per day.http://www.scribd.com/doc/125147649/Ultimate-Big-Data-Challenge#page=1

Between 2 and 40 exabytes will be needed to store just human genomes by 2025.http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195#pbio.1002195.ref009

1.4 million data points per game are collected for each of the Premier League’s 20 soccer stadiums.https://www.linkedin.com/pulse/how-big-data-analytics-changing-soccer-bernard-marr

By 2020, the amount of data available to the healthcare industry will double every 73 days.https://www.linkedin.com/pulse/apple-ibm-team-up-new-big-data-health-platform-bernard-marr?trk=mp-reader-card

A 4003 connectome dataset requires>100 terabytes.A full mouse brain would be ~1 exabyte.http://people.cs.pitt.edu/~mosse/capstone/2013-fall/Wetzel-CompSci-project-Aug2013.ppt

Page 5: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

5

HPC → Big Data: Changing Algorithms

VisualizationScientific

Visualization

Statistics

Calculations Calculations on Data

Optimization(numerical)

Structured Data

Machine Learning

Optimization(decision-making)

Natural

Processing

Natural Language Processing

Image Analysis

Video

Sound

Unstructured Unstructured Data

Graph Analytics

Information VisualizationInformation

Visualization

Page 6: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

6

The $9.65M Bridges acquisition is made possible by National Science Foundation (NSF) award #ACI-1445606:

Bridges: From Communities and Data to Workflows and Insight

is delivering Bridges

Disclaimer: The following presentation conveys the current plan for Bridges. Details are subject to change.

Page 7: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

7

High-throughputgenome sequencers

An Important Addition to theNational Advanced Cyberinfrastructure Ecosystem

Bridges will be a new resource on XSEDE and will interoperate withother XSEDE resources, AdvancedCyberinfrastructure (ACI) projects,campuses, and instruments nationwide.

Data Infrastructure Building Blocks (DIBBs)‒ Data Exacell (DXC)‒ Integrating Geospatial Capabilities into HUBzero‒ Building a Scalable Infrastructure for Data-

Driven Discovery & Innovation in Education‒ Other DIBBs projectsOther ACI projects

Reconstructing brain circuits from high-resolution electron microscopy

Temple University’s new Science, Education, and Research Center

Carnegie Mellon University’s Gates Center for Computer Science

Examples:

Social networks and the Internet

Page 8: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

8

Motivating Use Cases(examples)

Data-intensive applications & workflows

Gateways – the power of HPC without the programmingShared data collections & related analysis tools

Cross-domain analytics

Graph analytics, machine learning, genome sequence assembly, and other large-memory applications

Scaling research questions beyond the laptop

Scaling research from individuals to teams and collaborations

Very large in-memory databasesOptimization & parameter sweeps

Distributed & service-oriented architectures

Data assimilation from large instruments and Internet data

Leveraging an extensive collection of interoperating software

Page 9: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

9

Potential Applications (Examples)• Finding causal relationships in cancer genomics, lung disease,

and brain dysfunction

• Analysis of financial markets and policies

• Improving the effectiveness of organ donation networks

• Assembling large genomes and metagenomes

• Recognizing events and enabling search for videos

• Understanding how the brain is connected from EM data

• Addressing societal issues from social media data

• Analyzing large corpora in the digital humanities

• Cross-observational analyses in astronomy & other sciences

• Data integration and fusion for history and related fields

Page 10: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

10

Objectives and Approach• Bring HPC to nontraditional users and

research communities.

• Allow high-performance computing to beapplied effectively to big data.

• Bridge to campuses to streamline accessand provide cloud-like burst capability.

• Leveraging PSC’s expertise with sharedmemory, Bridges will feature 3 tiers oflarge, coherent shared-memory nodes:12TB, 3TB, and 128GB.

• Bridges implements a uniquely flexible environment featuring interactivity, gateways, databases, distributed (web) services, high-productivity programming languages and frameworks, and virtualization, and campus bridging.

EMBO Mol Med (2013) DOI: 10.1002/emmm.201202388: Proliferation of cancer-causing mutations throughout life

Alex Hauptmann et. al.: Efficient large-scalecontent-based multimedia event detection

Page 11: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

11

Interactivity

• Interactivity is the feature mostfrequently requested bynontraditional HPC communities.

• Interactivity provides immediatefeedback for doing exploratorydata analytics and testing hypotheses.

• Bridges will offer interactivity through a combination of virtualization for lighter-weight applications and dedicated nodes for more demanding ones.

Page 12: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

12

Gateways and Tools for Building Them

Gateways provide easy-to-use access to Bridges’ HPC and data resources, allowing users to launch jobs, orchestrate complex workflows, and manage data from their browsers.

- Extensive leveraging of databases and polystore systems- Great attention to HCI is needed to get these right

Download sites for MEGA-6 (Molecular Evolutionary Genetic Analysis),from www.megasoftware.net

Interactive pipeline creation in GenePattern (Broad Institute)

Col*Fusion portal for the systematic accumulation, integration, and utilization

of historical data, from http://colfusion.exp.sis.pitt.edu/colfusion/

Page 13: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

13

Virtualization and Containers

• Virtual Machines (VMs) will enable flexibility, customization, security, reproducibility, ease of use, and interoperability with other services.

• Early user demand on PSC’s Data Exacell research pilot project has centered on VMs for custom database and web server installations to develop data-intensive, distributed applications and containers for reproducibility.

• Bridges leverages OpenStack to provision resources, between interactive, batch, Hadoop, and VM uses.

Page 14: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

14

High-Productivity ProgrammingSupporting the languages that communities are already using is critical for successful application of HPC to their research questions.

Page 15: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

15

Bridges’ large memory is great for Spark!

Bridges enables workflows thatintegrate Spark/Hadoop, HPC,and/or shared-memory components.

Spark + Hadoop Ecosystem

Page 16: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

16

Campus Bridging

• Through a pilot project with Temple University, the Bridges project will explore new ways to transition data and computing seamlessly between campus and XSEDE resources.

• Federated identity management will allow users to use their local credentials for single sign-on to remote resources, facilitating data transfers between Bridges and Temple’s local storage systems.

• Burst offload will enable cloud-like offloading of jobs from Temple to Bridges and vice versa during periods of unusually heavy load.

Federated identity management

Burst offload

http://www.temple.edu/medicine/research/RESEARCH_TUSM/

Page 17: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

17© 2015 Pittsburgh Supercomputing Center

800 HPE Apollo 2000 (128GB) compute nodes

20 “leaf” Intel® OPA edge switches

6 “core” Intel® OPA edge switches:fully interconnected,2 links per switch

42 HPE ProLiant DL580 (3TB) compute nodes

20 Storage Building Blocks, implementing the parallel Pylonfilesystem (~10PB) using PSC’s SLASH2 filesystem

4 HPE Integrity Superdome X (12TB)

compute nodes

12 HPE ProLiant DL380 database nodes

6 HPE ProLiant DL360 web server nodes

4 MDS nodes2 front-end nodes

2 boot nodes8 management nodes

Intel® OPA cables

Purpose-built Intel® Omni-Path topology for data-intensive HPC

http://psc.edu/bridges

16 RSM nodes with NVIDIA K80 GPUs

32 RSM nodes with NVIDIAnext-generation GPUs

Page 18: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

18

• 3 tiers of large, coherent shared memory nodes

• The latest Intel® Xeon® CPUs• NVIDIA® Tesla® dual-GPU accelerators

Memory per node

Number of nodes Example applications

12 TB 4 Genomics, machine learning, graph analytics, other extreme-memory applicationsHPE Integrity Superdome X

3 TB 42 Virtualization and interactivity includinglarge-scale visualization and analytics;mid-spectrum memory-intensive jobs

HPE ProLiant DL580

128 GB 800 Execution of most components of workflows, interactivity, Hadoop, and capacity computingHPE ProLiant DL580

High-Performance,Data-Intensive Computing

Page 19: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

19

Database and Web Server Nodes• Dedicated database nodes will power persistent

relational and NoSQL databases– Support data management and data-driven workflows– SSDs for high IOPs; RAIDed HDDs for high capacity

• Dedicated web server nodes– Enable distributed, service-oriented architectures– High-bandwidth connections to XSEDE and the Internet

(examples)

HPE ProLiant DL380

HPE ProLiant DL360

Page 20: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

20

Data Management

• Pylon: A large, central, high-performance filesystem– Visible to all nodes– Large datasets, community repositories (~10 PB usable)

• Distributed (node-local) storage– Enhance application portability– Improve overall system performance– Improve performance consistency to the shared filesystem

• Acceleration for Hadoop-based applications

Page 21: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

21

Intel® Omni-Path Architecture

• Omni-Path will connect all nodes and the shared filesystem, providing Bridges and its users with:– 100 Gbps line speed per port;

25 GB/s bidirectional bandwidth per port– 160M MPI messages per second– 48-port edge switch reduces

interconnect complexity and cost– HPC performance, reliability, and QoS– OFA-compliant applications supported without modification– Early access to this new, important, forward-looking technology

• The Intel Omni-Path Architecture will deployed in Bridgesusing a novel topology developed by PSC to enable data-intensive HPC

Page 22: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

22

Example: Causal Discovery PortalCenter for Causal Discovery, an NIH Big Data to Knowledge Center of Excellence

Web node

VMApache Tomcat

Messaging

Database node

VM

Other DBs

LSM Node (3TB)

ESM Node(12TB)

Analytics:TETRAD (FastGES, …) and related implementations

Browser-based UI

• Authentication• Data• Provenance

Execute causal discovery algorithms

Omni-Path

• Prepare and upload data

• Run causal discovery algorithms

• Visualize results

Internet

Memory-resident datasets

Pylon filesystem

TCGAfMRI

Page 23: A Uniquely Flexible HPC Resource for New Communities and ...€¦ · A Uniquely Flexible HPC Resource for New Communities and Data Analytics Nick Nystrom · Director, Strategic Applications

23

Getting Started on Bridges

• Starter Allocation https://www.xsede.org/allocations– Can request anytime... including now!– Can request XSEDE ECSS (Extended Collaborative Support Service)

• Research Allocation (XRAC) https://www.xsede.org/allocations– Appropriate for larger requests; can request ECSS– Quarterly submission windows; Next: Dec. 15, 2015–Jan. 15, 2016

• Early User Period– Users with starter or research proposals may be eligible for

Bridges’ Early User Period (starting late 2015)

• Questions?– See http://psc.edu/bridges

or email Nick Nystrom at [email protected]