world-class supercomputing on a flexible, standards-based ... · world-class supercomputing on a...

4
World-Class Supercomputing on a Flexible, Standards-based Cluster Bull, science+computing and Intel® Cluster Ready help RWTH Aachen University Deliver Powerful, Flexible HPC Resources to More than a Thousand Researchers When the Center for Computing and Communication at RWTH Aachen University decided to upgrade its high-performance computing (HPC) system, project leaders released an open tender for a system that would combine massive capacity with the flexibility to support a broad range of research needs. The winning solution was a 292 teraflop, 1,712-node supercomputer designed by Bull, implemented by science+computing, and based on the Intel® Cluster Ready architecture. CHALLENGES Meet the escalating computing demands of research teams that increasingly rely on advanced simulation to push the boundaries of scientific knowledge. Support a wide range of applications and workloads, while optimizing utilization and time-to-results for dozens of simultaneous users. SOLUTION An Intel Cluster Ready certified HPC system from Bull that ranks as one of the 30 largest research supercomputers 1 in the world. IMPACT A 5X increase in computing capacity to stay at the leading edge of computer- aided science and engineering research. A flexible cluster design that provides massively parallel computing resources as needed, along with shared-memory supernodes for scale-up computing. Simplified deployment and maintenance, with Intel® Cluster Checker software and optimized tools and support from science+computing. Assured compatibility with a wide range of HPC applications through Intel Cluster Ready certification. CASE STUDY Intel® Cluster Ready High-Performance Computing “Intel worked closely with us during the certification process and provided excellent support for solving the issues identified by Intel® Cluster Checker.” Dieter an Mey HPC Department Head Center for Computing and Communication RWTH Aachen University

Upload: others

Post on 22-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: World-Class Supercomputing on a Flexible, Standards-based ... · World-Class Supercomputing on a Flexible, Standards-based Cluster ... During the first stage of cluster operation,

World-Class Supercomputing on a Flexible, Standards-based ClusterBull, science+computing and Intel® Cluster Ready help RWTH Aachen University Deliver Powerful, Flexible HPC Resources to More than a Thousand Researchers

When the Center for Computing and Communication at RWTH Aachen University decided to upgrade its high-performance computing (HPC) system, project leaders released an open tender for a system that would combine massive capacity with the flexibility to support a broad range of research needs. The winning solution was a 292 teraflop, 1,712-node supercomputer designed by Bull, implemented by science+computing, and based on the Intel® Cluster Ready architecture.

CHALLENGES

• Meet the escalating computing demands of research teams that increasingly rely on advanced simulation to push the boundaries of scientific knowledge.

• Support a wide range of applications and workloads, while optimizing utilization and time-to-results for dozens of simultaneous users.

SOLUTION

• An Intel Cluster Ready certified HPC system from Bull that ranks as one of the 30 largest research supercomputers1 in the world.

IMPACT

• A 5X increase in computing capacity to stay at the leading edge of computer-aided science and engineering research.

• A flexible cluster design that provides massively parallel computing resources as needed, along with shared-memory supernodes for scale-up computing.

• Simplified deployment and maintenance, with Intel® Cluster Checker software and optimized tools and support from science+computing.

• Assured compatibility with a wide range of HPC applications through Intel Cluster Ready certification.

CASE STUDYIntel® Cluster Ready High-Performance Computing

“Intel worked closely with us

during the certification process

and provided excellent support

for solving the issues identified

by Intel® Cluster Checker.”

Dieter an Mey HPC Department Head

Center for Computing and Communication RWTH Aachen University

Page 2: World-Class Supercomputing on a Flexible, Standards-based ... · World-Class Supercomputing on a Flexible, Standards-based Cluster ... During the first stage of cluster operation,

The frontiers of science have changed in recent years. A single scientist work-ing alone can still impact our view of the world, but progress increasingly depends on collaboration among many individuals and on the tremendous computational power provided by today’s high perfor-mance computing (HPC) clusters. This holds true across a wide range of disciplines. Alongside theory and experimentation, computer simulation has become the third pillar of science. It enables research teams to explore complex interactions that would be impossible or impractical to reproduce in physical experiments.

As science advances, so do computing requirements. Leading research institu-tions, such as RWTH Aachen University, need to continuously scale their high-performance computing (HPC) resources to empower their research teams. As one of the nine German “Universities of Excel-lence,” the university provides computing resources to more than a thousand scientists working in fields that include engineering, physical sciences, chemistry, biology, mathematics, and computer science. All the key disciplines at the university make use of HPC to explore the universe, and the power and flexibility of those resources can have a significant impact on their progress.

To meet growing demands, the university recently embarked on a major upgrade of its HPC computing environment. After defining requirements in an open tender, the university received design bids from a number of leading vendors. The winning

solution was a 1,712 node supercomputer designed by Bull and based on the Intel Cluster Ready architecture. According to Dieter an Mey, head of the HPC team at the Center for Computing and Communica-tion, “We chose Bull because the overall architecture of the proposed system met the demands of our researchers and their disciplines perfectly. We believe Bull is an HPC partner in whom we can have confi-dence, and who understands how to meet our needs over the coming years.”

A Flexible Design to Support Diverse Research RequirementsGiven the broad scope of research sup-ported by RWTH Aachen University, the new cluster was designed to be both massive and flexible. Not all HPC work-loads are the same. Some can be divided into sub-tasks that can be run simultane-ously on a large number of relatively small server nodes. Other applications perform better on a single multiprocessor server with a large shared memory space—or on a small cluster of relatively large servers. To meet these diverse needs, the cluster for RTWH Aachen University was designed with two major partitions.

An MPI Partition for Massively Parallel Execution

The MPI partition includes 1,350 two-socket bullx B500 server blades, all based on the Intel® Xeon® processor 5600 series. With six cores per processor and Intel Hyper-Threading Technology, this partition can handle up to 32,400 simultaneous tasks, or threads. It is called the “MPI partition” because the software applications it runs rely on the Message Passing Interface (MPI) to enable scalable performance in such a massively distributed hardware environment.

2

Advancing Science with a Massive New HPC Cluster

“We chose Bull because

the overall architecture of

the proposed system met the

demands of our researchers

and their disciplines perfectly.

We believe Bull is an HPC

partner in whom we can

have confidence, and who

understands how to meet our

needs over the coming years.”

Dieter an Mey HPC Department Head

Center for Computing and Communication RWTH Aachen University

MPI Partition (one of three rows). 1,350 Bull B500 blades based on Intel® Xeon® processors can handle up to 32,400 simultaneous tasks.

Page 3: World-Class Supercomputing on a Flexible, Standards-based ... · World-Class Supercomputing on a Flexible, Standards-based Cluster ... During the first stage of cluster operation,

An SMP Partition to Provide Large, Shared-Memory Configurations

The symmetric multiprocessing (SMP) partition provides larger server nodes for applications that do not run efficiently on a cluster of smaller servers. It includes 342 four-socket bullx s6010 modules based on the Intel Xeon processor 7500 series. This processor family provides more compute resources (cores, cache, and bandwidth), so each module can support substantially heavier workloads. These processors also provide advanced reliability, availability, and serviceability (RAS) features. RAS is particularly important when an application runs in a vertically integrated environment, since the failure of single node will have a greater impact on productivity.

During the first stage of cluster operation, the bullx s6010 modules were used individ-ually. However, they are designed so that two or more modules can be connected together to form larger SMP configura-tions. In the next stage of operation, they will be connected in groups of four to pro-vide 16-socket systems that will accelerate performance for some workoads.

High Performance Networking and Scalable Storage

All servers in the RWTH Aachen cluster are connected to a fat-tree, QDR InfiniBand network that provides high-bandwidth, low-latency, and very efficient node-to-node communications. The disk storage system supports up to 3 petabytes of data. To maximize performance and cost- effectiveness, data is automatically moved to the optimal disk-type (SAS or SATA) based on priority and access patterns.

The World’s Largest Intel Cluster Ready Certified SystemWhen the Aachen Center for Computing and Communication initially defined its cluster requirements, it did not stipulate a system that complied with the Intel Cluster Ready architecture. However, engineers at Bull and science+computing (a subsidiary of the Bull Group) had recently implemented a large cluster based on the architecture and knew Intel® Cluster Checker software would help simplify implementation and maintenance. They also knew Intel Cluster Ready certification would help to ensure greater application compatibility in the university’s produc-tion environment.

Faster, Simpler Implementation

When building a cluster based on the Intel Cluster Ready architecture, Intel Cluster Checker software can be used to test hardware and software components and configurations. Problems that might otherwise be difficult to detect, such as a wrong BIOS version or an underperform-ing network card, can be identified quickly and easily.

This can save many hours of effort, even when building a small cluster. The advantages are even greater for large clusters, and Intel Cluster Checker played a significant role in getting the RWTH Aachen University system up and running quickly and with reduced effort. Each hardware node was tested with Intel Cluster Checker at the factory. This helped Bull engineers identify and resolve issues more efficiently and ensure each node

was production-ready when shipped. It also simplified testing and troubleshooting for science+computing engineers at the production site.

Enhanced Application Compatibility and More Efficient Maintenance

Intel Cluster Checker can be used to certify compliance with the Intel Cluster Ready architecture, and the RWTH Aachen super-computer has been successfully certified. Certification ensures interoperability with a wide range of off-the-shelf HPC applica-tions from leading independent software vendors (ISVs). RWTH Aachen University research teams can be confident these applications will run as expected right out of the box, which saves time and frees up technical personnel to focus on other issues. The standards-based architecture also helps to simplify development for custom applications, since developers can count on a consistent hardware and software platform.

“Intel worked closely with us during the certification process,” says Dieter an Mey, “and provided excellent support for solv-ing the issues identified by Intel Cluster Checker.” With the certification process finished, RWTH Aachen now has a working baseline for the cluster configuration which can be used for periodical health checks and further improvements. Maintenance teams are able to identify and resolve hardware and software issues earlier, so they can keep the cluster running optimally and maintain application compatibility.

3

Cooling infrastructure. 1,600 KW of air and water cooling

are required to keep the cluster operating at optimal temperatures.

Page 4: World-Class Supercomputing on a Flexible, Standards-based ... · World-Class Supercomputing on a Flexible, Standards-based Cluster ... During the first stage of cluster operation,

Empowering Researchers Across Diverse FieldsThe performance and flexibility of the RWTH Aachen supercomputer will be key to driving advances across diverse fields of study. Application developers can take full advantage of the benefits offered by MPI, OpenMP, and hybrid parallel programs, and system administra-tors can tailor cluster resources to closely match the needs of individual applications. According to Dieter an Mey, “This HPC system is a critical resource for our users in Engineering Sciences, Physics, Chemis-try, Biology, Math and Computer Sciences. It significantly contributes to the research progress made at RWTH Aachen Univer-sity in simulation sciences, from gaining deep insight in natural phenomena to the development of new materials and technologies.”

For more information, visit the following Web sites.

Intel Cluster Ready intel.com/go/cluster

Bull HPC Solutions www.bull.com/extreme-computing/ bullx.html

science+computing www.science-computing.de/en.html

RWTH Aachen Supercomputer www.rz.rwth-aachen.de/aw/cms/rz/Themen/hochleistungsrechnen/rechnersysteme/~omk/beschreibung_der_hpc_systeme/?lang=en

Center for Computing and Communi-cation of RWTH Aachen University www.rz.rwth-aachen.de/go/id/owz/?lang=en

Find an Intel Cluster Ready solution that is right for your

organization. Contact your Intel representative or visit

intel.com/go/cluster.

SOLUTION PROVIDED BY:

1 Top500 List - November 2011 (1-100). Top500® Computer Sites. November 2011. December 30, 2011. http://www.top500.org/list/2011/11/100. Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at http://software.intel.com/en-us/articles/optimization-notice. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY

THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A “Mission Critical Application” is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL’S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS’ FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intelmaymakechangestospecificationsandproductdescriptionsatanytime,withoutnotice.Designersmustnotrelyontheabsenceorcharacteristicsofanyfeaturesorinstructionsmarked“reserved”or“undefined”.Intelreservesthese forfuturedefinitionandshallhavenoresponsibilitywhatsoeverforconflictsorincompatibilitiesarisingfromfuturechangestothem.Theinformationhereissubjecttochangewithoutnotice.Donotfinalizeadesignwiththisinformation.

Theproductsdescribedinthisdocumentmaycontaindesigndefectsorerrorsknownaserratawhichmaycausetheproducttodeviatefrompublishedspecifications.Currentcharacterizederrataareavailableonrequest. ContactyourlocalIntelsalesofficeoryourdistributortoobtainthelatestspecificationsandbeforeplacingyourproductorder.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm This document and the information given are for the convenience of Intel’s customer base and are provided “AS IS” WITH NO WARRANTIES WHATSOEVER, EXPRESS OR IMPLIED, INCLUDING ANY IMPLIED WARRANTY OF

MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT OF INTELLECTUAL PROPERTY RIGHTS. Receipt or possession of this document does not grant any license to any of the intellectual property described, displayed, or contained herein. Intel® products are not intended for use in medical, lifesaving, life-sustaining, critical control, or safety systems, or in nuclear facility applications.

SoftwareandworkloadsusedinperformancetestsmayhavebeenoptimizedforperformanceonlyonIntelmicroprocessors.Performancetests,suchasSYSmarkandMobileMark,aremeasuredusingspecificcomputersystems,components,software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright ©2011 Intel Corporation. All rights reserved. Intel, the Intel logo and Intel Cluster Ready are trademarks or registered trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Printed in USA 1211/KE/HEM/XX/PDF Please Recycle 326552-001US

“ This HPC system is a critical

resource for our users in

Engineering Sciences, Physics,

Chemistry, Biology, Math and

Computer Sciences.”

Dieter an Mey HPC Department Head Center for Computing and Communication RWTH Aachen University

Aachen supercomputer. Together, the MPI partition (left) and the SMP partition (right) provide up to 292 teraflops of computing power.