opera software architecture osa - zettaflopszettaflops.org/spc08/crago-isi-ftscent-osa-5-08.pdf ·...

16
-UNCLASSIFIED- -UNCLASSIFIED- OPERA Software Architecture OSA Steve Crago Janice McMahon USC/ISI-East May 29, 2008

Upload: others

Post on 11-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

-UNCLASSIFIED-

-UNCLASSIFIED-

OPERA Software Architecture

OSA

Steve CragoJanice McMahon

USC/ISI-EastMay 29, 2008

Page 2: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

2-UNCLASSIFIED-

-UNCLASSIFIED-Outline

Introduction

OPERA Software Environment

Software Fault Tolerance for OPERA

Page 3: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

3-UNCLASSIFIED-

-UNCLASSIFIED-Multicore Trends

x86 Multi-Core

Cell

GPU

TILE64

Power density and complexity problems have driven current and next generation processors to have multiple cores on chip

Intel Core DuoAMD8 core SUN Niagara-2Tilera TILE64

On-chip parallelism improves throughput of multiple applications, but results in programming challenges

Single applications must be parallelizedParallel applications must be scalableRequires highly skilled programmers or better tools

Page 4: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

4-UNCLASSIFIED-

-UNCLASSIFIED-

Interprocessor Communication

Low latency connections between compute tiles expose new software issues (multi-core ≠ Symmetric Multiprocessor ≠High Performance Computer)

Traditional Multiprocessor

DRAM~100 cycles

DRAM~100 cycles~1000 cycles

0 1 2 3 4 5 6 7

8 9 10 11 12 13 14 15

16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31

32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47

48 49 50 51 52 53 54 55

56 57 58 59 60 61 62 63

<10 cycleslatency between

tiles

Multi-Core Architecture

Page 5: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

5-UNCLASSIFIED-

-UNCLASSIFIED-Performance Sources

Instruction Level ParallelismFine grainedAvailable in general-purpose legacy codeExploited by traditional superscalar and VLIW uniprocessorsTypically limited by branch prediction and dependencies

Data Level ParallelismEfficient to exploitAvailable in streaming applicationsRelatively easy for programmer to specifyExploited by multimedia extensions, Cell, SIMD architectures

Thread Level ParallelismExecutes tasks in parallelNeeded for scalabilityTrickier to program or extract from programInitial focus of commercial multi-core architectures

Multi-core tools must consider these sources of parallelism simultaneously. This differentiates multi-core tools from traditional uniprocessor and multiprocessor

tools.

Page 6: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

6-UNCLASSIFIED-

-UNCLASSIFIED-Government/Industry Multi-core Roles

IndustryHardware: stamp x86 cores on a chip (e.g. Intel, AMD)Hardware: develop aggressive multi-core chip for specific commercial markets (e.g. Tilera)Evolutionary software tools and programming supportGeneral-purpose software research and education (e.g. fund universities to teach students how to program multi-core)

Government ResponsibilitiesSpecial needs (e.g. radiation hardening)Application libraries (e.g. MPI, VSIPL)Domain-specific toolsRun-time management for dynamic scenarios and autonomous operation (resource management and fault tolerance)Longer term research

Page 7: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

7-UNCLASSIFIED-

-UNCLASSIFIED-Outline

Introduction

OPERA Software Environment

Software Fault Tolerance for OPERA

Page 8: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

8-UNCLASSIFIED-

-UNCLASSIFIED-

OPERA Software Architecture GoalsEnable application demonstration of OPERA processor

Provide basic functionality to allow hardware technology demonstration

Leverage commercial Tilera software baseCurrent support is for sequential compiler and library for message passing and shared memory

Provide baseline programming environment for future missions

Based on familiar programming modelsStandard-compliant APIs

Provide technology base for future software technologies

Domain-specific parallelizationDynamic resource management Software fault tolerance

Page 9: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

9-UNCLASSIFIED-

-UNCLASSIFIED-

OPERA Software Architecture

Long-Term MulticoreSystem Software Vision

Multi-core System Software

Applications

Run-time System

Static Compiler

Application/ System Libraries

(Static and Dynamic)BinariesIntermediate representations

Programming languagesExplicit/implicit parallelism

Fault Tolerance

Resource Manager

System Utilities

Dynamic Compiler

Multicore Hardware Resources

Processing cores I/O PortsMemory

BanksNetwork

Links

Supports dynamic, fault-tolerant, and autonomous operation

ApplicationLibraries

Applications

Parallel Binary Executable

Tilera Modified Compiler

LinkerCommunication

Libraries

Page 10: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

10-UNCLASSIFIED-

-UNCLASSIFIED-

OSA Funded TasksTool extensions

Floating point extensions: compiler, simulator, debugger, iLiblibraryLegacy software support: C++, MPI, OpenMP, VSIPL, pVSIPL++, OpenPNL

Performance and productivity toolsParallel performance analysisParallel debugRun-time monitor and run-time systemFine-grain parallel compiler

ApplicationsJPEG2KCAFCo-addOn-board processing

Page 11: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

11-UNCLASSIFIED-

-UNCLASSIFIED-

Outline

Introduction

OPERA Software Environment

Software Fault Tolerance for OPERA

Page 12: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

12-UNCLASSIFIED-

-UNCLASSIFIED-

Software Fault Tolerance

Rad-hard by design (RHBD) as applied to OPERA will provide protection for total dose, latchup, and single-event upsets

Rate of updates can be traded off with cost of protectionRHBD can be used in combination with system-level software techniques

Other sources of faults will still exist, e.g.Software errorPhysical damage

Software fault tolerance provides mitigation for faults while minimizing overhead and can work in conjunction with RHBD to provide overall mission reliability

Page 13: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

13-UNCLASSIFIED-

-UNCLASSIFIED-Run-time Research

Resource management and introspectionCore allocationPower managementIntrospection hooks

Performance monitoringBehavior monitoringSupport for programming and performance tuning

Fault toleranceSupport from compiler and run-time systemLimited redundancyCheck-pointing and roll-backInteraction with resource management

Page 14: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

14-UNCLASSIFIED-

-UNCLASSIFIED-Fault Tolerance Opportunities

Redundancy availableCoresNetworksMemory interfacesI/O

ProgrammabilityFault tolerance and performance can be tuned according to application needs

Page 15: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

15-UNCLASSIFIED-

-UNCLASSIFIED-Fault Tolerance Challenges

Increased stateCoresNetworksMemory interfacesI/O

ProgrammabilityProgramming can be topology-dependent and tied to physical location

Page 16: OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf · demonstration Leverage commercial Tilera software base ... Long-Term Multicore System

16-UNCLASSIFIED-

-UNCLASSIFIED-Summary

OPERA is an opportunity to provide unprecedented general-purpose performance for space

OPERA software will build on commercial software to provide an environment suitable for government applications

OPERA provides both challenges and opportunities for fault tolerance

Being addressed through both hardware and software