modern computing: cloud, distributed, & high performance

SECTION 3:MODERN COMPUTING:CLOUD, DISTRIBUTED & HIGH PERFORMANCE

DR. ÜMIT V. ÇATALYÜREKPROFESSOR AND ASSOCIATE CHAIRGeorgia Institute of Technology

JANUARY 27, 2017

The Big Data to Knowledge (BD2K) Guide to the Fundamentals of Data Science

1

ÜMİT V. ÇATALYÜREK• A Professor in the School of Computational Science &

Engineering in the College of Computing at the Georgia Institute of Technology.

• A recipient of an NSF CAREER award

• The primary investigator of several awards from the Department of Energy, the National Institute of Health, & the National Science Foundation.

• An Associate Editor for Parallel Computing, & editorial board member for IEEE Transactions on Parallel & Distributed Computing, & the Journal of Parallel & Distributed Computing.

• A Fellow of IEEE, member of ACM & SIAM, & the Chair for IEEE TCPP 2016-2017, & Vice-Chair for ACM SIGBio 2015-2018 term.

• Main research areas: parallel computing, combinatorial scientific computing & biomedical informatics.

• More information about Dr. Ümit V. Çatalyürek can be found at http://cc.gatech.edu/~umit.

2

MODERN COMPUTING: CLOUD, DISTRIBUTED & HIGH PERFORMANCE COMPUTING Ümit V. ÇatalyürekProfessor and Associate ChairSchool of Computational Science and EngineeringGeorgia Institute of Technology

The BD2K Guide to the Fundamentals of Data Science Series27 January 2017

3

Outline

• HPC • What is it? Why?

• A Crash Course on (HPC) Computer Architecture• History of Single “Processor” Performance• Taxonomy of Processors, Memory Topology of Parallel Computers• Supercomputers

• How to speedup your application?• Focus the common case• Pay attention to locality• Take advantage of parallelism

• An Example Application: Whole-Slide Histopathology Image Analysis for Neuroblastoma

• Summary

4

What does High Performance Computing (HPC) mean?

• There is no such thing as “Low Performance Computing”

• “HPC most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business” (insideHPC)

• HPC allows scientists and engineers to solve complex science, engineering, and business problems using applications that require high bandwidth, enhanced networking, and very high compute capabilities.” (Amazon AWS)

• “HPC is the use of parallel processing for running advanced application programs efficiently, reliably and quickly… The term HPC is occasionally used as a synonym for supercomputing.” (SearchEnterpriseLinux/WhatIs.com)

5

My Definition of High Performance Computing (HPC)

• Efficient use of computing platforms for running application programs quickly.

• Why do we care about speed?• We do not want science to wait for computing.

• Why do we care about efficiency?• Efficient use of resources means more resources available to all of us J• Somebody has to pay the bills!• When you have efficient program, it will be also very fast!

• Supercomputing is HPC, but HPC does not mean just supercomputing• For Supercomputers check top500.org (more later).

6

Computing Today• Computing = Parallel Computing = HPC

• Any “computer” you touch has parallel processing power:• Your laptop’s CPU has at least 2 cores.• Your cell phone has 4-8 cores!

• This is BD2K Seminar: Data (and hence computational need) is BIG!• Too big that it does not fit in to your computer.• It takes too long to compute on your computer.

0

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Dec-82

Nov-84

Oct-86

Sep-88

Aug-90

Jul-92

Jun-94

May-96

Apr-98

Mar-00

Feb-02

Jan-04

Dec-05

Nov-07

Oct-09

Sep-11

Aug-13

Jul-15

Megab

ases

GenBankBases

GenBank

WGS

Source: http://www.genome.gov/sequencingcosts/Oxford NanaporeMinION MkI

7

Outline





• Summary

8

History of Single “Processor” Performance

9

RISC

Move to multi-cores

Bandwidth and Latency

•Bandwidth or throughput• Total work done in a given time• 10,000-25,000X improvement for processors• 300-1200X improvement for memory and disks

•Latency or response time• Time between start and completion of an event• 30-80X improvement for processors• 6-8X improvement for memory and disks

10

Bandwidth and Latency

11

Log-log plot of bandwidth and latency milestones

Flynn’s Taxonomy

12

Instructions

Single (SI) Multiple (MI)

Dat

a

Mul

tiple

(MD

)

SISD

Single-threaded process

MISD

Pipeline architecture

SIMD

Vector Processing

MIMD

Shared-/ Distributed-

Memory Computing

Sing

le (S

D)

SISD

13

D D D D D D D

Processor

Instructions

SIMD

14

D0

Processor

Instructions

D0D0 D0 D0 D0

D1

D2

D3

D4

…Dn

D1

D2

D3

D4

…Dn

D1

D2

D3

D4

…Dn

D1

D2

D3

D4

…Dn

D1

D2

D3

D4

…Dn

D1

D2

D3

D4

…Dn

D1

D2

D3

D4

…Dn

D0

GPU (SIMD) Advantage

15 Images are from W. Dally’s SC10 Keynote Talk

MIMD

16

D D D D D D D

Processor

Instructions

D D D D D D D

Processor

Instructions

Memory Typology: Shared

17

Memory

Processor

Processor Processor

Processor

a.k.a. SMPs

Memory Typology: Distributed

18

MemoryProcessor MemoryProcessor

MemoryProcessor MemoryProcessor

Network

Memory Typology: Hybrid

19

MemoryProcessor

Network

Processor

MemoryProcessor

Processor

MemoryProcessor

Processor

MemoryProcessor

Processor

Memory Typology: Hybrid + Hetorogenous

20

Memory

Processor

Network

Processor

GPU

Memory

Processor

Processor

GPU

Memory

Processor

Processor

GPU

Memory

Processor

Processor

GPU

Outline





• Summary

21

Oxen or Chicken Dilemma

• "If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?”Seymour Cray

22

Highlights from Top500

24


25


26

Outline





• Summary

27

Amdahl’s Law

28

( )enhanced

enhancedenhanced

new

oldoverall

SpeedupFraction Fraction

1 ExTimeExTime Speedup

+−==1

Best you could ever hope to do:

( )enhancedmaximum Fraction - 1

1 Speedup =

( ) !"

#$%

&+−×=

enhanced

enhancedenhancedoldnew Speedup

FractionFraction ExTime ExTime 1

Amdahl’s Law Example:

( )

( )56.1

64.01

100.4 0.4 1

1

SpeedupFraction Fraction 1

1 Speedup

enhanced

enhancedenhanced

overall

==+−

=

+−=

29

• Sequence Analysis Pipeline has a “slow” step which does error correction of the input reads

• New CPU 10X faster• I/O bound server, so 60% time waiting for I/O

• Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster

Multiple Sequence AlignmentVTISCTGSSSNIG-AGNHVKWYQQLPGVTISCTGTSSNIG--SITVNWYQQLPGLRLSCSSSGFIFS--SYAMYWVRQAPGLSLTCTVSGTSFD--DYYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNW--YVDGATLVCLISDFYPG--AVTVAW--KADSAALGCLVKDYFPE--PVTVSW--NS-GVSLTCLVKGFYPS--DIAVEW--ESNG

30

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWESNG--

• Optimal: O(2n P |li|)• 6 sequences of length 100 if constant is 10-9 seconds

• running time 6.4 x 104 seconds (~17.7 hours)• add 2 sequences

• running time 2.6 x 109 seconds (~82.4 years!)

or

CLUSTAL W

• Based on Higgins & Sharp CLUSTAL [Gene88]• Progressive alignment-based strategy

• Pairwise Alignment (n2l2)• A distance matrix is computed using either an approximate method (fast) or

dynamic programming (more accurate, slower)• Computation of Guide Tree (n3): phylogenetic tree

• Computed from the distance matrix • Iteratively selecting aligned pairs and linking them.

• Progressive Alignment (nl2)• A series of pairwise alignments computed using full dynamic programming to align

larger and larger groups of sequences.• The order in the Guide Tree determines the ordering of sequence alignments. • At each step; either two sequences are aligned, or a new sequence is aligned with

a group, or two groups are aligned. • n: number of sequences in the query• l : average sequence length

31

Speeding up CLUSTALW Breakdown of CLUSTAL W Execution Time on PIII-650MHz

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

25 50 75 100 150 200 400 600 800 1000

Number of GPCR Sequences

Tim

e Fr

actio

ns

prog-align

guidetree

pairwise

• By parallelizing most time consuming part: pair-wise alignment

0

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8

Spee

edup

# Processors

Speedup of Parallelized Version of CLUSTALW

linear

pair align

ideal speeduptotal

32

0

100

200

300

400

500

600

700

800

900

1,000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Spee

dup

Number Of Processors

10.00%

5.00%

2.00%

1.00%

0.50%

0.10%

More on Amdahl’s law

33

Outline





• Summary

34

Levels of the Memory Hierarchy35

CPU Registers100s Bytes300 – 500 ps (0.3-0.5 ns)

L1 and L2 Cache10s-100s K Bytes~1 ns - ~10 ns$1000s/ GByte

Main MemoryG Bytes80ns- 200ns~ $100/ GByte

Disk10s T Bytes, 10 ms (10,000,000 ns)~ $1 / GByte

CapacityAccess TimeCost

Tape infinitesec-min~$1 / GByte

Registers

L1 Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

Upper Level

Lower Level

faster

Larger

L2 Cache

Blocks

Locality Aware Remote Visualization

• Scientific and clinical research generate multi-BG to multi-TB of spatially and temporally correlated data• Different spatial and temporal resolutions• Different acquisition modalities, from CT to light microscopy to electron

micrography• Examples Applications: Visible Human, mouse BIRN

• DataCutter Streams Data to MPI-based OSC Parallel Renderer

• Setup• Full color Visible Woman dataset

• Super-sampled at 2x for entire dataset, 4x and 8x for regions of the dataset• Data stored on 20 nodes• 8 rendering nodes and 1 compositing node with texture VR• Remote thin client connected over internet

System Overview

Query Execution

Implementation of OSC Parallel Renderer

Outline





• Summary

41

Current and Emerging Scientific Applications

42

Processing Remotely-Sensed DataNOAA Tiros-Nw/ AVHRR sensor

AVHRR Level 1 DataAVHRR Level 1 Data• As the TIROS-N satellite orbits, the Advanced Very High Resolution Radiometer (AVHRR)sensor scans perpendicular to the satellite’s track.• At regular intervals along a scan line measurementsare gathered to form an instantaneous field of view(IFOV).• Scan lines are aggregated into Level 1 data sets.

A single file of Global Area Coverage (GAC) data represents:• ~one full earth orbit.• ~110 minutes.• ~40 megabytes.• ~15,000 scan lines.

One scan line is 409 IFOV’s

Satellite Data Processing

DCE-MRI AnalysisShort Sequence Mapping

Quantum Chemistry

Image Processing Multimedia Video Surveillance Montage

Application Patterns•Complex and diverse processing structures

43

Processing Remotely-Sensed DataNOAA Tiros-Nw/ AVHRR sensor

AVHRR Level 1 DataAVHRR Level 1 Data• As the TIROS-N satellite orbits, the Advanced Very High Resolution Radiometer (AVHRR)sensor scans perpendicular to the satellite’s track.• At regular intervals along a scan line measurementsare gathered to form an instantaneous field of view(IFOV).• Scan lines are aggregated into Level 1 data sets.

A single file of Global Area Coverage (GAC) data represents:• ~one full earth orbit.• ~110 minutes.• ~40 megabytes.• ~15,000 scan lines.

One scan line is 409 IFOV’s

Bag-of-Tasks Model

Data Analysis Applications

Bag-of-Tasks Applications

Task

File


44


Bag-of-Tasks Applications Workflows

Non-streaming

Task

File

Sequential or Parallel Task

This image cannot currently be displayed.

Non-streaming


45

Streaming


Bag-of-Tasks Applications Workflows

Non-streaming

Task

File

Sequential or Parallel Task Streaming

Taxonomy of Parallelism

•Complex and diverse processing structures• Varied parallelism

46

Bag-of-Tasks Applications

Sequential Task

File

P1 P2 P3 P4

Task-parallelism

Application Patterns

• Complexanddiverseprocessingstructures• Variedparallelism

•Bag-of-tasksapplications:task-parallelism

47

Data-parallelism

Task-parallelism

Non-streaming Workflows


P1 P2 P3 P4



48



•Bag-of-tasks:task-parallelism•Non-streamingworkflows:task- anddata-parallelism

49

Data-parallelism

Streaming Workflows


P1 P2 P3 P4

Pipelined-parallelism

Task-parallelism



50



•Bag-of-tasks:task-parallelism•Non-streamingworkflows:task- anddata-parallelism•Streamingworkflows:task-,data- andpipelined-parallelism

51

Outline





• Summary

52

An Example Application: Whole-Slide Histopathology Image Analysis for Neuroblastoma•Classifybiopsytissueimagesintodifferentsubtypesofprognosticsignificance

•Veryhighresolutionslides• Dividedintosmallertiles

•Multi-resolutionimageanalysis• Mimicsthewaypathologistsperformtheiranalysis• Ifclassificationatlowerresolutionisnotsatisfactory,analysisalgorithmisexecutedathigherresolution(s),hencethedynamicworkload.

53

Why do we need HPC?

§ Due to the large sizes of whole-slide images§ A 120K x 120K image digitized at 40x occupies more than 40 GB.

§ The processing time on a single CPU§ For an image tile of 1K x 1K is »6 secs w/ Matlab, 850 msecs w/

C++§ For a “small” 50K x 50K slide (assuming 50% background) »20 min.

§ In algorithm development§ Algorithm development in Matlab§ Requires evaluation of many different techniques, parameters etc.

§ In clinical practice, 8-9 biopsy samples are collected per patient. For an average of 500 neuroblastoma patients treated annually, our biomedical image analysis consumes:§ On a CPU: 24 months using Matlab and 3.4 months using C++.§ Can we reduce this to couple days or even hours?

54

`

Whole-slide image

Label 1Label 2

backgroundundetermined

Assign classification labels

Classification map

Image tiles (40X magnification)

CPUSSE

Intel Xeon Phi

…

Computation units

GPU …

Computational Infrastructure

55

CPUC/C++

…

Characterizing the GPU/CPU speed-up56

Color conversion

Co-occur.

matrices

LBP operator

Histo-gram

Color channels Three Three One OneOutput results 1Kx1K

tile4x4

matrix1Kx1K

tile256 bins

Comput. weight Heavy Average Heavy LowOperator type Streaming Iterative Streaming IterativeData reuse None Strong Little StrongLocality access None High Little HighArithm. intensity

Heavy Low Average Low

Memory access

Low High Average High

GPU speed-up 166.09 x 16.75 x 85.86 x 8.32 x

Effect of runtime optimizations57

Homogeneous base case

Heterogeneous base case

Tile recalculation rate: % of tiles recalculated at higher resolution.

ODDS improves performance even in the base case

Using an additional CPU-only machine is more than 3x faster than GPU-only version

Cluster Comput (2012) 15:125–144 139

Table 6 Differentdemand-driven schedulingpolicies used in Sect. 6

Demand-driven Area of Queue Policy Size of request for

Scheduling Policy effect Sender Receiver data buffers

DDFCFS Intra-filter Unsorted Unsorted Static

DDWRR Intra-filter Unsorted Sorted by speedup Static

ODDS Inter-filter Sorted by speedup Sorted by speedup Dynamic

In Table 6 we present three demand-driven policies(where consumer filters only get as much data as they re-quest) used in our evaluation. All these scheduling policiesmaintain some minimal queue at the receiver side, such thatprocessor idle time is avoided. Simpler policies like round-robin or random do not fit into the demand-driven paradigm,as they simply push data buffers down to the consumer filterswithout any knowledge of whether the data buffers are beingprocessed efficiently. As such, we do not consider these tobe good scheduling methods, and we exclude them from ourevaluation.

The First-Come, First-Served (DDFCFS) policy simplymaintains FIFO queues of data buffers on both ends of thestream, and a filter instance requesting data will get what-ever data buffer is next out of the queue. The DDWRRpolicy uses the same technique as DDFCFS on the senderside, but sorts its receiver-side queue of data buffers bythe relative speedup to give the highest-performing databuffers to each processor. Both DDFCFS and DDWRR havea static value for requests for data buffers during execu-tion, which is chosen by the programmer. For ODDS, dis-cussed in Sect. 5.3, the sender and receiver queues are sortedby speedup and the receiver’s number of requests for databuffers is dynamically calculated at run-time.

6.5.1 Homogeneous cluster base case

This section presents the results of experiments run in thehomogeneous cluster base case, which consists of a sin-gle CPU/GPU-equipped machine. In these experiments, wecompared ODDS to DDWRR. DDWRR is the only oneused for comparison because it achieved the best perfor-mance among the intra-filter task assignment policies (seeSect. 6.3). These experiments used NBIA with asynchro-nous copy, and 26,742 image tiles with two resolution levels,as in Sect. 6.3, and the tile recalculation rate is varied.

The results, presented in Fig. 17, surprisingly show thateven for one processing node ODDS could surpass the per-formance allowed by DDWRR. The gains due to asynchro-nous transfers between ODDS and DDWRR at a 20% tilerecalculation rate, for instance, is around 23%. The improve-ments obtained by ODDS are directly related to the ability tobetter select data buffers that maximize the performance ofthe target processing units. It occurs even for one processingmachine because the data buffers are queued at the sender

Fig. 17 Homogeneous base case evaluation

Fig. 18 Tiles processed by CPU for each communication policy asrecalculation rate is varied

side for both policies, but ODDS selects the data buffers thatmaximize the performance of all processors of the receiver,improving the ability of the receiver filter to better assigntasks locally.

Figure 18 presents the percentage of tasks processed bythe CPU according to the communication policy and tile re-calculation rate. As shown, DDFCFS is only able to processa reasonable amount of tiles when the reconfiguration rateis 0%; its collaboration to the entire execution is minimumfor the other experiments. When analyzing DDWRR andODDS, on the other hand, both allow the CPU to computea significant number of tiles for all values of reconfigurationrate, which directly explains the performance gap between

Outline





• Summary

58

How about Cloud Computing?

• Cloud Computing• It is not really “Cloud”; it is someone else’s computer!• Rent instead of buy.

• Pay for Compute, Data Storage and Transfer.• Our current best bet to enable sharing of large data, workflows and

computational resources.• For “most of us” our best bet to achieve scalability and speed.

• Sample reading:• Nature Reviews Genetics 11, 647-657 (September 2010) | doi:10.1038/nrg2857• Computational solutions to large-scale data management and analysis• Eric E. Schadt, Michael D. Linderman, Jon Sorenson, Lawrence Lee, and Garry

P. Nolan• http://www.nature.com/nrg/multimedia/compsolutions/slideshow.html

• See also: Correspondence by Trelles et al. | Correspondence by Schadt et al.

59

Summary

• How to speedup your application?• Focus the common case

• If only 50% can be “improved”, best you can get 2x speedup!

• Pay attention to locality• Reduce data move• Move computation to data

• Take advantage of parallelism• Multiple types of parallelism: task-, data- and pipelined-parallelism• Fastest processor does not mean your application will run fast; find most suitable

architecture. • GPUs are good for “regular” computations• GPUs can be up-to 10x faster compared to multi-core CPU, in many real life

applications, it is usually 3-5x

60

QUESTIONS?

61