jst-crest “extreme big data” project (2013...

JST-CREST“ExtremeBigData”Project(2013-2018)

SupercomputersCompute&Batch-Oriented

More fragile

Cloud IDCVery low BW & EfficiencyHighly available, resilient

Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW

PCB

TSV Interposer

High Powered Ma in CPU

Low Power CPU

DRAMDRAMDRAMNVM/Flash

NVM/Flash

NVM/Flash

Low Power CPU

DRAMDRAMDRAMNVM/Flash

NVM/Flash

NVM/Flash

2Tbps HBM4~6HBM Channels1.5TB/s DRAM & NVM BW

30PB/s I/O BW Possible1 Yottabyte / Year

EBD System Softwareincl. EBD Object System

Introduction

Problem

Dom

ain

Inmostlivingorganism

sgeneticinstructions

used

intheirdevelopm

entarestored

inthelong

polym

eric

moleculecalledDNA.

DNAconsists

oftwolong

polym

ersof

simpleun

its

callednu

cleotides.

The

four

basesfoun

din

DNAareadenine(abbreviated

A),cytosine

(C),guanine(G

)andthym

ine(T

).

AleksandrDrozd

,NaoyaMaruyama,SatoshiMatsuoka

(TIT

ECH)

AMultiGPU

Rea

dAlig

nmen

tAlgorithm

withModel-basedPerform

ance

Optimization

Nove

mber

1,2011

5/54

Large Scale Metagenomics

Massive Sensors and Data Assimilation in Weather Prediction

Ultra Large Scale Graphs and Social Infrastructures

Exascale Big Data HPC

Co-Design

Future Non-Silo Extreme Big Data Scientific Apps

GraphStore

EBDBagCo-Design 13/06/06 22:36日本地図

1/1 ページfile:///Users/shirahata/Pictures/日本地図.svg

1000km

KVS

KVS

KVS

EBDKVS

CartesianPlaneCo-Design

Givenatop-classsupercomputer,howfastcanweacceleratenextgenerationbigdatac.f.Clouds?

IssuesregadingArchitectural,algorithmic,andsystemsoftwareevolution?

UseofGPUs?

TheGraph500– June2014andJune2015KComputer#1TokyoTech[EBDCREST] Univ.Kyushu[Fujisawa

GraphCREST],RikenAICS,Fujitsu

List Rank GTEPS Implementation

November 2013 4 5524.12 Top-down only

June 2014 1 17977.05 Efficient hybrid

November 2014 2 Efficient hybrid

June 2015 1 38621.4 Hybrid + Node Compression

*Problem size is weak scaling

“Brain-class” graph

88,000 nodes, 700,000 CPU Cores1.6 Petabyte mem20GB/s Tofu NW

≫

LLNL-IBM Sequoia1.6 million CPUs1.6 Petabyte mem

0

500

1000

1500

64 nodes

(Scale 30)

65536 nodes

(Scale 40)El

apse

d Ti

me

(ms) Communi…

73% total exec time wait in

communication

LargeScaleGraphProcessingUsingNVM2.Proposal1.Hybrid-BFS(Beamer’11)

CPU Intel Xeon E5-2690 × 2

DRAM 256 GB

NVM EBD-I/O 2TB × 2

3.Experiment

0.0

1.0

2.0

3.0

4.0

5.0

6.0

23 24 25 26 27 28 29 30 31SCALE(# vertices = 2SCALE)

DRAM + EBD-I/O

DRAM Only

Limit of DRAM Only

3.8

4.1

Median GigaTEPS

(Giga Traversed Edges Per Seconds)

mSATA-SSD

RAID Card

RAID Card (RAID 0)

mSATASSD

× 8mSATA

SSD・・・

www.adaptec.com

www.crucial.com/

4timeslargergraphwith6.9%ofdegradation

DRAM NVM

LoadhighlyaccessedgraphdatabeforeBFS

Holds fullsizeofGraphHoldshighlyaccesseddata

[Iwabuchi,IEEEBigData2014]

Top-down Bottom-up

#offrontiers:nfrontier，#ofallvertices:nall,Parameter :α,β

Switchingtwoapproaches

Tokyo’s Institute of TechnologyGraphCREST-Custom #1

is ranked

No.3

in the Big Data category of the Green Graph 500Ranking of Supercomputers with

35.21 MTEPS/W on Scale 31

on the third Green Graph 500 list published at theInternational Supercomputing Conference, June 23, 2014.

Congratulations from the Green Graph 500 Chair

Ranked 3rdinGreenGraph500(June2014)

EBD Algorithm Kernels

GPU-basedDistributedSorting[Shamoto,IEEEBigData 2014,IEEETrans.BigData2015]• Sorting:KernelalgorithmforvariousEBDprocessing• Fastsortingmethods

– DistributedSorting:Sortingfordistributedsystem• Splitter-basedparallelsort• Radixsort• Mergesort

– Sortingonheterogeneousarchitectures• Manysortingalgorithmsareacceleratedbymanycoresandhighmemorybandwidth.

• Sortingforlarge-scaleheterogeneoussystemsremainsunclear

• WedevelopandevaluatebandwidthandlatencyreducingGPU-basedHykSort onTSUBAME2.5vialatencyhiding– Nowpreparingtoreleasethesortinglibrary

EBD Algorithm Kernels

K20x x4 faster than K20x

0

20

40

60

0 500 1000 1500 2000 0 500 1000 1500 2000# of proccesses (2 proccesses per node)

Keys

/sec

ond(

billio

ns)

HykSort 6threadsHykSort GPU + 6threadsPCIe_10PCIe_100PCIe_200PCIe_50Prediction of our implementation

K20x x4 faster than K20x

0

20000

40000

60000

0 500 1000 1500 2000 0 500 1000 1500 2000# of proccesses (2 proccesses per node)

Keys

/sec

ond(

mill

ions

)

HykSort 6threadsHykSort GPU + 6threadsPCIe_10PCIe_100PCIe_200PCIe_50Prediction of our implementation

0

10

20

30

0 500 1000 1500 2000# of proccesses (2 proccesses per node)

Keys

/sec

ond(

billio

ns)

HykSort 1threadHykSort 6threadsHykSort GPU + 6threads

x1.4

x3.61

x389

0.25[TB/s]

Performanceprediction

x2.2 speedupcomparedtoCPU-based

implementationwhenthe#ofPCIbandwidthincreaseto50GB/s

8.8% reductionofoverallruntimewhenthe

acceleratorswork4timesfasterthanK20x

• PCIe_#:#GB/sbandwidthofinterconnectbetweenCPUandGPU

• Weakscalingperformance(GrandChallengeonTSUBAME2.5)

– 1~1024nodes(2~2048GPUs)– 2processespernode– Eachnodehas2GB64bit integer

• C.f.Yahoo/HadoopTerasort:0.02[TB/s]

– IncludingI/O

GPUimplementationofsplitter-basedsorting(HykSort)

GPU + NVM + PCIe SSD Sortingour new Xtr2sort library [H.Sato et.al. SC15 Poster]

Single Node Xeon - 2 socket 36 cores- 128GB DDR4- K40 GPU (12GB)- SSD PCIe card

(2.4TB)

in-core GPU

Xtr2sort GPU+CPU+NVM

CPU+NVM

ObjectStorageDesigninOpenNVM [TakatsuetalGPC2015]

• Newinterface- Sparseaddressspace,atomicbatchoperationsandpersistenttrim

• Simpledesignbyfixed-sizeRegionenabledbysparseaddressspaceandpersistenttrim

– Free’ed bypersistenttrimandnoreuse

– Enoughregionsizetostoreoneobject

• Optimizationtechniquesforobjectcreation

– Bulkreservationandbulkinitialization

SuperRegion

Region1

RegionN…

NextRegionID[]

…Region2

746Kops/sec

0

200

400

600

800

1 2 4 8 16 32

KOP

S

#thread

ObjectCreationPerformancewithOptimizations

Baseline

128reservations

128reservations+32initializations

2.8x

1.5xXFS 15.6Kops/sDirectFS 61.3Kops/sProposal 746 Kops/s

Fuyumasa Takatsu,Kohei Hiraga,andOsamuTatebe,“Designofobject storageusingOpenNVM forhigh-performancedistributedfilesystem”,the10thInternationalConferenceonGreen,PervasiveandCloudComputing(GPC2015),May4,2015

ConcurrentB+Tree IndexforNativeNVM-KVS[Jabri]

• Enablerange-queriessupportforKVSrunningnativelyonNVMlikefusionio ioDrive

• DesignofLock-freeconcurrentB+Tree• Lock-freeoperations– search,

insertanddelete• DynamicrebalancingoftheTree• Nodestobesplitormergedare

frozenuntilreplacedbynewnodes

• Asynchronousinterfaceusingfuture/promiseinC++11/14

OpenNVM like KVS Interface

NVM (Fusion-io flash device)

NVM-KVS supporting range-queries

In-memory B+ Tree

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"

1" 2" 3" 4" 5" 6" 7" 8" 9" 10" 11"

Compu

ta(o

n*(m

e*[sec]�

Number*of*samples*processed*in*one*itera(on�

Model"A"

Model"B"

Model"C"

PerformanceModelingofaLargeScaleAsynchronousDeepLearningSystemunderRealisticSGDSettings

YosukeOyama1,AkihiroNomura1,Ikuro Sato2,HirokiNishimura3,Yukimasa Tamatsu3,andSatoshiMatsuoka11TokyoInstituteofTechnology2DENSOITLABORATORY,INC.3DENSOCORPORATION

Background• DeepConvolutionalNeuralNetworks(DCNNs)have

achievedstage-of-the-artperformanceinvariousmachinelearningtaskssuchasimagerecognition

• AsynchronousStochasticGradientDescent(SGD)methodhasbeenproposedtoaccelerateDNNtraining

– Itmaycauseunrealistictrainingsettingsanddegraderecognitionaccuracyonlargescalesystems,duetolargenon-trivialmini-batchsize

0"5"

10"15"20"25"30"35"40"

0" 100" 200" 300" 400" 500" 600"

Top$5&valid

a,on

&error&[%

]�

Epoch�

48"GPUs"

1"GPU"

Better

Worsethan1GPUtraining

Validation ErrorofILSVRC 2012Classification Task onTwoPlatforms:

Trained11layerCNNwithASGDmethod

ProposalandEvaluation• WeproposeaempiricalperformancemodelforanASGD

trainingsystemonGPUsupercomputers,whichpredictsCNNcomputationtimeandtimetosweepentiredataset

– Considering“effectivemini-batchsize”,time-averagedmini-batchsizeasacriterionfortrainingquality

• Ourmodelachieves8%predictionerrorforthesemetricsinaverageonagivenplatform,andsteadilychoosethefastestconfigurationontwodifferentsupercomputerswhichnearlymeetsatargeteffectivemini-batchsize

MeasuredTime(Solid) andPredictedTime(Dashed)ofCNNComputationofThree15-17LayerModels

PredictedEpochTimeofILSVRC 2012Classification Task:Shadedareaindicate theeffectivemini-batch size

isin138±25%

Num

berofsam

ples

processedinone

iteration

10 20 30 40

24

68

10 5e+02 sec1e+03 sec2e+03 sec5e+03 sec1e+04 sec2e+04 sec5e+04 sec1e+05 sec

Numberofnodes

Thebestconfigurationtoachieve the shortestepochtime

10

TSUBAME3.0

2006 TSUBAME1.080 Teraflops, #1 Asia #7 World“Everybody’s Supercomputer”

2010 TSUBAME2.02.4 Petaflops #4 World

“Greenest Production SC”

2013TSUBAME2.5

upgrade5.7PF DFP

/17.1PF SFP20% power reduction

2013 TSUBAME-KFC#1 Green 500

/DL upgrade -> 1.5PF/rack

2017 TSUBAME3.015~20PF(DFP) ~4PB/s Mem BW9~10GFlops/W power efficiencyBig Data & Cloud Convergence

Large Scale SimulationBig Data Analytics

Industrial Apps2011 ACM GordonBellPrize

2017Q2 TSUBAME3.0 TowardsExa &BigData1. “Everybody’sSupercomputer”– HighPerformance(15~20Petaflops,~4PB/sMem,~1Pbit/s

NW),innovativehighcost/performancepackaging&design,inmere100m2…2. “ExtremeGreen”– 9~10GFlops/Wpower-efficientarchitecture,system-widepowercontrol,

advancedcooling,futureenergyreservoirloadleveling&energyrecovery3. “BigDataConvergence”– ExtremehighBW&capacity,deepmemory

hierarchy,extremeI/Oacceleration,BigDataSWStackformachinelearning/DNN,graphprocessing,…

4. “CloudSC”– dynamicdeployment,container-basednodeco-location&dynamicconfiguration,resourceelasticity,assimilationofpublicclouds…

5. “Transparency”- fullmonitoring&uservisibilityofmachine&jobstate,accountabilityviareproducibility

10

Comparison of Machine Learning / AI Capabilities

X~10>>

(effectively more due to optimized DL SW Stack on

GPUs)

K Computer (2011)Deep Learning

FP32 11.4 Petaflops

TSUBAME2.5(2013)

+TSUBAME3.0(2017) 8000GPUs

Deep Learning / AI Capabilities

FP16+FP32 up to ~100 Petaflops

+ up to 100PB online storage

BG/Q Sequoia (2011)22 Petaflops SFP/DFP

2015 Proposal to MEXT - Big Data and HPC Convergent Infrastructure

=> “Nationoal Big Data Science Center” （Tokyo Tech GSIC）

• “Big Data” currently processed managed by domain laboratories => No longer scalable• HPCI HPC Center => Converged HPC and Big Data Science Center• People convergence: domain scientists + data scientists + CS/Infrastructure => Big data science center• Data services including large data handling, big data structures e.g. graphs, ML/DNN/AI services…

2013 TSUBAME2.5Upgrade

5.7Petaflops 17PF DNN

2017Q1 TSUBAME3.0+2.5Green&Big Data 60~80PF DNN

HPCI Leading MachineUltra-fast memory

network, I/O

Mid-tierParallel FS

Storage

Archival Long-Term

Object Store

BigDataScienceApplications

Introduction

Problem

Dom

ain

Inmostlivingorganism

sgeneticinstructions

used

intheirdevelopm

entarestored

inthelong

polym

eric

moleculecalledDNA.

DNAconsists

oftwolong

polym

ersof

simpleun

its

callednu

cleotides.

The

four

basesfoun

din

DNAareadenine(abbreviated

A),cytosine

(C),guanine(G

)andthym

ine(T

).

Aleksan

drDrozd

,Nao

yaMaruyama,

SatoshiMatsuoka

(TIT

ECH)

AMultiGPU

Rea

dAlig

nmen

tAlgorithm

withModel-based

Perform

ance

Optimization

Nove

mber

1,2011

5/54

NationalLabsWithData

PresentoldstyledatascienceDomainlabssegregateddatafacilities

NomutualcollaborationsInefficient,notscalablewithNotenoughdatascientists

Convergenceoftop-tierHPCandBigDataInfrastructure

DataManagementBigDataStorageDeepLearningSWInfrastructure

VirtualMulti-InstitutionalDataScience=>PeopleConvergenceGoal 100 Petabytes

100GbpsL2Connection tocommercialclouds

Mainreason:Wehaveshared

resourceHPCcentersbutno“DataCenter”

perse

TSUBAME4beyond2021~2022K-in-a-Box(GoldenBox)BD/ECConvergentArchitecture

1/500Size,1/150Power,1/500Cost,x5DRAM+NVMMemory

10 Petaflops,10PetabyteHiearchical Memory(K:1.5PB),10Knodes

50GB/sInterconnect(200-300TbpsBisectionBW)(ConceptuallysimilartoHP“TheMachine”)

DatacenterinaBoxLargeDatacenterwillbecome“Jurassic”

AccelerationofEBDProcessing(1)• LargeCapacity– Multi-Terabytes,Petabytes,Exabytes•Kernelalgorithmsfordiscretedata– graph,sort,etc.

• EBDCharacteristics• Sparseandrandomdatastructure• Involvefrequentandabundantdatatransfer

• EBDSolutions(research)• Highcapacityatlowpower:non-volatilememory,deepmemoryhierarchy• Highbandwidth:faston-packagememory+memoryhierarchy+SupercomputerNetwork(>100Gbpsinjection,Petabits bisection)+bandwidthreducingalgorithmsforEBD

• LowLatency• latencyreduction=>memory3-Dstacking,faston-packagememory+lowlatencynetwork

• Latencyhiding=>manycore+manythreading+latencyreducingalgorithmsforEBD

Implieslowlatencyandhighbandwidthaccess

Ourresearch:define&inventEBDarchitecture+algorithm+systemSW

AccelerationofEBDProcessing(2)• Classificationalgorithms – statisticalmodeling/optimization,MachineLearning

• EBDCharacteristics:iterativenumericaloptimization• Kernelmaybesparse(e.g.,SVM)ordense(e.g.,DeepLearning)• Parallelismdifficultduetomassivesamplesize(10~100billionimages)

• EBDSolutions(ourresearch)• Approach:EmploytraditionalandnewHPC/supercomputerparallelizationandaccelerationstrategies

• Sparsealgorithms– highbandwidthprocessors(e.g.,GPU)w/stackedmemoryandon-packagememory+memoryhierarchy+supercomputingnetwork+bandwidthreducingalgorithms (sparselinearalgebra)

• Densealgorithms– many-corehighFLOPSprocessor(e.g.,GPU)+algorithmicadvancesforstrongscaling

• Highvolumedata– utilize“burstbuffer”technology(incl.Clouds)

Limitedshowingtoday

Optimized Graph500 program (1) – Bandwidth Reducing AlgorithmSparse Matrix Representation with Bitmap

} Problem} Sincethepartitionedgraphisahypersparsematrix,weneedefficient

hypersparsematrixrepresentationforlargescaledistributedgraphprocessing.

} Ourproposal:SparseMatrixRepresentationwithBitmap} Enablescompressionofrowindexesandfastaccesstoeachrow.

Datasizeofrowindex(MB/node)(8064partition,Scale36)

CSR(CompressedSparseRow) 1806DCSC 861CoarseIndex +SkipList 309Bitmap(Proposal) 337

2,294 2,653

3,328

0

500

1000

1500

2000

2500

3000

3500

DCSC CoarseIndex+Skip List

Bitmap

Performance(GTEPS)

Performance(8064nodes,Scale36)

Comparisonwithothermethods

3,328

2,5962,235

1,891

0

500

1000

1500

2000

2500

3000

3500

Proposal Onlyremoveunnecessary

Noreorder Convertatlast

Perfo

rmance(G

TEPS)

(i) Localizememoryaccess

Optimized Graph500 program (2) – Bandwidth Reducing AlgorithmVertex Reordering for Bitmap Optimization

} Ouridea} Createsreorderedvertexnumberbysortingverticesbydegree.} Usereorderednumberforbitmapaccessandoriginalnumberforother

processing.} Result

} 16%speedupbyreductionofbitmapdata,28%speedupbylocalizedmemoryaccess,and49%speedupintotal.(8064nodes)

(ii)ReducethesizeofBitmap

Bitmap Access

Unnecessarypart

16%49%28%

Performance(8064nodes,Scale36)

Reorder

jst-crest “extreme big data” project (2013...

Documents