jst-crest “extreme big data” project (2013...
TRANSCRIPT
JST-CREST“ExtremeBigData”Project(2013-2018)
SupercomputersCompute&Batch-Oriented
More fragile
Cloud IDCVery low BW & EfficiencyHighly available, resilient
Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW
PCB
TSV Interposer
High Powered Ma in CPU
Low Power CPU
DRAMDRAMDRAMNVM/Flash
NVM/Flash
NVM/Flash
Low Power CPU
DRAMDRAMDRAMNVM/Flash
NVM/Flash
NVM/Flash
2Tbps HBM4~6HBM Channels1.5TB/s DRAM & NVM BW
30PB/s I/O BW Possible1 Yottabyte / Year
EBD System Softwareincl. EBD Object System
Introduction
Problem
Dom
ain
Inmostlivingorganism
sgeneticinstructions
used
intheirdevelopm
entarestored
inthelong
polym
eric
moleculecalledDNA.
DNAconsists
oftwolong
polym
ersof
simpleun
its
callednu
cleotides.
The
four
basesfoun
din
DNAareadenine(abbreviated
A),cytosine
(C),guanine(G
)andthym
ine(T
).
AleksandrDrozd
,NaoyaMaruyama,SatoshiMatsuoka
(TIT
ECH)
AMultiGPU
Rea
dAlig
nmen
tAlgorithm
withModel-basedPerform
ance
Optimization
Nove
mber
1,2011
5/54
Large Scale Metagenomics
Massive Sensors and Data Assimilation in Weather Prediction
Ultra Large Scale Graphs and Social Infrastructures
Exascale Big Data HPC
Co-Design
Future Non-Silo Extreme Big Data Scientific Apps
GraphStore
EBDBagCo-Design 13/06/06 22:36日本地図
1/1 ページfile:///Users/shirahata/Pictures/日本地図.svg
1000km
KVS
KVS
KVS
EBDKVS
CartesianPlaneCo-Design
Givenatop-classsupercomputer,howfastcanweacceleratenextgenerationbigdatac.f.Clouds?
IssuesregadingArchitectural,algorithmic,andsystemsoftwareevolution?
UseofGPUs?
TheGraph500– June2014andJune2015KComputer#1TokyoTech[EBDCREST] Univ.Kyushu[Fujisawa
GraphCREST],RikenAICS,Fujitsu
List Rank GTEPS Implementation
November 2013 4 5524.12 Top-down only
June 2014 1 17977.05 Efficient hybrid
November 2014 2 Efficient hybrid
June 2015 1 38621.4 Hybrid + Node Compression
*Problem size is weak scaling
“Brain-class” graph
88,000 nodes, 700,000 CPU Cores1.6 Petabyte mem20GB/s Tofu NW
≫
LLNL-IBM Sequoia1.6 million CPUs1.6 Petabyte mem
0
500
1000
1500
64 nodes
(Scale 30)
65536 nodes
(Scale 40)El
apse
d Ti
me
(ms) Communi…
73% total exec time wait in
communication
LargeScaleGraphProcessingUsingNVM2.Proposal1.Hybrid-BFS(Beamer’11)
CPU Intel Xeon E5-2690 × 2
DRAM 256 GB
NVM EBD-I/O 2TB × 2
3.Experiment
0.0
1.0
2.0
3.0
4.0
5.0
6.0
23 24 25 26 27 28 29 30 31SCALE(# vertices = 2SCALE)
DRAM + EBD-I/O
DRAM Only
Limit of DRAM Only
3.8
4.1
Median GigaTEPS
(Giga Traversed Edges Per Seconds)
mSATA-SSD
RAID Card
RAID Card (RAID 0)
mSATASSD
× 8mSATA
SSD・・・
www.adaptec.com
www.crucial.com/
4timeslargergraphwith6.9%ofdegradation
DRAM NVM
LoadhighlyaccessedgraphdatabeforeBFS
Holds fullsizeofGraphHoldshighlyaccesseddata
[Iwabuchi,IEEEBigData2014]
Top-down Bottom-up
#offrontiers:nfrontier,#ofallvertices:nall,Parameter :α,β
Switchingtwoapproaches
Tokyo’s Institute of TechnologyGraphCREST-Custom #1
is ranked
No.3
in the Big Data category of the Green Graph 500Ranking of Supercomputers with
35.21 MTEPS/W on Scale 31
on the third Green Graph 500 list published at theInternational Supercomputing Conference, June 23, 2014.
Congratulations from the Green Graph 500 Chair
Ranked 3rdinGreenGraph500(June2014)
EBD Algorithm Kernels
GPU-basedDistributedSorting[Shamoto,IEEEBigData 2014,IEEETrans.BigData2015]• Sorting:KernelalgorithmforvariousEBDprocessing• Fastsortingmethods
– DistributedSorting:Sortingfordistributedsystem• Splitter-basedparallelsort• Radixsort• Mergesort
– Sortingonheterogeneousarchitectures• Manysortingalgorithmsareacceleratedbymanycoresandhighmemorybandwidth.
• Sortingforlarge-scaleheterogeneoussystemsremainsunclear
• WedevelopandevaluatebandwidthandlatencyreducingGPU-basedHykSort onTSUBAME2.5vialatencyhiding– Nowpreparingtoreleasethesortinglibrary
EBD Algorithm Kernels
K20x x4 faster than K20x
0
20
40
60
0 500 1000 1500 2000 0 500 1000 1500 2000# of proccesses (2 proccesses per node)
Keys
/sec
ond(
billio
ns)
HykSort 6threadsHykSort GPU + 6threadsPCIe_10PCIe_100PCIe_200PCIe_50Prediction of our implementation
K20x x4 faster than K20x
0
20000
40000
60000
0 500 1000 1500 2000 0 500 1000 1500 2000# of proccesses (2 proccesses per node)
Keys
/sec
ond(
mill
ions
)
HykSort 6threadsHykSort GPU + 6threadsPCIe_10PCIe_100PCIe_200PCIe_50Prediction of our implementation
0
10
20
30
0 500 1000 1500 2000# of proccesses (2 proccesses per node)
Keys
/sec
ond(
billio
ns)
HykSort 1threadHykSort 6threadsHykSort GPU + 6threads
x1.4
x3.61
x389
0.25[TB/s]
Performanceprediction
x2.2 speedupcomparedtoCPU-based
implementationwhenthe#ofPCIbandwidthincreaseto50GB/s
8.8% reductionofoverallruntimewhenthe
acceleratorswork4timesfasterthanK20x
• PCIe_#:#GB/sbandwidthofinterconnectbetweenCPUandGPU
• Weakscalingperformance(GrandChallengeonTSUBAME2.5)
– 1~1024nodes(2~2048GPUs)– 2processespernode– Eachnodehas2GB64bit integer
• C.f.Yahoo/HadoopTerasort:0.02[TB/s]
– IncludingI/O
GPUimplementationofsplitter-basedsorting(HykSort)
GPU + NVM + PCIe SSD Sortingour new Xtr2sort library [H.Sato et.al. SC15 Poster]
Single Node Xeon - 2 socket 36 cores- 128GB DDR4- K40 GPU (12GB)- SSD PCIe card
(2.4TB)
in-core GPU
Xtr2sort GPU+CPU+NVM
CPU+NVM
ObjectStorageDesigninOpenNVM [TakatsuetalGPC2015]
• Newinterface- Sparseaddressspace,atomicbatchoperationsandpersistenttrim
• Simpledesignbyfixed-sizeRegionenabledbysparseaddressspaceandpersistenttrim
– Free’ed bypersistenttrimandnoreuse
– Enoughregionsizetostoreoneobject
• Optimizationtechniquesforobjectcreation
– Bulkreservationandbulkinitialization
SuperRegion
Region1
RegionN…
NextRegionID[]
…Region2
746Kops/sec
0
200
400
600
800
1 2 4 8 16 32
KOP
S
#thread
ObjectCreationPerformancewithOptimizations
Baseline
128reservations
128reservations+32initializations
2.8x
1.5xXFS 15.6Kops/sDirectFS 61.3Kops/sProposal 746 Kops/s
Fuyumasa Takatsu,Kohei Hiraga,andOsamuTatebe,“Designofobject storageusingOpenNVM forhigh-performancedistributedfilesystem”,the10thInternationalConferenceonGreen,PervasiveandCloudComputing(GPC2015),May4,2015
ConcurrentB+Tree IndexforNativeNVM-KVS[Jabri]
• Enablerange-queriessupportforKVSrunningnativelyonNVMlikefusionio ioDrive
• DesignofLock-freeconcurrentB+Tree• Lock-freeoperations– search,
insertanddelete• DynamicrebalancingoftheTree• Nodestobesplitormergedare
frozenuntilreplacedbynewnodes
• Asynchronousinterfaceusingfuture/promiseinC++11/14
OpenNVM like KVS Interface
NVM (Fusion-io flash device)
NVM-KVS supporting range-queries
In-memory B+ Tree
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"
1" 2" 3" 4" 5" 6" 7" 8" 9" 10" 11"
Compu
ta(o
n*(m
e*[sec]�
Number*of*samples*processed*in*one*itera(on�
Model"A"
Model"B"
Model"C"
PerformanceModelingofaLargeScaleAsynchronousDeepLearningSystemunderRealisticSGDSettings
YosukeOyama1,AkihiroNomura1,Ikuro Sato2,HirokiNishimura3,Yukimasa Tamatsu3,andSatoshiMatsuoka11TokyoInstituteofTechnology2DENSOITLABORATORY,INC.3DENSOCORPORATION
Background• DeepConvolutionalNeuralNetworks(DCNNs)have
achievedstage-of-the-artperformanceinvariousmachinelearningtaskssuchasimagerecognition
• AsynchronousStochasticGradientDescent(SGD)methodhasbeenproposedtoaccelerateDNNtraining
– Itmaycauseunrealistictrainingsettingsanddegraderecognitionaccuracyonlargescalesystems,duetolargenon-trivialmini-batchsize
0"5"
10"15"20"25"30"35"40"
0" 100" 200" 300" 400" 500" 600"
Top$5&valid
a,on
&error&[%
]�
Epoch�
48"GPUs"
1"GPU"
Better
Worsethan1GPUtraining
Validation ErrorofILSVRC 2012Classification Task onTwoPlatforms:
Trained11layerCNNwithASGDmethod
ProposalandEvaluation• WeproposeaempiricalperformancemodelforanASGD
trainingsystemonGPUsupercomputers,whichpredictsCNNcomputationtimeandtimetosweepentiredataset
– Considering“effectivemini-batchsize”,time-averagedmini-batchsizeasacriterionfortrainingquality
• Ourmodelachieves8%predictionerrorforthesemetricsinaverageonagivenplatform,andsteadilychoosethefastestconfigurationontwodifferentsupercomputerswhichnearlymeetsatargeteffectivemini-batchsize
MeasuredTime(Solid) andPredictedTime(Dashed)ofCNNComputationofThree15-17LayerModels
PredictedEpochTimeofILSVRC 2012Classification Task:Shadedareaindicate theeffectivemini-batch size
isin138±25%
Num
berofsam
ples
processedinone
iteration
10 20 30 40
24
68
10 5e+02 sec1e+03 sec2e+03 sec5e+03 sec1e+04 sec2e+04 sec5e+04 sec1e+05 sec
Numberofnodes
Thebestconfigurationtoachieve the shortestepochtime
10
TSUBAME3.0
2006 TSUBAME1.080 Teraflops, #1 Asia #7 World“Everybody’s Supercomputer”
2010 TSUBAME2.02.4 Petaflops #4 World
“Greenest Production SC”
2013TSUBAME2.5
upgrade5.7PF DFP
/17.1PF SFP20% power reduction
2013 TSUBAME-KFC#1 Green 500
/DL upgrade -> 1.5PF/rack
2017 TSUBAME3.015~20PF(DFP) ~4PB/s Mem BW9~10GFlops/W power efficiencyBig Data & Cloud Convergence
Large Scale SimulationBig Data Analytics
Industrial Apps2011 ACM GordonBellPrize
2017Q2 TSUBAME3.0 TowardsExa &BigData1. “Everybody’sSupercomputer”– HighPerformance(15~20Petaflops,~4PB/sMem,~1Pbit/s
NW),innovativehighcost/performancepackaging&design,inmere100m2…2. “ExtremeGreen”– 9~10GFlops/Wpower-efficientarchitecture,system-widepowercontrol,
advancedcooling,futureenergyreservoirloadleveling&energyrecovery3. “BigDataConvergence”– ExtremehighBW&capacity,deepmemory
hierarchy,extremeI/Oacceleration,BigDataSWStackformachinelearning/DNN,graphprocessing,…
4. “CloudSC”– dynamicdeployment,container-basednodeco-location&dynamicconfiguration,resourceelasticity,assimilationofpublicclouds…
5. “Transparency”- fullmonitoring&uservisibilityofmachine&jobstate,accountabilityviareproducibility
10
Comparison of Machine Learning / AI Capabilities
X~10>>
(effectively more due to optimized DL SW Stack on
GPUs)
K Computer (2011)Deep Learning
FP32 11.4 Petaflops
TSUBAME2.5(2013)
+TSUBAME3.0(2017) 8000GPUs
Deep Learning / AI Capabilities
FP16+FP32 up to ~100 Petaflops
+ up to 100PB online storage
BG/Q Sequoia (2011)22 Petaflops SFP/DFP
2015 Proposal to MEXT - Big Data and HPC Convergent Infrastructure
=> “Nationoal Big Data Science Center” (Tokyo Tech GSIC)
• “Big Data” currently processed managed by domain laboratories => No longer scalable• HPCI HPC Center => Converged HPC and Big Data Science Center• People convergence: domain scientists + data scientists + CS/Infrastructure => Big data science center• Data services including large data handling, big data structures e.g. graphs, ML/DNN/AI services…
2013 TSUBAME2.5Upgrade
5.7Petaflops 17PF DNN
2017Q1 TSUBAME3.0+2.5Green&Big Data 60~80PF DNN
HPCI Leading MachineUltra-fast memory
network, I/O
Mid-tierParallel FS
Storage
Archival Long-Term
Object Store
BigDataScienceApplications
Introduction
Problem
Dom
ain
Inmostlivingorganism
sgeneticinstructions
used
intheirdevelopm
entarestored
inthelong
polym
eric
moleculecalledDNA.
DNAconsists
oftwolong
polym
ersof
simpleun
its
callednu
cleotides.
The
four
basesfoun
din
DNAareadenine(abbreviated
A),cytosine
(C),guanine(G
)andthym
ine(T
).
Aleksan
drDrozd
,Nao
yaMaruyama,
SatoshiMatsuoka
(TIT
ECH)
AMultiGPU
Rea
dAlig
nmen
tAlgorithm
withModel-based
Perform
ance
Optimization
Nove
mber
1,2011
5/54
NationalLabsWithData
PresentoldstyledatascienceDomainlabssegregateddatafacilities
NomutualcollaborationsInefficient,notscalablewithNotenoughdatascientists
Convergenceoftop-tierHPCandBigDataInfrastructure
DataManagementBigDataStorageDeepLearningSWInfrastructure
VirtualMulti-InstitutionalDataScience=>PeopleConvergenceGoal 100 Petabytes
100GbpsL2Connection tocommercialclouds
Mainreason:Wehaveshared
resourceHPCcentersbutno“DataCenter”
perse
TSUBAME4beyond2021~2022K-in-a-Box(GoldenBox)BD/ECConvergentArchitecture
1/500Size,1/150Power,1/500Cost,x5DRAM+NVMMemory
10 Petaflops,10PetabyteHiearchical Memory(K:1.5PB),10Knodes
50GB/sInterconnect(200-300TbpsBisectionBW)(ConceptuallysimilartoHP“TheMachine”)
DatacenterinaBoxLargeDatacenterwillbecome“Jurassic”
AccelerationofEBDProcessing(1)• LargeCapacity– Multi-Terabytes,Petabytes,Exabytes•Kernelalgorithmsfordiscretedata– graph,sort,etc.
• EBDCharacteristics• Sparseandrandomdatastructure• Involvefrequentandabundantdatatransfer
• EBDSolutions(research)• Highcapacityatlowpower:non-volatilememory,deepmemoryhierarchy• Highbandwidth:faston-packagememory+memoryhierarchy+SupercomputerNetwork(>100Gbpsinjection,Petabits bisection)+bandwidthreducingalgorithmsforEBD
• LowLatency• latencyreduction=>memory3-Dstacking,faston-packagememory+lowlatencynetwork
• Latencyhiding=>manycore+manythreading+latencyreducingalgorithmsforEBD
Implieslowlatencyandhighbandwidthaccess
Ourresearch:define&inventEBDarchitecture+algorithm+systemSW
AccelerationofEBDProcessing(2)• Classificationalgorithms – statisticalmodeling/optimization,MachineLearning
• EBDCharacteristics:iterativenumericaloptimization• Kernelmaybesparse(e.g.,SVM)ordense(e.g.,DeepLearning)• Parallelismdifficultduetomassivesamplesize(10~100billionimages)
• EBDSolutions(ourresearch)• Approach:EmploytraditionalandnewHPC/supercomputerparallelizationandaccelerationstrategies
• Sparsealgorithms– highbandwidthprocessors(e.g.,GPU)w/stackedmemoryandon-packagememory+memoryhierarchy+supercomputingnetwork+bandwidthreducingalgorithms (sparselinearalgebra)
• Densealgorithms– many-corehighFLOPSprocessor(e.g.,GPU)+algorithmicadvancesforstrongscaling
• Highvolumedata– utilize“burstbuffer”technology(incl.Clouds)
Limitedshowingtoday
Optimized Graph500 program (1) – Bandwidth Reducing AlgorithmSparse Matrix Representation with Bitmap
} Problem} Sincethepartitionedgraphisahypersparsematrix,weneedefficient
hypersparsematrixrepresentationforlargescaledistributedgraphprocessing.
} Ourproposal:SparseMatrixRepresentationwithBitmap} Enablescompressionofrowindexesandfastaccesstoeachrow.
Datasizeofrowindex(MB/node)(8064partition,Scale36)
CSR(CompressedSparseRow) 1806DCSC 861CoarseIndex +SkipList 309Bitmap(Proposal) 337
2,294 2,653
3,328
0
500
1000
1500
2000
2500
3000
3500
DCSC CoarseIndex+Skip List
Bitmap
Performance(GTEPS)
Performance(8064nodes,Scale36)
Comparisonwithothermethods
3,328
2,5962,235
1,891
0
500
1000
1500
2000
2500
3000
3500
Proposal Onlyremoveunnecessary
Noreorder Convertatlast
Perfo
rmance(G
TEPS)
(i) Localizememoryaccess
Optimized Graph500 program (2) – Bandwidth Reducing AlgorithmVertex Reordering for Bitmap Optimization
} Ouridea} Createsreorderedvertexnumberbysortingverticesbydegree.} Usereorderednumberforbitmapaccessandoriginalnumberforother
processing.} Result
} 16%speedupbyreductionofbitmapdata,28%speedupbylocalizedmemoryaccess,and49%speedupintotal.(8064nodes)
(ii)ReducethesizeofBitmap
Bitmap Access
Unnecessarypart
16%49%28%
Performance(8064nodes,Scale36)
Reorder