performance and energy analysis using transactional workloads
TRANSCRIPT
AnastasiaAilamakiEPFLandRAWLabsSA
students:DanicaPorobic,Utku Sirin,andPinarTozun
Performanceandenergyanalysisusingtransactionalworkloads
$20B+industry
Characteristics:– Hasmanyconcurrentrequests– Touchsmallpartofwholedata– Needhigh&predictableperformance
Primaryapplicationfordatabases
OnlineTransactionProcessing
HardwareOLTPrunson
Core
ILPSMT multisocketmulticores
2005 2020
heterogeneousmany-cores
How’stheutilization?Hardwarekeepsprovidingnewformsofparallelism
TPC-C100GBdata1workerthreadIntelIvyBridge
0%10%20%30%40%50%60%70%80%90%
100%
Shore-MT DBMSD VoltDB HyPer DBMSM
ExecutionCyclesBreakdo
wn
Utilizingmodernprocessors
STAL
LED
BUSY
Processorstalledmostofthetime
ScalingupOLTPonmultisockets
1 2 3 4 5 6 7 8Th
roughp
utNumberofsockets
Multisocketserversseverelyunder-utilized
Energyefficiencyasimportantasperformance
Whycareaboutpower?
57%
8%
18%
13%4% Servers
NetworkingEquipmentPowerDistribution&CoolingPower
Other
Monthlydatacentercosts[J.R.Hamilton]
30%power-relatedDynamicfractionincreasing
Power
Utilization
TodayIdeal
Energyproportionality
• Whyismysystemunder-utilizinghardware?
• Whyisn’tmysystemfasteronnewhardware?
• Arenewprocessorsmoreenergy-efficient?
Analyzingperformanceandenergy
• Macrobenchmarks orMicrobenchmarks?
• Executiontimebreakdowns
• Measuringenergyefficiency
Analyzingperformanceandenergy
• Macrobenchmarks orMicrobenchmarks?
• Executiontimebreakdowns
• Measuringenergyefficiency
Utilization(microarchitecturelevel)
0
0.5
1
1.5
2
2.5
3
3.5
4
TPC-B TPC-C TPC-E
InstructionsperCycle
IntelXeonX5660
0
0.5
1
1.5
2
2.5
3
3.5
4
TPC-B TPC-C TPC-EInstructionsperCycle
SunNiagaraT2
maximum
maximum
OLTPutilizesleancoresbetter
TPC-EhashigherIPC
[EDBT13]
Macrobenchmark:ExecutionCycles&StallsIntelXeonX5660
0%
20%
40%
60%
80%
100%
TPC-B TPC-C TPC-E
ExecutionCyclesBreakdo
wn
0%
20%
40%
60%
80%
100%
TPC-B TPC-C TPC-ECo
reStalls
Over70%oftimegoestostallsInstructionstallsarethemainproblem
STAL
LED
BUSY
INSTRU
CTION
REST
[EDBT13]
0%
20%
40%
60%
80%
100%
Shore-MT DBMSD VoltDB HyPer DBMSMExecutioncyclesbreakdo
wn
in-memoryon-disk
TPC-C,100GB1workerthreadIntelIvyBridge
Evenin-memorysystemsstall>60%time
STAL
LED
BUSY
Utilizationacrosssystems[SIGMOD16]
Microbenchmarks forwhat-ifanalysis100GB
1workerthreadIntelIvyBridge
Lowerdatalocalityà lowIPCforsomesystems
0
1
2
3
4
1 10 100 1 10 100 1 10 100 1 10 100 1 10 100
Shore-MT DBMSD VoltDB HyPer DBMSM
IPC
Numberofrowsread
[SIGMOD16]
Core Core Core
L1
L2
L1
L2
L1
L2
L3
Core
L1
L2
Memorycontroller
Inter-socketlinks
Core Core Core
L1
L2
L1
L2
L1
L2
L3
Memorycontroller
Core
L1
L2
Inter-socketlinks
Socket0 Socket1
Column Column
2x12-core Intel Xeon E5-2650L v3, 24x16GiB DDR3
scan
21cycles 72cycles237cycles
OLTPonhardwareislands
Howmeasureimpact?
[PVLDB12]
Partitionsensitivemicrobenchmark• Singlesiteversion
– probe/updateNrowsfromthelocalsite
• Multisiteversion– probe/update1rowfromthelocalsite– probe/updateN-1rowsuniformlyfromanysite– sitesmayresideonthesameinstance
[PVLDB12]
Multisitetransactions:readonly
0
50
100
150
200
0 20 40 60 80 100
Throughp
ut(K
Tps)
%multisitetransactions
Shared-everythingShared-nothingIslandshared-nothing
4socketx6cores10rowsaccessed
Moreinstances->fasterperformancedegradation
[PVLDB12]
Multisitetransactions:updates
020406080
100120
0 20 40 60 80 100
Throughp
ut(K
Tps)
%multisitetransactions
Shared-everythingShared-nothingIslandshared-nothing
Updatedistributedtransactionsaremoreexpensive
4socketx6cores10rowsaccessed
[PVLDB12]
Analyzingperformanceandenergy
• Macrobenchmarks orMicrobenchmarks?
• Executiontimebreakdowns
• Measuringenergyefficiency
Myfirsttoy:PIIXeon
FETCH/DECODE
UNIT
DISPATCH EXECUTE
UNIT
RETIRE UNIT
INSTRUCTION POOL
L1 I-CACHE L1 D-CACHE
L2 CACHE
ÊBranchprediction,non-blockingcaches,out-of-order
MAIN MEMORY
[VLDB99]
WhereDoesTimeGo?
Time=TComputation+TMemory+TBranch+TResource-TOverlap
• Computation• Stalls
– Cachemisses– Branchmispredictions– Otherexecutionpipelinestalls
✪ Stalltimeandcomputationoverlap
[VLDB99]
SetupandMethodology
• FourcommercialDBMSs:A,B,C,D• 6400PIIXeon/MTrunningWindowsNT4• UsedPIIcounters• Correctness:Measured&computedCPI
Range Selection(sequential, indexed)
select avg (a3)from Rwhere a2 > Lo and a2 < Hi
Equijoin(sequential)
select avg (a3)from R, Swhere R.a2 = S.a1
[VLDB99]
Adaptedformula#branch-misp
xpenalty
𝑇" + 𝑇$%& = 𝑇( + 𝑇) + 𝑇* + 𝑇+
𝑇) = 𝑇&,- + 𝑇&,. + 𝑇&/- + 𝑇&/.+𝑇&0- + 𝑇&0.
𝑇+ = 𝑇12 + 𝑇.34 +𝑇)-5(
𝑇.6&* + 𝑇-6&*
+
[SIGMOD16]
0
20
40
60
80
100
TPC-B TPC-C TPC-E
Miss
esperk-In
structions
CacheMissestoday
TPC-EhaslowerdatamissratioL1-Imissesdominate
IntelXeonX566032KBL1-I&32KBL1-D
SunNiagaraT216KBL1-I&8KBL1-D
0
20
40
60
80
100
TPC-B TPC-C TPC-E
Miss
esperk-In
structions
LLC L2D L1D L2I L1I
[EDBT13]
TPC-E’slowermissratio
0
30
60
90
120
150
TPC-B TPC-C TPC-E
#RecordsAccessed
Morescansà Increasedpagereuse
Averagepertransaction
0
5
10
15
20
25
30
35
40
TPC-B TPC-C TPC-E#PagesAccessed
HeapIndex
0100200300400500600700800
Shore-MT DBMSD VoltDB HyPer DBMSM
Stallcyclesb
reakdo
wnin
1000
instructions
LLCD L2D L1DLLCI L2I L1I
L1I
LLCD
TPC-C,100GB1workerthreadIntelIvyBridge
L1IorLLCDstallsdominate
Breakingdownclockcycles[SIGMOD16]
0
200
400
600
800
1000
1 10 100 1 10 100Numberofrowsread Numberofrowsread
DBMSD DBMSM
Stallcyclesb
reakdo
wnin
1000
instructions
L2IL1I
L1I
L2I
ReadNrows,100GB
1workerthreadIntelIvyBridge
WheredoL1Istallscomefrom?
Codeoutsidestoragemgrà highL1Imisses
Analyzingperformanceandenergy
• Macrobenchmarks orMicrobenchmarks?
• Executiontimebreakdowns
• Measuringenergyefficiency
ARMserver-gradeprocessor
Fatlean
- B/cyoucanshutdownnodesetc.- Insteadof1Xeon,use10Atoms…
- Butthen,switchcost/energy/faults…- Notascleanasasingleserver.
- Scalability- Taillatency- AddARMlikeanothercolumntothetable- PutnewARMsystemsnames,logos!
MOREFATCOREEXAMPLES:SPARC,ETC.
ARM
OLTPonARM:performance&power?
4-wayfetch,decode,
rename,dispatch Issue
Branch
Integer0
Integer1
FP/SIMD0
FP/SIMD1
Load
Store
[DAMON16]
Processor IntelXeon ARMCortex-A57#Sockets 2(oneisactive) 1
# Cores/socket 8 8Issuewidth 4 4Clockspeed 2.00GHz 2.00GhzL1I/L1D 32KB/32KB 32KB/32KB
L2 256KB 256KBL3(shared) 20MB 8MB
RAM 256GB 16GB
Xeonvs.ARM[DAMON16]
ARMisapromisingalternativeSilo,TPC-C,5GB
Singlethread
0
5
10
15Normalize
dthroughp
utARM Xeon
2.5x
0
5
10
15
Normalize
dpo
wer
15x
0
5
10
15
Normalize
den
ergyefficien
cy
6x
[DAMON16]
ARMachievesenergyproportionalitySilo,TPC-C,5GB
0
10
20
1 2 3 4 5 6 7
Normalize
dthroughp
ut
Numberofthreads
ARMXeon
0
15
30
1 2 3 4 5 6 7
Normalize
dpo
wer
Numberofthreads
0
0.5
1
1 2 3 4 5 6 7
Normalize
den
ergyefficien
cy
Numberofthreads
[DAMON16]
0
2
4
6
8
10
12
50th 99th 99.9th
Normalize
dlatency
ARM Xeon
0
2
4
6
8
10
12
50th 99th 99.9th
Normalize
dlatency
ARMislesssuitableforlowlatencyTPC-C,5GB
Allcoresactive
Shore-MT Silo
2x
11x
5x
3x
[DAMON16]
• Macrobenchmarks showbigpicture• Microbenchmarks revealdetails• Breakdownscorrelatenumbers• Sensitivityanalysishighlightstrends• Rightmethodologyisessentialforunderstandingbehavior
Lessonslearned
References
[DAMON16]U.Sirin,R.Appuswamy,andA.Ailamaki:OLTPonaserver-gradeARM.[EDBT13]P.Tözün,I.Pandis,C.Kaynak,D.Jevdjic,andA.Ailamaki:FromAtoE:AnalyzingTPC’sOLTPBenchmarks– Theobsolete,theubiquitous,theunexplored.[PVLDB12]D.Porobic,I.Pandis,M.Branco,P.Tözün,andA.Ailamaki:OLTPonHardwareIslands.[SIGMOD16]U.Sirin,P.Tözün,D.Porobic,andA.Ailamaki:Micro-architecturalAnalysisofIn-memoryOLTP[VLDB99]A.Ailamaki,D.DeWitt,M.Hill,andD.Wood:DBMSsOnAModernProcessor:WhereDoesTimeGo?