performance and energy analysis using transactional workloads

AnastasiaAilamakiEPFLandRAWLabsSA

students:DanicaPorobic,Utku Sirin,andPinarTozun

Performanceandenergyanalysisusingtransactionalworkloads

$20B+industry

Characteristics:– Hasmanyconcurrentrequests– Touchsmallpartofwholedata– Needhigh&predictableperformance

Primaryapplicationfordatabases

OnlineTransactionProcessing

HardwareOLTPrunson

Core

ILPSMT multisocketmulticores

2005 2020

heterogeneousmany-cores

How’stheutilization?Hardwarekeepsprovidingnewformsofparallelism

TPC-C100GBdata1workerthreadIntelIvyBridge

0%10%20%30%40%50%60%70%80%90%

100%

Shore-MT DBMSD VoltDB HyPer DBMSM

ExecutionCyclesBreakdo

wn

Utilizingmodernprocessors

STAL

LED

BUSY

Processorstalledmostofthetime

ScalingupOLTPonmultisockets

1 2 3 4 5 6 7 8Th

roughp

utNumberofsockets

Multisocketserversseverelyunder-utilized

Energyefficiencyasimportantasperformance

Whycareaboutpower?

57%

8%

18%

13%4% Servers

NetworkingEquipmentPowerDistribution&CoolingPower

Other

Monthlydatacentercosts[J.R.Hamilton]

30%power-relatedDynamicfractionincreasing

Power

Utilization

TodayIdeal

Energyproportionality

• Whyismysystemunder-utilizinghardware?

• Whyisn’tmysystemfasteronnewhardware?

• Arenewprocessorsmoreenergy-efficient?

Analyzingperformanceandenergy

• Macrobenchmarks orMicrobenchmarks?

• Executiontimebreakdowns

• Measuringenergyefficiency

Utilization(microarchitecturelevel)

0

0.5

1

1.5

2

2.5

3

3.5

4

TPC-B TPC-C TPC-E

InstructionsperCycle

IntelXeonX5660

0

0.5

1

1.5

2

2.5

3

3.5

4

TPC-B TPC-C TPC-EInstructionsperCycle

SunNiagaraT2

maximum

maximum

OLTPutilizesleancoresbetter

TPC-EhashigherIPC

[EDBT13]

Macrobenchmark:ExecutionCycles&StallsIntelXeonX5660

0%

20%

40%

60%

80%

100%

TPC-B TPC-C TPC-E

ExecutionCyclesBreakdo

wn

0%

20%

40%

60%

80%

100%

TPC-B TPC-C TPC-ECo

reStalls

Over70%oftimegoestostallsInstructionstallsarethemainproblem

STAL

LED

BUSY

INSTRU

CTION

REST

[EDBT13]

0%

20%

40%

60%

80%

100%

Shore-MT DBMSD VoltDB HyPer DBMSMExecutioncyclesbreakdo

wn

in-memoryon-disk

TPC-C,100GB1workerthreadIntelIvyBridge

Evenin-memorysystemsstall>60%time

STAL

LED

BUSY

Utilizationacrosssystems[SIGMOD16]

Microbenchmarks forwhat-ifanalysis100GB

1workerthreadIntelIvyBridge

Lowerdatalocalityà lowIPCforsomesystems

0

1

2

3

4

1 10 100 1 10 100 1 10 100 1 10 100 1 10 100


IPC

Numberofrowsread

[SIGMOD16]

Core Core Core

L1

L2

L1

L2

L1

L2

L3

Core

L1

L2

Memorycontroller

Inter-socketlinks

Core Core Core

L1

L2

L1

L2

L1

L2

L3

Memorycontroller

Core

L1

L2

Inter-socketlinks

Socket0 Socket1

Column Column

2x12-core Intel Xeon E5-2650L v3, 24x16GiB DDR3

scan

21cycles 72cycles237cycles

OLTPonhardwareislands

Howmeasureimpact?

[PVLDB12]

Partitionsensitivemicrobenchmark• Singlesiteversion

– probe/updateNrowsfromthelocalsite

• Multisiteversion– probe/update1rowfromthelocalsite– probe/updateN-1rowsuniformlyfromanysite– sitesmayresideonthesameinstance

[PVLDB12]

OLTPdeploymentconfigurations

Shared-everything Islandshared-nothing Shared-nothing

[PVLDB12]

Multisitetransactions:readonly

0

50

100

150

200

0 20 40 60 80 100

Throughp

ut(K

Tps)

%multisitetransactions

Shared-everythingShared-nothingIslandshared-nothing

4socketx6cores10rowsaccessed

Moreinstances->fasterperformancedegradation

[PVLDB12]

Multisitetransactions:updates

020406080

100120

0 20 40 60 80 100

Throughp

ut(K

Tps)

%multisitetransactions

Shared-everythingShared-nothingIslandshared-nothing

Updatedistributedtransactionsaremoreexpensive

4socketx6cores10rowsaccessed

[PVLDB12]

Myfirsttoy:PIIXeon

FETCH/DECODE

UNIT

DISPATCH EXECUTE

UNIT

RETIRE UNIT

INSTRUCTION POOL

L1 I-CACHE L1 D-CACHE

L2 CACHE

ÊBranchprediction,non-blockingcaches,out-of-order

MAIN MEMORY

[VLDB99]

WhereDoesTimeGo?

Time=TComputation+TMemory+TBranch+TResource-TOverlap

• Computation• Stalls

– Cachemisses– Branchmispredictions– Otherexecutionpipelinestalls

✪ Stalltimeandcomputationoverlap

[VLDB99]

SetupandMethodology

• FourcommercialDBMSs:A,B,C,D• 6400PIIXeon/MTrunningWindowsNT4• UsedPIIcounters• Correctness:Measured&computedCPI

Range Selection(sequential, indexed)

select avg (a3)from Rwhere a2 > Lo and a2 < Hi

Equijoin(sequential)

select avg (a3)from R, Swhere R.a2 = S.a1

[VLDB99]

Twoveryusefulbreakdowns[VLDB99]

processorstalled>50%oftimemoststalls:L1IandL2D

Adaptedformula#branch-misp

xpenalty

𝑇" + 𝑇$%& = 𝑇( + 𝑇) + 𝑇* + 𝑇+

𝑇) = 𝑇&,- + 𝑇&,. + 𝑇&/- + 𝑇&/.+𝑇&0- + 𝑇&0.

𝑇+ = 𝑇12 + 𝑇.34 +𝑇)-5(

𝑇.6&* + 𝑇-6&*

+

[SIGMOD16]

0

20

40

60

80

100

TPC-B TPC-C TPC-E

Miss

esperk-In

structions

CacheMissestoday

TPC-EhaslowerdatamissratioL1-Imissesdominate

IntelXeonX566032KBL1-I&32KBL1-D

SunNiagaraT216KBL1-I&8KBL1-D

0

20

40

60

80

100

TPC-B TPC-C TPC-E

Miss

esperk-In

structions

LLC L2D L1D L2I L1I

[EDBT13]

TPC-E’slowermissratio

0

30

60

90

120

150

TPC-B TPC-C TPC-E

#RecordsAccessed

Morescansà Increasedpagereuse

Averagepertransaction

0

5

10

15

20

25

30

35

40

TPC-B TPC-C TPC-E#PagesAccessed

HeapIndex

0100200300400500600700800


Stallcyclesb

reakdo

wnin

1000

instructions

LLCD L2D L1DLLCI L2I L1I

L1I

LLCD

TPC-C,100GB1workerthreadIntelIvyBridge

L1IorLLCDstallsdominate

Breakingdownclockcycles[SIGMOD16]

0

200

400

600

800

1000

1 10 100 1 10 100Numberofrowsread Numberofrowsread

DBMSD DBMSM

Stallcyclesb

reakdo

wnin

1000

instructions

L2IL1I

L1I

L2I

ReadNrows,100GB

1workerthreadIntelIvyBridge

WheredoL1Istallscomefrom?

Codeoutsidestoragemgrà highL1Imisses

ARMserver-gradeprocessor

Fatlean

- B/cyoucanshutdownnodesetc.- Insteadof1Xeon,use10Atoms…

- Butthen,switchcost/energy/faults…- Notascleanasasingleserver.

- Scalability- Taillatency- AddARMlikeanothercolumntothetable- PutnewARMsystemsnames,logos!

MOREFATCOREEXAMPLES:SPARC,ETC.

ARM

OLTPonARM:performance&power?

4-wayfetch,decode,

rename,dispatch Issue

Branch

Integer0

Integer1

FP/SIMD0

FP/SIMD1

Load

Store

[DAMON16]

Processor IntelXeon ARMCortex-A57#Sockets 2(oneisactive) 1

# Cores/socket 8 8Issuewidth 4 4Clockspeed 2.00GHz 2.00GhzL1I/L1D 32KB/32KB 32KB/32KB

L2 256KB 256KBL3(shared) 20MB 8MB

RAM 256GB 16GB

Xeonvs.ARM[DAMON16]

ARMisapromisingalternativeSilo,TPC-C,5GB

Singlethread

0

5

10

15Normalize

dthroughp

utARM Xeon

2.5x

0

5

10

15

Normalize

dpo

wer

15x

0

5

10

15

Normalize

den

ergyefficien

cy

6x

[DAMON16]

ARMachievesenergyproportionalitySilo,TPC-C,5GB

0

10

20

1 2 3 4 5 6 7

Normalize

dthroughp

ut

Numberofthreads

ARMXeon

0

15

30

1 2 3 4 5 6 7

Normalize

dpo

wer

Numberofthreads

0

0.5

1

1 2 3 4 5 6 7

Normalize

den

ergyefficien

cy

Numberofthreads

[DAMON16]

0

2

4

6

8

10

12

50th 99th 99.9th

Normalize

dlatency

ARM Xeon

0

2

4

6

8

10

12

50th 99th 99.9th

Normalize

dlatency

ARMislesssuitableforlowlatencyTPC-C,5GB

Allcoresactive

Shore-MT Silo

2x

11x

5x

3x

[DAMON16]

• Macrobenchmarks showbigpicture• Microbenchmarks revealdetails• Breakdownscorrelatenumbers• Sensitivityanalysishighlightstrends• Rightmethodologyisessentialforunderstandingbehavior

Lessonslearned

References

[DAMON16]U.Sirin,R.Appuswamy,andA.Ailamaki:OLTPonaserver-gradeARM.[EDBT13]P.Tözün,I.Pandis,C.Kaynak,D.Jevdjic,andA.Ailamaki:FromAtoE:AnalyzingTPC’sOLTPBenchmarks– Theobsolete,theubiquitous,theunexplored.[PVLDB12]D.Porobic,I.Pandis,M.Branco,P.Tözün,andA.Ailamaki:OLTPonHardwareIslands.[SIGMOD16]U.Sirin,P.Tözün,D.Porobic,andA.Ailamaki:Micro-architecturalAnalysisofIn-memoryOLTP[VLDB99]A.Ailamaki,D.DeWitt,M.Hill,andD.Wood:DBMSsOnAModernProcessor:WhereDoesTimeGo?

performance and energy analysis using transactional workloads

Documents