![Page 1: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/1.jpg)
Reducing OLTP Instruction Misses with Thread Migration
Islam Atta Pınar Tözün Anastasia Ailamaki Andreas Moshovos
University of TorontoÉcole Polytechnique Fédérale de Lausanne
![Page 2: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/2.jpg)
2
OLTP on a Intel Xeon5660Shore-MTHyper-threading disabled
IPC < 1 on a 4-issue machineTPC-C TPC-E
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Inst
ructi
ons
per C
ycle
TPC-C TPC-E0%
10%20%30%40%50%60%70%80%90%
100%
Resource (includes data)Instructions
Brea
kdow
n of
Cor
e St
alls
bette
r
70-80% of stalls are instruction stalls
![Page 3: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/3.jpg)
3
16 32 64 128 256 512 10240
10
20
30
40
50
60
TPC-CTPC-E
Cache Size (KB)
Mis
ses p
er k
-Inst
ructi
onOLTP L1 Instruction Cache Misses
Trace Simulation4-way L1-I Cache
Shore-MT
bette
r
~512KB is enough for OLTP instruction footprint
Most common today!
![Page 4: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/4.jpg)
4
• Larger L1-I cache size Higher access latency
• Different replacement policies Does not really affect OLTP workloads
• Advanced prefetching Has too much space overhead (40KB per core)
• Simultaneous multi-threading Increases IPC per hardware context Cache polluting
Reducing Instruction Stallsat the hardware level
![Page 5: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/5.jpg)
5
• Enables usage of aggregate L1-I capacity– Large cache size without increased latency
• Can exploit instruction commonality– Localizes common transaction instructions
• Dynamic hardware solution– More general purpose
Alternative: Thread Migration
![Page 6: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/6.jpg)
6
Transactions Running Parallel
T1 T2 T3
Instruction parts that can fit into L1-I
Threads
TransactionT1T2T3
Common instructions among concurrent threads
![Page 7: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/7.jpg)
7
Scheduling Threads
0 1 2 3T1
T2 T1
T3 T2 T1
T1 T3 T2
CORES
1
T3
0 1 2 3T1
T1 T2
T1 T2 T3
T1 T2 T3
CORES
T3
Traditional TMi
L1I
3
6
9
10
1
2
3
4
4
T1
T2
T3
Threadstim
e
TotalMisses
TotalMisses
![Page 8: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/8.jpg)
8
TMi
0 1
T1
CORES • Group threads• Wait till L1-I is almost full
– Count misses– Record last N misses– Misses > threshold => Migrate
L1I
T2T1Transaction A
T4T3Transaction B tim
e
![Page 9: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/9.jpg)
9
TMi
0 1
T1
T2 T1
T1 T2
T1 T2
CORES Where to migrate?• Check the last N misses recorded
in other caches1) No matching cache => Move to an idle core if exists2) Matching cache => Move to that core3) None of above => Do not move
L1I
T2T1Transaction A
time
![Page 10: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/10.jpg)
10
• Trace Simulation– PIN to extract instructions & data accesses per transaction– 16 core system– 32KB 8-way set-associative L1 caches– Miss-threshold is 256– Last 6 misses are kept
• Shore-MT as the storage manager– Workloads: TPC-C, TPC-E
Experimental Setup
![Page 11: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/11.jpg)
11
Impact on L1-I Misses
Instruction misses reduced by half
bette
r
No M
igra
tion
TMi
TMi B
lind
No M
igra
tion
TMi
TMi B
lind
TPC-C TPC-E
05
1015202530354045
Instruction
Mis
ses p
er k
-Inst
ructi
on
![Page 12: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/12.jpg)
12
Impact on L1-D Misses
Cannot ignore increased data misses
No M
igra
tion
TMi
TMi B
lind
No M
igra
tion
TMi
TMi B
lind
TPC-C TPC-E
05
1015202530354045 Write Data
Read DataInstruction
Mis
ses p
er k
-Inst
ructi
on
bette
r
![Page 13: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/13.jpg)
13
• Dealing with the data left behind– Prefetching
• Depends on thread identification– Software assisted– Hardware detection
• OS support needed– Disabling OS control over thread scheduling
TMi’s Challenges
![Page 14: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/14.jpg)
14
• ~50% of the time OLTP stalls on instructions• Spread computation through thread migration• TMi
– Halves L1-I misses– Time-wise ~30% expected improvement– Data misses should be handled
Conclusion
Thank you!
![Page 15: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/15.jpg)
15
BACKUP
![Page 16: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/16.jpg)
16
L1-I Misses per K-Instruction16
K32
K64
K12
8K25
6K51
2K 1M 16K
32K
64K
128K
256K
512K 1M 16
K32
K64
K12
8K25
6K51
2K 1M 16K
32K
64K
128K
256K
512K 1M 16
K32
K64
K12
8K25
6K51
2K 1M 16K
32K
64K
128K
256K
512K 1M 16
K32
K64
K12
8K25
6K51
2K 1M 16K
32K
64K
128K
256K
512K 1M
2-way 4-way 8-way FA 2-way 4-way 8-way FATPC-C TPC-E
0
10
20
30
40
50 Capacity Conflict Compulsory
Inst
ructi
ons M
PKI
![Page 17: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/17.jpg)
17
L1-D Misses per K-Instruction16
K32
K64
K12
8K25
6K51
2K 1M 16K
32K
64K
128K
256K
512K 1M 16
K32
K64
K12
8K25
6K51
2K 1M 16K
32K
64K
128K
256K
512K 1M
8-way FA 8-way FATPC-C TPC-E
0
2
4
6
8
10 CapacityConflictCompulsory
Data
MPK
I
![Page 18: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/18.jpg)
18
Replacement Policies
I-MPKI D-MPKI I-MPKI D-MPKITPC-C TPC-E
0
5
10
15
20
25
30 LRU LIP BIP DIP
MPK
I
![Page 19: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/19.jpg)
Experimental Setup
Intel Xeon X5660 Server
#Sockets 2
#Cores in a Socket 6 (OoO)
#HW Contexts 24
Clock Speed 2.80GHz
Memory 48GB
LLC (L3) 12 MB
L2 (per core) 256KB
L1 (per core) 32KB (both I and D)
Hyper-Threading Enabled
OS Ubuntu 10.04 with Linux kernel 2.6.32
• Intel VTune 2011– Interface for hardware
counters• Working set fits in RAM• Log flushed to RAM• Each run:
– Starts with initial database
– Each worker executes 1000 xcts before Vtune starts collecting numbers for 60 secs
![Page 20: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/20.jpg)
20
Formulas• IPC = INST_RETIRED.ANY_P /
CPU_CLK_UNHALTED.THREAD
• Data Stalls = RESOURCE_STALLS.ANY
• Instruction Stalls = UOPS_ISSUED.CORE_STALL_CYCLES - RESOURCE_STALLS.ANY
![Page 21: Reducing OLTP Instruction Misses with Thread Migration](https://reader035.vdocuments.site/reader035/viewer/2022081604/568165f5550346895dd91a8b/html5/thumbnails/21.jpg)
21
16K
32K
64K
128K
256K
512K 1M 16
K
32K
64K
128K
256K
512K 1M
TPC-C TPC-E
05
101520253035404550
Cache Size
Capa
city
Mis
ses p
er k
-Inst
ructi
onOLTP L1 Instruction Cache Misses
Trace Simulation4-way L1-I Cache
Shore-MT
Most common today!
bette
r
~512KB is enough for OLTP instruction footprint