2013/06/10 yun-chung yang kandemir, m., yemliha, t. ; kultursay, e. pennsylvania state univ.,...
TRANSCRIPT
![Page 1: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/1.jpg)
Paper Presentation
2013/06/10 Yun-Chung Yang
Kandemir, M., Yemliha, T. ; Kultursay, E.Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEEPage 954 – 959
A Helper Thread Based Dynamic CachePartitioning Scheme for
Multithreaded Applications
![Page 2: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/2.jpg)
2
Abstract Related Work Motivation Difference between inter and intra application Proposed Method Experiment Result Conclusion
Outline
![Page 3: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/3.jpg)
3
Focusing on the problem of how to partition the cache space given to a multithreaded application across its threads, we show that different threads of a multithreaded application can have different cache space requirements, propose a fully automated, dynamic, intra-application cache partitioning scheme targeting emerging multicores with multilayer cache hierarchies, present a comprehensive experimental analysis of the proposed scheme, and show average improvements of 17.1% and 18.6% in SPECOMP and PARSEC suites.
Abstract
![Page 4: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/4.jpg)
4
Related Work
Off-chip bandwidth[3,
10, 13]
Processor cores[6]
Resource Management
Shared cache[5, 4, 8, 11, 12, 17, 18, 20]
Application granularity
Intra-application shared cache[16]
This paperImprove the cache layer problem
![Page 5: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/5.jpg)
5
Run application of facesim(PARSEC) and art(SPECOMP).
Perform six scheme and recorded the Average Memory Access Time(AMAT). No-partition Uniform Nonuniform Nonuniform-L2 Nonuniform-L3 Dynamic
Dynamic outer perform the rest Divide application into fixed epoch and performs the best.
Motivation
![Page 6: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/6.jpg)
6
The objectives and the implementation are different on cache partition.
The intra-application cache partition tries to minimize the latency of the slowest thread. Runtime system or dynamic compiler
The inter-application cache partition tries to optimize workload throughput. OS problem
Difference between Inter & Intra App.
![Page 7: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/7.jpg)
Dynamic Partition System
Helper Thread whose main responsibility is to partition the cache space allocated to the application to maximize its performance.
The Proposed Method
System Interfacing
Performance Monitoring
Performance Modeling
![Page 8: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/8.jpg)
Each OS epoch is composed many application, which divided into 5 epoch. Performance Monitoring Performance Modeling Resource Partitioning System Interfacing Application Execution
Proposed Method(cont.)
![Page 9: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/9.jpg)
9
Use Average Memory Access Time as measure of the cache performance of a thread.
AMAT The ratio of total cycles spent on memory instructions
and total number of instructions Depends on the cache partition size Take into account with different level of cache
Performance Monitoring
![Page 10: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/10.jpg)
10
Need to predict the impact of increasing and decreasing the cache space to a thread.
Expressed a thread with 3D plot X and Y respectively for cache space allocation from L2
and L3
Thread i, point d(sL2, sL3) value to build dynamic model for thread i.
Purpose – predict the performance of a thread
Performance Modeling
![Page 11: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/11.jpg)
11
ith L2 cache, qL2,i denotes the total cache way allocated to this application.
qL2,i are shared by mL2,i thread(from 0 to mL2,i)
The number of ways allocated to the kth thread is denoted as sL2,i(k)
Cache Space Partitioning
![Page 12: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/12.jpg)
12
P[t] denotes cache resources(numbers of way in L2 & L3).
Cache Space Partitioning Algorithm
![Page 13: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/13.jpg)
13
New partition information is delivered to the OS using system call.
Add new instruction to ISA
COID = core ID, CLVL = cache level, CAID = cache ID, W = 64bit wide way allocation
System Interfacing
![Page 14: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/14.jpg)
14
The experimental environment Compare with other scheme
Average Memory Access Time。The main target of the performance monitoring
Execution Cycle
What we want to know
![Page 15: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/15.jpg)
15
SIMICS and GEMS to model below multicore architecture.
Run SPECOMP and PARSEC application. Use 120 million instruction as application epoch.
Experiment Environment
![Page 16: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/16.jpg)
16
Perform 8 schemes and recorded average memory access time No-partition Uniform – as evenly as possible for each core Static Best – static partition for best result through
exhaustive search Dynamic – the proposed method Dynamic-L2 – partition only L2 Dynamic-L3 – partition only L3 L2+L3 – a separate performance model for each one. Ideal – optimal strategy
Experiment Environment(cont.)
![Page 17: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/17.jpg)
17
![Page 18: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/18.jpg)
18
Improve Performance Shows that balancing the data access latency of
different threads. As the execution went on, they all end up at
about 8 AMAT(cycle).
![Page 19: 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),](https://reader030.vdocuments.site/reader030/viewer/2022032516/56649c785503460f9492db92/html5/thumbnails/19.jpg)
19
Intra-application cache partitioning for multithread Dynamic model, able to partition cache in multiple
layer. Average improvement of 17.1% in SECOMP and
18.6% in PARSEC.
My Comment Remind me the importance of software and hardware
cooperation. Thread is a main issue in CMP.
Conclusion