![Page 1: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/1.jpg)
Speculative Parallelization in
Decoupled Look-ahead
Alok Garg, Raj Parihar, and Michael C. Huang
Dept. of Electrical & Computer Engineering
University of Rochester, Rochester, NY
![Page 2: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/2.jpg)
Motivation
• Single-thread performance still important design goal
• Branch mispredictions and cache misses are still
important performance hurdles
• One effective approach: Decoupled Look-ahead
• Look-ahead thread can often become the new bottleneck
• Speculative parallelization aptly suited for its acceleration
• More parallelism opportunities: Look-ahead thread does not
contain all dependences
• Simple architectural support: Look-ahead not correctness critical
1
![Page 3: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/3.jpg)
Outline
• Motivation
• Baseline decoupled look-ahead
• Speculative parallelization in look-ahead
• Experimental analysis
• Summary
2
![Page 4: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/4.jpg)
Baseline Decoupled Look-ahead
• Binary parser is used to generate skeleton from original program
• The skeleton runs on a separate core and
• Maintains its own memory image in local L1, no write-back to shared L2
• Sends all branch outcomes through FIFO queue and also helps prefetching
3
*A. Garg and M. Huang. “A Performance-Correctness Explicitly Decoupled Architecture”. In Proc. Int’l Symp. On Microarch., November 2008
![Page 5: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/5.jpg)
Practical Advantages of Decoupled
Look-ahead • Look-ahead thread is a single, self-
reliant agent
• No need for quick spawning and register
communication support
• Little management overhead on main thread
• Easier for run-time control to disable
• Natural throttling mechanism to prevent
run-away prefetching
• Look-ahead thread size comparable to
aggregation of short helper threads
4
![Page 6: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/6.jpg)
Look-ahead: A New Bottleneck
5
• Comparing four
systems
• Baseline
• Decoupled look-ahead
• Ideal
• Look-ahead alone
• Application categories
• Bottlenecks removed
• Speed of look-ahead
not the problem
• Look-ahead is the
new bottleneck
![Page 7: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/7.jpg)
Outline
• Motivation
• Baseline decoupled look-ahead
• Speculative parallelization in look-ahead
• Experimental analysis
• Summary
6
![Page 8: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/8.jpg)
Unique Opportunities
• Skeleton code offers more
parallelism
• Certain dependencies removed
during slicing for skeleton
• Look-ahead is inherently error-
tolerant
• Can ignore dependence violations
• Little to no support needed, unlike in
conventional TLS
7
Assembly code from equake
![Page 9: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/9.jpg)
Software Support
8
• Dependence analysis
• Profile guided, coarse-grain at
basic block level
• Spawn and Target points
• Basic blocks with consistent
dependence distance of more
than threshold of DMIN
• Spawned thread executes from
target point
• Loop level parallelism is also
exploited
Parallelism at basic block level
Available parallelism for 2 core/contexts system
Spawn
Target
Tim
e
![Page 10: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/10.jpg)
Parallelism in Look-ahead Binary
9
Available theoretical parallelism for 2 core/contexts system; DMIN = 15 BB
Parallelism potential with stables and consistent target and spawn points
![Page 11: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/11.jpg)
Hardware and Runtime Support
10
• Thread spawning and merging
• Not too different from regular thread
spawning except
• Spawned thread shares the same
register and memory state
• Spawning thread will terminate at
the target PC
• Value communication
• Register-based naturally through shared
registers in SMT
• Memory-based communication can be
supported at different levels
• Partial versioning in cache at line level
Spawning of a new look-ahead thread
![Page 12: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/12.jpg)
Outline
• Motivation
• Baseline decoupled look-ahead
• Speculative parallelization in look-ahead
• Experimental analysis
• Summary
11
![Page 13: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/13.jpg)
Experimental Setup
• Program/binary analysis tool: based on ALTO
• Simulator: based on heavily modified SimpleScalar
• SMT, look-ahead and speculative parallelization support
• True execution-driven simulation (faithfully value modeling)
12
Microarchitectural and cache configurations
![Page 14: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/14.jpg)
Speedup of speculative parallel
decoupled look-ahead
13
• 14 applications in which look-ahead is bottleneck
• Speedup of look-ahead systems over single thread
• Decoupled look-ahead over single thread baseline: 1.61x
• Speculative look-ahead over single thread baseline: 1.81x
• Speculative look-ahead over decoupled look-ahead:1.13x
![Page 15: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/15.jpg)
Speculative Parallelization in Look-
ahead vs Main thread
14
• Skeleton provides more opportunities for parallelization
• Speedup of speculative parallelization
• Speculative look-ahead over decoupled LA baseline: 1.13x
• Speculative main thread over single thread baseline: 1.07x
IPC
Core1
Core2
Baseline (ST)
Baseline TLS
Decoupled LA
Speculative LA
LA
1.07x
1.13x
LA0 LA1
1.81x
![Page 16: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/16.jpg)
Flexibility in Hardware Design
15
• Comparison of regular (partial versioning) cache
support with two other alternatives
• No cache versioning support
• Dependence violation detection and squash
![Page 17: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/17.jpg)
Other Details in the Paper
• The effect of spawning look-ahead thread in preserving the
lead of overall look-ahead system
• Technique to avoid spurious spawns in dispatch stage which
could be subjected to branch misprediction
• Technique to mitigate the damage of runaway spawns
• Impact of speculative parallelization on overall recoveries
• Modifications to branch queue in multiple look-ahead system
16
![Page 18: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/18.jpg)
Outline
• Motivation
• Baseline decoupled look-ahead
• Speculative parallelization in look-ahead
• Experimental analysis
• Summary
17
![Page 19: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/19.jpg)
Summary
• Decoupled look-ahead can significantly improve single-thread
performance but look-ahead thread often becomes new
bottleneck
• Look-ahead thread lends itself to TLS acceleration
• Skeleton construction removes dependences and increases parallelism
• Hardware design is flexible and can be a simple extension of SMT
• A straightforward implementation of TLS look-ahead
achieves an average 1.13x (up to 1.39x) speedup
18
![Page 20: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/20.jpg)
Speculative Parallelization in
Decoupled Look-ahead
Alok Garg, Raj Parihar, and Michael C. Huang
Dept. of Electrical & Computer Engineering
University of Rochester, Rochester, NY
Backup Slides
http://www.ece.rochester.edu/projects/acal/
![Page 21: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/21.jpg)
References (Partial) • J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a
Cache Miss. In Proc. Int’l Conf. on Supercomputing, pages 68–75, July 1997.
• O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large
Instruction Windows for Out-of-order Processors. In Proc. Int’l Symp. on High-Perf. Comp. Arch.,
pages 129–140, February 2003.
• J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative
Precomputation: Long-range Prefetching of Delinquent Loads. In Proc. Int’l Symp. On Comp. Arch.,
pages 14–25, June 2001.
• C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In Proc. Int’l Symp. on
Comp. Arch., pages 2–13, June 2001.
• A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proc. Int’l Symp. on High-Perf. Comp.
Arch., pages 37–48, January 2001.
• Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A Study of Slipstream Processors. In Proc. Int’l Symp.
on Microarch., pages 269–280, December 2000.
• G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In Proc. Int’l Symp. on Comp. Arch.,
pages 414–425, June 1995.
20
![Page 22: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/22.jpg)
References (cont…) • H. Zhou. Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window. In
Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques, pages 231–242, September 2005.
• A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Int’l
Symp. On Microarch., November 2008.
• C. Zilles and G. Sohi. Master/Slave Speculative Parallelization. In Proc. Int’l Symp. on Microarch.,
pages 85–96, Nov 2002.
• D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical report 1342, Computer
Sciences Department, University of Wisconsin-Madison, June 1997.
• J. Renau, K. Strauss, L. Ceze,W. Liu, S. Sarangi, J. Tuck, and J. Torrellas. Energy-Efficient Thread-
Level Speculation on a CMP. IEEE Micro, 26(1):80–91, January/February 2006.
• P. Xekalakis, N. Ioannou, and M. Cintra. Combining Thread Level Speculation, Helper Threads, and
Runahead Execution. In Proc. Intl. Conf. on Supercomputing, pages 410–420, June 2009.
• M. Cintra, J. Martinez, and J. Torrellas. Architectural Support for Scalable Speculative Parallelization
in Shared-Memory Multiprocessors. In Proc. Int’l Symp. on Comp. Arch., pages 13–24, June 2000.
21
![Page 23: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/23.jpg)
Backup Slides
For reference only
![Page 24: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/24.jpg)
Register Renaming Support
23
![Page 25: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/25.jpg)
Modified Banked Branch Queue
24
![Page 26: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/26.jpg)
Speedup of all applications
25
![Page 27: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/27.jpg)
Impact on Overall Recoveries
26
![Page 28: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/28.jpg)
“Partial” Recovery in Look-ahead
27
• When recovery happens in primary look-ahead
• Do we kill the secondary look-ahead or not?
• If we don’t kill we gain around 1% performance improvement
• After recovery in main, secondary thread survives often (1000 insts)
• Spawning of a secondary look-ahead thread helps preserving
the slip of overall look-ahead system
![Page 29: Speculative Parallelization in Decoupled Look-aheadparihar/papers/pact11_pres.pdf · Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int’l](https://reader030.vdocuments.site/reader030/viewer/2022041111/5f157dc33ba58c0b6a7eac52/html5/thumbnails/29.jpg)
Quality of Thread Spawning
• Successful and run-away spawns: killed after certain time
28
• Impact of lazy spawning policy
• Wait for few cycles when spawning opportunity arrives
• Avoids spurious spawns; Saves some wrong path execution