analysis of multithreading capabilities of current high–performance processors 2005

22
Analysis of Multithreading Capabilities of Current High-Performance Processors Kamil Kędzierski, Francisco J. Cazorla, Mateo Valero [email protected], [email protected], [email protected]

Upload: kamil-kedzierski

Post on 07-Jul-2015

128 views

Category:

Technology


0 download

DESCRIPTION

Kamil Kędzierski, Francisco J. Cazorla, Mateo Valero, “Analysis of Multithreading Capabilities of Current High–Performance Processors”, XVII Jornadas de Paralelismo, September 18 – 20, 2006, Albacete, Spain

TRANSCRIPT

Page 1: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Analysis of Multithreading Capabilitiesof Current High-Performance

Processors

Kamil Kędzierski, Francisco J. Cazorla, Mateo [email protected], [email protected], [email protected]

Page 2: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Outline

• Introduction

• Implementations

• Performance

• Conclusions

• References

IntroductionImplementationPerformanceConclusions

Page 3: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Do we need multithreading?

• Problem: resources not used efficiently in a super-scalar processor

• Goal: increase efficiency – do more computation with similar resource amount

• One possible solution: SMT – Simultaneous Multithreading

[ Francisco J. Cazorla, “QoS for SMT Processors”, PhD Thesis, DAC, UPC, 2005 ]

IntroductionImplementationPerformanceConclusions

MotivationHistory

Page 4: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

How did we get to SMT?

• Multithreading first appeared in the 1950s, and Simultaneous Multithreading was investigated by IBM in 1968

• Super-scalar architectures: only one thread is running at a given time

• Multiprocessor: each thread uses a different set of resources

[ http://www.cs.clemson.edu/~mark/multithreading.html ]

[ Francisco J. Cazorla, “QoS for SMT Processors”, PhD Thesis, DAC, UPC, 2005 ]

IntroductionImplementationPerformanceConclusions

MotivationHistory

Page 5: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

How did we get to SMT?

• Coarse-grain MT: thread switch on a long-latency operations– a.k.a. "concurrent multithreading (CMT)“, "switch-on-event multithreading (SOEMT)“,

"dynamic blocked multithreading (BMT)"

• Fine-grain MT: thread switch not only on a long-latency ops

– a.k.a. "vertical multithreading“,"interleaved multithreading (IMT)“, "time-sharing multithreading“, "hardware multiprogramming" (1958)

• Simultaneous MT: ops issue from different threads at the same clock cycle

[ http://www.cs.clemson.edu/~mark/multithreading.html ]

[ Francisco J. Cazorla, “QoS for SMT Processors”, PhD Thesis, DAC, UPC, 2005 ]

IntroductionImplementationPerformanceConclusions

MotivationHistory

Page 6: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Outline

• Introduction

• Implementations

• Performance

• Conclusions

• References

IntroductionImplementationsPerformanceConclusions

Page 7: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Studied implementations

• IBM’s POWER5

• Intel’s Xeon (Pentium 4)

POWER5 die photo

[B. Sinharoy et al,“POWER5 system microarchitecture”, IBM Journal of Research and Development 2005 ]

IntroductionImplementationsPerformanceConclusions

IBM’s POWER5Intel’s XeonComparison

Page 8: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

SMT implementation in POWER5

• SMT added to super-scalar architecture– Second PC

– Return stack in branch prediction logic duplicated

– GPR/FPR rename mapper expanded for second set of registers

– Completion logic duplicated

– Thread bit added to most address/tag busses

[Ron Kalla et al,”Simultaneous Multi-threading Implementation in POWER5”, HotChips, 2003 ]

[B. Sinharoy et al “POWER5 system microarchitecture”, IBM Journal of Research and Development 2005 ]

IntroductionImplementationsPerformanceConclusions

IBM’s POWER5Intel’s XeonComparison

Page 9: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

POWER5 architecture

[B. Sinharoy et al “POWER5 system microarchitecture”, IBM Journal of Research and Development 2005 ]

IntroductionImplementationsPerformanceConclusions

IBM’s POWER5Intel’s XeonComparison

Page 10: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

SMT implementation in Xeon

• SMT added to super-scalar architecture– Second PC

– ITLB duplicated

– Microcode queue duplicated

– Return stack buffer duplicated

– Rename logic duplicated

– Thread bit added to most address/tag busses

[Deborah T. Marr et al “Hyper-Threading Technology Arch. and Microarchitecture”, Intel Technology Journal, Vol. 6, Issue 1, 2002 ]

IntroductionImplementationsPerformanceConclusions

IBM’s POWER5Intel’s XeonComparison

Page 11: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Xeon architecture

Additional stages in case of cache miss – used to fill the Trace Cache

[Deborah T. Marr et al “Hyper-Threading Technology Arch. and Microarchitecture”, Intel Technology Journal, Vol. 6, Issue 1, 2002 ]

[David Burns “Pre-Silicon Validation of Hyper-Threading Technology”, Intel Technology Journal, Volume 6, Issue 1, 2002 ]

IntroductionImplementationsPerformanceConclusions

IBM’s POWER5Intel’s XeonComparison

Page 12: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Simplified pipeline’s views

• IBM’s POWER5

• Intel’s Xeon (Pentium 4)

[ Kamil Kedzierski et al, “Analysis of Simultaneous Multithreading Implementations in Current High-Performance Processors”, ACACES Second International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems

2006 ]

IntroductionImplementationsPerformanceConclusions

IBM’s POWER5Intel’s XeonComparison

Page 13: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

SMT implementations comparison

[ Kamil Kedzierski et al, “Analysis of Simultaneous Multithreading Implementations in Current High-Performance Processors”, ACACES Second International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems

2006 ]

Resource POWER5 Intel XeonPC duplicated duplicatedInstruction Cache shared sharedITLB shared duplicatedDTLB shared sharedBHT shared sharedReturn Stack Buffer duplicated duplicatedDecode shared sharedInstruction Buffer/uCode Queue duplicated duplicated

Group Formation shared -Mapping/Rename shared duplicatedIQ shared partitionedScheduler - sharedRegister Read shared sharedFUs shared sharedStore Queues duplicated partitionedRegister Write shared sharedGCT/Commit duplicated partitioned

IntroductionImplementationsPerformanceConclusions

IBM’s POWER5Intel’s XeonComparison

Page 14: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Outline

• Introduction

• Implementations

• Performance

• Conclusions

• References

IntroductionImplementationsPerformanceConclusions

Page 15: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

POWER5 performance

0,97 + 0,97 = 1,94

[ R. Kalla et al ”IBM POWER5 Chip: A Dual-Core Multithreaded Processor”, IEEE MICRO 2004 ]

[ Francisco J. Cazorla, “QoS for SMT Processors”, PhD Thesis, DAC, UPC, 2005 ]

IntroductionImplementationsPerformanceConclusions

IBM’s POWER5Intel’s XeonComparison

Page 16: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

POWER5 performance

0,85 + 1 = 1,85

[ Francisco J. Cazorla, “QoS for SMT Processors”, PhD Thesis, DAC, UPC, 2005 ]

[ R. Kalla et al ”IBM POWER5 Chip: A Dual-Core Multithreaded Processor”, IEEE MICRO 2004 ]

IntroductionImplementationsPerformanceConclusions

IBM’s POWER5Intel’s XeonComparison

Page 17: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

POWER5 performance

0,85 + 1 = 1,85

[ Francisco J. Cazorla, “QoS for SMT Processors”, PhD Thesis, DAC, UPC, 2005 ]

[ R. Kalla et al ”IBM POWER5 Chip: A Dual-Core Multithreaded Processor”, IEEE MICRO 2004 ]

Shows only similar trends!

Pictures for two different architectures

IntroductionImplementationsPerformanceConclusions

IBM’s POWER5Intel’s XeonComparison

Page 18: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Xeon performance

Varying processors in a system Varying server workloads

[Deborah T. Marr et al “Hyper-Threading Technology Arch. and Microarchitecture”, Intel Technology Journal, Vol. 6, Issue 1, 2002 ]

IntroductionImplementationsPerformanceConclusions

IBM’s POWER5Intel’s XeonComparison

Page 19: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Average performance comparison

• POWER5’s performance is reported to increase up to 41%

• Xeon’s performance is reported to increase up to 28%

• BUT POWER5 is dual-core architecture

[Deborah T. Marr et al “Hyper-Threading Technology Arch. and Microarchitecture”, Intel Technology Journal, Vol. 6, Issue 1, 2002 ]

[B. Sinharoy et al “POWER5 system microarchitecture”, IBM Journal of Research and Development 2005 ]

IntroductionImplementationsPerformanceConclusions

IBM’s POWER5Intel’s XeonComparison

Page 20: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Outline

• Introduction

• Implementations

• Performance

• Conclusions

• References

IntroductionImplementationsPerformanceConclusions

Page 21: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Conclusions

• Two architectures have been studied (POWER5 and Xeon)

• Adding second thread usually increases performance– Heavy depends on a given workload

• Most of the resources are shared

• Only the crucial ones are duplicated/partitioned:– Return Stack Buffer, Instruction Buffer/uCode Queue, Store Queues

and GCT/Commit logic

• In addition Xeon more often duplicates/partitiones than shares:– ITLB, Rename and IQ

• The thread level parallelism is a good answer on how to increase performance without drastically increasing the resources

IntroductionImplementationsPerformanceConclusions

ConclusionsReferences

Page 22: Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

References1. R. Kalla, B. Sinharoy, J. M. Tendler, ”IBM POWER5 Chip: A Dual-Core Multithreaded Processor”, IEEE MICRO

2004, pages 40-472. B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner “POWER5 system

microarchitecture”, IBM Journal of Research and Development 2005, pages 505-5213. H. M. Mathis, A. E. Mericas, J. D. McCalpin, R. J. Eickemeyer, and S. R. Kunkel “Characterization of

simultaneous multithreading (SMT) efficiency in POWER5”, IBM Journal of Research and Development 2005, pages 555-564

4. Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, J. Alan Miller, Michael Upton “Hyper-Threading Technology Architecture and Microarchitecture”, Intel Technology Journal, Volume 6, Issue 1, 2002

5. David Burns “Pre-Silicon Validation of Hyper-Threading Technology”, Intel Technology Journal, Volume 6, Issue 1, 2002

6. Ron Kalla, BalaramSinharoy, Joel Tendler, ”Simultaneous Multi-threading Implementation in POWER5”, A Symposium on High Performance Chips (HotChips), 19th August 2003,

7. G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. “The microarchitecture of the Pentium 4 processor”, Intel Technology Journal, Volume 5, Issue 1, 2001

8. Joel Emer, “EV8:The Post-Ultimate Alpha”, International Conference on Parallel Architectures and Compilation Techniques, page 155, 2001

9. Shubu Mukherjee, “The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y2K”, 1999, http://www.cs.wisc.edu/~shubu/talks/alpha.ppt

10. Rick Merritt, “Designers cut fresh paths to parallelism”, EETimeshttp://www.eetimes.com/story/OEG19991008S0014

11. Francisco J. Cazorla Almeida, “Quality of Service for Simultaneous Multithreading Processors (QoS for SMT Processors)”, PhD Thesis, DAC, UPC, 2005

12. Kamil Kędzierski, Francisco J. Cazorla and Mateo Valero, “Analysis of Simultaneous Multithreading Implementations in Current High-Performance Processors”, ACACES Second International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems, Poster Session, July 26th, 2006 pages 113-116

13. http://www.cs.clemson.edu/~mark/multithreading.html

IntroductionImplementationsPerformanceConclusions

ConclusionsReferences