hyper-threading - ninewhilenine.org · cs203b-advanced computer architecture, ... smt 3....

32
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis [email protected] CS203B-Advanced Computer Architecture, Spring 2004 – p.1/32

Upload: leduong

Post on 04-Aug-2018

232 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Hyper-Threading:

Simultaneous Multithreadingon Pentium 4

Presented by:

Thomas Repantis

[email protected]

CS203B-Advanced Computer Architecture, Spring 2004 – p.1/32

Page 2: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Overview

Multiple threads executing on a single processorwithout switching.

1. Threads

2. SMT

3. Hyper-Threading on P4

4. OS and Compiler Support

5. Performance for Different Applications

CS203B-Advanced Computer Architecture, Spring 2004 – p.2/32

Page 3: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Threads

• Process: “A task being run by the computer.”• Context: Describes a process’s current state of

execution (registers, flags, PC...).• Thread: A “light-weight” process (has its own PC

and SP, but single address space and globalvariables).

• Each process consists of at least one thread.• Threads allow faster context-switching and

fine-grain multitasking.

CS203B-Advanced Computer Architecture, Spring 2004 – p.3/32

Page 4: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Single-Threaded CPU

A lot of bubbles in the in-struction issue and in thepipeline!

CS203B-Advanced Computer Architecture, Spring 2004 – p.4/32

Page 5: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Single-Threaded SMP

Executing processes aredoubled, but bubbles aredoubled as well!

CS203B-Advanced Computer Architecture, Spring 2004 – p.5/32

Page 6: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Superthreaded CPU

Each issue and eachpipeline stage can con-tain instructions of thesame thread only.

CS203B-Advanced Computer Architecture, Spring 2004 – p.6/32

Page 7: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Hyper-Threaded CPU (SMT)

Instructions of differentthreads can be sched-uled on the same stage.

CS203B-Advanced Computer Architecture, Spring 2004 – p.7/32

Page 8: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

SMT vs TeraMTA

• Each processor of the TeraMTA has 128 streams,that include a PC and 32 registers.

• Each stream is assigned to a thread.• Instructions from different streams can be

pipelined on the same processor.• However, in TeraMTA only a single thread is

active on any given cycle.

CS203B-Advanced Computer Architecture, Spring 2004 – p.8/32

Page 9: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

SMT Benefits

SMT:• Gives the OS the illusion of several (currently two)

logical processors.• Makes efficient use of resources.• Overcomes the barrier of limited amount of ILP

within just one thread.• Is implemented by dividing processor resources to

replicated, partitioned, and shared.

CS203B-Advanced Computer Architecture, Spring 2004 – p.9/32

Page 10: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Replicated Resources

Each logical processor has independent:• Instruction Pointer• Register Renaming Logic• Instruction TLB• Return Stack Predictor• Advanced Programmable Interrupt Controller• Other architectural registers

CS203B-Advanced Computer Architecture, Spring 2004 – p.10/32

Page 11: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Partitioned Resources

Each logical processors gets exactly half of:• Re-order buffers (ROBs)• Load/Store buffers• Several queues (e.g. scheduling, uop

(micro-operations))

Partitioning prohibits a logical processor from monopo-

lizing the resources.

CS203B-Advanced Computer Architecture, Spring 2004 – p.11/32

Page 12: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Statically Partitioned Queue

Specific positions are as-signed to each proces-sor.

CS203B-Advanced Computer Architecture, Spring 2004 – p.12/32

Page 13: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Dynamically Partitioned Queue

A limit is imposed to thepositions each processorcan use, but no specificpositions are assigned.

CS203B-Advanced Computer Architecture, Spring 2004 – p.13/32

Page 14: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Shared Resources

Each logical processor shares SMT-unawareresources:

• Execution Units• Microarchitectural registers (GPRs, FPRs)• Caches: trace cache, L1, L2, L3

Sharing:+ Enables efficient use of resources, but...

- Allows a thread to monopolize a resource (e.g. cache

thrashing).

CS203B-Advanced Computer Architecture, Spring 2004 – p.14/32

Page 15: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Pentium 4

• 32-bit

• 2.4 to 3.4 GHz clock frequency

• 800 MHz system bus

• 0.13-micron technology

• 8KB L1 data cache, 12KB L1instruction cache, 256KB to 1MBL2 cache, 2MB L3 cache

• NetBurst microarchitecture(hyper-pipelined)

• Hyper-Threading technology

CS203B-Advanced Computer Architecture, Spring 2004 – p.15/32

Page 16: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Front-End Pipeline

(a) Trace Cache Hit

(b) Trace Cache Miss

CS203B-Advanced Computer Architecture, Spring 2004 – p.16/32

Page 17: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Out-Of-Order Execution EnginePipeline

CS203B-Advanced Computer Architecture, Spring 2004 – p.17/32

Page 18: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Implementation Goals Achieved

• Minimal die area cost (less than 5% more diearea).

• Stall of one logical processor does not stall theother (buffering queues between pipeline logicblocks).

• When only one thread is running, speed should bethe same as without H-T (partitioned resourcesare dedicated to it).

CS203B-Advanced Computer Architecture, Spring 2004 – p.18/32

Page 19: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Single- and Multi-Task Modes

Partitioned resources are dedicated to one of thelogical processors when the other is HALTed.

CS203B-Advanced Computer Architecture, Spring 2004 – p.19/32

Page 20: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Operating System Optimizations

When the OS schedules threads to logical processorsit should:

• HALT an inactive logical processor, to avoidwasting resources for idle loops (continuouslychecking for available work).

• Schedule threads to logical processors on differentphysical processors instead of the same (whenpossible), to avoid using the same physicalexecution resources.

CS203B-Advanced Computer Architecture, Spring 2004 – p.20/32

Page 21: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

OS Optimizations

The Linux kernel (2.6 series) distinguishes betweenlogical and physical processors:

• H-T-aware passive and active load-balancing• H-T-aware task pickup• H-T-aware affinity• H-T-aware wakeup

CS203B-Advanced Computer Architecture, Spring 2004 – p.21/32

Page 22: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Compiler Optimizations

Intel 8.0 C++ and FORTRAN compilers:

Automatic optimizations:• Vectorization• Advanced instruction selection

Programmer-controlled optimizations:• Insertion of Streaming-SIMD-Extensions 3 (SSE3)

instructions• Insertion of OpenMP directives

CS203B-Advanced Computer Architecture, Spring 2004 – p.22/32

Page 23: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Performance gain from automaticoptimizations

SPEC CPU 2000 shows significant speedup not onlyfrom H-T specific (QxP) but even for general P4 (QxN)optimizations.

CS203B-Advanced Computer Architecture, Spring 2004 – p.23/32

Page 24: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Performance gain from manualoptimizations

SPEC OMPM 2001 shows speedup achieved byautomatic optimizations in combination with OpenMPdirectives.

CS203B-Advanced Computer Architecture, Spring 2004 – p.24/32

Page 25: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Thread-level Parallelism of DesktopApplications

• Unlike server workloads, interactive desktopapplications focus on response time and not onend-to-end throughput.

• Average response time improvement on dual- vsuni-processor measured 22%.

• The application programmer has to exploitmulti-threading.

• More than 2 processors yield no greatimprovements.

CS203B-Advanced Computer Architecture, Spring 2004 – p.25/32

Page 26: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Performance in Client-ServerApplications

While H-T offers no gain or degradation in API callsand user application workloads, it achievesconsiderable speedups in multi-threaded workloads.

CS203B-Advanced Computer Architecture, Spring 2004 – p.26/32

Page 27: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Performance in File ServerWorkloads

Good speedups in multi-threaded workloads, whetherfilesystem and socket calls, or just socket calls.

CS203B-Advanced Computer Architecture, Spring 2004 – p.27/32

Page 28: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Performance in Online TransactionProcessing

21% performance gain in the case of 1 and 2processors.

CS203B-Advanced Computer Architecture, Spring 2004 – p.28/32

Page 29: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Performance in Web Serving

16 to 28% performance gain.

CS203B-Advanced Computer Architecture, Spring 2004 – p.29/32

Page 30: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Conclusions

• Hyper-Threading enables thread-level parallelismby duplicating the architectural state of theprocessor, while sharing one set of processorexecution resources.

• When scheduling threads, the OS sees two logicalprocessors.

• While not providing the performance achieved byadding a second processor, Hyper-Threading canoffer a 30% improvement.

• Resource contention limits the performancebenefits for certain applications.

• Performance gains are evident in multi-threadedworkloads, which are usually found in servers.CS203B-Advanced Computer Architecture, Spring 2004 – p.30/32

Page 31: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

References

1. D. Marr et al., “Hyper-Threading Technology Architecture and Microarchitecture”,Intel Technology Journal, Volume 06-Issue 01, 2002.

2. D. Tulsen et al., “ Simultaneous Multithreading: Maximizing On-Chip Parallelism”,ISCA, 1995.

3. J. Stokes, “Introduction to Multithreading, Superthreading and Hyperthreading”,Ars Technica, 2002.

4. K. Smith et al., “Support for the Intel Pentium 4 Processor with Hyper-ThreadingTechnology in Intel 8.0 Compilers”, Intel Technology Journal, Volume 08-Issue 01,2004.

5. D. Vianney, “Hyper-Threading speeds Linux”, IBM Linux developerWorks, 2003.

6. J.Hennessy, D. Patterson, “Computer Architecture: A Quantitative Approach”, 3rdEdition, pp. 608–615, 2003.

7. “Hyper-Threading Technology on the Intel Xeon Processor Family for Servers”,Intel White Paper, 2004.

8. K. Flautner et al., “Thread-level Parallelism and Interactive Performance ofDesktop Applications”, ASPLOS, 2000.

9. L. Carter et al., “Performance and Programming Experience on the Tera MTA”,SIAM Conference on Parallel Processing, 1999.

CS203B-Advanced Computer Architecture, Spring 2004 – p.31/32

Page 32: Hyper-Threading - ninewhilenine.org · CS203B-Advanced Computer Architecture, ... SMT 3. Hyper-Threading on P4 4. ... “Hyper-Threading Technology Architecture and Microarchitecture”,

Thank you!

Questions/Comments?

CS203B-Advanced Computer Architecture, Spring 2004 – p.32/32