multithreading fundamentals

51
@gfraiteur a deep investigation on how .NET multithreading primitives map to hardware and Windows Kernel You Thought You Understood Multithreading Gaël Fraiteur PostSharp Technologies Founder & Principal Engineer

Upload: postsharp-technologies

Post on 30-Nov-2014

2.300 views

Category:

Technology


0 download

DESCRIPTION

Everybody knows the lock keyword, but how does it implemented? What are its performance characteristics. Gael Fraiteur scratches the surface of multithreaded programming in .NET and goes deep through the Windows Kernel down to CPU microarchitecture.

TRANSCRIPT

Page 1: Multithreading Fundamentals

@gfraiteur

a deep investigation on how .NET multithreading primitives

map to hardware and Windows Kernel

You Thought You Understood Multithreading

Gaël Fraiteur

PostSharp TechnologiesFounder & Principal Engineer

Page 2: Multithreading Fundamentals

@gfraiteur

Hello!

My name is GAEL.

No, I don’t thinkmy accent is funny.

my twitter

Page 3: Multithreading Fundamentals

@gfraiteur

Agenda1. Hardware

2. Windows Kernel

3. Application

Page 4: Multithreading Fundamentals

@gfraiteur

MicroarchitectureHardware

Page 5: Multithreading Fundamentals

@gfraiteurThere was no RAM in Colossus, the world’s first electronic programmable computing device (1944).http://en.wikipedia.org/wiki/File:Colossus.jpg

Page 6: Multithreading Fundamentals

@gfraiteur

HardwareProcessor microarchitecture

Intel Core i7 980x

Page 7: Multithreading Fundamentals

@gfraiteur

HardwareHardware latency

DRAM

L3 Cache

L2 Cache

L1 Cache

CPU Cycle

0 5 10 15 20 25 30

26.2

8.58

3

1.2

0.3

Latency

nsMeasured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI) with SiSoftware Sandra

Page 8: Multithreading Fundamentals

@gfraiteur

HardwareCache Hierarchy

• Distributed into caches

• L1 (per core)• L2 (per core)• L3 (per die)

Die 1

Core 4

Processor

Core 5

Processor

Core 7

Processor

L2 Cache

Core 6

Processor

L2 Cache

Memory Store (DRAM)

Die 0

Core 1

Processor

Core 0

Processor

Core 2

Processor

L2 Cache

Core 3

Processor

L2 Cache

L2 Cache

L3 Cache

L2 Cache

L2 Cache

L3 Cache

L2 Cache

Data Transfer

Core Interconn

ect

Core Interconn

ect Processor Interconnect

Synchronization

• Synchronized through an asynchronous “bus”:

• Request/response• Message queues• Buffers

Page 9: Multithreading Fundamentals

@gfraiteur

HardwareMemory Ordering• Issues due to implementation details:

• Asynchronous bus

• Caching & Buffering

• Out-of-order execution

Page 10: Multithreading Fundamentals

@gfraiteur

HardwareMemory Ordering

Processor 0 Processor 1

Store A := 1 Load A

Store B := 1 Load B

Possible values of (A, B) for loads?

Processor 0 Processor 1

Load A Load B

Store B := 1 Store A := 1

Possible values of (A, B) for loads?

Processor 0 Processor 1

Store A := 1 Store B := 1

Load B Load A

Possible values of (A, B) for loads?

Page 11: Multithreading Fundamentals

@gfraiteur

HardwareMemory Models: Guarantees

TypeAlpha

ARMv7

PA-

RISC

POWER

SPARC RMO

SPARC

PSO

SPARC TSO

x86

x86 oostore

AMD64

IA64

zSeries

Loads reordered after Loads

Y Y Y Y Y Y Y

Loads reordered after Stores Y Y Y Y Y Y Y

Stores reordered after Stores

Y Y Y Y Y Y Y Y

Stores reordered after Loads

Y Y Y Y Y Y Y Y Y Y Y Y

Atomic reordered with Loads

Y Y Y Y Y

Atomic reordered with Stores

Y Y Y Y Y Y

Dependent Loads reordered

Y

Incoherent Instruction cache pipeline

Y Y Y Y Y Y Y Y Y Y

http://en.wikipedia.org/wiki/Memory_ordering

X84, AMD64: strong orderingARM: weak ordering

Page 12: Multithreading Fundamentals

@gfraiteur

HardwareMemory Models ComparedProcessor 0 Processor 1

Store A := 1 Load A

Store B := 1 Load B

Possible values of (A,B) for loads?• X86: • ARM:

Processor 0 Processor 1

Load A Load B

Store B := 1 Store A := 1

Possible values of (A, B) for loads?• X86: • ARM:

Page 13: Multithreading Fundamentals

@gfraiteur

HardwareCompiler Memory Ordering• The compiler (CLR) can:

• Cache (memory into register),

• Reorder

• Coalesce writes

• volatile keyword disables compiler optimizations. And that’s all!

Page 14: Multithreading Fundamentals

@gfraiteur

HardwareThread.MemoryBarrier()

Processor 0 Processor 1

Store A := 1 Store B := 1

Thread.MemoryBarrier() Thread.MemoryBarrier()

Load B Load A

Possible values of (A, B)?• Without MemoryBarrier: • With MemoryBarrier:

Barriers serialize access to memory.

Page 15: Multithreading Fundamentals

@gfraiteur

HardwareAtomicityProcessor 0 Processor 1

Store A := (long) 0xFFFFFFFF

Load A

Possible values of (A,B)?• X86 (32-bit): • AMD64:

• Up to native word size (IntPtr)• Properly aligned fields (attention to struct fields!)

Page 16: Multithreading Fundamentals

@gfraiteur

HardwareInterlocked access• The only locking mechanism at hardware level.

• System.Threading.Interlocked

• AddAdd(ref x, int a ) { x += a; return x; }

• IncrementInc(ref x ) { x += 1; return x; }

• Exchange

Exc( ref x, int v ) { int t = x; x = a; return t; }

• CompareExchange CAS( ref x, int v, int c ) { int t = x; if ( t == c ) x = a; return t; }

• Read(ref long)

• Implemented by L3 cache and/or interconnect messages.

• Implicit memory barrier

Page 17: Multithreading Fundamentals

@gfraiteur

HardwareCost of interlocked

Interlocked increment (contended, same socket)

L3 Cache

Interlocked increment (non-contended)

Non-interlocked increment

L1 Cache

0 2 4 6 8 10 12

9.75

8.58

5.5

1.87

1.2

Latency

ns

Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI)with C# code. Cache latencies measured with SiSoftware Sandra.

Page 18: Multithreading Fundamentals

@gfraiteur

HardwareCost of cache coherency

Page 19: Multithreading Fundamentals

@gfraiteur

Multitasking• No thread, no process at hardware level

• There no such thing as “wait”

• One core never does more than 1 “thing” at the same time(except HyperThreading)

• Task-State Segment:

• CPU State

• Permissions (I/O, IRQ)

• Memory Mapping

• A task runs until interrupted by hardware or OS

Page 20: Multithreading Fundamentals

@gfraiteur

Lab: Non-Blocking Algorithms• Non-blocking stack (4.5 MT/s)

• Ring buffer (13.2 MT/s)

Page 21: Multithreading Fundamentals

@gfraiteur

Windows KernelOperating System

Page 22: Multithreading Fundamentals

@gfraiteurSideKick (1988). http://en.wikipedia.org/wiki/SideKick

Page 23: Multithreading Fundamentals

@gfraiteur

• Program(Sequence of Instructions)

• CPU State

• Wait Dependencies

• (other things)

• Virtual Address Space

• Resources

• At least 1 thread

• (other things)

Processes and Threads

Process Thread

Windows Kernel

Page 24: Multithreading Fundamentals

@gfraiteur

Windows KernelProcessed and Threads

8 Logical Processors(4 hyper-threaded cores)

2384 threads155 processes

Page 25: Multithreading Fundamentals

@gfraiteur

Windows KernelDispatching threads to CPU

Executing

CPU 0

CPU 1

CPU 2

CPU 3

Thread H

Thread G

Thread I

Thread J

Mutex Semaphore

Event Timer

Thread A

Thread C

Thread B

Wait

Thread DThread

EThread F

Ready

Page 26: Multithreading Fundamentals

@gfraiteur

• Live in the kernel

• Sharable among processes

• Expensive!

• Event

• Mutex

• Semaphore

• Timer

• Thread

Dispatcher Objects (WaitHandle)

Kinds of dispatcher objects Characteristics

Windows Kernel

Page 27: Multithreading Fundamentals

@gfraiteur

Windows KernelCost of kernel dispatching

Normal increment

Non-contented interlocked increment

Contented interlocked increment

Access kernel dispatcher object

Create+dispose kernel dispatcher object

1 10 100 1000 10000

2

6

10

295

2,093

Duration (ns)

Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI)

Page 28: Multithreading Fundamentals

@gfraiteur

Windows KernelWait Methods• Thread.Sleep

• Thread.Yield

• Thread.Join

• WaitHandle.Wait

• (Thread.SpinWait)

Page 29: Multithreading Fundamentals

@gfraiteur

Windows KernelSpinWait: smart busy loops

1. Thread.SpinWait (Hardware awareness, e.g. Intel HyperThreading)

2. Thread.Sleep(0) (give up to threads of same priority)

3. Thread.Yield() (give up to threads of any priority)

4. Thread.Sleep(1) (sleeps for 1 ms)

Processor 0

Processor 1

A = 1;B = 1;

SpinWait w = new SpinWait();int localB;if ( A == 1 ) { while ( ((localB = B) == 0 ) w.SpinOnce();}

Page 30: Multithreading Fundamentals

@gfraiteur

.NET FrameworkApplication Programming

Page 31: Multithreading Fundamentals

@gfraiteur

Application ProgrammingDesign Objectives• Ease of use

• High performancefrom applications!

Page 32: Multithreading Fundamentals

@gfraiteur

Application ProgrammingInstruments from the platform

Hardware• Interlocked• SpinWait• Volatile

Operating System

• Thread• WaitHandle

(mutex, event, semaphore, …)• Timer

Page 33: Multithreading Fundamentals

@gfraiteur

Application ProgrammingNew Instruments

Concurrency & Synchronization

• Monitor (Lock), SpinLock• ReaderWriterLock• Lazy, LazyInitializer• Barrier, CountdownEvent

Data Structures• BlockingCollection• Concurrent(Queue|Stack|Bag|Dictionary)• ThreadLocal, ThreadStatic

Thread Dispatching

• ThreadPool• Dispatcher, ISynchronizeInvoke• BackgroundWorker

Frameworks• PLINQ• Tasks• Async/Await

Page 34: Multithreading Fundamentals

@gfraiteur

Concurrency & Synchronization

Application Programming

Page 35: Multithreading Fundamentals

@gfraiteur

Concurrency & SynchronizationMonitor: the lock keyword1. Start with interlocked operations (no

contention)

2. Continue with “spin wait”.

3. Create kernel event and wait

Good performance if low contention

Page 36: Multithreading Fundamentals

@gfraiteur

Concurrency & SynchronizationReaderWriterLock• Concurrent reads allowed

• Writes require exclusive access

Page 37: Multithreading Fundamentals

@gfraiteur

Concurrency & SynchronizationLab: ReaderWriterLock

Page 38: Multithreading Fundamentals

@gfraiteur

Data StructuresApplication Programming

Page 39: Multithreading Fundamentals

@gfraiteur

Data StructuresBlockingCollection• Use case: pass messages between

threads

• TryTake: wait for an item and take it

• Add: wait for free capacity (if limited) and add item to queue

• CompleteAdding: no more item

Page 40: Multithreading Fundamentals

@gfraiteur

Data StructuresLab: BlockingCollection

Page 41: Multithreading Fundamentals

@gfraiteur

Data StructuresConcurrentStack

node

top

Producer A

Producer B

Producer C

Consumer A

Consumer B

Consumer C

Consumers and producers compete for the top of the stack

node

Page 42: Multithreading Fundamentals

@gfraiteur

Data StructuresConcurrentQueue

node

node

tail

Producer A

Producer B

Producer C

Consumer A

Consumer B

Consumer C

head

Consumers compete for the head.Producers compete for the tail.Both compete for the same node if the queue is empty.

Page 43: Multithreading Fundamentals

@gfraiteur

Data StructuresConcurrentBag

Thread A

Bag

Thread Local Storage A

Thread Local Storage B

node

node

node

node

Producer A Consumer A

Thread B

Producer B Consumer B

• Threads preferentially consume nodes they produced themselves.

• No ordering

Page 44: Multithreading Fundamentals

@gfraiteur

Data StructuresConcurrentDictionary• TryAdd

• TryUpdate (CompareExchange)

• TryRemove

• AddOrUpdate

• GetOrAdd

Page 45: Multithreading Fundamentals

@gfraiteur

Thread DispatchingApplication Programming

Page 46: Multithreading Fundamentals

@gfraiteur

Thread DispatchingThreadPool• Problems solved:

• Execute a task asynchronously (QueueUserWorkItem)

• Waiting for WaitHandle (RegisterWaitForSingleObject)

• Creating a thread is expensive

• Good for I/O intensive operations

• Bad for CPU intensive operations

Page 47: Multithreading Fundamentals

@gfraiteur

Thread DispatchingDispatcher• Execute a task in the UI thread

• Synchronously: Dispatcher.Invoke

• Asynchonously: Dispatcher.BeginInvoke

• Under the hood, uses the HWND message loop

Page 48: Multithreading Fundamentals

@gfraiteur

Thread DispatchingLab: background UI

Page 49: Multithreading Fundamentals

@gfraiteur

Q&AGael [email protected]@gfraiteur

Page 50: Multithreading Fundamentals

@gfraiteur

Summary

Page 51: Multithreading Fundamentals

@gfraiteur

http://www.postsharp.net/[email protected]