10/20/2006eleg652-06f1 topic 5 synchronization and costs for shared memory “.... you will be...

10/20/2006 ELEG652-06F 1

Topic 5

Synchronization and Costs for Shared Memory

“.... You will be assimilated. Resistance is futile.“Star Trek

10/20/2006 ELEG652-06F 2

Synchronization

• The orchestration of two or more threads (or processes) to complete a task in a correct manner and avoid any data races

• Data Race or Race Condition– “There is an anomaly of concurrent accesses

by two or more threads to a shared memory and at least one of the accesses is a write”

• Atomicity and / or serialibility

10/20/2006 ELEG652-06F 3

Atomicity

• Atomic From the Greek “Atomos” which means indivisible

• An “All or None” scheme• An instruction (or a group of them) will

appear as if it was (they were) executed in a single try– All side effects of the instruction (s) in the

block are seen in its totality or not all• Side effects Writes and (Causal) Reads to the

variables inside the atomic block

10/20/2006 ELEG652-06F 4

Atomicity

• Word aligned load and stores are atomic in almost all architectures

• Unaligned and bigger than word accesses are usually not atomic

• What happens when non-atomic operations goes wrong– The final result will be a garbled combination of

values– Complete operations might be lost in the process

• Strong Versus Weak Atomicity

10/20/2006 ELEG652-06F 5

Synchronization

• Applied to Shared Variables • Synchronization might enforce ordering or not• High level Synchronization types

– Semaphores– Mutex– Barriers– Critical Sections– Monitors– Conditional Variables

10/20/2006 ELEG652-06F 6

Semaphores

• Intelligent Counters of Resources– Zero Means not available

• Abstract data which has two operations involved– P probeer te verlagen: “try to decrease” Waits

(Busy waits or sleeps) if the resource is not available.– V verhoog: “increase.” Frees the resource

• Binary V.S. Blocking V.S. Counting Semaphores– Binary: Initial Value will allow threads to obtain it– Blocking: Initial Value will block the threads– Counting: Initial Value is not zero

• Note: P and V are atomic operations!!!!

10/20/2006 ELEG652-06F 7

Mutex

• Mutual Exclusion Lock

• A binary semaphore to ensure that one thread (and only one) will access the resource– P Lock the mutex– V Unlock the mutex

• It doesn’t enforce ordering

• Fine V.S. Coarse grained

10/20/2006 ELEG652-06F 8

Barriers

• A high level programming construct• Ensure that all participating threads will wait at a

program point for all other (participating) threads to arrive, before they can continue

• Types of Barriers– Tree Barriers (Software Assisted)– Centralized Barriers– Tournament Barriers– Fine grained Barriers– Butterfly style Barriers– Consistency Barriers (i.e. #pragma omp flush)

10/20/2006 ELEG652-06F 9

Critical Sections

• A piece of code that is executed by one and only one thread at any point in time

• If T1 finds CS in use, then it waits until the CS is free for it to use it

• Special Case:– Conditional Critical Sections: Threads waits

on a “given” signal to resume execution.– Better implemented with lock free techniques

(i.e. Transactional Memory)

10/20/2006 ELEG652-06F 10

Monitors and Conditional Variables

• A monitor consists of:– A set of procedures to work on shared variables– A set of shared variables– An invariant – A lock to protect from access by other threads

• Conditional Variables– The invariant in a monitor (but it can be used in other

schemes)– It is a signal place holder for other threads activities

10/20/2006 ELEG652-06F 11

Much More …

• However, all of these are abstractions• Major elements

– A synchronization element that ensure atomicity• Locks!!!!

– A synchronization element that ensure ordering• Barriers!!!!

• Implementations and types– Common types of atomic primitives– Read – Modify – Write Back cycles

• Synch Overhead may break a system– Unnecessary consistency actions– Communication cost between threads

• Why Distributed Memory Machines have “implicit” synchronization?

10/20/2006 ELEG652-06F 12

Topic 5a

Locks

10/20/2006 ELEG652-06F 13

Implementation

• Atomic Primitives– Fetch and Φ operations

• Read – Modify – Write Cycles• Test and Set• Fetch and Store

– Exchange register and memory

• Fetch and Add• Compare and Swap

– Conditionally exchange the value of a memory location

10/20/2006 ELEG652-06F 14

Implementation

• Use by programmers to implement more complex synchronization constructs

• Waiting behavior– Scheduler based: The process / thread is de-

scheduled and will be scheduled in a future time

– Busy Wait: The process / thread polls on the resource until it is available

– Dependent on the Hardware / OS / Scheduler behavior

10/20/2006 ELEG652-06F 15

Types of (Software) LocksThe Spin Lock Family

• The Simple Test and Set Lock– Polls a shared Boolean variable: A binary

semaphore– Uses Fetch and Φ operations to operate on

the binary semaphore– Expensive!!!!

• Waste bandwidth• Generate Extra Busses transactions

– The test test and set approach• Just poll when the lock is in use

10/20/2006 ELEG652-06F 16


• Delay based Locks– Spin Locks in which a delay has been

introduced in testing the lock– Constant delay– Exponentional Back-off

• Best Results

– The test test and set scheme is not needed

10/20/2006 ELEG652-06F 17


Pseudo code:Pseudo code:

enum LOCK_ACTIONS = {LOCKED, UNLOCKED};void acquire_lock(lock_t L){

int delay = 1;while(! test_and_set(L, LOCKED) ) {

sleep(delay);delay *= 2;

}}void release_lock(lock_t L){

L = UNLOCKED;}

10/20/2006 ELEG652-06F 18

Types of (Software) LocksThe Ticket Lock

• Reduce the # of Fetch and Φ operations – Only one per lock acquisition

• Strongly fair lock– No starvation

• A FIFO service

• Implementation: Two counters– A Request and Release Counters

10/20/2006 ELEG652-06F 19


T1 T2 T3 T4 T5

0 0

Request Release

T1 acquires the lock

10/20/2006 ELEG652-06F 20


T1 T2 T3 T4 T5

1 0

Request Release

T2 requests the lock

10/20/2006 ELEG652-06F 21


T1 T2 T3 T4 T5

2 0

Request Release


10/20/2006 ELEG652-06F 22


T1 T2 T3 T4 T5

3 1

Request ReleaseT1 releases the lockT2 gets the lockT4 requests the lock

10/20/2006 ELEG652-06F 23


T1 T2 T3 T4 T5

4 1

Request Release


10/20/2006 ELEG652-06F 24


T1 T2 T3 T4 T5

5 1

Request Release


10/20/2006 ELEG652-06F 25


T1 T2 T3 T4 T5

5 2

Request Release

T2 releases the lockT3 acquires the lock

10/20/2006 ELEG652-06F 26

• Reduce the number of Fetch and Φ operations – Only read ops on the release counter

• However, still a lot of memory and network bandwidth wasted.

• Back off techniques also used– Exponentional Back off

• A bad idea

– Constant Delay• Minimum time of holding a lock

– Proportional Back off• Dependent on how many are waiting for the lock


10/20/2006 ELEG652-06F 27


Pseudocode:Pseudocode:

unsigned int next_ticket = 0;unsigned int now_serving = 0;void acquire_lock(){

unsigned int my_ticket = fetch_and_increment(next_ticket);while{

sleep(my_ticket - now_serving);if(now_serving == my_ticket) return;

}}void release_lock(){

now_serving = now_serving + 1;}

10/20/2006 ELEG652-06F 28

Types of (Software) LocksThe Array Based Queue Lock

• Contention on the release counter• Cache Coherence and memory traffic

– Invalidation of the counter variable and the request to a single memory bank

• Two elements– An Array and a tail pointer that index such array– The array is as big as the number of processor– Fetch and store Address of the array element– Fetch and increment Tail pointer

• FIFO ordering

10/20/2006 ELEG652-06F 29


T1 T2 T3 T4 T5

Enter Wait Wait Wait Wait

Tail

The tail pointer points to the beginning of the array

The all array elements except the first one are marked to wait

10/20/2006 ELEG652-06F 30


T1 T2 T3 T4 T5


Tail

T1 Gets the lock

10/20/2006 ELEG652-06F 31


T1 T2 T3 T4 T5


Tail

T2 Requests

10/20/2006 ELEG652-06F 32


T1 T2 T3 T4 T5


Tail

T3 requests

10/20/2006 ELEG652-06F 33


T1 T2 T3 T4 T5

Wait Enter Wait Wait Wait

Tail

T1 releasesT2 Gets

10/20/2006 ELEG652-06F 34


T1 T2 T3 T4 T5


Tail

T4 Requests

10/20/2006 ELEG652-06F 35


T1 T2 T3 T4 T5


Tail

T1 requests

10/20/2006 ELEG652-06F 36


T1 T2 T3 T4 T5

Wait Wait Enter Wait Wait

Tail

T2 releasesT3 gets

10/20/2006 ELEG652-06F 37

Types of (Software) LocksThe Queue Locks

• It uses too much memory– Linear space (relative to the number of

processors) per lock.

• Array– Easy to implement

• Linked List: QNODE– Cache management

10/20/2006 ELEG652-06F 38

Types of (Software) LocksThe MCS Lock

• Characteristics– FIFO ordering– Spins on locally accessible flag variables– Small amount of space per lock– Works equally well on machines with and without

coherent caches

• Similar to the QNODE implementation of queue locks– QNODES are assigned to local memory – Threads spins on local memory

10/20/2006 ELEG652-06F 39

MCS: How it works?

• Each processor enqueues its own private lock variable into a queue and spins on it– key: spin locally

• CC model: spin in local cache• DSM model: spin in local private memory

– No contention

• On lock release, the releaser unlocks the next lock in the queue– Only have bus/network contention on actual unlock– No starvation (order of lock acquisitions defined by

the list)

10/20/2006 ELEG652-06F 40

MCS Lock

• Requires atomic instruction:– compare-and-swap– fetch-and-store

• If there is no compare-and-swap– an alternative release algorithm

• extra complexity• loss of strict FIFO ordering• theoretical possibility of starvation• Detail: Mellor-Crummey and Scott’s 1991 paper

10/20/2006 ELEG652-06F 41

MCS: Example

Tail

Flag Next F = 1 Next

Tail

Flag Next

Tail

Init

Proc 1 gets

Proc 2 tries

CPU 1

CPU 2

CPU 3

CPU 4

•CPU 1 holds the “real” lock•CPU 2, CPU 3 and CPU 4 spins on the flag•When CPU 1 releases, it releases the lock and change the flag variable of the next in the list

10/20/2006 ELEG652-06F 42

ImplementationModern Alternatives

• Fetch and Φ operations– They are restrictive– Not all architecture support all of them

• Problem: A general one atomic op is hard!!!• Solution: Provide two primitives to generate

atomic operations• Load Linked and Store Conditional

– Remember PowerPC lwarx and stwcx instructions

10/20/2006 ELEG652-06F 43

An ExampleSwap

try: mov R3, R4ld R2, 0(R1)st R3, 0(R1)mov R4, R2

Exchange the contents of register R4 with memory location pointed by R1

Not Atomic!!!!

10/20/2006 ELEG652-06F 44

An ExampleAtomic Swap

try: mov R3, R4ll R2, 0(R1)sc R3, 0(R1)beqz R3, trymov R4, R2

Swap (Fetch and store) using ll and sc

In case that another processor writes to the value pointed by R1 before the sc can complete, the reservation (usually keep in register) is lost. This means that the sc will fail and the code will loop back and try again.

10/20/2006 ELEG652-06F 45

Another ExampleFetch and Increment and Spin Lock

try: ll R2, 0(R1) addi R2, R2, #1sc R2, 0(R1)beqz R2, try

Fetch and Increment using ll-scFetch and Increment using ll-sc

li R2, #1lockit: exch R2, 0(R1)

bnez R2, lockit

Spin Lock using ll-scSpin Lock using ll-scThe exch instruction is equivalent to the Atomic Swap Instruction Block presented earlierAssume that the lock is not cacheableNote: 0 Unlocked; 1 Locked

10/20/2006 ELEG652-06F 46

Performance Penalty

Example

Suppose there are 10 processors on a bus that each

try to lock a variable simultaneously. Assume that each

bus transaction (read miss or write miss) is 100 clock

cycles long. You can ignore the time of the actual read

or write of a lock held in the cache, as well as the time

the lock is held (they won’t matter much!) Determine

the performance penalty.

10/20/2006 ELEG652-06F 47

Answer

It takes over 12,000 cycles total for all processor to

pass through the lock!

Note: the contention of the lock and the serialization

of the bus transactions.

See example on pp 596, Henn/Patt, 3rd Ed.

10/20/2006 ELEG652-06F 48

Performance Penalty

• Assume the same example as before (100 cycles per bus transaction, 10 processors) but consider the case of a queue lock which only updates on a miss

Paterson and Hennesy p 603

10/20/2006 ELEG652-06F 49

Performance Penalty

• Answer: – First time: n+1– Subsequent access: 2(n-1)– Total: 3n – 1– 29 Bus cycles or 2900 clock cycles

10/20/2006 ELEG652-06F 50

Implementing Locks Using Coherence

lockit: ld R2, 0(R1)bnez R2, lockitli R2, #1exch R2, 0(R1)bnez R2, lockit

Step P0 P1 P2 State Bus

1 Has Lock Spins Spins S None

2 Set Lock = 0 Inv rcvd Inv rcvd E Write Inv from P0

3 Cache Miss Cache Miss S WB from P0.

4 Waits Lock = 0 S Cache Miss (P2) satisfied

5 Lock = 0 Swap Cache Miss

S Cache Miss (P1) satisfied

6 Swap Cache miss

Completes swap returns 0 and L=1

E(P2) Inv from P2

7 Swap completes returns 1 and set L=1

Enter E(P1) WB

8 Spins L = 0 None

lockit: ll R2, 0(R1)bnez R2, lockitli R2, #1sc R2, 0(R1)beqz R2, lockit

10/20/2006 ELEG652-06F 51

Some GraphsIncrease in Network Latency on a Butterfly. Sixty Processor

Performance of spin locks on a butterfly. The x-axis represents processors and y-axis represents time in microseconds

Extracted from “Algorithms for Scalable Synchronization onShared.” John M. Mellor-Crummer and Michael L. Scott. January 1991

10/20/2006 ELEG652-06F 52

Topic 5b

Barriers

10/20/2006 ELEG652-06F 53

The Barrier Construct

• The idea for software barriers– A program point in which all participating threads wait

for each other to arrive to this point before continuing

• Difficulty– Overhead of synchronizing the threads– Network and Memory bandwidth issues

• Implementation– Centralized

• Simple to implement with locks

– Tree based• Better with bandwidth

10/20/2006 ELEG652-06F 54

Centralized Barriers

• A normal barrier in which all threads / processors waits for each other “serially”

• Typical Implementation:• Two spin locks

– One waits for all threads to arrives – One keeps tally of the arrived threads

• A thread arrives to the barrier and increment the counter by one (atomically)

• Check if you are the last one– If you aren’t then wait– If you are, unblock (awake) the rest of the threads

10/20/2006 ELEG652-06F 55

Centralized BarrierPseudo Code

int count = 0;bool sense = true;void central_barrier(){

lock(L);if (count == 0) sense = 0;count ++;unlock(L);if(count == PROCESSORS){

sense = 1;count = 0;

}else

spin(sense == 1);}

It may deadlock or malfunction

10/20/2006 ELEG652-06F 56

Centralized Barrier

T1T2

T3

Barrier 1WorkBarrier 2

T1 arrives to the barrier, increments count and spins

T2 arrives to the barrier, increments count and spins

T3 arrives to the barrier, increments count and change sense

T3 is delayed and T1 do Work

T1 reaches the next barrier, increments count and it is delayed

T3 starts again and reset the count

T2 and T3 arrives to the barrier and forever spin

10/20/2006 ELEG652-06F 57

Centralized BarrierPseudo Code: Reverse Sense Barrier

int count = 0;bool sense = true;void central_barrier(){

static bool local_sense = true;local_sense = ! local_sense;lock(L);count ++;if(count == PROCESSORS){

count = 0;sense = local_sense;

}unlock(L);spin(sense == local_sense);

}

It will wait since the spin target can be either from the previous barrier (old local_sense) or from the current barrier (local_sense)

10/20/2006 ELEG652-06F 58

Centralized Barrier

Performance

Suppose there are 10 processors on a bus that each try to execute a barrier simultaneously. Assume that each bus transaction is 100 clock cycles, as before. You can ignore the time of the actual read or write of a lock held in the cache as the time to execute other non-synchronization operations in the barrier implementation. Determine the number of bus transactions required for all 10 processors to reach the barrier, be released from the barrier and exit the barrier. Assume that the bus is totally fair, so that every pending request is serviced before a new request and that the processors are equally fast. Don’t worry about counting the processors out of the barrier. How long will the entire process take?

Patterson and Hennesy Page 598

10/20/2006 ELEG652-06F 59

Centralized Barrier

• Steps through the barrier– Assume that ll-sc lock is used

– LL the lock i times– SC the lock i times– Load Count 1 time– LL the lock again i -1 times– Store Count 1 time– Store lock 1 time– Load sense 2 times

– Total transaction for the ith processor: 3i + 4– Total: (3n2 + 11n)/2 – 1– 204 bus cycles 20,400 clock cycles

10/20/2006 ELEG652-06F 60

Tree Type Barriers

• The software combining tree barrier– A shared variable becomes a tree of access– Each parent node will combine the results of each its children– A group of processor per leaf– Last processor update the leaf and then moves up– A two pass scheme:

• From down to up Update count• From up to down Update sense and resume

– Objectives• Reduces Memory Contention

– Disadvantages• Spins on memory locations which positions cannot be statically

determinated

10/20/2006 ELEG652-06F 61

Tree Type Barriers

• Butterfly Barrier– Based on the Butterfly network scheme for

broadcasting and reduction– Pairwise optimizations

• At step k: Processor i signals processor i xor 2k

– In case that the number of processors are not a power of two then existing processor will participate.

– Max Synchronizations: 2 Floor(log2 P)1 2 3 4 5 6 7

R0

R1

R2

R3

10/20/2006 ELEG652-06F 62

Tree Type Barriers

• Dissemination Barrier– Similar to Butterfly but with less maximum

synchronization operations floor(log2P)

– At step k: Processor i signals processor (i + 2k) mod P

– Advantages:• The flags that each processor spins are statically

assigned (Better locality)

10/20/2006 ELEG652-06F 63

Tree Type Barriers

• Tournament Barriers– A tree style barrier– A round of the tournament

• A level of the tree

– Winners are statically decided• No fetch and Φ operations are needed

– Processor i sets a flag that is being awaited by processor j, then processor i drops from the tournament and j continues

– The final processor wakes all others– Types

• CREW (concurrent read exclusive write): Global variable to signal back

• EREW (exclusive read exclusive write): Separate flags in which each processor spins separate.

10/20/2006 ELEG652-06F 64

Bibliography

• Paterson and Hennessy. “Chapter 6: Multiprocessors and Thread Level Parallelism”

• Mellor-Crummey, John; Scott, Michael. “Algorithms for Scalable Synchronization on Shared Memory Multiprocessors”. January 1991.

10/20/2006eleg652-06f1 topic 5 synchronization and costs for shared memory “.... you will be...

Documents