10/20/2006eleg652-06f1 topic 5 synchronization and costs for shared memory “.... you will be...

64
10/20/2006 ELEG652-06F 1 Topic 5 Synchronization and Costs for Shared Memory .... You will be assimilated. Resistance is futile.“ Star Trek

Post on 19-Dec-2015

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 1

Topic 5

Synchronization and Costs for Shared Memory

“.... You will be assimilated. Resistance is futile.“Star Trek

Page 2: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 2

Synchronization

• The orchestration of two or more threads (or processes) to complete a task in a correct manner and avoid any data races

• Data Race or Race Condition– “There is an anomaly of concurrent accesses

by two or more threads to a shared memory and at least one of the accesses is a write”

• Atomicity and / or serialibility

Page 3: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 3

Atomicity

• Atomic From the Greek “Atomos” which means indivisible

• An “All or None” scheme• An instruction (or a group of them) will

appear as if it was (they were) executed in a single try– All side effects of the instruction (s) in the

block are seen in its totality or not all• Side effects Writes and (Causal) Reads to the

variables inside the atomic block

Page 4: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 4

Atomicity

• Word aligned load and stores are atomic in almost all architectures

• Unaligned and bigger than word accesses are usually not atomic

• What happens when non-atomic operations goes wrong– The final result will be a garbled combination of

values– Complete operations might be lost in the process

• Strong Versus Weak Atomicity

Page 5: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 5

Synchronization

• Applied to Shared Variables • Synchronization might enforce ordering or not• High level Synchronization types

– Semaphores– Mutex– Barriers– Critical Sections– Monitors– Conditional Variables

Page 6: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 6

Semaphores

• Intelligent Counters of Resources– Zero Means not available

• Abstract data which has two operations involved– P probeer te verlagen: “try to decrease” Waits

(Busy waits or sleeps) if the resource is not available.– V verhoog: “increase.” Frees the resource

• Binary V.S. Blocking V.S. Counting Semaphores– Binary: Initial Value will allow threads to obtain it– Blocking: Initial Value will block the threads– Counting: Initial Value is not zero

• Note: P and V are atomic operations!!!!

Page 7: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 7

Mutex

• Mutual Exclusion Lock

• A binary semaphore to ensure that one thread (and only one) will access the resource– P Lock the mutex– V Unlock the mutex

• It doesn’t enforce ordering

• Fine V.S. Coarse grained

Page 8: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 8

Barriers

• A high level programming construct• Ensure that all participating threads will wait at a

program point for all other (participating) threads to arrive, before they can continue

• Types of Barriers– Tree Barriers (Software Assisted)– Centralized Barriers– Tournament Barriers– Fine grained Barriers– Butterfly style Barriers– Consistency Barriers (i.e. #pragma omp flush)

Page 9: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 9

Critical Sections

• A piece of code that is executed by one and only one thread at any point in time

• If T1 finds CS in use, then it waits until the CS is free for it to use it

• Special Case:– Conditional Critical Sections: Threads waits

on a “given” signal to resume execution.– Better implemented with lock free techniques

(i.e. Transactional Memory)

Page 10: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 10

Monitors and Conditional Variables

• A monitor consists of:– A set of procedures to work on shared variables– A set of shared variables– An invariant – A lock to protect from access by other threads

• Conditional Variables– The invariant in a monitor (but it can be used in other

schemes)– It is a signal place holder for other threads activities

Page 11: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 11

Much More …

• However, all of these are abstractions• Major elements

– A synchronization element that ensure atomicity• Locks!!!!

– A synchronization element that ensure ordering• Barriers!!!!

• Implementations and types– Common types of atomic primitives– Read – Modify – Write Back cycles

• Synch Overhead may break a system– Unnecessary consistency actions– Communication cost between threads

• Why Distributed Memory Machines have “implicit” synchronization?

Page 12: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 12

Topic 5a

Locks

Page 13: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 13

Implementation

• Atomic Primitives– Fetch and Φ operations

• Read – Modify – Write Cycles• Test and Set• Fetch and Store

– Exchange register and memory

• Fetch and Add• Compare and Swap

– Conditionally exchange the value of a memory location

Page 14: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 14

Implementation

• Use by programmers to implement more complex synchronization constructs

• Waiting behavior– Scheduler based: The process / thread is de-

scheduled and will be scheduled in a future time

– Busy Wait: The process / thread polls on the resource until it is available

– Dependent on the Hardware / OS / Scheduler behavior

Page 15: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 15

Types of (Software) LocksThe Spin Lock Family

• The Simple Test and Set Lock– Polls a shared Boolean variable: A binary

semaphore– Uses Fetch and Φ operations to operate on

the binary semaphore– Expensive!!!!

• Waste bandwidth• Generate Extra Busses transactions

– The test test and set approach• Just poll when the lock is in use

Page 16: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 16

Types of (Software) LocksThe Spin Lock Family

• Delay based Locks– Spin Locks in which a delay has been

introduced in testing the lock– Constant delay– Exponentional Back-off

• Best Results

– The test test and set scheme is not needed

Page 17: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 17

Types of (Software) LocksThe Spin Lock Family

Pseudo code:Pseudo code:

enum LOCK_ACTIONS = {LOCKED, UNLOCKED};void acquire_lock(lock_t L){

int delay = 1;while(! test_and_set(L, LOCKED) ) {

sleep(delay);delay *= 2;

}}void release_lock(lock_t L){

L = UNLOCKED;}

Page 18: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 18

Types of (Software) LocksThe Ticket Lock

• Reduce the # of Fetch and Φ operations – Only one per lock acquisition

• Strongly fair lock– No starvation

• A FIFO service

• Implementation: Two counters– A Request and Release Counters

Page 19: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 19

Types of (Software) LocksThe Ticket Lock

T1 T2 T3 T4 T5

0 0

Request Release

T1 acquires the lock

Page 20: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 20

Types of (Software) LocksThe Ticket Lock

T1 T2 T3 T4 T5

1 0

Request Release

T2 requests the lock

Page 21: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 21

Types of (Software) LocksThe Ticket Lock

T1 T2 T3 T4 T5

2 0

Request Release

T3 requests the lock

Page 22: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 22

Types of (Software) LocksThe Ticket Lock

T1 T2 T3 T4 T5

3 1

Request ReleaseT1 releases the lockT2 gets the lockT4 requests the lock

Page 23: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 23

Types of (Software) LocksThe Ticket Lock

T1 T2 T3 T4 T5

4 1

Request Release

T5 requests the lock

Page 24: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 24

Types of (Software) LocksThe Ticket Lock

T1 T2 T3 T4 T5

5 1

Request Release

T1 requests the lock

Page 25: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 25

Types of (Software) LocksThe Ticket Lock

T1 T2 T3 T4 T5

5 2

Request Release

T2 releases the lockT3 acquires the lock

Page 26: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 26

• Reduce the number of Fetch and Φ operations – Only read ops on the release counter

• However, still a lot of memory and network bandwidth wasted.

• Back off techniques also used– Exponentional Back off

• A bad idea

– Constant Delay• Minimum time of holding a lock

– Proportional Back off• Dependent on how many are waiting for the lock

Types of (Software) LocksThe Ticket Lock

Page 27: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 27

Types of (Software) LocksThe Ticket Lock

Pseudocode:Pseudocode:

unsigned int next_ticket = 0;unsigned int now_serving = 0;void acquire_lock(){

unsigned int my_ticket = fetch_and_increment(next_ticket);while{

sleep(my_ticket - now_serving);if(now_serving == my_ticket) return;

}}void release_lock(){

now_serving = now_serving + 1;}

Page 28: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 28

Types of (Software) LocksThe Array Based Queue Lock

• Contention on the release counter• Cache Coherence and memory traffic

– Invalidation of the counter variable and the request to a single memory bank

• Two elements– An Array and a tail pointer that index such array– The array is as big as the number of processor– Fetch and store Address of the array element– Fetch and increment Tail pointer

• FIFO ordering

Page 29: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 29

Types of (Software) LocksThe Array Based Queue Lock

T1 T2 T3 T4 T5

Enter Wait Wait Wait Wait

Tail

The tail pointer points to the beginning of the array

The all array elements except the first one are marked to wait

Page 30: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 30

Types of (Software) LocksThe Array Based Queue Lock

T1 T2 T3 T4 T5

Enter Wait Wait Wait Wait

Tail

T1 Gets the lock

Page 31: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 31

Types of (Software) LocksThe Array Based Queue Lock

T1 T2 T3 T4 T5

Enter Wait Wait Wait Wait

Tail

T2 Requests

Page 32: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 32

Types of (Software) LocksThe Array Based Queue Lock

T1 T2 T3 T4 T5

Enter Wait Wait Wait Wait

Tail

T3 requests

Page 33: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 33

Types of (Software) LocksThe Array Based Queue Lock

T1 T2 T3 T4 T5

Wait Enter Wait Wait Wait

Tail

T1 releasesT2 Gets

Page 34: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 34

Types of (Software) LocksThe Array Based Queue Lock

T1 T2 T3 T4 T5

Wait Enter Wait Wait Wait

Tail

T4 Requests

Page 35: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 35

Types of (Software) LocksThe Array Based Queue Lock

T1 T2 T3 T4 T5

Wait Enter Wait Wait Wait

Tail

T1 requests

Page 36: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 36

Types of (Software) LocksThe Array Based Queue Lock

T1 T2 T3 T4 T5

Wait Wait Enter Wait Wait

Tail

T2 releasesT3 gets

Page 37: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 37

Types of (Software) LocksThe Queue Locks

• It uses too much memory– Linear space (relative to the number of

processors) per lock.

• Array– Easy to implement

• Linked List: QNODE– Cache management

Page 38: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 38

Types of (Software) LocksThe MCS Lock

• Characteristics– FIFO ordering– Spins on locally accessible flag variables– Small amount of space per lock– Works equally well on machines with and without

coherent caches

• Similar to the QNODE implementation of queue locks– QNODES are assigned to local memory – Threads spins on local memory

Page 39: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 39

MCS: How it works?

• Each processor enqueues its own private lock variable into a queue and spins on it– key: spin locally

• CC model: spin in local cache• DSM model: spin in local private memory

– No contention

• On lock release, the releaser unlocks the next lock in the queue– Only have bus/network contention on actual unlock– No starvation (order of lock acquisitions defined by

the list)

Page 40: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 40

MCS Lock

• Requires atomic instruction:– compare-and-swap– fetch-and-store

• If there is no compare-and-swap– an alternative release algorithm

• extra complexity• loss of strict FIFO ordering• theoretical possibility of starvation• Detail: Mellor-Crummey and Scott’s 1991 paper

Page 41: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 41

MCS: Example

Tail

Flag Next F = 1 Next

Tail

Flag Next

Tail

Init

Proc 1 gets

Proc 2 tries

CPU 1

CPU 2

CPU 3

CPU 4

•CPU 1 holds the “real” lock•CPU 2, CPU 3 and CPU 4 spins on the flag•When CPU 1 releases, it releases the lock and change the flag variable of the next in the list

Page 42: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 42

ImplementationModern Alternatives

• Fetch and Φ operations– They are restrictive– Not all architecture support all of them

• Problem: A general one atomic op is hard!!!• Solution: Provide two primitives to generate

atomic operations• Load Linked and Store Conditional

– Remember PowerPC lwarx and stwcx instructions

Page 43: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 43

An ExampleSwap

try: mov R3, R4ld R2, 0(R1)st R3, 0(R1)mov R4, R2

Exchange the contents of register R4 with memory location pointed by R1

Not Atomic!!!!

Page 44: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 44

An ExampleAtomic Swap

try: mov R3, R4ll R2, 0(R1)sc R3, 0(R1)beqz R3, trymov R4, R2

Swap (Fetch and store) using ll and sc

In case that another processor writes to the value pointed by R1 before the sc can complete, the reservation (usually keep in register) is lost. This means that the sc will fail and the code will loop back and try again.

Page 45: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 45

Another ExampleFetch and Increment and Spin Lock

try: ll R2, 0(R1) addi R2, R2, #1sc R2, 0(R1)beqz R2, try

Fetch and Increment using ll-scFetch and Increment using ll-sc

li R2, #1lockit: exch R2, 0(R1)

bnez R2, lockit

Spin Lock using ll-scSpin Lock using ll-scThe exch instruction is equivalent to the Atomic Swap Instruction Block presented earlierAssume that the lock is not cacheableNote: 0 Unlocked; 1 Locked

Page 46: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 46

Performance Penalty

Example

Suppose there are 10 processors on a bus that each

try to lock a variable simultaneously. Assume that each

bus transaction (read miss or write miss) is 100 clock

cycles long. You can ignore the time of the actual read

or write of a lock held in the cache, as well as the time

the lock is held (they won’t matter much!) Determine

the performance penalty.

Page 47: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 47

Answer

It takes over 12,000 cycles total for all processor to

pass through the lock!

Note: the contention of the lock and the serialization

of the bus transactions.

See example on pp 596, Henn/Patt, 3rd Ed.

Page 48: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 48

Performance Penalty

• Assume the same example as before (100 cycles per bus transaction, 10 processors) but consider the case of a queue lock which only updates on a miss

Paterson and Hennesy p 603

Page 49: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 49

Performance Penalty

• Answer: – First time: n+1– Subsequent access: 2(n-1)– Total: 3n – 1– 29 Bus cycles or 2900 clock cycles

Page 50: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 50

Implementing Locks Using Coherence

lockit: ld R2, 0(R1)bnez R2, lockitli R2, #1exch R2, 0(R1)bnez R2, lockit

Step P0 P1 P2 State Bus

1 Has Lock Spins Spins S None

2 Set Lock = 0 Inv rcvd Inv rcvd E Write Inv from P0

3 Cache Miss Cache Miss S WB from P0.

4 Waits Lock = 0 S Cache Miss (P2) satisfied

5 Lock = 0 Swap Cache Miss

S Cache Miss (P1) satisfied

6 Swap Cache miss

Completes swap returns 0 and L=1

E(P2) Inv from P2

7 Swap completes returns 1 and set L=1

Enter E(P1) WB

8 Spins L = 0 None

lockit: ll R2, 0(R1)bnez R2, lockitli R2, #1sc R2, 0(R1)beqz R2, lockit

Page 51: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 51

Some GraphsIncrease in Network Latency on a Butterfly. Sixty Processor

Performance of spin locks on a butterfly. The x-axis represents processors and y-axis represents time in microseconds

Extracted from “Algorithms for Scalable Synchronization onShared.” John M. Mellor-Crummer and Michael L. Scott. January 1991

Page 52: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 52

Topic 5b

Barriers

Page 53: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 53

The Barrier Construct

• The idea for software barriers– A program point in which all participating threads wait

for each other to arrive to this point before continuing

• Difficulty– Overhead of synchronizing the threads– Network and Memory bandwidth issues

• Implementation– Centralized

• Simple to implement with locks

– Tree based• Better with bandwidth

Page 54: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 54

Centralized Barriers

• A normal barrier in which all threads / processors waits for each other “serially”

• Typical Implementation:• Two spin locks

– One waits for all threads to arrives – One keeps tally of the arrived threads

• A thread arrives to the barrier and increment the counter by one (atomically)

• Check if you are the last one– If you aren’t then wait– If you are, unblock (awake) the rest of the threads

Page 55: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 55

Centralized BarrierPseudo Code

int count = 0;bool sense = true;void central_barrier(){

lock(L);if (count == 0) sense = 0;count ++;unlock(L);if(count == PROCESSORS){

sense = 1;count = 0;

}else

spin(sense == 1);}

It may deadlock or malfunction

Page 56: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 56

Centralized Barrier

T1T2

T3

Barrier 1WorkBarrier 2

T1 arrives to the barrier, increments count and spins

T2 arrives to the barrier, increments count and spins

T3 arrives to the barrier, increments count and change sense

T3 is delayed and T1 do Work

T1 reaches the next barrier, increments count and it is delayed

T3 starts again and reset the count

T2 and T3 arrives to the barrier and forever spin

Page 57: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 57

Centralized BarrierPseudo Code: Reverse Sense Barrier

int count = 0;bool sense = true;void central_barrier(){

static bool local_sense = true;local_sense = ! local_sense;lock(L);count ++;if(count == PROCESSORS){

count = 0;sense = local_sense;

}unlock(L);spin(sense == local_sense);

}

It will wait since the spin target can be either from the previous barrier (old local_sense) or from the current barrier (local_sense)

Page 58: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 58

Centralized Barrier

Performance

Suppose there are 10 processors on a bus that each try to execute a barrier simultaneously. Assume that each bus transaction is 100 clock cycles, as before. You can ignore the time of the actual read or write of a lock held in the cache as the time to execute other non-synchronization operations in the barrier implementation. Determine the number of bus transactions required for all 10 processors to reach the barrier, be released from the barrier and exit the barrier. Assume that the bus is totally fair, so that every pending request is serviced before a new request and that the processors are equally fast. Don’t worry about counting the processors out of the barrier. How long will the entire process take?

Patterson and Hennesy Page 598

Page 59: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 59

Centralized Barrier

• Steps through the barrier– Assume that ll-sc lock is used

– LL the lock i times– SC the lock i times– Load Count 1 time– LL the lock again i -1 times– Store Count 1 time– Store lock 1 time– Load sense 2 times

– Total transaction for the ith processor: 3i + 4– Total: (3n2 + 11n)/2 – 1– 204 bus cycles 20,400 clock cycles

Page 60: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 60

Tree Type Barriers

• The software combining tree barrier– A shared variable becomes a tree of access– Each parent node will combine the results of each its children– A group of processor per leaf– Last processor update the leaf and then moves up– A two pass scheme:

• From down to up Update count• From up to down Update sense and resume

– Objectives• Reduces Memory Contention

– Disadvantages• Spins on memory locations which positions cannot be statically

determinated

Page 61: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 61

Tree Type Barriers

• Butterfly Barrier– Based on the Butterfly network scheme for

broadcasting and reduction– Pairwise optimizations

• At step k: Processor i signals processor i xor 2k

– In case that the number of processors are not a power of two then existing processor will participate.

– Max Synchronizations: 2 Floor(log2 P)1 2 3 4 5 6 7

R0

R1

R2

R3

Page 62: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 62

Tree Type Barriers

• Dissemination Barrier– Similar to Butterfly but with less maximum

synchronization operations floor(log2P)

– At step k: Processor i signals processor (i + 2k) mod P

– Advantages:• The flags that each processor spins are statically

assigned (Better locality)

Page 63: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 63

Tree Type Barriers

• Tournament Barriers– A tree style barrier– A round of the tournament

• A level of the tree

– Winners are statically decided• No fetch and Φ operations are needed

– Processor i sets a flag that is being awaited by processor j, then processor i drops from the tournament and j continues

– The final processor wakes all others– Types

• CREW (concurrent read exclusive write): Global variable to signal back

• EREW (exclusive read exclusive write): Separate flags in which each processor spins separate.

Page 64: 10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek

10/20/2006 ELEG652-06F 64

Bibliography

• Paterson and Hennessy. “Chapter 6: Multiprocessors and Thread Level Parallelism”

• Mellor-Crummey, John; Scott, Michael. “Algorithms for Scalable Synchronization on Shared Memory Multiprocessors”. January 1991.