10/20/2006eleg652-06f1 topic 5 synchronization and costs for shared memory “.... you will be...
Post on 19-Dec-2015
218 views
TRANSCRIPT
10/20/2006 ELEG652-06F 1
Topic 5
Synchronization and Costs for Shared Memory
“.... You will be assimilated. Resistance is futile.“Star Trek
10/20/2006 ELEG652-06F 2
Synchronization
• The orchestration of two or more threads (or processes) to complete a task in a correct manner and avoid any data races
• Data Race or Race Condition– “There is an anomaly of concurrent accesses
by two or more threads to a shared memory and at least one of the accesses is a write”
• Atomicity and / or serialibility
10/20/2006 ELEG652-06F 3
Atomicity
• Atomic From the Greek “Atomos” which means indivisible
• An “All or None” scheme• An instruction (or a group of them) will
appear as if it was (they were) executed in a single try– All side effects of the instruction (s) in the
block are seen in its totality or not all• Side effects Writes and (Causal) Reads to the
variables inside the atomic block
10/20/2006 ELEG652-06F 4
Atomicity
• Word aligned load and stores are atomic in almost all architectures
• Unaligned and bigger than word accesses are usually not atomic
• What happens when non-atomic operations goes wrong– The final result will be a garbled combination of
values– Complete operations might be lost in the process
• Strong Versus Weak Atomicity
10/20/2006 ELEG652-06F 5
Synchronization
• Applied to Shared Variables • Synchronization might enforce ordering or not• High level Synchronization types
– Semaphores– Mutex– Barriers– Critical Sections– Monitors– Conditional Variables
10/20/2006 ELEG652-06F 6
Semaphores
• Intelligent Counters of Resources– Zero Means not available
• Abstract data which has two operations involved– P probeer te verlagen: “try to decrease” Waits
(Busy waits or sleeps) if the resource is not available.– V verhoog: “increase.” Frees the resource
• Binary V.S. Blocking V.S. Counting Semaphores– Binary: Initial Value will allow threads to obtain it– Blocking: Initial Value will block the threads– Counting: Initial Value is not zero
• Note: P and V are atomic operations!!!!
10/20/2006 ELEG652-06F 7
Mutex
• Mutual Exclusion Lock
• A binary semaphore to ensure that one thread (and only one) will access the resource– P Lock the mutex– V Unlock the mutex
• It doesn’t enforce ordering
• Fine V.S. Coarse grained
10/20/2006 ELEG652-06F 8
Barriers
• A high level programming construct• Ensure that all participating threads will wait at a
program point for all other (participating) threads to arrive, before they can continue
• Types of Barriers– Tree Barriers (Software Assisted)– Centralized Barriers– Tournament Barriers– Fine grained Barriers– Butterfly style Barriers– Consistency Barriers (i.e. #pragma omp flush)
10/20/2006 ELEG652-06F 9
Critical Sections
• A piece of code that is executed by one and only one thread at any point in time
• If T1 finds CS in use, then it waits until the CS is free for it to use it
• Special Case:– Conditional Critical Sections: Threads waits
on a “given” signal to resume execution.– Better implemented with lock free techniques
(i.e. Transactional Memory)
10/20/2006 ELEG652-06F 10
Monitors and Conditional Variables
• A monitor consists of:– A set of procedures to work on shared variables– A set of shared variables– An invariant – A lock to protect from access by other threads
• Conditional Variables– The invariant in a monitor (but it can be used in other
schemes)– It is a signal place holder for other threads activities
10/20/2006 ELEG652-06F 11
Much More …
• However, all of these are abstractions• Major elements
– A synchronization element that ensure atomicity• Locks!!!!
– A synchronization element that ensure ordering• Barriers!!!!
• Implementations and types– Common types of atomic primitives– Read – Modify – Write Back cycles
• Synch Overhead may break a system– Unnecessary consistency actions– Communication cost between threads
• Why Distributed Memory Machines have “implicit” synchronization?
10/20/2006 ELEG652-06F 12
Topic 5a
Locks
10/20/2006 ELEG652-06F 13
Implementation
• Atomic Primitives– Fetch and Φ operations
• Read – Modify – Write Cycles• Test and Set• Fetch and Store
– Exchange register and memory
• Fetch and Add• Compare and Swap
– Conditionally exchange the value of a memory location
10/20/2006 ELEG652-06F 14
Implementation
• Use by programmers to implement more complex synchronization constructs
• Waiting behavior– Scheduler based: The process / thread is de-
scheduled and will be scheduled in a future time
– Busy Wait: The process / thread polls on the resource until it is available
– Dependent on the Hardware / OS / Scheduler behavior
10/20/2006 ELEG652-06F 15
Types of (Software) LocksThe Spin Lock Family
• The Simple Test and Set Lock– Polls a shared Boolean variable: A binary
semaphore– Uses Fetch and Φ operations to operate on
the binary semaphore– Expensive!!!!
• Waste bandwidth• Generate Extra Busses transactions
– The test test and set approach• Just poll when the lock is in use
10/20/2006 ELEG652-06F 16
Types of (Software) LocksThe Spin Lock Family
• Delay based Locks– Spin Locks in which a delay has been
introduced in testing the lock– Constant delay– Exponentional Back-off
• Best Results
– The test test and set scheme is not needed
10/20/2006 ELEG652-06F 17
Types of (Software) LocksThe Spin Lock Family
Pseudo code:Pseudo code:
enum LOCK_ACTIONS = {LOCKED, UNLOCKED};void acquire_lock(lock_t L){
int delay = 1;while(! test_and_set(L, LOCKED) ) {
sleep(delay);delay *= 2;
}}void release_lock(lock_t L){
L = UNLOCKED;}
10/20/2006 ELEG652-06F 18
Types of (Software) LocksThe Ticket Lock
• Reduce the # of Fetch and Φ operations – Only one per lock acquisition
• Strongly fair lock– No starvation
• A FIFO service
• Implementation: Two counters– A Request and Release Counters
10/20/2006 ELEG652-06F 19
Types of (Software) LocksThe Ticket Lock
T1 T2 T3 T4 T5
0 0
Request Release
T1 acquires the lock
10/20/2006 ELEG652-06F 20
Types of (Software) LocksThe Ticket Lock
T1 T2 T3 T4 T5
1 0
Request Release
T2 requests the lock
10/20/2006 ELEG652-06F 21
Types of (Software) LocksThe Ticket Lock
T1 T2 T3 T4 T5
2 0
Request Release
T3 requests the lock
10/20/2006 ELEG652-06F 22
Types of (Software) LocksThe Ticket Lock
T1 T2 T3 T4 T5
3 1
Request ReleaseT1 releases the lockT2 gets the lockT4 requests the lock
10/20/2006 ELEG652-06F 23
Types of (Software) LocksThe Ticket Lock
T1 T2 T3 T4 T5
4 1
Request Release
T5 requests the lock
10/20/2006 ELEG652-06F 24
Types of (Software) LocksThe Ticket Lock
T1 T2 T3 T4 T5
5 1
Request Release
T1 requests the lock
10/20/2006 ELEG652-06F 25
Types of (Software) LocksThe Ticket Lock
T1 T2 T3 T4 T5
5 2
Request Release
T2 releases the lockT3 acquires the lock
10/20/2006 ELEG652-06F 26
• Reduce the number of Fetch and Φ operations – Only read ops on the release counter
• However, still a lot of memory and network bandwidth wasted.
• Back off techniques also used– Exponentional Back off
• A bad idea
– Constant Delay• Minimum time of holding a lock
– Proportional Back off• Dependent on how many are waiting for the lock
Types of (Software) LocksThe Ticket Lock
10/20/2006 ELEG652-06F 27
Types of (Software) LocksThe Ticket Lock
Pseudocode:Pseudocode:
unsigned int next_ticket = 0;unsigned int now_serving = 0;void acquire_lock(){
unsigned int my_ticket = fetch_and_increment(next_ticket);while{
sleep(my_ticket - now_serving);if(now_serving == my_ticket) return;
}}void release_lock(){
now_serving = now_serving + 1;}
10/20/2006 ELEG652-06F 28
Types of (Software) LocksThe Array Based Queue Lock
• Contention on the release counter• Cache Coherence and memory traffic
– Invalidation of the counter variable and the request to a single memory bank
• Two elements– An Array and a tail pointer that index such array– The array is as big as the number of processor– Fetch and store Address of the array element– Fetch and increment Tail pointer
• FIFO ordering
10/20/2006 ELEG652-06F 29
Types of (Software) LocksThe Array Based Queue Lock
T1 T2 T3 T4 T5
Enter Wait Wait Wait Wait
Tail
The tail pointer points to the beginning of the array
The all array elements except the first one are marked to wait
10/20/2006 ELEG652-06F 30
Types of (Software) LocksThe Array Based Queue Lock
T1 T2 T3 T4 T5
Enter Wait Wait Wait Wait
Tail
T1 Gets the lock
10/20/2006 ELEG652-06F 31
Types of (Software) LocksThe Array Based Queue Lock
T1 T2 T3 T4 T5
Enter Wait Wait Wait Wait
Tail
T2 Requests
10/20/2006 ELEG652-06F 32
Types of (Software) LocksThe Array Based Queue Lock
T1 T2 T3 T4 T5
Enter Wait Wait Wait Wait
Tail
T3 requests
10/20/2006 ELEG652-06F 33
Types of (Software) LocksThe Array Based Queue Lock
T1 T2 T3 T4 T5
Wait Enter Wait Wait Wait
Tail
T1 releasesT2 Gets
10/20/2006 ELEG652-06F 34
Types of (Software) LocksThe Array Based Queue Lock
T1 T2 T3 T4 T5
Wait Enter Wait Wait Wait
Tail
T4 Requests
10/20/2006 ELEG652-06F 35
Types of (Software) LocksThe Array Based Queue Lock
T1 T2 T3 T4 T5
Wait Enter Wait Wait Wait
Tail
T1 requests
10/20/2006 ELEG652-06F 36
Types of (Software) LocksThe Array Based Queue Lock
T1 T2 T3 T4 T5
Wait Wait Enter Wait Wait
Tail
T2 releasesT3 gets
10/20/2006 ELEG652-06F 37
Types of (Software) LocksThe Queue Locks
• It uses too much memory– Linear space (relative to the number of
processors) per lock.
• Array– Easy to implement
• Linked List: QNODE– Cache management
10/20/2006 ELEG652-06F 38
Types of (Software) LocksThe MCS Lock
• Characteristics– FIFO ordering– Spins on locally accessible flag variables– Small amount of space per lock– Works equally well on machines with and without
coherent caches
• Similar to the QNODE implementation of queue locks– QNODES are assigned to local memory – Threads spins on local memory
10/20/2006 ELEG652-06F 39
MCS: How it works?
• Each processor enqueues its own private lock variable into a queue and spins on it– key: spin locally
• CC model: spin in local cache• DSM model: spin in local private memory
– No contention
• On lock release, the releaser unlocks the next lock in the queue– Only have bus/network contention on actual unlock– No starvation (order of lock acquisitions defined by
the list)
10/20/2006 ELEG652-06F 40
MCS Lock
• Requires atomic instruction:– compare-and-swap– fetch-and-store
• If there is no compare-and-swap– an alternative release algorithm
• extra complexity• loss of strict FIFO ordering• theoretical possibility of starvation• Detail: Mellor-Crummey and Scott’s 1991 paper
10/20/2006 ELEG652-06F 41
MCS: Example
Tail
Flag Next F = 1 Next
Tail
Flag Next
Tail
Init
Proc 1 gets
Proc 2 tries
CPU 1
CPU 2
CPU 3
CPU 4
•CPU 1 holds the “real” lock•CPU 2, CPU 3 and CPU 4 spins on the flag•When CPU 1 releases, it releases the lock and change the flag variable of the next in the list
10/20/2006 ELEG652-06F 42
ImplementationModern Alternatives
• Fetch and Φ operations– They are restrictive– Not all architecture support all of them
• Problem: A general one atomic op is hard!!!• Solution: Provide two primitives to generate
atomic operations• Load Linked and Store Conditional
– Remember PowerPC lwarx and stwcx instructions
10/20/2006 ELEG652-06F 43
An ExampleSwap
try: mov R3, R4ld R2, 0(R1)st R3, 0(R1)mov R4, R2
Exchange the contents of register R4 with memory location pointed by R1
Not Atomic!!!!
10/20/2006 ELEG652-06F 44
An ExampleAtomic Swap
try: mov R3, R4ll R2, 0(R1)sc R3, 0(R1)beqz R3, trymov R4, R2
Swap (Fetch and store) using ll and sc
In case that another processor writes to the value pointed by R1 before the sc can complete, the reservation (usually keep in register) is lost. This means that the sc will fail and the code will loop back and try again.
10/20/2006 ELEG652-06F 45
Another ExampleFetch and Increment and Spin Lock
try: ll R2, 0(R1) addi R2, R2, #1sc R2, 0(R1)beqz R2, try
Fetch and Increment using ll-scFetch and Increment using ll-sc
li R2, #1lockit: exch R2, 0(R1)
bnez R2, lockit
Spin Lock using ll-scSpin Lock using ll-scThe exch instruction is equivalent to the Atomic Swap Instruction Block presented earlierAssume that the lock is not cacheableNote: 0 Unlocked; 1 Locked
10/20/2006 ELEG652-06F 46
Performance Penalty
Example
Suppose there are 10 processors on a bus that each
try to lock a variable simultaneously. Assume that each
bus transaction (read miss or write miss) is 100 clock
cycles long. You can ignore the time of the actual read
or write of a lock held in the cache, as well as the time
the lock is held (they won’t matter much!) Determine
the performance penalty.
10/20/2006 ELEG652-06F 47
Answer
It takes over 12,000 cycles total for all processor to
pass through the lock!
Note: the contention of the lock and the serialization
of the bus transactions.
See example on pp 596, Henn/Patt, 3rd Ed.
10/20/2006 ELEG652-06F 48
Performance Penalty
• Assume the same example as before (100 cycles per bus transaction, 10 processors) but consider the case of a queue lock which only updates on a miss
Paterson and Hennesy p 603
10/20/2006 ELEG652-06F 49
Performance Penalty
• Answer: – First time: n+1– Subsequent access: 2(n-1)– Total: 3n – 1– 29 Bus cycles or 2900 clock cycles
10/20/2006 ELEG652-06F 50
Implementing Locks Using Coherence
lockit: ld R2, 0(R1)bnez R2, lockitli R2, #1exch R2, 0(R1)bnez R2, lockit
Step P0 P1 P2 State Bus
1 Has Lock Spins Spins S None
2 Set Lock = 0 Inv rcvd Inv rcvd E Write Inv from P0
3 Cache Miss Cache Miss S WB from P0.
4 Waits Lock = 0 S Cache Miss (P2) satisfied
5 Lock = 0 Swap Cache Miss
S Cache Miss (P1) satisfied
6 Swap Cache miss
Completes swap returns 0 and L=1
E(P2) Inv from P2
7 Swap completes returns 1 and set L=1
Enter E(P1) WB
8 Spins L = 0 None
lockit: ll R2, 0(R1)bnez R2, lockitli R2, #1sc R2, 0(R1)beqz R2, lockit
10/20/2006 ELEG652-06F 51
Some GraphsIncrease in Network Latency on a Butterfly. Sixty Processor
Performance of spin locks on a butterfly. The x-axis represents processors and y-axis represents time in microseconds
Extracted from “Algorithms for Scalable Synchronization onShared.” John M. Mellor-Crummer and Michael L. Scott. January 1991
10/20/2006 ELEG652-06F 52
Topic 5b
Barriers
10/20/2006 ELEG652-06F 53
The Barrier Construct
• The idea for software barriers– A program point in which all participating threads wait
for each other to arrive to this point before continuing
• Difficulty– Overhead of synchronizing the threads– Network and Memory bandwidth issues
• Implementation– Centralized
• Simple to implement with locks
– Tree based• Better with bandwidth
10/20/2006 ELEG652-06F 54
Centralized Barriers
• A normal barrier in which all threads / processors waits for each other “serially”
• Typical Implementation:• Two spin locks
– One waits for all threads to arrives – One keeps tally of the arrived threads
• A thread arrives to the barrier and increment the counter by one (atomically)
• Check if you are the last one– If you aren’t then wait– If you are, unblock (awake) the rest of the threads
10/20/2006 ELEG652-06F 55
Centralized BarrierPseudo Code
int count = 0;bool sense = true;void central_barrier(){
lock(L);if (count == 0) sense = 0;count ++;unlock(L);if(count == PROCESSORS){
sense = 1;count = 0;
}else
spin(sense == 1);}
It may deadlock or malfunction
10/20/2006 ELEG652-06F 56
Centralized Barrier
T1T2
T3
Barrier 1WorkBarrier 2
T1 arrives to the barrier, increments count and spins
T2 arrives to the barrier, increments count and spins
T3 arrives to the barrier, increments count and change sense
T3 is delayed and T1 do Work
T1 reaches the next barrier, increments count and it is delayed
T3 starts again and reset the count
T2 and T3 arrives to the barrier and forever spin
10/20/2006 ELEG652-06F 57
Centralized BarrierPseudo Code: Reverse Sense Barrier
int count = 0;bool sense = true;void central_barrier(){
static bool local_sense = true;local_sense = ! local_sense;lock(L);count ++;if(count == PROCESSORS){
count = 0;sense = local_sense;
}unlock(L);spin(sense == local_sense);
}
It will wait since the spin target can be either from the previous barrier (old local_sense) or from the current barrier (local_sense)
10/20/2006 ELEG652-06F 58
Centralized Barrier
Performance
Suppose there are 10 processors on a bus that each try to execute a barrier simultaneously. Assume that each bus transaction is 100 clock cycles, as before. You can ignore the time of the actual read or write of a lock held in the cache as the time to execute other non-synchronization operations in the barrier implementation. Determine the number of bus transactions required for all 10 processors to reach the barrier, be released from the barrier and exit the barrier. Assume that the bus is totally fair, so that every pending request is serviced before a new request and that the processors are equally fast. Don’t worry about counting the processors out of the barrier. How long will the entire process take?
Patterson and Hennesy Page 598
10/20/2006 ELEG652-06F 59
Centralized Barrier
• Steps through the barrier– Assume that ll-sc lock is used
– LL the lock i times– SC the lock i times– Load Count 1 time– LL the lock again i -1 times– Store Count 1 time– Store lock 1 time– Load sense 2 times
– Total transaction for the ith processor: 3i + 4– Total: (3n2 + 11n)/2 – 1– 204 bus cycles 20,400 clock cycles
10/20/2006 ELEG652-06F 60
Tree Type Barriers
• The software combining tree barrier– A shared variable becomes a tree of access– Each parent node will combine the results of each its children– A group of processor per leaf– Last processor update the leaf and then moves up– A two pass scheme:
• From down to up Update count• From up to down Update sense and resume
– Objectives• Reduces Memory Contention
– Disadvantages• Spins on memory locations which positions cannot be statically
determinated
10/20/2006 ELEG652-06F 61
Tree Type Barriers
• Butterfly Barrier– Based on the Butterfly network scheme for
broadcasting and reduction– Pairwise optimizations
• At step k: Processor i signals processor i xor 2k
– In case that the number of processors are not a power of two then existing processor will participate.
– Max Synchronizations: 2 Floor(log2 P)1 2 3 4 5 6 7
R0
R1
R2
R3
10/20/2006 ELEG652-06F 62
Tree Type Barriers
• Dissemination Barrier– Similar to Butterfly but with less maximum
synchronization operations floor(log2P)
– At step k: Processor i signals processor (i + 2k) mod P
– Advantages:• The flags that each processor spins are statically
assigned (Better locality)
10/20/2006 ELEG652-06F 63
Tree Type Barriers
• Tournament Barriers– A tree style barrier– A round of the tournament
• A level of the tree
– Winners are statically decided• No fetch and Φ operations are needed
– Processor i sets a flag that is being awaited by processor j, then processor i drops from the tournament and j continues
– The final processor wakes all others– Types
• CREW (concurrent read exclusive write): Global variable to signal back
• EREW (exclusive read exclusive write): Separate flags in which each processor spins separate.
10/20/2006 ELEG652-06F 64
Bibliography
• Paterson and Hennessy. “Chapter 6: Multiprocessors and Thread Level Parallelism”
• Mellor-Crummey, John; Scott, Michael. “Algorithms for Scalable Synchronization on Shared Memory Multiprocessors”. January 1991.