lecture 3 (complexities of parallelism)
DESCRIPTION
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms. Lecture 3 (Complexities of Parallelism). Course Outline. Introduction Multi-threading on multi-core processors Multi-core applications and their complexities - PowerPoint PPT PresentationTRANSCRIPT
Programming Multi-Core Processors based
Embedded Systems
A Hands-On Experience on Cavium Octeon based Platforms
Lecture 3 (Complexities of Parallelism)
KICS, UETCopyright © 2009 3-2
Course Outline Introduction Multi-threading on multi-core processors Multi-core applications and their
complexities Multi-core parallel applications Complexities of multi-threading and
parallelism Application layer computing on multi-core Performance measurement and tuning
KICS, UETCopyright © 2009 3-3
Agenda for Today Multi-core parallel applications space
Scientific/engineering applications Commercial applications
Complexities due to parallelism Threading related issues Memory consistency and cache coherence Synchronization
Parallel ApplicationsScience/engineering application, general-purpose application, and
desktop applicationsDavid E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A Hardware/Software
Approach, Morgan Kaufmann, 1998
KICS, UETCopyright © 2009 3-5
Parallel Application Trends There is an ever-increasing demand for high
performance computing in a number of application areas
Scientific and engineering applications: Computational fluid dynamics Weather modeling Number of applications from physics, chemistry,
biology, etc. General-purpose computing applications
Video encoding/decoding, graphics, games Database management Networking applications
KICS, UETCopyright © 2009 3-6
Application Trends (2) Demand for cycles fuels advances in hardware, and vice-
versa Cycle drives exponential increase in microprocessor
performance Drives parallel architecture harder: most demanding applications
Range of performance demands Need range of system performance with progressively increasing
cost Platform pyramid
Goal of applications in using multi-core machines: Speedup Speedup (p cores) =
For a fixed problem size (input data set), performance = 1/time
Speedup fixed problem (p cores) =
Performance (p cores)
Performance (1 core)
Time (1 core)
Time (p cores)
KICS, UETCopyright © 2009 3-7
Scientific Computing Demand
KICS, UETCopyright © 2009 3-8
Engineering Application Demands Large parallel machines a mainstay in many
industries Petroleum (reservoir analysis) Automotive (crash simulation, drag analysis,
combustion efficiency), Aeronautics (airflow analysis, engine efficiency,
structural mechanics, electromagnetism), Computer-aided design Pharmaceuticals (molecular modeling) Visualization
in all of the above entertainment (films like Toy Story) architecture (walk-throughs and rendering)
KICS, UETCopyright © 2009 3-9
Application Trends Example: ASCI Accelerated Strategic Computing
Initiative (ASCI) is a US DoE program that proposes the use of high performance computing for 3-D modeling and simulation
Promised to provide 5 orders of magnitude greater computing power in 8 years (1996 to 2004) than state-of-the-art (1 GFlops to 100 Tflops)
KICS, UETCopyright © 2009 3-10
Application Trends Example (2) Platforms
ASCI Red 3.1 TOPs peak performance Developed by Intel with 4,510 nodes
ASCI Blue Mountain 3 TOPs peak performance Developed by SGI with 48, 128 node Origin2000s
ASCI White 12 TOPs peak performance Developed by IBM as cluster of SMPs
KICS, UETCopyright © 2009 3-11
Commercial Applications Databases, online-transaction
processing, decision support, data mining
Also relies on parallelism for high end Scale not so large, but use much more wide-
spread High performance means performing more
work (transactions) in a fixed time
KICS, UETCopyright © 2009 3-12
Commercial Applications (2) TPC benchmarks (TPC-C order entry,
TPC-D decision support) Explicit scaling criteria provided Size of enterprise scales with size of system Problem size no longer fixed as p increases,
so throughput is used as a performance measure (transactions per minute or tpm)
Desktop applications Video applications Secure computing and web services
KICS, UETCopyright © 2009 3-13
Parallel Applications Landscape
Embedded Applications(Wireless and mobile devices, PDAs,
consumer electronics)
Desktop Applications(WWW browser, office,
multimedia applications)
Data Center Appls.(Search, e-commerce,
Enterprise, SOA)
HPCC(Science/
engineering)
KICS, UETCopyright © 2009 3-14
Summary of Application Trends Transition to parallel computing has occurred
for scientific and engineering computing In rapid progress in commercial computing Desktop also uses multithreaded programs,
which are a lot like parallel programs Demand for improving throughput on
sequential workloads Greatest use of small-scale multiprocessors Currently employ multi-core processors
Solid application demand exists and will increase
Solutions to Common Parallel
Programming Problems using
Multiple ThreadsChapter 7
Shameem Akhtar and Jason Roberts, Multi-Core Programming, Intel Press,
2006
KICS, UETCopyright © 2009 3-16
Common Problems Too many threads Data races,
deadlocks, and live locks
Heavily contended locks
Non-blocking algorithms
Thread-safe functions and libraries
Memory issues Cache related
issues Pipeline stalls Date organization
KICS, UETCopyright © 2009 3-17
Too Many Threads Little threading good many will be great
Not always true Excessive threading can degrade performance
Two types of impacts of excessive threads Too little work per thread
Overhead of starting and maintaining dominates Fine granularity of work hides any performance
benefits Excessive contention for hardware resources
OS uses time-slicing for fair scheduling May result in excessive context switching overhead Thrashing at virtual memory level
KICS, UETCopyright © 2009 3-18
Data Races, Deadlocks, and Livelocks Race condition
Due to unsynchronized accesses to shared data Program results are non-deterministic
Depend on relative timings of threads Can be handled through locking
Deadlock A problem due to incorrect locking Results due to cyclic dependence that stops forward
progress by threads Livelock
Thread continuously conflict with each other and back off
No thread makes any progress Solution: back off with release of acquired locks to
allow at least one thread to make progress
KICS, UETCopyright © 2009 3-19
Races among Unsynchronized Threads
KICS, UETCopyright © 2009 3-20
Race Conditions Hiding Behind Language Syntax
KICS, UETCopyright © 2009 3-21
A Higher-Level Race Condition Example
Race conditions possible with synch However, synchronization at too low level Higher level may still have data races
Example Each key should occur only once in the list Individual list operators have locks Problem: two threads simultaneously may find that key does not
exist and insert the same key in the list one after the other Solution: locking both for list as well as to protect key repetition
KICS, UETCopyright © 2009 3-22
Deadlock Caused by Cycle
KICS, UETCopyright © 2009 3-23
Conditions for a DeadlockDeadlock can occur only if the following four
conditions are true: Access to each resource is exclusive; A thread is allowed to hold one resource
requesting another; No thread is willing to relinquish a
resource that it has acquired; and There is a cycle of threads trying to
acquire resources, where each resource is held by one thread and requested by another
KICS, UETCopyright © 2009 3-24
Locks Ordered by their Addresses
Consistent ordering of lock acquisition Prevents deadlock
KICS, UETCopyright © 2009 3-25
Try and Backoff Logic
One reason for deadlocks: no thread willing to give up a resource
Solution: thread gives up resource if it cannot acquire another one
KICS, UETCopyright © 2009 3-26
Heavily Contested Locks Locks ensure correctness
By preventing race conditions By preventing deadlocks
Performance impact When locks become heavily contested
among threads Threads try to acquire the lock at a rate
faster than the rate at which a thread can execute the corresponding critical section
If a thread falls asleep, all threads have to wait for it
KICS, UETCopyright © 2009 3-27
Priority Inversion Scenario
KICS, UETCopyright © 2009 3-28
Solution: Spreading out Contention
KICS, UETCopyright © 2009 3-29
Hash Table with Fine-Grained Locking
Mutexes protecting each bucket
KICS, UETCopyright © 2009 3-30
Non-Blocking Algorithms How about not using locks at all!
To resolve the locking problems Such algorithms are called non-blocking Stopping one thread does not prevent rest of the
system from making progress Non-blocking guarantees:
Obstruction freedom—thread makes progress as long as no contention livelock possible uses exponential backoff to avoid it
Lock freedom—system as a whole makes progress Wait freedom—every thread makes progress even
when faced with contention practically difficult to achieve
KICS, UETCopyright © 2009 3-31
Thread-Safe Functions
Thread-safe function when concurrently called on different objects
Implementer should ensure thread safety of hidden shared state
KICS, UETCopyright © 2009 3-32
Memory Issues Speed disparity
Processing is fast Memory access is slow Multiple cores can exacerbate the problem
Specific memroy issues Bandwidth Working in the cache Memory contention Memory consistency
KICS, UETCopyright © 2009 3-33
Bandwidth
KICS, UETCopyright © 2009 3-34
Working in the Cache
KICS, UETCopyright © 2009 3-35
Memory Contention Types of memory accesses
Between a core and main memory Between two cores
Two types of data dependences: Read-write dependency: a core write a cache line
and then different core reads it Write-write dependency: a cores write a cache line
and then a different core writes it Interactions among cores
Consume bandwidth Are avoided when multiple cores only read from
cache lines Can be avoided by minimizing the shared locations
KICS, UETCopyright © 2009 3-36
False Sharing Cache block may
also introduce artifacts
Two distinct variables in the same cache block
Technique: allocate data used by each processor contiguously, or at least avoid interleaving in memory
Example problem: an array of ints, one written frequently by each processor (many ints per cache line)
KICS, UETCopyright © 2009 3-37
Performance Impact of False Sharing
KICS, UETCopyright © 2009 3-38
What is Memory Consistency?
KICS, UETCopyright © 2009 3-39
Itanium Architecture
KICS, UETCopyright © 2009 3-40
Shared Memory without a Lock
Memory Consistency and Cache Coherence
David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A
Hardware/Software Approach, Morgan Kaufmann, 1998
(Advanced Topics—can be skipped)
KICS, UETCopyright © 2009 3-42
Memory Consistency for Multi-Core Architectures Memory consistency issue
Programs are written for a conceptual sequential machine with memory
Programs for parallel architectures: Written for multiple concurrent instruction streams Memory accesses may occur in any order May result in incorrect comupation
This is a well-known problem Traditional parallel architecture deal with it Multi-core architectures inherit this complexity Presented in this section for sake of completion
More relevant for HPCC applications Not as complex for multi-threading thread level
solutions
KICS, UETCopyright © 2009 3-43
Memory Consistency Consistency requirement:
writes to a location become visible to all in the same order
But when does a write become visible How to establish orders between a write and
a read by different process? Typically use event synchronization By using more than one location
KICS, UETCopyright © 2009 3-44
Memory Consistency (2)
Sometimes expect memory to respect order between accesses to different locations issued by a given processor
to preserve orders among accesses to same location by different processes
Coherence doesn’t help: pertains only to single location
P1
P2
/*Assume initial value of A and flag is 0*/
A = 1; while (flag == 0); /*spin idly*/
flag = 1; print A;
KICS, UETCopyright © 2009 3-45
An Example of Orders
We need an ordering model for clear semantics across different locations as well so programmers can reason about what
results are possible This is the memory consistency model
P1 P2
/*Assume initial values of A and B are 0*/(1a) A = 1; (2a) print B;(1b) B = 2; (2b) print A;
KICS, UETCopyright © 2009 3-46
Memory Consistency Model Specifies constraints on the order in
which memory operations (from any process) can appear to execute with respect to one another What orders are preserved? Given a load, constrains the possible values
returned by it Without it, can’t tell much about an SAS
program’s execution
KICS, UETCopyright © 2009 3-47
Memory Consistency Model (2) Implications for both programmer and
system designer Programmer uses to reason about
correctness and possible results System designer can use to constrain how
much accesses can be reordered by compiler or hardware
Contract between programmer and system
KICS, UETCopyright © 2009 3-48
Sequential Consistency
(as if there were no caches, and a single memory)
Processors issuing memory references as per program order
P1 P2 Pn
Memory
The “switch” is randomly set after each memoryreference
KICS, UETCopyright © 2009 3-49
Sequential Consistency (2) Total order achieved by interleaving accesses
from different processes Maintains program order, and memory operations,
from all processes, appear to [issue, execute, complete] atomically w.r.t. others
Programmer’s intuition is maintained “A multiprocessor is sequentially consistent if
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979]
KICS, UETCopyright © 2009 3-50
What Really is Program Order? Intuitively, order in which operations appear in
source code Straightforward translation of source code to
assembly At most one memory operation per instruction
But not the same as order presented to hardware by compiler
So which is program order? Depends on which layer, and who’s doing the
reasoning We assume order as seen by programmer
KICS, UETCopyright © 2009 3-51
Sequential Consistency: Example
possible outcomes for (A,B): (0,0), (1,0), (1,2) impossible under SC: (0,2) we know 1a1b and 2a2b by program order A = 0 implies 2b1a, which implies 2a1b B = 2 implies 1b2a, which leads to a contradiction BUT, actual execution 1b1a2b2a is SC, despite not
program order appears just like 1a1b2a2b as visible from results
actual execution 1b2a2b1a is not SC
P1 P2
/*Assume initial values of A and B are 0*/(1a) A = 1; (2a) print B;(1b) B = 2; (2b) print A;
KICS, UETCopyright © 2009 3-52
Implementing SC Two kinds of requirements:
Program order memory operations issued by a process must
appear to become visible (to others and itself) in program order
Atomicity in the overall total order, one memory operation
should appear to complete with respect to all processes before the next one is issued
needed to guarantee that total order is consistent across processes
tricky part is making writes atomic
KICS, UETCopyright © 2009 3-53
Write Atomicity Write Atomicity: Position in total order
at which a write appears to perform should be the same for all processes Nothing a process does after it has seen the
new value produced by a write W should be visible to other processes until they too have seen W
In effect, extends write serialization to writes from multiple processes
KICS, UETCopyright © 2009 3-54
Write Atomicity (2) Transitivity implies A should print as 1
under SC Problem if P2 leaves loop, writes B, and
P3 sees new B but old A (from its cache, say)
P1 P2 P3
A=1; while (A==0);B=1; while (B==0);
print A;
KICS, UETCopyright © 2009 3-55
Formal Definition of SC Each process’s program order imposes
partial order on set of all operations Interleaving of these partial orders
defines a total order on all operations Many total orders may be SC (SC does
not define particular interleaving)
KICS, UETCopyright © 2009 3-56
Formal Definition of SC (2) SC Execution:
An execution of a program is SC if the results it produces are the same as those produced by some possible total order (interleaving)
SC System: A system is SC if any possible execution on
that system is an SC execution
KICS, UETCopyright © 2009 3-57
Sufficient Conditions for SC Every process issues memory operations in
program order After a write operation is issued, the issuing
process waits for the write to complete before issuing its next operation
After a read operation is issued, the issuing process waits for the read to complete, and for the write whose value is being returned by the read to complete, before issuing its next operation (provides write atomicity)
KICS, UETCopyright © 2009 3-58
Sufficient Conditions for SC (2) Sufficient, not necessary, conditions Clearly, compilers should not reorder for SC, but
they do! Loop transformations, register allocation (eliminates!)
Even if issued in order, hardware may violate for better performance
Write buffers, out of order execution Reason: uniprocessors care only about
dependences to same location Makes the sufficient conditions very restrictive for
performance
KICS, UETCopyright © 2009 3-59
Summary of SC Implementation Assume for now that compiler does not reorder Hardware needs mechanisms to detect:
Detect write completion (read completion is easy) Ensure write atomicity
For all protocols and implementations, we will see How they satisfy coherence, particularly write serialization How they satisfy sufficient conditions for SC (write
completion and write atomicity) How they can ensure SC but not through sufficient
conditions Will see that centralized bus interconnect makes it
easier
KICS, UETCopyright © 2009 3-60
Cache Coherence CC for SMP architectures
One memory location in multiple caches Not a problem for read accesses
No need to update the memory address Computation can continue on local processor
Write access drive coherence requirements Memory needs to be updated Need to invalidate cache copies in other processors
Multiple ways to deal with updates Update memory immediately write through caches Update later write back caches
KICS, UETCopyright © 2009 3-61
Cache Coherence (2) CC is a well-known problem
For traditional SMP style multiprocessors Inherited by multi-core processors
Multiple solutions Can be resolved in software However, traditionally resolved in hardware Hardware supports CC protocols
A mechanism to detect cache coherence related events Mechanisms to keep the caches coherent
Presented here for the sake of completion Programmer does not have to worry about it However, a key consideration for a multi-core
architecture
KICS, UETCopyright © 2009 3-62
SC in Write-through Provides SC, not just coherence Extend arguments used for coherence
Writes and read misses to all locations serialized by bus into bus order
If read obtains value of write W, W guaranteed to have completed
since it caused a bus transaction When write W is performed w.r.t. any
processor, all previous writes in bus order have completed
KICS, UETCopyright © 2009 3-63
Design Space for Snooping Protocols No need to change processor, main memory,
cache … Extend cache controller and exploit bus (provides
serialization) Focus on protocols for write-back caches Dirty state now also indicates exclusive ownership
Exclusive: only cache with a valid copy Owner: responsible for supplying block upon a request
for it Design space
Invalidation versus Update-based protocols Set of states
KICS, UETCopyright © 2009 3-64
Invalidation-based Protocols Exclusive means can modify without notifying
anyone else i.e. without bus transaction Must first get block in exclusive state before writing
into it Even if already in valid state, need transaction, so
called a write miss Store to non-dirty data generates a read-
exclusive bus transaction
KICS, UETCopyright © 2009 3-65
Invalidation-based Protocols (2) The read-exclusive bus transaction (cont’d)
Tells others about impending write, obtains exclusive ownership
makes the write visible, i.e. write is performed may be actually observed (by a read miss) only later write hit made visible (performed) when block updated in
writer’s cache Only one RdX can succeed at a time for a block:
serialized by bus Read and Read-exclusive bus transactions drive
coherence actions Writeback transactions also, but not caused by
memory operation and quite incidental to coherence protocol
note: replaced block that is not in modified state can be dropped
KICS, UETCopyright © 2009 3-66
Update-based Protocols A write operation updates values in other
caches New, update bus transaction
Advantages Other processors don’t miss on next access:
reduced latency In invalidation protocols, they would miss and cause
more transactions Single bus transaction to update several caches
can save bandwidth Also, only the word written is transferred, not whole
block
KICS, UETCopyright © 2009 3-67
Update-based Protocols (2) Disadvantages
Multiple writes by same processor cause multiple update transactions
In invalidation, first write gets exclusive ownership, others local
Detailed tradeoffs more complex
KICS, UETCopyright © 2009 3-68
Invalidate versus Update Basic question of program behavior
Is a block written by one processor read by others before it is rewritten?
Invalidation: Yes => readers will take a miss No => multiple writes without additional traffic
and clears out copies that won’t be used again Update:
Yes => readers will not miss if they had a copy previously single bus transaction to update all copies
No => multiple useless updates, even to dead copies Invalidation protocols much more popular
Some systems provide both, or even hybrid
KICS, UETCopyright © 2009 3-69
Protocols 3-state writeback invalidation protocol 4-state writeback invalidation protocol 4-state writeback update protocol
KICS, UETCopyright © 2009 3-70
Basic MSI Writeback Invalidation Protocol States
Invalid (I) Shared (S): one or
more Dirty or Modified (M):
one only Processor Events:
PrRd (read) PrWr (write)
Bus Transactions BusRd: asks for copy
with no intent to modify
BusRdX: asks for copy with intent to modify
BusWB: updates memory
Actions Update state, perform
bus transaction, flush value onto bus
KICS, UETCopyright © 2009 3-71
State Transition Diagram Write to shared
block: Already have latest
data; can use upgrade (BusUpgr) instead of BusRdX
Replacement changes state of two blocks: outgoing and incoming
PrRd/—
PrRd/—
PrWr/BusRdXBusRd/—
PrWr/—
S
M
I
BusRdX/Flush
BusRdX/—
BusRd/Flush
PrWr/BusRdX
PrRd/BusRd
KICS, UETCopyright © 2009 3-72
Satisfying Coherence Write propagation is clear Write serialization?
All writes that appear on the bus (BusRdX) ordered by the bus
Write performed in writer’s cache before it handles other transactions, so ordered in same way even w.r.t. writer
Reads that appear on the bus ordered wrt these
KICS, UETCopyright © 2009 3-73
Satisfying Coherence (2) Write serialization? (cont’d)
Write that don’t appear on the bus: sequence of such writes between two bus
trnsactions for the block must come from same processor, say P
in serialization, the sequence appears between these two bus transactions
reads by P will see them in this order w.r.t. other bus transactions
reads by other processors separated from sequence by a bus transaction, which places them in the serialized order w.r.t the writes
so reads by all processors see writes in same order
KICS, UETCopyright © 2009 3-74
Satisfying Sequential Consistency Appeal to definition:
Bus imposes total order on bus xactions for all locations Between transactions, processors perform reads/writes locally
in program order So any execution defines a natural partial order
Mj subsequent to Mi if (i) follows in program order on same processor, (ii) Mj generates bus xaction that follows the memory operation for Mi
In segment between two bus transactions, any interleaving of ops from different processors leads to consistent total order
In such a segment, writes observed by processor P serialized as follows
Writes from other processors by the previous bus xaction P issued
Writes from P by program order
KICS, UETCopyright © 2009 3-75
Satisfying Sequential Consistency (2) Show sufficient conditions are satisfied
Write completion: can detect when write appears on bus
Write atomicity: if a read returns the value of a write, that write has already become visible to all others already (can reason different cases)
KICS, UETCopyright © 2009 3-76
Lower-level Protocol Choices BusRd observed in M state: what transition to make? Depends on expectations of access patterns
S: assumption that I’ll read again soon, rather than other will write
good for mostly read data what about “migratory” data
I read and write, then you read and write, then X reads and writes... better to go to I state, so I don’t have to be invalidated on your
write Synapse transitioned to I state Sequent Symmetry and MIT Alewife use adaptive protocols
Choices can affect performance of memory system
KICS, UETCopyright © 2009 3-77
MESI (4-state) Invalidation Protocol Problem with MSI protocol
Reading and modifying data is 2 bus xactions, even if none sharing
e.g. even in sequential program BusRd (I->S) followed by BusRdX or BusUpgr (S-
>M) Add exclusive state: write locally
without xaction, but not modified Main memory is up to date, so cache not
necessarily owner
KICS, UETCopyright © 2009 3-78
MESI (4-state) Invalidation Protocol (2) Add exclusive state: (cont’d)
States invalid exclusive or exclusive-clean (only this cache has
copy, but not modified) shared (two or more caches may have copies) modified (dirty)
I E on PrRd if no one else has copy needs “shared” signal on bus: wired-or line
asserted in response to BusRd
KICS, UETCopyright © 2009 3-79
MESI State Transition Diagram BusRd(S) means
shared line asserted on BusRd transaction
Flush’: if cache-to-cache sharing (see next), only one cache flushes data
MOESI protocol: Owned state: exclusive but memory not valid
PrWr/—
BusRd/Flush
PrRd/
BusRdX/Flush
PrWr/BusRdX
PrWr/—
PrRd/—
PrRd/—BusRd/Flush
E
M
I
S
PrRd
BusRd(S)
BusRdX/Flush
BusRdX/Flush
BusRd/Flush
PrWr/BusRdX
PrRd/BusRd (S)
KICS, UETCopyright © 2009 3-80
Lower-level Protocol Choices Who supplies data on miss when not in
M state: memory or cache Original, lllinois MESI: cache, since
assumed faster than memory Cache-to-cache sharing
Not true in modern systems Intervening in another cache more
expensive than getting from memory
KICS, UETCopyright © 2009 3-81
Lower-level Protocol Choices (2) Cache-to-cache sharing also adds
complexity How does memory know it should supply data
(must wait for caches) Selection algorithm if multiple caches have
valid data But valuable for cache-coherent machines
with distributed memory May be cheaper to obtain from nearby cache
than distant memory Especially when constructed out of SMP nodes
(Stanford DASH)
KICS, UETCopyright © 2009 3-82
Dragon Write-back Update Protocol 4 states
Exclusive-clean or exclusive (E): I and memory have it
Shared clean (Sc): I, others, and maybe memory, but I’m not owner
Shared modified (Sm): I and others but not memory, and I’m the owner
Sm and Sc can coexist in different caches, with only one Sm
Modified or dirty (D): I and, noone else
KICS, UETCopyright © 2009 3-83
Dragon Write-back Update Protocol (2) No invalid state
If in cache, cannot be invalid If not present in cache, can view as being in
not-present or invalid state New processor events: PrRdMiss,
PrWrMiss Introduced to specify actions when block not
present in cache New bus transaction: BusUpd
Broadcasts single word written on bus; updates other relevant caches
KICS, UETCopyright © 2009 3-84
Dragon State Transition Diagram
E Sc
Sm M
PrW r/—PrRd/—
PrRd/—
PrRd/—
PrRdMiss/BusRd(S)PrRdMiss/BusRd(S)
PrW r/—
PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/BusRd(S)
PrWr/BusUpd(S)
PrWr/BusUpd(S)
BusRd/—
BusRd/Flush
PrRd/— BusUpd/Update
BusUpd/Update
BusRd/Flush
PrWr/BusUpd(S)
PrWr/BusUpd(S)
KICS, UETCopyright © 2009 3-85
Lower-level Protocol Choices Can shared-modified state be eliminated?
If update memory as well on BusUpd transactions (DEC Firefly)
Dragon protocol doesn’t (assumes DRAM memory slow to update)
Should replacement of an Sc block be broadcast? Would allow last copy to go to E state and not
generate updates Replacement bus transaction is not in critical
path, later update may be
KICS, UETCopyright © 2009 3-86
Lower-level Protocol Choices (2) Shouldn’t update local copy on write hit
before controller gets bus Can mess up serialization
Coherence, consistency considerations much like write-through case
In general, many subtle race conditions in protocols
But first, let’s illustrate quantitative assessment at logical level
Synchronization
David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A
Hardware/Software Approach, Morgan Kaufmann, 1998
(Advanced Topic—can be skipped)
KICS, UETCopyright © 2009 3-88
Synchronization Synchronization is a fundamental concept
of parallel computing“A parallel computer is a collection of processing elements that cooperate and
communicate to solve large problems fast.” Types
Mutual Exclusion Event synchronization
point-to-point group global (barriers)
KICS, UETCopyright © 2009 3-89
Synchronization (2) Synchronization is a well-known problem
In traditional parallel computing Inherited by multi-core architectures
Resolution requires hardware and software Processor instruction set needs to provide an atomic
test-and-set instruction System software uses it to provide synchronization
mechanisms Presented here for sake of completion
To provide exposure to idea behind it Multithreading software provides synchronization
primitives
KICS, UETCopyright © 2009 3-90
History and Perspectives Much debate over hardware primitives over
the years Conclusions depend on technology and
machine style speed vs flexibility
Most modern methods use a form of atomic read-modify-write IBM 370: included atomic compare&swap for
multiprogramming x86: any instruction can be prefixed with a lock
modifier
KICS, UETCopyright © 2009 3-91
History and Perspectives (2) Atomic read-modify-write (cont’d)
High-level language advocates want hardware locks/barriers
but it goes against the “RISC” flow SPARC: atomic register-memory ops (swap,
compare&swap) MIPS, IBM Power: no atomic operations but
pair of instructions load-locked, store-conditional later used by PowerPC and DEC Alpha too
Rich set of tradeoffs
KICS, UETCopyright © 2009 3-92
Components of a Synchronization Event Acquire method
Acquire right to the synch (enter critical section, go past event
Waiting algorithm Wait for synch to become available when it isn’t
Release method Enable other processors to acquire right to the
synch Waiting algorithm is independent of type of
synchronization
KICS, UETCopyright © 2009 3-93
Waiting Algorithms Blocking
Waiting processes are descheduled High overhead Allows processor to do other things
Busy-waiting Waiting processes repeatedly test a location
until it changes value Releasing process sets the location Lower overhead, but consumes processor
resources Can cause network traffic
KICS, UETCopyright © 2009 3-94
Waiting Algorithms (2) Busy-waiting better when
Scheduling overhead is larger than expected wait time
Processor resources are not needed for other tasks
Scheduler-based blocking is inappropriate (e.g. in OS kernel)
Hybrid methods: busy-wait a while, then block
KICS, UETCopyright © 2009 3-95
Role of System and User User wants to use high-level
synchronization operations Locks, barriers... Doesn’t care about implementation
System designer: how much hardware support in implementation? Speed versus cost and flexibility Waiting algorithm difficult in hardware, so
provide support for others
KICS, UETCopyright © 2009 3-96
Role of System and User (2) Popular trend:
System provides simple hardware primitives (atomic operations)
Software libraries implement lock, barrier algorithms using these
But some propose and implement full-hardware synchronization
KICS, UETCopyright © 2009 3-97
Challenges Same synchronization may have different
needs at different times Lock accessed with low or high contention Different performance requirements: low latency or
high throughput Different algorithms best for each case, and need
different primitives Multiprogramming can change synchronization
behavior and needs Process scheduling and other resource interactions May need more sophisticated algorithms, not so
good in dedicated case
KICS, UETCopyright © 2009 3-98
Challenges (2) Rich area of software-hardware
interactions Which primitives available affects what
algorithms can be used Which algorithms are effective affects what
primitives to provide Need to evaluate using workloads
KICS, UETCopyright © 2009 3-99
Mutual Exclusion Mutual exclusion = lock-unlock operation Wide range of algorithms to implement these
operations Role of contention for locks
Simple algorithms are fast when low contention for locks
Sophisticated algorithms deal with contention in a better way but have higher cost
Types of locks Hardware locks Simple lock algorithms Advanced lock algorithms
KICS, UETCopyright © 2009 3-100
Hardware Locks Separate lock lines on the bus: holder of a lock
asserts the line Priority mechanism for multiple requestors
Locking algorithm Busy-wait with timeout
Lock registers (Cray XMP) Set of registers shared among processors
Inflexible, so not popular for general purpose use few locks can be in use at a time (one per lock line) hardwired waiting algorithm
Primarily used to provide atomicity for higher-level software locks
KICS, UETCopyright © 2009 3-101
First Attempt at Simple Software Lock
lock: ld register, location /* copy location to register */
cmp location, #0 /* compare with 0 */bnz lock /* if not 0, try again */st location, #1 /* store 1 to mark it
locked */ret /* return control to
caller */andunlock: st location, #0 /* write 0 to location */
ret /* return control to caller */
KICS, UETCopyright © 2009 3-102
First Attempt at Simple Software Lock (2) Problem: lock needs atomicity in
its own implementation Read (test) and write (set) of lock
variable by a process not atomic Solution: atomic read-modify-write
or exchange instructions atomically test value of location and
set it to another value, return success or failure somehow
KICS, UETCopyright © 2009 3-103
Atomic Exchange Instruction Specifies a location and register. In
atomic operation: Value in location read into a register Another value (function of value read or not)
stored into location Many variants
Varying degrees of flexibility in second part
KICS, UETCopyright © 2009 3-104
Atomic Exchange Instruction (2) Simple example: test&set
Value in location read into a specified register
Constant 1 stored into location Successful if value loaded into register is 0 Other constants could be used instead of 1
and 0 Can be used to build locks
KICS, UETCopyright © 2009 3-105
Simple Test&Set Locklock: t&s register, location
bnz lock /* if not 0, try again */ret /* return control to
caller */unlock: st location, #0 /* write 0 to location */
ret /* return control to caller */
KICS, UETCopyright © 2009 3-106
Simple Test&Set Lock (2) Other read-modify-write primitives can
be used too Swap Fetch&op Compare&swap
Three operands: location, register to compare with, register to swap with
Not commonly supported by RISC instruction sets Can be cacheable or uncacheable (we
assume cacheable)
KICS, UETCopyright © 2009 3-107
Simple Test&Set Lock (3) On SGI Challenge Code:
lock; delay(c); unlock;
Same total number of lock calls as p increases measure time per transfer
KICS, UETCopyright © 2009 3-108
T&S Lock Microbenchmark Performance (2)
Number of processors
Tim
e (
s)
11 13 150
2
4
6
8
10
12
14
16
18
20 T est&set, c = 0
T est&set, exponential backof f, c = 3.64
T est&set, exponential backof f, c = 0
Ideal
9753
Performance degrades because unsuccessful test&sets generate traffic
KICS, UETCopyright © 2009 3-109
Enhancements to Simple Lock Algorithm Reduce frequency of issuing test&sets
while waiting Test&set lock with backoff Don’t back off too much or will be backed
off when lock becomes free Exponential backoff works quite well
empirically: ith time = k*ci
KICS, UETCopyright © 2009 3-110
Enhancements to Simple Lock Algorithm (2) Busy-wait with read operations rather
than test&set Test-and-test&set lock Keep testing with ordinary load
cached lock variable will be invalidated when release occurs
When value changes (to 0), try to obtain lock with test&set
only one attemptor will succeed; others will fail and start testing again
KICS, UETCopyright © 2009 3-111
Performance Criteria (T&S Lock) Uncontended Latency
Very low if repeatedly accessed by same processor; indept. of p
Traffic Lots if many processors compete; poor
scaling with p Each t&s generates invalidations, and all
rush out again to t&s Storage
Very small (single variable); independent of p
KICS, UETCopyright © 2009 3-112
Performance Criteria (2) Fairness
Poor, can cause starvation Test&set with backoff similar, but less traffic Test-and-test&set: slightly higher latency,
much less traffic But still all rush out to read miss and test&set
on release Traffic for p processors to access once each: O(p2)
Luckily, better hardware primitives as well as algorithms exist
KICS, UETCopyright © 2009 3-113
Improved Hardware Primitives: LL-SC Goals:
Test with reads Failed read-modify-write attempts don’t
generate invalidations Nice if single primitive can implement range
of r-m-w operations Two instructions: Load-Locked (or -
linked), Store-Conditional LL reads variable into register
KICS, UETCopyright © 2009 3-114
Improved Hardware Primitives (2) Follow with arbitrary instructions to
manipulate its value SC tries to store back to location if and
only if no one else has written to the variable since this processor’s LL If SC succeeds, means all three steps
happened atomically If fails, doesn’t write or generate
invalidations (need to retry LL) Success indicated by condition codes
KICS, UETCopyright © 2009 3-115
Simple Lock with LL-SClock: ll reg1, location /* LL location to reg1 */
sc location, reg2 /* SC reg2 into location*/
beqz reg2, lock /* if failed, start again */
retunlock: st location, #0 /* write 0 to location */
ret
KICS, UETCopyright © 2009 3-116
Simple Lock with LL-SC (2) Can do more fancy atomic ops by changing
what’s between LL & SC But keep it small so SC likely to succeed Don’t include instructions that would need to be undone
(e.g. stores) SC can fail (without putting transaction on bus) if:
Detects intervening write even before trying to get bus Tries to get bus but another processor’s SC gets bus
first LL, SC are not lock, unlock respectively
Only guarantee no conflicting write to lock variable between them
But can use directly to implement simple operations on shared variables
KICS, UETCopyright © 2009 3-117
More Efficient SW Locking Algorithms Problem with Simple LL-SC lock
No invals on failure, but read misses by all waiters after both release and successful SC by winner
No test-and-test&set analog, but can use backoff to reduce burstiness
Doesn’t reduce traffic to minimum, and not a fair lock
KICS, UETCopyright © 2009 3-118
More Efficient SW Locking (2) Better SW algorithms for bus (for r-m-w
instructions or LL-SC) Only one process to try to get lock upon release
valuable when using test&set instructions; LL-SC does it already
Only one process to have read miss upon release
valuable with LL-SC too Ticket lock achieves first Array-based queueing lock achieves both Both are fair (FIFO) locks as well
KICS, UETCopyright © 2009 3-119
Ticket Lock Only one r-m-w (from only one
processor) per acquire Works like waiting line at a bank
Two counters per lock (next_ticket, now_serving)
Acquire: fetch&inc next_ticket; wait for now_serving to equal it
atomic op when arrive at lock, not when it’s free (so less contention)
Release: increment now-serving FIFO order, low latency for low-contention if
fetch&inc cacheable
KICS, UETCopyright © 2009 3-120
Ticket Lock (2) Works like waiting line at a bank (cont’d)
Still O(p) read misses at release, since all spin on same variable
like simple LL-SC lock, but no inval when SC succeeds, and fair
Can be difficult to find a good amount to delay on backoff
exponential backoff not a good idea due to FIFO order
backoff proportional to now-serving - next-ticket may work well
Wouldn’t it be nice to poll different locations ...
KICS, UETCopyright © 2009 3-121
Array-based Queuing Locks Waiting processes poll on different
locations in an array of size p Acquire
fetch&inc to obtain address on which to spin (next array element)
ensure that these addresses are in different cache lines or memories
Release set next location in array, thus waking up process
spinning on it O(1) traffic per acquire with coherent caches
KICS, UETCopyright © 2009 3-122
Array-based Queuing Locks (2) Waiting processes poll on different
locations in an array of size p (cont’d) FIFO ordering, as in ticket lock But, O(p) space per lock Good performance for bus-based machines Not so great for non-cache-coherent
machines with distributed memory array location I spin on not necessarily in my local
memory
KICS, UETCopyright © 2009 3-123
Lock Performance on SGI Challenge
Array-based
LL-SC
LL-SC, exponential
Ticket
Ticket, proportional
0
1
1
3 5 7 9
11 13 15 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
(a) Null (c = 0, d = 0) (b) Critical-section (c = 3.64 s, d = 0) (c) Delay (c = 3.64 s, d = 1.29 s)
Tim
e (
s)
Tim
e (
s)
Tim
e (
s)Number of processors Number of processors Number of processors
Loop: lock; delay(c); unlock; delay(d);
KICS, UETCopyright © 2009 3-124
Lock Performance on SGI Challenge (2) Simple LL-SC lock does best at small p
due to unfairness Not so with delay between unlock and next
lock Need to be careful with backoff
Ticket lock with proportional backoff scales well, as does array lock
Methodologically challenging, and need to look at real workloads
KICS, UETCopyright © 2009 3-125
Point to Point Event Synchronization Software methods:
Interrupts Busy-waiting: use ordinary variables as flags Blocking: use semaphores
Full hardware support: full-empty bit with each word in memory Set when word is “full” with newly produced
data (i.e. when written) Unset when word is “empty” due to being
consumed (i.e. when read)
KICS, UETCopyright © 2009 3-126
Point to Point Event Synchronization (2) Full hardware support: (cont’d)
Natural for word-level producer-consumer synchronization
producer: write if empty, set to full; consumer: read if full; set to empty
Hardware preserves atomicity of bit manipulation with read or write
Problem: flexiblity multiple consumers, or multiple writes before
consumer reads? needs language support to specify when to use composite data structures?
KICS, UETCopyright © 2009 3-127
Barriers Software algorithms implemented using
locks, flags, counters Hardware barriers
Wired-AND line separate from address/data bus
Set input high when arrive, wait for output to be high to leave
In practice, multiple wires to allow reuse Useful when barriers are global and very
frequent
KICS, UETCopyright © 2009 3-128
Barriers (2) Hardware barriers
Difficult to support arbitrary subset of processors
even harder with multiple processes per processor Difficult to dynamically change number and
identity of participants e.g. latter due to process migration
Not common today on bus-based machines Let’s look at software algorithms with
simple hardware primitives
KICS, UETCopyright © 2009 3-129
A Simple Centralized Barrier Shared counter maintains number of processes that
have arrived increment when arrive (lock), check until reaches numprocs
struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name;
BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/mycount = bar_name.counter++; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */bar_name.counter = 0; /* reset for next barrier */bar_name.flag = 1; /* release waiters */}else while (bar_name.flag == 0) {}; /* busy wait for release */
}
Problem?
KICS, UETCopyright © 2009 3-130
A Working Centralized Barrier Consecutively entering the same barrier
doesn’t work Must prevent process from entering until all
have left previous instance Could use another counter, but increases
latency and contention Sense reversal: wait for flag to take
different value consecutive times Toggle this value only when all processes
reach
KICS, UETCopyright © 2009 3-131
A Working Centralized Barrier (2)
BARRIER (bar_name, p) {local_sense = !(local_sense); /* toggle private sense variable */
LOCK(bar_name.lock);mycount = bar_name.counter++; /* mycount is private */if (bar_name.counter == p)
UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/
else { UNLOCK(bar_name.lock);
while (bar_name.flag != local_sense) {}; }}
KICS, UETCopyright © 2009 3-132
Centralized Barrier Performance Latency
Want short critical path in barrier Centralized has critical path length at least
proportional to p Traffic
Barriers likely to be highly contended, so want traffic to scale well
About 3p bus transactions in centralized Storage Cost
Very low: centralized counter and flag
KICS, UETCopyright © 2009 3-133
Centralized Barrier Performance (2) Fairness
Same processor should not always be last to exit barrier
No such bias in centralized Key problems for centralized barrier are
latency and traffic Especially with distributed memory, traffic
goes to same node
KICS, UETCopyright © 2009 3-134
Improved Barrier Algorithms for a Bus
Software combining tree Only k processors access the same location,
where k is degree of tree
Flat Tree structured
Contention Little contention
KICS, UETCopyright © 2009 3-135
Improved Barrier Algorithms for a Bus (2)
Separate arrival and exit trees, and use sense reversal
Valuable in distributed network: communicate along different paths
On bus, all traffic goes on same bus, and no less total traffic
Higher latency (log p steps of work, and O(p) serialized bus xactions)
Advantage on bus is use of ordinary reads/writes instead of locks
KICS, UETCopyright © 2009 3-136
Barrier Performance on SGI Challenge
Centralized does quite wellNumber of processors
Tim
e (
s)
123456780
5
10
15
20
25
30
35 Centralized Combining tree Tournament Dissemination
KICS, UETCopyright © 2009 3-137
Synchronization Summary Rich interaction of hardware-software tradeoffs Must evaluate hardware primitives and
software algorithms together primitives determine which algorithms perform well
Evaluation methodology is challenging Use of delays, microbenchmarks Should use both microbenchmarks and real
workloads Simple software algorithms with common
hardware primitives do well on bus
KICS, UETCopyright © 2009 3-138
Key Takeaways for this Session Multi-core processors are here
These are multiprocessor/MIMD systems We need to understand parallel programming
Strengths, weaknesses, opportunities, and threads No “free lunch” for performance improvement
System support for multi-core is available OS: both Linux and Windows support them Compilers/language support: gcc, C#, java
Two types of development tracks High performance computing High throughput computing Both have their unique challenges
KICS, UETCopyright © 2009 3-139
Key Takeaways (2) High performance computing
Most scientific/engineering applications Available programming models: message-passing (MPI)
or shared-memory processing (OpenMP) Challenge: performance scalability with cores and
problem size while dealing with data/function partitioning
High throughput computing Most business applications Available programming model: multi-threading (shared-
memory processing) Challenge: performance scalability while dealing with
deadlocks, locking, cache, and memory issues