scalable reader-writer synchronization for shared- memory multiprocessors mellor-crummey and scott...

Scalable Reader-Writer Synchronization for Shared-

Memory Multiprocessors

Mellor-Crummey and ScottPresented by

Robert T. Bauer

Problem

• Efficient SMMP Reader/Writer Synchronization

Basics

• Readers can “share” a data structure• Writers need exclusive access

– Write appears to be atomic• Issues:

– Fairness: Fair every “process” eventually runs

– Preference:• Reader preference Writer can starve• Writer preference Reader can starve

Organization

• Algorithm 1 – simple mutual exclusion • Algorithm 2 – RW with reader preference• Algorithm 3 – A fair lock

• Algoirthm 4 – local only spinning (Fair)• Algorithm 5– local only reader preference• Algorithm 6 – local only writer preference

• Conclusions

Paper’s

Contrib

Algorithm I – just a spin lock

• Idea is that processors spin on their own lock record

• Lock records form a linked list• When a lock is released, the “next”

processor waiting on the lock is signaled by passing the lock

• By using “compare-swap” when releasing, the algorithm guarantees FIFO

• Spinning is “local” by design

Algorithm 1

• Acquire Lockpred := fetch_and_store(L, I)pred /= null I->locked := true

prednext := I repeat while Ilocked

• Release Lock Inext == null compare_and_swap(L,I,null) return repeat while Inext == nullInextlocked := false

Algorithm 2 – Simple RW lock with reader preference

Bit 0 – writer active?Bit 31:1 – count of interested readers

start_write – repeat until compare_and_swap(L,0, 0x1)

start_read – atomic_add(L,2);repeat until ((L & 0x1) = 0)

end_write – atomic_add(L, -1)

end_read – atomic_add(L, -2)

Algorithm 3 – Fair Lock

Writer CountReader Count

start_write prev = fetch_clear_then_add(Lrequests, MASK, 1) // ++ write requests repeat until completions = prev // wait for previous readers and writers to go first

end_write – clear_then_add(Lcompletions, MASK,1) // ++ write completions

start_read // ++ read request, get count of prev writers prev_writer = fetch_clear_then_add(Lrequests, MASK, 1) & MASK repeat until (completions & MASK) = prev_writer // wait for prev writers to go first

end_read – clear_then_add(Lcompletions, MASK,1) // ++ read completions

Requests

Writer CountReader Count Completions

So far so good, but …

• Algorithm 2 and 3 spin on a shared memory location.

• What we want is for the algorithms to spin on processor local variables.

• Note – results weren’t presented for Algorithms 2 and 3. We can guess the performance though, since we know the general characteristics of contention.

Algorithm 4Fair R/W Lock: Local-Only Spinning

• Fairness Algorithm– read request granted when all previous write

requests have completed– write request granted when all previous read

and write requests have completed

Lock and Local Data Layout

Case 1: Just a Read

Pred == nil

Lock.tail I

Upon exit:

Lock.tail I

Lock.reader_count == 1

Case 1: Exit Readnext == nil

Lock.tailI, so cas ret T

Lock.next_writer == nil

Upon Exit:

Lock.tail == nil

Case 2: Overlapping ReadAfter first read:

Lock.tail I1

not nil !!!!

predclass == reading

Pred->state == [false,none]

Locked.reader_count == 2

Case 2: Overlapping ReadAfter the 2nd read enters:

Locked.tail I2

I1next == I2

Case 2: Overlapping readsI1 finishes next != nil

I2 finishes Locked.tail = nil

count goes to zeroafter I1 and I2 finish

Case 3: Read Overlaps Write

• The previous cases weren’t interesting, but they did help us get familiar with the data structures and (some of) the code.

• Now we need to consider the case where a “write” has started, but a read is requested. The read should block (spin) until the write completes.

• We need to “prove” that the spinning occurs on a locally cached memory location.

Case 3: Read Overlaps WriteThe Write

Upon exit:

Locked.tail I

Locked.next_writer = nil

I.class = writing, I.next = nil

I.blocked = false, success… = none

pred == nil

reset blocked to false

Case 3: Read Overlaps WriteThe Read

pred class == writing

wait here for write to complete

Case 3: Read Overlaps WriteThe Write Completes

I.next The Read

Yes!Works, but is “uncomfortable”because concerns aren’tseparated

unlock the reader

Case 3: What if there were more than 1 reader?

change the predecessor reader

wait here

Yes! Changed by the successor

unblock the successor

Case 4: Write Overlaps Read

• Overlapping reads form a chain

• The overlapping write, “spins” waiting for the read chain to complete

• Reads that “enter” after the write as “enter”, but before the write completes (even while the write is “spinning”), form a chain following the write (as with case 3).

Case 4: Write Overlaps Read

wait here

Algorithm 5 Reader Preference R/W Local-Only Spinning

• We’ll look at the Reader-Writer-Reader case and demonstrate that the second Reader completes before the Writer is signaled to start.

1st Reader++reader_countWaflag == 0 false1st reader just runs!

Overlapping Write

queue the write

Register writerinterest, resultnot zero, sincethere is a reader

We have a reader,so the cas fails.

The writer blocks herewaiting for a readerset blocked = false

2nd ReaderStill no active reader++reader_count

Reader Completes

Only last reader willsatisfy equality

Last reader to completewill set WAFLAGand unblock writer

Algorithm 6 Writer Preference R/W Local-Only Spinning

• We’ll look at the Writer-Reader-Writer case and demonstrate that the second Writer completes before the Reader is signaled to start.

1st Writer

1st writer

“set_next_writer”

1st writerwriter interested or active

no readers, just writer

writer should run

1st Writer

1st writer

blocked = false, so writerstarts

Reader

put reader on queue

“register” reader, seeif there are writers

wait here for writerto complete

2nd Writer

queue this write behindthe other write

and wait

Writer Completes

start the queuedwrite

Last Writer Completes

clear write flagssignal readers

Unblock Readers

++reader count,clear rdr’s interested

no writers waiting oractive

empty the “waiting”reader list

when this readercontinues, it willunblock the “next”reader -- which willunblock the “next”reader, etc.reader count getsbumped

Results & Conclusion

• The authors reported results for a different algorithm than was presented here.

• The “algorithms” used were “more” costly in a multiprocessor environment; so they’re claiming that the algorithms presented here would be “better.”

Timing Results

Latency is costly becauseof the number of atomicoperations.

scalable reader-writer synchronization for shared- memory multiprocessors mellor-crummey and scott...

Documents

a multi-platform co-array fortran compiler for...

parallel computing platforms - rice university...

1 a multi-platform co-array fortran compiler yuri dotsenko...

principles of parallel algorithm design: concurrency and...

avoiding crummey power mistakes in drafting trust...

multiprocessors— large vs. small scale multiprocessors—...

csl718 : multiprocessors

collective communication - clear.rice.edu · john...

8.1 multiprocessors

1 john mellor-crummey cristian coarfa, yuri dotsenko...

1 multiprocessors computer organization prof. h. yoon...

1scidac annual meeting june 2007 harnessing the power of...

l31 multiprocessors

numa multiprocessors

portable, mpi-interoperable coarray fortran chaoran yang, 1...

department of computer science rice university …john...

hpctoolkit: tools for performance analysis of optimized...

gaining insight into parallel program performance using...

efficiently exploring compiler optimization sequences with...

a wait-free queue as fast as...