acoherent shared memory derek r. hower ph.d. defense july 16, 2012

Acoherent

Shared Memory

Derek R. HowerPh.D. DefenseJuly 16, 2012

Executive Summary

Coherent

Acoherent

Simple abstraction?

L2Simple abstraction

- Simple implementation- Abstracts caches- Low overhead

- Complex implementation- Hides caches (bad?!)- High overhead

Outline

Motivation and GoalsASM ModelASM-CMP PrototypeEvaluation and ResultsConclusions and Future Work

Trends

We must change

We can change

Energy Matters Dark Silicon/Mobile/Datacenter < 50% of processor powered by 20241

Complexity Matters Lower barrier to entry for accelerators

Area Matters New tech nodes are not cheaper2

Memory: may be difficult to turn off e.g., S-NUCA

Compatibility doesn’t matter Vertical integration is the new black

1 Esmaeilzadeh, et al. ISCA 20112 ExtremeTech 2012

The Problem With Coherence Wrong abstraction

Optimized for fine-grained, share-everything• Programs aren’t!

Makes SW isolation hard Hypothesis: SW will want control over data

placement

Impedes HW specialization Does your multicore ASIC need a coherence

controller? Coherent GPUs?

Efficiency problems Directories take space/broadcasts take energy

• e.g. 14% of cache are dedicated to directory on 4-core die1

1 Stackhouse et al., ISSCC 2008

Rethinking Coherence: Goals Maintain programmer sanity

Keep shared memory Minimal compatibility change

Expose hardware capabilities Let SW guide memory management -> semantics

Simple hardware Lower cost of entry for accelerators

Solution: Acoherent Shared Memory

Outline

ASM Model Basics Replace black box with simple hierarchy

Still flat, linear address space SW gets private storage

Manage with CVS-like checkout/checkin

Checkout/Checkin

Checkout: Pull data into private storage

Checkin: Publish local updates globally

Checkout/Checkin are not synchronization primitives - Closer to a FENCE

Granularity?

Segments

BSSData

Compromise: Memory Segments– Linear partition of address space– CO/CI segments at a time

Observation: Programs are already segmented Can re-use layout

Typical CO/CI granularity in existing C code

Segment Types

Acoherent

PrivateStack

BSSData

Coherent RO

Shared

Private

Shared, Read-Only

Not all memory wants/needs acoherence Segment types give different “views” Communicate semantic information to HW

Available Types

Private

Coherent-RW

Coherent-RO

Acoherent

Device

Managing Finite Resources Model so far is strong acoherence

Likely requires prohibitive HW resources Also weak acoherence and best-effort

acoherence Still useful to software/hardware

Weak acoherence: Data visible early (before checkin)

Best-effort acoherence: Spontaneous checkouts at any time

• + SW notification All-or-nothing

Synchronized => not a problem

Hybrid Runtimes =>not a problem

Case Study: pthreadspthread_barrier_t barrier;char* shared_data;

int main(int argc, char* argv[]) { int i,j,k; pthread_t sib; shared_data = malloc(PROBLEM_SIZE); pthread_barrier_init(&barrier, NULL, 2); pthread_create(&sib, NULL, worker, (void*) 1); worker((void*) 0); pthread_join(sib, NULL); return 0;}

void* worker(void* arg){ while (work remains) { <split work> <do work> pthread_barrier_wait(&barrier); }}

Task: Convert to ASM

• Global, Heap in acoherent segment

• Stack in private segment

• Synch. in coherent-RW segment

• CI/CO Global, Heap at synchronization

Communication Point

barrier;

barrier

shared_data

argc argv

i j ksib

• Text in coherent-RO segment

shared_data

Automatic:Runtime

Automatic:Library

Works as isint pthread_barrier_init(…) { … _barrier = coherent_malloc(sizeof(int)); …}

int pthread_barrier_wait(…) { … checkin(heap, data); <barrier> checkout(heap, data); …}

Step 1: Assign Segments

Step 2: Checkout/Checkin

Memory Consistency Model

Option 1: The Details(6 slides + really ugly equations)

Option 2: The Highlights (2 slides)

Memory Consistency Model Defined in style of SPARC TSO/RMO

Memory Order: Total order of memory ops• Restricted by consistency model

Processor Order: Local dependencies

Value of load: defined via memory + processor order

Weak Acoherence

# Load -> Load to same address (a)# Load -> Store to same address (b)# Store -> Store to same address (c)# Paired CI-CO act as distributed fence (d)# CI/CO -> CI/CO (e)

1. Define Memory Order

2. Define legal value of loads

Same as TSO, etc.

CI-CO pair => fenceTotal order of CO/CI

S S Si mpS

i i iL a L L a L

S S S Si i ip m iL S a L S a

S S S Si i i mp iS a S a S a S a

SS Si i

S S Sp m j p j i m jS a CI CO L S L

p mCX CX CX CX

value value max |S S S S Sm

Si m i p iL a S a S a L a or S a L a

JJJJJJJJJJJJJJ

Strong Acoherence

# Load -> Load to same address (a) # Load -> Store to same address (b) # Store -> Store to same address (c) # Paired CI-CO act as distributed fence (d) # CI/CO -> CI/CO (e) # Store not visible until CI (f) # Stores can be clobbered

1. Define Memory Order

2. Define legal value of loads

S S Si mpS

i i iL a L L a L

S S S Si i ip m iL S a L S a

S S S Si i i mp iS a S a S a S a

SS Si i

S S Sp m j p j i m jS a CI CO L S L

p mCX CX CX CX

next ( , ) max ( , )S S S S Si p p i i mp p iS CO CI S S CO S

JJJJJJJJJJJJJJJJJJJJJJJJJJJJ

max | max ( , )

or, if does not exist,

max | max ( , )

S S S Si p i p i

S Si p

S S S Sm m

value L a value S a CO L a S a L a

value S a CO S a S a L a

JJJJJJJJJJJJJJJJJJJJJJJJJJJJ

JJJJJJJJJJJJJJ JJJJJJJJJJJJJJ

( , ) next ( , ) ( , )S S S S S S S Si p p ipp i p i i mS next CI S CO S next CI S S

JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

Normally:Stores not visible until CI

S Si p m iS CI CI S

Can “lose” data

Other Segment Types Coherent

Like weak, but:• Loads implicitly paired with (atomic) CO• Stores implicitly paired with (atomic) CI

SC w.r.t. each other

Private Like weak

Analysis CO/CI not atomic Subtlties:

03: R0 = A

02: CHECKOUT

13: CHECKIN

Thread 0 Thread 1

12: A = 1

Initially, A = 0

04: R1 = A

Strong: R0 = 0, R1 = 0Weak: R0 = 0, R1 = 0 or 1

(b) Isolation

05: CHECKOUT

Thread 0 Thread 1

14: A = 1

Initially, A = 0

06: R0 = A

Strong: R0 = 0Weak: R0 = 0 or 1

(c) Leaky stores

00: CHECKOUT

11: CHECKIN

Thread 0 Thread 1

10: A = 1

Initially, A = 0

01: R0 = A

(a) Lazy checkout

Strong: R0 = 0 or 1Weak: R0 = 0 or 1

ASM = SC for DRF ASM = SC for lossless and properly paired Lossless:

No clobbering checkouts i.e.,

Properly Paired: All conflicting stores->load separated by CI/CO i.e.,

Proof sketch: LL+PP executions defined by CO/CI order, program

order only CO/CI, program order same in ASM, SC

S S S Si i i p i

S S S Si i p i p i

S a if CO S a CO

CI S a CI CO

, : value( ) value( ),

S S Sj i j

S S S S S Si j j p j m i p

SL a S a L a S a i j

CO CI S a CI CO L a

CO/CI Semantics CO/CI like fence

Lazy checkouts Non-atomic, non-blocking checkins

• Updates can interleave

00: CHECKOUT

11: CHECKIN

Thread 0 Thread 1

10: A = 1

Initially, A = 0

01: R0 = A

Finally: R0 = 0 or 1

00: A = 101: B = 202: CHECKIN

Thread 0 Thread 1

10: A = 1011: B = 2012: CHECKIN

Initially, A = 0

Finally, any combo of: A = 1 or 10 B = 2 or 20

Consistency Highlights Coherent accesses have implicit CO/CI

CO/CI are totally ordered Transitivity hides non-atomicity

Sequentially consistent for data-race-free Lossless & Properly Paired

ST critical

ST lock

CI lock_segment

CO lock_segment

LD lock

ST lock

CI lock_segment

CO critical_segment

CI critical_segment

LD critical

Thread 0 Thread 1

STsync lock

LL lock

SC lock

Outline

ASM-CMP Overview Based on MIPS

+ special insns, e.g., checkout, checkin Uses segments, no paging

• Maintains flat address space

Coherence protocol -> Acoherence Engine DMA for caches

• Selectively move data

Skipping the Details

Baseline

Switch

CoreL1I

Memory Controller

Contro

Segment Types

Non-inclusive L2

L1 AEAE

Exclusive L2

Acoherent

Coherent-RW

Private

Acoherence Engine

Three main responsibilities: Checkout:

• Invalidate all segment data Checkin:

• Write back all dirty segment data Order:

• Detect CI-CO pairs

FSM like coherence, but few races, no directory

Timestamp based

Lazy Flash Invalidate

Track write set

Decoupled MetastateCache

Decoupled Metastate Cache All L1 Caches

Decouple metastate from data Quick access to aggregate

state Track V/D per-segment

Checkout: XOR

global/segment valid

Checkin: Walk segment

dirty state

Order Need to:

1. Determine if a CI precedes a CO2. Delay load after CO if previous CI hasn’t completed

Timestamp algorithm (per segment): Two phase CO/CI

1. Acquire timestamp1. Invalidate/Flush

2. Wait for previous CO/CI to complete

Implemented in firmware

Multiple Writer Support

Keep per-byte dirty bitmask in L1s Allows multiple writers with false sharing 12.5% larger L1 cache

Bitmask accompanies data to L2

Simple?

Directory / L2

REQREQ RESP RESP

Source of Races / Complexity

Outline

Motivation and GoalsASM ModelASM-1 PrototypeEvaluation and ResultsConclusions and Future Work

Methodology Simulation-based

Enhanced-User Mode

Workloads: Class-1: SPLASH Class-2: Task-Q

Three memory modules ASM-CMP CC from gem5-Ruby

• MESI (Inclusive)• MOESI (Non-inclusive)

es fftfm

moesi mesi asm

alized t

Performance

Comparable performance

Checkout too muchFalse Sharing/

Migratory Sharing

Perfect Checkout

es fftfm

asm_base asm_ideal

alized t

Baseline

Energy

es fftfm

e_l1d e_l1i e_l2 e_link e_switch e_tlb

alized t

Less Energy (Same Performance)

Checkout Characteristics

0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 64-71 72-79 80-87 88-95 96-103 104-111

112-119

120-127

0%10%20%30%40%50%60%70%80%

Class-1 Workloads barnes fft fmm lu mp3d ocean radix water

# blocks invalidated

f checkouts

es fftfm

heckout

Invalidati

Elided

Most checkout invalidations affect dead

blocks

Checkouts usually small;Can be large (> 25% of

Checkin Characteristics

0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 64-71 72-79 80-87 88-95 96-103 104-111

112-119

120-127

Class-1 Workloads

barnes fft fmm lu mp3d ocean radix water

# blocks invalidated

f checkin

Checkin latency is hiddenCheckins usually small;Can be large (> 25% of

Outline

Motivation and GoalsASM ModelASM-CMP PrototypeEvaluation and ResultsConclusions/Other Work

Conclusions Going forward:

HW designs must find efficiency SW will want to see caches/control placement

ASM: viable alternative to coherent shared memory Semantic cooperation between HW/SW

ASM-CMP: build components w/o coherence engine Make custom integration easier

Practically: Will the next x86 core use ASM? No Will a heterogeneous accelerator? Maybe

Related Work

ASM Model

ASM-CMP

Alternatives/Detractors

Related Work – ASM Model Relaxed consistency models

Release Consistency (ISCA 1990)• Acquire/Release ≈ CO/CI

DRF-0 (ISCA 1990), DRF-1 (PDS 1993)• SC for DRF

Weak ordering (ISCA 1998)

Semantic Segmentation Cohesion (ISCA 2011) Entry consistency (CMU-TR 1991)

Related Work – ASM-CMP Rigel: IEEE Micro 2011

Differentiates coherent/incoherent Treadmarks: ISCA 1992

Twinning and diffing

Related Work - Alternatives Reduce directory overhead

Cuckoo directory (HPCA 2011) Tagless directory (MICRO 2009, PACT 2011) Waypoint (PACT 2010) Region coherence (IEEE Micro 2006) SW controlled coherence (…)

Simplify coherence design Denovo (PACT 2011)

Coherence is here to stay CACM 2012

Future Work

ASM Model

ASM Implementations

ASM Software

Future Work – ASM Model1. Use CO/CI for synchronization

Return timestamp with CO/CI Blocking CO

2. Only guarantee transitivity across coherent accesses Would eliminate need for timestamps

3. Hierarchical ASM Expose multiple levels of abstracted caches

4. Interaction with coherent shared memory Acoherent/coherent components in same system

Future Work: ASM Implementation ASM-CMP

1. Optimize empty checkout/checkin2. Non-speculative support for strong

acoherence• e.g., HW copy-on-write support on eviction• Use ASM as foundation for TM/Determinism/etc

3. Low overhead byte-diffing• False sharing is rare/pattern reuse is common

4. More segment control• Non-contiguous• Remap-able

Other1. Multi-socket support2. Use ASM to simplify traditional coherence

• Private/shared

Future Work – ASM Software1. Message passing on ASM

More efficient than coherence (think: migratory)

2. Software speculation Use working memory for isolation

3. Programming language integration CO/CI first-class operations Work already exists:

• Worlds (ECOOP 2011), Revisions (OOPSLA 2010), PGAS

Previous Work Rerun: ISCA 2008 and CACM 2009

Race recorder for deterministic replay vs. state of the art:

• SAME logging performance, > 10x state reduction

Calvin: HPCA 2011 Coherence for deterministic execution

• i.e., zero-log-size deterministic replay Selective determinism to match program

requirements

Hobbes: WoDet 2011 Strong acoherence in SW runtime

Backup Slides

What I would do differently Focus on more specific target system Stop building new infrastructure!

Why did I? • gem5 wasn’t ready• Started more radical/not clear it would have helped

Step back more often Easy to get sucked in to details – usually don’t

matter Functional specification of consistency -> yuck!

Case Study: Cilk Work-stealing task queue

Distributed design

ASM Segments Benefit

es fftfm

0.20.40.60.8

11.21.41.61.8

e_l1d e_l1i e_l2 e_link e_switch e_tlb

alized t

es fftfm

moesi_tlb_0 moesi_tlb_32 moesi_tlb_64

alized t

History

1980 2000 2010CPU Era Multicore Era Dark Era

Moore of the sameEverything is

general purpose ??

Navigating the Darkness

Solution #1: Wait for CMOS replacement Don’t hold your breath

Solution #2: Rethink everything Deep integration HW Specialization/Heterogeneity Efficiency Take compatibility off its pedestal

Coherence?

Rethinking Coherence: Why Now? Dennard Scaling is over; Moore’s Law continues:

Need efficient components/reduced waste Heterogeneity/Specialization

Different memory access patterns Multicore ASICs

Important workloads don’t use it Compatibility not a show stopper

Mobile -> fast design cycles, controlled SW stacks Datacenter -> economy of scale in single location

Missing opportunities

Case Study 2: Software Speculationbegin_speculation() {

<copy state> <setup>}

end_speculation() { if(success) <free copies> else abort_speculation();}

abort_speculation() { <revert to copy> <cleanup>}

Multiple checkouts: “forget” updatesTask: Convert to ASM

checkout(…)

checkin(…) Checkin: commit updates

SW can use memory in new ways

Use private storage

New Software Potential Evaluate ability to write speculation software

Microbenchmark: Fill array with speculative data, then commit Vary size of array

16 64 256 1K 16K 32K 64K 128K0

ASMMESI

# of Blocks in Isolation Region

alized

Using Weak Acoherence

func producer(…) checkout(array); array[0] = x; array[1] = y; checkin(array);

signal(consumer);

end func

func consumer(…)

waitfor(producer);

checkout(array); …end func

global array;

weak acoherent

checkin(array);

Synchronized -> Early visibility OK

Synch hides early checkin

globally visible!

Using Best-Effort Acoherence

Exception!

SW handles resource limitations

array[1] = y checkin(array)

end_tx

begin_tx checkout(array)

array[0] = x checkout(array)

Simulator Design Two Goals

Functionally evaluate ASM system• programming model, kernel management

Performance comparison to CMP

Enhanced User Mode simulator Emulate non-timing critical components (e.g., disks) Simulate the rest (e.g., virtual memory)

Qualitative Data

Is ASM a reasonable model?

YES Almost no changes to application software

• Unsynchronized flags• Stack sharing

Functioning Kernel, same tricks• Heavier use of coherent segments

Three Questions

PP PPPPPP

Hardware

Layout

Coherent

PrivateView

Acoherent

1. How can software select view?2. Which view to use?3. How to manage CO/CI?

ASM-CMP Segments Uses true memory segments

e.g., all pointers are long (segment + offset)

BUT, address space still appears flat!

Long Pointer Propagation Segment pointers propagate through datapath Add lp/sp + register sidecars Languages/SW remain segment-oblivious

ASM-CMP SegmentsSegment pointers propagates with datapath

memcpy(dst, src, len);

lp $t0, 0(dst) lp $t1, 0(src) mov $t2, $a2 ; cnt <- lenloop: beqz $t2, exit lb $t3, 0($t1) ; ld src sb $t3, 0($t0) ; st dst addi $t0, $t0, 1 ; inc. dst addi $t1, $t1, 1 ; inc. src subi $t2, $t2, 1 ; dec. cnt b loopexit:

OffsetSeg. Ptr.

dst ptr

Memory

OffsetSeg. Ptr.

src ptr

OffsetRegister File

SegOffset

Seg. Ptr.

lp $t0, 0(dst) lp $t1, 0(src) mov $t2, $a2 ; cnt <- lenloop: beqz $t2, exit lb $t3, 0($t1) ; ld src sb $t3, 0($t0) ; st dst addi $t0, $t0, 1 ; inc. dst addi $t1, $t1, 1 ; inc. src subi $t2, $t2, 1 ; dec. cnt b loopexit:

Offset Seg$t1Seg. Ptr.Offset

len$t2data$t3

Offset SegOffset SegOffset

Offset+1

lp $t0, 0(dst) lp $t1, 0(src) mov $t2, $a2 ; cnt <- len beqz $t2, exit lb $t3, 0($t1) ; ld src sb $t3, 0($t0) ; st dst addi $t0, $t0, 1 ; inc.dst

Segment propagates src -> dst

Pointers are long

The Problem

HardwareLayout

Coherent Shared Memory

SoftwareView

The Problem

HardwareLayout

SoftwareView

Coherent Shared Memory

Hardware Policy – Software Can’t Change!

All Data Are Created Equal?

Location := 1;

Assume: CMP MESI protocol, inclusive LLC

Missed Opportunities

Location := 1;

begin_tx

end_tx

cpLocation := Location;

SW Makes Redundant Copy

All Data Are NOT Created Equal

Location := 1;

func foo() var Location;

Private

Wasting Space

ASM-1 Hardware

8MB L3

256KB L2

32KB L1

Bitmask

Per-line

Baseline

L2L2L2L2L2L2L2L2

L1L1L1L1L1L1L1L1

P0P1P2P3P4P5P6P7

P10P11P12P13P14P15

In-order, single thread Ring interconnect

Storage Overhead

2 4 8 16 32 64 128

ASM-1MESI-1 LevelMESI-2 LevelMESI-3 Level

# Cores

More indirection -> longer latency

No Indirection

Rethinking Coherence: Why Now? Dennard Scaling is over; Moore’s Law continues

Need scalable, energy efficient components Accelerators are here

How should they see memory? Shared-little workloads in important markets

All Data Are NOT Created Equal

Location := 1;

func CUDAKernel(…) …

Not clear accelerators want/need coherence

acoherent shared memory derek r. hower ph.d. defense july 16, 2012

sw runtime

time sw notificationall

rethinking coherence

coherence controller

sw isolation hardhypothesis

checkin besteffort acoherence

future work

data placement

Documents

jose a. hower capitulos metodologia

jessica & derek

derek matzinger

rerun: exploiting episodes for lightweight memory race...

calvin: deterministic or not? free will to choose derek r. ...

intimacy with god derek...

derek cheung

cleaning thermostatic bath hower ixer

derek walcott.pdf

detective derek

derek mahon

piñata derek

phil hower rico lawsuit named trupia and teta

emergency equipment rental agreement11276-c, daily shower...

sequential consistency for heterogeneous-race-free derek r....

derek portfolio

slovenia, derek

attridge derek

derek walcott

acoherent shared memory · 2012. 7. 25. · acoherent...