rethinking parallel execution
DESCRIPTION
Rethinking Parallel Execution. Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison. Outline. From sequential to multicore Reminiscing: Instruction Level Parallelism (ILP) Canonical parallel processing and execution - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/1.jpg)
Rethinking Parallel Execution
Guri Sohi(along with Matthew Allen, Srinath
Sridharan, Gagan Gupta)University of Wisconsin-Madison
![Page 2: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/2.jpg)
Outline
• From sequential to multicore• Reminiscing: Instruction Level Parallelism (ILP)• Canonical parallel processing and execution• Rethinking canonical parallel execution• Dynamic Serialization• Consequences of Dynamic Serialization• Wrap up
April 27, 2010 Mason Wells 2
![Page 3: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/3.jpg)
Microprocessor Generations
• Generation 1: Serial• Generation 2: Pipelined• Generation 3: Instruction-level Parallel (ILP)• Generation 4: Multiple processing cores
April 27, 2010 Mason Wells 3
![Page 4: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/4.jpg)
Microprocessor Generations
April 27, 2010 Mason Wells 4
Gen 1: Sequential (1970s) Gen 2: Pipelined (1980s)
Gen 3: ILP (1990s)
Gen 4: Multicore (2000s)
![Page 5: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/5.jpg)
5
From One Generation to Next
• Significant debate and research – New solutions proposed– Old solutions adapt in interesting ways to become
viable or even better than new solutions
• Solutions that involve changes “under the hood” end up winning over others
![Page 6: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/6.jpg)
6
From One Generation to Next
• From Sequential to Pipelined– RISC (MIPS, Sun SPARC, Motorola 88k, IBM PowerPC)
vs. CISC (Intel x86)– CISC architectures learned and employed RISC
innovations
• From Pipelined to Instruction-Level Parallel– Statically scheduled VLIW/EPIC– Dynamically scheduled superscalar
![Page 7: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/7.jpg)
7
From One Generation to Next
• From ILP to Multicore– Parallelism based upon canonical parallel execution
model– Overcome constraints to canonical parallelization• Thread-level speculation (TLS)• Transactional memory (TM)
![Page 8: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/8.jpg)
Reminiscing about ILP• Late 1980s to mid 1990s• Search for “post RISC” architecture– More accurately, instruction processing model
• Desire to do more than one instruction per cycle—exploit ILP
• Majority school of thought: VLIW/EPIC• Minority: out-of-order (OOO) superscalar
8
![Page 9: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/9.jpg)
VLIW/EPIC School• Parallel execution requires a parallel ISA• Parallel execution determined statically (by
compiler)• Parallel execution expressed in static program• Take program/algorithm parallelism and mold it to given
execution schedule for exploiting parallelism
9
![Page 10: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/10.jpg)
VLIW/EPIC School• Creating effective parallel representations
(statically) introduces several problems– Predication– Statically scheduling loads– Exception handling– Recovery code
• Lots of research addressing these problems• Intel and HP pushed it as their future (Itanium)
10
![Page 11: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/11.jpg)
OOO Superscalar• Create dynamic parallel execution from
sequential static representation– dynamic dependence information accurate– execution schedule flexible
• None of the problems associated with trying to create a parallel representation statically
• Natural growth path with no demands on software
11
![Page 12: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/12.jpg)
Lessons from ILP Generation• Significant consequences of trying to statically
detect and express parallelism• Techniques that make “under the hood” changes
are the winners– Even though they may have some
drawbacks/overheads
12
![Page 13: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/13.jpg)
The Multicore Generation• How to achieve parallel execution on multiple
processors?• Solution critical to the long-term health of the
computer and information technology industry• And thus the economy and society as we know it
13
![Page 14: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/14.jpg)
14
![Page 15: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/15.jpg)
15
![Page 16: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/16.jpg)
16
![Page 17: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/17.jpg)
The Multicore Generation• How to achieve parallel execution on multiple
processors?• Over four decades of conventional wisdom in
parallel processing– Mostly in the scientific application/HPC arena– Use this as basis
Parallel Execution Requires a Parallel Representation
17
![Page 18: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/18.jpg)
Canonical Parallel Execution ModelA: Analyze program to identify independence in
program– independent portions executed in parallel
B: Create static representation of independence– synchronization to satisfy independence assumption
C: Dynamic parallel execution unwinds as per static representation– potential consequences due to static assumptions
18
![Page 19: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/19.jpg)
Canonical Parallel Execution Model• Like VLIW/EPIC, canonical model creates a variety
of problems that have lead to a vast body of research– identifying independence– creating static representation– dynamic unwinding
19
![Page 20: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/20.jpg)
Identifying Independence
• Static program analysis– Over four decades of work
• Hard to identify statically– Inherently dynamic properties– Must be conservative statically
• Need to identify dependence in order to identify independence
April 27, 2010 Mason Wells 20
![Page 21: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/21.jpg)
Creating Static Representation
• Parallel representation for guaranteed independent work
• Insert synchronization for potential dependences– Conservative synchronization moves parallel
execution towards sequential execution
April 27, 2010 Mason Wells 21
![Page 22: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/22.jpg)
Dynamic Unwinding
• Non-determinism– Changes to program state may not be repeatable
• Race conditions• Several startup companies to deal with this
problem
April 27, 2010 Mason Wells 22
![Page 23: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/23.jpg)
Conventional Wisdom
Parallel Execution Requires a Parallel Representation
Consequences:• Must create parallel representation• For correct execution, must statically identify:– Independence for parallel representation– Dependence for synchronization
• Source of enormous difficulty and complexity– Generally functions of input to program– Inherently dynamic properties
April 27, 2010 Mason Wells 23
![Page 24: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/24.jpg)
Current Approaches
• Stick with canonical model and try to overcome limitations
• Thread Level Speculation (TLS) and Transactional Memory (TM)
• Techniques to allow programmer to program sequentially but automatically generate parallel representation
• Techniques to handle non-determinism and race conditions.
April 27, 2010 Mason Wells 24
![Page 25: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/25.jpg)
TLS and TM
• Overcome major constraint to creating static parallel representation
• Likely in several upcoming microprocessors– Our work in mid 1990s will be key enabler• Already in Sun MAJC, NEC Merlot, Sun Rock
April 27, 2010 Mason Wells 25
![Page 26: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/26.jpg)
Static Program RepresentationIssues Sequential ParallelBugs Yes Yes (more)
Data races No Yes
Locks/Synch No Yes
Deadlock No Yes
Nondeterminism No Yes
Parallel Execution ? Yes
April 27, 2010 Mason Wells 26
• Can we get parallel execution without a parallel representation? Yes
• Can dynamic parallelization extract parallelism that is inaccessible to static methods? Yes
![Page 27: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/27.jpg)
Serialization Sets: What?
• Sequential program representation and dynamic parallel execution– No static representation of independence– No locks and no explicit synchronization
• “Under the hood” run time system dynamically determines and orders dependent computations– Independence and thus parallelism falls out as a side
• Comparable or better performance than conventional parallel models
April 27, 2010 Mason Wells 27
![Page 28: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/28.jpg)
How? Big Picture• Write program in well object-oriented style
– Method operates on data of associated object (ver. 1)• Identify parts of program for potential parallel
execution– Make suitable annotations as needed
• Dynamically determine data object touched by selected code– Identify dependence
• Program thread assigns selected code to bins
April 27, 2010 Mason Wells 28
![Page 29: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/29.jpg)
How? Big Picture• Serialize computations to same object
– Enforce dependence– Assign them to same bin; delegate thread executes
computations in same bin sequentially• Do not look for/represent independence
– Falls out as an effect of enforcing dependence– Computations in different bins execute in parallel
• Updates to given state in same order as in sequential program– Determinism– No races– If sequential correct; parallel execution is correct (same input)
April 27, 2010 Mason Wells 29
![Page 30: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/30.jpg)
Big Picture
30
Program Thread
Delegate Thread 0
Delegate Thread 2
Delegate Thread 1
![Page 31: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/31.jpg)
Serialization Sets: How?• Sequential program with annotations
– Identify potentially independent methods– Associate a serializers with objects to express dependence
• Serializer groups dependent method invocations into a serialization set– Runtime executes in order to honor dependences
• Independent method invocations in different sets– Runtime opportunistically parallelizes execution
April 27, 2010 Mason Wells 31
![Page 32: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/32.jpg)
Example: Debit/Credit Transactions
April 27, 2010 Mason Wells 32
trans_t* trans;while ((trans = get_trans ()) != NULL) {
account_t* account = trans->account;
if (trans->type == DEPOSIT) account->deposit (trans->amount);
else if (trans->type == WITHDRAW) account->withdraw (trans->amount);}
Several static unknowns!
# of transactions?
Points to?
Loop-carried dependence?
![Page 33: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/33.jpg)
Multithreading Strategy
April 27, 2010 Mason Wells 33
trans_t* trans;while ((trans = get_trans ()) != NULL) {
account_t* account = trans[i]->account;
if (trans->type == DEPOSIT) account->deposit (trans->amount);
else if (trans->type == WITHDRAW) account->withdraw (trans->amount);}
1) Read all transactions into an array2) Divide chunks of array among multiple threads
Oblivious to what accounts each thread may access!→ Methods must lock account to→ ensure mutual exclusion
![Page 34: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/34.jpg)
private <account_t> private_account_t;
begin_nest ();trans_t* trans;while ((trans = get_trans ()) != NULL) { private_account_t* account = trans->account;
if (trans->type == DEPOSIT) account->delegate(deposit, trans->amount);
else if (trans->type == WITHDRAW) account->delegate(withdraw, trans->amount);}end_nest ();
End nesting level, implicit barrier
Example with Serialization Sets
April 27, 2010 Mason Wells 34
Declare wrapped account type
Initiate nesting level
Delegate indicates potentially-independent operations
At execution, delegate:1)Creates method invocation structure2)Gets serializer pointer from base class3)Enqueues invocation in serialization set
![Page 35: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/35.jpg)
delegate
delegate
delegate
delegate
delegate
delegate
delegate
delegate
April 27, 2010 Mason Wells 35
depositacct=100$2000
SS #100 SS #200 SS #300
withdrawacct=300
$350
withdrawacct=200$1000withdraw
acct=100$50
depositacct=300$5000
withdrawacct=100
$20
withdrawacct=200$1000
depositacct=100
$300
Program context
Delegate context
![Page 36: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/36.jpg)
Program thread
Delegate threadsProgram context
April 27, 2010 Mason Wells 36
depositacct=100$2000
SS #100 SS #200 SS #300
withdrawacct=300
$350
withdrawacct=200$1000withdraw
acct=100$50
depositacct=300$5000
withdrawacct=100
$20
withdrawacct=200$1000
depositacct=100
$300
Delegate contextDelegate 0 Delegate 1depositacct=100$2000
withdrawacct=100
$50
withdrawacct=100
$20
depositacct=100
$300
withdrawacct=200$1000
withdrawacct=300
$350
depositacct=300$5000
withdrawacct=200$1000
delegate
delegate
delegate
delegate
delegate
delegate
delegate
delegate
Race-free, determinate execution without synchronization!
![Page 37: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/37.jpg)
Prometheus: C++ Library for SS
• Template library– Compile-time instantiation of SS data structures– Metaprogramming for static type checking
• Runtime orchestrates parallel execution
• Portable– x86, x86_64, SPARC V9– Linux, Solaris
April 27, 2010 Mason Wells 37
![Page 38: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/38.jpg)
Prometheus Runtime
• Version 1.0– Dynamically extracts parallelism– Statically scheduled– No nested parallelism
• Version 2.0– Dynamically extracts parallelism– Dynamically scheduled• Work-stealing scheduler
– Supports nested parallelism
April 27, 2010 Mason Wells 38
![Page 39: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/39.jpg)
Network Packet Classification
39
packet_t* packet;classify_t* classifier;vector<int> ruleCount(num_rules);Vector<packet_queue_t> packet_queues;int packetCount = 0;
for(i=0;i<packet_queues.size();i++){
while ((packet = packet_queues[i].get_pkt()) != NULL){ruleID = classifier->softClassify (packet);ruleCount[ruleID]++;packetCount++;
}}
![Page 40: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/40.jpg)
Example with Serialization Sets
40
Private <classify_t> private_classify_t;vector<private_classify_t> classifiers;int packetCount = 0;vector<int> ruleCount(numRules,0);int size = packet_queues.size();begin_nest ();for (i=0;i<size;i++){
classifiers[i].delegate (&classifier_t::softClassify,
packet_queues[i]);}end_nest ();
for(i=0;i<size;i++){ruleCount += classifier[i].getRuleCount();packetCount += classifier[i].getPacketCount();
}
![Page 41: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/41.jpg)
Packet Classification(No Locks!)
41
![Page 42: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/42.jpg)
Network Intrusion Detection• Very common networking application• Most common program used: Snort– Open source version (like Linux)– But also commercial versions (Sourcefire)
• Basic structure of computation also found in many other deep packet inspection applications– E.g., packet de-duplication (Riverbed)
April 27, 2010 Mason Wells 42
![Page 43: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/43.jpg)
![Page 44: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/44.jpg)
Other Applications• Benchmarks
– Lonestar, NU-MineBench, PARSEC, Phoenix
• Conventional Parallelization– pthreads, OpenMP
• Prometheus versions– Port program to sequential C++ program– Idiomatic C++: OO, inheritance, STL– Parallelize with serialization sets
April 27, 2010 Mason Wells 44
![Page 45: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/45.jpg)
Statically Scheduled Results
April 27, 2010 Mason Wells 45
4 Socket AMD Barcelona (4-way multicore) = 16 total cores
![Page 46: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/46.jpg)
Statically Scheduled Results
April 27, 2010 Mason Wells 46
![Page 47: Rethinking Parallel Execution](https://reader035.vdocuments.site/reader035/viewer/2022062315/56814beb550346895db8c6d7/html5/thumbnails/47.jpg)
Summary• Sequential program with annotations– No explicit synchronization, no locks
• Programmers focus on keeping computation private to object state– Consistent with OO programming practices
• Dependence-based model– Determinate race-free parallel execution
• Do as well or better than incumbents but without their negatives
• Can do things that are very hard for incumbents
April 27, 2010 Mason Wells 47