task and data parallelism: real-world examples
DESCRIPTION
This presentation begins by reviewing the Task Parallel Library APIs, introduced in .NET 4.0 and expanded in .NET 4.5 -- the Task class, Parallel.For and Parallel.ForEach, and even Parallel LINQ. Then, we look at patterns and practices for extracting concurrency and managing dependencies, with real examples like Levenstein's edit distance algorithm, Fast Fourier Transform, and others.TRANSCRIPT
Sasha Goldshtein
CTOSela Group
@goldshtnblog.sashag.net
Task and Data Parallelism: Real-World
Examples
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
2
AGENDA
Multicore machines have been a cheap commodity for >10 years
Adoption of concurrent programming is still slow
Patterns and best practices are scarce We discuss the APIs first… …and then turn to examples, best practices, and tips
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
3
TPL EVOLUTION
The Future
• DataFlow in .NET 4.5 (NuGet)
• Augmented with language support (await, async methods)
2012
• Released in full glory with .NET 4.0
2010
• Incubated for 3 years as “Parallel Extensions for .NET”
2008
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
4
TASKS
A task is a unit of work May be executed in parallel with other tasks
by a scheduler (e.g. Thread Pool)
Much more than threads, and yet much cheaper
Task<string> t = Task.Factory.StartNew( () => { return DnaSimulation(…); });t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted);t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion);DisplayProgress();
try { //The C# 5.0 version var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task);}catch (Exception ex) { Show(ex);}
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
5
PARALLEL LOOPS
Ideal for parallelizing work over a collection of data
Easy porting of for and foreach loops Beware of inter-iteration dependencies!
Parallel.For(0, 100, i => { ...});
Parallel.ForEach(urls, url => { webClient.Post(url, options, data);});
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
6
PARALLEL LINQ
Mind-bogglingly easy parallelization of LINQ queries
Can introduce ordering into the pipeline, or preserve order of original elementsvar query = from monster in monsters.AsParallel() where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster;
query.ForAll(monster => Move(monster));
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
7
MEASURING CONCURRENCY
Visual Studio Concurrency Visualizer to the rescue
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
8
RECURSIVE PARALLELISM EXTRACTION
Divide-and-conquer algorithms are often parallelized through the recursive call
Be careful with parallelization threshold and watch out for dependenciesvoid FFT(float[] src, float[] dst, int n, int r, int
s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time }}
Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2));
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
9
SYMMETRIC DATA PROCESSING
For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution
Inter-iteration dependencies complicate things (think in-place blur)
Parallel.For(0, image.Rows, i => { for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); }});
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
10
UNEVEN WORK DISTRIBUTION
With non-uniform data items, use custom partitioning or manual distribution
Primes: 7 is easier to check than 10,320,647var work = Enumerable.Range(0, Environment.ProcessorCount) .Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1))));Task.WaitAll(work.ToArray());
VS
Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2));
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
11
COMPLEX DEPENDENCY MANAGEMENT
Must extract all dependencies and incorporate them into the algorithm
Typical scenarios: 1D loops, dynamic algorithms
Edit distance: each task depends on 2 predecessors, wavefront computation
C = x[i-1] == y[i-1] ? 0 : 1;D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C);
0,0
m,n
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
12
SYNCHRONIZATION > AGGREGATION
Excessive synchronization brings parallel code to its knees
Try to avoid shared state, or minimize access to it
Aggregate thread- or task-local state and merge later
Parallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List<int>(), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes);});
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
13
CREATIVE SYNCHRONIZATION
We implement a collection of stock prices, initialized with 105 name/price pairs
107 reads/s, 106 “update” writes/s, 103 “add” writes/day
Many reader threads, many writer threads
GET(key): if safe contains key then return safe[key] lock { return unsafe[key] }
PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
14
LOCK-FREE PATTERNS (1)
Try to avoid Windows synchronization and use hardware synchronization
Primitive operations such as Interlocked.Increment, Interlocked.CompareExchange
Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms
int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r;}
New
Valu
e
Com
para
nd
Old
Valu
e
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
15
LOCK-FREE PATTERNS (2)
User-mode spinlocks (SpinLock class) can replace locks you acquire very often, which protect tiny computations
class __DontUseMe__SpinLock { private int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { _lck = 0; Thread.MemoryBarrier(); }}
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
16
MISCELLANEOUS TIPS (1)
Don’t mix several concurrency frameworks in the same process
Some parallel work is best organized in pipelines – TPL DataFlow
BroadcastBlock
<Uri>
TransformBlock
<Uri, byte[]>
TransformBlock
<byte[], string>
ActionBlock<string>
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
17
MISCELLANEOUS TIPS (2)
Some parallel work can be offloaded to the GPU – C++ AMP
void vadd_exp(float* x, float* y, float* z, int n) { array_view<const float,1> avX(n, x), avY(n, y); array_view<float,1> avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index<1> i) ... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize();}
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
18
MISCELLANEOUS TIPS (3)
Invest in SIMD parallelization of heavy math or data-parallel algorithms
Make sure to take cache effects into account, especially on MP systems
START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4jns START
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
19
SUMMARY
Avoid shared state and synchronization Parallelize judiciously and apply
thresholds Measure and understand performance
gains or losses Concurrency and parallelism are still
hard A body of best practices, tips, patterns,
examples is being built
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
ADDITIONAL REFERENCES
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
21
THANK YOU!
Sasha Goldshtein@goldshtn