using parallel computing platform - nhdnug
DESCRIPTION
Slides from Phil Pennington\'s talk on Using Parallel Computing with Visual Studio 2010 and .NET 4.0, originally presented at the North Houston .NET Users Group (facebook.com/nhdnug).TRANSCRIPT
Agenda
2
• What’s new with Windows?• Parallel Computing Tools in Visual Studio• Using .NET Parallel Extensions
First, An ExampleMonte Carlo Approximation of Pi
S = 4*r*r C = Pi*r*r
Pi = 4*(C/S)
For each Point (P),d(P) = SQRT((x * x) + (y * y))
if (d < r) then P(x,y) is in C
Windows and Maximum Processors• Before Win7/R2, the maximum number of Logical Processors (LPs)
was dictated by processor integral word size– LP state (e.g. idle, affinity) represented in
word-sized bitmask– 32-bit Windows: 32 LPs– 64-bit Windows: 64 LPs
01631
32-bit Idle Processor Mask
Idle Busy
Processor GroupsNew with Windows 7 and Windows Server R2
5
GROUPNUMA NODE
NUMA NODE
Socket Socket
Core Core
Core CoreLP
LP
LP
LP
Processor GroupsExample: 2 Groups, 4 nodes, 8 sockets, 32 cores, 128 LP’s
6
Group
NUMA NodeSocket
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
Socket
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
NUMA NodeSocket
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
Socket
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
Group
NUMA NodeSocket
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
Socket
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
NUMA NodeSocket
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
Socket
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
CoreLP
LP
LP
LP
Many-Core Topology APIs Discovery
7
Many-Core Topology APIs Resource Localization
8
Many-Core Topology APIs Memory Management
9
Your Schedule
rLogic
Reason:
Yield
Wait
Reason:
Yield
User Mode SchedulingArchitectural Perspective
Application
Kernel
S1 S2
Scheduler Threads
CPU 1 CPU 2
W1 W2 W3 W4
Blocked Worker Threads
UMS Scheduler’s Ready List
UMS Completion List
Reason:
Created
Reason:
Blocked
Task Scheduling with a UMS SchedulerMaximize Quantum, Minimize Blocking Affects
• Tasks are run by worker threads, which the scheduler controls
Dead Zone
WT0
WT1
WT2
WT3 Without UMS (signal-and-wait)
With UMS (UMS yield)
WT0
WT1
WT2
WT3
CPU0 CPU1 CPU2 CPU3
Static Scheduling
Load-Balancing, Work Stealing Scheduler
Dynamic scheduling improves performance by distributing work efficiently at runtime.
CPU0 CPU1 CPU2 CPU3
Dynamic Scheduling
Demos
The Platform- Topology- Schedulers
Agenda
14
• What’s new with Windows?• Parallel Computing Tools in Visual Studio• Using .NET Parallel Extensions
Tools Programming Models – Structured Parallelism
.NET Parallel Extensions
.NET Runtime
Visual Studio 2010, .NET Developer Tools, Programming Models, Runtimes
Parallel LINQ(PLINQ)
Resource Manager
Task Scheduler
Managed Library
Threads Pools
Dat
a S
tru
ctu
res
Tools
Debugger
Profiler
Task ParallelLibrary (TPL)
Thread-Pool Scheduler in .NET 4.0
• Global Q is shared by legacy ThreadPool API and TPL
• Local work queues and work stealing scheduler (TPL only)
Enqueue
Global Queue (FIFO)
Thread 1Dispatch
Loop
Thread 1Local Queue
(LIFO)
Thread 2Dispatch
Loop
Thread 2Local Queue
(LIFO)
Thread NDispatch
Loop
Thread NLocal Queue
(LIFO)
Dequeue
DequeueEnqueue
Steal
T2T3 T4
Steal Steal
T5
T6
T7
T8
T1
Task Parallel Library (TPL)Tasks Concepts
TaskAn asynchronous operation
Task<TResult>A Task that returns a result
ContinuationA Task that starts when another
completes
FromAsyncA Task that wraps an existing APM
implementation
TaskCompletionSourceA Task that represents another
operation
TaskSchedulerAn extensible scheduler that executes
Tasks
Common Functionality: waiting, cancellation, continuations, parent/child relationships
Primitives and Structures• Thread-safe, scalable collections
– IProducerConsumerCollection<T>• ConcurrentQueue<T>• ConcurrentStack<T>• ConcurrentBag<T>
– ConcurrentDictionary<TKey,TValue>
• Phases and work exchange– Barrier – BlockingCollection<T>– CountdownEvent
• Partitioning– {Orderable}Partitioner<T>
• Partitioner.Create
• Exception handling– AggregateException
• Initialization– Lazy<T>
• LazyInitializer.EnsureInitialized<T>
– ThreadLocal<T>
• Locks– ManualResetEventSlim– SemaphoreSlim– SpinLock– SpinWait
• Cancellation• CancellationToken{Source}
Parallel Debugging
• Two new debugger toolwindows– Support both native and managed
• “Parallel Tasks”• “Parallel Stacks”
Parallel Tasks
− What threads are executing my tasks?− Where are my tasks running (location,
call stack)?− Which tasks are blocked?− How many tasks are waiting to run?
Parallel Stacks
Zoom control Bird’s eye view
− Multiple call stacks in a single view− Task-specific view (Task status)− Easy navigation to any executing method− Rich UI (zooming, panning, bird’s eye view,
flagging, tooltips)
Parallel Profiling
CPU Utilization
Number of cores
Your Process
Idle time
Other processes
Threads
Usage Hints
Detailed thread analysis(one channel per thread)
Active Legend
Hide uninteresting
threads
Measure time for interesting segments
Zoom in and out
Call Stacks
CoresEach logical core
in a swim lane
One color per thread
Cross-core migration details
Migration visualization
Demo
LibrariesLanguagesDebuggersProfilers
Agenda
27
• What’s new with Windows?• Parallel Computing Tools in Visual Studio• Using .NET Parallel Extensions
Thinking Parallel - “Task” vs. “Data” Parallelism
Task Parallelism
Parallel.Invoke(() => { Console.WriteLine("Begin first task...");
}, () => { Console.WriteLine("Begin second task...");
}, () => { Console.WriteLine("Begin third task...");
} );
Data Parallelism
IEnumerable<int> numbers = Enumerable.Range(2, 100-3);var myQuery =
from n in numbers.AsParallel()where Enumerable.Range(2,
(int)Math.Sqrt(n)).All(i => n % i > 0)select n;
int[] primes = myQuery.ToArray();
Thinking Parallel – How to Partition Work?
Several partitioning schemes built-in– Chunk
• Works with any IEnumerable<T>• Single enumerator shared; chunks handed out on-demand
– Range• Works only with IList<T>• Input divided into contiguous regions, one per partition
– Stripe• Works only with IList<T>• Elements handed out round-robin to each partition
– Hash• Works with any IEnumerable<T>• Elements assigned to partition based on hash code
Custom partitioning available through Partitioner<T>– Partitioner.Create available for tighter control over built-in partitioning schemes
Thinking Parallel – How to Execute Tasks?
Thinking Parallel – How to Collate Results?
Demos
PartitionExecuteCollate
Resources
• Native APIs/runtimes (Visual C++ 10)– Tasks, loops, collections, and Agents– http://msdn.microsoft.com/en-us/library/dd504870(VS.100).aspx
• Tools (in the VS2010 IDE)– Debugger and profiler– http://msdn.microsoft.com/en-us/library/dd460685(VS.100).aspx
• Managed APIs/runtimes (.NET 4)– Tasks, loops, collections, and PLINQ– http://msdn.microsoft.com/en-us/library/dd460693(VS.100).aspx
General VS2010 Parallel Computing Developer Centerhttp://msdn.microsoft.com/en-us/concurrency/default.aspx