phil pennington [email protected] microsoft wsv317

29

Upload: ami-wheeler

Post on 22-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

New NUMA Support with Windows Server 2008 R2 and Windows 7

Phil [email protected]

What will you look for?Overall Solution Scalability

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320

5

10

15

20

25

30

35

1.001.472.57

4.87

7.448.59 8.29

Your Application : SPEED-UP vs. CORESSpeedupPolynomial (Speedup)

Number of Cores

Spee

dup

AgendaWindows Server 2008 R2

New NUMA APIsNew User-Mode Scheduling APIsNew C++ Concurrency Runtime

Example NUMA Hardware Today A 256 Logical Processor System – HP SuperDomeA 64 Logical Processor System - Unisys ES7000

64 dual-core hyper-threaded “Montvale” 1.6 GHz Itanium2 32 dual-core hyper-threaded

“Tulsa” 3.4 GHz Xeon

NUMA Hardware Tommorrow2, 4, 8 Cores-per-Socket "Commodity" CPU Architectures

Expect systems with 128-256 logical processors

PCIExpress*

PCIExpress*

Nehalem Nehalem

Nehalem Nehalem

I/O

Hub

I/O

Hub

NUMA Node GroupsNew with Win7 and R2

GROUP

NUMA NODE

NUMA NODE

Socket Socket

Core Core

Core CoreLP

LP

LP

LP

NUMA Node GroupsExample: 2 Groups, 4 Nodes, 8 Sockets, 32 Cores, 4 LPs/Core = 128 LPs

Group

NUMA NodeSocket

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

Socket

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

NUMA NodeSocket

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

Socket

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

Group

NUMA NodeSocket

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

Socket

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

NUMA NodeSocket

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

Socket

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

CoreLP

LP

LP

LP

Sample SQL Server Scaling64P To 128P

1.7X

64P 128P

Windows Server Performance team sample test lab results

1.3X

P1

Cache1

MemA

Node Interconnect

MemBDiskA

P3

Cache3

P4

Cache4

Cache(s)

(0)

(3)

(4)(1)

(7) I/O InitiatorISR

I/O Buffer Home

DPC(2)(6)

(5)

P2

Cache2

DiskB

Locked out for I/O Initiation

Locked out for I/O Initiation

Bad Case Disk Write Software and Hardware Locality NOT Optimal

P1

Cache1

MemA

Node Interconnect

MemBDiskA

P3

Cache3

P4

Cache4

Cache(s)

(3)

(3)

I/O Initiator

ISR DPC

(2)P2

Cache2

DiskB

ISR

(2)

Windows Server 2008 R2Optimization for NUMA Topology

NUMA Aware ApplicationsNon-Uniform Memory Architecture

Minimize Contention, Maximize LocalityApps scaling beyond even 8-16 logical processors should be NUMA awareA process or thread can set a preferred NUMA nodeUse the Node Group scheme for Task or Process partitioningPerformance-optimize within Node Groups

NUMA API's“Minimize Contention and Maximize Locality”

demo

AgendaWindows Server 2008 R2

New NUMA APIsNew User-Mode Scheduling APIsNew C++ Concurrency Runtime

User Mode Scheduling (UMS)System Call Servicing

User

Kernel

KT(P1) KT(P2)

UT(P1) UT(P2)

Primary Threads

Core 1 Core 2

KT(1) KT(2) KT(3) KT(4)

UT(1) UT(2) UT(3) UT(4)

UMS KT (Backing threads)

USched ready list

Parked Parked Parked ParkedSYSCALL

Migrate request to appropriate KT RunningBlocked

Wake primary to regain core

UMS completion list

Kernel

User

User Mode Context Switch

BenefitLower context switch time means scheduling finer-grained itemsUMS-based yield: 370 cyclesSignal-and-wait: 2600 cycles

Direct impactsynchronization-heavy fine-grained work speeds up

Indirect impactfiner grains means more workloads are candidates for parallelization

Getting the Processor Back

BenefitThe scheduler keeps control of the processor when work blocks in the kernel

Direct impactMore deterministic scheduling and better use of a thread’s quantum

Indirect impactBetter cache locality when algorithmic libraries take advantage of the determinism to manage available resources

AgendaWindows Server 2008 R2

New NUMA APIsNew User-Mode SchedulingNew C++ Concurrency Runtime

Visual Studio 2010Tools, Programming Models, Runtimes

Parallel Pattern library

Resource manager

Task scheduler

Task Parallel library

PLINQ

Managed library Native libraryKey:

Threads/UMS

Operating system

Concurrency runtime

Programming models

Agentslibrary

Thread pool

Task scheduler

Resource manager

Data structures D

ata

stru

ctur

es

Tools

Tools

ParallelDebugger

Profiler and concurrency

analyzer

Task SchedulingTasks are run by worker threads, which the scheduler controls

Dead Zone

WT0

WT1

WT2

WT3 Without UMS (signal-and-wait)

With UMS (UMS yield)

WT0

WT1

WT2

WT3

User-Mode Scheduling API's and the C++ Concurrency Runtime

“Cooperative Thread-Scheduling”

demo

SummaryCall-to-action

Consider how your solution will scale on NUMA systems

Utilize the NUMA API’s to Maximize Node Locality

Leverage UMS for custom user-mode thread scheduling

Use the C++ Concurrency Runtime for most native Parallel Computing scenarios and gain benefits of NUMA/UMS implicitly

ResourcesMSDN Concurrency Dev-Center

http://msdn.microsoft.com/concurrencyMSDN Channel9

http://channel9.msdn.com/tags/w2k8r2MSDN Code Gallery

http://code.msdn.microsoft.com/w2k8r2MSDN Server Dev Center

http://msdn.microsoft.com/en-us/windowsserver64+ LP and NUMA API Support

http://code.msdn.microsoft.com/64plusLPhttp://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx

Dev-Team Blogshttp://blogs.msdn.com/pfxteamhttp://blogs.technet.com/winserverperformance

www.microsoft.com/teched

Sessions On-Demand & Community

http://microsoft.com/technet

Resources for IT Professionals

http://microsoft.com/msdn

Resources for Developers

www.microsoft.com/learningMicrosoft Certification and Training Resources

www.microsoft.com/learning

Microsoft Certification & Training Resources

Resources

Related Content

DTL203 "The Manycore Shift: Making Parallel Computing Mainstream"Monday 5/11, 2:45-4:00, Room 404, Stephen Toub

DTL310 Parallel Computing with Native C++ in Microsoft Visual Studio 2010Friday 5/15, 2:45-4:00, Room 515A, Josh Phillips

DTL403 "Microsoft Visual C++ Library, Language, and IDE : Now and Next"Thursday 5/14, 4:30-5:45, Room 408A, Kate Gregory

DTL06-INT "Task-Based Parallel Programming with the Microsoft .NET Framework 4"Thursday 5/14, 1:00-2:15, Blue Thr 2, Stephen Toub

Windows Server ResourcesMake sure you pick up your copy of Windows Server 2008 R2 RC from the Materials Distribution Counter

Learn More about Windows Server 2008 R2: www.microsoft.com/WindowsServer2008R2

Technical Learning Center (Orange Section): Highlighting Windows Server 2008 and R2 technologies•Over 15 booths and experts from Microsoft and our partners

Complete an evaluation on CommNet and enter to win!

question & answer

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,

IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.