understanding and optimizing the performance of internet-based systems ben zorn performance...
TRANSCRIPT
Understanding and Optimizing the Performance of Internet-based
Systems
Ben ZornPerformance Monitoring and Analysis GroupProgrammer Productivity Research Center
(PPRC)Microsoft Research
2
Who “We” Are
Performance Monitoring and Analysis Group Part of PPRC (directed by Amitabh Srivastava) Developers, testers, and researchers Recently formed with emphasis on .NET systems
Approach Provide solutions to MS product teams through
ideas, technologies, tools, and prototypes Actively participate in the external research
community through papers, leadership in professional community and grants
3
My Dad’s View of an Internet System
“There’s a little person inside.”- George Zorn
“Any sufficiently advanced technology is indistinguishablefrom magic” - Arthur C. Clarke
4
Outline
Introduction and motivation “Powers-of-ten” drill down
A framework to attack the problem Specific examples from a case study:
Optimizing the memory hierarchy
5
Why Performance is Interesting and Hard
Context Internet systems are probably the most complicated
artifact ever created by man They are currently immature
Improvements possible in 3 areas Functionality, correctness, and performance
My focus is on performance Efficiency is a central theme of the Internet revolution
Easier / cheaper to get to information, give information, and make informed decisions
6
Inspired by the film “Powers of Ten” by Charles and Ray Eames They looked at 38 orders of magnitude
(local galactic group down to proton in a nucleus)
We’ll drill down into computer abstractions Consider different logical abstraction layers
Distributed System “Powers of Ten”
7
Back to My Dad…
What really goeson in there?
8
Network “Cloud” View
client
MSNserverInternet
Dad
ISP
modemlink
- distinct roles- differentiated components
- Less than 7 items to remember- We seem to “get it”
1
2
9
Expanded Cloud View
clientDad
MSN
streamingmediaISP
DNSresolution
router
servercluster
1
2
3
45
10
Inside the Web Site
… databaseservers(back ends)
Webservers(front ends)
IP “director”
interconnectiontopology 1 2
…
… …
11
Inside a Web Server
devicedriver
devicedriver
networkprotocolstack
filter, parserequest
get staticpage
ExtensionAPI
generateHTML
serverextension
DB API
Web server program
operating system
DBserver
2
3
1 54
12
Inside the Server Extension
enter
call checkData
call SQL API
datavalid?
return
TF
proc checkData…load rx, addr 36use rx…load rx, addr 110use rx…return
1
2
13
Inside the Memory Hierarchy
CPU L1 cache
L2 cache
Main Memory
Virtual Memory (Disk)
load
1
2
14
Inside the CPU
Diagram courtesy of Artur Klauser
1
2
4
3
3
15
A Sea of Gates
Image from the Computer Info Centerhttp://bwrc.eecs.berkeley.edu/CIC/
Photo of a Pentium die
???
So what does Dad thinkabout all this?
What can he or anybodydo about performance?
16
Outline
Introduction and motivation A framework to attack the problem
Resource management and optimization Data collection, analysis, and action
Specific examples from a case study:Optimizing the memory hierarchy
17
Information is Essential
Optimization is really resource allocation Allocation requires good decision
making Time / space trade-off Where should data be stored, cached?
Challenges What information do we need? How do we get it? How do we manage it? What does it mean?
18
What Information Can We Get?
•Tag “interesting” events (like FedEx tracks packages)
• Associate time/resources with events
• Accumulate and analyze data
Event repositorytime,id
19
Information Management is Essential
It is easy to gather too much data Our capacity to generate data follows
Moore’s Law Data without context is less valuable
How do we related data gathered with problems experienced?
Systems change (new builds daily) Our abstractions are immature, current
approaches are ad hoc Data mining is a large potential opportunity
20
Example: Netmon Monitoring Tool
Netmon provides info about network packets Netmon has a rich, extensible architecture
(parsing, reporting) Netmon provides data, but
management, analysis, etc. have to be layered on top of it
Example output:00000060 00 01 00 5F 00 00 5C 00 5C 00 52 00 45 00 44 00 ..._..\.\.R.E.D.00000070 2D 00 44 00 43 00 2D 00 32 00 37 00 2E 00 52 00 -.D.C.-.2.7...R.00000080 45 00 44 00 4D 00 4F 00 4E 00 44 00 2E 00 43 00 E.D.M.O.N.D...C.00000090 4F 00 52 00 50 00 2E 00 4D 00 49 00 43 00 52 00 O.R.P...M.I.C.R.000000A0 4F 00 53 00 4F 00 46 00 54 00 2E 00 43 00 4F 00 O.S.O.F.T...C.O.000000B0 4D 00 5C 00 49 00 50 00 43 00 24 00 00 00 3F 3F M.\.I.P.C.$...??000000C0 3F 3F 3F 00 ???.
21
Conceptual Monitoring Framework
Sensors
Path Trace HW Perf Counters
Network Trace
StoreFilter
Intrusion Alerting
Leak Detector
Site Monitor
Event Bus
Tools
Actuators
Cluster
Detection
RebootSystem
WeeklyReport
Store
Management
Analysis
22
Outline
Introduction and motivation A framework to attack the problem Specific examples from a case study:
Optimizing the memory hierarchy Hardware performance counters Vulcan: binary transformation infrastructure Daedalus: data locality optimization
23
Parts of the Big Picture
Many different groups are working from similar frameworks Commercial efforts (e.g., Windows WMI) Many research efforts (e.g., Internet-scale caching)
I will focus on the lowest levels (CPU arch.) Hardware can generate 100 million events/sec. Data collection, reduction are significant problems Concretely illustrates different parts of approach:
Data gathering, data reduction, abstraction
24
Optimizing the Memory Hierarchy
CPU
L1 cache
L2 cache
Main Memory
Virtual Memory (Disk)
load1-4 cyclesUOT=1 word 10-20 cycles
100 cycles
1,000,000 cycles
64 Kbytes, UOT=32 bytes
1 Mbyte,UOT=32 bytes
100 Mbytes, UOT=32 bytes
50,000 MbytesUOT=8192 bytes
UOT=Unit of Transfer
25
Finding a Memory Problem
Problem Some loads take a long time,
but which ones?
Solution Hardware vendors provide
performance counters Counters can be read, also
interrupt processor Causing interrupts at costly
operations allows them to be recorded
proc Foo…load rx, addr 36use rx…load rx, addr 110use rx…return
This load takes too long
This use of rx is what stalls
26
Exposing Performance Information
CPU
L1 cache
L2 cache
Main Memory
Virtual Memory (Disk)
load
257
1,346
15,304
257
15,304C1
C2
addr 36
addr 110
addr 60
addr 116
performancecounters(L1 hits, L2 misses)
27
Extracting More Information
New Problem Why was I calling procedure Foo? What fraction of the total time did I spend in
Foo?
Solution: Binary transformation Program API to transform binary code Calls to arbitrary routines can be “spliced” in PPRC Vulcan infrastructure [Srivastava et al. ‘00]
X86, IA64 binaries Instrumentation can be added on-the-fly
28
Example Transformation
proc Foo…load rx, addr 36use rx…load rx, addr 110use rx…return
As code is executing, transform:
This:proc Foocall probe_enter_Foo()…load rx, addr 36use rx…load rx, addr 110use rx…call probe_exit_Foo()return
To this:
29
• Hierarchical interface to structure of binary
foreach procedure… foreach basic block foreach instruct…• Calls to arbitrary functions can be inserted anywhere
How Vulcan Works
…
proc Foo proc Bar
Block 1
Block 2
useload
useload
Program
Block 1
Block 2
moveshift
multadd
… …
call probe_enter_Foo
call probe_exit_Foo
30
Vulcan Tricks
Optimization Example:Instruction Scheduling
If “load rx” takes 100 cycles, find useful work to do between load and use
Other Vulcan uses: Code obfuscation Binary matching Software watermarking Software testing tools
Coverage Fault injection
proc Foo…load rx, addr 36
useful work notdependent on rxinserted here
use rx…load rx, addr 110use rx…return
100cycledelay
31
New Abstractions are Central to Success
Problem How to reorganize data for better locality?
Context Code reorganization is well understood
because code structures are static But… OO data structures are dynamic
Solution New abstraction: sequences of “hot” objects Daedalus Project [Chilimbi PLDI ’01]
32
Revisiting the Memory Hierarchy
CPU L1 cache
L2 cache
Main Memoryobj B
obj A
obj C
obj D
obj E
obj F
obj G
obj H
Goal: place“hot” objectscloser to CPU
Constraint:assume UOT = 2 objects
obj
obj
obj
obj
obj
obj
obj
obj
Virtual Memory (Disk) obj
obj
obj
objLoad sequence: A F B C A F E E A F B C…
33
Potential for Performance Improvement
0
20
40
60
80
100
Nor
mal
ize
mis
s ra
te
Base Perfect Optimization
34
Daedalus Project
Analyze locality Represent very large streams of references
(SEQUITUR algorithm [Nevill-Manning, Witten ‘97] ) Define new abstractions (hot data streams)
Exploit locality Build customized heap allocators (malloc/new) Insert prefetching instructions (PIII, etc. support) Data restructuring tools
Goal: Analyze and exploit data locality
35
SEQUITUR (Example)
aaabac aaabac aaabac aaabac aaabad aaabad aaabad aaabad aaabad aa
SEQUITURSEQUITURS -> BBDDCaa
A -> aaabac
B -> AA
C -> aaabad
D -> CC
S
C
BD
a b c d
A
SEQUITUR Grammar
DAG
representation
of grammar
36
Locality Analysis
Pruning
Program Execut
e
S
B
A C
a b c d
Whole Program Streams
Program data
reference trace
SEQUITUR
Hot data stream
analyses
S
B
A
a b dHot Program
Streams d b a
a b
Hot Data Streams
37
Daedalus Highlights
Data reference representations 100 to 10,000 times smaller than data reference trace
Data restructuring recommendations Improved execution time of several programs by 8—
15% with small header file modifications Custom heap allocators
Automatically reduced working set size by up to 40% and TLB misses by up to 90%
In-progress Automatic prefetching, smart copying garbage
collection, scalability optimizations, dynamic on-line optimizations
38
What’s This Got to Do with the Internet?
Approach remains the same Record interesting behavior (e.g. network
packets) Reduce large data volumes
Compression, summarization, presenting differences, etc.
Find interesting patterns that correspond to performance (security, correctness) issues
Display information using visualizations / abstractions that match the problem domain
An easier problem? It will take time to know for sure…
39
Summary
What’s my Dad to think? Internet Systems are usable today…but
extremely complex Ability to understand existing systems is
immature Technology still rapidly changing,
following Moore’s Law curve
But… Microsoft’s .NET initiative sets the
stage for our opportunity and challenges
Our approach is pragmatic, effective
40
More Information
Related Resources MSR, PPRC, PMA:
http://research.microsoft.com/pprc/pma.asp Vulcan
Srivastava et.al., “Binary Transformation in a Distributed Environment”, MSR Technical Report
Srivastava, “Emerging Opportunities for Binary Tools”, Keynote Talk, WBT 2000, October 2000.
Daedalus Project http://research.microsoft.com/users/trishulc/Daedalus.
htm Chilimbi, "Efficient Representations and Abstractions for
Quantifying and Exploiting Data Reference Locality", PLDI 2001, June 2001.
Contact me: [email protected]
41
Backup Slides
42
The Process of Optimizing Performance
Where’s the bottleneck?Who is at fault? How to find out?
What tool to use? How to use? How to understand?
Will my effort be worth it?
I’m happy now…but what about next time?
Suppose performance is poor…
43
A Framework for Monitoring Systems
Goals Collect data at all system levels Approximate continuous monitoring closely
Component Classes Sensors (gathering data) Management (communicate, summarize,
store) Analysis (recognizing patterns and
relationships) Tools (human feedback) Actuators (take action directly)
44
Reference Skew (Code Vs. Data)
Addr / PC reference skew
0
20
40
60
80
100
120
0 2 4 6 8 10
% of addr/ load-store pc
% o
f p
rog
ram
dat
a re
fere
nce
s
twolf-Addr
twolf-PC
perlbmk-Addr
perlbmk-PC
eon-Addr
eon-PC
mcf-Addr
mcf-PC
sqlserver-Addr
sqlserver-PC