magpie: distributed request tracking for realistic ... · l the sequence of application components...

34
12 November 2003 Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge Magpie: Distributed request tracking for realistic performance modelling

Upload: lediep

Post on 15-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

12 November 2003

Rebecca IsaacsPaul Barham

Richard MortierDushyanth Narayanan

Microsoft Research Cambridge

James Bulpin University of Cambridge

Magpie: Distributed request tracking for realistic

performance modelling

12 November 2003

Performance in distributed systems

l Faults in distributed systems are notoriously hard to diagnose

l Performance problems are even more subtle to debugl Often transient or affect only a subset of requests / usersl Frequently involve complex interactions between multiple

machinesl Aggregate statistics (e.g. utilization) may look perfectly

normal

12 November 2003

Magpie Approach

l Track individual requests end to endl Observe control flow (causality)l Monitor resource consumption: CPU, bandwidth, diskl Debug performance “in the small”

l Build a probabilistic workload model from the aggregate requestsl Cluster similar requests according to their observed

behaviourl Debug performance “in the large”

12 November 2003

How do we use this information?

l Performance debuggingl Why did this request take much longer than that

request?l Fault detectionl Configuration and management

l Performance predictionl Realistic workload models for capacity planningl Obtain automatically on a “live” system

12 November 2003

Magpie components

l Instrumentationl System activity recorded to logs

l Generic request parserl Extract individual requests from logs according to

an event schema

l Model constructionl Behavioural clustersl Probabilistic state machine

12 November 2003

Outline

l Introductionl What is a request?l Instrumentationl Request extractionl Modellingl Current status

12 November 2003

What is a request?

l System activity which takes place in response to an action initiated by the application being tracedl HTTP requestl Database queryl File open request

l We describe a request asl The sequence of application components involved in its

processingl The resource consumed at each stagel CPU, bandwidth, disk transfer size, (latency)

12 November 2003

A typical e-commerce site (1)

Web Front Ends

SQL ServersStorage

Internet

12 November 2003

A typical e-commerce site (2)Fi

lter

Kernelhttp.sys

CLRIIS

Kernel

Web Server

Application Logic

WinSock2 API

SQL Server

Stored procedures

StaticContent

ASP.NET ADO.NET

WinSock2 API

Data

12 November 2003

HTTP request: detailed view

WEB.eec

WEB.398

Disk

Net RX

Net TX

10.051s 10.155s

Net TX

Net RX

Disk

SQL.9c4

10.051s 10.155s

!

- + - - + - - + - + -

- - -

10.100s

10.100s

HTTP request packet

from

IIS worker thread picks up request

http.sys Sync WinSock send to SQL Server

ASP.NET thread blocks after RPC to database

ASP.NET worker thread takes over

TDS request and reply packets sent and

received

SQL thread unblocks

HTTP response packets sent back to client

IIS worker thread wakes up to write log

Blocked IIS ASP.NET SQLKEY: Disk Other

12 November 2003

Why is request tracking hard?

l Many components, multiple machinesl Must track control flow across machines

l No globally unique request IDl Components are developed independently

l Multiple thread poolsl Many threads participate in processing a request

l Asynchronous communicationl Must match send/recvs between threads/machines

l Hand-rolled synchronization primitivesl SQL server has user-mode scheduler

12 November 2003

Outline

l Introductionl What is a request?l Instrumentationl Request extractionl Modellingl Current status

12 November 2003

Event Tracing for Windows

l Low-overhead event mechanisml Events timestamped with cycle counterl Global ordering on events on a single machinel Can enable/disable sets of events at runtime

l Using ETW in Magpiel Each instrumentation point posts an eventl Events are logged to diskl Logs are post-processed to extract requestsl Can also consume events in real time

12 November 2003

Instrumentation points

l Existing ETW event providersl IIS, kernel

l App-specific hooksl IIS, ASP.NET, SQL Server

l Detoursl Wrap dlls to trap Win32 and WinSock2 calls

l WinPcapl Capture packets on the wire

12 November 2003

CPU usage from kernel events

l The ETW kernel logger records every context switchl How do we know which cycles are used for which

request?

l We can attribute cycles to a request byl An application-specific event which occurs within

a delimited sector of CPU time, orl The current context of execution, eg thread id

12 November 2003

Example: protocol processing in a DPC

cswitchDPCstart

DPCend

pkt recv

Request 1cycle count

Request 2cycle count

Events: cswitch

time

12 November 2003

Application and middleware events

l Cover points where flow of control moves between components

l Cover points where resources are multiplexed and demultiplexedl E.g. user-level scheduling primitives

l Propagation of a global request id is notrequired!l Magpie used to do this but not any more

12 November 2003

Instrumenting a web serviceFi

lter

Kernelhttp.sys

CLRIIS

Kernel

Web Server

HTTPModule

Application Logic

SQL Server

Wra

pper

s

Stored procedures

ISAPI Filter

StaticContent

ASP.NET ADO.NET

CLR profiler

WinSock2 APIIntercept

Data

Event Tracing for WindowsPacket capture

Event Tracing for WindowsPacket capture

Extended SPs

WinSock2 APIIntercept

12 November 2003

Outline

l Introductionl What is a request?l Instrumentationl Request extractionl Modellingl Current status

12 November 2003

Generic request extraction

l No inbuilt assumptions about the system or the applicationl No common unique identifier

l Schema specifies semantics of eventsl Easy to add new event types

l Parser stitches events into requests based on event semantics

12 November 2003

Terminology

l Namespacel Event parameter which references an entity in the

system, eg thread id

l Timelinel Instantiation of a namespace with a unique value,

eg thread id = 0xa

l Events bind or unbind requests to timelinesl Bindings capture the semantics of each event for

a particular request type

12 November 2003

Cpuid=0

Tid=0xa

Tid=0xb

Connid=0xd

Enter R

ecv

cswitch

cswitch

DP

C start

DP

C end

Recv returns

TC

P pkt

Example: connecting events

Request 1Request 2

12 November 2003

End-to-end request extraction

l An instance of the request parser runs on each machine in the distributed systeml Online or offline mode

l Offline post-processing connects request fragments from each node according to a globally unique namespace, e.g. packet IP identifier

12 November 2003

Outline

l Introductionl What is a request?l Instrumentationl Request extractionl Modellingl Current status

12 November 2003

Clustering for workload generation

l Target the Indy performance modelling tooll Calculates throughput, bottlenecksl Needs transaction mix, resource consumption

l Previously: microbenchmark approachl Run 10000 of each “transaction type” (URL)l Divide aggregate resource usage by 10000

l Aim: provide realistic workload modelsl From real, mixed workloadsl Derive transaction “types” automatically

12 November 2003

Single request: cartoon view

l Partial ordering of eventsl Annotated with resource usage

5ms 6ms 1ms3ms 6ms

2ms 3ms

6ms

6k1k

192kread

24kread

12k1k

IIS CPU ASP.NET CPU SQL Server CPU

DiskNetwork

12 November 2003

Behavioural clustering of requests

l Represent requests as event stringsl “Flatten” out any concurrency

l Use Levenshtein string edit distancel Modified to factor in resource usage vectors

l Cluster requests based on this distancel Linear-time algorithm

l Each cluster is a request “type”l Select representative from near centroid

12 November 2003

Build a workload model by clustering similar requests

Requests in the same cluster often have different URLs, and one URL may appear in many clusters

A

D

B

CE

A 2ms 10ms 1ms14ms 24ms

5ms 11ms

5ms

6 k0.2k

30k1k

5ms

5ms

0.1k0.2k

2 k0.2k

7%

B 14ms 27ms 1ms 2ms 7ms

11k1k

2ms

10%

C 5ms 6ms 1ms3ms 6ms

2ms 3ms

6ms

6 k1 k

192kread

24kread

12k1k

15%

E 5ms 11ms

1k0.6k

63%

D 2ms 13ms 2ms3ms

5ms

5ms

0.3k

11k1k

11ms

0.3k

5%

12 November 2003

Taking it further: work-in-progress

l Online and incremental modelling:l Detect component failurel Detect sudden shifts in workload

l More sophisticated modelsl Learn the probabilistic state machine for each requestl c.f. flowcharts annotated with performance information

l “Bayesian watchdogs”l Compute the likelihood of a request’s behaviour as it

moves through the systeml Deal with “unlikely” requests appropriately

12 November 2003

Outline

l Introductionl What is a request?l Instrumentationl Request extractionl Modellingl Current status

12 November 2003

Current status

l Recent focus has been developing a generic request extraction schemel Prototype for 2-machine e-commerce sitel TPC-W style workload

l Prototype for single machine SQL Server 2000l Challenge is user mode schedulerl TPC-C workload

l Other applications on the wayl Large-scalel “Real” systems with “real” performance problems

12 November 2003

Conclusion

l Magpie is a tool for performance analysis in a distributed system

l Bottom up, per-request approachl Complementary to existing techniques:l Performance countersl Program profiling

l Feeds into performance debugging and prediction tools

12 November 2003

Work-in-progress: learning the probabilistic state machine

l Infer a stochastic context free grammar from a sample set of stringsl Each state transition emits a character and has

an associated probabilityl Use the Alergia algorithm (Carrasco & Oncina ‘94)l Construct a prefix tree from the sample setl Merge similar subtrees

l Apply to Magpie requestsl “Just” event strings…

12 November 2003

Ongoing work with Alergia

l Tuning the similarity criterion

l Factoring in resource usage information

l Can we identify event sequences with suspiciously low probabilityl Run online for anomaly detection?