health & status monitoring (2010-v8)

28
Health & Status Monitoring: Two Case Studies Robert Grossman Open Data Group February 18, 2010 1

Upload: robert-grossman

Post on 21-Jun-2015

581 views

Category:

Technology


0 download

DESCRIPTION

This is a variant of a talk that I gave at Predictive Analytics World in February 2010.

TRANSCRIPT

Page 1: Health & Status Monitoring (2010-v8)

1

Health & Status Monitoring: Two Case Studies

Robert Grossman Open Data Group

February 18, 2010

Page 2: Health & Status Monitoring (2010-v8)

2

1. Introduction

Page 3: Health & Status Monitoring (2010-v8)

3

Traditional Approach

• Two types of variation:– Common cause of variation (noise) occur as

normal part of manufacturing process– Special cause of variation represents a

potential problem

Page 4: Health & Status Monitoring (2010-v8)

4

Shewhart control chart used by NIST for calibrating the standard KG.

3 s

Source: NIST

Page 5: Health & Status Monitoring (2010-v8)

5

Shewhart / Deming Cycle

• Plan – identify opportunity or problem and make a plan.

• Do – implement the change on a small scale and collect the data.

• Check – perform a statistical analysis and check if there was an impact.

• Act – if there was an impact, broaden the scale and continuously improve your results.

Page 6: Health & Status Monitoring (2010-v8)

6

Case Study 1. Data Center

• Thousands of servers

• Complex workloads

• Large variations are normal

• Problems make the front page

Page 7: Health & Status Monitoring (2010-v8)

7

Case Study 2. Payments Network• Billion+ cards• 100+ million terminals• Millions of merchants• Thousands of transactions per

second• Thousands of member banks• Data highly heterogeneous

– Variations among products– Variations among cardholders– Variations among merchants– Variations among banks– Variation among payment

networks

Page 8: Health & Status Monitoring (2010-v8)

8

The Challenge Today

• Many sources and data feeds • Data is complex and highly heterogeneous• High volume, streaming data from around

the world• Multiple parties involved, each of which

can modify the data in subtle ways

Page 9: Health & Status Monitoring (2010-v8)

9

Health & Status Monitoring SystemsSQC HSM

What is monitored?

Single assembly line producing physical widgets

Digital system with thousands of data feeds

Type of model

Exceed 3 standard deviations

?

# of models Single model ?Visualization Control chart ?

Process Plan-do-check-act ?

Page 10: Health & Status Monitoring (2010-v8)

10

2. The Technology

Page 11: Health & Status Monitoring (2010-v8)

11

Observed Model

Baseline Model

CUSUM modelsGLR models

Page 12: Health & Status Monitoring (2010-v8)

12

Build more than 104 Models: One for Each Cell in Cube of Models

Build separate model for each bank (1000+)

Build separate model for each geographical region (6 regions)

Build separate model for each different type of merchant (over 800 types of merchants)

For each distinct cube, build a distinct model

Geospatial region

Type of Transaction

15,000+ separate baselines

Modeling using Cubes of Models (MCM)

Bank

Page 13: Health & Status Monitoring (2010-v8)

data updates1. data

collection

Operational systems, data feeds, warehouses, …

3. on-line scoring

Model Consumer

candidate alerts

features

events

PMMLmodels

Entity/Feature Database

2. off-line modeling

Data Mining Mart

learning setsData Mining

System

Rules

Dashboard engine

4. reporting

13

reports

Page 14: Health & Status Monitoring (2010-v8)

14

Augustus• Augustus is an open source data mining platform:– Used to estimate baselines for over 15,000

separate segmented models – Used to score high volume operational data and

issue alerts for follow up investigations • Augustus is PMML compliant• Augustus scales with– Volume of data– Real time transaction streams (15,000/sec+)– Number of segmented models (10,000+)

Page 15: Health & Status Monitoring (2010-v8)

15

Greedy Meaningful/Manageable Balancing (GMMB) Algorithm

• Fewer alerts

• Alerts more manageable

• To decrease alerts, remove breakpoint,order by number

of decreased alerts, & select one or more breakpoints to remove

• More alerts

• Alerts more meaningful

• To increase alerts, add breakpoint to split cubes,

order by number of new alerts, &

select one or more new breakpoints

One model for each cell in data cube

Breakpoint

Page 16: Health & Status Monitoring (2010-v8)

16

3. Case Studies

Page 17: Health & Status Monitoring (2010-v8)

17

Case Study 1

Open Cloud Testbed Monitor

Page 18: Health & Status Monitoring (2010-v8)

18

Results• Dozens of separate statistical baselines models

developed and deployed.• Effective for discovering nodes that are hindering

effective use of OCC’s large data cloud.• Dead nodes are easy to identify and remove.• Removing just one or two “slow” nodes from a

pool of 100 nodes can improve overall performance by 15% - 20+%.

Page 19: Health & Status Monitoring (2010-v8)

19

Dashboard

Page 20: Health & Status Monitoring (2010-v8)

20

Case Study 2Account

Merchant

Issuing Bank

Acquiring Bank

Payments Network

Page 21: Health & Status Monitoring (2010-v8)

21

Program Structure• Strategic objective identified early:

– “Identify and ameliorate data interoperability issues to improve the approval rate of valid transactions and the disapproval rate of invalid transactions, ...”

– Report quarterly to CIOs’ council with third-party endorsed monetary benefits summarized on an executive dash board

• Introduced data governance program early in project• Developed payment transaction monitor that produced

candidate alerts• Set up investigation process to screen alerts and

investigate those of interest• Developed reference models and appropriate standards

Page 22: Health & Status Monitoring (2010-v8)

22

Results• ROI– 5.1x Year 1 (over 6 months)– 7.3x Year 2 (12 months)– 10.0x Year 3 (12 months)

• Over 15,500 separate statistical baselines models developed and deployed.

• Also developed appropriate rules-based models to make work of analysts more efficient.

Page 23: Health & Status Monitoring (2010-v8)

23

4. Summary

Page 24: Health & Status Monitoring (2010-v8)

24

Strategic Objective Dashboard

Governance

Monitor - produces candidate

alerts

events program alerts

Business Process

Modeling Process

Investigative Process

candidate alerts

Reference Model

InvestigativeProcess

Page 25: Health & Status Monitoring (2010-v8)

25

Some Lesson Learned• Business Processes

– Importance of “C”-level executive support, dashboard reports, and a data governance program

• Modeling Processes– Critical to build as many statistical models as the data

required; used open source Augustus software for this– Architecture separated offline modeling and online scoring– Post processing with business rules to control workflow to

analysts• Investigative Processes

– It is not about the models and alerts – it is about optimizing the analysts’ workload and derived business value

– Small changes in report designs had large impact in the effectiveness of the alerts

Page 26: Health & Status Monitoring (2010-v8)

26

SummarySQC HSM

What is monitored?

Single assembly line producing physical widgets

Digital system with thousands of data feeds

Type of model

Exceed 3 standard deviations

Change detection model

# of models Single model Cube of modelsVisualization Control chart Dashboard

Process Plan-do-check-act Plan-do-check-act

Page 27: Health & Status Monitoring (2010-v8)

27

For More Information

• Robert Grossman– grossman.info at gmail.com– rgrossman.com (blog)

Learn about Health and status monitoring• Open Data Group• www.opendatatgroup.com

Page 28: Health & Status Monitoring (2010-v8)

28

References• Joseph Bugajski, Chris Curry, Robert L. Grossman, David Locke, Steve

Vejcik, Detecting Changes in Large Data Sets of Payment Card Data: A Case Study, Proceedings of The Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2007), ACM, 2007

• Joseph Bugajski and Robert L. Grossman, An Alert Management Approach to Data Quality: Lessons Learned from the Visa Data Authority Program, Proceedings of the 12th International Conference on Information Quality, (ICIQ 2007).

• Walter A. Shewhart, Statistical Method from the Viewpoint of Quality Control, Dover, 1986.

• H. Vincent Poor and Olympia Hadjiliadis, Quickest Detection, Cambridge University Press, 2009.

• Augustus is an open source system available from augustus.googlecode.com.