slide 1/35 dependability in a connected world: research in action in the operational trenches...

67
Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering Department of Computer Science Purdue University Presentation available at: engineering.purdue.edu/dcsl

Upload: richard-chase

Post on 04-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 1/35

Dependability in a Connected World: Research in Action in the

Operational Trenches

Saurabh BagchiSchool of Electrical and Computer Engineering

Department of Computer Science Purdue University

Presentation available at: engineering.purdue.edu/dcsl

Page 2: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 2/35

Greetings come to you from …

Venue of SRDS ???PC Chair said: “This year we broadened the scope to include important topics such as computer networks and computer security.”

Page 3: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 3/35

Roadmap• Dependability basics

– Costly software bugs– System design principles

• Open data repository of system usage and failures– Historical examples of field failure data analysis– Design details for the repository– Preliminary insights from analysis of Purdue cluster

• Large-scale cyberinfrastructure NEES– Cyberinfrastructure features– Dependability solutions and insights

Page 4: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 4/35

Costly Software Bugs in History• Facebook IPO: May 2012

– NASDAQ earmarked $62 million for reimbursing investors and says it expects to incur significant costs beyond that for system upgrades and legal battles. WSJ pegged the total loss to investors at $500M.

– Investors were unsure how much of Facebook they’d bought. There were 12 million postponed share orders suddenly filled between 1:49 p.m. and 1:51 p.m. without being properly marked ‘late sale’, which exaggerated the impression that people were trying to dump Facebook shares.

– Diagnosis: Computer systems used to establish the opening price were overwhelmed by order cancellations and updates during the "biggest IPO cross in the history of mankind," Nasdaq Chief Executive Officer Robert Greifeld said Sunday. Nasdaq's systems fell into a "loop" that prevented it from opening the shares on schedule

Page 5: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 5/35

Costly Software Bugs in History• US Power Grid Blackout: August 14, 2003

– 50 million people lost power for up to two days in the biggest blackout in North American history. The event contributed to at least 11 deaths and cost an estimated $6 billion.

– Diagnosis: The task force responsible for investigating the cause of the blackout concluded that a software failure at FirstEnergy Corp. "may have contributed significantly" to the outage.

– FirstEnergy 's Alarm and Event Processing Routine (AEPR), a key software program that gives operators visual and audible indications of events occurring on their portion of the grid, began to malfunction. As a result, "key personnel may not have been aware of the need to take preventive measures at critical times”.

– Internet links to Supervisory Control and Data Acquisition (SCADA) software weren't properly secured and some operators lacked ability to view the status of electric systems outside their immediate control.

Page 6: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 6/35

Costly Software Bugs in History• Medical Devices: Ongoing

– It has been shown to be possible for a heart defibrillator and pacemaker to reprogram it to shut down and to deliver jolts of electricity that could be fatal. (For a device called Maximo from industry #1 company called Medtronic)

– Also possible to glean personal patient data by eavesdropping on signals from the tiny wireless radio embedded in the implant as a way to let doctors monitor and adjust it without surgery.

– 1983-1997: There were 2,792 quality problems that resulted in recalls of medical devices, 383 of which were related to computer software (14%), according to a 2001 study analyzing FDA reports of the medical devices that were voluntarily recalled by manufacturers.

– 2015: FDA has had 24 recalls of medical devices; class 1 recalls categorized as ones where there is “reasonable probability that use of these products will cause serious adverse health consequences or death.”

Page 7: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 7/35

Ways to Increase System Availability

• To increase availability:

1. Increase MTTF

2. Decrease MTTR

3. Decrease DL

Availability = MTTF/(MTTF + MTTR + DL)

Mean Time To Failure (MTTF) = EXP(t2-t0)

Mean Time To Repair (MTTR) = EXP(t4-t3)

Mean Time Between Failure (MTBF) = EXP(t6-t2)

Page 8: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 8/35

Roadmap• Dependability basics

– Costly software bugs– System design principles

• Open data repository of system usage and failures– Historical examples of field failure data analysis– Design details for the repository– Preliminary insights from analysis of Purdue cluster

• Large-scale cyberinfrastructure NEES– Cyberinfrastructure features– Dependability solutions and insights

Page 9: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 9/35

Importance of Field Studies of Failures• Progress in dependable system design can be made faster

if researchers can be exposed to problems through quantitative data

• Theories in the labs and small demonstrations in prototypes can be transitioned to the demanding realities of large computer systems if they could validate their inventions with real system failure and attack data

• There are some very influential studies of failures in the field

• They are hard to do, harder to write up in a way that is of general interest, and almost impossible to share the data

Page 10: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 10/35

Examples of Field Failure Studies1. Windows NT Server failures (1999): Xu, Kalbarczyk,

Iyer, PRDC. – Scale: 503 servers, 4 month period– Data source: Event log information for system reboots– Insights:

• System software and hardware failures are the two major contributors to the total system downtime (22% and 10%). But they are responsible for only 3% and 1% of all observed reboots

• Recovery from application software failures are usually quick• In many cases, more than one reboots are required to recover from a

failure• Planned maintenance is responsible for 31% of all reboots and 24% of

downtime– Data availability: Not available– Google citation count (normalized by #years): 9.6

Page 11: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 11/35

Examples of Field Failure Studies2. HPC computing failures (2006): Schroeder, Gibson, DSN.

(Expanded version in TDSC 2010)– Scope: Node failures at two clusters of HPC machines– Scale: Los Alamos, 9 years, 24K processors (1996-2005)– Data source: Manually generated failure records– Insights:

• Assumption of i.i.d. exponentially distributed time between failures not correct; rather hazard rate is decreasing

• Failures exhibit significant levels of temporal correlation at both short and long time lags. There is strong autocorrelation for hardware and software failures

• There is correlation between the failure rate of a machine and the type and intensity of the workload running on it

– Data availability: Available as the Computer Failure Data Repository– Google citation count (normalized by #years): 138.2

Page 12: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 12/35

Examples of Field Failure Studies3. System failures in petscale machine Blue Waters (2014):

Martino, Kalbarczyk, Iyer, DSN.– Scope: Blue Waters machine at UIUC– Scale: 724K cores, 3K GPU nodes (Mar-Nov 2013) – Data source: Manual system failure reports, machine check errors– Insights:

• Software was the cause of most (74.4%) system wide outages; software failures often affect multiple nodes

• ECC techniques corrected 99.997% of memory and processor errors• Parallel file system (Lustre) problems were the most common software-caused

failures• Uncorrectable error rate in GPU memory ~ 100X that in CPU memory

– Data availability: Not available– Google citation count (normalized by #years): 22.0

Page 13: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 13/35

System Usage Repository at Purdue• Objectives

– Enable systems research in dependability that relies on system usage and failure records from large-scale systems

– Provide synchronized workload information, system usage information, some user information, hardware status information

– Provide this repository for a diversity of workloads and diversity of computing systems

• Steps– National Science Foundation sponsored project from June 2014

“Computer System Failure Data Repository to Enable Data-Driven Dependability Research”

– Seeded with data from Purdue’s centralized computing cluster Conte from Oct 2014-Mar 2015 with 500K job information

URL: https://github.com/purdue-dcsl/fresco

Page 14: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 14/35

Gap Identification

• Many other communities apart from us (broadly computing systems) do a far better job of creating and maintaining open data repositories– Example: Genomics community has the 1000 Genomes project

and “The Cancer Genome Atlas” (TCGA)– Community effort with federal funding

• We still lack a comprehensive system usage and failure repository for large-scale computing infrastructure– CFDR [DSN ’06, TDSC ’10] is primarily for high performance

computing (HPC) workloads– Lacks workload context for its data– Outdated and does not reflect characteristics of recent workloads

URL: https://github.com/purdue-dcsl/fresco

Page 15: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 15/35

Sample Questions for the Repository

• What is the resource utilization (CPU, memory, network, storage) of jobs with a certain characteristic?– Characteristic could be the application domain, the libraries

being used, parallel or serial, and if parallel, how many cores– We want to know this especially if resource utilization is

anomalously high

• What is the profile of users submitting jobs to the cluster?– Are they asking for too much or too little resources in the

submission script?– Are they making the appropriate use of the parallelism?– What kinds of problem tickets do they submit and how many

rounds are needed to resolve these tickets?

URL: https://github.com/purdue-dcsl/fresco

Page 16: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 16/35

Wake up! I have Questions for You• Consider some sample results that we obtained from

Purdue’s centrally managed cluster (called “Conte”)• These are shared clusters called “community clusters”

– They have become common in big organizations, academic, industrial, and government labs

– A diverse set of sub-units within the organization buy assets in a cluster and these are then put together by central IT

– Flexible usage policies, such that partners in the cluster have ready access to the capacity they purchase and potentially to much more, when there are idle resources

• Jobs span a wide spectrum – difficult to create optimized hardware-software configurations for each job class

Page 17: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 17/35

Cluster Details: Conte• Production cluster in continuous operation since June 2013• 580 homogeneous nodes

– Each node contains two 8 core Intel Xeon E5-2670 Sandy Bridge processors running at 2.6 GHz

– Two Xeon Phi 5110P accelerator card, each with 60 cores– Memory: 64GB of DDR3, 1.6 GHz RAM– Networking: Nonblocking 40Gbps FDR10 Infiniband interconnect

• Aggregate bandwidth: IP over Infiniband aggregate of 80 GB/s• Filesystems: NFS-mounted directories which can sustain 2 GB/s

– Lustre parallel file system – Total capacity 1.4 PB with 49% utilization

• Software and scheduling: Red Hat Enterprise Linux on all nodes– System has some common pre-compiled libraries and applications– Portable Batch System (PBS) scheduling

Page 18: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 18/35

Cluster Details: Conte• Scheduling:

– Each job requests for certain time duration, number of nodes and in some cases, amount of memory needed

– When job exceeds the specified time limit, it is killed– Jobs are also killed by out-of-memory (OOM) killer scripts, if it exhausts

available physical memory and swap space

• Node sharing:– By default only a single job is scheduled on a an entire node giving dedicated

access to all the resources– However, user can enable sharing by using a configuration in the job

submission scripts

Page 19: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 19/35

Result: Hot Libraries• Extract all dynamically linked libraries being used by the applications

1. Out of a total 3,629 unique libraries, some are used much more often

2. Each of the top 50 libraries is used by more than 80 % of the users

Sorted histogram of top 500 libraries as used by the jobs

Page 20: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 20/35

Result: Resource Request Patterns

1. Almost 45% of jobs actually used less than 10% of requested time

2. But, scheduler during busy periods gives higher priority to shorter jobs

Percentage of the user requested time actually used by the jobs

Page 21: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 21/35

Result: Memory Thrashing• Memory usage related issues are a big reason for unsatisfactory

application performance• Most severe impact shows up as memory thrashing

CDF of peak major page faults of all jobs

% user’s jobs crossing memory thrashing threshold

1. Most of the jobs from top few users suffered from major page faults and need help from IT support staff

2. We contacted 10 of these users and with 1-5 hours of discussion and debugging figured out the reasons for the poor memory usage

Page 22: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 22/35

Result: Outlier Analysis• How many jobs create behave anomalously in terms of runtime

(wall-time and cpu-time) and memory usages (physical and virtual memory)

• Outlier is defined as above 95th percentile of values within the same application class

Venn Diagram of outliers for different metrics and their

intersections

Correlation between outliers in the different dimensions

1. Only 0.15% of jobs were anomalous in all dimensions2. Outliers wrt CPU metric were highest – reliable result?3. Little correlation among the metrics wrt outlier determination

Page 23: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 23/35

Conclusion: System Usage Repository• It is important to analyze how resources are being

utilized by users– “Errare humanum est”– Use this analysis for fine-tuning scheduler, resource

provisioning, and educating users• It is important to look at workload information together

with failure events– Workload affects the kinds of hardware-software failures that

are triggered– Performance failures can be used as teachable moments to

optimize the system• It is crucial that this kind of data be openly available for

different researchers to do different kinds of analyses

Page 24: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 24/35

Roadmap• Dependability basics

– Costly software bugs– System design principles

• Open data repository of system usage and failures– Historical examples of field failure data analysis– Design details for the repository– Preliminary insights from analysis of Purdue cluster

• Large-scale cyberinfrastructure NEES– Cyberinfrastructure features– Dependability solutions and insights

Page 25: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 25/35

Building a Dependable Cyberinfrastructure for Earthquake Engineers

• To reduce the impact of earthquake events, National Science Foundation has created NEES network as a national research infrastructure– Started in 2004; Purdue ran it for 2019-14– 14 advanced laboratories at different universities (sites)– NSF investment in 2009-14: $100M

• Purdue provided the cyberinfrastructure connecting these sites– Linking the NEES experimental facilities to each other, to NEEScomm, and

to off-site users– Enabled researchers participating on-site or remotely to collect, view,

process, and store data from NEES experiments at a central repository– Using the cyberinfrastructure, researchers conducted numerical simulation

studies and performed hybrid (combined experimental and numerical) testing

URL: http://nees.org

Page 26: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 26/35

NEES CI

University of CaliforniaSanta Barbara

University of NevadaReno

University of CaliforniaSan Diego

University of Texas, Austin

University of CaliforniaLos Angeles

University of CaliforniaDavis

Lehigh University

Rensselaer Polytechnic Institute

Cornell University

University of Buffalo

University of Minnesota

University of Illinois- UrbanaOregon State University

University of CaliforniaBerkeley

nees.org

• Data Repository• Computational

Simulation• HPC Resources

NEES Cyberinfrastructure

Page 27: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 27/35

NEES Cyberinfrastructure• Earthquake engineers demanded “always on, always

available, accessible anywhere” resources– Resources are: Software tools, simulation environment, data,

and collaboration tools

• Some key features of NEES cyberinfrastructure– NEEShub provides “software as a service,” which allows users

to run tools, share data, communicate, and collaborate through a web portal without the need to install software or download data to their computer

– NEEShub allows users to run software tools with direct access to NEES experimental data

Page 28: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 28/35

NEES Cyberinfrastructure• Leveraging virtual machine technology, the NEEShub provided a

secure and configurable computing environment for tools running in Linux or Windows.

• Running detailed computational simulations at a high resolution is a highly complex, processor-intensive operation. NEEShub provided access to supercomputing resources at Purdue University, the NSF Teragrid and XSEDE, and the Open Science Grid directly through the NEEShub to provide more powerful computing resources for the most challenging computational problems

jpgtxt

Data Model

Data Ingest

Data Tools

Data Presentation

Page 29: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 29/35

Data and User Community Size• NEES by numbers

– 25 Terabytes of data in 2M files– More than 70 applications– Shared by 112,862 users from 208 countries

Page 30: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 30/35

Dependability in Action• Dependability in numbers (over 2009-14)

– 99.63% availability – Not a single “high severity” security outage

• NEES developed a comprehensive dependability and cybersecurity approach that included – Best practice policies and mechanisms at Purdue– An annual security audit and continual low-intensity scans at

each of the NEES sites

• Goal: Walk the delicate line between – Prompt science and open community, AND– Preserving integrity and availability of our compute and data

resources

Page 31: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 31/35

Unique Challenges for NEES Cybersecurity

1. Different universities have different cybersecurity policies and our policies had to “play nice” with them

2. We were responsible for doing security audits of equipment which we did not own or have root access on

3. How to test for external threats as well as internal threats? Notion of perimeter security is ingrained in technology and admins.

Page 32: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 32/35

How We Solved These Challenges

1. Varied cybersecurity policies

2. No ownership or root access on some equipment

3. Test for external and internal threats

• We sometimes enforced stronger rules for machines that are part of the NEES network

• “Inner shell” of machines and other equipment (routers, PLC controllers, etc.) were created within the university IT resources

• Negotiated elevated privileges for purpose of security scanning• Success of human relation building: Feeling of being part of the

same team

• VPN access to sites for mimicking internal attacks• More stringent security controls at Purdue HQ

Page 33: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 33/35

The Twain between Cybersecurity Research & Practice

• As cybersecurity researchers, we have often bemoaned the slow adoption of our best research technologies

• Some successful topics where academic research has resulted in usable tools for security practitioners– Network/host intrusion detection: Bro, Tripwire– Vulnerability scanner: Nessus/OpenVAS, ZAP– Web server vulnerability assessment: Nikto– Static code analysis: Coverity – Blocking connections: fail2ban

• Areas where top quality research has not resulted in useful tools– Security configuration management: Example, for firewalls, SIM– False positives with signature-based systems

Page 34: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 34/35

Take-Aways• Achieving dependability in a federated cyberinfrastructure

has unique challenges– No uniform policies and practices– Wide variety of computational platforms– Wide variety of users

• Solution involved all of the following– Adapting usable technology from research space– Some commercial tools– User education and community building with system

administrators from all organizations

Page 35: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 35/35

Presentation available on:Research group web pageengineering.purdue.edu/dcsl

Page 36: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 36/35

Backup Slides

Page 37: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 37/35

Effect of major network outages on large business customers

October 2000 data

• Survey by Meta Group Inc. of 21 industrial sectors in 2000 found the mean loss of revenue due to computer system downtime was $1.01M/hour

Page 38: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 38/35

Dependable Computing Terminology

DEPENDABILITY

ATTRIBUTES

AVAILABILITYRELIABILITYSAFETYCONFIDENTIALITYINTEGRITYMAINTAINABILITY

FAULT PREVENTIONFAULT TOLERANCEFAULT REMOVALFAULT FORECASTING

FAULTSERRORSFAILURES

MEANS

IMPAIRMENTS

FAULT TOLERANCE

PREDICTION

DETECTION

CONTAINMENTDIAGNOSIS

RECOVERY

Page 39: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 39/35

Thrust #1: Distributed Intrusion Tolerant System• Distributed systems subjected to malicious attacks to

services, including hitherto unknown attacks• Objective is to tolerate intrusions, not just detect• Different phases: Detection, Diagnosis, Containment,

Response.• Solution approach:

– Attack graph based modeling of multi-stage attacks– Algorithms for containment and survivability computation and

semi-optimal response selection– Semi-optimal selection and configuration of intrusion detection

sensors for gaining visibility into the security state of system– Dealing with zero-day attacks

Page 40: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 40/35

Thrust #1: Distributed Intrusion Tolerant System

S1S2

S3S4

Hacker

Page 41: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 41/35

Thrust #1: Distributed Intrusion Tolerant System

Page 42: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 42/35

Debugging in the Large: Large-Scale Distributed Applications

• Goal: Provide highly available applications (e.g., web service) in distributed environment– Perform failure prediction– Perform bug localization

• Challenges in today’s distributed systems– Large number of entities and large amount of data– Interactions between entities causes error propagation– High throughput or low latency requirements for the

application

Page 43: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 43/35

Metric-based Bug Localization

How can we use these metrics to localize the root cause of problems?

MiddlewareVirtual machines and containers

statistics

Operating SystemCPU, memory, I/O, network

statistics

HardwareCPU performance counters

ApplicationRequests rate, transactions, DB

reads/writes, etc..

Tivoli Operations Manager

Page 44: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 44/35

Metric-based Bug Localization: Solution Approach

• Look for abnormal time patterns [Ozonat-DSN08, Gao-ICDCS09, Sharma-DSN13, …]

• Pinpoint code regions that are correlated with these abnormal patterns

Page 45: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 45/35

Bugs Cause Metric Correlations to Break• Hadoop DFS file-descriptor leak in version 0.17 (2008)

• Correlations are different when the bug manifests itself:– Metrics: open file descriptors, characters written to disk

Normal Run Failed Run

Correlations are different

Ignacio Laguna, Subrata Mitra, Fahad A. Arshad, Nawanol Theera-Ampornpunt, Zongyang Zhu, Saurabh Bagchi, Samuel P. Midkiff, Mike Kistler (IBM Research), and Ahmed Gheith (IBM Research), “Automatic Problem Localization in Distributed Applications via Multi-dimensional Metric Profiling,” At the 32nd International Symposium on Reliable Distributed Systems (SRDS), pp. 121-132, Braga, Portugal, September 30-October 3, 2013.

Page 46: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 46/35

Approach Overview

Page 47: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 47/35

Selecting Abnormal Window via Nearest-Neighbor (NN)

Normal Run Faulty Run

3, 55, 47, 0.7,…2, 54, 45, 0.8,…

3, 55, 47, 0.7,…2, 55, 45, 0.6,…

Traces

… …

▪ Sample of all metrics▪ Annotated with code region

Window1

Window2

Window3

Correlation Coefficient Vectors (CCV)[cc1,2, cc1,3,…, ccn-1,n]

xxxxxxx xxxx

xx

Nearest-Neighbor to find Outliers

xx Outliers

Normal Run Faulty Run

0.2, 0.8, 0, -0.6,… 0.1, 0.6, 0, -0.5,…

Page 48: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 48/35

Selecting Abnormal Metrics by Frequency of Occurrence

Distance (CCV1, CCV2)

CC6

,1

CC5

,1

CC10,

110.1 0.7 0.2CC5

,2

CC7

,2

CC3,1

20.5 0.05 0.3CC15,

16

CC8,2

0

CC19,

50.5 0.05 0.8

Window X

Window Y

Window Z

Abnormal metric: 5

ExampleSteps

Get abnormal windows1

Rank Correlation Coefficients (CC) based on contribution to the distance

2

Select the most frequent metric(s)

3

Contribution of correlation coefficient to the distance

CC5,1

CC5,2

CC19,5

Page 49: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 49/35

Case 1: Hadoop DFS• File-descriptor leak bug

– Sockets are left open in the DFSClient Java class– 45 classes and 358 methods instrumented (as code regions)

Output of the Tool

2nd metric correlates with origin of the problem

Java class of the bug site is correctly identified

Page 50: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 50/35

Case 2: HBase• Deadlock in version 0.20.3 of Hbase (2010)

– Incorrect use of locks– Bug site is the HRegion class

Abnormal metrics don’t provide much

insight

HRegion appears as the abnormal code

region

Output of the Tool

Page 51: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 51/35

Modeling Scale-dependent Behavior

RUN #

# O

F T

IME

S L

OO

P E

XE

CU

TE

S

Is there a bug in one of the

production runs?

Training runs Production runs

Previous Models

Page 52: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 52/35

Modeling Scale-dependent Behavior

SCALE

# O

F T

IME

S L

OO

P E

XE

CU

TE

S

Training runs Production runs

Accounting for scale makes trends clear,

errors at large scales obvious

Page 53: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 53/35

Difficulty in Detecting Scale-Dependent Bugs• What are they?

– Bugs that arise often in large-scale runs while staying invisible in small-scale ones

– Examples? Race condition, integer overflow, etc.– Severe impact on distributed systems: performance

degradation, silent data corruption

• Difficult to detect, let alone localize– Large scale computing environment not available to developers– Large scale data not available at development time

Bowen Zhou, Jonathan Too, Milind Kulkarni, and Saurabh Bagchi, “WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales,” At the 22nd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC), pp. 131-142, New York City, New York, June 17-21, 2013.

Page 54: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 54/35

When Statistical Debugging MeetsScale-dependent Bugs

0

50

100

150

200

250

300

350

400

0 50 100 150 200

Com

mun

icat

ion

Normal

BuggyTraining

Scale

Previous Statistical DebuggingOur Proposed Method

• To address scale-dependent bugs with previous methods– Either be very restricted in your feature selection– Or have access to a bug-free run on the large-scale system

• Our method solved the problems of previous methods– Capable of modeling scaling behavior of distributed programs– Only need access to bug-free runs from small-scale systems

Page 55: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 55/35

ControlFeature X

Observational Feature Y

g(*)

f(*)

Vrisha

corr(f( ), g( )) < 0

y

x

BUG!

Kernel Canonical Correlation Analysis takes observational feature X and control feature Y to find f and g such that f(X) and g(Y) is highly correlated

Behavioral Feature

Scale of Execution

Page 56: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 56/35

WuKong• Predicts the expected value in a large-scale run for each

feature separately• Prunes unpredictable features to improve localization

quality• Provides a shortlist of suspicious features in its

localization roadmap

Page 57: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 57/35

The Workflow

APPPINRUN 1

APPPINRUN 3

APPPINRUN 2

APPPINRUN 4

APPPINRUN N

...

SCALEFEATURERUN 1

SCALEFEATURERUN 3

SCALEFEATURERUN 2

SCALEFEATURERUN 4

SCALEFEATURERUN N

...

SCALE

FEATURE

MODEL

SCALEFEATURE

ProductionTraining

= ?

Page 58: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 58/35

Feature

void foo(int a) {1:if (a > 0) { } else { }2:if (a > 100) {   int i = 0; 3:while (i < a) {   4:if (i % 2 == 0) {     }     ++i;   } }}

2

1

3

4

Page 59: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 59/35

Model• X ~ vector of scale parameters X1...XN

• Y ~ number of occurrences of a feature• The model to predict Y from X:

• Compute the relative prediction error:

Page 60: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 60/35

Inference: Bug Localization• First, we need to determine if the production run is

buggy:

• If there is a bug in this run, we rank all the features by their prediction errors– Output the top N features as a roadmap for locating the bug

ii ME

Error of feature iin the production run

Constant parameterMax error of feature iin all training runs

Page 61: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 61/35

Optimization: Feature Pruning• Some noisy features cannot be effectively predicted by

the above model– Not correlated with scale, e.g. random– Discontinuous

• How to remove noisy features?– If we cannot predict them well for the training runs, we cannot

predict them for the large scale runs

Page 62: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 62/35

Evaluation

• Large-scale study of LLNL Sequoia AMG2006– Up to 1024 processes

• Two case studies of real bugs– Integer overflow in MPI_Allgather– Infinite loop in Transmission, a popular P2P file sharing

application

Page 63: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 63/35

AMG2006: Modeling Accuracy

• Trained on 8-128 processes• Compared predicted behavior at 256, 512 and 1024

processes with actual (non-buggy) behavior

Scale of Run Mean Prediction Error

256 6.55%

512 8.33%

1024 7.77%

Page 64: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 64/35

AMG2006: Fault Injection• Fault

– Injected at rank 0– Randomly pick a branch to flip

• Data– Training: No fault 110 runs @ 8-128 processes– Testing: With fault 100 runs @ 1024 processes

• Result

Total 100

Non-Crashing 57

Detected 53

Localized 49

Localization Ratio 49/53 = 92.5%

64

Page 65: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 65/35

Case Study: An Infinite Loop in Transmission

65

Page 66: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 66/35

Case Study: An Infinite Loop in Transmission

66

Page 67: Slide 1/35 Dependability in a Connected World: Research in Action in the Operational Trenches Saurabh Bagchi School of Electrical and Computer Engineering

Slide 67/35

Case Study: An Infinite Loop in Transmission

Feature 53, 66

67