slide 1/35 dependability in a connected world: research in action in the operational trenches...

/35

Dependability in a Connected World: Research in Action in the

Operational Trenches

Saurabh BagchiSchool of Electrical and Computer Engineering

Department of Computer Science Purdue University

Presentation available at: engineering.purdue.edu/dcsl

/35

Greetings come to you from …

Venue of SRDS ???PC Chair said: “This year we broadened the scope to include important topics such as computer networks and computer security.”

/35

Roadmap• Dependability basics

– Costly software bugs– System design principles

• Open data repository of system usage and failures– Historical examples of field failure data analysis– Design details for the repository– Preliminary insights from analysis of Purdue cluster

• Large-scale cyberinfrastructure NEES– Cyberinfrastructure features– Dependability solutions and insights

/35

Costly Software Bugs in History• Facebook IPO: May 2012

– NASDAQ earmarked $62 million for reimbursing investors and says it expects to incur significant costs beyond that for system upgrades and legal battles. WSJ pegged the total loss to investors at $500M.

– Investors were unsure how much of Facebook they’d bought. There were 12 million postponed share orders suddenly filled between 1:49 p.m. and 1:51 p.m. without being properly marked ‘late sale’, which exaggerated the impression that people were trying to dump Facebook shares.

– Diagnosis: Computer systems used to establish the opening price were overwhelmed by order cancellations and updates during the "biggest IPO cross in the history of mankind," Nasdaq Chief Executive Officer Robert Greifeld said Sunday. Nasdaq's systems fell into a "loop" that prevented it from opening the shares on schedule

/35

Costly Software Bugs in History• US Power Grid Blackout: August 14, 2003

– 50 million people lost power for up to two days in the biggest blackout in North American history. The event contributed to at least 11 deaths and cost an estimated $6 billion.

– Diagnosis: The task force responsible for investigating the cause of the blackout concluded that a software failure at FirstEnergy Corp. "may have contributed significantly" to the outage.

– FirstEnergy 's Alarm and Event Processing Routine (AEPR), a key software program that gives operators visual and audible indications of events occurring on their portion of the grid, began to malfunction. As a result, "key personnel may not have been aware of the need to take preventive measures at critical times”.

– Internet links to Supervisory Control and Data Acquisition (SCADA) software weren't properly secured and some operators lacked ability to view the status of electric systems outside their immediate control.

/35

Costly Software Bugs in History• Medical Devices: Ongoing

– It has been shown to be possible for a heart defibrillator and pacemaker to reprogram it to shut down and to deliver jolts of electricity that could be fatal. (For a device called Maximo from industry #1 company called Medtronic)

– Also possible to glean personal patient data by eavesdropping on signals from the tiny wireless radio embedded in the implant as a way to let doctors monitor and adjust it without surgery.

– 1983-1997: There were 2,792 quality problems that resulted in recalls of medical devices, 383 of which were related to computer software (14%), according to a 2001 study analyzing FDA reports of the medical devices that were voluntarily recalled by manufacturers.

– 2015: FDA has had 24 recalls of medical devices; class 1 recalls categorized as ones where there is “reasonable probability that use of these products will cause serious adverse health consequences or death.”

/35

Ways to Increase System Availability

• To increase availability:

1. Increase MTTF

2. Decrease MTTR

3. Decrease DL

Availability = MTTF/(MTTF + MTTR + DL)

Mean Time To Failure (MTTF) = EXP(t2-t0)

Mean Time To Repair (MTTR) = EXP(t4-t3)

Mean Time Between Failure (MTBF) = EXP(t6-t2)

/35





/35

Importance of Field Studies of Failures• Progress in dependable system design can be made faster

if researchers can be exposed to problems through quantitative data

• Theories in the labs and small demonstrations in prototypes can be transitioned to the demanding realities of large computer systems if they could validate their inventions with real system failure and attack data

• There are some very influential studies of failures in the field

• They are hard to do, harder to write up in a way that is of general interest, and almost impossible to share the data

/35

Examples of Field Failure Studies1. Windows NT Server failures (1999): Xu, Kalbarczyk,

Iyer, PRDC. – Scale: 503 servers, 4 month period– Data source: Event log information for system reboots– Insights:

• System software and hardware failures are the two major contributors to the total system downtime (22% and 10%). But they are responsible for only 3% and 1% of all observed reboots

• Recovery from application software failures are usually quick• In many cases, more than one reboots are required to recover from a

failure• Planned maintenance is responsible for 31% of all reboots and 24% of

downtime– Data availability: Not available– Google citation count (normalized by #years): 9.6

/35

Examples of Field Failure Studies2. HPC computing failures (2006): Schroeder, Gibson, DSN.

(Expanded version in TDSC 2010)– Scope: Node failures at two clusters of HPC machines– Scale: Los Alamos, 9 years, 24K processors (1996-2005)– Data source: Manually generated failure records– Insights:

• Assumption of i.i.d. exponentially distributed time between failures not correct; rather hazard rate is decreasing

• Failures exhibit significant levels of temporal correlation at both short and long time lags. There is strong autocorrelation for hardware and software failures

• There is correlation between the failure rate of a machine and the type and intensity of the workload running on it

– Data availability: Available as the Computer Failure Data Repository– Google citation count (normalized by #years): 138.2

/35

Examples of Field Failure Studies3. System failures in petscale machine Blue Waters (2014):

Martino, Kalbarczyk, Iyer, DSN.– Scope: Blue Waters machine at UIUC– Scale: 724K cores, 3K GPU nodes (Mar-Nov 2013) – Data source: Manual system failure reports, machine check errors– Insights:

• Software was the cause of most (74.4%) system wide outages; software failures often affect multiple nodes

• ECC techniques corrected 99.997% of memory and processor errors• Parallel file system (Lustre) problems were the most common software-caused

failures• Uncorrectable error rate in GPU memory ~ 100X that in CPU memory

– Data availability: Not available– Google citation count (normalized by #years): 22.0

/35

System Usage Repository at Purdue• Objectives

– Enable systems research in dependability that relies on system usage and failure records from large-scale systems

– Provide synchronized workload information, system usage information, some user information, hardware status information

– Provide this repository for a diversity of workloads and diversity of computing systems

• Steps– National Science Foundation sponsored project from June 2014

“Computer System Failure Data Repository to Enable Data-Driven Dependability Research”

– Seeded with data from Purdue’s centralized computing cluster Conte from Oct 2014-Mar 2015 with 500K job information

URL: https://github.com/purdue-dcsl/fresco

/35

Gap Identification

• Many other communities apart from us (broadly computing systems) do a far better job of creating and maintaining open data repositories– Example: Genomics community has the 1000 Genomes project

and “The Cancer Genome Atlas” (TCGA)– Community effort with federal funding

• We still lack a comprehensive system usage and failure repository for large-scale computing infrastructure– CFDR [DSN ’06, TDSC ’10] is primarily for high performance

computing (HPC) workloads– Lacks workload context for its data– Outdated and does not reflect characteristics of recent workloads


/35

Sample Questions for the Repository

• What is the resource utilization (CPU, memory, network, storage) of jobs with a certain characteristic?– Characteristic could be the application domain, the libraries

being used, parallel or serial, and if parallel, how many cores– We want to know this especially if resource utilization is

anomalously high

• What is the profile of users submitting jobs to the cluster?– Are they asking for too much or too little resources in the

submission script?– Are they making the appropriate use of the parallelism?– What kinds of problem tickets do they submit and how many

rounds are needed to resolve these tickets?


/35

Wake up! I have Questions for You• Consider some sample results that we obtained from

Purdue’s centrally managed cluster (called “Conte”)• These are shared clusters called “community clusters”

– They have become common in big organizations, academic, industrial, and government labs

– A diverse set of sub-units within the organization buy assets in a cluster and these are then put together by central IT

– Flexible usage policies, such that partners in the cluster have ready access to the capacity they purchase and potentially to much more, when there are idle resources

• Jobs span a wide spectrum – difficult to create optimized hardware-software configurations for each job class

/35

Cluster Details: Conte• Production cluster in continuous operation since June 2013• 580 homogeneous nodes

– Each node contains two 8 core Intel Xeon E5-2670 Sandy Bridge processors running at 2.6 GHz

– Two Xeon Phi 5110P accelerator card, each with 60 cores– Memory: 64GB of DDR3, 1.6 GHz RAM– Networking: Nonblocking 40Gbps FDR10 Infiniband interconnect

• Aggregate bandwidth: IP over Infiniband aggregate of 80 GB/s• Filesystems: NFS-mounted directories which can sustain 2 GB/s

– Lustre parallel file system – Total capacity 1.4 PB with 49% utilization

• Software and scheduling: Red Hat Enterprise Linux on all nodes– System has some common pre-compiled libraries and applications– Portable Batch System (PBS) scheduling

/35

Cluster Details: Conte• Scheduling:

– Each job requests for certain time duration, number of nodes and in some cases, amount of memory needed

– When job exceeds the specified time limit, it is killed– Jobs are also killed by out-of-memory (OOM) killer scripts, if it exhausts

available physical memory and swap space

• Node sharing:– By default only a single job is scheduled on a an entire node giving dedicated

access to all the resources– However, user can enable sharing by using a configuration in the job

submission scripts

/35

Result: Hot Libraries• Extract all dynamically linked libraries being used by the applications

1. Out of a total 3,629 unique libraries, some are used much more often

2. Each of the top 50 libraries is used by more than 80 % of the users

Sorted histogram of top 500 libraries as used by the jobs

/35

Result: Resource Request Patterns

1. Almost 45% of jobs actually used less than 10% of requested time

2. But, scheduler during busy periods gives higher priority to shorter jobs

Percentage of the user requested time actually used by the jobs

/35

Result: Memory Thrashing• Memory usage related issues are a big reason for unsatisfactory

application performance• Most severe impact shows up as memory thrashing

CDF of peak major page faults of all jobs

% user’s jobs crossing memory thrashing threshold

1. Most of the jobs from top few users suffered from major page faults and need help from IT support staff

2. We contacted 10 of these users and with 1-5 hours of discussion and debugging figured out the reasons for the poor memory usage

/35

Result: Outlier Analysis• How many jobs create behave anomalously in terms of runtime

(wall-time and cpu-time) and memory usages (physical and virtual memory)

• Outlier is defined as above 95th percentile of values within the same application class

Venn Diagram of outliers for different metrics and their

intersections

Correlation between outliers in the different dimensions

1. Only 0.15% of jobs were anomalous in all dimensions2. Outliers wrt CPU metric were highest – reliable result?3. Little correlation among the metrics wrt outlier determination

/35

Conclusion: System Usage Repository• It is important to analyze how resources are being

utilized by users– “Errare humanum est”– Use this analysis for fine-tuning scheduler, resource

provisioning, and educating users• It is important to look at workload information together

with failure events– Workload affects the kinds of hardware-software failures that

are triggered– Performance failures can be used as teachable moments to

optimize the system• It is crucial that this kind of data be openly available for

different researchers to do different kinds of analyses

/35





/35

Building a Dependable Cyberinfrastructure for Earthquake Engineers

• To reduce the impact of earthquake events, National Science Foundation has created NEES network as a national research infrastructure– Started in 2004; Purdue ran it for 2019-14– 14 advanced laboratories at different universities (sites)– NSF investment in 2009-14: $100M

• Purdue provided the cyberinfrastructure connecting these sites– Linking the NEES experimental facilities to each other, to NEEScomm, and

to off-site users– Enabled researchers participating on-site or remotely to collect, view,

process, and store data from NEES experiments at a central repository– Using the cyberinfrastructure, researchers conducted numerical simulation

studies and performed hybrid (combined experimental and numerical) testing

URL: http://nees.org

/35

NEES CI

University of CaliforniaSanta Barbara

University of NevadaReno

University of CaliforniaSan Diego

University of Texas, Austin

University of CaliforniaLos Angeles

University of CaliforniaDavis

Lehigh University

Rensselaer Polytechnic Institute

Cornell University

University of Buffalo

University of Minnesota

University of Illinois- UrbanaOregon State University

University of CaliforniaBerkeley

nees.org

• Data Repository• Computational

Simulation• HPC Resources

NEES Cyberinfrastructure

/35

NEES Cyberinfrastructure• Earthquake engineers demanded “always on, always

available, accessible anywhere” resources– Resources are: Software tools, simulation environment, data,

and collaboration tools

• Some key features of NEES cyberinfrastructure– NEEShub provides “software as a service,” which allows users

to run tools, share data, communicate, and collaborate through a web portal without the need to install software or download data to their computer

– NEEShub allows users to run software tools with direct access to NEES experimental data

/35

NEES Cyberinfrastructure• Leveraging virtual machine technology, the NEEShub provided a

secure and configurable computing environment for tools running in Linux or Windows.

• Running detailed computational simulations at a high resolution is a highly complex, processor-intensive operation. NEEShub provided access to supercomputing resources at Purdue University, the NSF Teragrid and XSEDE, and the Open Science Grid directly through the NEEShub to provide more powerful computing resources for the most challenging computational problems

jpgtxt

Data Model

Data Ingest

Data Tools

Data Presentation

/35

Data and User Community Size• NEES by numbers

– 25 Terabytes of data in 2M files– More than 70 applications– Shared by 112,862 users from 208 countries

/35

Dependability in Action• Dependability in numbers (over 2009-14)

– 99.63% availability – Not a single “high severity” security outage

• NEES developed a comprehensive dependability and cybersecurity approach that included – Best practice policies and mechanisms at Purdue– An annual security audit and continual low-intensity scans at

each of the NEES sites

• Goal: Walk the delicate line between – Prompt science and open community, AND– Preserving integrity and availability of our compute and data

resources

/35

Unique Challenges for NEES Cybersecurity

1. Different universities have different cybersecurity policies and our policies had to “play nice” with them

2. We were responsible for doing security audits of equipment which we did not own or have root access on

3. How to test for external threats as well as internal threats? Notion of perimeter security is ingrained in technology and admins.

/35

How We Solved These Challenges

1. Varied cybersecurity policies

2. No ownership or root access on some equipment

3. Test for external and internal threats

• We sometimes enforced stronger rules for machines that are part of the NEES network

• “Inner shell” of machines and other equipment (routers, PLC controllers, etc.) were created within the university IT resources

• Negotiated elevated privileges for purpose of security scanning• Success of human relation building: Feeling of being part of the

same team

• VPN access to sites for mimicking internal attacks• More stringent security controls at Purdue HQ

/35

The Twain between Cybersecurity Research & Practice

• As cybersecurity researchers, we have often bemoaned the slow adoption of our best research technologies

• Some successful topics where academic research has resulted in usable tools for security practitioners– Network/host intrusion detection: Bro, Tripwire– Vulnerability scanner: Nessus/OpenVAS, ZAP– Web server vulnerability assessment: Nikto– Static code analysis: Coverity – Blocking connections: fail2ban

• Areas where top quality research has not resulted in useful tools– Security configuration management: Example, for firewalls, SIM– False positives with signature-based systems

/35

Take-Aways• Achieving dependability in a federated cyberinfrastructure

has unique challenges– No uniform policies and practices– Wide variety of computational platforms– Wide variety of users

• Solution involved all of the following– Adapting usable technology from research space– Some commercial tools– User education and community building with system

administrators from all organizations

/35

Presentation available on:Research group web pageengineering.purdue.edu/dcsl

/35

Backup Slides

/35

Effect of major network outages on large business customers

October 2000 data

• Survey by Meta Group Inc. of 21 industrial sectors in 2000 found the mean loss of revenue due to computer system downtime was $1.01M/hour

/35

Dependable Computing Terminology

DEPENDABILITY

ATTRIBUTES

AVAILABILITYRELIABILITYSAFETYCONFIDENTIALITYINTEGRITYMAINTAINABILITY

FAULT PREVENTIONFAULT TOLERANCEFAULT REMOVALFAULT FORECASTING

FAULTSERRORSFAILURES

MEANS

IMPAIRMENTS

FAULT TOLERANCE

PREDICTION

DETECTION

CONTAINMENTDIAGNOSIS

RECOVERY

/35

Thrust #1: Distributed Intrusion Tolerant System• Distributed systems subjected to malicious attacks to

services, including hitherto unknown attacks• Objective is to tolerate intrusions, not just detect• Different phases: Detection, Diagnosis, Containment,

Response.• Solution approach:

– Attack graph based modeling of multi-stage attacks– Algorithms for containment and survivability computation and

semi-optimal response selection– Semi-optimal selection and configuration of intrusion detection

sensors for gaining visibility into the security state of system– Dealing with zero-day attacks

/35

Thrust #1: Distributed Intrusion Tolerant System

S1S2

S3S4

Hacker

/35

Thrust #1: Distributed Intrusion Tolerant System

/35

Debugging in the Large: Large-Scale Distributed Applications

• Goal: Provide highly available applications (e.g., web service) in distributed environment– Perform failure prediction– Perform bug localization

• Challenges in today’s distributed systems– Large number of entities and large amount of data– Interactions between entities causes error propagation– High throughput or low latency requirements for the

application

/35

Metric-based Bug Localization

How can we use these metrics to localize the root cause of problems?

MiddlewareVirtual machines and containers

statistics

Operating SystemCPU, memory, I/O, network

statistics

HardwareCPU performance counters

ApplicationRequests rate, transactions, DB

reads/writes, etc..

Tivoli Operations Manager

/35

Metric-based Bug Localization: Solution Approach

• Look for abnormal time patterns [Ozonat-DSN08, Gao-ICDCS09, Sharma-DSN13, …]

• Pinpoint code regions that are correlated with these abnormal patterns

/35

Bugs Cause Metric Correlations to Break• Hadoop DFS file-descriptor leak in version 0.17 (2008)

• Correlations are different when the bug manifests itself:– Metrics: open file descriptors, characters written to disk

Normal Run Failed Run

Correlations are different

Ignacio Laguna, Subrata Mitra, Fahad A. Arshad, Nawanol Theera-Ampornpunt, Zongyang Zhu, Saurabh Bagchi, Samuel P. Midkiff, Mike Kistler (IBM Research), and Ahmed Gheith (IBM Research), “Automatic Problem Localization in Distributed Applications via Multi-dimensional Metric Profiling,” At the 32nd International Symposium on Reliable Distributed Systems (SRDS), pp. 121-132, Braga, Portugal, September 30-October 3, 2013.

/35

Approach Overview

/35

Selecting Abnormal Window via Nearest-Neighbor (NN)

Normal Run Faulty Run

3, 55, 47, 0.7,…2, 54, 45, 0.8,…

3, 55, 47, 0.7,…2, 55, 45, 0.6,…

Traces

… …

▪ Sample of all metrics▪ Annotated with code region

Window1

Window2

Window3

Correlation Coefficient Vectors (CCV)[cc1,2, cc1,3,…, ccn-1,n]

xxxxxxx xxxx

xx

Nearest-Neighbor to find Outliers

xx Outliers

Normal Run Faulty Run

0.2, 0.8, 0, -0.6,… 0.1, 0.6, 0, -0.5,…

/35

Selecting Abnormal Metrics by Frequency of Occurrence

Distance (CCV1, CCV2)

CC6

,1

CC5

,1

CC10,

110.1 0.7 0.2CC5

,2

CC7

,2

CC3,1

20.5 0.05 0.3CC15,

16

CC8,2

0

CC19,

50.5 0.05 0.8

Window X

Window Y

Window Z

Abnormal metric: 5

ExampleSteps

Get abnormal windows1

Rank Correlation Coefficients (CC) based on contribution to the distance

2

Select the most frequent metric(s)

3

Contribution of correlation coefficient to the distance

CC5,1

CC5,2

CC19,5

/35

Case 1: Hadoop DFS• File-descriptor leak bug

– Sockets are left open in the DFSClient Java class– 45 classes and 358 methods instrumented (as code regions)

Output of the Tool

2nd metric correlates with origin of the problem

Java class of the bug site is correctly identified

/35

Case 2: HBase• Deadlock in version 0.20.3 of Hbase (2010)

– Incorrect use of locks– Bug site is the HRegion class

Abnormal metrics don’t provide much

insight

HRegion appears as the abnormal code

region

Output of the Tool

/35

Modeling Scale-dependent Behavior

RUN #

# O

F T

IME

S L

OO

P E

XE

CU

TE

S

Is there a bug in one of the

production runs?

Training runs Production runs

Previous Models

/35

Modeling Scale-dependent Behavior

SCALE

# O

F T

IME

S L

OO

P E

XE

CU

TE

S

Training runs Production runs

Accounting for scale makes trends clear,

errors at large scales obvious

/35

Difficulty in Detecting Scale-Dependent Bugs• What are they?

– Bugs that arise often in large-scale runs while staying invisible in small-scale ones

– Examples? Race condition, integer overflow, etc.– Severe impact on distributed systems: performance

degradation, silent data corruption

• Difficult to detect, let alone localize– Large scale computing environment not available to developers– Large scale data not available at development time

Bowen Zhou, Jonathan Too, Milind Kulkarni, and Saurabh Bagchi, “WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales,” At the 22nd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC), pp. 131-142, New York City, New York, June 17-21, 2013.

/35

When Statistical Debugging MeetsScale-dependent Bugs

0

50

100

150

200

250

300

350

400

0 50 100 150 200

Com

mun

icat

ion

Normal

BuggyTraining

Scale

Previous Statistical DebuggingOur Proposed Method

• To address scale-dependent bugs with previous methods– Either be very restricted in your feature selection– Or have access to a bug-free run on the large-scale system

• Our method solved the problems of previous methods– Capable of modeling scaling behavior of distributed programs– Only need access to bug-free runs from small-scale systems

/35

ControlFeature X

Observational Feature Y

g(*)

f(*)

Vrisha

corr(f( ), g( )) < 0

y

x

BUG!

Kernel Canonical Correlation Analysis takes observational feature X and control feature Y to find f and g such that f(X) and g(Y) is highly correlated

Behavioral Feature

Scale of Execution

/35

WuKong• Predicts the expected value in a large-scale run for each

feature separately• Prunes unpredictable features to improve localization

quality• Provides a shortlist of suspicious features in its

localization roadmap

/35

The Workflow

APPPINRUN 1

APPPINRUN 3

APPPINRUN 2

APPPINRUN 4

APPPINRUN N

...

SCALEFEATURERUN 1

SCALEFEATURERUN 3

SCALEFEATURERUN 2

SCALEFEATURERUN 4

SCALEFEATURERUN N

...

SCALE

FEATURE

MODEL

SCALEFEATURE

ProductionTraining

= ?

/35

Feature

void foo(int a) {1:if (a > 0) { } else { }2:if (a > 100) { int i = 0; 3:while (i < a) { 4:if (i % 2 == 0) { } ++i; } }}

2

1

3

4

/35

Model• X ~ vector of scale parameters X1...XN

• Y ~ number of occurrences of a feature• The model to predict Y from X:

• Compute the relative prediction error:

/35

Inference: Bug Localization• First, we need to determine if the production run is

buggy:

• If there is a bug in this run, we rank all the features by their prediction errors– Output the top N features as a roadmap for locating the bug

ii ME

Error of feature iin the production run

Constant parameterMax error of feature iin all training runs

/35

Optimization: Feature Pruning• Some noisy features cannot be effectively predicted by

the above model– Not correlated with scale, e.g. random– Discontinuous

• How to remove noisy features?– If we cannot predict them well for the training runs, we cannot

predict them for the large scale runs

/35

Evaluation

• Large-scale study of LLNL Sequoia AMG2006– Up to 1024 processes

• Two case studies of real bugs– Integer overflow in MPI_Allgather– Infinite loop in Transmission, a popular P2P file sharing

application

/35

AMG2006: Modeling Accuracy

• Trained on 8-128 processes• Compared predicted behavior at 256, 512 and 1024

processes with actual (non-buggy) behavior

Scale of Run Mean Prediction Error

256 6.55%

512 8.33%

1024 7.77%

/35

AMG2006: Fault Injection• Fault

– Injected at rank 0– Randomly pick a branch to flip

• Data– Training: No fault 110 runs @ 8-128 processes– Testing: With fault 100 runs @ 1024 processes

• Result

Total 100

Non-Crashing 57

Detected 53

Localized 49

Localization Ratio 49/53 = 92.5%

64

/35

Case Study: An Infinite Loop in Transmission

65

/35


66

/35


Feature 53, 66

67

slide 1/35 dependability in a connected world: research in action in the operational trenches...

Documents

facebook shares

computer systems

facebook fb shares

trading delay

computer networks

computer security

nasdaq exchange

facebook theyd