survival analysis for risk-ranking of esp system performance teddy petrou, rice university august...

Survival Analysis for Risk-Ranking of ESP System Performance

Teddy Petrou, Rice University

August 17, 2005

2 T. Petrou 17 Aug 2005

Presentation Outline

• ESP Overview

• Survival Analysis Review

• Dataset Explanation

• Problems

• Modeling Process and Improvements

• NPV Calculations for ESP’s

• Conclusions


ESP overview

• More than 60 percent (and rising) of

producing oil wells require some type of

assisted lift to produce the recoverable oil.

• ESPs are typically used where there is

insufficient pressure to lift the fluids to the

surface (typically in older, more watered-

out wells).

• Provide cost effective production by

boosting fluid production from these less

efficient, older reservoirs.


Survival Analysis (SA) Review• Survival Analysis refers to the statistical procedures for modeling the time until an

event occurs.

• Censoring occurs when a pump has yet to fail at the time of data

collection.


• Capable of providing insight into which explanatory variables

significantly affect run times.

• Predict run times of ESPs given various values of the explanatory

variables.

• Generate estimated survival curves

- Produce a bond-type risk ranking scheme

- Provide annuity-type NPV calculations for ESP value

- Simulate sample reservoir ESP usage

Survival Analysis Benefits


Survivor and Hazard Functions

• Survivor function S(t)

• Gives the probability that an individual survives longer than time t

Hazard function h(t)

• Gives the instantaneous potential per unit time for failure given that the pump has survived up to time t

• The models applied are defined in terms of the hazard function.


Generating Survival Curves

Three main methods:

• Non-parametric (Kaplan-Meier)

• Parametric (exponential, Weibull, etc…)

• Semi-parametric (Cox Proportional Hazards)

– Factors and covariates are compared to a baseline hazard

function

– Allows us determine which combination of potential explanatory

variables are most significant


Formulation of Cox Proportional Hazards Model

Given two pumps (R and C), made by two different manufacturers, their hazard

functions would be , where is a constant known

as the relative risk. If is less than 1 then pump R would be less likely to

fail at any given time.

Since the relative hazard cannot be negative, we let

The comparative baseline level can be arbitrarily chosen. If a different baseline

level is chosen, the parameters would change but all statistical significance

tests would remain the same.

( ) ( )R Ch t h t

exp( )


Step-wise modeling overview

1. Data transformation with expert collaboration

2. Step-wise model selection with factor collapsing

3. Model verification and validation

4. Model implementation

Once all steps are complete, an automated process can then be

set up for quick statistical ESP analysis.


Data Introduction

The data contains nearly 25000 different records of ESPs from around the world. There are 58 explanatory variables consisting of factors and

covariates.

Problems with large data: Difficult to find correct modelVery time consuming

Inconsistencies abound

Problems with this data• High correlation (multicollinearity)• Low-failure occurrences• Missing data

Pragmatic ApproachDifferent subsets of data were chosen.


Highly Correlated Data

The best way to alleviate multicollinearity issues is to work with someone that has expert knowledge of the database to remove redundant explanatory variables. In the absence of an expert, sifting through the data by hand is a must.

Producing a cross-table of the data is one method to find variables that are highly correlated.

SYSMFG

PM

PM

FG

A perfect one-to-one correlation is found. Removing one of the variables is necessary.


Removing Data

Variables exhibiting the near one-to-one correlation were removed. There were also many other variables that were subsets of one another

There might possibly be a chance to replace some variables with the variables that are subsets of them. Knowing one variable level can possibly give information about 15 others.

Reducing the data will help with model interpretation as well as computing time.


Transforming Low Counts and Missing Data

All factors in the data can be comprised of several levels each. Levels with low counts can severely skew the model building process. To alleviate this problem, all levels were required to have at least 15 records.

Missing data was also an issue. Several variables had more than half their values recorded as ‘NA’.

If the NA group contained more than 15 entries then, this group was changed to a level named ‘Unknown’

Again, collaboration with an expert is needed to investigate the cause for the missing entries.


Data With No Failures

• Right censored data can make for difficult analysis

• A factor level with no failures is essentially implying that an ESP will never fail. No information about failure rate is being provided.

• To alleviate this problem, the levels can be eliminated from the data all together or combined with another level with help from an expert.


Model Selection

Once a ‘good’ set of data is produced, a step-wise procedure will add or remove variables one at a time until a statistically ‘best’ model is found. Different combinations of explanatory variables will affect selection procedure. The step-wise procedure is conservative and will tend to keep variables in the model that might not be necessary.

Once this model is found, each variable is looked at individually and a decision is made whether or not to drop the variable.


Factor Collapsing

• Once a final model is chosen, a procedure to combine levels of similar hazards is began.


Model ValidationA valid model is one that is consistent, reliable and not sensitive to

small changes in the data.

Methods to check validity:• Randomly split data and retrieve a new model for each half and

compare.• Randomly split data and use model found for first half to model

second half and compare coefficients• Use a bootstrapping method to obtain many different sets of data

and apply the model building procedure• Obtain new data , repeat model building procedure and compare.

This method could be useful to see how the model changes over time.


Conclusions

• Pragmatic risk ranking and valuation tools for ESPs have been created

• Pragmatic tools for dealing with large, sparse, and inconsistent data as well as

• Modeling this data in a consistent fashion

survival analysis for risk-ranking of esp system performance teddy petrou, rice university august...

Documents

given time

significant slide

baseline hazard function

time of data collection

quick statistical esp

time t hazard function

unit time

relative hazard