survival analysis for risk-ranking of esp system performance teddy petrou, rice university august...
TRANSCRIPT
Survival Analysis for Risk-Ranking of ESP System Performance
Teddy Petrou, Rice University
August 17, 2005
2 T. Petrou 17 Aug 2005
Presentation Outline
• ESP Overview
• Survival Analysis Review
• Dataset Explanation
• Problems
• Modeling Process and Improvements
• NPV Calculations for ESP’s
• Conclusions
3 T. Petrou 17 Aug 2005
ESP overview
• More than 60 percent (and rising) of
producing oil wells require some type of
assisted lift to produce the recoverable oil.
• ESPs are typically used where there is
insufficient pressure to lift the fluids to the
surface (typically in older, more watered-
out wells).
• Provide cost effective production by
boosting fluid production from these less
efficient, older reservoirs.
4 T. Petrou 17 Aug 2005
Survival Analysis (SA) Review• Survival Analysis refers to the statistical procedures for modeling the time until an
event occurs.
• Censoring occurs when a pump has yet to fail at the time of data
collection.
5 T. Petrou 17 Aug 2005
• Capable of providing insight into which explanatory variables
significantly affect run times.
• Predict run times of ESPs given various values of the explanatory
variables.
• Generate estimated survival curves
- Produce a bond-type risk ranking scheme
- Provide annuity-type NPV calculations for ESP value
- Simulate sample reservoir ESP usage
Survival Analysis Benefits
6 T. Petrou 17 Aug 2005
Survivor and Hazard Functions
• Survivor function S(t)
• Gives the probability that an individual survives longer than time t
Hazard function h(t)
• Gives the instantaneous potential per unit time for failure given that the pump has survived up to time t
• The models applied are defined in terms of the hazard function.
7 T. Petrou 17 Aug 2005
Generating Survival Curves
Three main methods:
• Non-parametric (Kaplan-Meier)
• Parametric (exponential, Weibull, etc…)
• Semi-parametric (Cox Proportional Hazards)
– Factors and covariates are compared to a baseline hazard
function
– Allows us determine which combination of potential explanatory
variables are most significant
8 T. Petrou 17 Aug 2005
Formulation of Cox Proportional Hazards Model
Given two pumps (R and C), made by two different manufacturers, their hazard
functions would be , where is a constant known
as the relative risk. If is less than 1 then pump R would be less likely to
fail at any given time.
Since the relative hazard cannot be negative, we let
The comparative baseline level can be arbitrarily chosen. If a different baseline
level is chosen, the parameters would change but all statistical significance
tests would remain the same.
( ) ( )R Ch t h t
exp( )
9 T. Petrou 17 Aug 2005
Step-wise modeling overview
1. Data transformation with expert collaboration
2. Step-wise model selection with factor collapsing
3. Model verification and validation
4. Model implementation
Once all steps are complete, an automated process can then be
set up for quick statistical ESP analysis.
10 T. Petrou 17 Aug 2005
Data Introduction
The data contains nearly 25000 different records of ESPs from around the world. There are 58 explanatory variables consisting of factors and
covariates.
Problems with large data: Difficult to find correct modelVery time consuming
Inconsistencies abound
Problems with this data• High correlation (multicollinearity)• Low-failure occurrences• Missing data
Pragmatic ApproachDifferent subsets of data were chosen.
11 T. Petrou 17 Aug 2005
Highly Correlated Data
The best way to alleviate multicollinearity issues is to work with someone that has expert knowledge of the database to remove redundant explanatory variables. In the absence of an expert, sifting through the data by hand is a must.
Producing a cross-table of the data is one method to find variables that are highly correlated.
SYSMFG
PM
PM
FG
A perfect one-to-one correlation is found. Removing one of the variables is necessary.
12 T. Petrou 17 Aug 2005
Removing Data
Variables exhibiting the near one-to-one correlation were removed. There were also many other variables that were subsets of one another
There might possibly be a chance to replace some variables with the variables that are subsets of them. Knowing one variable level can possibly give information about 15 others.
Reducing the data will help with model interpretation as well as computing time.
13 T. Petrou 17 Aug 2005
Transforming Low Counts and Missing Data
All factors in the data can be comprised of several levels each. Levels with low counts can severely skew the model building process. To alleviate this problem, all levels were required to have at least 15 records.
Missing data was also an issue. Several variables had more than half their values recorded as ‘NA’.
If the NA group contained more than 15 entries then, this group was changed to a level named ‘Unknown’
Again, collaboration with an expert is needed to investigate the cause for the missing entries.
14 T. Petrou 17 Aug 2005
Data With No Failures
• Right censored data can make for difficult analysis
• A factor level with no failures is essentially implying that an ESP will never fail. No information about failure rate is being provided.
• To alleviate this problem, the levels can be eliminated from the data all together or combined with another level with help from an expert.
15 T. Petrou 17 Aug 2005
Model Selection
Once a ‘good’ set of data is produced, a step-wise procedure will add or remove variables one at a time until a statistically ‘best’ model is found. Different combinations of explanatory variables will affect selection procedure. The step-wise procedure is conservative and will tend to keep variables in the model that might not be necessary.
Once this model is found, each variable is looked at individually and a decision is made whether or not to drop the variable.
16 T. Petrou 17 Aug 2005
Factor Collapsing
• Once a final model is chosen, a procedure to combine levels of similar hazards is began.
17 T. Petrou 17 Aug 2005
Model ValidationA valid model is one that is consistent, reliable and not sensitive to
small changes in the data.
Methods to check validity:• Randomly split data and retrieve a new model for each half and
compare.• Randomly split data and use model found for first half to model
second half and compare coefficients• Use a bootstrapping method to obtain many different sets of data
and apply the model building procedure• Obtain new data , repeat model building procedure and compare.
This method could be useful to see how the model changes over time.
18 T. Petrou 17 Aug 2005
19 T. Petrou 17 Aug 2005
Conclusions
• Pragmatic risk ranking and valuation tools for ESPs have been created
• Pragmatic tools for dealing with large, sparse, and inconsistent data as well as
• Modeling this data in a consistent fashion
20 T. Petrou 17 Aug 2005