planes, trains, and automobiles: a data scientist’s guide to modeling engine degradation
TRANSCRIPT
1 © 2016 Pivotal Software, Inc. All rights reserved. 1
Planes, Trains, and Automobiles
A Data Scientist’s Guide to Modeling Engine Degradation
April Song @aprilsongg Sarah Aerni @itweetsarah
2 © 2016 Pivotal Software, Inc. All rights reserved.
Gene Sequencing
Smart Grids
COST TO SEQUENCE ONE GENOME HAS FALLEN FROM
$100M IN 2001
TO $10K IN 2011 TO $1K IN 2014
READING SMART METERS EVERY 15 MINUTES IS 3000X MORE DATA INTENSIVE
Stock Market
Social Media
FACEBOOK UPLOADS 250 MILLION
PHOTOS EACH DAY
Oil Exploration
Video Surveillance
OIL RIGS GENERATE
25000 DATA POINTS PER SECOND
Medical Imaging
Mobile Sensors
All industries need technology to process and store data
3 © 2016 Pivotal Software, Inc. All rights reserved.
How can connected devices in our home be smart enough to
make daily life easier?
4 © 2016 Pivotal Software, Inc. All rights reserved.
How can we know a tree has fallen on a power line before the
residents complain?
5 © 2016 Pivotal Software, Inc. All rights reserved.
How can we use data to prevent airplane accidents?
6 © 2016 Pivotal Software, Inc. All rights reserved.
Aerospace Industry is Embracing IoT
! Engines are being fitted with more and more sensors
! Aircraft data networks are improving data transfer speeds
! Real time analytics is improving efficiency and performance
7 © 2016 Pivotal Software, Inc. All rights reserved.
Pratt & Whitney’s Geared Turbo Fan Engine
! 5,000 sensors
! 10 GB data per second
! 12 hours of flight = 844 TB data
8 © 2016 Pivotal Software, Inc. All rights reserved.
WHY IS THIS A DATA SCIENCE PROBLEM?
9 © 2016 Pivotal Software, Inc. All rights reserved.
How does this…
10 © 2016 Pivotal Software, Inc. All rights reserved.
How does this…
…become this?
11 © 2016 Pivotal Software, Inc. All rights reserved.
How does this…
…become this?
By recognizing this
12 © 2016 Pivotal Software, Inc. All rights reserved.
HOW CAN IT SOLVE JET ENGINE CHALLENGES?
13 © 2016 Pivotal Software, Inc. All rights reserved.
But what can we do with this much data?
Predict thrust demands of an engine Reduction in fuel consumption
Monitor engine health and degradation
Reduced maintenance costs with increased performance, efficiency, and engine lifetime
Detect faults and anomalies during a flight
Prevention of equipment failures and accidents
14 © 2016 Pivotal Software, Inc. All rights reserved.
What We Will Cover Today
! Jet Engine Sensor Data
! Enabling Technologies for Data Science
! Building Models on Large-Scale Datasets – Detecting Engine “end-of-life” Signal via Clustering – Tracing Engine Health Degradation using Classification
15 © 2016 Pivotal Software, Inc. All rights reserved.
Commercial Modular Aero-Propulsion System Simulation
Introduction to C-MAPSS
C-MAPSS a Matlab program that simulates a large high-bypass commercial turbofan engine capable of ~90k lbs thrust – GUI allows point-and-click
operation of engine models – simulates deterioration and
faults
Simplified diagram of 90k engine
16 © 2016 Pivotal Software, Inc. All rights reserved.
Overview of Flights
! 6,875 flights – 5,244 flights from
nominal engines – 1,631 flights from
fault engines
! Flight lengths range from 74 to 85 minutes
! Average length of flight is ~80 minutes
# of
Flig
hts
Length of Flight (Seconds)
17 © 2016 Pivotal Software, Inc. All rights reserved.
Flight Parameters
Parameter Name Description Units
Flight Conditions
time Flight time sec
alt Altitude ft
MN Mach number pct
TRA Trottle resolver angle deg
Wf Fuel flow pps
Fn Net thrust lbf
Parameter Name Description Units
Measurement Temperatures
T48 Total temperature at HPT outlet R
T2 Total temperature at fan outlet R
T24 Total temperature at LPC outlet R
T30 Total temperature at HPC outlet R
T50 Total temperature at LPT outlet R
Parameter Name Description Units
Other Measurements Nf Physical fan speed rpm
Nc Physical core speed rpm
epr Engine pressure ratio (P50/P2) --
phi Ratio of fuel flow to Ps30
pps/psiu
Ps30 Static pressure at HPC outlet psia
NfR Corrected fan speed rpm
NcR Corrected core speed rpm
BPR Bypass ratio --
farB Burner fuel-air ratio --
htBleed Bleed enthalpy --
PCNfRdmd Percent corrected fan speed pct
W31 HPT coolant bleed lmb/s
W32 LPT coolant bleed lmb/s
Health Indicators
SmHPC HPC stall margin --
SmLPC LPC stall margin --
SmFan Fan stall margin --
Pressure Measurements
P2 Pressure at fan inlet psia
P15 Total pressure in bypass-duct psia
P30 Total pressure at HPC outlet psia
18 © 2016 Pivotal Software, Inc. All rights reserved.
LARGE DATASETS REQUIRE NEW TECHNOLOGIES
At-Scale Modeling
19 © 2016 Pivotal Software, Inc. All rights reserved.
Need for new environments to process big data?
HDFS STORAGE AND MPP ARCHITECTURES DISTRIBUTE STORAGE
AND PREVENT DATA MOVEMENT VARIETY/VELOCITY
DISTRIBUTED COMPUTATION FOR PARALLELIZATION
PETABYTES OF DATA
OPEN-SOURCE LIBRARY FOR MACHINE LEARNING AT SCALE AND FRAMEWORK
TO ACCESS COMMON LANGUAGES
RAPIDLY EVOLVING FIELD OF DATA SCIENCE AND
TOOLS
SQL ENGINE AND ODBC/JDBC CONNECTIONS TO HADOOP
MANY EXISTING LIBRARIES, TOOLS AND
EXPERTISE
FLEXIBLE
SCALABLE
ENABLING
ACCESSIBLE
20 © 2016 Pivotal Software, Inc. All rights reserved.
A single address for everything analytics Analytics with Pivotal
Time-to-Insights FORECASTING CLUSTERING
REGRESSION
CLASSIFICATION
OPTIMIZATION
21 © 2016 Pivotal Software, Inc. All rights reserved.
Pivotal Greenplum MPP DB Think of it as multiple PostGreSQL
servers
Rows are distributed across segments by a particular field (or
randomly)
Segments/Workers
Master
22 © 2016 Pivotal Software, Inc. All rights reserved.
Greenplum Database Features for Data Scientists
• Window functions: Perform calculations across a set of table rows that are somehow related to the current row
• Analytics extensions: In-database machine learning at scale using MADlib
• Procedural language extensions: Extended functionality using non-SQL programming languages and packages (e.g. Python and R) ! Client Access: ODBC and JDBC
access to support connections to 3rd party tools
* Only a subset of Greenplum Database features
23 © 2016 Pivotal Software, Inc. All rights reserved.
MADlib: Scalable, In-database ML
• Open Source https://github.com/madlib/madlib • Works on Greenplum DB, HAWQ and PostgreSQL • In active development by Pivotal • Downloads and Docs: http://madlib.net/
24 © 2016 Pivotal Software, Inc. All rights reserved.
• For embarrassingly parallel tasks, we can use procedural languages to easily parallelize any stand-alone library in Java, Python, R or C/C++
• The interpreter/VM of the language ‘X’ is installed on each node of the MPP environment
Standby Master
…
Master Host
SQL
Interconnect
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
Data Parallelism through PL/X
CREATE FUNCTION pymax ( a integer, b integer) RETURNS integer AS $$ if a > b: return a return b $$ LANGUAGE plpythonu;
SQL wrapper
Source language code
Source language
declaration
User Defined Functions
25 © 2016 Pivotal Software, Inc. All rights reserved.
Altitude over time for some example flights
What does a typical flight look like?
! Flight consists of series of ascents, cruises, and descents
! Average cruise at 35,000 Ft is for ~ 21 minutes – Engine health is
calculated from a snapshot of parameters during this cruise
26 © 2016 Pivotal Software, Inc. All rights reserved.
Time Series: Pressure Parameters
! P2, P15, and P30 appear to be positively correlated except during the middle cruise – correlation may
differ depending on regime
P2 = Pressure at Fan Inlet P15 = Total pressure in bypass-duct P30 = Total pressure at HPC outlet
27 © 2016 Pivotal Software, Inc. All rights reserved.
Life of a Nominal Engine
! Engine health is modeled to degrade exponentially over time
! 5,244 flights from 25 nominal engines
! Median number of flights for a nominal engine is 201
! Median health score of nominal engines across all flights is ~.81
28 © 2016 Pivotal Software, Inc. All rights reserved.
Opportunity for Clustering of Engines
! Nominal engines seem to degrade in at least 4 different ways – cluster engines
based on degradation trend
– caveat: small sample size (35 engines)
! Additional Modeling Opportunity: – Predict engine
health score
29 © 2016 Pivotal Software, Inc. All rights reserved.
Life of a Fault Engine ! Significant drop in
engine health is apparent after a fault flight
! 1,631 flights from 10 fault engines
! Median number flights of fault engines is 137 flights
! Median health score of fault engines across all flights is ~.72
Fault Flight
30 © 2016 Pivotal Software, Inc. All rights reserved.
Example: Engine Pressure Ratio (EPR) for flight 32-15, a flight with a fan fault
What happens when there is a fault?
At first glance, fault’s effects are not noticeable – Need to zoom in to see the effects of a fault
31 © 2016 Pivotal Software, Inc. All rights reserved.
Feature Engineering: Transforming Timeseries
! Many modeling approaches require feature extraction – Clustering of engines – Regression to reverse-engineer engine
health
32 © 2016 Pivotal Software, Inc. All rights reserved.
Engineering Features From Time Series
! Goal: Represent timeseries data as variables
! Approach: 1. Identify the different phases
of the flight: takeoff, climbs, cruises, descents, landing
2. For each phase and parameter calculate:
3. Summary stats on rate of change for features
▪ mean ▪ min ▪ max ▪ stddev
▪ max – min ▪ median
mean: 13,674 stddev: 0 max: 13,674 min: 13,674 max-min: 0 median: 13,674
mean: 33,596 stddev: 5,732 max: 45,575 min: 25,959 max-min: 19,616 median: 32,556
33 © 2016 Pivotal Software, Inc. All rights reserved.
Calculating Correlations between Sensors
! How correlated are two sensors?
! Are correlations between the sensors different flight to flight?
! Approach: – 1) Calculate correlations over entire flight data set and observe
trends – 2) Calculate correlations over each flight and observe trends
34 © 2016 Pivotal Software, Inc. All rights reserved.
Sensor Parameter Correlations
! Correlations calculated on entire flight data set
! 435 total unique parameter pairs – 162 pairs are strongly
positively correlated (>.8) – 45 pairs are strongly
negatively correlated (<-.8) – 228 pairs are weakly
correlated
# of
Mea
sure
men
t Pai
rs
Correlation
35 © 2016 Pivotal Software, Inc. All rights reserved.
Top Correlated Parameter Pairs
Parameter 1 Parameter 2 Correlation
p2 alt -0.985
t2 alt -0.974
p15 alt -0.972
w31 alt -0.931
w32 alt -0.931
Parameter 1 Parameter 2 Correlation
nc htbleed .999
t30 nc .999
t30 htbleed .999
ps30 p30 .999
w31 w32 .999
Negatively Correlated Positively Correlated
p2 pressure at fan inlet t2 total temp at fan inlet p15 total pressure in bypass-duct w31 HPT cooland bleed w32 LPT cooland bleed
nc physical core speed htbleed bleed enthalpy t30 total temperature at HPC outlet ps30 total pressure at HPC outlet
36 © 2016 Pivotal Software, Inc. All rights reserved.
Top Negatively Correlated Sensors
p2 pressure at fan inlet t2 total temp at fan inlet p15 total pressure in bypass-duct w31 HPT cooland bleed w32 LPT cooland bleed
! Potential Analysis: Calculating correlations at a regime level may reveal anomalies
37 © 2016 Pivotal Software, Inc. All rights reserved.
Top Positively Correlated Sensors
nc physical core speed htbleed bleed enthalpy t30 total temperature at HPC outlet ps30 total pressure at HPC outlet
! Potential Analysis: Calculating correlations at a regime level may reveal anomalies
38 © 2016 Pivotal Software, Inc. All rights reserved.
Correlation Between Altitude and P2 Flight ID
39 © 2016 Pivotal Software, Inc. All rights reserved.
Correlation Between Altitude and P2 Flight ID
40 © 2016 Pivotal Software, Inc. All rights reserved.
Correlation Between Altitude and P2 Flight ID
41 © 2016 Pivotal Software, Inc. All rights reserved.
Correlation Between Altitude and P2 Flight ID
42 © 2016 Pivotal Software, Inc. All rights reserved.
Correlation Between Altitude and P2 Flight ID
43 © 2016 Pivotal Software, Inc. All rights reserved.
Correlation Between Altitude and P2 Flight ID
44 © 2016 Pivotal Software, Inc. All rights reserved.
Clustering Flights Insights on engine degradation and end of life
45 © 2016 Pivotal Software, Inc. All rights reserved.
Feature Reduction using VIF
K-Means Clustering Algorithm Objective: Group flights based on their parameter time series
Time Series for Single Sensor Data
Extract Summary Statistics for All Phases
Cluster using K-means algorithm in MADlib with Summary Statistics as Feature Vector
46 © 2016 Pivotal Software, Inc. All rights reserved.
Feature Reduction using VIF
K-Means Clustering Algorithm Objective: Group flights based on their parameter time series
Time Series for Single Sensor Data
Extract Summary Statistics for All Phases
Cluster using K-means algorithm in MADlib with Summary Statistics as Feature Vector
Param 1
Extract Features
47 © 2016 Pivotal Software, Inc. All rights reserved.
K-Means Clustering Algorithm
Source: http://www.naftaliharris.com/
Feature Reduction using VIF
Time Series for Single Sensor Data
Extract Summary Statistics for All Phases
For each Cluster using K-means algorithm in MADlib with Summary Statistics as Feature Vector
48 © 2016 Pivotal Software, Inc. All rights reserved.
K-Means Clustering Algorithm
Feature Reduction using VIF
Time Series for Single Sensor Data
Extract Summary Statistics for All Phases
For each Cluster using K-means algorithm in MADlib with Summary Statistics as Feature Vector
Repeat process for 29 parameters
49 © 2016 Pivotal Software, Inc. All rights reserved.
Flights in Cluster 4 Indicate Engine’s end of life
Smfan Timeseries Features Clustering Results
50 © 2016 Pivotal Software, Inc. All rights reserved.
Classification-Based Similarity Metric Understanding similarities between flights
51 © 2016 Pivotal Software, Inc. All rights reserved.
Classification-Based Distance Metric
! Binary classification methods to build models to differentiate between two groups using available attributes – Algorithms allow us to use
optimal subset of attributes to differentiate classes (feature selection)
– Ability to differentiate becomes a proxy for dissimilarity
Class 1 Class 2 Class 3
Classes differentiated by size and color
These classes are indistinguishable
Model accuracy HIGH : able to predict class
Model accuracy LOW: unable to predict classes
52 © 2016 Pivotal Software, Inc. All rights reserved.
Classification-Based Flight Similarity Metric
! For a given pre-takeoff phase – Create a non-overlapping set of all 5-second windows – Extract features
▪ Summary statistic (402) for each parameter in the time-window ▪ Correlations between all pairs of parameters in the time-window used for
propulsion data only
Flight 1, Flight 2
Flight 1, Flight 3
Flight m, Flight n
…
Train Classifier for
Classification Accuracy Score
Classification Accuracy Score
Classification Accuracy Score
Engine 1
…
53 © 2016 Pivotal Software, Inc. All rights reserved.
Expected Results
! 745,281 total models built – For each flight, classifier to
each other flight for the same engine
– Modeling run-time ~11 min on 128-segment cluster
! As engines begin to degrade, adjacent flights should be similar (low accuracy)
Class 1 Class 2 Class 3
Classes differentiated by size and color
These classes are indistinguishable
Model accuracy HIGH : able to predict class
Model accuracy LOW: unable to predict classes
54 © 2016 Pivotal Software, Inc. All rights reserved.
Expected Results
! 745,281 total models built – For each flight, classifier to
each other flight for the same engine
– Modeling run-time ~11 min on 128-segment cluster
! As engines begin to degrade, adjacent flights should be similar (low accuracy)
Model accuracy HIGH : able to distinguish flights that occur after degradation
Model accuracy LOW: unable to predict distinguish adjacent flights (little difference)
Flight number
Mod
el A
ccur
acy
REFERENCE FLIGHT
55 © 2016 Pivotal Software, Inc. All rights reserved.
Engine 1 results Model accuracy HIGH : able to distinguish flights that occur after degradation
Model accuracy LOW: unable to predict distinguish adjacent flights (little difference)
REFERENCE FLIGHT
56 © 2016 Pivotal Software, Inc. All rights reserved.
Engine 1 results Model accuracy HIGH : able to distinguish flights that occur after degradation
Model accuracy LOW: unable to predict distinguish adjacent flights (little difference)
REFERENCE FLIGHT
57 © 2016 Pivotal Software, Inc. All rights reserved.
Engine 1 results
Model accuracy HIGH : able to distinguish flights that occur before and after degradation
Model accuracy LOW: unable to predict distinguish adjacent flights (little difference)
REFERENCE FLIGHT
58 © 2016 Pivotal Software, Inc. All rights reserved.
Engine 1 results
Model accuracy HIGH : able to distinguish flights that occur before and after degradation
Model accuracy LOW: unable to predict distinguish adjacent flights (little difference)
REFERENCE FLIGHT
59 © 2016 Pivotal Software, Inc. All rights reserved.
Logistic Regression Results
! Earlier flights are more similar to each other
! Earlier flights are more dissimilar to later flights
! Flights up until 50th are similar to each other
! Flights after 50th are only similar to neighboring flights but start to differ from earlier flights
! Indicates change/degradation over time
Similar Dissimilar
60 © 2016 Pivotal Software, Inc. All rights reserved.
Examining Engine Degradation Over Time
! Summary statistics over flights provide insights into degradation patterns – Median/mean accuracies over
PRECEDING flights indicates what degradation occurred since the engine start
– Observations over adjacent windows may be of interest
– Detecting anomalies
Flight number
Mod
el A
ccur
acy
REFERENCE FLIGHT
REFERENCE FLIGHT
61 © 2016 Pivotal Software, Inc. All rights reserved.
Engine Health and Engine Classification-based Similarity
! Median accuracy score of a flight to prior flights increases as engine health decreases
! Abrupt changes in engine health can be found using future flights (to find an inflection)
62 © 2016 Pivotal Software, Inc. All rights reserved.
Accuracy Scores Show both Time and Degradation
! With many more flights median accuracy increases
63 © 2016 Pivotal Software, Inc. All rights reserved.
Accuracy Scores Show both Time and Degradation
! With many more flights median accuracy increases
64 © 2016 Pivotal Software, Inc. All rights reserved.
Accuracy Scores Show both Time and Degradation
! With many more flights median accuracy increases
! Degradation in engine causes median accuracy to drop faster
65 © 2016 Pivotal Software, Inc. All rights reserved.
Example of Fault: engine 32
! Flight before fault occurs
! avg scores of flights before fault flight is slightly higher
! flight after fault: more flights with score > .8
66 © 2016 Pivotal Software, Inc. All rights reserved.
HPT Fault
Classification-Based Similarity Changes at Faults
67 © 2016 Pivotal Software, Inc. All rights reserved.
LPC Fault – low engine health change still detected
Classification-Based Similarity Changes at Faults
68 © 2016 Pivotal Software, Inc. All rights reserved.
LPC Fault – low engine health change still detected
Classification-Based Similarity Changes at Faults
69 © 2016 Pivotal Software, Inc. All rights reserved.
Engine Health and Median Accuracy Correlations
HPT fault flights
70 © 2016 Pivotal Software, Inc. All rights reserved.
Engine Health and Median Accuracy Correlations
fan hpc hpt
lpc lpt
71 © 2016 Pivotal Software, Inc. All rights reserved.
What Did We Learn? What is next?
• Through technology, data exploration and feature generation becomes easier – What we learned: Rapidly transforming large volumes of
sensor data – What’s next: Timeseries analysis, interpolation on missing
data • Experimentation with building models to predict engine decay
and faults – What we learned: unsupervised techniques for clustering
and distance metrics enable us to discover signals of decay
– What’s next: supervised approaches to detect known faults
72 © 2016 Pivotal Software, Inc. All rights reserved.
Opportunities in the Digital Brain
73 © 2016 Pivotal Software, Inc. All rights reserved.
Opportunities in the Digital Brain
CONNECTED CARS
PERSONALIZED MEDICINE
SMART METERS
SECURITY
PREDICTIVE MAINTENANCE
SPORT TRACKING
OPTIMIZATION AND EFFICIENCY
75 © 2016 Pivotal Software, Inc. All rights reserved.
Appendix
• Propulsion dataset can be downloaded at: https://c3.nasa.gov/dashlink/resources/140/