handling data gaps in time series using imputation...for k-nearest neighbor imputation, the missing...
TRANSCRIPT
-
H A N D L I N G D A T A G A P S I N T I M E S E R I E S
USING IMPUTATION
PRES
ENTE
D BY
CLARE JEON
ALF WHITEHEAD
-
HELLO!
CLARE JEON
ALF WHITEHEAD
-
1 What is time-series dataset & how do we use the data?
2 Missing values and imputation - What can we do about that missing data?
3 Real world example: Predicting blood glucose level of Type 1 Diabetes patients at 30 minutes horizon
C O N T E N T S
-
1 What is time-series dataset & how do we use the data?
2 Missing values and imputation - What can we do about that missing data?
3 Real world example: Predicting blood glucose level of Type 1 Diabetes patients at 30 minutes horizon
C O N T E N T S
-
T I M E S E R I E S D A T A
https://www.kaggle.com/team-ai/bitcoin-price-prediction/version/1
Time series data is:• A sequence of values of variables• Measures the same thing over time and stores them in time order
-
T I M E S E R I E S A R E E V E R Y W H E R E I N S I G N A L P R O C E S S I N G
y = f(t)
f(t) is implicitly defined everywhere, but the data is only sampled.
A common error is to use time series to track discrete events (e.g. events on a purchasing journey). This isn’t helpful because there is no “value” between the samples - use a sequence representation instead.
Interest of hockey over time
-
Our study system: Glucose (Pathogenesis of Diabetes Mellitus)
DOI: 10.1109/TBME.2015.2512276
F O R E C A S T I N G I N H E A L T H C A R E : G L U C O S E I N T Y P E 1 D I A B E T E S
Daily Change of Glucose Level
-
DOI: 10.1109/TBME.2015.2512276
F O R E C A S T I N G I N H E A L T H C A R E : G L U C O S E I N T Y P E 1 D I A B E T E S
-
T Y P E 1 D I A B E T E S
Insulin therapy is the only way for patients.
Accurate blood glucose level prediction can help to figure out when insulin should be injected.
-
DOI: 10.2210/rcsb_pdb/GH/DM/monitoring/complications
F O R E C A S T I N G I N H E A L T H C A R E : G L U C O S E I N T Y P E 1 D I A B E T E S
-
M A N A G E T Y P E 1 D I A B E T E S
Photo courtesy: U.S. Food and Drug Administration
CGM = Continuous Glucose Monitoring
-
R E A L W O R L D M I S S I N G D A T A I N T Y P E 1 D I A B E T E S P A T I E N T S
Faulty sensors, incomplete records, mistakes in data collection, unavailability to report certain information (at a given time point) and others...
Continuous Glucose Monitoring device is not perfect!
-
1 What is time-series dataset & how do we use the data?
2 Missing values and imputation - What can we do about that missing data?
3 Real world example: Predicting blood glucose level of Type 1 Diabetes patients at 30 minutes horizon
C O N T E N T S
-
T H R E E Q U E S T I O N S
1 Which imputation methods are available?
2 How do we use them?
3 How do we evaluate imputation methods?
-
Q 1 . I M P U T A T I O N M E T H O D S F O R T I M E S E R I E S D A T A
UNIVARIATE TIME SERIES IMPUTATION
Mean (Median)
Last Observation Carried Forward
Linear Interpolation
Polynomial Interpolation
Kalman Filter
Moving Average
Random
MULTIVARIATE TIME SERIES IMPUTATION
K-Nearest Neighbors
Random Forest
Multiple Singular Spectral Analysis
Expectation-Maximization
Multiple Imputation with Chained Equations
-
Q 1 . I M P U T A T I O N M E T H O D S F O R T I M E S E R I E S D A T A
Canonical ML/DL modeling Use both past and future values
Past (head) Future (tail)
Real-time prediction (Online mode) Use only past value
Past (head)
Linear InterpolationPolynomial Interpolation
Kalman Smoothing
Moving AverageK-Nearest Neighbors
Random ForestMultiple Singular Spectral Analysis
Expectation Maximisation
Multiple Imputation with Chained Equations
Last Observed Carried ForwardMean (Median)
RandomEmpirical Based
-
Q 2 . H O W D O W E U S E I M P U T A T I O N M E T H O D S
Today’s Sample Data: 24 hours Blood glucose levels
-
S E T U P
PYTHON
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from fancyimpute import knn
import rpy2.robjects.packages import importr
imputeTS = importr(“imputeTS”)
kalman_StructTs = robjects.r['na.kalman’]
kalman_auto_arima = robjects.r['na.kalman']
df = pd.read_excel('glucose.xlsx')
df.set_index('Minute_of_Day', inplace=True)
missing_minutes = \ list(df[df['Inferred_Glucose'].isna()].index)
R
library (xlsx)
library (imputeTS)
library (pracm)
library(spatialEco)
library (VIM)
df
-
“ T R A D I T I O N A L M E T H O D ”
Remove entire missing values.
When there are small fraction of missing values.
This is valid if the data is missing completely at random.
PYTHON
df[...].dropna()
R
df
-
L A S T O B S E R V A T I O N C A R R I E D F O R W A R D
Last observed value of a variable is used for all subsequent.
Can be either conservative or liberal depending on the context.
PYTHON
df[...].fillna(method = 'ffill')
R
imputeTS::na_locf(x)(x, option = ‘locf’)
-
M E A N V A L U E
On average, the signal’s value is the mean.
PYTHON
df[...].fillna(df[...].mean())
R
imputeTS::na_mean(x, option = “mean”)
-
L I N E A R I N T E R P O L A T I O N
What the graphing libraries do by default - draw a line connecting the two points at either end.
Can you do this while data is being collected? Nope.
PYTHON
df[...].interpolate(method='linear')
R
imputeTS::na_interpolation(x, option = “linear”)
-
N E A R E S T N E I G H B O R
We can use the measured point nearest to the missing sample as the estimate.
PYTHON
df[...].interpolate(method='nearest')
R
pracma::interpl(x, y, method = ‘nearest’)
-
P O L Y N O M I A L I N T E R P O L A T I O N
What if we take into account the trajectory of the signal at loss and reacquisition? We could fit a curve using
the derivatives, which would be a polynomial.
Stineman is a special type of polynomial fit.
PYTHON
df[...].interpolate(method='polynomial', order=2)
df[...].interpolate(method='polynomial', order=3)
R
spatialEco::poly.regression(df$Inferred_Glucose, s = 0.2, impute = TRUE, na.only = TRUE)
-
S P L I N E I N T E R P O L A T I O N
Splines are an alternate way to fit “curvy lines” to data. They assume the use of “knots,” or fixed points, around
which a semi-rigid curve is bent. This comes from pencil-and-paper drafting techniques.
PYTHON
df[...].interpolate(method='spline', order=2)
df[...].interpolate(method='spline', order=3)
R
imputeTS::na_interpolation(x, option = “spline”)
-
M O V I N G A V E R A G E I M P U T A T I O N
Missing values are replaced by moving average values. The mean in this implementation taken from an equal
number of observations on either side of a central value.
PYTHON
df[...].rolling(4).mean()
R
imputeTS::na_ma(df[...], k = 4)
-
K A L M A N S M O O T H I N G I M P U T A T I O N
Filling missing values by estimating a joint probability distribution over the variables for each timeframe.
PYTHON
kalman_StructTs(df[...], model = "StructTS")
kalman_StructTs(df[...], model = "auto.arima")
R
imputeTS::na_kalman(df[...],model = 'StructTS')
imputeTS::na_kalman(df[...],model = 'auto.arima')
-
K N N I M P U T A T I O N
For k-Nearest Neighbor imputation, the missing values are based on a kNN algorithm. In this method, k
neighbors are chosen based on some distance measure and their average is used as an imputation
estimate.
PYTHON
KNN(k = 5).fit_transform(df)[:, 0]
R
VIM::kNN(df[...], variable = ‘Inferred_Glucose’, k = 5)
-
Assumption: Patients have regular life schedule
E M P I R I C A L B A S E D I M P U T A T I O N
-
Q 3 . H O W D O W E E V A L U A T E I M P U T A T I O N M E T H O D S
Geometric mean of the ranks of imputation methods determined by MAE, RMSE and R2
Generalize performance evaluation
1 Mean Absolute Error (MAE)
2 Root Mean Squared Error (RMSE)
3 Pearson’s Correlation Coefficient (R2)
4 Rank Product
-
1 What is time-series dataset & how do we use the data?
2 Missing values and imputation - What can we do about that missing data?
3 Real world example: Predicting blood glucose level of Type 1 Diabetes patients at 30 minutes horizon
C O N T E N T S
-
P R E D I C T I N G B L O O D G L U C O S E L E V E L S O F D I A B E T E S P A T I E N T SE X P E R I M E N T S I N D A T A I M P U T A T I O N
-
B U I L D I N G P R E D I C T I V E M O D E L S F O R B L O O D G L U C O S E L E V E L
-
B U I L D I N G P R E D I C T I V E M O D E L S F O R B L O O D G L U C O S E L E V E L
-
C O M P A R I S O N O F M E T H O D S F O R M I S S I N G D A T A I M P U T A T I O N
-
T H E E F F E C T O F M I S S I N G D A T A I M P U T A T I O N O N B L O O D G L U C O S E L E V E L P R E D I C T I O N
-
E N S E M B L E M O D E L I N G
-
E N S E M B L E M O D E L I N G
-
1. We’ve shown 11 approaches for missing data that you can chose from
2. Important questions to ask:Is your dataset univariate or multivariate?Do you need the ability to make real-time predictions with missing data?
3. Pick error metric(s) and test the appropriate imputation methods using those metrics
4. For best results, ensemble the best-performing methods
F O R Y O U R W O R K
-
K L I C K – W H O W E A R E
-
T H A N K Y O U .
PRES
ENTE
D BY
CLARE JEON
ALF WHITEHEAD
https://github.com/KlickInc/datasci-strata-talk-missing-data
https://github.com/KlickInc/datasci-strata-talk-missing-data
-
Rate today’s session
Session page on conference website O’Reilly Events App