handling data gaps in time series using imputation...for k-nearest neighbor imputation, the missing...

H A N D L I N G D A T A G A P S I N T I M E S E R I E S

USING IMPUTATION

PRES

ENTE

D BY

CLARE JEON

ALF WHITEHEAD

HELLO!

CLARE JEON

ALF WHITEHEAD

1 What is time-series dataset & how do we use the data?

2 Missing values and imputation - What can we do about that missing data?

3 Real world example: Predicting blood glucose level of Type 1 Diabetes patients at 30 minutes horizon

C O N T E N T S

T I M E S E R I E S D A T A

https://www.kaggle.com/team-ai/bitcoin-price-prediction/version/1

Time series data is:• A sequence of values of variables• Measures the same thing over time and stores them in time order

T I M E S E R I E S A R E E V E R Y W H E R E I N S I G N A L P R O C E S S I N G

y = f(t)

f(t) is implicitly defined everywhere, but the data is only sampled.

A common error is to use time series to track discrete events (e.g. events on a purchasing journey). This isn’t helpful because there is no “value” between the samples - use a sequence representation instead.

Interest of hockey over time

Our study system: Glucose (Pathogenesis of Diabetes Mellitus)

DOI: 10.1109/TBME.2015.2512276

F O R E C A S T I N G I N H E A L T H C A R E : G L U C O S E I N T Y P E 1 D I A B E T E S

Daily Change of Glucose Level

DOI: 10.1109/TBME.2015.2512276


T Y P E 1 D I A B E T E S

Insulin therapy is the only way for patients.

Accurate blood glucose level prediction can help to figure out when insulin should be injected.

DOI: 10.2210/rcsb_pdb/GH/DM/monitoring/complications


M A N A G E T Y P E 1 D I A B E T E S

Photo courtesy: U.S. Food and Drug Administration

CGM = Continuous Glucose Monitoring

R E A L W O R L D M I S S I N G D A T A I N T Y P E 1 D I A B E T E S P A T I E N T S

Faulty sensors, incomplete records, mistakes in data collection, unavailability to report certain information (at a given time point) and others...

Continuous Glucose Monitoring device is not perfect!




C O N T E N T S

T H R E E Q U E S T I O N S

1 Which imputation methods are available?

2 How do we use them?

3 How do we evaluate imputation methods?

Q 1 . I M P U T A T I O N M E T H O D S F O R T I M E S E R I E S D A T A

UNIVARIATE TIME SERIES IMPUTATION

Mean (Median)

Last Observation Carried Forward

Linear Interpolation

Polynomial Interpolation

Kalman Filter

Moving Average

Random

MULTIVARIATE TIME SERIES IMPUTATION

K-Nearest Neighbors

Random Forest

Multiple Singular Spectral Analysis

Expectation-Maximization

Multiple Imputation with Chained Equations

Q 1 . I M P U T A T I O N M E T H O D S F O R T I M E S E R I E S D A T A

Canonical ML/DL modeling Use both past and future values

Past (head) Future (tail)

Real-time prediction (Online mode) Use only past value

Past (head)

Linear InterpolationPolynomial Interpolation

Kalman Smoothing

Moving AverageK-Nearest Neighbors

Random ForestMultiple Singular Spectral Analysis

Expectation Maximisation

Multiple Imputation with Chained Equations

Last Observed Carried ForwardMean (Median)

RandomEmpirical Based

Q 2 . H O W D O W E U S E I M P U T A T I O N M E T H O D S

Today’s Sample Data: 24 hours Blood glucose levels

S E T U P

PYTHON

import pandas as pd

import numpy as np

import scipy as sp

import matplotlib.pyplot as plt

from fancyimpute import knn

import rpy2.robjects.packages import importr

imputeTS = importr(“imputeTS”)

kalman_StructTs = robjects.r['na.kalman’]

kalman_auto_arima = robjects.r['na.kalman']

df = pd.read_excel('glucose.xlsx')

df.set_index('Minute_of_Day', inplace=True)

missing_minutes = \ list(df[df['Inferred_Glucose'].isna()].index)

R

library (xlsx)

library (imputeTS)

library (pracm)

library(spatialEco)

library (VIM)

df

“ T R A D I T I O N A L M E T H O D ”

Remove entire missing values.

When there are small fraction of missing values.

This is valid if the data is missing completely at random.

PYTHON

df[...].dropna()

R

df

L A S T O B S E R V A T I O N C A R R I E D F O R W A R D

Last observed value of a variable is used for all subsequent.

Can be either conservative or liberal depending on the context.

PYTHON

df[...].fillna(method = 'ffill')

R

imputeTS::na_locf(x)(x, option = ‘locf’)

M E A N V A L U E

On average, the signal’s value is the mean.

PYTHON

df[...].fillna(df[...].mean())

R

imputeTS::na_mean(x, option = “mean”)

L I N E A R I N T E R P O L A T I O N

What the graphing libraries do by default - draw a line connecting the two points at either end.

Can you do this while data is being collected? Nope.

PYTHON

df[...].interpolate(method='linear')

R

imputeTS::na_interpolation(x, option = “linear”)

N E A R E S T N E I G H B O R

We can use the measured point nearest to the missing sample as the estimate.

PYTHON

df[...].interpolate(method='nearest')

R

pracma::interpl(x, y, method = ‘nearest’)

P O L Y N O M I A L I N T E R P O L A T I O N

What if we take into account the trajectory of the signal at loss and reacquisition? We could fit a curve using

the derivatives, which would be a polynomial.

Stineman is a special type of polynomial fit.

PYTHON

df[...].interpolate(method='polynomial', order=2)

df[...].interpolate(method='polynomial', order=3)

R

spatialEco::poly.regression(df$Inferred_Glucose, s = 0.2, impute = TRUE, na.only = TRUE)

S P L I N E I N T E R P O L A T I O N

Splines are an alternate way to fit “curvy lines” to data. They assume the use of “knots,” or fixed points, around

which a semi-rigid curve is bent. This comes from pencil-and-paper drafting techniques.

PYTHON

df[...].interpolate(method='spline', order=2)

df[...].interpolate(method='spline', order=3)

R

imputeTS::na_interpolation(x, option = “spline”)

M O V I N G A V E R A G E I M P U T A T I O N

Missing values are replaced by moving average values. The mean in this implementation taken from an equal

number of observations on either side of a central value.

PYTHON

df[...].rolling(4).mean()

R

imputeTS::na_ma(df[...], k = 4)

K A L M A N S M O O T H I N G I M P U T A T I O N

Filling missing values by estimating a joint probability distribution over the variables for each timeframe.

PYTHON

kalman_StructTs(df[...], model = "StructTS")

kalman_StructTs(df[...], model = "auto.arima")

R

imputeTS::na_kalman(df[...],model = 'StructTS')

imputeTS::na_kalman(df[...],model = 'auto.arima')

K N N I M P U T A T I O N

For k-Nearest Neighbor imputation, the missing values are based on a kNN algorithm. In this method, k

neighbors are chosen based on some distance measure and their average is used as an imputation

estimate.

PYTHON

KNN(k = 5).fit_transform(df)[:, 0]

R

VIM::kNN(df[...], variable = ‘Inferred_Glucose’, k = 5)

Assumption: Patients have regular life schedule

E M P I R I C A L B A S E D I M P U T A T I O N

Q 3 . H O W D O W E E V A L U A T E I M P U T A T I O N M E T H O D S

Geometric mean of the ranks of imputation methods determined by MAE, RMSE and R2

Generalize performance evaluation

1 Mean Absolute Error (MAE)

2 Root Mean Squared Error (RMSE)

3 Pearson’s Correlation Coefficient (R2)

4 Rank Product




C O N T E N T S

P R E D I C T I N G B L O O D G L U C O S E L E V E L S O F D I A B E T E S P A T I E N T SE X P E R I M E N T S I N D A T A I M P U T A T I O N

B U I L D I N G P R E D I C T I V E M O D E L S F O R B L O O D G L U C O S E L E V E L

C O M P A R I S O N O F M E T H O D S F O R M I S S I N G D A T A I M P U T A T I O N

T H E E F F E C T O F M I S S I N G D A T A I M P U T A T I O N O N B L O O D G L U C O S E L E V E L P R E D I C T I O N

E N S E M B L E M O D E L I N G

1. We’ve shown 11 approaches for missing data that you can chose from

2. Important questions to ask:Is your dataset univariate or multivariate?Do you need the ability to make real-time predictions with missing data?

3. Pick error metric(s) and test the appropriate imputation methods using those metrics

4. For best results, ensemble the best-performing methods

F O R Y O U R W O R K

K L I C K – W H O W E A R E

T H A N K Y O U .

PRES

ENTE

D BY

CLARE JEON

ALF WHITEHEAD

https://github.com/KlickInc/datasci-strata-talk-missing-data

https://github.com/KlickInc/datasci-strata-talk-missing-data

Rate today’s session

Session page on conference website O’Reilly Events App

handling data gaps in time series using imputation...for k-nearest neighbor imputation, the missing...

Documents