handling data gaps in time series using imputation...for k-nearest neighbor imputation, the missing...

42
HANDLING DATA GAPS IN TIME SERIES USING IMPUTATION PRESENTED BY CLARE JEON ALF WHITEHEAD

Upload: others

Post on 27-Jan-2021

23 views

Category:

Documents


0 download

TRANSCRIPT

  • H A N D L I N G D A T A G A P S I N T I M E S E R I E S

    USING IMPUTATION

    PRES

    ENTE

    D BY

    CLARE JEON

    ALF WHITEHEAD

  • HELLO!

    CLARE JEON

    ALF WHITEHEAD

  • 1 What is time-series dataset & how do we use the data?

    2 Missing values and imputation - What can we do about that missing data?

    3 Real world example: Predicting blood glucose level of Type 1 Diabetes patients at 30 minutes horizon

    C O N T E N T S

  • 1 What is time-series dataset & how do we use the data?

    2 Missing values and imputation - What can we do about that missing data?

    3 Real world example: Predicting blood glucose level of Type 1 Diabetes patients at 30 minutes horizon

    C O N T E N T S

  • T I M E S E R I E S D A T A

    https://www.kaggle.com/team-ai/bitcoin-price-prediction/version/1

    Time series data is:• A sequence of values of variables• Measures the same thing over time and stores them in time order

  • T I M E S E R I E S A R E E V E R Y W H E R E I N S I G N A L P R O C E S S I N G

    y = f(t)

    f(t) is implicitly defined everywhere, but the data is only sampled.

    A common error is to use time series to track discrete events (e.g. events on a purchasing journey). This isn’t helpful because there is no “value” between the samples - use a sequence representation instead.

    Interest of hockey over time

  • Our study system: Glucose (Pathogenesis of Diabetes Mellitus)

    DOI: 10.1109/TBME.2015.2512276

    F O R E C A S T I N G I N H E A L T H C A R E : G L U C O S E I N T Y P E 1 D I A B E T E S

    Daily Change of Glucose Level

  • DOI: 10.1109/TBME.2015.2512276

    F O R E C A S T I N G I N H E A L T H C A R E : G L U C O S E I N T Y P E 1 D I A B E T E S

  • T Y P E 1 D I A B E T E S

    Insulin therapy is the only way for patients.

    Accurate blood glucose level prediction can help to figure out when insulin should be injected.

  • DOI: 10.2210/rcsb_pdb/GH/DM/monitoring/complications

    F O R E C A S T I N G I N H E A L T H C A R E : G L U C O S E I N T Y P E 1 D I A B E T E S

  • M A N A G E T Y P E 1 D I A B E T E S

    Photo courtesy: U.S. Food and Drug Administration

    CGM = Continuous Glucose Monitoring

  • R E A L W O R L D M I S S I N G D A T A I N T Y P E 1 D I A B E T E S P A T I E N T S

    Faulty sensors, incomplete records, mistakes in data collection, unavailability to report certain information (at a given time point) and others...

    Continuous Glucose Monitoring device is not perfect!

  • 1 What is time-series dataset & how do we use the data?

    2 Missing values and imputation - What can we do about that missing data?

    3 Real world example: Predicting blood glucose level of Type 1 Diabetes patients at 30 minutes horizon

    C O N T E N T S

  • T H R E E Q U E S T I O N S

    1 Which imputation methods are available?

    2 How do we use them?

    3 How do we evaluate imputation methods?

  • Q 1 . I M P U T A T I O N M E T H O D S F O R T I M E S E R I E S D A T A

    UNIVARIATE TIME SERIES IMPUTATION

    Mean (Median)

    Last Observation Carried Forward

    Linear Interpolation

    Polynomial Interpolation

    Kalman Filter

    Moving Average

    Random

    MULTIVARIATE TIME SERIES IMPUTATION

    K-Nearest Neighbors

    Random Forest

    Multiple Singular Spectral Analysis

    Expectation-Maximization

    Multiple Imputation with Chained Equations

  • Q 1 . I M P U T A T I O N M E T H O D S F O R T I M E S E R I E S D A T A

    Canonical ML/DL modeling Use both past and future values

    Past (head) Future (tail)

    Real-time prediction (Online mode) Use only past value

    Past (head)

    Linear InterpolationPolynomial Interpolation

    Kalman Smoothing

    Moving AverageK-Nearest Neighbors

    Random ForestMultiple Singular Spectral Analysis

    Expectation Maximisation

    Multiple Imputation with Chained Equations

    Last Observed Carried ForwardMean (Median)

    RandomEmpirical Based

  • Q 2 . H O W D O W E U S E I M P U T A T I O N M E T H O D S

    Today’s Sample Data: 24 hours Blood glucose levels

  • S E T U P

    PYTHON

    import pandas as pd

    import numpy as np

    import scipy as sp

    import matplotlib.pyplot as plt

    from fancyimpute import knn

    import rpy2.robjects.packages import importr

    imputeTS = importr(“imputeTS”)

    kalman_StructTs = robjects.r['na.kalman’]

    kalman_auto_arima = robjects.r['na.kalman']

    df = pd.read_excel('glucose.xlsx')

    df.set_index('Minute_of_Day', inplace=True)

    missing_minutes = \ list(df[df['Inferred_Glucose'].isna()].index)

    R

    library (xlsx)

    library (imputeTS)

    library (pracm)

    library(spatialEco)

    library (VIM)

    df

  • “ T R A D I T I O N A L M E T H O D ”

    Remove entire missing values.

    When there are small fraction of missing values.

    This is valid if the data is missing completely at random.

    PYTHON

    df[...].dropna()

    R

    df

  • L A S T O B S E R V A T I O N C A R R I E D F O R W A R D

    Last observed value of a variable is used for all subsequent.

    Can be either conservative or liberal depending on the context.

    PYTHON

    df[...].fillna(method = 'ffill')

    R

    imputeTS::na_locf(x)(x, option = ‘locf’)

  • M E A N V A L U E

    On average, the signal’s value is the mean.

    PYTHON

    df[...].fillna(df[...].mean())

    R

    imputeTS::na_mean(x, option = “mean”)

  • L I N E A R I N T E R P O L A T I O N

    What the graphing libraries do by default - draw a line connecting the two points at either end.

    Can you do this while data is being collected? Nope.

    PYTHON

    df[...].interpolate(method='linear')

    R

    imputeTS::na_interpolation(x, option = “linear”)

  • N E A R E S T N E I G H B O R

    We can use the measured point nearest to the missing sample as the estimate.

    PYTHON

    df[...].interpolate(method='nearest')

    R

    pracma::interpl(x, y, method = ‘nearest’)

  • P O L Y N O M I A L I N T E R P O L A T I O N

    What if we take into account the trajectory of the signal at loss and reacquisition? We could fit a curve using

    the derivatives, which would be a polynomial.

    Stineman is a special type of polynomial fit.

    PYTHON

    df[...].interpolate(method='polynomial', order=2)

    df[...].interpolate(method='polynomial', order=3)

    R

    spatialEco::poly.regression(df$Inferred_Glucose, s = 0.2, impute = TRUE, na.only = TRUE)

  • S P L I N E I N T E R P O L A T I O N

    Splines are an alternate way to fit “curvy lines” to data. They assume the use of “knots,” or fixed points, around

    which a semi-rigid curve is bent. This comes from pencil-and-paper drafting techniques.

    PYTHON

    df[...].interpolate(method='spline', order=2)

    df[...].interpolate(method='spline', order=3)

    R

    imputeTS::na_interpolation(x, option = “spline”)

  • M O V I N G A V E R A G E I M P U T A T I O N

    Missing values are replaced by moving average values. The mean in this implementation taken from an equal

    number of observations on either side of a central value.

    PYTHON

    df[...].rolling(4).mean()

    R

    imputeTS::na_ma(df[...], k = 4)

  • K A L M A N S M O O T H I N G I M P U T A T I O N

    Filling missing values by estimating a joint probability distribution over the variables for each timeframe.

    PYTHON

    kalman_StructTs(df[...], model = "StructTS")

    kalman_StructTs(df[...], model = "auto.arima")

    R

    imputeTS::na_kalman(df[...],model = 'StructTS')

    imputeTS::na_kalman(df[...],model = 'auto.arima')

  • K N N I M P U T A T I O N

    For k-Nearest Neighbor imputation, the missing values are based on a kNN algorithm. In this method, k

    neighbors are chosen based on some distance measure and their average is used as an imputation

    estimate.

    PYTHON

    KNN(k = 5).fit_transform(df)[:, 0]

    R

    VIM::kNN(df[...], variable = ‘Inferred_Glucose’, k = 5)

  • Assumption: Patients have regular life schedule

    E M P I R I C A L B A S E D I M P U T A T I O N

  • Q 3 . H O W D O W E E V A L U A T E I M P U T A T I O N M E T H O D S

    Geometric mean of the ranks of imputation methods determined by MAE, RMSE and R2

    Generalize performance evaluation

    1 Mean Absolute Error (MAE)

    2 Root Mean Squared Error (RMSE)

    3 Pearson’s Correlation Coefficient (R2)

    4 Rank Product

  • 1 What is time-series dataset & how do we use the data?

    2 Missing values and imputation - What can we do about that missing data?

    3 Real world example: Predicting blood glucose level of Type 1 Diabetes patients at 30 minutes horizon

    C O N T E N T S

  • P R E D I C T I N G B L O O D G L U C O S E L E V E L S O F D I A B E T E S P A T I E N T SE X P E R I M E N T S I N D A T A I M P U T A T I O N

  • B U I L D I N G P R E D I C T I V E M O D E L S F O R B L O O D G L U C O S E L E V E L

  • B U I L D I N G P R E D I C T I V E M O D E L S F O R B L O O D G L U C O S E L E V E L

  • C O M P A R I S O N O F M E T H O D S F O R M I S S I N G D A T A I M P U T A T I O N

  • T H E E F F E C T O F M I S S I N G D A T A I M P U T A T I O N O N B L O O D G L U C O S E L E V E L P R E D I C T I O N

  • E N S E M B L E M O D E L I N G

  • E N S E M B L E M O D E L I N G

  • 1. We’ve shown 11 approaches for missing data that you can chose from

    2. Important questions to ask:Is your dataset univariate or multivariate?Do you need the ability to make real-time predictions with missing data?

    3. Pick error metric(s) and test the appropriate imputation methods using those metrics

    4. For best results, ensemble the best-performing methods

    F O R Y O U R W O R K

  • K L I C K – W H O W E A R E

  • T H A N K Y O U .

    PRES

    ENTE

    D BY

    CLARE JEON

    ALF WHITEHEAD

    https://github.com/KlickInc/datasci-strata-talk-missing-data

    https://github.com/KlickInc/datasci-strata-talk-missing-data

  • Rate today’s session

    Session page on conference website O’Reilly Events App