gdg devfest seoul 2017: codelab - time series analysis for kaggle using tensorflow

전태균, 전승현

Developer of Satrec Initiative

Taegyun Jeon and Seunghyun Jeon

시계열 분석: TensorFlow로 짜보고 Kaggle 도전하기

Time Series Analysis

Introduction to Kaggle

KaggleZeroToAll

Contents

코드랩을 다 듣고 나시면

1. 시계열 문제에 대해 이해!2. Kaggle에서 문제 풀기 가능!3. Kaggle Leaderboard에 본인의 모델 업로드!

Time Series Analysis

시계열 분석

● Time Series Analysis

● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN

● TensorFlow TimeSeries API (TFTS)

시계열 분석

시계열 데이터

시계열 데이터● Stock values

● Economic variables

● Weather

● Sensor: Internet-of-Things

● Energy demand

● Signal processing

● Sales forecasting

문제점

● Standard Supervised Learning

○ IID assumption

○ Same distribution for training and test data

○ Distributions fixed over time (stationarity)

● Time Series

○ 모두 해당 되지 않음!!

시계열 분석




Autoregressive (AR) Models

● AR(p) model

: Linear generative model based on the pth order Markov assumption

○ : zero mean uncorrelated random variables with variance

○ : autoregressive coefficients

○ : observed stochastic process

Moving Average (MA)● MA(q) model

: Linear generative model for noise term on the qth order Markov

assumption

○ : moving average coefficients

ARMA Model● ARMA(p,q) model

: generative linear model that combines AR(p) and MA(q) models

Stationarity● Definition: a sequence of random variables is stationary if its

distribution is invariant to shifting in time.

Lag Operator● Definition: Lag operator is defined by

● ARMA model in terms of the lag operator:

● Characteristic polynomial

can be used to study properties of this stochastic process.

ARIMA Model● Definition: Non-stationary processes can be modeled using processes

whose characteristic polynomial has unit roots.

● Characteristic polynomial with unit roots can be factored:

● ARIMA(p, D, q) model is an ARMA(p,q) model for

Other Extensions● Further variants:

○ Models with seasonal components (SARIMA)

○ Models with side information (ARIMAX)

○ Models with long-memory (ARFIMA)

○ Multi-variate time series model (VAR)

○ Models with time-varing coefficients

○ other non-linear models

Recurrent Neural Networks

시계열 분석




쉽게 구현 할 수 있는 방법?

TensorFlow TimeSeries● tf.contrib.timeseries

○ Classic model (state space, autoregressive)

○ Flexible infrastructure

○ Data management

■ Chunking

■ Batching

■ Saving model

■ Truncated backpropagation

과연 쉬울까요??

예제부터 살펴봅시다

Introduction to Kaggle

https://www.kaggle.com/

What is the Kaggle?

마음껏 데이터를 가지고 놀수있는 데이터 놀이터

Kaggle에서 노는 법

1. 대회 고르기2. 문제와 데이터를 확인하고 분석하기3. 다른 사람들은 어떻게 하나 구경하기 4. 본인만의 솔루션 만들기

Competitions 종류

1. Featured: 기업, 기관에서 돈을 걸고 경쟁2. Research: 연구 목적 대회3. Playground: 연습 문제 4. Getting Started: 연습 문제

몇 가지 일반적인 대회 규칙

1. 하루 제출 횟수 제한2. Test의 일정 비율만 Public Score에 노출3. 대회가 종료될때 최종 점수가 공개4. 대회가 끝나도 데이터셋 접근 가능!

https://www.kaggle.com/c/favorita-grocery-sales-forecasting

오프라인 식료품점의 판매량 예측하기

복잡하다면…

남이 잘 분석한걸 이용하자: https://www.kaggle.com/headsortails/shopping-for-insights-favorita-eda

대부분의 대회에서 가장 많이 추천을 받는 커널은 EDA처음 대회 들어가면 EDA를 먼저 보는걸 추천

https://www.kaggle.com/towever/devfest

KaggleZeroToAll

# -*- coding: utf-8 -*-

import datetime

from datetime import timedelta

import numpy as np

import pandas as pd

import tensorflow as tf

from tensorflow.contrib.timeseries.python.timeseries import NumpyReader

from tensorflow.contrib.timeseries.python.timeseries import estimators as tfts_estimators

from tensorflow.contrib.timeseries.python.timeseries import model as tfts_model

import matplotlib

import matplotlib.pyplot as plt

%matplotlib inline

Prepare

dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8'}

train = pd.read_csv('../input/train.csv', usecols=[1,2,3,4], dtype=dtypes,

parse_dates=['date'],

skiprows=range(1, 101688780) #Skip initial dates

)

train.loc[(train.unit_sales < 0),'unit_sales'] = 0 # eliminate negatives

train['unit_sales'] = train['unit_sales'].apply(pd.np.log1p) #logarithm conversion

train['dow'] = train['date'].dt.dayofweek

Read Dataset

# creating records for all items, in all markets on all dates

# for correct calculation of daily unit sales averages.

u_dates = train.date.unique()

u_stores = train.store_nbr.unique()

u_items = train.item_nbr.unique()

train.set_index(['date', 'store_nbr', 'item_nbr'], inplace=True)

train = train.reindex(

pd.MultiIndex.from_product(

(u_dates, u_stores, u_items),

names=['date','store_nbr','item_nbr']

)

)

Preprocess data

train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs

train.reset_index(inplace=True) # reset index and restoring unique columns

lastdate = train.iloc[train.shape[0]-1].date # get last day on data

train.head()

Preprocess data

tmp = train[['item_nbr','store_nbr','dow','unit_sales']]

ma_dw = tmp.groupby(['item_nbr','store_nbr','dow'])['unit_sales'].mean().to_frame('madw')

ma_dw.reset_index(inplace=True)

ma_dw.head()

Preprocess data

tmp = ma_dw[['item_nbr','store_nbr','madw']]

ma_wk = tmp.groupby(['item_nbr', 'store_nbr'])['madw'].mean().to_frame('mawk')

ma_wk.reset_index(inplace=True)

ma_wk.head()

Preprocess data

tmp = train[['item_nbr','store_nbr','unit_sales']]

ma_is = tmp.groupby(['item_nbr', 'store_nbr'])['unit_sales'].mean().to_frame('mais226')

Moving Average using Pandas

for i in [112,56,28,14,7,3,1]:

tmp = train[train.date>lastdate-timedelta(int(i))]

tmpg = tmp.groupby(['item_nbr','store_nbr'])['unit_sales'].mean().to_frame('mais'+str(i))

ma_is = ma_is.join(tmpg, how='left')

del tmp,tmpg


ma_is['mais']=ma_is.median(axis=1)

ma_is.reset_index(inplace=True)

ma_is.head()


def data_to_npreader(store_nbr: int, item_nbr: int) -> NumpyReader:

unit_sales = train[np.logical_and(train["store_nbr"] == store_nbr,

train['item_nbr'] == item_nbr)].unit_sales

x = np.asarray(range(len(unit_sales)))

y = np.asarray(unit_sales)

dataset = {

tf.contrib.timeseries.TrainEvalFeatures.TIMES: x,

tf.contrib.timeseries.TrainEvalFeatures.VALUES: y,

}

reader = NumpyReader(dataset)

return x, y, reader

Make data trainable

x, y, reader = data_to_npreader(store_nbr=1, item_nbr=105574)

train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(

reader, batch_size=32, window_size=40)

ar = tf.contrib.timeseries.ARRegressor(

periodicities=21, input_window_size=30, output_window_size=10,

num_features=1,

loss=tf.contrib.timeseries.ARModel.NORMAL_LIKELIHOOD_LOSS

)

ar.train(input_fn=train_input_fn, steps=16000)

Tensorflow Timesereies - ARRegressor

evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)

# keys of evaluation: ['covariance', 'loss', 'mean', 'observed', 'start_tuple',

'times', 'global_step']

evaluation = ar.evaluate(input_fn=evaluation_input_fn, steps=1)

(ar_predictions,) = tuple(ar.predict(

input_fn=tf.contrib.timeseries.predict_continuation_input_fn(

evaluation, steps=16)))


plt.figure(figsize=(15, 5))

plt.plot(x.reshape(-1), y.reshape(-1), label='origin')

plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation')

plt.plot(ar_predictions['times'].reshape(-1), ar_predictions['mean'].reshape(-1),

label='prediction')

plt.xlabel('time_step')

plt.ylabel('values')

plt.legend(loc=4)

plt.show()


Tensorflow Timesereies - LSTM

get lstm class: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/timeseries/examples/lstm.py

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/timeseries/examples/lstm.py

Tensorflow Timesereies - LSTMx, y, reader = data_to_npreader(store_nbr=2, item_nbr=105574)

train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(

reader, batch_size=16, window_size=21)

estimator = tfts_estimators.TimeSeriesRegressor(

model=_LSTMModel(num_features=1, num_units=32),

optimizer=tf.train.AdamOptimizer(0.001))

estimator.train(input_fn=train_input_fn, steps=16000)

evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)

evaluation = estimator.evaluate(input_fn=evaluation_input_fn, steps=1)


(lstm_predictions,) = tuple(estimator.predict(

input_fn=tf.contrib.timeseries.predict_continuation_input_fn(

evaluation, steps=16)))

Tensorflow Timesereies - LSTMplt.figure(figsize=(15, 5))

plt.plot(x.reshape(-1), y.reshape(-1), label='origin')

plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation')

plt.plot(lstm_predictions['times'].reshape(-1), lstm_predictions['mean'].reshape(-1),

label='prediction')

plt.xlabel('time_step')

plt.ylabel('values')

plt.legend(loc=4)

plt.show()

Forecasting test data

# Read test dataset

test = pd.read_csv('../input/test.csv', dtype=dtypes,

parse_dates=['date'])

test['dow'] = test['date'].dt.dayofweek

Forecasting test data# Moving Average

test = pd.merge(test, ma_is, how='left', on=['item_nbr','store_nbr'])

test = pd.merge(test, ma_wk, how='left', on=['item_nbr','store_nbr'])

test = pd.merge(test, ma_dw, how='left', on=['item_nbr','store_nbr','dow'])

test['unit_sales'] = test.mais

# Autoregressive

ar_predictions['mean'][ar_predictions['mean'] < 0] = 0

test.loc[np.logical_and(test['store_nbr'] == 1, test['item_nbr'] == 105574), 'unit_sales'] =

ar_predictions['mean']

# LSTM

lstm_predictions['mean'][lstm_predictions['mean'] < 0] = 0

test.loc[np.logical_and(test['store_nbr'] == 2, test['item_nbr'] == 105574), 'unit_sales'] =

lstm_predictions['mean']


pos_idx = test['mawk'] > 0

test_pos = test.loc[pos_idx]

test.loc[pos_idx, 'unit_sales'] = test_pos['unit_sales'] * test_pos['madw'] / test_pos['mawk']

test.loc[:, "unit_sales"].fillna(0, inplace=True)

test['unit_sales'] = test['unit_sales'].apply(pd.np.expm1) # restoring unit values


holiday = pd.read_csv('../input/holidays_events.csv', parse_dates=['date'])

holiday = holiday.loc[holiday['transferred'] == False]

test = pd.merge(test, holiday, how = 'left', on =['date'] )

test['transferred'].fillna(True, inplace=True)

test.loc[test['transferred'] == False, 'unit_sales'] *= 1.2

test.loc[test['onpromotion'] == True, 'unit_sales'] *= 1.15

test[['id','unit_sales']].to_csv('submission.csv.gz', index=False, compression='gzip')

Thanks You!

gdg devfest seoul 2017: codelab - time series analysis for kaggle using tensorflow

Engineering