gdg devfest seoul 2017: codelab - time series analysis for kaggle using tensorflow
TRANSCRIPT
전태균, 전승현
Developer of Satrec Initiative
Taegyun Jeon and Seunghyun Jeon
시계열 분석: TensorFlow로 짜보고 Kaggle 도전하기
Time Series Analysis
Introduction to Kaggle
KaggleZeroToAll
Contents
코드랩을 다 듣고 나시면
1. 시계열 문제에 대해 이해!2. Kaggle에서 문제 풀기 가능!3. Kaggle Leaderboard에 본인의 모델 업로드!
Time Series Analysis
시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)
시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)
시계열 분석
시계열 데이터
시계열 데이터● Stock values
● Economic variables
● Weather
● Sensor: Internet-of-Things
● Energy demand
● Signal processing
● Sales forecasting
문제점
● Standard Supervised Learning
○ IID assumption
○ Same distribution for training and test data
○ Distributions fixed over time (stationarity)
● Time Series
○ 모두 해당 되지 않음!!
시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)
Autoregressive (AR) Models
● AR(p) model
: Linear generative model based on the pth order Markov assumption
○ : zero mean uncorrelated random variables with variance
○ : autoregressive coefficients
○ : observed stochastic process
Moving Average (MA)● MA(q) model
: Linear generative model for noise term on the qth order Markov
assumption
○ : moving average coefficients
ARMA Model● ARMA(p,q) model
: generative linear model that combines AR(p) and MA(q) models
Stationarity● Definition: a sequence of random variables is stationary if its
distribution is invariant to shifting in time.
Lag Operator● Definition: Lag operator is defined by
● ARMA model in terms of the lag operator:
● Characteristic polynomial
can be used to study properties of this stochastic process.
ARIMA Model● Definition: Non-stationary processes can be modeled using processes
whose characteristic polynomial has unit roots.
● Characteristic polynomial with unit roots can be factored:
● ARIMA(p, D, q) model is an ARMA(p,q) model for
Other Extensions● Further variants:
○ Models with seasonal components (SARIMA)
○ Models with side information (ARIMAX)
○ Models with long-memory (ARFIMA)
○ Multi-variate time series model (VAR)
○ Models with time-varing coefficients
○ other non-linear models
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)
쉽게 구현 할 수 있는 방법?
TensorFlow TimeSeries● tf.contrib.timeseries
○ Classic model (state space, autoregressive)
○ Flexible infrastructure
○ Data management
■ Chunking
■ Batching
■ Saving model
■ Truncated backpropagation
과연 쉬울까요??
예제부터 살펴봅시다
Introduction to Kaggle
https://www.kaggle.com/
What is the Kaggle?
마음껏 데이터를 가지고 놀수있는 데이터 놀이터
Kaggle에서 노는 법
1. 대회 고르기2. 문제와 데이터를 확인하고 분석하기3. 다른 사람들은 어떻게 하나 구경하기 4. 본인만의 솔루션 만들기
Competitions 종류
1. Featured: 기업, 기관에서 돈을 걸고 경쟁2. Research: 연구 목적 대회3. Playground: 연습 문제 4. Getting Started: 연습 문제
몇 가지 일반적인 대회 규칙
1. 하루 제출 횟수 제한2. Test의 일정 비율만 Public Score에 노출3. 대회가 종료될때 최종 점수가 공개4. 대회가 끝나도 데이터셋 접근 가능!
Kaggle에서 노는 법
1. 대회 고르기2. 문제와 데이터를 확인하고 분석하기3. 다른 사람들은 어떻게 하나 구경하기 4. 본인만의 솔루션 만들기
Kaggle에서 노는 법
1. 대회 고르기2. 문제와 데이터를 확인하고 분석하기3. 다른 사람들은 어떻게 하나 구경하기 4. 본인만의 솔루션 만들기
https://www.kaggle.com/c/favorita-grocery-sales-forecasting
오프라인 식료품점의 판매량 예측하기
복잡하다면…
남이 잘 분석한걸 이용하자: https://www.kaggle.com/headsortails/shopping-for-insights-favorita-eda
대부분의 대회에서 가장 많이 추천을 받는 커널은 EDA처음 대회 들어가면 EDA를 먼저 보는걸 추천
Kaggle에서 노는 법
1. 대회 고르기2. 문제와 데이터를 확인하고 분석하기3. 다른 사람들은 어떻게 하나 구경하기 4. 본인만의 솔루션 만들기
https://www.kaggle.com/towever/devfest
KaggleZeroToAll
# -*- coding: utf-8 -*-
import datetime
from datetime import timedelta
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.contrib.timeseries.python.timeseries import NumpyReader
from tensorflow.contrib.timeseries.python.timeseries import estimators as tfts_estimators
from tensorflow.contrib.timeseries.python.timeseries import model as tfts_model
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
Prepare
dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8'}
train = pd.read_csv('../input/train.csv', usecols=[1,2,3,4], dtype=dtypes,
parse_dates=['date'],
skiprows=range(1, 101688780) #Skip initial dates
)
train.loc[(train.unit_sales < 0),'unit_sales'] = 0 # eliminate negatives
train['unit_sales'] = train['unit_sales'].apply(pd.np.log1p) #logarithm conversion
train['dow'] = train['date'].dt.dayofweek
Read Dataset
# creating records for all items, in all markets on all dates
# for correct calculation of daily unit sales averages.
u_dates = train.date.unique()
u_stores = train.store_nbr.unique()
u_items = train.item_nbr.unique()
train.set_index(['date', 'store_nbr', 'item_nbr'], inplace=True)
train = train.reindex(
pd.MultiIndex.from_product(
(u_dates, u_stores, u_items),
names=['date','store_nbr','item_nbr']
)
)
Preprocess data
train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs
train.reset_index(inplace=True) # reset index and restoring unique columns
lastdate = train.iloc[train.shape[0]-1].date # get last day on data
train.head()
Preprocess data
train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs
train.reset_index(inplace=True) # reset index and restoring unique columns
lastdate = train.iloc[train.shape[0]-1].date # get last day on data
train.head()
Preprocess data
tmp = train[['item_nbr','store_nbr','dow','unit_sales']]
ma_dw = tmp.groupby(['item_nbr','store_nbr','dow'])['unit_sales'].mean().to_frame('madw')
ma_dw.reset_index(inplace=True)
ma_dw.head()
Preprocess data
tmp = ma_dw[['item_nbr','store_nbr','madw']]
ma_wk = tmp.groupby(['item_nbr', 'store_nbr'])['madw'].mean().to_frame('mawk')
ma_wk.reset_index(inplace=True)
ma_wk.head()
Preprocess data
tmp = train[['item_nbr','store_nbr','unit_sales']]
ma_is = tmp.groupby(['item_nbr', 'store_nbr'])['unit_sales'].mean().to_frame('mais226')
Moving Average using Pandas
for i in [112,56,28,14,7,3,1]:
tmp = train[train.date>lastdate-timedelta(int(i))]
tmpg = tmp.groupby(['item_nbr','store_nbr'])['unit_sales'].mean().to_frame('mais'+str(i))
ma_is = ma_is.join(tmpg, how='left')
del tmp,tmpg
Moving Average using Pandas
ma_is['mais']=ma_is.median(axis=1)
ma_is.reset_index(inplace=True)
ma_is.head()
Moving Average using Pandas
def data_to_npreader(store_nbr: int, item_nbr: int) -> NumpyReader:
unit_sales = train[np.logical_and(train["store_nbr"] == store_nbr,
train['item_nbr'] == item_nbr)].unit_sales
x = np.asarray(range(len(unit_sales)))
y = np.asarray(unit_sales)
dataset = {
tf.contrib.timeseries.TrainEvalFeatures.TIMES: x,
tf.contrib.timeseries.TrainEvalFeatures.VALUES: y,
}
reader = NumpyReader(dataset)
return x, y, reader
Make data trainable
x, y, reader = data_to_npreader(store_nbr=1, item_nbr=105574)
train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(
reader, batch_size=32, window_size=40)
ar = tf.contrib.timeseries.ARRegressor(
periodicities=21, input_window_size=30, output_window_size=10,
num_features=1,
loss=tf.contrib.timeseries.ARModel.NORMAL_LIKELIHOOD_LOSS
)
ar.train(input_fn=train_input_fn, steps=16000)
Tensorflow Timesereies - ARRegressor
evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)
# keys of evaluation: ['covariance', 'loss', 'mean', 'observed', 'start_tuple',
'times', 'global_step']
evaluation = ar.evaluate(input_fn=evaluation_input_fn, steps=1)
(ar_predictions,) = tuple(ar.predict(
input_fn=tf.contrib.timeseries.predict_continuation_input_fn(
evaluation, steps=16)))
Tensorflow Timesereies - ARRegressor
plt.figure(figsize=(15, 5))
plt.plot(x.reshape(-1), y.reshape(-1), label='origin')
plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation')
plt.plot(ar_predictions['times'].reshape(-1), ar_predictions['mean'].reshape(-1),
label='prediction')
plt.xlabel('time_step')
plt.ylabel('values')
plt.legend(loc=4)
plt.show()
Tensorflow Timesereies - ARRegressor
Tensorflow Timesereies - ARRegressor
Tensorflow Timesereies - LSTM
get lstm class: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/timeseries/examples/lstm.py
Tensorflow Timesereies - LSTMx, y, reader = data_to_npreader(store_nbr=2, item_nbr=105574)
train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(
reader, batch_size=16, window_size=21)
estimator = tfts_estimators.TimeSeriesRegressor(
model=_LSTMModel(num_features=1, num_units=32),
optimizer=tf.train.AdamOptimizer(0.001))
estimator.train(input_fn=train_input_fn, steps=16000)
evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)
evaluation = estimator.evaluate(input_fn=evaluation_input_fn, steps=1)
Tensorflow Timesereies - LSTM
(lstm_predictions,) = tuple(estimator.predict(
input_fn=tf.contrib.timeseries.predict_continuation_input_fn(
evaluation, steps=16)))
Tensorflow Timesereies - LSTMplt.figure(figsize=(15, 5))
plt.plot(x.reshape(-1), y.reshape(-1), label='origin')
plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation')
plt.plot(lstm_predictions['times'].reshape(-1), lstm_predictions['mean'].reshape(-1),
label='prediction')
plt.xlabel('time_step')
plt.ylabel('values')
plt.legend(loc=4)
plt.show()
Tensorflow Timesereies - LSTM
Forecasting test data
# Read test dataset
test = pd.read_csv('../input/test.csv', dtype=dtypes,
parse_dates=['date'])
test['dow'] = test['date'].dt.dayofweek
Forecasting test data# Moving Average
test = pd.merge(test, ma_is, how='left', on=['item_nbr','store_nbr'])
test = pd.merge(test, ma_wk, how='left', on=['item_nbr','store_nbr'])
test = pd.merge(test, ma_dw, how='left', on=['item_nbr','store_nbr','dow'])
test['unit_sales'] = test.mais
# Autoregressive
ar_predictions['mean'][ar_predictions['mean'] < 0] = 0
test.loc[np.logical_and(test['store_nbr'] == 1, test['item_nbr'] == 105574), 'unit_sales'] =
ar_predictions['mean']
# LSTM
lstm_predictions['mean'][lstm_predictions['mean'] < 0] = 0
test.loc[np.logical_and(test['store_nbr'] == 2, test['item_nbr'] == 105574), 'unit_sales'] =
lstm_predictions['mean']
Forecasting test data
pos_idx = test['mawk'] > 0
test_pos = test.loc[pos_idx]
test.loc[pos_idx, 'unit_sales'] = test_pos['unit_sales'] * test_pos['madw'] / test_pos['mawk']
test.loc[:, "unit_sales"].fillna(0, inplace=True)
test['unit_sales'] = test['unit_sales'].apply(pd.np.expm1) # restoring unit values
Forecasting test data
holiday = pd.read_csv('../input/holidays_events.csv', parse_dates=['date'])
holiday = holiday.loc[holiday['transferred'] == False]
test = pd.merge(test, holiday, how = 'left', on =['date'] )
test['transferred'].fillna(True, inplace=True)
test.loc[test['transferred'] == False, 'unit_sales'] *= 1.2
test.loc[test['onpromotion'] == True, 'unit_sales'] *= 1.15
test[['id','unit_sales']].to_csv('submission.csv.gz', index=False, compression='gzip')
Thanks You!