apache...

29
Instructions on how to replace photo/image on cover • Open Slide Master view • Click on white gradated overlay and send to back • Select grey logo pattern and delete • Insert photo or other graphic no larger than 10” wide by 4” tall • Move photo to top edge of slide • Send photo to back • Delete these instructions Development of software for scalable anomaly detection modeling of time-series data using Apache Spark Ryo Kawahara, Toshihiro Takahashi, Hideo Watanabe, IBM Research – Tokyo 2016/02/08, Spark Conference Japan Apache Spark をををををををををを ををををををををををををををををををををををを ををを

Upload: -

Post on 11-Jan-2017

3.562 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

Instructions on how to replace photo/image on cover• Open Slide Master view• Click on white gradated overlay and send to back• Select grey logo pattern and delete• Insert photo or other graphic no larger than 10” wide by 4” tall• Move photo to top edge of slide• Send photo to back• Delete these instructions

Development of software for scalable anomaly detection modeling of time-series data using Apache Spark

Ryo Kawahara, Toshihiro Takahashi, Hideo Watanabe, IBM Research – Tokyo2016/02/08, Spark Conference Japan

Apache Spark を用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

Page 2: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation2 May 1, 2023

How we detect anomaly

System under monitoring(ex. Factory plant)

Sensor A

Sensor B

SensorC

Sensor D

Sensor values are correlated

temperature acceleration pressure density

Page 3: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation3 May 1, 2023

How we detect anomaly

System under monitoring(ex. Factory plant)

Sensor A

Sensor B

SensorC

Sensor D

Sensor values are correlated Correlation changes at anomaly situation

temperature acceleration pressure density

• T. Idé, et al., SDM 2009.• T. Idé, IBM ProVISION No. 78, 2013

Page 4: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation4 May 1, 2023

Prediction model of correct behavior

How we detect anomaly

System under monitoring(ex. Factory plant)

Sensor A

Sensor B

SensorC

Sensor DCompare predicted sensor value

with the observed value It is anomaly if the two are different

Sensor values are correlated Correlation changes at anomaly situation

Value of Sensor A is predicted fromother sensors B, C, and D

temperature acceleration pressure density

Sensor A

Sensor B

SensorC

Sensor D

• T. Idé, et al., SDM 2009.• T. Idé, IBM ProVISION No. 78, 2013

Page 5: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation5 May 1, 2023

Prediction model of correct behavior

How we detect anomaly

System under monitoring(ex. Factory plant)

Sensor A

Sensor B

SensorC

Sensor DCompare predicted sensor value

with the observed value It is anomaly if the two are different

Sensor values are correlated Correlation changes at anomaly situation

Value of Sensor A is predicted fromother sensors B, C, and D

temperature acceleration pressure density

Sensor A

Sensor B

SensorC

Sensor D

Motivation:The prediction model is computed in advance by Machine Learning.It takes a very long time and requires much memory. Improve the scalability with Spark!

• T. Idé, et al., SDM 2009.• T. Idé, IBM ProVISION No. 78, 2013

Page 6: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation6 May 1, 2023

How we applied Spark (before)

Training: A linear model using LASSO regression (Least square + L1 regularization)

Hyper-parameter λ(tuned later to achieve

the best prediction accuracy)

Page 7: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation7 May 1, 2023

How we applied Spark (before)

Training: A linear model using LASSO regression (Least square + L1 regularization)

Evaluation: cross validation of prediction accuracy– Other data is used to test

the model

Hyper-parameter λ(tuned later to achieve

the best prediction accuracy)

training evaluationmodel

Search loop of hyper parameter λ

Page 8: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation8 May 1, 2023

How we applied Spark (before)

Time-series xtj

– T ~ 106 or more samples (time)

– D ~ 102 sensors (dimensions)

– (i.e., T >> D)

Training: A linear model using LASSO regression (Least square + L1 regularization)

Hyper-parameter λ(tuned later to achieve

the best prediction accuracy)

training evaluationmodel

Search loop of hyper parameter λ

original time-series data(big)

xtj

D

T

Evaluation: cross validation of prediction accuracy– Other data is used to test

the model

Page 9: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation9 May 1, 2023

How we applied Spark (before)

Time-series xtj

– T ~ 106 or more samples (time)

– D ~ 102 sensors (dimensions)

– (i.e., T >> D)

Training: A linear model using LASSO regression (Least square + L1 regularization)

Hyper-parameter λ(tuned later to achieve

the best prediction accuracy)

training evaluationmodel

Search loop of hyper parameter λ

Computed in advance(small)

original time-series data(big)

𝑆 𝑗𝑘=1𝑇 ∑

𝑡=1

𝑇

𝑥𝑡𝑗 𝑥𝑡𝑘

Sjk

xtj

D

D

D

T

Evaluation: cross validation of prediction accuracy– Other data is used to test

the model

Page 10: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation10 May 1, 2023

How we applied Spark (after)

trainingsensor 1

trainingsensor D

trainingsensor D-1

trainingsensor 2

evaluation

evaluation

evaluation

evaluation

By sensors By time (map-reduce)

model

Search loop of hyper parameter λ

Sjk

xtj

D

D

D

T

The small data is copied to all the

nodes

Page 11: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation11 May 1, 2023

Model is copied to all the nodes

How we applied Spark (after)

trainingsensor 1

trainingsensor D

trainingsensor D-1

trainingsensor 2

evaluation

evaluation

evaluation

evaluation

By sensors By time (map-reduce)

model

Search loop of hyper parameter λ

Sjk

xtj

D

D

D

T

The small data is copied to all the

nodesBig data is not

copied or moved.

Page 12: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation12 May 1, 2023

Why we did not use Spark MLlib

Spark MLlib Our method Decision ReasonLASSO regression SGD Shooting

algorithmCross validation framework

Random split Block split

Page 13: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation13 May 1, 2023

Why we did not use Spark MLlib

Spark MLlib Our method Decision ReasonLASSO regression SGD Shooting

algorithmImplement by ourselves using RDD

(maybe) better accuracy when T >> D

Cross validation framework

Random split Block split Implement by ourselves using RDD

To avoid overfitting (specific to time-series)

Page 14: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation14 May 1, 2023

Why we did not use Spark MLlib

Spark MLlib Our method Decision ReasonLASSO regression SGD Shooting

algorithmImplement by ourselves using RDD

(maybe) better accuracy when T >> D

Cross validation framework

Random split Block split Implement by ourselves using RDD

To avoid overfitting (specific to time-series)

xtj

train

test

Cross validation for usual data(random sampling)

Page 15: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation15 May 1, 2023

Why we did not use Spark MLlib

Spark MLlib Our method Decision ReasonLASSO regression SGD Shooting

algorithmImplement by ourselves using RDD

(maybe) better accuracy when T >> D

Cross validation framework

Random split Block split Implement by ourselves using RDD

To avoid overfitting (specific to time-series)

xtj

train

test

Cross validation for usual data(random sampling)

xtj

train

test

Cross validation for time-series data(block sampling)

Page 16: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation16 May 1, 2023

Why we did not use Spark MLlib

Spark MLlib Our method Decision ReasonLASSO regression SGD Shooting

algorithmImplement by ourselves using RDD

(maybe) better accuracy when T >> D

Cross validation framework

Random split Block split Implement by ourselves using RDD

To avoid overfitting (specific to time-series)

xtj

train

test

Cross validation for usual data(random sampling)

xtj

train

test

Cross validation for time-series data(block sampling) Balance optimization of CV

xtj

model 4

Pred1

Pred2

Pred3

Pred4

model 3

model 2

model 1

map reduceRDD(original)

RDD(prediction)

test 4test 3

test 2test 1

averagepredictionaccuracy

Page 17: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation17 May 1, 2023

Performance

1 node x 1 core 1 nodes x 32 cores 2 nodes x 32 cores0

200

400

600

800

1000

1200

1400

1600

1800

Model computation time with various data sizes

10000 20000 40000 80000 160000

Exec

ution

tim

e (s

)

1 node x 1 core 1 node x 32 cores 2 nodes x 32 cores0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Model computation time

50 sensors,10k

Exec

ution

tim

e (s

econ

ds)

Item Specification Item Specification

Processor Intel(R) Xeon(R) E5-2680 0, 2.70GHz Memory / node 32GB

Cores / node 32 (2 processors X 8 cores X 2 Hyper threads) NW 1Gb Ethernet

OS Red Hat Enterprise Linux Server release 6.3 (Santiago) x86_64 JVM IBM® Java 1.8.0

Spark Version 1.5.0, standalone scheduler Hadoop (HDFS) Version 2.6.0

Speed up by7.8 times

16 times larger data can be handled within the same time.

Number of samples

Page 18: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation18 May 1, 2023

Sliding window is not in RDD

Lessons learned (in time-series handling)

3

1

2

3,4,5

1,2,3

2,3,4

import org.apache.spark.mllib.rdd.RDDFunctions._

val x = sc.parallelize(1 to 1000).sliding(3)

Page 19: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation19 May 1, 2023

Sliding window is not in RDD

Pitfall: Order preservation in RDD operation– join (not preserved)– zip (preserved)

Lessons learned (in time-series handling)

3

1

2

3,4,5

1,2,3

2,3,4

import org.apache.spark.mllib.rdd.RDDFunctions._

val x = sc.parallelize(1 to 1000).sliding(3)

c

a

b 4,d

3,c

1,a

3

1

2

3,c

1,a

2,b

slidingwindow

map - reduce

Bug!

OK

OK

OK

not preserved

preserved

Page 20: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation20 May 1, 2023

Sliding window is not in RDD

Pitfall: Order preservation in RDD operation– join (not preserved)– zip (preserved)

Lessons learned (in time-series handling)

3

1

2

3,4,5

1,2,3

2,3,4

Alternative APIs– DataFrame

(Spark MLlib)– Dstream

(Spark Streaming)– TimeSeriesRDD

(Cloudera Spark TS)

c

a

b 4,d

3,c

1,a

3

1

2

3,c

1,a

2,b

slidingwindow

map - reduce

Bug!

OK

OK

OK

not preserved

preservedIs it better to use higher

level API for future extensions instead of

RDD?import org.apache.spark.mllib.rdd.RDDFunctions._

val x = sc.parallelize(1 to 1000).sliding(3)

Page 21: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation21 May 1, 2023

Sliding window is not in RDD

Pitfall: Order preservation in RDD operation– join (not preserved)– zip (preserved)

Lessons learned (in time-series handling)

3

1

2

3,4,5

1,2,3

2,3,4

Alternative APIs– DataFrame

(Spark MLlib)– Dstream

(Spark Streaming)– TimeSeriesRDD

(Cloudera Spark TS)

c

a

b 4,d

3,c

1,a

3

1

2

3,c

1,a

2,b

slidingwindow

map - reduce

Bug!

OK

OK

OK

not preserved

preservedIs it better to use higher

level API for future extensions instead of

RDD?

But in most cases, Spark programming is easy and fun.Thank you!

import org.apache.spark.mllib.rdd.RDDFunctions._

val x = sc.parallelize(1 to 1000).sliding(3)

Page 22: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation

Page 23: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation23 May 1, 2023

Java およびすべての Java 関連の商標およびロゴは Oracle やその関連会社の米国およびその他の国における商標または登録商標です。

インテル , Intel, Intel ロゴ , Intel Inside, Intel Inside ロゴ , Centrino, Intel Centrino ロゴ , Celeron, Xeon, Intel SpeedStep, Itanium, および Pentium は Intel Corporation または子会社の米国およびその他の国における商標または登録商標です。

Page 24: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation

Page 25: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation25 May 1, 2023

Data is a high dimensional time-series generated by sensors

Typical sizes (long in vertical direction)– D : number of sensors < 1k– T : number of samples ~ 1M or more– File size: ~ 1GB or more

Data is processed in batch

Data

Time Sensor 1 … Sensor D01:10:23 456 0.10 … -0.91

01:10:23 556 0.15 … -0.99

01:10:23 656 0.12 … -0.87

01:10:23 756 0.17 … -0.54

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

23:59:59 956 -0.49 … -0.29

T

D

Page 26: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation26 May 1, 2023

Architecture

DriverModel

creation tool server

Executor

Executor

Model creation tool GUI

Java RMI Spark HDFS

Physical architecture

Logical architecture

Frameworks / Middleware

Client PCMasterserver

Workerservers Storages

OS

JVM (JRE)

HDFSOther Libraries

Modeling creation tool server

Spark

Model creation engine (ML)

Standalone scheduler

Page 27: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation27 May 1, 2023

計算の性質– Training: 行列 S(D×D) のみに依存し大きな元データ x (T×D) によらない– Evaluation: 元データ x (T×D) のサンプル (1 行 , D) を要素とする map-reduce – 両者ともセンサー ( 予測対象の変数 ) ごとに独立に計算可能

ハイパーパラメーター探索ループの並列化の場合– 全ノードに元データのコピーが必要– 1 ノードのメモリーに乗り切らないかもしれない

1 反復全体をセンサーごとで並列化– 全ノードに元データのコピーが必要– 1 ノードのメモリーに乗り切らないかもしれない

Training はセンサーごとの並列化、 Evaluation は時間ごとの並列化– 行列 S とモデルは全ノードで共有 サイズが小さいので可能– Evaluation は典型的な map-reduce 元データは分散配置可能

並列化の設計

Sjk

training eval.

Hyper parameter search loop

xtj

D

D

D

T

model

Page 28: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation28 May 1, 2023

Training: 線形回帰モデルを LASSO 回帰 ( 最小二乗法 +L1 正則化 ) を使ってデータから構築– 変数 i を応答変数 ( 予測対象 ) 、変数 i 以外の変数を説明変数とする

係数 {aji} は Shooting algorithm により gi を最小化するように決定 ハイパーパラメーター λ は適当な小さい数 ( 後で決める )

– さらに以下の最適化を行う ( 先に Sjk をループ外で計算しておく )

計算量 : 1 変数あたりおよそ O(D3)

Evaluation: クロスバリデーション ( 別データでサンプル毎の予測精度の平均を評価 )– 計算量 : 1 変数あたり O(TD)

モデリング手法

Sjk

training eval.

Hyper parameter search loop

xtj

D

D

D

T

model

全体構造 : 最も予測精度が良くなるハイパーパラメーター λの探索

Page 29: Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation29 May 1, 2023

We have developed a scalable modeling software for anomaly detection of time-series using Spark– Modeling is done in batch– implemented own LASSO regression algorithm with RDD– optimized to a time-series with T >> D situation

Performance improvements (2 nodes x 32 cores)– Speed up by 7.8 times– 16 times larger data set can be handled within a same time

Conclusion