apache...

Instructions on how to replace photo/image on cover• Open Slide Master view• Click on white gradated overlay and send to back• Select grey logo pattern and delete• Insert photo or other graphic no larger than 10” wide by 4” tall• Move photo to top edge of slide• Send photo to back• Delete these instructions

Development of software for scalable anomaly detection modeling of time-series data using Apache Spark

Ryo Kawahara, Toshihiro Takahashi, Hideo Watanabe, IBM Research – Tokyo2016/02/08, Spark Conference Japan

Apache Spark を用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

©2015 IBM Corporation2 May 1, 2023

How we detect anomaly

System under monitoring(ex. Factory plant)

Sensor A

Sensor B

SensorC

Sensor D

Sensor values are correlated

temperature acceleration pressure density




Sensor A

Sensor B

SensorC

Sensor D

Sensor values are correlated Correlation changes at anomaly situation


• T. Idé, et al., SDM 2009.• T. Idé, IBM ProVISION No. 78, 2013


Prediction model of correct behavior



Sensor A

Sensor B

SensorC

Sensor DCompare predicted sensor value

with the observed value It is anomaly if the two are different


Value of Sensor A is predicted fromother sensors B, C, and D


Sensor A

Sensor B

SensorC

Sensor D



Prediction model of correct behavior



Sensor A

Sensor B

SensorC

Sensor DCompare predicted sensor value

with the observed value It is anomaly if the two are different


Value of Sensor A is predicted fromother sensors B, C, and D


Sensor A

Sensor B

SensorC

Sensor D

Motivation:The prediction model is computed in advance by Machine Learning.It takes a very long time and requires much memory. Improve the scalability with Spark!



How we applied Spark (before)

Training: A linear model using LASSO regression (Least square + L1 regularization)

–

Hyper-parameter λ(tuned later to achieve

the best prediction accuracy)




–

Evaluation: cross validation of prediction accuracy– Other data is used to test

the model



training evaluationmodel

Search loop of hyper parameter λ



Time-series xtj

– T ~ 106 or more samples (time)

– D ~ 102 sensors (dimensions)

– (i.e., T >> D)


–





original time-series data(big)

xtj

D

T


the model



Time-series xtj

– T ~ 106 or more samples (time)

– D ~ 102 sensors (dimensions)

– (i.e., T >> D)


–





Computed in advance(small)

original time-series data(big)

𝑆 𝑗𝑘=1𝑇 ∑

𝑡=1

𝑇

𝑥𝑡𝑗 𝑥𝑡𝑘

Sjk

xtj

D

D

D

T


the model


How we applied Spark (after)

trainingsensor 1

trainingsensor D

trainingsensor D-1

trainingsensor 2

evaluation

evaluation

evaluation

evaluation

By sensors By time (map-reduce)

model


Sjk

xtj

D

D

D

T

The small data is copied to all the

nodes


Model is copied to all the nodes

How we applied Spark (after)

trainingsensor 1

trainingsensor D

trainingsensor D-1

trainingsensor 2

evaluation

evaluation

evaluation

evaluation

By sensors By time (map-reduce)

model


Sjk

xtj

D

D

D

T

The small data is copied to all the

nodesBig data is not

copied or moved.


Why we did not use Spark MLlib

Spark MLlib Our method Decision ReasonLASSO regression SGD Shooting

algorithmCross validation framework

Random split Block split




algorithmImplement by ourselves using RDD

(maybe) better accuracy when T >> D

Cross validation framework

Random split Block split Implement by ourselves using RDD

To avoid overfitting (specific to time-series)









xtj

train

test

Cross validation for usual data(random sampling)









xtj

train

test


xtj

train

test

Cross validation for time-series data(block sampling)









xtj

train

test


xtj

train

test

Cross validation for time-series data(block sampling) Balance optimization of CV

xtj

model 4

Pred1

Pred2

Pred3

Pred4

model 3

model 2

model 1

map reduceRDD(original)

RDD(prediction)

test 4test 3

test 2test 1

averagepredictionaccuracy


Performance

1 node x 1 core 1 nodes x 32 cores 2 nodes x 32 cores0

200

400

600

800

1000

1200

1400

1600

1800

Model computation time with various data sizes

10000 20000 40000 80000 160000

Exec

ution

tim

e (s

)

1 node x 1 core 1 node x 32 cores 2 nodes x 32 cores0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Model computation time

50 sensors,10k

Exec

ution

tim

e (s

econ

ds)

Item Specification Item Specification

Processor Intel(R) Xeon(R) E5-2680 0, 2.70GHz Memory / node 32GB

Cores / node 32 (2 processors X 8 cores X 2 Hyper threads) NW 1Gb Ethernet

OS Red Hat Enterprise Linux Server release 6.3 (Santiago) x86_64 JVM IBM® Java 1.8.0

Spark Version 1.5.0, standalone scheduler Hadoop (HDFS) Version 2.6.0

Speed up by7.8 times

16 times larger data can be handled within the same time.

Number of samples


Sliding window is not in RDD

Lessons learned (in time-series handling)

3

1

2

3,4,5

1,2,3

2,3,4

import org.apache.spark.mllib.rdd.RDDFunctions._

val x = sc.parallelize(1 to 1000).sliding(3)



Pitfall: Order preservation in RDD operation– join (not preserved)– zip (preserved)


3

1

2

3,4,5

1,2,3

2,3,4



c

a

b 4,d

3,c

1,a

3

1

2

3,c

1,a

2,b

slidingwindow

map - reduce

Bug!

OK

OK

OK

not preserved

preserved





3

1

2

3,4,5

1,2,3

2,3,4

Alternative APIs– DataFrame

(Spark MLlib)– Dstream

(Spark Streaming)– TimeSeriesRDD

(Cloudera Spark TS)

c

a

b 4,d

3,c

1,a

3

1

2

3,c

1,a

2,b

slidingwindow

map - reduce

Bug!

OK

OK

OK

not preserved

preservedIs it better to use higher

level API for future extensions instead of

RDD?import org.apache.spark.mllib.rdd.RDDFunctions._






3

1

2

3,4,5

1,2,3

2,3,4

Alternative APIs– DataFrame

(Spark MLlib)– Dstream

(Spark Streaming)– TimeSeriesRDD

(Cloudera Spark TS)

c

a

b 4,d

3,c

1,a

3

1

2

3,c

1,a

2,b

slidingwindow

map - reduce

Bug!

OK

OK

OK

not preserved

preservedIs it better to use higher

level API for future extensions instead of

RDD?

But in most cases, Spark programming is easy and fun.Thank you!




Java およびすべての Java 関連の商標およびロゴは Oracle やその関連会社の米国およびその他の国における商標または登録商標です。

インテル , Intel, Intel ロゴ , Intel Inside, Intel Inside ロゴ , Centrino, Intel Centrino ロゴ , Celeron, Xeon, Intel SpeedStep, Itanium, および Pentium は Intel Corporation または子会社の米国およびその他の国における商標または登録商標です。


Data is a high dimensional time-series generated by sensors

Typical sizes (long in vertical direction)– D : number of sensors < 1k– T : number of samples ~ 1M or more– File size: ~ 1GB or more

Data is processed in batch

Data

Time Sensor 1 … Sensor D01:10:23 456 0.10 … -0.91

01:10:23 556 0.15 … -0.99

01:10:23 656 0.12 … -0.87

01:10:23 756 0.17 … -0.54

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

23:59:59 956 -0.49 … -0.29

T

D


Architecture

DriverModel

creation tool server

Executor

Executor

Model creation tool GUI

Java RMI Spark HDFS

Physical architecture

Logical architecture

Frameworks / Middleware

Client PCMasterserver

Workerservers Storages

OS

JVM (JRE)

HDFSOther Libraries

Modeling creation tool server

Spark

Model creation engine (ML)

Standalone scheduler


計算の性質– Training: 行列 S(D×D) のみに依存し大きな元データ x (T×D) によらない– Evaluation: 元データ x (T×D) のサンプル (1 行 , D) を要素とする map-reduce – 両者ともセンサー ( 予測対象の変数 ) ごとに独立に計算可能

ハイパーパラメーター探索ループの並列化の場合– 全ノードに元データのコピーが必要– 1 ノードのメモリーに乗り切らないかもしれない

1 反復全体をセンサーごとで並列化– 全ノードに元データのコピーが必要– 1 ノードのメモリーに乗り切らないかもしれない

Training はセンサーごとの並列化、 Evaluation は時間ごとの並列化– 行列 S とモデルは全ノードで共有サイズが小さいので可能– Evaluation は典型的な map-reduce 元データは分散配置可能

並列化の設計

Sjk

training eval.

Hyper parameter search loop

xtj

D

D

D

T

model


Training: 線形回帰モデルを LASSO 回帰 ( 最小二乗法 +L1 正則化 ) を使ってデータから構築– 変数 i を応答変数 ( 予測対象 ) 、変数 i 以外の変数を説明変数とする

係数 {aji} は Shooting algorithm により gi を最小化するように決定ハイパーパラメーター λ は適当な小さい数 ( 後で決める )

– さらに以下の最適化を行う ( 先に Sjk をループ外で計算しておく )

計算量 : 1 変数あたりおよそ O(D3)

Evaluation: クロスバリデーション ( 別データでサンプル毎の予測精度の平均を評価 )– 計算量 : 1 変数あたり O(TD)

モデリング手法

Sjk

training eval.

Hyper parameter search loop

xtj

D

D

D

T

model

全体構造 : 最も予測精度が良くなるハイパーパラメーター λの探索


We have developed a scalable modeling software for anomaly detection of time-series using Spark– Modeling is done in batch– implemented own LASSO regression algorithm with RDD– optimized to a time-series with T >> D situation

Performance improvements (2 nodes x 32 cores)– Speed up by 7.8 times– 16 times larger data set can be handled within a same time

Conclusion

apache...

Data & Analytics