apache...
TRANSCRIPT
Instructions on how to replace photo/image on cover• Open Slide Master view• Click on white gradated overlay and send to back• Select grey logo pattern and delete• Insert photo or other graphic no larger than 10” wide by 4” tall• Move photo to top edge of slide• Send photo to back• Delete these instructions
Development of software for scalable anomaly detection modeling of time-series data using Apache Spark
Ryo Kawahara, Toshihiro Takahashi, Hideo Watanabe, IBM Research – Tokyo2016/02/08, Spark Conference Japan
Apache Spark を用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
©2015 IBM Corporation2 May 1, 2023
How we detect anomaly
System under monitoring(ex. Factory plant)
Sensor A
Sensor B
SensorC
Sensor D
Sensor values are correlated
temperature acceleration pressure density
©2015 IBM Corporation3 May 1, 2023
How we detect anomaly
System under monitoring(ex. Factory plant)
Sensor A
Sensor B
SensorC
Sensor D
Sensor values are correlated Correlation changes at anomaly situation
temperature acceleration pressure density
• T. Idé, et al., SDM 2009.• T. Idé, IBM ProVISION No. 78, 2013
©2015 IBM Corporation4 May 1, 2023
Prediction model of correct behavior
How we detect anomaly
System under monitoring(ex. Factory plant)
Sensor A
Sensor B
SensorC
Sensor DCompare predicted sensor value
with the observed value It is anomaly if the two are different
Sensor values are correlated Correlation changes at anomaly situation
Value of Sensor A is predicted fromother sensors B, C, and D
temperature acceleration pressure density
Sensor A
Sensor B
SensorC
Sensor D
• T. Idé, et al., SDM 2009.• T. Idé, IBM ProVISION No. 78, 2013
©2015 IBM Corporation5 May 1, 2023
Prediction model of correct behavior
How we detect anomaly
System under monitoring(ex. Factory plant)
Sensor A
Sensor B
SensorC
Sensor DCompare predicted sensor value
with the observed value It is anomaly if the two are different
Sensor values are correlated Correlation changes at anomaly situation
Value of Sensor A is predicted fromother sensors B, C, and D
temperature acceleration pressure density
Sensor A
Sensor B
SensorC
Sensor D
Motivation:The prediction model is computed in advance by Machine Learning.It takes a very long time and requires much memory. Improve the scalability with Spark!
• T. Idé, et al., SDM 2009.• T. Idé, IBM ProVISION No. 78, 2013
©2015 IBM Corporation6 May 1, 2023
How we applied Spark (before)
Training: A linear model using LASSO regression (Least square + L1 regularization)
–
Hyper-parameter λ(tuned later to achieve
the best prediction accuracy)
©2015 IBM Corporation7 May 1, 2023
How we applied Spark (before)
Training: A linear model using LASSO regression (Least square + L1 regularization)
–
Evaluation: cross validation of prediction accuracy– Other data is used to test
the model
Hyper-parameter λ(tuned later to achieve
the best prediction accuracy)
training evaluationmodel
Search loop of hyper parameter λ
©2015 IBM Corporation8 May 1, 2023
How we applied Spark (before)
Time-series xtj
– T ~ 106 or more samples (time)
– D ~ 102 sensors (dimensions)
– (i.e., T >> D)
Training: A linear model using LASSO regression (Least square + L1 regularization)
–
Hyper-parameter λ(tuned later to achieve
the best prediction accuracy)
training evaluationmodel
Search loop of hyper parameter λ
original time-series data(big)
xtj
D
T
Evaluation: cross validation of prediction accuracy– Other data is used to test
the model
©2015 IBM Corporation9 May 1, 2023
How we applied Spark (before)
Time-series xtj
– T ~ 106 or more samples (time)
– D ~ 102 sensors (dimensions)
– (i.e., T >> D)
Training: A linear model using LASSO regression (Least square + L1 regularization)
–
Hyper-parameter λ(tuned later to achieve
the best prediction accuracy)
training evaluationmodel
Search loop of hyper parameter λ
Computed in advance(small)
original time-series data(big)
𝑆 𝑗𝑘=1𝑇 ∑
𝑡=1
𝑇
𝑥𝑡𝑗 𝑥𝑡𝑘
Sjk
xtj
D
D
D
T
Evaluation: cross validation of prediction accuracy– Other data is used to test
the model
©2015 IBM Corporation10 May 1, 2023
How we applied Spark (after)
trainingsensor 1
trainingsensor D
trainingsensor D-1
trainingsensor 2
evaluation
evaluation
evaluation
evaluation
By sensors By time (map-reduce)
model
Search loop of hyper parameter λ
Sjk
xtj
D
D
D
T
The small data is copied to all the
nodes
©2015 IBM Corporation11 May 1, 2023
Model is copied to all the nodes
How we applied Spark (after)
trainingsensor 1
trainingsensor D
trainingsensor D-1
trainingsensor 2
evaluation
evaluation
evaluation
evaluation
By sensors By time (map-reduce)
model
Search loop of hyper parameter λ
Sjk
xtj
D
D
D
T
The small data is copied to all the
nodesBig data is not
copied or moved.
©2015 IBM Corporation12 May 1, 2023
Why we did not use Spark MLlib
Spark MLlib Our method Decision ReasonLASSO regression SGD Shooting
algorithmCross validation framework
Random split Block split
©2015 IBM Corporation13 May 1, 2023
Why we did not use Spark MLlib
Spark MLlib Our method Decision ReasonLASSO regression SGD Shooting
algorithmImplement by ourselves using RDD
(maybe) better accuracy when T >> D
Cross validation framework
Random split Block split Implement by ourselves using RDD
To avoid overfitting (specific to time-series)
©2015 IBM Corporation14 May 1, 2023
Why we did not use Spark MLlib
Spark MLlib Our method Decision ReasonLASSO regression SGD Shooting
algorithmImplement by ourselves using RDD
(maybe) better accuracy when T >> D
Cross validation framework
Random split Block split Implement by ourselves using RDD
To avoid overfitting (specific to time-series)
xtj
train
test
Cross validation for usual data(random sampling)
©2015 IBM Corporation15 May 1, 2023
Why we did not use Spark MLlib
Spark MLlib Our method Decision ReasonLASSO regression SGD Shooting
algorithmImplement by ourselves using RDD
(maybe) better accuracy when T >> D
Cross validation framework
Random split Block split Implement by ourselves using RDD
To avoid overfitting (specific to time-series)
xtj
train
test
Cross validation for usual data(random sampling)
xtj
train
test
Cross validation for time-series data(block sampling)
©2015 IBM Corporation16 May 1, 2023
Why we did not use Spark MLlib
Spark MLlib Our method Decision ReasonLASSO regression SGD Shooting
algorithmImplement by ourselves using RDD
(maybe) better accuracy when T >> D
Cross validation framework
Random split Block split Implement by ourselves using RDD
To avoid overfitting (specific to time-series)
xtj
train
test
Cross validation for usual data(random sampling)
xtj
train
test
Cross validation for time-series data(block sampling) Balance optimization of CV
xtj
model 4
Pred1
Pred2
Pred3
Pred4
model 3
model 2
model 1
map reduceRDD(original)
RDD(prediction)
test 4test 3
test 2test 1
averagepredictionaccuracy
©2015 IBM Corporation17 May 1, 2023
Performance
1 node x 1 core 1 nodes x 32 cores 2 nodes x 32 cores0
200
400
600
800
1000
1200
1400
1600
1800
Model computation time with various data sizes
10000 20000 40000 80000 160000
Exec
ution
tim
e (s
)
1 node x 1 core 1 node x 32 cores 2 nodes x 32 cores0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Model computation time
50 sensors,10k
Exec
ution
tim
e (s
econ
ds)
Item Specification Item Specification
Processor Intel(R) Xeon(R) E5-2680 0, 2.70GHz Memory / node 32GB
Cores / node 32 (2 processors X 8 cores X 2 Hyper threads) NW 1Gb Ethernet
OS Red Hat Enterprise Linux Server release 6.3 (Santiago) x86_64 JVM IBM® Java 1.8.0
Spark Version 1.5.0, standalone scheduler Hadoop (HDFS) Version 2.6.0
Speed up by7.8 times
16 times larger data can be handled within the same time.
Number of samples
©2015 IBM Corporation18 May 1, 2023
Sliding window is not in RDD
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)
©2015 IBM Corporation19 May 1, 2023
Sliding window is not in RDD
Pitfall: Order preservation in RDD operation– join (not preserved)– zip (preserved)
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)
c
a
b 4,d
3,c
1,a
3
1
2
3,c
1,a
2,b
slidingwindow
map - reduce
Bug!
OK
OK
OK
not preserved
preserved
©2015 IBM Corporation20 May 1, 2023
Sliding window is not in RDD
Pitfall: Order preservation in RDD operation– join (not preserved)– zip (preserved)
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
Alternative APIs– DataFrame
(Spark MLlib)– Dstream
(Spark Streaming)– TimeSeriesRDD
(Cloudera Spark TS)
c
a
b 4,d
3,c
1,a
3
1
2
3,c
1,a
2,b
slidingwindow
map - reduce
Bug!
OK
OK
OK
not preserved
preservedIs it better to use higher
level API for future extensions instead of
RDD?import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)
©2015 IBM Corporation21 May 1, 2023
Sliding window is not in RDD
Pitfall: Order preservation in RDD operation– join (not preserved)– zip (preserved)
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
Alternative APIs– DataFrame
(Spark MLlib)– Dstream
(Spark Streaming)– TimeSeriesRDD
(Cloudera Spark TS)
c
a
b 4,d
3,c
1,a
3
1
2
3,c
1,a
2,b
slidingwindow
map - reduce
Bug!
OK
OK
OK
not preserved
preservedIs it better to use higher
level API for future extensions instead of
RDD?
But in most cases, Spark programming is easy and fun.Thank you!
import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)
©2015 IBM Corporation
©2015 IBM Corporation23 May 1, 2023
Java およびすべての Java 関連の商標およびロゴは Oracle やその関連会社の米国およびその他の国における商標または登録商標です。
インテル , Intel, Intel ロゴ , Intel Inside, Intel Inside ロゴ , Centrino, Intel Centrino ロゴ , Celeron, Xeon, Intel SpeedStep, Itanium, および Pentium は Intel Corporation または子会社の米国およびその他の国における商標または登録商標です。
©2015 IBM Corporation
©2015 IBM Corporation25 May 1, 2023
Data is a high dimensional time-series generated by sensors
Typical sizes (long in vertical direction)– D : number of sensors < 1k– T : number of samples ~ 1M or more– File size: ~ 1GB or more
Data is processed in batch
Data
Time Sensor 1 … Sensor D01:10:23 456 0.10 … -0.91
01:10:23 556 0.15 … -0.99
01:10:23 656 0.12 … -0.87
01:10:23 756 0.17 … -0.54
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
23:59:59 956 -0.49 … -0.29
T
D
©2015 IBM Corporation26 May 1, 2023
Architecture
DriverModel
creation tool server
Executor
Executor
Model creation tool GUI
Java RMI Spark HDFS
Physical architecture
Logical architecture
Frameworks / Middleware
Client PCMasterserver
Workerservers Storages
OS
JVM (JRE)
HDFSOther Libraries
Modeling creation tool server
Spark
Model creation engine (ML)
Standalone scheduler
©2015 IBM Corporation27 May 1, 2023
計算の性質– Training: 行列 S(D×D) のみに依存し大きな元データ x (T×D) によらない– Evaluation: 元データ x (T×D) のサンプル (1 行 , D) を要素とする map-reduce – 両者ともセンサー ( 予測対象の変数 ) ごとに独立に計算可能
ハイパーパラメーター探索ループの並列化の場合– 全ノードに元データのコピーが必要– 1 ノードのメモリーに乗り切らないかもしれない
1 反復全体をセンサーごとで並列化– 全ノードに元データのコピーが必要– 1 ノードのメモリーに乗り切らないかもしれない
Training はセンサーごとの並列化、 Evaluation は時間ごとの並列化– 行列 S とモデルは全ノードで共有 サイズが小さいので可能– Evaluation は典型的な map-reduce 元データは分散配置可能
並列化の設計
Sjk
training eval.
Hyper parameter search loop
xtj
D
D
D
T
model
©2015 IBM Corporation28 May 1, 2023
Training: 線形回帰モデルを LASSO 回帰 ( 最小二乗法 +L1 正則化 ) を使ってデータから構築– 変数 i を応答変数 ( 予測対象 ) 、変数 i 以外の変数を説明変数とする
係数 {aji} は Shooting algorithm により gi を最小化するように決定 ハイパーパラメーター λ は適当な小さい数 ( 後で決める )
– さらに以下の最適化を行う ( 先に Sjk をループ外で計算しておく )
計算量 : 1 変数あたりおよそ O(D3)
Evaluation: クロスバリデーション ( 別データでサンプル毎の予測精度の平均を評価 )– 計算量 : 1 変数あたり O(TD)
モデリング手法
Sjk
training eval.
Hyper parameter search loop
xtj
D
D
D
T
model
全体構造 : 最も予測精度が良くなるハイパーパラメーター λの探索
©2015 IBM Corporation29 May 1, 2023
We have developed a scalable modeling software for anomaly detection of time-series using Spark– Modeling is done in batch– implemented own LASSO regression algorithm with RDD– optimized to a time-series with T >> D situation
Performance improvements (2 nodes x 32 cores)– Speed up by 7.8 times– 16 times larger data set can be handled within a same time
Conclusion