evaluation of distributed open source solutions in cern database use cases hepix, spring 2015 kacper...

22

Upload: emma-harrison

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
Page 2: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

Evaluation of distributed open source solutions in CERN database use casesHEPiX, spring 2015Kacper Surdy IT-DB-DBFM. Grzybek, D. L. Garcia, Z. Baranowski, L. Canali, E. Grancher

Page 3: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

3

Motivations• propose better suited solution for big volumes of

read-only data• open new possibilities for data analysis• offload Oracle in data warehouse workloads• our users ask for this!

Page 4: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

4

Hadoop evaluation• One interface (SQL)• Different use cases from CERN• LHC log, Controls data, Atlas jobs manager (PANDA)

• Different query engines• Different approaches for storing data

Page 5: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

5

Query enginesApache Hive• SQL -> MapReduce jobs

Cloudera Impala• Custom SQL planner and executor• Uses Hive metastore

Apache Spark SQL• Spark lightweight threading execution• Can use Hive metastore

Page 6: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

Data formatsData storage format and compression type make a difference• data volume• IO vs. CPU bound throughput• benefits of columnar storage• partitioning

Page 7: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

Data formatsCSV• human readable

Page 8: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

CSV SequenceFile Avro Parquet0 GB

200 GB

400 GB

600 GB

800 GB

1000 GB

1200 GB

1400 GB

1600 GB

1800 GB

1240 GB

238 GB

109 GB

649 GB

ACCLOG 8 days - file size comparison

no compression snappy bzip2 Oracle

Page 9: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

Data formatsCSV• human readable

SequenceFile• flat set of binary stored (key, value) pairs

Page 10: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

CSV SequenceFile Avro Parquet0 GB

200 GB

400 GB

600 GB

800 GB

1000 GB

1200 GB

1400 GB

1600 GB

1800 GB

1240 GB

1545 GB

238 GB 265 GB

109 GB 117 GB

649 GB

ACCLOG 8 days - file size comparison

no compression snappy bzip2 Oracle

Page 11: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

Data formatsCSV• human readable

SequenceFile• flat set of binary stored (key, value) pairs

Apache Avro• binary serializing standard

Page 12: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

CSV SequenceFile Avro Parquet0 GB

200 GB

400 GB

600 GB

800 GB

1000 GB

1200 GB

1400 GB

1600 GB

1800 GB

1240 GB

1545 GB

542 GB

238 GB 265 GB 226 GB

109 GB 117 GB171 GB

649 GB

ACCLOG 8 days - file size comparison

no compression snappy bzip2 Oracle

Page 13: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

SequenceFile Avro Parquet0 s

200 s

400 s

600 s

800 s

1000 s

1200 s

1400 s

1600 s

1800 s

2000 s

682 s

216 s

572 s

113 s

1800 s

118 s

ACCLOG 8 days - union query execution time

no compression snappy bzip2

Page 14: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

Data formatsCSV• human readable

SequenceFile• flat set of binary stored (key, value) pairs

Apache Avro• binary serializing standard

Parquet• binary columnar-storage

Page 15: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

CSV SequenceFile Avro Parquet0 GB

200 GB

400 GB

600 GB

800 GB

1000 GB

1200 GB

1400 GB

1600 GB

1800 GB

1240 GB

1545 GB

542 GB 558 GB

238 GB 265 GB 226 GB288 GB

109 GB 117 GB171 GB

649 GB

ACCLOG 8 days - file size comparison

no compression snappy bzip2 Oracle

Page 16: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

SequenceFile Avro Parquet0 s

200 s

400 s

600 s

800 s

1000 s

1200 s

1400 s

1600 s

1800 s

2000 s

682 s

216 s328 s

572 s

113 s 117 s

1800 s

118 s

ACCLOG 8 days - union query execution time

no compression snappy bzip2

Page 17: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

Potential columnar store benefit

Parquet Avro Oracle0

50

100

150

200

250

300

350

400

450

500

23.39

467.31

117

Ntuples data exeuction comparison

Exec

ution

tim

e [s

]

Page 18: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

Scalability - Impala

4 8 12 160

50

100

150

200

250

300

ACCLOG data

MeasuredIdeal

Number of nodes

Exec

ution

tim

e [s

]

Page 19: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

Scalability - Hive

4 8 12 160

100

200

300

400

500

600

ACCLOG data – Hive scalability

MeasuredIdeal

Number of nodes

Exec

ution

tim

e [s

]

Page 20: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

Scalability – Spark SQL

4 8 12 160

20

40

60

80

100

120

140

160

180

200

ACCLOG data

MeasuredIdeal

Number of nodes

Exec

ution

tim

e [s

]

Page 21: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

Summary• Data format choice is a key• Heterogeneous data in the same place• No matter what interface -> it scales

Page 22: Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,