evaluation of distributed open source solutions in cern database use cases hepix, spring 2015 kacper...

Post on 19-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Evaluation of distributed open source solutions in CERN database use casesHEPiX, spring 2015Kacper Surdy IT-DB-DBFM. Grzybek, D. L. Garcia, Z. Baranowski, L. Canali, E. Grancher

3

Motivations• propose better suited solution for big volumes of

read-only data• open new possibilities for data analysis• offload Oracle in data warehouse workloads• our users ask for this!

4

Hadoop evaluation• One interface (SQL)• Different use cases from CERN• LHC log, Controls data, Atlas jobs manager (PANDA)

• Different query engines• Different approaches for storing data

5

Query enginesApache Hive• SQL -> MapReduce jobs

Cloudera Impala• Custom SQL planner and executor• Uses Hive metastore

Apache Spark SQL• Spark lightweight threading execution• Can use Hive metastore

Data formatsData storage format and compression type make a difference• data volume• IO vs. CPU bound throughput• benefits of columnar storage• partitioning

Data formatsCSV• human readable

CSV SequenceFile Avro Parquet0 GB

200 GB

400 GB

600 GB

800 GB

1000 GB

1200 GB

1400 GB

1600 GB

1800 GB

1240 GB

238 GB

109 GB

649 GB

ACCLOG 8 days - file size comparison

no compression snappy bzip2 Oracle

Data formatsCSV• human readable

SequenceFile• flat set of binary stored (key, value) pairs

CSV SequenceFile Avro Parquet0 GB

200 GB

400 GB

600 GB

800 GB

1000 GB

1200 GB

1400 GB

1600 GB

1800 GB

1240 GB

1545 GB

238 GB 265 GB

109 GB 117 GB

649 GB

ACCLOG 8 days - file size comparison

no compression snappy bzip2 Oracle

Data formatsCSV• human readable

SequenceFile• flat set of binary stored (key, value) pairs

Apache Avro• binary serializing standard

CSV SequenceFile Avro Parquet0 GB

200 GB

400 GB

600 GB

800 GB

1000 GB

1200 GB

1400 GB

1600 GB

1800 GB

1240 GB

1545 GB

542 GB

238 GB 265 GB 226 GB

109 GB 117 GB171 GB

649 GB

ACCLOG 8 days - file size comparison

no compression snappy bzip2 Oracle

SequenceFile Avro Parquet0 s

200 s

400 s

600 s

800 s

1000 s

1200 s

1400 s

1600 s

1800 s

2000 s

682 s

216 s

572 s

113 s

1800 s

118 s

ACCLOG 8 days - union query execution time

no compression snappy bzip2

Data formatsCSV• human readable

SequenceFile• flat set of binary stored (key, value) pairs

Apache Avro• binary serializing standard

Parquet• binary columnar-storage

CSV SequenceFile Avro Parquet0 GB

200 GB

400 GB

600 GB

800 GB

1000 GB

1200 GB

1400 GB

1600 GB

1800 GB

1240 GB

1545 GB

542 GB 558 GB

238 GB 265 GB 226 GB288 GB

109 GB 117 GB171 GB

649 GB

ACCLOG 8 days - file size comparison

no compression snappy bzip2 Oracle

SequenceFile Avro Parquet0 s

200 s

400 s

600 s

800 s

1000 s

1200 s

1400 s

1600 s

1800 s

2000 s

682 s

216 s328 s

572 s

113 s 117 s

1800 s

118 s

ACCLOG 8 days - union query execution time

no compression snappy bzip2

Potential columnar store benefit

Parquet Avro Oracle0

50

100

150

200

250

300

350

400

450

500

23.39

467.31

117

Ntuples data exeuction comparison

Exec

ution

tim

e [s

]

Scalability - Impala

4 8 12 160

50

100

150

200

250

300

ACCLOG data

MeasuredIdeal

Number of nodes

Exec

ution

tim

e [s

]

Scalability - Hive

4 8 12 160

100

200

300

400

500

600

ACCLOG data – Hive scalability

MeasuredIdeal

Number of nodes

Exec

ution

tim

e [s

]

Scalability – Spark SQL

4 8 12 160

20

40

60

80

100

120

140

160

180

200

ACCLOG data

MeasuredIdeal

Number of nodes

Exec

ution

tim

e [s

]

Summary• Data format choice is a key• Heterogeneous data in the same place• No matter what interface -> it scales

top related