evaluation of distributed open source solutions in cern database use cases hepix, spring 2015 kacper...
Post on 19-Dec-2015
215 Views
Preview:
TRANSCRIPT
Evaluation of distributed open source solutions in CERN database use casesHEPiX, spring 2015Kacper Surdy IT-DB-DBFM. Grzybek, D. L. Garcia, Z. Baranowski, L. Canali, E. Grancher
3
Motivations• propose better suited solution for big volumes of
read-only data• open new possibilities for data analysis• offload Oracle in data warehouse workloads• our users ask for this!
4
Hadoop evaluation• One interface (SQL)• Different use cases from CERN• LHC log, Controls data, Atlas jobs manager (PANDA)
• Different query engines• Different approaches for storing data
5
Query enginesApache Hive• SQL -> MapReduce jobs
Cloudera Impala• Custom SQL planner and executor• Uses Hive metastore
Apache Spark SQL• Spark lightweight threading execution• Can use Hive metastore
Data formatsData storage format and compression type make a difference• data volume• IO vs. CPU bound throughput• benefits of columnar storage• partitioning
Data formatsCSV• human readable
CSV SequenceFile Avro Parquet0 GB
200 GB
400 GB
600 GB
800 GB
1000 GB
1200 GB
1400 GB
1600 GB
1800 GB
1240 GB
238 GB
109 GB
649 GB
ACCLOG 8 days - file size comparison
no compression snappy bzip2 Oracle
Data formatsCSV• human readable
SequenceFile• flat set of binary stored (key, value) pairs
CSV SequenceFile Avro Parquet0 GB
200 GB
400 GB
600 GB
800 GB
1000 GB
1200 GB
1400 GB
1600 GB
1800 GB
1240 GB
1545 GB
238 GB 265 GB
109 GB 117 GB
649 GB
ACCLOG 8 days - file size comparison
no compression snappy bzip2 Oracle
Data formatsCSV• human readable
SequenceFile• flat set of binary stored (key, value) pairs
Apache Avro• binary serializing standard
CSV SequenceFile Avro Parquet0 GB
200 GB
400 GB
600 GB
800 GB
1000 GB
1200 GB
1400 GB
1600 GB
1800 GB
1240 GB
1545 GB
542 GB
238 GB 265 GB 226 GB
109 GB 117 GB171 GB
649 GB
ACCLOG 8 days - file size comparison
no compression snappy bzip2 Oracle
SequenceFile Avro Parquet0 s
200 s
400 s
600 s
800 s
1000 s
1200 s
1400 s
1600 s
1800 s
2000 s
682 s
216 s
572 s
113 s
1800 s
118 s
ACCLOG 8 days - union query execution time
no compression snappy bzip2
Data formatsCSV• human readable
SequenceFile• flat set of binary stored (key, value) pairs
Apache Avro• binary serializing standard
Parquet• binary columnar-storage
CSV SequenceFile Avro Parquet0 GB
200 GB
400 GB
600 GB
800 GB
1000 GB
1200 GB
1400 GB
1600 GB
1800 GB
1240 GB
1545 GB
542 GB 558 GB
238 GB 265 GB 226 GB288 GB
109 GB 117 GB171 GB
649 GB
ACCLOG 8 days - file size comparison
no compression snappy bzip2 Oracle
SequenceFile Avro Parquet0 s
200 s
400 s
600 s
800 s
1000 s
1200 s
1400 s
1600 s
1800 s
2000 s
682 s
216 s328 s
572 s
113 s 117 s
1800 s
118 s
ACCLOG 8 days - union query execution time
no compression snappy bzip2
Potential columnar store benefit
Parquet Avro Oracle0
50
100
150
200
250
300
350
400
450
500
23.39
467.31
117
Ntuples data exeuction comparison
Exec
ution
tim
e [s
]
Scalability - Impala
4 8 12 160
50
100
150
200
250
300
ACCLOG data
MeasuredIdeal
Number of nodes
Exec
ution
tim
e [s
]
Scalability - Hive
4 8 12 160
100
200
300
400
500
600
ACCLOG data – Hive scalability
MeasuredIdeal
Number of nodes
Exec
ution
tim
e [s
]
Scalability – Spark SQL
4 8 12 160
20
40
60
80
100
120
140
160
180
200
ACCLOG data
MeasuredIdeal
Number of nodes
Exec
ution
tim
e [s
]
Summary• Data format choice is a key• Heterogeneous data in the same place• No matter what interface -> it scales
top related