lvc20-303 - state of big data and data · 2020. 12. 22. · bdds-54 - workload setup on database...

23
LVC20-303 - State of Big Data and Data Science on ARM - Ganesh Raju Tech Lead, Big Data and Data Science, Linaro

Upload: others

Post on 21-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

LVC20-303 - State of Big Data and Data Science on ARM

- Ganesh RajuTech Lead, Big Data and Data Science,

Linaro

Page 2: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Agenda● Big Data Ecosystem● High Level Goals● Misconceptions on ARM● Approach● Team’s Achievements● General Pain Points ● Current Status in ARM World● Roadmap

Page 3: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Big Data and Data Science EcosystemBig Data in itself is a huge ecosystem. It is just too large, complex and redundant. It has too many standards, too many engines, too many vendors.

Categorizing Big Data Components

● Core Components● Operational Components● Data Ingestion● Streaming● Data Warehousing● NoSQL● File formats● Dashboards● Security/Governance● Data Science Tools / Machine Learning

Components● Notebooks

Page 4: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

High Level Goals1. ARM is first class citizen with all Big Data and Data Science Projects

a. Build and Portb. Setup CI on ARM Hardwarec. Automated Testsd. Multi-Arch Docker images

2. Benchmark against X863. Optimize with AArch64 advantages

Page 5: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

MisconceptionsARM is raspberry pi ? Projects unfamiliar with ARM platformARM is not production ready. Unavailability of ARM HardwareIt’s JAVA, and it should run anywhere !!! Dependencies not having ARM supportAdditional effort required for testing. Lack of interests to work on ARM

Page 6: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Approach➢ Top to Bottom Approach

○ Operational Component - Apache Ambari○ Ambari Mpack○ Apache Bigtop

➢ Bottom to Top Approach○ Core components - Apache Hadoop, Spark, HBase, Hive○ Other Apache Projects like Apache Arrow, Beam○ Other Projects - NiFi, MiniFi, etc○ Data science Projects - Tensorflow, Anaconda, H2O, etc

Page 7: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Apache Bigtop

Bigtop is a comprehensive project for packaging, testing, configuring, installing many Big Data components.

Originally, release and CI, were only available for x86 and powerpc.

To run on Arm, a lots of hacks and manual tuning to configurations were needed. ● Details: - Linaro Big Data team webpage,

https://collaborate.linaro.org/display/BDTS/Documentation

7

Page 8: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Bigtop - Supports >25 Hadoop Ecosystem Components

Page 9: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Bigtop- Foundation for many commercial Hadoop Distros/services

Page 10: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Bigtop - AchievementsApache Bigtop Contributions (BDDS-11)

● BDDS-8 - Apache Ambari mpack● BDDS-8 - Add ElasticSearch to Apache Bigtop● Number patches to upgrade components● Upstream CI● Integration tests and smoke tests● Linaro leading the effort

● Recognition for contributions○ Jun He is recognized as Chair of Bigtop PMC○ Jun He has been filled in RM role for Bigtop○ Yuqi Gu has been recognized as maintainer for

Bigtop

Page 11: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Apache Bigtop on AArch64 Timeline

2016-04 2017-03 2017-11 2018-03 2018-11 2019-06

Build Setup in Linaro

v1.2.1 released with a lot of

AArch64 patches

v1.3.0Officially ARM is

First Class Citizen Jun He - Release

Manager

Successful build on Ubuntu

AArch64 CI nodes added V1.4 Released

v1.5

Page 12: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Bigtop Smoke Test CI matrix

Page 13: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Bigtop Distro Matrix and Components

Distro ARM x86 PPC

CentOS 7 & 8

Debian 9 & 10

Fedora 31

Ubuntu 16.04 & 18.04

OpenSuse 42.3

Hadoop Spark HBase Hive

Flink ElasticSearch LogStash Kibana

Kafka Solr Ambari Flume

Giraph Gpdb Ignite Alluxio

Livy Mahout Oozie Phoenix

Qfs Sqoop Tez YCSB

Zookeeper Zeppelin Hama Tajo

Page 14: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Apache Bigtop: v1.5Upcoming in few weeks !New component additions

- ElasticSearch v5.6.14, Logstash, Kibana v5.4.1Version bumps:

- Hadoop 2.10.5, Spark 2.4.5, HBase v1.50, Hive v2.3.6, Kafka 2.4.0, Flume 1.9.0, Alluxio 1.8.2, Giraph v1.2.0, Ignite v2.7.6, Livy v0.7.0, Pheonix v4.15.0, Solr v6.6.6, Tez v0.9.2, Zeppelin v0.8.2, Zookeeper v3.4.13

Components Removed:- Apex, Hama, Tajo

New features:- Integration Tests- Smoke Tests- More built-in test coverage

- Hive, Flink, Giraph, Zeppelin, etc- A Lot of improvements and bug fixes!

Page 15: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

What is Apache Ambari

➢ Platform Independent➢ Pluggable component➢ Version Management and Upgrade➢ Extensibility➢ Failure Recovery➢ Security

Usage of

Apache Ambari

Provisioning of Big Data clusterMonitoring of Hadoop Cluster

Management of Hadoop Cluster

Security of Hadoop Cluster

Page 16: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Achievements● Build and Port - Majority of them already have ARM bits available

○ Apache Pulsar, Pheonix, NiFi, MiniFi, Airflow, Beam, etc● CI with upstream

○ Bigtop, Hadoop, Spark, HBase, Hive, Flink, etc.● Workload setup and Demo

○ ELK Stack - ElasticSearch, Logstash and Kibana○ H2O and Sparkling water○ Apache Ambari○ Apache Drill

● Benchmarking○ HiBench

● Optimization ○ E.g, Arrow CRC32 and ARM specific optimization

● Helping University of Michigan○ Cluster running Bigtop Petabyte size, twitter data, 20 GB of tweets / day○ Ambari and Bigtop

Page 17: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

CI setup with other projectsProject CI link

Apache Bigtop https://ci.bigtop.apache.org/computer/

Apache Hadoop https://builds.apache.org/view/H-L/view/Hadoop/job/Hadoop-qbt-linux-ARM-trunk/

Apache Spark https://amplab.cs.berkeley.edu/jenkins/label/spark-arm/

Apache HBase https://builds.apache.org/job/HBase-Nightly-ARM/

Apache Hive https://builds.apache.org/job/Hive-linux-ARM-trunk

Apache Flink https://status.openlabtesting.org/builds?job_name=flink-build-and-test-arm64-core-and-tests

Apache Kudu https://logs.openlabtesting.org/logs/periodic-kudu-mail/github.com/apache/kudu/master/kudu-build-test-arm64-in-docker/4df6de9/

ElasticSearch Stack https://ci.linaro.org/view/All/job/bigdata-elasticsearch/

Apache Arrow https://travis-ci.org/github/apache/arrow/jobs/728491410

Apache Drill https://ci.linaro.org/view/All/job/ldcg-bigdata-apache-drill/

Apache Impala http://status.openlabtesting.org/job/impala-build-test-arm64

Tensorflow http://status.openlabtesting.org/builds?job_name=tensorflow-arm64-release-build-v2.1.0-py36

PyTorch https://snapshots.linaro.org/hpc/python/pytorch/3/

Page 18: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Pain Points● Dependency issues

○ Native binaries: protobuf, phantomjs, …○ Jars with native binaries embedded: levedb-jni, ignite-shmem, jffi,

snappy-java …○ Version mismatch: slf4j, log4j, log4j2, …

● Cyclic references take a lot of effort to fix● It takes time to convince projects

○ Protobuf and PhantomJS issue○ Bazel issue

Page 19: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Team’s Current Scope (Next 6 months)● Building and porting Big Data and Datascience projects on ARM64.

○ BDDS-7 - Apache Bigtop v1.5 Release○ Start Apache Bigtop v1.6 work

■ Hadoop 3 upgrade■ Ambari mpack as top level component

○ BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra○ BDDS-12 - Kerberos and Security components like Ranger, Knox and Atlas

● Utilize Apache Arrow in Apache Spark● Arrow Memory optimization and fix● BDDS-262 - RocksDB performance issue fix

○ RocksDB v5.17+ has >8% performance regression● BDDS-17 - Apache Airflow Workload end to end Setup and Demo● BDDS-252 - Apache Pulsar Workload end to end Setup and demo

Page 20: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Roadmap● Bigtop

○ Hadoop 3 upgrade○ JDK 11 integration○ Ambari Mpack ○ Kubernetes support○ Add Beam, Arrow, Storm, NiFi, MiniFi, Presto○ Add Data science tools

● Build and Port : ○ Databases: ArangoDB, Hawq, Accumulo,

Geode, Parquet-MR, Thrift, Gobblin, etc● ARM Optimization

○ Benchmarking○ SVE and SIMD optimization

● Datascience ○ MLOps, Spark-ML, FlinkML, Horovod,

Hopsml, BigDL, PyTorch, Scikit-Learn, NumPy, Keras, MxNet

○ Anaconda● HPDA

○ Hadoop and Spark on RDMA. RoCE+Spark○ Hadoop on Ceph

● End to End Use case

Page 21: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

● 23% of HPC system usage is currently HPDA○ Machine learning○ Stochastic modeling / Monte Carlo – explore large problem

spaces○ MapReduce/Hadoop, graph analytics, knowledge discovery

HPDA – High Performance Data Analysis

Page 22: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

RDMA Big Data Proposal● RDMA could give over 40% performance boost for Big Data● Develop and Test Plugins for i.e., Hadoop, such as mapreduce and HDFS, to accelerate

Hadoop by using RDMA (Remote Direct Memory Access) technology on ARM64 platform

Page 23: LVC20-303 - State of Big Data and Data · 2020. 12. 22. · BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra BDDS-12 - Kerberos and Security

Thanks

Linaro BDDS team:Ganesh Raju - Tech Lead, Linaro [email protected] Gu - Assignee, ARMJun He - Member Engineer, ARM

Thanks to OpenEuler, Packet and ARM for their contributions