lvc20-303 - state of big data and data · 2020. 12. 22. · bdds-54 - workload setup on database...
TRANSCRIPT
LVC20-303 - State of Big Data and Data Science on ARM
- Ganesh RajuTech Lead, Big Data and Data Science,
Linaro
Agenda● Big Data Ecosystem● High Level Goals● Misconceptions on ARM● Approach● Team’s Achievements● General Pain Points ● Current Status in ARM World● Roadmap
Big Data and Data Science EcosystemBig Data in itself is a huge ecosystem. It is just too large, complex and redundant. It has too many standards, too many engines, too many vendors.
Categorizing Big Data Components
● Core Components● Operational Components● Data Ingestion● Streaming● Data Warehousing● NoSQL● File formats● Dashboards● Security/Governance● Data Science Tools / Machine Learning
Components● Notebooks
High Level Goals1. ARM is first class citizen with all Big Data and Data Science Projects
a. Build and Portb. Setup CI on ARM Hardwarec. Automated Testsd. Multi-Arch Docker images
2. Benchmark against X863. Optimize with AArch64 advantages
MisconceptionsARM is raspberry pi ? Projects unfamiliar with ARM platformARM is not production ready. Unavailability of ARM HardwareIt’s JAVA, and it should run anywhere !!! Dependencies not having ARM supportAdditional effort required for testing. Lack of interests to work on ARM
Approach➢ Top to Bottom Approach
○ Operational Component - Apache Ambari○ Ambari Mpack○ Apache Bigtop
➢ Bottom to Top Approach○ Core components - Apache Hadoop, Spark, HBase, Hive○ Other Apache Projects like Apache Arrow, Beam○ Other Projects - NiFi, MiniFi, etc○ Data science Projects - Tensorflow, Anaconda, H2O, etc
Apache Bigtop
Bigtop is a comprehensive project for packaging, testing, configuring, installing many Big Data components.
Originally, release and CI, were only available for x86 and powerpc.
To run on Arm, a lots of hacks and manual tuning to configurations were needed. ● Details: - Linaro Big Data team webpage,
https://collaborate.linaro.org/display/BDTS/Documentation
7
Bigtop - Supports >25 Hadoop Ecosystem Components
Bigtop- Foundation for many commercial Hadoop Distros/services
Bigtop - AchievementsApache Bigtop Contributions (BDDS-11)
● BDDS-8 - Apache Ambari mpack● BDDS-8 - Add ElasticSearch to Apache Bigtop● Number patches to upgrade components● Upstream CI● Integration tests and smoke tests● Linaro leading the effort
● Recognition for contributions○ Jun He is recognized as Chair of Bigtop PMC○ Jun He has been filled in RM role for Bigtop○ Yuqi Gu has been recognized as maintainer for
Bigtop
Apache Bigtop on AArch64 Timeline
2016-04 2017-03 2017-11 2018-03 2018-11 2019-06
Build Setup in Linaro
v1.2.1 released with a lot of
AArch64 patches
v1.3.0Officially ARM is
First Class Citizen Jun He - Release
Manager
Successful build on Ubuntu
AArch64 CI nodes added V1.4 Released
v1.5
Bigtop Smoke Test CI matrix
Bigtop Distro Matrix and Components
Distro ARM x86 PPC
CentOS 7 & 8
Debian 9 & 10
Fedora 31
Ubuntu 16.04 & 18.04
OpenSuse 42.3
Hadoop Spark HBase Hive
Flink ElasticSearch LogStash Kibana
Kafka Solr Ambari Flume
Giraph Gpdb Ignite Alluxio
Livy Mahout Oozie Phoenix
Qfs Sqoop Tez YCSB
Zookeeper Zeppelin Hama Tajo
Apache Bigtop: v1.5Upcoming in few weeks !New component additions
- ElasticSearch v5.6.14, Logstash, Kibana v5.4.1Version bumps:
- Hadoop 2.10.5, Spark 2.4.5, HBase v1.50, Hive v2.3.6, Kafka 2.4.0, Flume 1.9.0, Alluxio 1.8.2, Giraph v1.2.0, Ignite v2.7.6, Livy v0.7.0, Pheonix v4.15.0, Solr v6.6.6, Tez v0.9.2, Zeppelin v0.8.2, Zookeeper v3.4.13
Components Removed:- Apex, Hama, Tajo
New features:- Integration Tests- Smoke Tests- More built-in test coverage
- Hive, Flink, Giraph, Zeppelin, etc- A Lot of improvements and bug fixes!
What is Apache Ambari
➢ Platform Independent➢ Pluggable component➢ Version Management and Upgrade➢ Extensibility➢ Failure Recovery➢ Security
Usage of
Apache Ambari
Provisioning of Big Data clusterMonitoring of Hadoop Cluster
Management of Hadoop Cluster
Security of Hadoop Cluster
Achievements● Build and Port - Majority of them already have ARM bits available
○ Apache Pulsar, Pheonix, NiFi, MiniFi, Airflow, Beam, etc● CI with upstream
○ Bigtop, Hadoop, Spark, HBase, Hive, Flink, etc.● Workload setup and Demo
○ ELK Stack - ElasticSearch, Logstash and Kibana○ H2O and Sparkling water○ Apache Ambari○ Apache Drill
● Benchmarking○ HiBench
● Optimization ○ E.g, Arrow CRC32 and ARM specific optimization
● Helping University of Michigan○ Cluster running Bigtop Petabyte size, twitter data, 20 GB of tweets / day○ Ambari and Bigtop
CI setup with other projectsProject CI link
Apache Bigtop https://ci.bigtop.apache.org/computer/
Apache Hadoop https://builds.apache.org/view/H-L/view/Hadoop/job/Hadoop-qbt-linux-ARM-trunk/
Apache Spark https://amplab.cs.berkeley.edu/jenkins/label/spark-arm/
Apache HBase https://builds.apache.org/job/HBase-Nightly-ARM/
Apache Hive https://builds.apache.org/job/Hive-linux-ARM-trunk
Apache Flink https://status.openlabtesting.org/builds?job_name=flink-build-and-test-arm64-core-and-tests
Apache Kudu https://logs.openlabtesting.org/logs/periodic-kudu-mail/github.com/apache/kudu/master/kudu-build-test-arm64-in-docker/4df6de9/
ElasticSearch Stack https://ci.linaro.org/view/All/job/bigdata-elasticsearch/
Apache Arrow https://travis-ci.org/github/apache/arrow/jobs/728491410
Apache Drill https://ci.linaro.org/view/All/job/ldcg-bigdata-apache-drill/
Apache Impala http://status.openlabtesting.org/job/impala-build-test-arm64
Tensorflow http://status.openlabtesting.org/builds?job_name=tensorflow-arm64-release-build-v2.1.0-py36
PyTorch https://snapshots.linaro.org/hpc/python/pytorch/3/
Pain Points● Dependency issues
○ Native binaries: protobuf, phantomjs, …○ Jars with native binaries embedded: levedb-jni, ignite-shmem, jffi,
snappy-java …○ Version mismatch: slf4j, log4j, log4j2, …
● Cyclic references take a lot of effort to fix● It takes time to convince projects
○ Protobuf and PhantomJS issue○ Bazel issue
Team’s Current Scope (Next 6 months)● Building and porting Big Data and Datascience projects on ARM64.
○ BDDS-7 - Apache Bigtop v1.5 Release○ Start Apache Bigtop v1.6 work
■ Hadoop 3 upgrade■ Ambari mpack as top level component
○ BDDS-54 - Workload setup on Database components like Kudu, Impala, Presto and Cassandra○ BDDS-12 - Kerberos and Security components like Ranger, Knox and Atlas
● Utilize Apache Arrow in Apache Spark● Arrow Memory optimization and fix● BDDS-262 - RocksDB performance issue fix
○ RocksDB v5.17+ has >8% performance regression● BDDS-17 - Apache Airflow Workload end to end Setup and Demo● BDDS-252 - Apache Pulsar Workload end to end Setup and demo
Roadmap● Bigtop
○ Hadoop 3 upgrade○ JDK 11 integration○ Ambari Mpack ○ Kubernetes support○ Add Beam, Arrow, Storm, NiFi, MiniFi, Presto○ Add Data science tools
● Build and Port : ○ Databases: ArangoDB, Hawq, Accumulo,
Geode, Parquet-MR, Thrift, Gobblin, etc● ARM Optimization
○ Benchmarking○ SVE and SIMD optimization
● Datascience ○ MLOps, Spark-ML, FlinkML, Horovod,
Hopsml, BigDL, PyTorch, Scikit-Learn, NumPy, Keras, MxNet
○ Anaconda● HPDA
○ Hadoop and Spark on RDMA. RoCE+Spark○ Hadoop on Ceph
● End to End Use case
● 23% of HPC system usage is currently HPDA○ Machine learning○ Stochastic modeling / Monte Carlo – explore large problem
spaces○ MapReduce/Hadoop, graph analytics, knowledge discovery
HPDA – High Performance Data Analysis
RDMA Big Data Proposal● RDMA could give over 40% performance boost for Big Data● Develop and Test Plugins for i.e., Hadoop, such as mapreduce and HDFS, to accelerate
Hadoop by using RDMA (Remote Direct Memory Access) technology on ARM64 platform
Thanks
Linaro BDDS team:Ganesh Raju - Tech Lead, Linaro [email protected] Gu - Assignee, ARMJun He - Member Engineer, ARM
Thanks to OpenEuler, Packet and ARM for their contributions