best practices for using alluxio with apache spark with gene pang
TRANSCRIPT
Best Practices for Using Alluxio with Spark
Gene Pang, Alluxio, Inc. Spark Summit EU - October 2017
About Me
• Gene Pang
• Software engineer @ Alluxio, Inc.
• Alluxio open source PMC member
• Ph.D. from AMPLab @ UC Berkeley
• Worked at Google before UC Berkeley
• Twitter: @unityxx
• Github: @gpang
©2017 Alluxio, Inc. All Rights Reserved 2
Outline Alluxio Overview
Alluxio + Spark Use Cases
Alluxio Architecture
Using Spark with Alluxio
Experiments
1
2
3
4
5
©2017 Alluxio, Inc. All Rights Reserved 3
Data Ecosystem Yesterday
• One Compute Framework
• Single Storage System • Co-located
©2017 Alluxio, Inc. All Rights Reserved 4
Data Ecosystem Today
• Many Compute Frameworks
• Multiple Storage Systems • Most not co-located
©2017 Alluxio, Inc. All Rights Reserved 5
Data Ecosystem Issues
• Each application manage multiple data sources
• Add/Removing data sources require application changes
• Storage optimizations requires application change
• Lower performance due to lack of locality
©2017 Alluxio, Inc. All Rights Reserved 6
Data Ecosystem with Alluxio
• Apps only talk to Alluxio
• Simple Add/Remove
• No App Changes
• Memory Performance
Native File System Hadoop Compatible File System
Native Key-Value Interface
Fuse Compatible File System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
©2017 Alluxio, Inc. All Rights Reserved 7
Next Gen Analytics with Alluxio
Native File System Hadoop Compatible File System
Native Key-Value Interface
Fuse Compatible File System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
Apps, Data & Storage at Memory Speed
ü Big Data/IoT ü AI/ML ü Deep Learning ü Cloud Migration ü Multi Platform ü Autonomous
©2017 Alluxio, Inc. All Rights Reserved 8
Fastest Growing Big Data Open Source Projects
Fastest Growing open-source project in the big data ecosystem
Running in large production clusters
600+ Contributors from 100+ organizations
0
100
200
300
400
500 0 10 20 30 40 45
Num
ber
of C
ontr
ibut
ors
Github Open Source Contributors by Month
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive
©2017 Alluxio, Inc. All Rights Reserved 9
Outline Alluxio Overview
Alluxio + Spark Use Cases
Alluxio Architecture
Using Spark with Alluxio
Experiments
1
2
3
4
5
©2017 Alluxio, Inc. All Rights Reserved 1 0
Big Data Case Study –
1 110/30/17 ©2017 Alluxio, Inc. All Rights Reserved
Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency
SPARK
TERADATA
SPARK
TERADATA
Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” http://bit.ly/2oMx95W
Big Data Case Study –
1 210/30/17 ©2017 Alluxio, Inc. All Rights Reserved
Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency
SPARK
Baidu File System
SPARK
Baidu File System
Solution – With Alluxio, data queries are 30X faster Impact – Higher operational efficiency http://bit.ly/2pDHS3O
Big Data Case Study –
Challenge – Gain end to end view of business with large volume of data for $5B Travel Site Queries were slow / not interactive, resulting in operational inefficiency
SPARK
HDFS
Solution – With Alluxio, 300x improvement in performance Impact – Increased revenue from immediate response to user behavior Use case: http://bit.ly/2pDJdrq
CEPH
HDFS CEPH
FLINK SPARK FLINK
©2017 Alluxio, Inc. All Rights Reserved 1 3
Machine Learning Case Study –
1 410/30/17 ©2017 Alluxio, Inc. All Rights Reserved
Challenge – Disparate Data both on-prem and Cloud. Heterogeneous types of data. Scaling of Exabyte size data. Slow due to disk based approach.
SPARK
HDFS
SPARK
MINIO
Solution – Using Alluxio to prevent I/O bottlenecks Impact – Orders of magnitude higher performance than before. http://bit.ly/2p18ds3
MES
OS
Outline Alluxio Overview
Alluxio + Spark Use Cases
Alluxio Architecture
Using Spark with Alluxio
Experiments
1
2
3
4
5
©2017 Alluxio, Inc. All Rights Reserved 1 5
1 6©2017 Alluxio, Inc. All Rights Reserved
App
licat
ion
Allu
xio
Clie
nt
Alluxio Master
Alluxio Worker
Alluxio Worker
…
Storage
Storage
…
Alluxio Architecture
Alluxio Client
Applications interact with Alluxio via the Alluxio client
• Java Native Alluxio Filesystem Client • Alluxio specific operations like [un]pin, [un]mount, [un]set TTL
• HDFS-Compatible Filesystem Client • No code change necessary
• S3 API
©2017 Alluxio, Inc. All Rights Reserved 1 7
Alluxio Master
Master is responsible for managing metadata
• Filesystem namespace metadata
• Blocks / workers metadata
Primary master writes journal for durable operations
• Secondary masters replay journal entries
©2017 Alluxio, Inc. All Rights Reserved 1 8
Alluxio Worker
Worker is responsible for managing block data
Worker stores block data on various storage media
• HDD, SSD, Memory
Reads and writes data to underlying storage systems
©2017 Alluxio, Inc. All Rights Reserved 1 9
Outline Alluxio Overview
Alluxio + Spark Use Cases
Alluxio Architecture
Using Spark with Alluxio
Experiments
1
2
3
4
5
©2017 Alluxio, Inc. All Rights Reserved 2 0
Sharing Data via Memory
Storage Engine & Execution Engine Same Process
• Two copies of data in memory – double the memory used • Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3 block 1
block 3
block 2
block 4
Spark Compute
Spark Storage
block 1
block 3
©2017 Alluxio, Inc. All Rights Reserved 2 1
Sharing Data via Memory
Storage Engine & Execution Engine Different process
• Half the memory used • Sharing Data at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3 block 1
block 3
block 2
block 4
HDFS disk
block 1
block 3
block 2
block 4 Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage
©2017 Alluxio, Inc. All Rights Reserved 2 2
Data Resilience During Crash
Spark Compute
Spark Storage block 1
block 3
HDFS / Amazon S3 block 1
block 3
block 2
block 4
Storage Engine & Execution Engine Same Process
©2017 Alluxio, Inc. All Rights Reserved 2 3
Data Resilience During Crash
CRASH
Spark Storage block 1
block 3
HDFS / Amazon S3 block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine & Execution Engine Same Process
©2017 Alluxio, Inc. All Rights Reserved 2 4
Data Resilience During Crash
CRASH
HDFS / Amazon S3 block 1
block 3
block 2
block 4
Storage Engine & Execution Engine Same Process
• Process Crash Requires Network and/or Disk I/O to Re-read Data
©2017 Alluxio, Inc. All Rights Reserved 2 5
Data Resilience During Crash
Spark Compute
Spark Storage
HDFS / Amazon S3 block 1
block 3
block 2
block 4
HDFS disk
block 1
block 3
block 2
block 4 Alluxio
block 1
block 3 block 4
Storage Engine & Execution Engine Different process
©2017 Alluxio, Inc. All Rights Reserved 2 6
Data Resilience During Crash
• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3 block 1
block 3
block 2
block 4
HDFS disk
block 1
block 3
block 2
block 4 Alluxio
block 1
block 3 block 4
CRASH Storage Engine & Execution Engine Different process
©2017 Alluxio, Inc. All Rights Reserved 2 7
Accessing Alluxio Data From Spark
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file
©2017 Alluxio, Inc. All Rights Reserved 2 8
Code Example for Spark RDDs
Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath) !rdd.saveAsObjectFile(alluxioPath) !
Reading RDD from Alluxio rdd = sc.textFile(alluxioPath) !rdd = sc.objectFile(alluxioPath) !
©2017 Alluxio, Inc. All Rights Reserved 2 9
Code Example for Spark DataFrames
Writing to Alluxio df.write.parquet(alluxioPath) !
Reading from Alluxio df = sc.read.parquet(alluxioPath) !
©2017 Alluxio, Inc. All Rights Reserved 3 0
Deploying Alluxio with Spark
©2017 Alluxio, Inc. All Rights Reserved 3 1
Spark
Alluxio
Storage
Spark
Alluxio
Storage
Colocate Alluxio Workers with Spark for optimal I/O performance
Deploy Alluxio between Spark and Storage
Outline Alluxio Overview
Alluxio + Spark Use Cases
Alluxio Architecture
Using Spark with Alluxio
Experiments
1
2
3
4
5
©2017 Alluxio, Inc. All Rights Reserved 3 2
Experiments
Spark 2.2.0 + Alluxio 1.6.0
Single worker: Amazon r3.2xlarge
Compare reading cached parquet files
©2017 Alluxio, Inc. All Rights Reserved 3 3
Reading Cached DataFrame (parquet)
©2017 Alluxio, Inc. All Rights Reserved 3 4
New Context: 50 GB DataFrame (S3)
6x – 8x speedup
©2017 Alluxio, Inc. All Rights Reserved 3 5
Conclusion
Easy to use Alluxio with Spark
Alluxio enables improved I/O performance
Easily interact with various storage systems with Alluxio
©2017 Alluxio, Inc. All Rights Reserved 3 6
Thank you!
Gene Pang [email protected] Twitter: @unityxx
Twi$er.com/alluxio
Linkedin.com/alluxio
Website www.alluxio.com
E-mail [email protected]
@
Social Media
á
�
©2017 Alluxio, Inc. All Rights Reserved 3 7