getting started with alluxio + spark + s3

Alluxio (formerly Tachyon):Getting Started with Alluxio + Spark + S3

Calvin Jia

June 15, 2016 @ Alluxio Meetup (hosted by Intel)

Related Blog Post: http://goo.gl/MUpL0O

Who Am I?

• Calvin Jia

• SWE @ Alluxio, Inc.

• Alluxio PMC Member

• Twitter: @JiaCalvin

Outline

• Technology Overview

• Alluxio + Spark + S3

• Demo

Alluxio Ecosystem

Why Alluxio?

• Data sharing between jobs

• Data resilience during application crashes

• Consolidate memory usage and alleviate GC issues

In-Memory Storage

block 1

block 3

In-Memory Storage

block 1

block 3

block 2

block 4

storage engine & execution enginesame process

Data Sharing Between Jobs

Inter-process sharing slowed down by network I/O6

Data Sharing Between Jobs

block 1

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4 In-Memory

block 1

block 3 block 4

storage & execution engineseparated

Inter-process sharing can happen at memory speed7

Data Resilience during Crashes

In-Memory Storageblock 1

block 3

block 1

block 3

block 2

block 4

Process crash requires network I/O to re-read the data

In-Memory Storageblock 1

block 3

block 1

block 3

block 2

block 4

block 1

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4 In-Memory

block 1

block 3 block 4

Process crash only needs memory I/O to re-read the data

Crashstorage & execution engineseparated

Process crash only needs memory I/O to re-read the data

HDFSdisk

block 1

block 3

block 2

block 4 In-Memory

block 1

block 3 block 4

Consolidating Memory

In-MemoryStorage

block 1

block 3

In-MemoryStorage

block 3

block 1

block 3

block 2

block 4

Data duplicated at memory-level

Consolidating Memory

block 1

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4 In-Memory

block 1

block 3 block 4

Data not duplicated at memory-level

Outline

• Demo

Visualizing the Stack

FAST 104 - 105 MB/s

MODERATE 103 - 104 MB/s

SLOW 102 - 103 MB/s

Only when necessaryLimited

SSDHDD

When to use Alluxio

•Two or more jobs access the same dataset•Job(s) may not always succeed•Dataset larger than Spark JVM•Jobs are pipelined•Resulting data does not need to be immediately persisted

Version Selection

• Alluxio 1.1.0–Latest released version–Many improvements, upgrade recommended

• Spark 1.6.1–Latest released version–Remember to use Spark Alluxio client, ie. -Pspark

–Spark 2.0 is coming out soon, will recommend the best way to integrate with Alluxio

API Selection• Access data directly through the FileSystem API, but

change scheme to alluxio://–Minimal code change–Do not need to reason about logic

•Example:–val file = sc.textFile(“s3n://my-bucket/myFile”)–val file = sc.textFile(“alluxio://master:19998/myFile”)

Outline

• Demo

getting started with alluxio + spark + s3

Technology

accelerating machine learning pipelines with alluxio at...

the architecture of decoupling compute and storage with...

started with-apache-spark

getting started with rails on glassfish (hands-on lab) -...

getting started with apache spark

best practices for using alluxio with apache spark with gene...

alluxio: the missing piece of on-demand clusters at alluxio...

alluxio: unify data at memory speed; 2016-11-18

1 getting started with helping young people find their spark

第1回 ``learning spark'' 読書会第2章 ``downloading...

getting started with spark

getting started running apache spark on apache mesos

best practices for using alluxio with spark

alluxio (formerly tachyon): unify data at memory speed -...

alluxio (formerly tachyon) - snia · 2019-12-21 · alluxio...

large scale analtics with the imsql/ pliny compute...

spark + user manualftp.santok.com/marketing/sh/user...

getting started with apache spark -...

getting started with apache spark - big data toronto …...

evaluation of the suitability of alluxio for hadoop...