getting started with alluxio + spark + s3
Post on 11-Jan-2017
447 Views
Preview:
TRANSCRIPT
Alluxio (formerly Tachyon):Getting Started with Alluxio + Spark + S3
Calvin Jia
June 15, 2016 @ Alluxio Meetup (hosted by Intel)
Related Blog Post: http://goo.gl/MUpL0O
Why Alluxio?
• Data sharing between jobs
• Data resilience during application crashes
• Consolidate memory usage and alleviate GC issues
5
In-Memory Storage
block 1
block 3
In-Memory Storage
block 1
block 3
block 2
block 4
storage engine & execution enginesame process
Data Sharing Between Jobs
Inter-process sharing slowed down by network I/O6
Data Sharing Between Jobs
block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
storage & execution engineseparated
Inter-process sharing can happen at memory speed7
Data Resilience during Crashes
In-Memory Storageblock 1
block 3
block 1
block 3
block 2
block 4
storage engine & execution enginesame process
Process crash requires network I/O to re-read the data
8
Data Resilience during Crashes
Crash
In-Memory Storageblock 1
block 3
block 1
block 3
block 2
block 4
storage engine & execution enginesame process
Process crash requires network I/O to re-read the data
9
Data Resilience during Crashes
block 1
block 3
block 2
block 4
Crash
storage engine & execution enginesame process
Process crash requires network I/O to re-read the data
10
Data Resilience during Crashes
storage & execution engineseparated
HDFSdisk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
Process crash only needs memory I/O to re-read the data
11
Data Resilience during Crashes
Crashstorage & execution engineseparated
Process crash only needs memory I/O to re-read the data
HDFSdisk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
12
Consolidating Memory
In-MemoryStorage
block 1
block 3
In-MemoryStorage
block 3
block 1
block 1
block 3
block 2
block 4
storage engine & execution enginesame process
Data duplicated at memory-level
13
Consolidating Memory
block 1
block 3
block 2
block 4
storage & execution engineseparated
HDFSdisk
block 1
block 3
block 2
block 4 In-Memory
block 1
block 3 block 4
Data not duplicated at memory-level
14
Visualizing the Stack
16
FAST 104 - 105 MB/s
MODERATE 103 - 104 MB/s
SLOW 102 - 103 MB/s
Only when necessaryLimited
Often
SSDHDD
Mem
When to use Alluxio
•Two or more jobs access the same dataset•Job(s) may not always succeed•Dataset larger than Spark JVM•Jobs are pipelined•Resulting data does not need to be immediately persisted
17
Version Selection
• Alluxio 1.1.0–Latest released version–Many improvements, upgrade recommended
• Spark 1.6.1–Latest released version–Remember to use Spark Alluxio client, ie. -Pspark
–Spark 2.0 is coming out soon, will recommend the best way to integrate with Alluxio
18
API Selection• Access data directly through the FileSystem API, but
change scheme to alluxio://–Minimal code change–Do not need to reason about logic
•Example:–val file = sc.textFile(“s3n://my-bucket/myFile”)–val file = sc.textFile(“alluxio://master:19998/myFile”)
19
top related