scalable multi-language data analysis on beam - the erlang … · 2016-09-24 · scalable...

Post on 03-Apr-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Scalable Multi-Language Data Analysis on BEAMThe Erlang Experience

Jorgen Brandt

Humboldt-Universitat zu Berlin

2016-09-08

Erlang User Conference 2016

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 1 / 29

About me

PhD student at Humboldt (Berlin)Creator of Cuneiform workflow languageMajor area of application: Bioinformatics

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 2 / 29

Motivation

Introduction to CuneiformHow to implement a large-scale data analysis PLHow in Erlang?Why in Erlang?

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 3 / 29

Scientific Workflow Systems

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 4 / 29

Scientific Workflow Systems

Workflows as DAGs

Scientific Workflows areDAGs

Nodes are tasksEdges are data dependencies

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 5 / 29

Scientific Workflow Systems

Workflows as DAGs

Scientific Workflows areDAGsNodes are tasks

Edges are data dependencies

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 5 / 29

Scientific Workflow Systems

Workflows as DAGs

Scientific Workflows areDAGsNodes are tasksEdges are data dependencies

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 5 / 29

The Next Generation Sequencing use case

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 6 / 29

The Next Generation Sequencing use case

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 6 / 29

The Next Generation Sequencing use case

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 6 / 29

The Next Generation Sequencing use case

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 6 / 29

The Next Generation Sequencing use case

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 6 / 29

Cuneiform is ...

Functional Language for Large-Scale Data Analysisimplemented in Erlang

Different from systems like SparkStandalone syntax, no fluent API in ScalaIntegration of foreign PLs

Different from distributed workflow languages like SnakemakeReduction of functional expression, no static dependency graph

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 7 / 29

The main idea

Functional Programming+

Foreign Language Interfacing

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 8 / 29

Functional Programming

Functional Programming

Very expressiveNatural operations on lists(map, . . . )Natural iteration (recursion)

Gain

General data analysis

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 9 / 29

Functional Programming

Functional Programming

Very expressiveNatural operations on lists(map, . . . )Natural iteration (recursion)

Parallelize independentsub-expressionsLazy Evaluation

Gain

General data analysis

Gain

Automatic Parallelism

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 9 / 29

Cuneiform Example

deftask gunzip( out( File ) : gz( File ) )in bash *{ out=output.file gzip -c -d $gz > $out}*

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 10 / 29

Cuneiform Example

deftask gunzip( out( File ) : gz( File ) )in bash *{ out=output.file gzip -c -d $gz > $out}*

gunzip( gz: 'myarchive1.gz');

gunzip

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 10 / 29

Cuneiform Example

deftask gunzip( out( File ) : gz( File ) )in bash *{ out=output.file gzip -c -d $gz > $out}*

gunzip( gz: 'myarchive1.gz' 'myarchive2.gz');

gunzip

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 10 / 29

Cuneiform Example

deftask gunzip( out( File ) : gz( File ) )in bash *{ out=output.file gzip -c -d $gz > $out}*

gunzip( gz: 'myarchive1.gz' 'myarchive2.gz' 'myarchive3.gz');

gunzip

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 10 / 29

Workflow Implementations Available

cuneiform-lang.org/examples

Available example workflows:Variant Calling(Varscan)MethylationRNA-SeqVariant Calling (GATK)etc . . .

samtools-faidx

fastq-dump

cufflinks

cummerbund

tophat-single

cuffdiff

cuffmerge

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 11 / 29

Example: RNA-Seqcuneiform-lang.org/examples

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 12 / 29

Biology-Inspired Analogy

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 13 / 29

Biology-Inspired Analogy

Large-scale data analysis systems as 2-tiersystem:

Algorithm-level (fast, local)Workflow-level (organizational,distributed)

Analogy to the cell:DNA transcription (fast, local)Cell-to-cell signaling (organizational,distributed)

Erlang good fit for organizational part

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 14 / 29

Distributed Workflow System Architecture

Query Query Query

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 15 / 29

Distributed Workflow System Architecture

Cache

Query Query Query

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 15 / 29

Distributed Workflow System Architecture

Scheduler

Cache

Query Query Query

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 15 / 29

Distributed Workflow System Architecture

Scheduler

FS Work FS Work FS Work

Cache

Query Query Query

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 15 / 29

Two Modeling Challenges (i)

Interpreter

Execution Environment

Interpreter

Reduction of queryexpression

Execution Environment

Distributed System

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 16 / 29

How to model programming languages?

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 17 / 29

Modeling the Cuneiform Interpreter

Compiler

Transcrip�onParser Generator

(Leex/Yecc)

BNF

Program Interpreter Result

Opera�onal

Seman�cs

Abstract

Program

Parser is generated from BNFInterpreter is transcribed from Operational Semantics

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 18 / 29

Abstract SyntaxAn expression in Cuneiform is . . .

Expr

x

String String

Str

(str, s)Var

(var, n)Select

(select, c, u)App

(app, c, λ, fa)Cnd

(cnd, [xc1, ...], [xt1, ...], [xe1, ...])

N+ Expr Expr Expr N+Fut

(fut, r, [po1, ...])Lam

(lam, s, b)

N+

Var

Param

String→[Expr]

Natbody

(natbody, fb)Forbody

(forbody, l, s)Sign

(sign, [po1, ...], [pi1, ...])

Param String→[Expr] StringParam

(param, n, pl)Correl

(correl, [n1, ...])bash python r

String B String

Expressions can contain other expressionsSemantics define how expressions are reduced

x0 → x1 → . . . → x∗

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 19 / 29

Implementing an Operational Semantics in Erlang

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 20 / 29

Two Modeling Challenges (ii)

Interpreter

Execution Environment

Interpreter

Reduction of queryexpression

Execution Environment

Distributed System

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 21 / 29

How to model distributed systems?

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 22 / 29

How to Model Distributed Systems

Petri Nets

Mature theoryLivenessInvariantsTraps/Cotraps. . .

Local decision/synchronizationParallel execution of independent transitions

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 23 / 29

Modeling the Cuneiform Execution Environment

(a,r)

(a,r)

(a,r)

(a,r)(a,r)

a

a

App

p

p

(p,a)

(p,a,r)

(p,s,a,q)

monitor

decommission

resubmit

submit schedule

reply

lookup

S3

S4

S5

P1

P2

P3

D

A

T4 T3

(a,q) (a,q) (a,q)

(p,s,a,q)

(a,q)

(p,s)

(p,s)

(p,s)

(p,s)

(p,s,a,q)(p,s)

pmatch(s,q)

C

Execution Environmenta∈App, p∈Pid, r∈Result∪Err,

q∈Req, s∈Spec

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 24 / 29

How to do Petri Nets in Erlang

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 25 / 29

Flow-based programming revisited

System languages for heavy liftingLarge-scale data analysis is hard

Because programming languages are hardBecause distributed systems are hard

Erlang is good fitBecause FP is already close to operational semantics style notationBecause Erlang process model already close to Petri Nets

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 26 / 29

REPL

cuneiform-lang.org

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 27 / 29

Conclusion

Functional Programming +Foreign LanguagesDistributed Execution

Local MulticoreHTCondorHadoop

Automatic ParallelizationExpressive data analysisworkflowsForeign Language Interface

BashPerl. . .

cuneiform-lang.org

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 28 / 29

Questions?

Questions?

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 29 / 29

top related