programming cascading

Taewook Eom

Data Infrastructure Group SK planet

[email protected]

2014-09-25

Programming Cascading

Big Data Processing

저자동 고유연성 고자동 저유연성

Cascading

http://www.cascading.org/

Since 2007, by Chris Wensel (CTO, founder of Concurrent, Inc.)

http://www.cascading.org/

Cascading

데이터베이스 기초 개념과 배관(파이프와 연산자)을 비유하여 추상화 제공 엔터프라이즈 data workflow의 비즈니스 프로세스 관리를 위한 패턴 언어

http://docs.cascading.org/cascading/2.5/userguide/pdf/userguide.pdf

http://docs.cascading.org/impatient/ https://github.com/Cascading/Impatient

Cascading for the Impatient

http://docs.cascading.org/cascading/2.5/userguide/pdf/userguide.pdf

https://github.com/Cascading/Impatient




Cascading

• Flow Planner가 사전 계획 단계(compile time)에서 에러 확인 p.23 – 연산에 필요한 필드 p.31 – 연산 순서 – 파이프와 탭의 연결 상태 – 의존성 그래프 생성 -> DAG 생성 p.37

• DAG(Directed Acyclic Graph)

– Data work flow에 적합한 형태 – 다양한 데이터 처리 엔진에서 사용: Microsoft Dryad, Apache Tez, Apache Spark

• 엔터프라이즈 환경에 적합 – 논리적 계획이 아닌 물리적 계획으로 예측 가능 p.24 – 결정적 전략으로 실행마다 물리적 실행 계획이 바뀌지 않음 p.36 – 하나의 JAR 파일로 다양한 규모 적용(Same JAR, any scale) p.33

• 비즈니스 로직, 시스템 통합, 단위 테스트, 정합성 검사, 예외 처리 모두 포함 • 운영상 복잡성 낮춤 p.172

– Ad-hoc Query나 빠른 응답 보다는 Hive처럼 높은 처리량 목적으로 ETL에 적합 p.142 – JAVA 개발자들에게 익숙한 도구와 절차 p.172

Cascading Terminology

http://docs.cascading.org/cascading/2.5/userguide/html/ch03.html#N2013B 3.1 Terminology

• Pipe: Data stream

• Filter: Data operation

• Tuple: Data record

• Branch: 분기나 병합이 없는 간단한 파이프 연결

• Pipe Assembly: Pipe branch들의 연결 집합

• Tuple Stream: Pipe branch나 assembly를 통과하는 Tuple들의 연속

• Tap: Data source/sink

• Flow: Tap들과 연결된 한 개 이상의 pipe assembly들의 연결상태

• Cascade – Flow의 집합으로 하나의 프로세스로 실행

– Flow는 다른 flow의 데이터 의존성이 만족될 때까지 실행되지 않음

http://docs.cascading.org/cascading/2.5/userguide/html/ch03.html#N2013B

http://docs.cascading.org/cascading/2.5/userguide/html/ch03.html#N2013B

Pipe Types

http://docs.cascading.org/cascading/2.5/userguide/html/ch03s03.html#N20276 Types of Pipes

• Each – Filter, Function 적용

– Filter는 Tuple 삭제만 가능

– Function은 필드 추가/변경과 여러 Tuple 출력 가능

– Function의 기본 Output Selector는 Fields.RESULT

• Every – GroupBy, CoGroup의 결과에만 사용

– Aggregator, Buffer 적용

• Function, Aggregator, Buffer의 Output Selector 필드 꼭 지정

http://docs.cascading.org/cascading/2.5/userguide/html/ch03s03.html#N20276

http://docs.cascading.org/cascading/2.5/userguide/html/ch03s03.html#N20438 The Each and Every Pipes


Buffer vs. Aggregator

• 공통점 – GroupBy, CoGroup의 결과에 대해서만 동작

– Aggregator와 Buffer의 기본 Output Selector는 Fields.ALL

• 차이점 – Aggregator는 chained되지만 Buffer는 chained되지 않음

– Buffer는 하나의 group에 대해 여러 개 결과 tuple 출력 가능

– Buffer는 Aggregator를 똑같이 구현할 수 있으므로 Aggregator는 Buffer의 특별히 최적화된 형태라 볼 수 있음

pipe = new GroupBy(pipe, new Fields("mdn"), new Fields("log_time")); pipe = new Every(pipe, new Count(new Fields("count"))); pipe = new Every(pipe, new Fields("mdn"), new DistinctCount(new Fields("unique_mdn_cnt"))); pipe = new Every(pipe, new Fields("pay_amt"), new Sum(new Fields("sum"), long.class)); pipe = new Every(pipe, new Fields("log_time"), new Last(new Fields("last_time")));

Pipe Types


• Merge – Unsorted merge

– 같은 필드와 타입을 가진 둘 이상의 Pipe들을 하나의 stream으로 병합

– Grouping을 하지 않아 GroupBy보다 빠름 (Aggregator/Buffer 사용불가)

• GroupBy – Key 필드에 대해 Sorted merge

– 같은 필드와 타입을 가진 둘 이상의 Pipe들만 병합 가능

– Group 내 임의의 순서 (속도는 낮아지지만 2차 정렬 가능)

– Grouping 만들어 Every를 위한 준비 작업

– Grouping 위해 grouping fields를 정렬해서 Merge 보다 느림 • grouping fields에 대해 natural order로 정렬

– 2차 정렬 가능

• 2차 정렬 지정하지 않으면 group 내에서 임의 순서지만 더 빠르게 수행

Fields sortFields = new Fields("value1", "value2"); sortFields.setComparator("value1", Collections.reverseOrder()); Pipe groupBy = new GroupBy(assembly, groupFields, sortFields);


Pipe Types


• 서로 다른 fields 가진 둘 이상의 stream을 공통 fields 값 기준으로 Join

• CoGroup – SQL의 join과 유사(InnerJoin, OuterJoin, LeftJoin, RightJoin, MixedJoin)

– outer join의 경우 존재하지 fields들은 null로 채움

– 결과에 모든 stream의 모든 fields가 출력되기 때문에, 모든 stream의 fields들은 중복된 이름을 포함할 수 없음

• 중복된 이름이 있을 경우 declaredFields 인자로 변경 가능

– field의 순서로 짝맞춤. field 이름은 개발자를 위한 수단일 뿐

– 빠른 join 위해 오른쪽 stream의 모든 unique key tuple(bag)을 메모리에 저장 시도 • 설정 가능한 임계치를 넘어서면 메모리에서 disk로 쓰면서 진행(성능 저하)

• 임계치가 클 경우 메모리 에러 유발

• 가장 큰 group을 가장 왼쪽에 넣고 적절히 임계치를 조절하면 최고 성능 발휘

• HashJoin – 한 개의 큰 stream과 작은 stream들의 join에 최적화 (Map-side Join)

• 오른쪽 stream을 모두 메모리에 넣어 빠르게 비교 연산 (group 없어 전체 메모리에 올림)

– Group 필요없어 임의 순서로 Join하여 CoGroup 보다 빠름

– Group 존재하지 않아 aggregator나 buffer가 뒤따르지 못함

– CheckPoint를 HashJoin 직전에 넣어 작게 된 stream 모두 디스크에 쓰는 방식 유용


http://docs.cascading.org/cascading/2.5/userguide/html/ch03s03.html#N20630 CoGroup


String inPath = args[ 0 ]; String outPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); Tap inTap = new Hfs( new TextDelimited( true, "\t" ), inPath ); Tap outTap = new Hfs( new TextDelimited( true, "\t" ), outPath ); Pipe copyPipe = new Pipe( "copy" ); FlowDef flowDef = FlowDef.flowDef() .addSource( copyPipe, inTap ) .addTailSink( copyPipe, outTap ); flowConnector.connect( flowDef ).complete();

https://github.com/Cascading/Impatient/blob/master/part1/src/main/java/impatient/Main.java

p.29 1.2 초간단 케스케이딩 애플리케이션


… Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath ); Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\$\$,.]" ); Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ).addTailSink( wcPipe, wcTap ); …


p.37 1.5 흔한 단어 세기


… Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\$\$,.]" ); Fields fieldSelector = new Fields( "doc_id", "token" ); Pipe docPipe = new Each( "token", text, splitter, fieldSelector ); Fields scrubArguments = new Fields( "doc_id", "token" ); docPipe = new Each( docPipe, scrubArguments, new ScrubFunction( scrubArguments ), Fields.RESULTS ); Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new Retain( wcPipe, token ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ).addTailSink( wcPipe, wcTap ); Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();


doc_id text doc01 A rain shadow is a dry area on the lee back … doc02 This sinking, dry air produces a rain shadow, … doc03 A rain shadow is an area of dry land that lies … …

p.55 2.2 토큰 다듬기


public class ScrubFunction extends BaseOperation implements Function { public ScrubFunction( Fields fieldDeclaration ) { super( 2, fieldDeclaration ); } public void operate( FlowProcess flowProcess, FunctionCall functionCall ) { TupleEntry argument = functionCall.getArguments(); String doc_id = argument.getString( 0 ); String token = scrubText( argument.getString( 1 ) ); if( token.length() > 0 ) { Tuple result = new Tuple(); result.add( doc_id ); result.add( token ); functionCall.getOutputCollector().add( result ); } } public String scrubText( String text ) { return text.trim().toLowerCase(); } }

https://github.com/Cascading/Impatient/blob/master/part3/src/main/java/impatient/ScrubFunction.java

p.49 2.1 사용자 정의 연산

https://github.com/Cascading/Impatient/blob/master/part3/src/main/java/impatient/ScrubFunction.java

p.55 2.2 토큰 다듬기

… String stopPath = args[ 2 ]; … Fields stop = new Fields( "stop" ); Tap stopTap = new Hfs( new TextDelimited( stop, true, "\t" ), stopPath ); … Fields scrubArguments = new Fields( "doc_id", "token" ); docPipe = new Each( docPipe, scrubArguments, new ScrubFunction( scrubArguments ), Fields.RESULTS ); Pipe stopPipe = new Pipe( "stop" ); Pipe tokenPipe = new HashJoin( docPipe, token, stopPipe, stop, new LeftJoin() ); tokenPipe = new Each( tokenPipe, stop, new RegexFilter( "^$" ) ); Pipe wcPipe = new Pipe( "wc", tokenPipe ); wcPipe = new Retain( wcPipe, token ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addSource( stopPipe, stopTap ) .addTailSink( wcPipe, wcTap ); …


stop a about after all along an and Any …

p.57 2.3 복제 조인


p.57 2.3 복제 조인

… String tfidfPath = args[ 3 ]; … Fields fieldSelector = new Fields( "doc_id", "token" ); tokenPipe = new Retain( tokenPipe, fieldSelector ); … Pipe tfPipe = new Pipe( "TF", tokenPipe ); Fields tf_count = new Fields( "tf_count" ); tfPipe = new CountBy( tfPipe, new Fields( "doc_id", "token" ), tf_count ); Fields tf_token = new Fields( "tf_token" ); tfPipe = new Rename( tfPipe, token, tf_token ); Fields doc_id = new Fields( "doc_id" ); Fields tally = new Fields( "tally" ); Fields rhs_join = new Fields( "rhs_join" ); Fields n_docs = new Fields( "n_docs" ); Pipe dPipe = new Unique( "D", tokenPipe, doc_id ); dPipe = new Each( dPipe, new Insert( tally, 1 ), Fields.ALL ); dPipe = new Each( dPipe, new Insert( rhs_join, 1 ), Fields.ALL ); dPipe = new SumBy( dPipe, rhs_join, tally, n_docs, long.class ); Pipe dfPipe = new Unique( "DF", tokenPipe, Fields.ALL ); Fields df_count = new Fields( "df_count" ); dfPipe = new CountBy( dfPipe, token, df_count ); Fields df_token = new Fields( "df_token" ); Fields lhs_join = new Fields( "lhs_join" ); dfPipe = new Rename( dfPipe, token, df_token ); dfPipe = new Each( dfPipe, new Insert( lhs_join, 1 ), Fields.ALL ); Pipe idfPipe = new HashJoin( dfPipe, lhs_join, dPipe, rhs_join );


p.71 3.1 TF-IDF 구현

tfPipe: (“doc_id”, “tf_token”, “tf_count”)

dPipe: (“doc_id”)

dfPipe: (“doc_id”, “token”)

tfPipe: (“doc_id”, “token”, “tf_count”)

dPipe: (“doc_id”, “tally”)

dPipe: (“doc_id”, “tally”, “rhs_join”)

dPipe: (“rhs_join”, “n_docs”)

dfPipe: (“token”, “df_count”)

dfPipe: (“df_token”, “df_count”)

dfPipe: (“df_token”, “df_count”, “lhs_join”)

idfPipe: (“df_token”, “df_count”, “lhs_join”, “rhs_join”, “n_docs”)


Pipe tfidfPipe = new CoGroup( tfPipe, tf_token, idfPipe, df_token ); Fields tfidf = new Fields( "tfidf" ); String expression = "(double) tf_count * Math.log( (double) n_docs / ( 1.0 + df_count ) )"; ExpressionFunction tfidfExpression = new ExpressionFunction( tfidf, expression, Double.class ); Fields tfidfArguments = new Fields( "tf_count", "df_count", "n_docs" ); tfidfPipe = new Each( tfidfPipe, tfidfArguments, tfidfExpression, Fields.ALL ); fieldSelector = new Fields( "tf_token", "doc_id", "tfidf" ); tfidfPipe = new Retain( tfidfPipe, fieldSelector ); tfidfPipe = new Rename( tfidfPipe, tf_token, token ); Pipe wcPipe = new Pipe( "wc", tfPipe ); Fields count = new Fields( "count" ); wcPipe = new SumBy( wcPipe, tf_token, tf_count, count, long.class ); wcPipe = new Rename( wcPipe, tf_token, token ); wcPipe = new GroupBy( wcPipe, count, count ); FlowDef flowDef = FlowDef.flowDef() .setName( "tfidf" ) .addSource( docPipe, docTap ) .addSource( stopPipe, stopTap ) .addTailSink( tfidfPipe, tfidfTap ) .addTailSink( wcPipe, wcTap ); …



tfidfPipe: (“doc_id”, “tf_token”, “tf_count”, “df_token”, “df_count”, “lhs_join”, “rhs_join”, “n_docs”)

tfidfPipe: (“doc_id”, “tf_token”, “tf_count”, “df_token”, “df_count”, “lhs_join”, “rhs_join”, “n_docs”, “tfidf”)

tfidfPipe: (“tf_token”, “doc_id”, “tfidf”)

tfidfPipe: (“token”, “doc_id”, “tfidf”)


Programming Tips

• Local Mode – Hadoop 사용하기 전에 로컬 파일을 이용해 개발/테스트/데이터 탐색

– Hadoop API를 사용하지 않고, 메모리에서만 동작(메모리에 제한)

– 로컬 테스트 가능하나 로컬과 Hadoop 미묘한 API 차이 있음 • cascading-hadoop-2.0.x.jar 대신 cascading-local-2.0.x.jar 사용

• FileTap, LocalFlowConnector 사용

• Test p.80 – CascadingTestCase

– Debug http://docs.cascading.org/cascading/2.5/userguide/html/ch09s02.html

– Assert • http://docs.cascading.org/cascading/2.5/userguide/html/ch08s02.html

• http://docs.cascading.org/cascading/2.5/userguide/html/ch09s09.html

– Trap http://docs.cascading.org/cascading/2.5/userguide/html/ch08s03.html

– Sample http://docs.cascading.org/cascading/2.5/userguide/html/ch09s03.html

– Checkpoint • http://docs.cascading.org/cascading/2.5/userguide/html/ch08s04.html

• http://docs.cascading.org/cascading/2.5/userguide/html/ch08s05.html

http://docs.cascading.org/cascading/2.5/userguide/html/ch09s02.html











Programming Tips

• 작은 의미 구분으로 SubAssembly와 Flow를 만들고 Cascade 연결

• Flow 연결 – Head, Tail, Assembly들은 "이름" 통해서 연결되므로 이름 명시 중요 – DAG로 되어 있어 마지막 sink들로 부터 역으로 연결 여부 검사 – Pipe는 이전 Pipe의 이름 물려 받으므로 명시적 이름 구분이 runtime 오류 방지

• 필드 이름은 _, 소문자, 숫자만 사용 – 한글이나 –는 Janino compiler를 사용하는 Expression 함수에서 오류 발생 – "first-name“은 필드 이름에 적합하지만, Expression에 사용되면 first-name.trim() 처럼

인식하면서 Janino에서 runtime 오류 발생 – Expression function 보다 function 구현이 Janino 문제도 없고 재사용 쉬움

• GroupBy의 sort 필드는 class type 먼저 맞추기 – HDFS에 저장 후 다시 읽으면 무조건 String 타입으로 변경됨

• Operation 재사용을 위해 전역변수나 property 이용 최소화하고 operation의 constructor 에 인자 넘기기

Programming Tips

• Reducer 개수 지정 – 중간 Reducer 개수

– 최종 Reducer 개수

Properties properties = new Properties(); properties.put("mapred.reduce.tasks", “10”); properties.put("mapred.map.tasks.speculative.execution", "true"); properties.put("mapred.reduce.tasks.speculative.execution", "false"); properties.put("mapred.job.priority", “HIGH”); AppProps.setApplicationJarClass(properties, Main.class); FlowConnector flowConnector = new HadoopFlowConnector(properties);

TextDelimited scheme = new TextDelimited(new Fields(“key“, “value”), true, "\t"); scheme.setNumSinkParts(1); Tap sinkTap = new Hfs(scheme, outputPath, SinkMode.REPLACE);

• http://docs.cascading.org/cascading/2.5/userguide/html/ch09.html 9. Built-In Operations

– Identity Function – Text Functions

– Regular Expression Operations

– Java Expression Operations

– Buffers

• http://docs.cascading.org/cascading/2.5/userguide/html/ch10.html 10. Built-in Assemblies

– AggregateBy (AverageBy, CountBy, SumBy, FirstBy)

– Rename

– Retain

– Unique

• http://docs.cascading.org/cascading/2.5/userguide/html/ch13.html 13. Cookbook

Programming Tips

http://docs.cascading.org/cascading/2.5/userguide/html/ch09.html






Questions? Questions.foreach( answer(_) )

public class DistinctCount extends BaseOperation<HashSet<String>> implements Aggregator<HashSet<String>> { public DistinctCount(Fields fieldDeclaration) { super(fieldDeclaration); } @Override public void start(FlowProcess flowProcess, AggregatorCall<HashSet<String>> aggregatorCall) { if (aggregatorCall.getContext() == null) { aggregatorCall.setContext(new HashSet<String>()); } else { aggregatorCall.getContext().clear(); } } @Override public void aggregate(FlowProcess flowProcess, AggregatorCall<HashSet<String>> aggregatorCall) { TupleEntry argument = aggregatorCall.getArguments(); HashSet<String> context = aggregatorCall.getContext(); context.add(argument.getTuple().toString()); } @Override public void complete(FlowProcess flowProcess, AggregatorCall<HashSet<String>> aggregatorCall) { aggregatorCall.getOutputCollector().add(new Tuple(aggregatorCall.getContext().size())); } }

programming cascading

Data & Analytics