trecul – data flow processing using hadoop and llvm

25
Trecul – Data Flow Processing using Hadoop and LLVM David Blair

Upload: roland

Post on 24-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Trecul – Data Flow Processing using Hadoop and LLVM. David Blair. Agenda. Problem Statement Trecul User Level Overview Trecul Architecture and Design. Advertising Decision Solutions. The Akamai Advertising Network Full Funnel Approach Awareness Prospecting Remarketing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

Trecul – Data Flow Processing using Hadoop and LLVMDavid Blair

Page 2: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Agenda

Problem StatementTrecul User Level OverviewTrecul Architecture and Design

Page 3: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Advertising Decision Solutions

The Akamai Advertising NetworkFull Funnel Approach

AwarenessProspecting Remarketing

Data Coop500+ sites browse & buy data300m monthly active cookies

Data Collection600 Million Site Events per

Day50 Million Impressions per

Day

Page 4: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Making Data into Decisions

Ad ServingandData Collection

Data Coop

ModelingScoring

AttributionBilling

Page 5: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Problem Statement

Had a working system but much painCommerical Parallel RDBMS, MySQL, Perl

Functional RequirementsNatural partitioning key = User/CookieMost processing aligns with that keyHandling of structured data only (e.g. no text analysis)

Non Functional RequirementsFault TolerancePerformance/CostMust be deployable in Akamai network

Reach GoalsEase of use Ad-hoc queries

Page 6: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Hadoop to Rescue (Almost)

HDFSGood enough performanceHooks to customize data placementHandles most single node failures

Map ReduceCluster and resource managementPartition parallel computing modelShuffles for cases when we need itHandles most single node failures

Mystery guestAd-hoc Java – anti-patternHive or Pig – too slow for our needsor …

Page 7: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;

Anatomy of a Trecul Program

Page 8: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;

Anatomy of a Trecul Program : Operators

Page 9: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;

Anatomy of a Trecul Program : Arguments

Page 10: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;

Anatomy of a Trecul Program : Arrows

Page 11: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

$ ads-df --file - << EOF> g = generate[output="'Hello World!' AS greeting", > numRecords=1];> p = print[limit=10]; > g -> p;> d = devNull[];> p -> d;> EOFHello World!

Streaming pipes & filters without threads or processes

Anatomy of a Trecul Program

Page 12: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Basic Trecul Map Reduce Programm = map[format=“cre_date DATETIME, event_id INTEGER, greeting VARCHAR”];e = emit[key=“greeting”];g -> e;

r = reduce[];gb = group_by[sortKey=“greeting”, output=“greeting, SUM(1) AS greeting_cnt”];r -> gb;w = write[file=“hdfs://default:0/users/dblair/demo_mr”];gb -> w;

$ ads-df –-map map.tql –-reduce reduce.tql –-input /my/test/data

Page 13: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Example with branching and merging

r = read[file=“hdfs://default:0/foo”, format=“akid CHAR(22), cre_date DATETIME, coop_id INTEGER”];c = copy[output=“akid”, output=“input.*”];r -> c;g = group_by[sortKey=“akid”, output=“akid, SUM(1) AS activity”];c -> g;j = merge_join[leftKey=“akid”, rightKey=“akid”, where=“activity > 5”, output=“l.*”];c -> j;g -> j;

read copy join

group

Page 14: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Scope of Trecul

File IO• Read, write• Simple parser/printer• Local filesystem and HDFS• Bucketed mode

Merging• Hash Join & Merge Join

•Inner, outer, semi, anti semi• Union All • Sorted Union• Switch

Transformation Generate, Copy

• Aggregation– Hash Group By– Sort Group By– Hybrid Group By– Sort Running Total

• Filter• Sort

– Exernal sort – Supports presorted keys

• Map Reduce Integration– Emit– Map– Reduce

• MySQL Integration– Select– Insert

Page 15: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Limits of Trecul

Relational DataPrimitive Types : INTEGER, BIGINT, DOUBLE

PRECISION, DECIMAL, DATE, DATETIME, CHAR(N), VARCHAR

No container types : list, set, bag, mapNo Unicode, no code page support

No Metadata management ADS has special operators that encapsulate specific

datasetsFormats may be stored in files

No optimizerWe write very complex queriesNo barrier to construction of optimal plansPredictable performance in production

Page 16: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Trecul Architecture

C++, LinuxExpression Language

ParserSemantic AnalysisCodegen

Operator LibraryDataflow Runtime

OS ServicesGraph Semantic AnalysisOperator Scheduler

Harness IntegrationSingle MachineHadoopMPI (experimental)

Page 17: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Trecul and LLVM

LLVM Open source compiler and toolchain projectUsed extensively by Apple and GoogleSupports static and JIT compilationhttp://www.llvm.org

Trecul ExpresssionsTransforms, predicates, aggregatesExpressions & data structures compiled using LLVMOperators are parameterized with expressions

Most operator code + scheduler etc. statically compiled

Page 18: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Trecul Expression Compilation

R = read[file=“/home/dblair/example”, format=“a INTEGER, b BIGINT, c DATE”];F = filter[where=“c >= CAST(‘2012-01-01’ AS DATE)”];R -> F;G = group_by[hashKey=“c”, output=“c, SUM(a*b) AS s”];F -> G;

struct _R { int32_t a; int64_t b; date c; };

Note: The use of pseudo-C is for illustration only; we transform Trecul directly to LLVM IR and then to machine code.

Page 19: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Trecul Expression Compilation

R = read[file=“/home/dblair/example”, format=“a INTEGER, b BIGINT, c DATE”];F = filter[where=“c >= CAST(‘2012-01-01’ AS DATE)”];R -> F;G = group_by[hashKey=“c”, output=“c, SUM(a*b) AS s”];F -> G;

struct _R { int32_t a; int64_t b; date c; };bool _F(_R * rec) { return rec->c >= date(1,1,2012); }

Page 20: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Trecul Expression Compilation

R = read[file=“/home/dblair/example”, format=“a INTEGER, b BIGINT, c DATE”];F = filter[where=“c >= CAST(‘2012-01-01’ AS DATE)”];R -> F;G = group_by[hashKey=“c”, output=“c, SUM(a*b) AS s”];F -> G;

struct _R { int32_t a; int64_t b; date c; };bool _F(_R * rec) { return rec->c >= date(1,1,2012); }struct _G {date c; int64_t s; };void _G_init(_G * out, _R * in) { out->c = in->c; out->s = 0LL; }void _G_upd(_G * out, _R * in) { out->s += in->a*in->b; }

Page 21: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Integration with Hadoop

Query HadoopPipesCompiler

ExecutorTaskJVM

HDFS

TaskJVM

Executor

Page 22: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Performance Testing

All tests performed on 82-node Hadoop cluster16 GB memory1 x 4 code SMT Xeon2 x 2TB 7200 RPM SATA disks

Two datasets in useSite Events : cookie sorted; 2048 buckets; 640 GB; 100B

rowsImpressions : cookie sorted; 2048 buckets; 700 GB; 17B

rowsBuckets gzip compressed

Running Hadoop 0.21 and Hive 0.9.0Had to implement shim layer to get Hive to run on 0.21

Page 23: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Performance on Simple Queries

Count distinct cookies Join purchases & impressions0

2

4

6

8

10

12

14

16

18

20

TreculHive

Page 24: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Performance on Complex Queries

Join + Aggregation Modeling Feature Generation

Query #30

20

40

60

80

100

120

140

160

180

200

TreculHive

Page 25: Trecul  – Data Flow  Processing using  Hadoop  and LLVM

©2012 AKAMAI | FASTER FORWARDTM

Thanks!

https://github.com/akamai-tech/trecul