trecul – data flow processing using hadoop and llvm

Trecul – Data Flow Processing using Hadoop and LLVMDavid Blair

©2012 AKAMAI | FASTER FORWARDTM

Agenda

Problem StatementTrecul User Level OverviewTrecul Architecture and Design


Advertising Decision Solutions

The Akamai Advertising NetworkFull Funnel Approach

AwarenessProspecting Remarketing

Data Coop500+ sites browse & buy data300m monthly active cookies

Data Collection600 Million Site Events per

Day50 Million Impressions per

Day


Making Data into Decisions

Ad ServingandData Collection

Data Coop

ModelingScoring

AttributionBilling


Problem Statement

Had a working system but much painCommerical Parallel RDBMS, MySQL, Perl

Functional RequirementsNatural partitioning key = User/CookieMost processing aligns with that keyHandling of structured data only (e.g. no text analysis)

Non Functional RequirementsFault TolerancePerformance/CostMust be deployable in Akamai network

Reach GoalsEase of use Ad-hoc queries


Hadoop to Rescue (Almost)

HDFSGood enough performanceHooks to customize data placementHandles most single node failures

Map ReduceCluster and resource managementPartition parallel computing modelShuffles for cases when we need itHandles most single node failures

Mystery guestAd-hoc Java – anti-patternHive or Pig – too slow for our needsor …


g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;

Anatomy of a Trecul Program



Anatomy of a Trecul Program : Operators



Anatomy of a Trecul Program : Arguments



Anatomy of a Trecul Program : Arrows


$ ads-df --file - << EOF> g = generate[output="'Hello World!' AS greeting", > numRecords=1];> p = print[limit=10]; > g -> p;> d = devNull[];> p -> d;> EOFHello World!

Streaming pipes & filters without threads or processes

Anatomy of a Trecul Program


Basic Trecul Map Reduce Programm = map[format=“cre_date DATETIME, event_id INTEGER, greeting VARCHAR”];e = emit[key=“greeting”];g -> e;

r = reduce[];gb = group_by[sortKey=“greeting”, output=“greeting, SUM(1) AS greeting_cnt”];r -> gb;w = write[file=“hdfs://default:0/users/dblair/demo_mr”];gb -> w;

$ ads-df –-map map.tql –-reduce reduce.tql –-input /my/test/data


Example with branching and merging

r = read[file=“hdfs://default:0/foo”, format=“akid CHAR(22), cre_date DATETIME, coop_id INTEGER”];c = copy[output=“akid”, output=“input.*”];r -> c;g = group_by[sortKey=“akid”, output=“akid, SUM(1) AS activity”];c -> g;j = merge_join[leftKey=“akid”, rightKey=“akid”, where=“activity > 5”, output=“l.*”];c -> j;g -> j;

read copy join

group


Scope of Trecul

File IO• Read, write• Simple parser/printer• Local filesystem and HDFS• Bucketed mode

Merging• Hash Join & Merge Join

•Inner, outer, semi, anti semi• Union All • Sorted Union• Switch

Transformation Generate, Copy

• Aggregation– Hash Group By– Sort Group By– Hybrid Group By– Sort Running Total

• Filter• Sort

– Exernal sort – Supports presorted keys

• Map Reduce Integration– Emit– Map– Reduce

• MySQL Integration– Select– Insert


Limits of Trecul

Relational DataPrimitive Types : INTEGER, BIGINT, DOUBLE

PRECISION, DECIMAL, DATE, DATETIME, CHAR(N), VARCHAR

No container types : list, set, bag, mapNo Unicode, no code page support

No Metadata management ADS has special operators that encapsulate specific

datasetsFormats may be stored in files

No optimizerWe write very complex queriesNo barrier to construction of optimal plansPredictable performance in production


Trecul Architecture

C++, LinuxExpression Language

ParserSemantic AnalysisCodegen

Operator LibraryDataflow Runtime

OS ServicesGraph Semantic AnalysisOperator Scheduler

Harness IntegrationSingle MachineHadoopMPI (experimental)


Trecul and LLVM

LLVM Open source compiler and toolchain projectUsed extensively by Apple and GoogleSupports static and JIT compilationhttp://www.llvm.org

Trecul ExpresssionsTransforms, predicates, aggregatesExpressions & data structures compiled using LLVMOperators are parameterized with expressions

Most operator code + scheduler etc. statically compiled


Trecul Expression Compilation

R = read[file=“/home/dblair/example”, format=“a INTEGER, b BIGINT, c DATE”];F = filter[where=“c >= CAST(‘2012-01-01’ AS DATE)”];R -> F;G = group_by[hashKey=“c”, output=“c, SUM(a*b) AS s”];F -> G;

struct _R { int32_t a; int64_t b; date c; };

Note: The use of pseudo-C is for illustration only; we transform Trecul directly to LLVM IR and then to machine code.




struct _R { int32_t a; int64_t b; date c; };bool _F(_R * rec) { return rec->c >= date(1,1,2012); }




struct _R { int32_t a; int64_t b; date c; };bool _F(_R * rec) { return rec->c >= date(1,1,2012); }struct _G {date c; int64_t s; };void _G_init(_G * out, _R * in) { out->c = in->c; out->s = 0LL; }void _G_upd(_G * out, _R * in) { out->s += in->a*in->b; }


Integration with Hadoop

Query HadoopPipesCompiler

ExecutorTaskJVM

HDFS

TaskJVM

Executor

…


Performance Testing

All tests performed on 82-node Hadoop cluster16 GB memory1 x 4 code SMT Xeon2 x 2TB 7200 RPM SATA disks

Two datasets in useSite Events : cookie sorted; 2048 buckets; 640 GB; 100B

rowsImpressions : cookie sorted; 2048 buckets; 700 GB; 17B

rowsBuckets gzip compressed

Running Hadoop 0.21 and Hive 0.9.0Had to implement shim layer to get Hive to run on 0.21


Performance on Simple Queries

Count distinct cookies Join purchases & impressions0

2

4

6

8

10

12

14

16

18

20

TreculHive


Performance on Complex Queries

Join + Aggregation Modeling Feature Generation

Query #30

20

40

60

80

100

120

140

160

180

200

TreculHive


Thanks!

https://github.com/akamai-tech/trecul

trecul – data flow processing using hadoop and llvm

Documents

akamai faster forwardtmg

akamai faster forwardtmpain

operator declarations

filterstwo kinds of

devnullp danatomy

g pd

hello world

trecul data flow processing