trecul – data flow processing using hadoop and llvm
DESCRIPTION
Trecul – Data Flow Processing using Hadoop and LLVM. David Blair. Agenda. Problem Statement Trecul User Level Overview Trecul Architecture and Design. Advertising Decision Solutions. The Akamai Advertising Network Full Funnel Approach Awareness Prospecting Remarketing - PowerPoint PPT PresentationTRANSCRIPT
Trecul – Data Flow Processing using Hadoop and LLVMDavid Blair
©2012 AKAMAI | FASTER FORWARDTM
Agenda
Problem StatementTrecul User Level OverviewTrecul Architecture and Design
©2012 AKAMAI | FASTER FORWARDTM
Advertising Decision Solutions
The Akamai Advertising NetworkFull Funnel Approach
AwarenessProspecting Remarketing
Data Coop500+ sites browse & buy data300m monthly active cookies
Data Collection600 Million Site Events per
Day50 Million Impressions per
Day
©2012 AKAMAI | FASTER FORWARDTM
Making Data into Decisions
Ad ServingandData Collection
Data Coop
ModelingScoring
AttributionBilling
©2012 AKAMAI | FASTER FORWARDTM
Problem Statement
Had a working system but much painCommerical Parallel RDBMS, MySQL, Perl
Functional RequirementsNatural partitioning key = User/CookieMost processing aligns with that keyHandling of structured data only (e.g. no text analysis)
Non Functional RequirementsFault TolerancePerformance/CostMust be deployable in Akamai network
Reach GoalsEase of use Ad-hoc queries
©2012 AKAMAI | FASTER FORWARDTM
Hadoop to Rescue (Almost)
HDFSGood enough performanceHooks to customize data placementHandles most single node failures
Map ReduceCluster and resource managementPartition parallel computing modelShuffles for cases when we need itHandles most single node failures
Mystery guestAd-hoc Java – anti-patternHive or Pig – too slow for our needsor …
©2012 AKAMAI | FASTER FORWARDTM
g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;
Anatomy of a Trecul Program
©2012 AKAMAI | FASTER FORWARDTM
g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;
Anatomy of a Trecul Program : Operators
©2012 AKAMAI | FASTER FORWARDTM
g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;
Anatomy of a Trecul Program : Arguments
©2012 AKAMAI | FASTER FORWARDTM
g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;
Anatomy of a Trecul Program : Arrows
©2012 AKAMAI | FASTER FORWARDTM
$ ads-df --file - << EOF> g = generate[output="'Hello World!' AS greeting", > numRecords=1];> p = print[limit=10]; > g -> p;> d = devNull[];> p -> d;> EOFHello World!
Streaming pipes & filters without threads or processes
Anatomy of a Trecul Program
©2012 AKAMAI | FASTER FORWARDTM
Basic Trecul Map Reduce Programm = map[format=“cre_date DATETIME, event_id INTEGER, greeting VARCHAR”];e = emit[key=“greeting”];g -> e;
r = reduce[];gb = group_by[sortKey=“greeting”, output=“greeting, SUM(1) AS greeting_cnt”];r -> gb;w = write[file=“hdfs://default:0/users/dblair/demo_mr”];gb -> w;
$ ads-df –-map map.tql –-reduce reduce.tql –-input /my/test/data
©2012 AKAMAI | FASTER FORWARDTM
Example with branching and merging
r = read[file=“hdfs://default:0/foo”, format=“akid CHAR(22), cre_date DATETIME, coop_id INTEGER”];c = copy[output=“akid”, output=“input.*”];r -> c;g = group_by[sortKey=“akid”, output=“akid, SUM(1) AS activity”];c -> g;j = merge_join[leftKey=“akid”, rightKey=“akid”, where=“activity > 5”, output=“l.*”];c -> j;g -> j;
read copy join
group
©2012 AKAMAI | FASTER FORWARDTM
Scope of Trecul
File IO• Read, write• Simple parser/printer• Local filesystem and HDFS• Bucketed mode
Merging• Hash Join & Merge Join
•Inner, outer, semi, anti semi• Union All • Sorted Union• Switch
Transformation Generate, Copy
• Aggregation– Hash Group By– Sort Group By– Hybrid Group By– Sort Running Total
• Filter• Sort
– Exernal sort – Supports presorted keys
• Map Reduce Integration– Emit– Map– Reduce
• MySQL Integration– Select– Insert
©2012 AKAMAI | FASTER FORWARDTM
Limits of Trecul
Relational DataPrimitive Types : INTEGER, BIGINT, DOUBLE
PRECISION, DECIMAL, DATE, DATETIME, CHAR(N), VARCHAR
No container types : list, set, bag, mapNo Unicode, no code page support
No Metadata management ADS has special operators that encapsulate specific
datasetsFormats may be stored in files
No optimizerWe write very complex queriesNo barrier to construction of optimal plansPredictable performance in production
©2012 AKAMAI | FASTER FORWARDTM
Trecul Architecture
C++, LinuxExpression Language
ParserSemantic AnalysisCodegen
Operator LibraryDataflow Runtime
OS ServicesGraph Semantic AnalysisOperator Scheduler
Harness IntegrationSingle MachineHadoopMPI (experimental)
©2012 AKAMAI | FASTER FORWARDTM
Trecul and LLVM
LLVM Open source compiler and toolchain projectUsed extensively by Apple and GoogleSupports static and JIT compilationhttp://www.llvm.org
Trecul ExpresssionsTransforms, predicates, aggregatesExpressions & data structures compiled using LLVMOperators are parameterized with expressions
Most operator code + scheduler etc. statically compiled
©2012 AKAMAI | FASTER FORWARDTM
Trecul Expression Compilation
R = read[file=“/home/dblair/example”, format=“a INTEGER, b BIGINT, c DATE”];F = filter[where=“c >= CAST(‘2012-01-01’ AS DATE)”];R -> F;G = group_by[hashKey=“c”, output=“c, SUM(a*b) AS s”];F -> G;
struct _R { int32_t a; int64_t b; date c; };
Note: The use of pseudo-C is for illustration only; we transform Trecul directly to LLVM IR and then to machine code.
©2012 AKAMAI | FASTER FORWARDTM
Trecul Expression Compilation
R = read[file=“/home/dblair/example”, format=“a INTEGER, b BIGINT, c DATE”];F = filter[where=“c >= CAST(‘2012-01-01’ AS DATE)”];R -> F;G = group_by[hashKey=“c”, output=“c, SUM(a*b) AS s”];F -> G;
struct _R { int32_t a; int64_t b; date c; };bool _F(_R * rec) { return rec->c >= date(1,1,2012); }
©2012 AKAMAI | FASTER FORWARDTM
Trecul Expression Compilation
R = read[file=“/home/dblair/example”, format=“a INTEGER, b BIGINT, c DATE”];F = filter[where=“c >= CAST(‘2012-01-01’ AS DATE)”];R -> F;G = group_by[hashKey=“c”, output=“c, SUM(a*b) AS s”];F -> G;
struct _R { int32_t a; int64_t b; date c; };bool _F(_R * rec) { return rec->c >= date(1,1,2012); }struct _G {date c; int64_t s; };void _G_init(_G * out, _R * in) { out->c = in->c; out->s = 0LL; }void _G_upd(_G * out, _R * in) { out->s += in->a*in->b; }
©2012 AKAMAI | FASTER FORWARDTM
Integration with Hadoop
Query HadoopPipesCompiler
ExecutorTaskJVM
HDFS
TaskJVM
Executor
…
©2012 AKAMAI | FASTER FORWARDTM
Performance Testing
All tests performed on 82-node Hadoop cluster16 GB memory1 x 4 code SMT Xeon2 x 2TB 7200 RPM SATA disks
Two datasets in useSite Events : cookie sorted; 2048 buckets; 640 GB; 100B
rowsImpressions : cookie sorted; 2048 buckets; 700 GB; 17B
rowsBuckets gzip compressed
Running Hadoop 0.21 and Hive 0.9.0Had to implement shim layer to get Hive to run on 0.21
©2012 AKAMAI | FASTER FORWARDTM
Performance on Simple Queries
Count distinct cookies Join purchases & impressions0
2
4
6
8
10
12
14
16
18
20
TreculHive
©2012 AKAMAI | FASTER FORWARDTM
Performance on Complex Queries
Join + Aggregation Modeling Feature Generation
Query #30
20
40
60
80
100
120
140
160
180
200
TreculHive
©2012 AKAMAI | FASTER FORWARDTM
Thanks!
https://github.com/akamai-tech/trecul