cascalog workshop
TRANSCRIPT
Cascalog Workshop
Example query
Execution
1. Pre-aggregation
2. Aggregation
3. Post-aggregation
Variable dependencies
Pre-aggregation
• Start from generator variables
• Resolve as many variables as possible using:
• Joins
• Functions
• Use as many filters as possible
• Join all sources into one set of tuples
Aggregation
• Group by resolved output variables
• Apply all aggregators to each group
Post-aggregation
• Resolve the rest of the variables
• Apply rest of filters
Example query
Query planner
Start with generators
Query planner
[?person2 ?age2 ?double-age2]
Add functions and filters until fixed point
Query planner
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
Do a join
Query planner
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
Add functions and filters until fixed point
Query planner
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
Do a join
Query planner
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
Add functions and filters until fixed point
Query planner
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
Group by ?delta
Group by already satisfied output vars
Query planner
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
[?delta ?count]Group by ?delta
Execute aggregators on each group
Query planner
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
[?delta ?count]Group by ?delta
Add functions and filters until fixed point
Query planner
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
[?delta ?count]Group by ?delta
Project fields to [?delta ?count]
Cascading pipes
• Each: can occur in Map or Reduce
• GroupBy: Causes a Reduce step
• Every: One or more follow GroupBy
• CoGroup: Join implementation, causes Reduce step
To Cascading
To Cascading
[?person2 ?age2 ?double-age2]
Each
To Cascading
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]CoGroup
To Cascading
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]CoGroup
To Cascading
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
Each
Each
To Cascading
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
Group by ?delta
GroupBy
To Cascading
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
[?delta ?count]Group by ?delta
Execute aggregators on each group
Every
To Cascading
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
[?delta ?count]Group by ?deltaEach
To Cascading
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
[?delta ?count]Group by ?delta
Project fields to [?delta ?count]
Each
To MapReduce
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
[?delta ?count]Group by ?delta
Project fields to [?delta ?count]
Job 1
To MapReduce
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
[?delta ?count]Group by ?delta
Project fields to [?delta ?count]
Job 2
To MapReduce
[?person2 ?age2 ?double-age2]
[?person1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2]
[?person1 ?age1 ?person2 ?age2 ?double-age2 ?delta]
[?delta ?count]Group by ?delta
Project fields to [?delta ?count]
Job 3
defmapop
[A1, B1, C1]
[A2, B2, C2]
[A3, B3, C3]
[A1, B1, C1, D1, E1]
[A2, B2, C2, D2, E2]
[A3, B3, C3, D3, E3]
Appends fields to tuple
deffilterop
[A1, B1, C1]
[A2, B2, C2]
[A3, B3, C3]
true
false
true
[A1, B1, C1]
[A3, B3, C3]
defmapcatop
[“a red dog”]
[“ ”]
[“hello”]
[“a red dog”, “a”]
[“a red dog”, “red”]
[“hello”, “hello”]
[“a red dog”, “dog”]
[“a red dog”, “a”]
[“a red dog”, “red”]
[“hello”, “hello”]
[“a red dog”, “dog”]
Map Concat
[
[ ]
]
[ ]
Aggregators
[“key1”, 1]
[“key1”, 2]
[“key2”, 3]
[“key3”, 3]
[“key3”, 1]
Map Task 2
Map Task 1
[“key1”, 1]
[“key1”, 2]
[“key2”, 3]
[“key3”, 3]
[“key3”, 1]
Reduce Task 2
Reduce Task 1
Regular aggregators - all data goes to reducers
[“key1”, 3]
[“key2”, 3]
[“key3”, 4]
defparallelagg[“nathan”]
[“nathan”]
[“sally”]
[“alice”]
[“nathan”]
Map Task 1
Map Task 2
[“nathan”, 1]
[“alice”, 1]
[“nathan”, 1]
Map Task 1
[“nathan”, 1]
[“sally”, 1]
Map Task 2
Init
[“nathan”, 2]
[“alice”, 1]
Map Task 1
[“nathan”, 1]
[“sally”, 1]
Map Task 2
Combine(Map)
[“nathan”, 3]
Reduce Task 1
[“sally”, 1]
[“alice”, 1]
Reduce Task 2
Combine(Reduce)
Parallel aggregators - partial aggregation done in mappers
combine[1]
[2]
[3]
[3]
[4]
[5]
[1]
[2]
[3]
[4]
[5]
[3]
union[1]
[2]
[3]
[3]
[4]
[5]
[1]
[2]
[3]
[4]
[5]
ElephantDB
Generation of domain of data
Key/Value pairs
Pre-shardand index inMapReduce
Shard 0
Shard 1
Shard 2
Shard 3
Shard 4
Shard 5
Distributed Filesystem
ElephantDB
Shard 0
Shard 1
Shard 2
Shard 3
Shard 4
Shard 5
ElephantDB Server
ElephantDB Server
ElephantDB Server
Serving domain of data
DFS