dryadlinq a system for general-purpose distributed data-parallel computing
DESCRIPTION
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing. Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ú lfar Erlingsson, Pradeep Kumar Gunda, Jon Currey Microsoft Research Silicon Valley Presented by: TD (Tathagata Das). - PowerPoint PPT PresentationTRANSCRIPT
DryadLINQA System for General-Purpose
Distributed Data-Parallel Computing
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu,Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey
Microsoft Research Silicon Valley
Presented by: TD (Tathagata Das)
Designing a general purpose language for writing distributed data-parallel programs for a
compute cluster
General purposeSingle-thread abstraction
Familiar language / environment
???
Dryad
Cluster
Shell script
Shell
Machine≈
Dryad = Execution Engine
• Nebula – limited to existing binaries
• Scope – SQL-ish, not general purpose
• Can we do better? – Can we get the general purpose-ness of C#/Java and
conciseness of SQL? – And at the same time, be efficient too?
Can I have my cake and eat it too!
Language Integrated Query (LINQ)
Language Integrated Query (LINQ)
• The creamy goodness of SQL-like queries within a declarative programming model
• Basic abstraction - collections
“All the world’s a collection, And all the men and women merely iterate on
collections”
- implied by Shakespeare
Collections, Iterators and LINQ
IEnumerable <T>
+LINQ
=>
IEnumerable <T>
=>
import system.linq;var result = from num in numbers
where num % 2 == 0 orderby num select num;
List<int> result = new List<int>();foreach (int num in numbers) {
if (num % 2 == 0)result.Add(num);
}result.sort();
Syntactical sweetness of LINQvar result = from num in numbers
where num % 2 == 0
orderby num select num;
var result = numbers
.Where(num => num % 2 == 0)
.OrderBy(n => n);
Query Style
Method Style
LINQ Functionality
• Select / SelectMany
• Where
• GroupBy
• OrderBy
• Join
• Union / Intersect / Except
• …
Map (1-to-1 / 1-to-many)
Filter
Reduce
Sort
Join
Set operations
LINQ Providers
SQLXML
…GoogleWikipediaTwitter
• Select / SelectMany
• Where
• GroupBy
• OrderBy
• Join
• Union / Intersect / Except
• …
LINQ System Architecture
.NetProgram
LINQProviderInterface
Query
Objects
LINQ-to-SQL
LINQ-to-XML
PLINQ
DryadLINQ
Parallel Collections
Partition
Collection
Simplest example: GFS/HDFS file
Dryad + LINQ = DryadLINQstring uri = @"file://\\machine\directory\input.pt";PartitionedTable<LineRecord> input =
PartitionedTable.Get<LineRecord>(uri);
var lengths = input.Select(line => line.ToString().Length);
Word Count with DryadLINQstring uri = @"file://\\machine\directory\input.pt";PartitionedTable<LineRecord> input =
PartitionedTable.Get<LineRecord>(uri);
string separator = ",";var words = input.SelectMany(x => SplitLineRecord(separator));
var groups = words.GroupBy(x => x);
var counts = groups.Select(x => new Pair(x.Key, x.Count()));
var ordered = counts.OrderByDescending(x => x[2]);
var top = ordered.Take(k);
top.ToDryadPartitionedTable("matching.pt");
Get
SM
G
S
O
Take
Exec
ution
Pla
n Gr
aph
DryadLINQ Word Count Dryad
SM
G
S
O
SM
D
MS
G
S
SM
D
MS
G
S
SM
D
MS
G
S
G G G
D D D
MS MS MS
SM
D
MS
G
S
G
D
MS
Exec
ution
Pla
n Gr
aph
Data
Flo
w G
raph
Dist
ribut
ed D
ata
Flow
Gra
ph
DryadLINQ Architecture [1]
DryadLINQ
Client machine
DistributedQuery Plan.Net
Programs
Query Expr
Cluster
Output Tables
Input Tables
Query
Dryad ExecutionDryad JM
Vertexcode
Con-text
DryadLINQ Code Generationstring uri = @"file://\\machine\directory\input.pt";PartitionedTable<LineRecord> input =
PartitionedTable.Get<LineRecord>(uri);
string separator = ",";var words = input.SelectMany(x => SplitLineRecord(separator));
var groups = words.GroupBy(x => x);
var counts = groups.Select(x => new Pair(x.Key, x.Count()));
var ordered = counts.OrderByDescending(x => x.count);
var top = ordered.Take(k);
top.ToDryadPartitionedTable("matching.pt");
Conversion of subexpressions to code for Dryad vertices…
1. Local variables2. Local libraries and functions
DryadLINQ Architecture [2]
DryadLINQ
Client machine
(11)
DistributedQuery Plan.Net
Programs
Query Expr
Cluster
Output TablesResults
Input TablesInvoke Query
Output Partitioned-
Table
Dryad Execution
.Net Objects
Dryad JM
Vertexcode
Con-text
19
Combining with LINQ-to-SQL
DryadLINQ
Subquery Subquery Subquery Subquery Subquery
Query
LINQ-to-SQL LINQ-to-SQL
DryadLINQ Optimizations
• Some are similar to existing DB optimizations– Eliminate redundant partitioning steps– Aggregation steps moved up the graph, before
partitioning steps
• Existing Dryad optimizations as well– Dynamic reconfiguration of aggregation trees
Thoughts [1]
• Easy to read, though reads more like a PL paper
• What are system contributions that are different from Dryad?
• Does the high level abstraction provide any extra information that allow
Thoughts [2]
Interesting anecdote…
DryadLINQ is inefficient for random access workload, but for some workloads they outperformed systems customized for random-access
HDD performance characteristics are such that sequential read (even if you discard 99% data) is better than small random accesses
Thoughts [3]
• How different is FlumeJava from this?