Download - Gl conference2014 scalabledata_yucheng
![Page 1: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/1.jpg)
The Technology
Behind
Yucheng Low, PhD Chief Architect
GraphLab Create
![Page 2: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/2.jpg)
GraphLab Philosophy Users-First Architecture
![Page 3: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/3.jpg)
Architecture
Systems
User
![Page 4: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/4.jpg)
Architecture
Systems
User
Systems-First Architectures
Systems define constraints. Optimize for performance.
PowerGraph
![Page 5: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/5.jpg)
Architecture
Systems
User
Users-First Architectures
Users define constraints. Optimize for user interaction.
![Page 6: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/6.jpg)
What is a Users-First Architecture for Data Science?
![Page 7: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/7.jpg)
SFrame and SGraph
Building on decades of database and systems research.
Built by data scientists, for data scientists.
![Page 8: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/8.jpg)
SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation
User Com.
Title Body
User Disc.
![Page 9: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/9.jpg)
Enabling users to easily and efficiently translate between both representations to get the best of both worlds.
![Page 10: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/10.jpg)
SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation
User Com.
Title Body
User Disc.
![Page 11: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/11.jpg)
SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation
User Com.
Title Body
User Disc.
![Page 12: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/12.jpg)
SFrame Design
Jobs fail because: • Machine run out of memory • Did not set Java Heap Size correctly • Resource Configuration X needs to be
bigger.
Pain Point #1: Resource Limits
![Page 13: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/13.jpg)
SFrame Design • Graceful Degradation as 1st principle
• Always Works
Pain Point #2: Too Strict or Too Weak Schemas We want strong schema types. We also want weak schema types. Missing Values
![Page 14: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/14.jpg)
SFrame Design • Graceful Degradation as 1st principle
• Always Works
• Rich Datatypes • Strong schema types: int, double, string...
• Weak schema types: list, dictionary
![Page 15: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/15.jpg)
SFrame Design • Graceful Degradation as 1st principle
• Always Works
• Rich Datatypes • Strong schema types: int, double, string...
• Weak schema types: list, dictionary
Pain Point #3: Feature Manipulation Difficult or costly to inspect existing features and create new features. Hard to perform data exploration.
![Page 16: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/16.jpg)
SFrame Design • Graceful Degradation as 1st principle
• Always Works
• Rich Datatypes • Strong schema types: int, double, string...
• Weak schema types: list, dictionary
• Columnar Architecture • Easy feature engineering + Vectorized feature operations.
• Immutable columns + Lazy evaluation • Statistics + visualization + sketches
![Page 17: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/17.jpg)
SFrame Python API Example Make a little SFrame of 1 column and 5 values: sf = gl.SFrame({‘x’:[1,2,3,4,5]}) Normalizes the column x: sf[‘x’] = sf[‘x’] / sf[‘x’].sum() Uses a python lambda to create a new column: sf[‘x-‐squared’] = sf[‘x’].apply(lambda x: x*x) Create a new column using a vectorized operator: sf[‘x-‐cubed’] = sf[‘x-‐squared’] * sf[‘x’] Create a new SFrame taking only 2 of the columns: sf2 = sf[[‘x’,’x-‐squared’]]
![Page 18: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/18.jpg)
SFrame Querying Supports most typical SQL SELECT operations using a Pythonic syntax.
SQL SELECT Book.title AS title, COUNT(*) AS authors
FROM Book JOIN Book_author ON Book.isbn = Book_author.isbn GROUP BY Book.title;
SFrame Python
Book.join(Book_author, on=‘isbn’) .groupby(‘title’, {‘authors’:gl.aggregate.COUNT})
![Page 19: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/19.jpg)
SFrame Columnar Encoding
user movie ra+ng
Ne#lix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed
![Page 20: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/20.jpg)
SFrame Columnar Encoding
user movie ra+ng Type aware compression: • Variable Bit length Encode • Frame Of Reference Encode • ZigZag Encode • Delta / Delta ZigZag Encode • Dic+onary Encode • General Purpose LZ4
Ne#lix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed
SFrame File
![Page 21: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/21.jpg)
SFrame Columnar Encoding
user movie ra+ng Type aware compression: • Variable Bit length Encode • Frame Of Reference Encode • ZigZag Encode • Delta / Delta ZigZag Encode • Dic+onary Encode • General Purpose LZ4
Ne#lix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed
User à 176 MB 14.2 bits/int
SFrame File
0.02 bits/int Movie à 257 KB 3.8 bits/int Ra+ng à 47 MB
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Total à 223MB
![Page 22: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/22.jpg)
SFrame Columnar Encoding
user movie ra+ng Type aware compression: • Variable Bit length Encode • Frame Of Reference Encode • ZigZag Encode • Delta / Delta ZigZag Encode • Dic+onary Encode • General Purpose LZ4
Ne#lix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed
User à 176 MB 14.2 bits/int
SFrame File
0.02 bits/int Movie à 257 KB 3.8 bits/int Ra+ng à 47 MB
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Total à 223MB
10s
![Page 23: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/23.jpg)
SFrames Distributed
• Distributed Dataflow • Columnar Query Optimizations • Communicate columnar compressed blocks
rather than row tuples.
The choice of distributed or local execuFon is a quesFon of query opFmizaFon.
![Page 24: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/24.jpg)
SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation
User Com.
Title Body
User Disc.
![Page 25: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/25.jpg)
SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation
User Com.
Title Body
User Disc.
![Page 26: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/26.jpg)
SGraph
• SFrame backed graph representation. Inherits SFrame properties. • Data types, External Memory, Columnar,
compression, etc.
• Layout optimized for batch external memory computation.
![Page 27: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/27.jpg)
SGraph Layout
1
2
3
4
Vertex SFrames
VerFces ParFFoned into p = 4 SFrames.
![Page 28: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/28.jpg)
SGraph Layout
1
2
3
4
Vertex SFrames
__id Name Address ZipCode 1011 John … 98105 2131 Jack … 98102
VerFces ParFFoned into p = 4 SFrames.
![Page 29: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/29.jpg)
SGraph Layout
1
2
3
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
Edges parFFoned into p^2 = 16 SFrames.
![Page 30: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/30.jpg)
SGraph Layout
1
2
3
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
Edges parFFoned into p^2 = 16 SFrames.
![Page 31: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/31.jpg)
SGraph Layout
1
2
3
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
Edges parFFoned into p^2 = 16 SFrames.
![Page 32: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/32.jpg)
SGraph Layout Vertex SFrames
Edge SFrames
![Page 33: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/33.jpg)
SGraph Layout Vertex SFrames
Edge SFrames
![Page 34: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/34.jpg)
SGraph Layout Vertex SFrames
Edge SFrames
![Page 35: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/35.jpg)
SGraph Layout Vertex SFrames
Edge SFrames
![Page 36: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/36.jpg)
SGraph Layout Vertex SFrames
Edge SFrames
![Page 37: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/37.jpg)
Deep Integration of SFrames and SGraphs • Seamless interaction between graph data
and table data. • Queries can be performed easily across
graph and tables.
![Page 38: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/38.jpg)
Demo
![Page 39: Gl conference2014 scalabledata_yucheng](https://reader033.vdocuments.site/reader033/viewer/2022052900/55626328d8b42aed7d8b4d17/html5/thumbnails/39.jpg)
SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation
User Com.
Title Body
User Disc.
User-first architecture. Built by data scientists,
for data scientists.