openchorus: building a tool-chest for big data science - building a... · data – chorus view...
TRANSCRIPT
![Page 1: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/1.jpg)
1 © Copyright 2012 EMC Corporation. All rights reserved.
OpenChorus: Building a Tool-Chest for Big Data Science
Milind Bhandarkar Chief Scientist, Machine Learning Platforms EMC Greenplum
![Page 2: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/2.jpg)
2 © Copyright 2012 EMC Corporation. All rights reserved.
Agenda
! Tools for Data Science ! Data Science Workflow
! Greenplum OpenChorus
! How Chorus Works
![Page 3: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/3.jpg)
3 © Copyright 2012 EMC Corporation. All rights reserved.
Data Science Tools: Abundance of Riches
! Proliferation of tools ! Languages & Libraries
– R, Matlab, Python – SciPy, NLTK, Madlib, Mahout
! Frameworks – Graphlab, Pregel (Giraffe), Mesos, CEP
! Platforms/Data Stores – MPP Databases, Hadoop, NoSQL (Hbase, Cassandra,
MongoDB), SciDB
![Page 4: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/4.jpg)
4 © Copyright 2012 EMC Corporation. All rights reserved.
Choice of Tool(s)?
! Hammer ? – Hadoop ought to be sufficient for most tasks – “If all you have have a hammer, throw away everything
that is not a nail” – Jimmy Lin (http://arxiv.org/abs/1209.2191)
– Operational complexity / learning curve not worth efficiency
! Tool-Chest ? – Use the right tool for the right job – How to reduce complexity
![Page 5: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/5.jpg)
5 © Copyright 2012 EMC Corporation. All rights reserved.
![Page 6: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/6.jpg)
6 © Copyright 2012 EMC Corporation. All rights reserved.
Data Science Workload
! http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
! Obtain
! Scrub
! Explore
! Model ! Interpret
![Page 7: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/7.jpg)
7 © Copyright 2012 EMC Corporation. All rights reserved.
Obtain
! Corpus needs to be usable & sufficient ! Possibly from multiple independent sources
! Needs to be automated for streams
! Needs to have efficient ingestion for one-time data
![Page 8: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/8.jpg)
8 © Copyright 2012 EMC Corporation. All rights reserved.
Scrub
! Raw data is always messy – Missing data, inconsistent data, charsets(!) – NY, New York, NYC, Big Apple etc
! Growing Dictionaries
! Join with Crowdsourcing – Mechanical Turk etc
![Page 9: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/9.jpg)
9 © Copyright 2012 EMC Corporation. All rights reserved.
Explore
! Visualize, Clustering, Dimensionality reduction – Feature correlations (scatter plots) – Single feature histograms
! Challenge: How not to lose these observations
![Page 10: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/10.jpg)
10 © Copyright 2012 EMC Corporation. All rights reserved.
Model
! Find correlation of past data and outcome – Find good training set – Label the training set – Derive model parameters – Apply model, and validate
! Ensemble Models: Model of models
![Page 11: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/11.jpg)
11 © Copyright 2012 EMC Corporation. All rights reserved.
Interpret
! Models are built for prediction and interpretation ! Check that there are no “surprises”
! Reason about models
! Improve models
![Page 12: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/12.jpg)
12 © Copyright 2012 EMC Corporation. All rights reserved.
Data Science Data Flow
! Raw Data (Timed, Partitioned, Crowdsourced, De-duped etc)
! Derived data (simple aggregates, other statistics)
! Models (Feature weights, decision trees)
! Indexes
![Page 13: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/13.jpg)
13 © Copyright 2012 EMC Corporation. All rights reserved.
Data Diversity
! Natural Language Text, and Annotations ! (Bags of words) : Concept
! Graphs (sparse matrices)
! Dense Matrices
! Location (proximity)
![Page 14: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/14.jpg)
14 © Copyright 2012 EMC Corporation. All rights reserved.
• Share drives • Wiki • Content mgmt app
Too Many Tools for One Data Science Project
Data Exploration and Sharing
Analytics Tools
Communications
Content and File Management
• Data marts • Excel spreadsheets • Flat files
• Data mining • BI/Visualization • Data integration
• Emails • Meetings • IMs
Process Documentation
• Some project plans, no up-to-date team collaborative documentation
![Page 15: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/15.jpg)
15 © Copyright 2012 EMC Corporation. All rights reserved.
High Cost of Knowledge Sharing
! Data science process breaks when organization structure changes
! Very difficult knowledge transfer
! No “insurance policy” for the data science intellectual assets
![Page 16: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/16.jpg)
16 © Copyright 2012 EMC Corporation. All rights reserved.
Delayed Time-to-Market
1. Find the data
2. Get access To data
4. Move to sandbox
5. Analysis Finally!
6. Operationalize the model
3. Learn about the data
6-9 Months
![Page 17: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/17.jpg)
17 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Chorus
! Collaborative analytics ! Powerful extensibility ! The freedom of open source
Greenplum’s Social Platform for Collaborative Data Science
![Page 18: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/18.jpg)
18 © Copyright 2012 EMC Corporation. All rights reserved.
Chorus Enables Collaborative Data Science
! Collaborate within projects, share data, content, and findings across teams
! Make projects more transparent
! Iterate faster for accelerated insights with real-time social collaboration
Self-Services Provisioning
Analysis & Modeling
Publish & Share
Data Discovery & Exploration
Collaboration
![Page 19: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/19.jpg)
19 © Copyright 2012 EMC Corporation. All rights reserved.
Powerful Extensibility
! Integrated development environment for analytics
! Expand insights with simple access to third-party data
! Fusion with leading analytics and visualization tools
![Page 20: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/20.jpg)
20 © Copyright 2012 EMC Corporation. All rights reserved.
The Freedom of Open Source
www.openchorus.org
! Modify and extend to any environment
! Promotes an ecosystem of applications, startups, and data scientists community
![Page 21: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/21.jpg)
21 © Copyright 2012 EMC Corporation. All rights reserved.
How Chorus Works
![Page 22: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/22.jpg)
22 © Copyright 2012 EMC Corporation. All rights reserved.
Cho
rus
Wor
kspa
ce
Activities & Insights
Data
Work Files
How Chorus Works
Greenplum DB
Sandbox
Source Data
EDW
DB
GPDB
GPDB External
Table
Non-GPDB
Chorus View
Summary of next few slides, with
animation built in
![Page 23: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/23.jpg)
23 © Copyright 2012 EMC Corporation. All rights reserved.
Data Exploration Search and Data Discovery
! Automatic indexing of meta-data, work files, comments, and insights
! Quickly find data across the enterprise regardless of location
![Page 24: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/24.jpg)
24 © Copyright 2012 EMC Corporation. All rights reserved.
Data Exploration Data Preview and Visualization ! Data preview for instant
understanding ! Quick and easy data
visualizations – Visualize data for faster insight
into datasets – No need to export to third-party
applications like R – Not a replacement for advanced
visualization tools
![Page 25: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/25.jpg)
25 © Copyright 2012 EMC Corporation. All rights reserved.
Data Exploration Living Data Dictionary ! Bring everything about the
data to the data – Attach documents – Ask questions – Add comments
! Build a living data dictionary – Everything is current – No more spreadsheets
![Page 26: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/26.jpg)
26 © Copyright 2012 EMC Corporation. All rights reserved.
Cho
rus
Wor
kspa
ce
Workspace – Streamlines Collaboration ! Chorus includes unlimited
workspaces, each representing individual project
! Streamlines complex user-user and user-data interactions
![Page 27: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/27.jpg)
27 © Copyright 2012 EMC Corporation. All rights reserved.
Cho
rus
Wor
kspa
ce
Multi-level Secure Collaboration ! Authentication
– Integrates with LDAP and AD for password management
! Application access control – User roles: Admin vs. general
user – Workspace types: Public or
private
! Data access control – Chorus enforces database rules
and permissions
![Page 28: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/28.jpg)
28 © Copyright 2012 EMC Corporation. All rights reserved.
Cho
rus
Wor
kspa
ce
Activities & Insights
Data
Work Files
Data – Dataset Types 1. Source Dataset
– Pointer to the source data – Both internal and external data – Support both native connectivity
for GPDB and flat files – Use GPDB External Tables for Non-
GPDB databases and Hadoop
2. Sandbox Dataset – Copy of the source data to be used
for analytics – Data generated from analytics
![Page 29: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/29.jpg)
29 © Copyright 2012 EMC Corporation. All rights reserved.
Cho
rus
Wor
kspa
ce
Activities & Insights
Data
Work Files
Data – Sandbox ! Container of all the analytics
data ! Ease of self-service
provisioning of sandboxes – Free up IT bandwidth – Minimize data proliferation to
uncontrolled/unmanaged data marts
Greenplum DB
Sandbox
![Page 30: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/30.jpg)
30 © Copyright 2012 EMC Corporation. All rights reserved.
Data – Populating Sandbox Import data easily from anywhere: ! Directly from sources ! Through Chorus View ! Flat file import
Cho
rus
Wor
kspa
ce
Activities & Insights
Data
Work Files
Greenplum DB
Sandbox
Chorus View
Source Data
Non-GPDB EDW
GPDB
GPDB External
Table
![Page 31: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/31.jpg)
31 © Copyright 2012 EMC Corporation. All rights reserved.
Data – Chorus View Utility ! Single-view GUI utility for
exploring, filtering, aggregating, and moving the desired data from sources to sandbox
! Data exploration and visualization prior to bringing the data into sandbox
! Derive variation of the basic source datasets without bringing the data into sandbox
Cho
rus
Wor
kspa
ce
Activities & Insights
Data
Work Files
Greenplum DB
Sandbox
Chorus View
![Page 32: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/32.jpg)
32 © Copyright 2012 EMC Corporation. All rights reserved.
Sandbox Dataset
Data - Chorus View Chorus View
Select a.userid, a.customer_name, a.gender, a.customer_state, b.ipaddress, b.device, From customers AS a INNER JOIN weblog_2012q1 as b ON a.userid = b.userid
Source Dataset
GPDB External Table
EDW GPDB
GPDB External Table
![Page 33: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/33.jpg)
33 © Copyright 2012 EMC Corporation. All rights reserved.
Data – Automated Data Services
! Subscribe to receive automatic updates
– Schedule imports from multiple data sources
– Define and share data sets within the data science team
– Removes manual data refresh activities
![Page 34: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/34.jpg)
34 © Copyright 2012 EMC Corporation. All rights reserved.
Cho
rus
Wor
kspa
ce
Work Files
! Work files are non-data assets – SQL query statements with
code editor interface – Execution of in-database
analytics, ex: MADLib, PL/R – Third-party tool files – PowerPoint, Word doc, etc.
! Analytics asset management with version, compare, and archive work files
Activities & Insights
Data
Work Files
Greenplum DB
Sandbox
Chorus View
![Page 35: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/35.jpg)
35 © Copyright 2012 EMC Corporation. All rights reserved.
Integration with Analytics Tools
! Third-party tools – Execute in-database analytics
functions (ex: MADLib, R) from Chorus work files
– Publish and execute Alpine Miner Workflow from Chorus native interface
– Data preparation for analysis using SAS and other analytics tools
! Code-design UI for SQL
![Page 36: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/36.jpg)
36 © Copyright 2012 EMC Corporation. All rights reserved.
Insight and Data Sharing
! Post comments and ask questions on any analytics artifacts
! Share and publish any activities or insights
! Promote fast iteration on data and ideas
![Page 37: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/37.jpg)
37 © Copyright 2012 EMC Corporation. All rights reserved.
Cho
rus
Wor
kspa
ce
Activities and Insights
! Build a living library of activities and insights
– Define, publish, and share new insights
– Discover and learn from existing insights
! Iterate faster, model less
Activities & Insights
Data
Work Files
Greenplum DB
Sandbox
Chorus View
![Page 38: OpenChorus: Building a Tool-Chest for Big Data Science - Building a... · Data – Chorus View Utility ! Single-view GUI utility for exploring, filtering, aggregating, and moving](https://reader035.vdocuments.site/reader035/viewer/2022070719/5edf157bad6a402d666a6fa1/html5/thumbnails/38.jpg)
38 © Copyright 2012 EMC Corporation. All rights reserved.