building data science teams - meetupfiles.meetup.com/3343012/2013 boston meetups - building... ·...
TRANSCRIPT
1 © Copyright 2013 EMC Corporation. All rights reserved.
Building Data Science Teams David Dietrich Advisory Technical Education Consultant EMC Education Services @imdaviddietrich Boston Data Scientist Meetup, Leading Analytics Series September 3, 2013
2 © Copyright 2013 EMC Corporation. All rights reserved.
Quick Profile • Architect of EMC’s Data Science
curriculum
• Co-author of 3 courses on Big Data and Data Science (with FOSS)
• Filed 9 patents on data science, data privacy, and cloud computing
• Advisor to universities on analytic programs (Babson, Harvard…)
Twitter:
@imdaviddietrich
3 © Copyright 2013 EMC Corporation. All rights reserved.
Today’s Discussion Questions • When do you need to build a team? How do you know? What are the
needs? What kind of people do you need on the team?
• Organizational model? Where should the data science team live within an organization ?
• Is it always necessary to build a team? When to build, buy or partner?
• Who are the sponsors in large companies?
3
4 © Copyright 2013 EMC Corporation. All rights reserved.
Creating Reports, Dashboards, and Databases… Do You Need A Data Science Team For This?
0
2
4
6
8
10
12
14
16
NetWorker ICA RP Sys Deploy Symm Config Mgmt
Vsphere ICM VNX US Impl
Registrant -‐ Americas 3 4 12Offered -‐ Americas 2 3 1Registrant -‐ APJ 2 6Offered -‐ APJ 2 1Registrant -‐ EMEA 2 4Offered -‐ EMEA 1 1Registrant -‐ Unprovided 2 1Offered -‐ Unprovided 1 1Total Registrant 3 10 1 16 6Total Offered 2 7 1 2 1
Key:
-‐ EMC NetWorker Installation, Configuration and Administration
-‐ RecoverPoint S ystem Deployment
-‐ S ymmetrix Configuration Management
-‐ VMware vSphere: Install, Configure, Manage [V5.1]
-‐ VNX Unified S torage Implementation
5 © Copyright 2013 EMC Corporation. All rights reserved.
How About For This? Creating a Map of the Internet
6 © Copyright 2013 EMC Corporation. All rights reserved.
Example: Output From a Data Science Team Mapping The Spread of Innovation Ideas Using Social Graphs
7 © Copyright 2013 EMC Corporation. All rights reserved.
Future Past TIME
Analytical Approach
Business Intelligence
Predictive Analytics and Data Mining (Data Science) Typical Techniques and Data Types
• Optimization, predictive modeling, forecasting, statistical analysis
• Structured/unstructured data, many types of sources, very large data sets
Common Questions
• What if…..? • What’s the optimal scenario for our business? • What will happen next? What if these trends continue?
Why is this happening?
Business Intelligence Typical Techniques and Data Types
• Standard and ad hoc reporting, dashboards, alerts, queries, details on demand
• Structured data, traditional sources, manageable data sets
Common Questions
• What happened last quarter? • How many did we sell? • Where is the problem? In which situations?
Data Science
Explanatory
Big Data Requires New Approaches to Analytics Business Intelligence Versus Data Science
Exploratory
8 © Copyright 2013 EMC Corporation. All rights reserved.
“Companies Are Always Looking To Reinvent Themselves….But It’s A Mistake To Treat Data Science Teams Like Any Old Product Group.
To Build Teams That Create Great Data Products, You Have To Find People With The Skills And The Curiosity To Ask The Big Questions.”
- DJ Patil, Data Scientist in Residence at Greylock Partners
9 © Copyright 2013 EMC Corporation. All rights reserved.
Framework for Developing Data Science Teams
Data Science Team
Data Scientist
BI Analyst
Project Sponsor
Project Manager
Business User
Data Engineer DBA
10 © Copyright 2013 EMC Corporation. All rights reserved.
Data Science Teams
11 © Copyright 2013 EMC Corporation. All rights reserved.
Data Scientist: An Emerging Career
SPOTLIGHT ON BIG DATA
by Thomas H. Davenport and D.J. Patil
12 © Copyright 2013 EMC Corporation. All rights reserved.
Comparing Two Data Analysts
ACME Healthcare
John
Traditional BI Analyst Data Scientist
Sample Tasks • Predict Regional Sales For Next Quarter • Discover Customer Opinions Via Social Media • Identify Ways to Maximize Sales Campaign ROI
Sample Tasks • Report Regional Sales For Last Quarter • Perform Customer Feedback Surveys • Identify Average Cost Per Supplier
ACME Healthcare
Janet
13 © Copyright 2013 EMC Corporation. All rights reserved.
Skills Matrix, Based on Recent Students
Technical Ability
Recent STEM Grads
Business Intelligence
Professionals, IT
Quantitative Analysts, Statisticians,
Business and data analysts
Quantitative Skills
Data Scientists
14 © Copyright 2013 EMC Corporation. All rights reserved.
Profile of a Data Scientist
Curious & Creative Technical
Quantitative
Communicative & Collaborative Skeptical
15 © Copyright 2013 EMC Corporation. All rights reserved.
Interpreting the Resume of a Senior Data Scientist
John Smith [email protected] Skills R, SAS, Java, data mining, sta8s8cs, ontology, bioinforma8cs, human-‐computer interac8on, research Experience 2009—Present, Senior Data Scien8st, ABC Analy)cs 2007—2009, Founder&CEO, Genome
Genome specializes in consumer health informa8on. The main product is InherithHealth, a tool for acquisi8on of family medical histories that provides familial disease risk assessment.
2005—2007, Knowledge Engineer, ScienceExperts.com Managed technical outsourcing efforts. Developed criterion and evaluated engineering outsourcing agencies and individuals …
2004—2006, Research Scien8st, University of Washington Developed rigorous sta8s8cal and computa8onal models for addressing primary shortcomings of observa8onal data analysis in the context of disease risk and drug response.
2000—2004, Research Developer, Nat’l Inst. of Standards and Technology Designed and implemented prototypes. Evaluated tools for represen8ng rules of autonomous on-‐road naviga8on.
Educa6on Ph.D, Biomedical Informa8cs, University of Washington, 2011
Disserta8on: Detec8on of Protein–protein Interac8on in Living Cells by Flow Cytometry
BS, Computer Science, University of Texas at Aus)n, 2004
Responsibili6es: • Work with business owners to map business requirements into technical solu8ons • Analyze and extract relevant informa8on from large amounts of data to help iden8fy key revenue-‐driven features • Perform ad-‐hoc sta8s8cal and data mining analyses • Design and implement scalable and repeatable solu8ons, and establish scalable, efficient, automated processes for large scale data analyses • Work closely with the sodware engineering team to drive new feature crea8on • Design mul8-‐factor experiments and validate hypothesis Qualifica6ons: • A proven passion for genera8ng insights from data, with a strong familiarity with the higher-‐level trends in data growth, open-‐source plaeorms, and public data sets • Experience with sta8s8cal languages and packages, including R, S-‐Plus, SAS and Matlab, and/or Mahout • Experience working with rela8onal databases and/or distributed compu8ng plaeorms, and their query interfaces, such as SQL, MapReduce, Hadoop, Cassandra, PIG, and Hive • Strong communica8on skills, with ability to communicate at all levels of the organiza8on • Masters/PhD degree in mathema8cs, sta8s8cs, computer science or a similar quan8ta8ve field • Experience in designing and implemen8ng scalable data mining solu8ons • Preferably experience with addi8onal programming languages, including Python, Java, and C/C++ • Ability to travel as-‐needed to meet with customers
Data Scientist Job Description Sample Data Scientist Resume
Sta$s$cs
Data Mining Programming
Advanced STEM Degrees
16 © Copyright 2013 EMC Corporation. All rights reserved.
Successful Analytic Projects Require Breadth of Roles
Business User Project Sponsor Project Manager Business Intelligence Analyst
Data Engineer Database Administrator (DBA)
Data Scientist
17 © Copyright 2013 EMC Corporation. All rights reserved.
Break 1
Discussion Questions for the Break
• Is it always necessary to build a team?
• When to build, buy or partner?
18 © Copyright 2013 EMC Corporation. All rights reserved.
Framework for Developing Data Science Teams
Data Science Team
Data Scientist
BI Analyst
Project Sponsor
Project Manager
Business User
Data Engineer DBA
Developing Data Science
Capabilities Transforming Creating As-a-Service Crowdsourcing
19 © Copyright 2013 EMC Corporation. All rights reserved.
Developing Data Science Capabilities
20 © Copyright 2013 EMC Corporation. All rights reserved.
Four Approaches to Developing Data Science Capabilities
Transforming Creating As a Service Crowdsourcing
21 © Copyright 2013 EMC Corporation. All rights reserved.
Approaches to Developing Data Science Capabilities: Transforming Teams
• Industries Requiring Deep Domain Knowledge (Such As Genetics And DNA Sequencing)
• Established Companies Who Wish To Introduce Data Science Into Their Business
• Companies Who Wish To Enrich The In-house Skill Sets
Transforming And Realignment With Minimal Change To The Current Organizational Structure
22 © Copyright 2013 EMC Corporation. All rights reserved.
Approaches to Developing Data Science Capabilities: Creating Teams
• Start-up Companies • Companies Who Wish To …
– Increase Their Focus On Data Analytics – Start New Data Science Projects
• Companies Where Data Is The Product • Deep Domain Knowledge Is Less Critical For The Analytics
Developing A New Team From Scratch
23 © Copyright 2013 EMC Corporation. All rights reserved.
Approaches to Developing Data Science Capabilities: Data Science as a Service
• When To Engage DSaaS Providers – Prefer Not To Change Existing Organizational Structure – When Creating Or Transforming Are Not Viable Options
• Consider Service-level Agreements (SLAs) When Determining Whether To Engage Internal Resources Or External Providers
Engaging Data Science as a Service (DSaaS)
24 © Copyright 2013 EMC Corporation. All rights reserved.
Approaches to Developing Data Science Capabilities: Crowdsourcing Data Science
• When To Crowdsource
– The Problem Is “Open” In Nature – Willing To Accept Opinions From Distributed And Diverse Groups Of
People – There’s A Back-up Plan In Case Of “Crowd Failures”
• Examples: Wikipedia, Netflix’s $1,000,000 Prize
Outsource Data Science Project To Distributed Groups Of People
25 © Copyright 2013 EMC Corporation. All rights reserved.
Approaches to Developing Data Science Capabilities: Crowdsourcing Data Science (Cont’d)
• Different Crowdsourcing Models
– Wisdom Of Crowds – Swarm Creativity (Collective Intelligence)
• Crowdsourcing Platforms – Kaggle.com, Innocentive.com – Amazon Mechanical Turk
• Crowd Failures: When The Turnout Of Crowdsourcing Is Unsatisfactory
Outsource Data Science Projects To Distributed Groups Of People
26 © Copyright 2013 EMC Corporation. All rights reserved.
Benefits and Drawbacks of the Four Approaches
Transforming • Strong Domain
Knowledge • Knowledge of Business
Processes • New Talent Raises Level
of Team Performance • Gradually Increases the
Quality of Service
• Risk of homogeneous thinking
• May Struggle With Quality of Service
• Some Team Members May Resist Change
Creating • Control Over Skill-
sets • More Flexibility • High Quality of
Service
• Hiring and Knowledge Transfer Are Time-consuming
• Time Required to Find and Hire Right Team Members
DSaaS • Able to Scale on
Demand • May Get Better
Service Levels Than In-house
• Learn From Outside Experts
• Provider May Not Understand Company’s Unique Processes
• Difficult to Bring Expertise Back In-house
• Decreasing Quality of Service Over Time
• No SLA; value not guaranteed
• Difficult to design the “Open” Problem
• Difficult For Domain Intensive Tasks
• Crowd Failure May Happen (Adds Cost)
• Leverage Wisdom of the Crowds
• Diverse Perspectives • Lower Cost • Fast Results
Crowdsourcing
27 © Copyright 2013 EMC Corporation. All rights reserved.
Break 2
Discussion Questions for the Break • Organizational model?
– Where should the data science team live within an organization ? – What are some options?
• Who are the sponsors in large companies?
28 © Copyright 2013 EMC Corporation. All rights reserved.
Framework for Developing Data Science Teams
Data Science Team
Data Scientist
BI Analyst
Project Sponsor
Project Manager
Business User
Data Engineer DBA
Developing Data Science
Capabilities Transforming Creating As-a-Service Crowdsourcing
Organizational Model Centralized Decentralized Hybrid
29 © Copyright 2013 EMC Corporation. All rights reserved.
Organizational Model
30 © Copyright 2013 EMC Corporation. All rights reserved.
Organizational Models for Data Science Teams
Regardless Of Which Approach, They All Need Executive Sponsorship To Succeed
Hybrid
There Is A Centralized Data Science Team, But Business Units Also Have
Data Science Capabilities
DS Team
Decentralized
Each Business Unit Has Its Own Data Science Capabilities
BU BU BU BU
Centralized
The Data Science Team Functions As A Hub And Spoke Model, In Which They Are A
Central Provider Of Analytics To Multiple Business Units
DS Team BU BU
BU BU
BU
31 © Copyright 2013 EMC Corporation. All rights reserved.
Framework for Developing Data Science Teams
Executive Engagement Data-driven CEO Chief Data Officer
Data Science Team
Data Scientist
BI Analyst
Project Sponsor
Project Manager
Business User
Data Engineer DBA
Developing Data Science
Capabilities Transforming Creating As-a-Service Crowdsourcing
Organizational Model Centralized Decentralized Hybrid
32 © Copyright 2013 EMC Corporation. All rights reserved.
Executive Engagement
33 © Copyright 2013 EMC Corporation. All rights reserved.
Analytics Requires Executive Level Engagement
Executive Boardroom
CEO
“Executive Sponsorship Is So Vital To Analytical Competition…” -- Tom Davenport (Competing on Analytics)
Chief Finance Officer Use Time Series Analysis
Over Historical Data to Predict KPIs to Project
Earnings
Chief Security Officer Collect and Mine Log Data Within and Outside of the
Company to Detect Unknown Threats
Chief Operating Officer Mine Customer Opinions
and Competitor Behaviors to Predict Inventory
Demands
Chief Strategy Officer Simulate Outcomes for Acquiring Our Top 3 Competitors
Chief Product Officer Conduct Social Media Analyses to Identify Customer Opinions
Chief Marketing Officer Conduct Behavior Analyses to Predict If Customers Are Going to Churn
34 © Copyright 2013 EMC Corporation. All rights reserved.
Executive Engagement: Data-Driven CEO
Key Focus Areas of a Data-driven CEO:
• Strategic Data Planning
• Analytic Understanding
• Technology Awareness
Procter & Gamble Business Sphere
“… If Your Organization Can Arrange It … Have Someone In A Key Operational Role -- Business Unit Head, Chief Operations Officer, Even CEO -- To Be An Enthusiastic Advocate Of Matters Quantitative.”
-- Tom Davenport (HBR Blog Network)
35 © Copyright 2013 EMC Corporation. All rights reserved.
Executive Engagement: Chief Data Officer (CDO)
• Promote Data-driven Decision Making To Support Company’s Key Initiatives
• Ensure The Company Collects The Right Data
• Oversee And Drive Analytics Company-wide
“… It's Time For Corporations To Embrace A New Functional Member Of The C-suite: The Chief Data Officer (CDO).”
-- Anthony Goldbloom and Merav Bloch, Kaggle
25% of organizations will have a Chief Data Officer by 2015.
-- Gartner Blog Network
Executive Boardroom
Executive-level Advisor On Data Analytics
36 © Copyright 2013 EMC Corporation. All rights reserved.
EMC Courses on Data Science & Big Data Analytics
90 min
1 day
5 days Aspiring Data
Scientists
Business Leaders
Heads of Data Science Teams
Data Science and Big Data Analytics
Data Science and Big Data Analytics for Business Transformation
Introducing Data Science and Big Data Analytics for Business Transformation
New
New
37 © Copyright 2013 EMC Corporation. All rights reserved.
Closing Thoughts….
Now You Know How To Develop Data Science Teams…What Next?
• Determine How You Would Like To Develop Data Science Capabilities • Hire People To Fill Out Your Data Science Team • Consider Which Organizational Model Will Work Best For Your Situation • Assess How Much Executive Engagement You Have Or Need • Map Out Potential Projects -- Balance Quick Wins With Longer-term Wins
SPOTLIGHT ON BIG DATA
38 © Copyright 2013 EMC Corporation. All rights reserved.
Questions? Twitter: @imdaviddietrich Additional Resources:
1. EMC Education Services curriculum on Data Science and Big Data Analytics
for Business Transformation: http://education.emc.com/guest/campaign/data_science.aspx
2. My Blog on Data Science & Big Data Analytics: http://infocus.emc.com/author/david_dietrich/
3. Blog on applying Data Analytics Lifecycle to measuring innovation data: http://stevetodd.typepad.com/my_weblog/data-science-and-big-data-curriculum/