from developer to data scientist - gaines kergosien

58
@GAINESK @ITCAMPRO #ITCAMP17 Community Conference for IT Professionals From Developer to Data Scientist Gaines Kergosien Executive Director, Music City Code Associate Director, UBS

Upload: itcamp

Post on 18-Mar-2018

320 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

From Developer to Data Scientist

Gaines Kergosien

Executive Director, Music City Code

Associate Director, UBS

Page 2: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

Many thanks to our sponsors & partners!

GOLD

SILVER

PARTNERS

PLATINUM

POWERED BY

Page 3: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

GAINES KERGOSIEN

Leader, Speaker, Problem Solver

Page 4: From Developer to Data Scientist - Gaines Kergosien
Page 5: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• National gap for analytical expertise at 140k+ by 2017. –McKinsey 2011

• Shortage of 100k Data Scientists by 2020. –Gartner 2012

• 90% of clients need expertise, 40% cite lack of talent. –Accenture 2014

• Survey finds 83% of data scientists see shortage. –Crowdflower 2016

• “I keep saying that the sexy job in the next 10 years will be statisticians. And I’m not kidding.” –Google’s Chief Economist

• Data Scientist the #1 job in America for 2016 AND 2017! –GlassDoor

The Demand

Page 6: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Maturity

http://www.burtchworks.com/files/2016/04/Burtch-Works-Study_DS-2016-final.pdf

Page 7: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Salary

Page 8: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Industry

Page 9: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

A data scientist is a job title for an employee or

business intelligence (BI) consultant who excels at

analyzing data, particularly large amounts of data, to

help a business gain a competitive edge.

–WhatIs.com

The Definition

Page 10: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Definition

Page 11: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Recipe

Page 12: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• Classification – Is this A or B?

• Anomaly Detection – Is this weird?

• Regression – How much -or- how many?

• Clustering – How is this organized?

• Reinforcement Learning – What should I do next?

The Five Questions

Page 13: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• Educate the business

• Look for problems to solve

• Research new techniques

• Collate data for analysis (ETL)*

• Implement algorithms

• Design big data-capable architecture

• Present insights

The Job

Page 14: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• Big Data

• Fast Data

• Dark Data

• Unstructured Data

• Data Mining

• Data Visualization

• Predictive Analytics

• [Deep] Neural Network

The Buzzwords

Page 15: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

Big Data

Volume

Page 16: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Volume

Page 17: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

Big Data

Volume Variety

• Records• Transactions• Tables & Files

• Structured• Unstructured• Semi-structured

Page 18: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

Unstructured Text

• Books

• Blog Posts

• Comments

• Tweets

• Photos

• Video

• Audio

The Variety

Page 19: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

Big Data

Volume Variety

Velocity

• Real Time• Near Time• Batch• Streams

• Records• Transactions• Tables & Files

• Structured• Unstructured• Semi-structured

Page 20: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Velocity

Twitter

• 6,000 tweets per second

• 500 million tweets/day

Facebook

• 300 million photos/day

NY Stock Exchange

• captures 1TB of trade information each session

Page 21: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

Big Data

Big DataVolume Variety

Velocity

• Real Time• Near Time• Batch• Streams

• Records• Transactions• Tables & Files

• Structured• Unstructured• Semi-structured

Page 22: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

Data

• Define

• Collect

• Store

• Explore

The Breakdown

Science

• Hypothesis

• Plan Approach

• Analysis

• Report Results

Page 23: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Skills

Subject Matter Expertise

Statistics• Choose Procedures• Diagnose Problems• Develop Procedures

Hacking Expertise• Technical Skills• Creativity

• Values• Goals• Constraints

Machine Learning

Traditional Research

TraditionalSoftware

Data Science

Page 24: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Skills

Subject Matter

Expertise

Hacking Expertise

Social Sciences

Statistics

Machine Learning

TraditionalSoftware

Data Science

Traditional Research

Traditional Research

HolisticResearch

SociallyUnaware

DomainUnaware

HolisticSoftware

Page 25: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Overlap

Data ScienceBig Data

BigData

Science

Big Data

Volume Variety

Velocity

Page 26: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Analysis Tools

Page 27: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Tool Trends

Python

KNIME

RapidMiner

R

SPSS

SAS

Hadoop

Page 28: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• SQL

• Excel

• Python

• R

• MySQL

The Top Tools

Page 29: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• SPSS

• Matlab

• Julia

• Kafka/Storm

• R

• Python

• Java/Scala

• Stata

• SAS

The Languages

http://www.kdnuggets.com/2015/05/r-vs-python-data-science.html

Page 30: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Languages – SAS, Phython or R?

http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/

Page 31: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Languages – Trends

http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/

Page 32: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Languages – Industries

http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/

Page 33: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Languages – Education

Page 34: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Languages – Trends

Page 35: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Languages – The Future

Page 36: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• R Statistical Programming Language

• Based on the S programming language

• R Development Environment

• Statistical and Visual Analysis

• Cross-Platform

• Free Open Source

• Active User Community

• Over 9,000 Extension Packages

The R

Page 37: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• Created in 1991 to emphasize productivity and code

readability

• Easier learning curve than R

• Free Open Source

• Active User Community

The Python

Page 38: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• Hadoop Distributed File System (HDFS)

• MapReduce vs. YARN

• Pig

• Hive

• Hbase

• Storm

• Spark

• etc.

The Hadoop Collective

Page 39: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

Sample the Data

• Random

• Stratified

Reconcile Missing Data

• Discard

• Infer

Normalize Numeric Values

• Standard Unit of Measure

• Subtract Average (Mean = 0)

• Divide by Standard Deviation

The Wrangling

Reduce Dimensionality

• Irrelevant Input Variables

• Redundant Input Variables

Add Derivative Values

• Generalize Attributes

• Discretize Attributes to Categories

• Binarize Categorical Attributes

Design Training Data

• Select

• Combine

• Aggregate

Power and Log transformation

• Approximate Normal Distribution

Page 40: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• basic statistics (ie. p-value)

• statistical modeling

• statistical tests

• experiment design

• distributions

• maximum likelihood estimators

• probability theory

• linear algebra

• multivariable calculus

The Math

Page 41: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• Tableau (enterprise visualization products) - www.tableau.com

• ggvis (R visualization package) - ggvis.rstudio.com

• ggplot (plotting system) - ggplot.yhathq.com

• D3.js (declarative DOM manipulation) - d3js.org

• Vega (visualization grammar)- trifacta.github.com/vega

• Rickshaw (charting library - code.shutterstock.com/rickshaw

• modest maps (map library) - modestmaps.com

• Chart.js (plotting library) - www.chartjs.org

The Visualization Tools

Page 42: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

Concepts

• k-nearest neighbors

• random forests

• ensemble methods

• …use Python libraries!

Tools

• Weka - www.cs.waikato.ac.nz/ml/weka/

The Machine Learning

Page 43: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• Report

• Presentation

• Demo

• Prototype

• Component

The Results

Page 44: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• Data Analyst (A)

• Data Engineer (B)

• Academic (Ab)

• Generalist (AB)

The Skills

Page 45: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Expertise

Page 46: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Degree

Page 47: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Degree – Trending

Page 48: From Developer to Data Scientist - Gaines Kergosien
Page 49: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

1. Fundamentals

2. Statistics

3. Programming

4. ML

5. Text Mining

6. Visualization

7. Big Data

8. Data Munging

9. Toolbox

The Path

Page 50: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

1. Matrices & Linear Algebra

2. Hash Functions, Binary Tree

3. Relational Algebra, DB Basics

4. Inner, Outer, Cross, Theta Join

5. Cap Theorem

6. Tabular Data

7. Data Frames & Series

8. Sharding

9. OLAP

The Fundamentals

Page 51: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

10. Multidimensional Data Model

11. ETL

12. Reporting vs BI vs Analytics

13. JSON & XML

14. NoSQL

15. Regex

16. Vendor Landscape

17. Environment Setup

The Fundamentals

Page 52: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

1. Pick a Dataset

2. Descriptive Statistics

3. Exploratory Data Analysis

4. Histograms

5. Percentiles and Outliers

6. Probability Theorem

7. Bayes Theorem

8. Random Variables

9. Cumul Dist Fn (CDF)

The Statistics

Page 53: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

The Statistics

10. Continuous Distr.

11. Skewness

12. ANOVA

13. Prob Den Fn (PDF)

14. Cenral Limit Theorem

15. Monte Carlo Method

16. Hypothesis Training

17. p-Value

Page 54: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

1. Python Basics

2. Working in Excel

3. R Setup / R Studio

4. R Basics

5. Expressions

6. Variables

7. Vectors

8. Matrices

9. Arrays

The Programming

Page 55: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

10. Factors

11. Lists

12. Data Frames

13. Reading CSV Data

14. Reading Raw Data

15. Subsetting Data

16. Manipulate Data Frames

17. Functions

18. Factor Analysis

19. Install Packages

The Programming

Page 56: From Developer to Data Scientist - Gaines Kergosien
Page 57: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

• Coursera - www.coursera.org

• EdX- www.edx.org

• Udacity - www.udacity.com

• Kaggle - www.kaggle.com

• Youtube - projects.iq.harvard.edu/stat110/youtube

• Boot Camps

The Training

Page 58: From Developer to Data Scientist - Gaines Kergosien

@GAINESK

@ITCAMPRO #ITCAMP17Community Conference for IT Professionals

Q & A

Slides at DotNetDude.net

Subject Matter

Expertise

Hacking Expertise

Social Sciences

Statistics

Machine Learning

TraditionalSoftware

Data Science

Traditional Research

Traditional Research

HolisticResearch

SociallyUnaware

DomainUnaware

HolisticSoftware

Big Data

Volume Variety

Velocity

Data ScienceBig Data

BigData

Science