demystifying data science · 2018-03-20 · demystifying data science alyson wilson, ph.d., pstat...

33
Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University [email protected] March 22, 2018 A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 1 / 33

Upload: others

Post on 20-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Demystifying Data Science

Alyson Wilson, Ph.D., PStat

Department of StatisticsLaboratory for Analytic SciencesNorth Carolina State University

[email protected]

March 22, 2018

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 1 / 33

Page 2: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Objectives

• Lots of “Data” Definitions

• Classes of Algorithms (with descriptions)

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 2 / 33

Page 3: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

It’s All About Data

• Big data

• Data engineering

• Data science

• Data analytics

• Data mining

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 3 / 33

Page 4: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Data Engineering

Data engineers design, build, and manage the infrastructure to supportdata collection, storage, and analysis.

One of the key functions is managing extract, transform, load (ETL).

• Extract data from a variety of sources.

• Transform to the proper format or structure to support querying andanalysis.

• Load the data into the final database, for example, a data store, datamart, or data warehouse.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 4 / 33

Page 5: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Is “Big Data” the same as “Data Science”?

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 5 / 33

Page 6: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Five Data Science Skills

• Technical• Algorithmic/computational/predictive and data/statistical/inferential

methods• Mathematics (particularly modeling and linear algebra)• Obtaining, wrangling, curating, managing and processing data,

exploring data

• Communication

• Collaboration

• Tools

• Subject Matter Expertise

Adapted from Michael Rappa, NCSU Institute for Advanced Analytics andCurriculum Guidelines for Undergraduate Programs in Data Science, AmericanStatistical Association

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 6 / 33

Page 7: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Data Product

A data product is the production output from a statistical analysis. Dataproducts automate complex analysis tasks or use technology to expand theutility of a data informed model, algorithm or inference.

• interactive analytics (e.g., R Shiny)

• packages of analysis tools (e.g., an R package)

• interactive graphics

The idea is to use technology to tell a story about data to a broadaudience.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 7 / 33

Page 8: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Data Science is a Team Sport

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 8 / 33

Page 9: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

A Few Words about Data Wrangling (Munging)

Data is messy. Much (but not all) of what falls in data wrangling is whatwe have traditionally called data cleaning.

• Data scraping is the use of a program to extract data fromhuman-readable sources. Think web site, online datasets, etc.

• There is an emerging standard for data wrangling embodied in the Rpackage dplyr.

• select() take a subset of columns/features/variables• filter() take a subset of rows/observations• mutate() add or modify existing columns• arrange() sort rows• summarize() aggregate data across rows

• See also tidyr, an R package which describes a way to think aboutstoring and formatting data.http://vita.had.co.nz/papers/tidy-data.html

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 9 / 33

Page 10: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Data Analytics

Data analytics are essentially statistics with a lower-case “s”. Analyticsare computations that one makes with data to answer questions. They areoften described by data type, by application area, or by method class.

• Geospatial analytics• Text analytics• Network analytics• Forecasting = “time series

analytics”• Business analytics• Visualization• Neural nets• Deep learning

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 10 / 33

Page 11: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Data Mining

“Statistics at scale and speed” (and simplicity)

D. Pregibon (1999). 2001: A statistical odyssey. Invited talk at The Fifth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining,ACM Press, NY.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 11 / 33

Page 12: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

More Definitions

• Machine learning: focused on prediction, based on known propertieslearned from the training data.

• Data mining: focused on the discovery of (previously) unknownproperties in the data.

Data mining + machine learning are currently being rebranded as artificialintelligence.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 12 / 33

Page 13: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Algorithms

• Supervised learning: Providing an algorithm with labeled records inwhich an output variable of interest is known and the algorithm learnshow to predict this value with new records where the output isunknown.

• Unsupervised learning: Providing an algorithm without labeledrecords in which the goal is to draw inferences from only input data.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 13 / 33

Page 14: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Prediction and Classification

• Prediction: supervised learning when the response is a continuousvariable

• Classification: supervised learning when the response is a categoricalvariable

The goal is to predict the value of the response using the explanatoryvariables

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 14 / 33

Page 15: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Data

Gender Height Weight

F 66 135M 70 165F 70 155M 72 200F 62 140

Response? Explanatory?

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 15 / 33

Page 16: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Predictive Analytics

Predictive analytics is a general term encompassing classification andprediction (and sometimes association analysis).

Common Algorithms:

• K-Nearest Neighbors

• Linear Regression

• Logistic Regression

• Classification Trees

• Regression Trees

• Neural Networks

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 16 / 33

Page 17: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Explanatory v Predictive Models

• The typical explanatory model is for the “small data case” (classicalstatistics); a typical predictive model is for the “large data case (datamining).”

• A good explanatory model fits the data closely; a good predictivemodel predicts new cases accurately.

• In explanatory models, the whole dataset is used for estimating thecoefficients and picking the “best model.” Performance measuresassess how well the model fits the data.

• In predictive models, the training data is used to estimate the model,and the validation data set is used to assess performance (more in aminute). Performance measures assess predictive accuracy.

We are focused on predictive models.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 17 / 33

Page 18: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Assessing Performance

When we are choosing a predictive analytic (method, model, algorithm),we typically divide (partition) our data into three parts.

• Training Partition: Usually the largest, used to build the model(s).

• Validation (Test) Partition: Assess performance of each model.

• Test (Holdout, Evaluation) Partition: Assess the performance of thechosen model with new data.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 18 / 33

Page 19: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

K-Nearest Neighbors

• Nonparametric technique: no assumption about the form of therelationship between the response and explanatory variables

• Can be used either for classification or prediction

• Idea: classify/predict a new record by finding “similar” records in thetraining data. These “neighbors” are used to derive a classification/prediction by voting (classification) or averaging (prediction)

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 19 / 33

Page 20: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

K-Nearest Neighbors

Algorithm:

• Compute the distance from each record to each other record, usingonly the explanatory variables

• Using the k nearest neighbors, classify each record into the categorythat has the most of the k neighbors

• OR, using the k nearest neighbors, predict the response of each recordas the average response of the k nearest neighbors

Implementation:

• Choose k

• Normalize data

• Choose distance metric

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 20 / 33

Page 21: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

K-Nearest Neighbors

What changes if the response is continuous?

L. da F Costa, P. Boas, F. Silva, F. Rodrigues (2010). A pattern recognition approach to complex networks. Journal ofStatistical Mechanics: Theory and Experiment, 2010(11), P11015.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 21 / 33

Page 22: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Assess Performance (Classification)

Classification/Confusion Matrix

Predicted Class = 1 Predicted Class = 0

Actual Class = 1 n11 n10Actual Class = 0 n01 n00

Overall error rate = Estimated misclassification rate = n10+n01n11+n10+n01+n00

Overall accuracy = 1 - overall error

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 22 / 33

Page 23: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Assess Performance (Prediction)

Root mean squared error =

√∑(yi−yi )2

n

where

yi = observed response variable

yi = predicted response variable

n = sample size

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 23 / 33

Page 24: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Prediction: Linear Regression

This is perhaps the most popularmodel for making predictions.

Y = β0+β1X1+β2X2+. . .+βnXn+ε

The response variable (Y) is equalto a weighted sum of theexplanatory variables (X) plus noise(ε).

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 24 / 33

Page 25: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Classification: Logistic Regression

In linear regression, we model theresponse as a function of theexplanatory variables. In logisticregression, the response variable isbinary, and we model the probabilitythat the response = 1 (p) as afunction of the explanatoryvariables.

logp

1 − p= β0+β1X1+β2X2+. . .+βnXn

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 25 / 33

Page 26: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Prediction: Regression Tree

Explode Size Age Manufacturer

1 25 5 A1 30 5 A1 35 10 A1 40 10 A0 40 10 A0 35 10 B0 40 10 B1 50 10 B0 60 15 B0 55 15 B

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 26 / 33

Page 27: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Prediction: Regression Tree

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 27 / 33

Page 28: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Classification: Classification Tree

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 28 / 33

Page 29: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Clustering

Cluster analysis is an unsupervised learning technique.

Unsupervised learning techniques are often not ends in themselves, but aremethods for finding relationships and patterns that might be used forsubsequent predictive analysis.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 29 / 33

Page 30: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

K-Means Clustering

Algorithm

• Pick a number of clusters (k)

• Assign each record to one of the k clusters

• Calculate the centroid (vector mean) for each cluster

• At each step, each record is reassigned to the cluster with the“closest” centroid

• Recompute the centroids

• Stop when moving any more records increases cluster dispersion

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 30 / 33

Page 31: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Association Analysis

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 31 / 33

Page 32: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

Association Analysis

Examples

• We have information on what items were purchased by eachconsumer at Harris Teeter. We would like to use this information togenerate coupons.

• We are an online merchant. We see what the customer is purchasingand recommend another item (and potentially offer it at a discount).

Details

• Also called affinity analysis or market basket analysis

• Goal is to identify item clusterings in transaction-type databases(“what goes with what”)

• The classic algorithm is the a priori algorithm.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 32 / 33

Page 33: Demystifying Data Science · 2018-03-20 · Demystifying Data Science Alyson Wilson, Ph.D., PStat Department of Statistics Laboratory for Analytic Sciences North Carolina State University

My Go-To Books for Teaching Data Science

• G. Shmueli, P. Bruce, I. Yahav, N. Patel, K. Lichtendahl (2018).Data Mining for Business Analytics: Concepts, Techniques, andApplications in R. John Wiley & Sons.

• B. Baumer, D. Kaplan, N. Horton (2017). Modern Data Science withR. CRC Press.

• D. Nolan, D. Temple Lang (2015). Data Science in R: A Case StudiesApproach to Computational Reasoning and Problem Solving.Chapman & Hall.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 33 / 33