demystifying data science · 2018-03-20 · demystifying data science alyson wilson, ph.d., pstat...

Demystifying Data Science

Alyson Wilson, Ph.D., PStat

Department of StatisticsLaboratory for Analytic SciencesNorth Carolina State University

[email protected]

March 22, 2018

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 1 / 33

Objectives

• Lots of “Data” Definitions

• Classes of Algorithms (with descriptions)


It’s All About Data

• Big data

• Data engineering

• Data science

• Data analytics

• Data mining


Data Engineering

Data engineers design, build, and manage the infrastructure to supportdata collection, storage, and analysis.

One of the key functions is managing extract, transform, load (ETL).

• Extract data from a variety of sources.

• Transform to the proper format or structure to support querying andanalysis.

• Load the data into the final database, for example, a data store, datamart, or data warehouse.


Is “Big Data” the same as “Data Science”?


Five Data Science Skills

• Technical• Algorithmic/computational/predictive and data/statistical/inferential

methods• Mathematics (particularly modeling and linear algebra)• Obtaining, wrangling, curating, managing and processing data,

exploring data

• Communication

• Collaboration

• Tools

• Subject Matter Expertise

Adapted from Michael Rappa, NCSU Institute for Advanced Analytics andCurriculum Guidelines for Undergraduate Programs in Data Science, AmericanStatistical Association


Data Product

A data product is the production output from a statistical analysis. Dataproducts automate complex analysis tasks or use technology to expand theutility of a data informed model, algorithm or inference.

• interactive analytics (e.g., R Shiny)

• packages of analysis tools (e.g., an R package)

• interactive graphics

The idea is to use technology to tell a story about data to a broadaudience.


Data Science is a Team Sport


A Few Words about Data Wrangling (Munging)

Data is messy. Much (but not all) of what falls in data wrangling is whatwe have traditionally called data cleaning.

• Data scraping is the use of a program to extract data fromhuman-readable sources. Think web site, online datasets, etc.

• There is an emerging standard for data wrangling embodied in the Rpackage dplyr.

• select() take a subset of columns/features/variables• filter() take a subset of rows/observations• mutate() add or modify existing columns• arrange() sort rows• summarize() aggregate data across rows

• See also tidyr, an R package which describes a way to think aboutstoring and formatting data.http://vita.had.co.nz/papers/tidy-data.html


Data Analytics

Data analytics are essentially statistics with a lower-case “s”. Analyticsare computations that one makes with data to answer questions. They areoften described by data type, by application area, or by method class.

• Geospatial analytics• Text analytics• Network analytics• Forecasting = “time series

analytics”• Business analytics• Visualization• Neural nets• Deep learning


Data Mining

“Statistics at scale and speed” (and simplicity)

D. Pregibon (1999). 2001: A statistical odyssey. Invited talk at The Fifth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining,ACM Press, NY.


More Definitions

• Machine learning: focused on prediction, based on known propertieslearned from the training data.

• Data mining: focused on the discovery of (previously) unknownproperties in the data.

Data mining + machine learning are currently being rebranded as artificialintelligence.


Algorithms

• Supervised learning: Providing an algorithm with labeled records inwhich an output variable of interest is known and the algorithm learnshow to predict this value with new records where the output isunknown.

• Unsupervised learning: Providing an algorithm without labeledrecords in which the goal is to draw inferences from only input data.


Prediction and Classification

• Prediction: supervised learning when the response is a continuousvariable

• Classification: supervised learning when the response is a categoricalvariable

The goal is to predict the value of the response using the explanatoryvariables


Data

Gender Height Weight

F 66 135M 70 165F 70 155M 72 200F 62 140

Response? Explanatory?


Predictive Analytics

Predictive analytics is a general term encompassing classification andprediction (and sometimes association analysis).

Common Algorithms:

• K-Nearest Neighbors

• Linear Regression

• Logistic Regression

• Classification Trees

• Regression Trees

• Neural Networks


Explanatory v Predictive Models

• The typical explanatory model is for the “small data case” (classicalstatistics); a typical predictive model is for the “large data case (datamining).”

• A good explanatory model fits the data closely; a good predictivemodel predicts new cases accurately.

• In explanatory models, the whole dataset is used for estimating thecoefficients and picking the “best model.” Performance measuresassess how well the model fits the data.

• In predictive models, the training data is used to estimate the model,and the validation data set is used to assess performance (more in aminute). Performance measures assess predictive accuracy.

We are focused on predictive models.


Assessing Performance

When we are choosing a predictive analytic (method, model, algorithm),we typically divide (partition) our data into three parts.

• Training Partition: Usually the largest, used to build the model(s).

• Validation (Test) Partition: Assess performance of each model.

• Test (Holdout, Evaluation) Partition: Assess the performance of thechosen model with new data.


K-Nearest Neighbors

• Nonparametric technique: no assumption about the form of therelationship between the response and explanatory variables

• Can be used either for classification or prediction

• Idea: classify/predict a new record by finding “similar” records in thetraining data. These “neighbors” are used to derive a classification/prediction by voting (classification) or averaging (prediction)


K-Nearest Neighbors

Algorithm:

• Compute the distance from each record to each other record, usingonly the explanatory variables

• Using the k nearest neighbors, classify each record into the categorythat has the most of the k neighbors

• OR, using the k nearest neighbors, predict the response of each recordas the average response of the k nearest neighbors

Implementation:

• Choose k

• Normalize data

• Choose distance metric


K-Nearest Neighbors

What changes if the response is continuous?

L. da F Costa, P. Boas, F. Silva, F. Rodrigues (2010). A pattern recognition approach to complex networks. Journal ofStatistical Mechanics: Theory and Experiment, 2010(11), P11015.


Assess Performance (Classification)

Classification/Confusion Matrix

Predicted Class = 1 Predicted Class = 0

Actual Class = 1 n11 n10Actual Class = 0 n01 n00

Overall error rate = Estimated misclassification rate = n10+n01n11+n10+n01+n00

Overall accuracy = 1 - overall error


Assess Performance (Prediction)

Root mean squared error =

√∑(yi−yi )2

n

where

yi = observed response variable

yi = predicted response variable

n = sample size


Prediction: Linear Regression

This is perhaps the most popularmodel for making predictions.

Y = β0+β1X1+β2X2+. . .+βnXn+ε

The response variable (Y) is equalto a weighted sum of theexplanatory variables (X) plus noise(ε).


Classification: Logistic Regression

In linear regression, we model theresponse as a function of theexplanatory variables. In logisticregression, the response variable isbinary, and we model the probabilitythat the response = 1 (p) as afunction of the explanatoryvariables.

logp

1 − p= β0+β1X1+β2X2+. . .+βnXn


Prediction: Regression Tree

Explode Size Age Manufacturer

1 25 5 A1 30 5 A1 35 10 A1 40 10 A0 40 10 A0 35 10 B0 40 10 B1 50 10 B0 60 15 B0 55 15 B


Prediction: Regression Tree


Classification: Classification Tree


Clustering

Cluster analysis is an unsupervised learning technique.

Unsupervised learning techniques are often not ends in themselves, but aremethods for finding relationships and patterns that might be used forsubsequent predictive analysis.


K-Means Clustering

Algorithm

• Pick a number of clusters (k)

• Assign each record to one of the k clusters

• Calculate the centroid (vector mean) for each cluster

• At each step, each record is reassigned to the cluster with the“closest” centroid

• Recompute the centroids

• Stop when moving any more records increases cluster dispersion


Association Analysis


Association Analysis

Examples

• We have information on what items were purchased by eachconsumer at Harris Teeter. We would like to use this information togenerate coupons.

• We are an online merchant. We see what the customer is purchasingand recommend another item (and potentially offer it at a discount).

Details

• Also called affinity analysis or market basket analysis

• Goal is to identify item clusterings in transaction-type databases(“what goes with what”)

• The classic algorithm is the a priori algorithm.


My Go-To Books for Teaching Data Science

• G. Shmueli, P. Bruce, I. Yahav, N. Patel, K. Lichtendahl (2018).Data Mining for Business Analytics: Concepts, Techniques, andApplications in R. John Wiley & Sons.

• B. Baumer, D. Kaplan, N. Horton (2017). Modern Data Science withR. CRC Press.

• D. Nolan, D. Temple Lang (2015). Data Science in R: A Case StudiesApproach to Computational Reasoning and Problem Solving.Chapman & Hall.


demystifying data science · 2018-03-20 · demystifying data science alyson wilson, ph.d., pstat...

Documents