about the presenter: david j corliss david corlis… · about the presenter: david j corliss...

30
About the Presenter: David J Corliss • PhD in statistical astrophysics formerly parttime faculty at Wayne State University • Analytics Architect in the automotive industry • Work focuses on bringing university research in bog data and time series analysis to the private sector • Founder of PeaceWork, a volunteer cooperative of statisticians, data scientists and other researchers applying analytics to issues in poverty, education and social justice

Upload: others

Post on 06-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

About the Presenter: David J Corliss

• PhD in statistical astrophysics;; formerly part-­time faculty at Wayne State University• Analytics Architect in the automotive industry • Work focuses on bringing university research in bog data and time series analysis to the private sector• Founder of Peace-­Work, a volunteer cooperative of statisticians, data scientists and other researchers applying analytics to issues in poverty, education and social justice

Page 2: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Best Practices in Big Data

David J Corliss, PhDPeace-Work

4/27/2016

IHBIThe Institute for Healthand Business Insight

Page 3: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

OUTLINE

Data Management

Sampling and Coding for Big Data

Tests For Model Performance

Distributed Computing

Summary

Page 4: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Data Management for Big Data

• Pre-screen records and variables

• Process only the records and variables needed

• Efficient Data Step Coding

• Use less computationally intensive methods

Page 5: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Bad Data Management 101Proc sort data=applicants;

by demographic_seg ID;

proc genmod data=applicants;

class demographic_seg;

model accept = var1—var221 /

dist = bin

link = logit

lrci; run;

Page 6: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Proc sort data=applicants;by demographic_seg ID;

proc genmod data=applicants;

class demographic_seg;

model accept = var1—var221 /

dist = bin

link = logit

lrci; run;

Bad Data Management 101Unnecessary Sort

Page 7: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Proc sort data=applicants;by demographic_seg ID;

proc genmod data=applicants;

class demographic_seg;

model accept = var1—var221 /

dist = bin

link = logit

lrci; run;

Bad Data Management 101

Doesn’t screenvariables first

Unnecessary Sort

Page 8: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Proc sort data=applicants;by demographic_seg ID;

proc genmod data=applicants;

class demographic_seg;

model accept = var1—var221 /

dist = bin

link = logit

lrci; run;

Bad Data Management 101

Doesn’t screenvariables first

Unnecessary Sort

Models allvariables

Page 9: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Proc sort data=applicants;by demographic_seg ID;

proc genmod data=applicants;

class demographic_seg;

model accept = var1—var221 /

dist = bin

link = logit

lrci; run;

Bad Data Management 101

Doesn’t screenvariables first

Unnecessary Sort

Models allvariables

Computationally intensivebut not needed

Page 10: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

proc glmselect data=applicants(where ranuni(0) le 0.001);

model accept=var1—var221/selection=lasso(stop=none choose=sbc);

run;

proc logistic data=applicants;class demographic_seg;model accept =

var12 var57 var125 var203;run;

Managing Big Data

Page 11: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Managing Big Data

proc glmselect data=applicants(where ranuni(0) le 0.001);

model accept=var1—var221/selection=lasso(stop=none choose=sbc);

run;

proc logistic data=applicants;class demographic_seg;model accept =

var12 var57 var125 var203;run;

Test on a sample

Page 12: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Managing Big Data

proc glmselect data=applicants(where ranuni(0) le 0.001);

model accept=var1—var221/selection=lasso(stop=none choose=sbc);

run;

proc logistic data=applicants;class demographic_seg;model accept =

var12 var57 var125 var203;run;

Test on a sample

Select candidate variables

Page 13: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Managing Big Data

proc glmselect data=applicants(where ranuni(0) le 0.001);

model accept=var1—var221/selection=lasso(stop=none choose=sbc);

run;

proc logistic data=applicants;class demographic_seg;model accept =

var12 var57 var125 var203;run;

Test on a sample

Select candidate variables

Computationally lightestsufficient method

Page 14: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Managing Big Data

proc glmselect data=applicants(where=(ranuni(0) le 0.001));

model accept=var1—var221/selection=lasso(stop=none choose=sbc);

run;

proc logistic data=applicants;class demographic_seg;model accept =

var12 var57 var125 var203;run;

Test on a sample

Select candidate variables

Computationally lightestsufficient method

Model onlyscreened variables

Page 15: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Sampling for Big Data

• Develop analytic processes using sample

• Sample Size

• Representative Samples

• Testing Sample Quality

Page 16: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Efficient Coding for Big Data

• Read only the variables needed for analysis

• Pass the data as few times as possible

• Use formats instead of new variables

• Shorten records by using codes instead of text

• Trim unnecessary decimal places

• Computationally light processes where possible

Page 17: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Coding for Big Data: Hash ObjectAn Ordinary Customer ListName Street_Address City State Zip_Code prod_42 prod_44

Magnify Analytics 1 Kennedy Square Detroit MI 48226 4 3

Fedex Office 2609 Plymouth Road #7 Ann Arbor MI 48105 4 2

Hyatt Regency Minneapolis 1300 Nicollet Mall Minneapolis MN 55403 1 5

Wrigley Field 1060 W. Addison St Chicago IL 60613 2 3

.

.

The Same Data in a Hash TableHash_ID Zip_Code prod_42 prod_44

00042540 48226 4 3

00063640 48105 4 3

00146328 55403 4 3

00243466 60613 4 3

.

.

Page 18: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Coding for Big Data: Hash ObjectThe Hash Object Process

y = w1 x1 + w2 x2 + w3 x3 + w4 x4 + w5 x5

1. Read the hash key for the given record

2. Look up the value of x1 by the key

3. Multiply by w1 and save it in a buffer

4. Repeat for each component of the model

5. Add all the components to calculate y

6. Release the buffer and go the next record

7. Repeat for each record

Page 19: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Testing Model PerformanceThe Problem of p-values and Big Data

Explanatory Variable Estimate Pr ( > |z|)

Var1 0.271503909 > 0.001

Var2 0.998361223 > 0.001

. . .

. . .

Var25 0.244677914 > 0.001

Var26 0.387859652 > 0.001

. . .

. . .

Var100 0.561703993 > 0.001

Var101 0.479482516 0.002

Var102 0.35656757 0.003

Page 20: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

ASA Statement on p-values, 3/7/2016:

“The p-value was never intended to be a substitute

for scientific reasoning…Well-reasoned statistical

arguments contain much more than the value of a

single number and whether that number exceeds an

arbitrary threshold. The ASA statement is intended

to steer research into a ‘post p<0.05 era.”

Ron Wasserstein, ASA Executive Director

Testing Model Performance

Page 21: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Testing Model PerformanceNew Statistical Tests for Big Data

• Bonferroni Correction

• False Discovery Rate

• False Coverage Rate

• PCER

• Bayesian, including Bayesian FCR

Page 22: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Traditional Server Computing

SERVER

USER WORK STATIONS

Distributed Computing

Page 23: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Traditional Server Computing

SERVER

USER WORK STATIONS

Need More Resources?

SERVER

USER WORK STATIONS

Distributed Computing

Page 24: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Traditional Server Computing

SERVER

USER WORK STATIONS

Need More Resources? >> Get a Bigger Server

SERVER

USER WORK STATIONS

Distributed Computing

Page 25: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Distributed Computing

Scalable Distributed Computing

USER WORK STATIONS

SERVER NODE NETWORK

Page 26: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Distributed Computing

Scalable Distributed Computing

Need More Resources?

USER WORK STATIONS

SERVER NODE NETWORK

USER WORK STATIONS

SERVER NODE NETWORK

Page 27: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Distributed Computing

Scalable Distributed Computing

USER WORK STATIONS

SERVER NODE NETWORK

USER WORK STATIONS

SERVER NODE NETWORK

Need More Resources? >> Add More Nodes

Page 28: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Summary of Big Data Best Practices

• Use best practices for managing large data sets, with efficient coding

• Pre-screen records and variables, only processing the data needed

• Use sampling where appropriate

• Consider Hash Object Programming to apply scoring models to big data

• Learn and use multi-threaded and distributed statistical procedures

• Use tests for model performance that have been designed for big data

• Look into grid computing for large analytic systems

Page 29: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

References and Additional MaterialsProgramming for Job Security, Arthur Carpenter and Tony Payne

http://www2.sas.com/proceedings/sugi23/Training/p275.pdf

Secrets of Efficient SAS® Coding Techniques

http://support.sas.com/resources/papers/proceedings16/11741-2016.pdf

The SAS Data Step: Where Your Input Mattershttp://www.pharmasug.org/proceedings/2012/TF/PharmaSUG-2012-TF04.pdf

Maximizing the Power of Hash Tables, David J Corliss

http://support.sas.com/resources/papers/proceedings13/037-2013.pdf

Page 30: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University

Questions

[email protected]

IHBIThe Institute for Healthand Business Insight