orie 4741: learning with big messy data [2ex] introduction · 29 f ct $53,000 college yes clinton...

ORIE 4741: Learning with Big Messy Data

Introduction

Professor Udell

Operations Research and Information EngineeringCornell

September 8, 2020

1 / 39

Outline

Logistics

Stories

Definitions

Kinds of learning

Syllabus

2 / 39

ORIE 4741: Learning with Big Messy Data

want to take this class?

I ASAP:I enroll (or drop) (or get on wait list)I fill out course survey (provides access campuswire Q&A)I sign up for iClicker REEF

I Thursday 9/10/2020: homework 0

links on course website:https://people.orie.cornell.edu/mru8/orie4741/

3 / 39

https://people.orie.cornell.edu/mru8/orie4741/

Course staff

I Prof. Madeleine Udell

I TA: Chengrun Yang (ECE PhD)

I TA: Yuxuan Chen (Statistics PhD)

I TA: Anusha Avyukt (Statistics MPS)

I TA: Juliet Zhong (ORIE MEng + Undergraduate)

I TA: Allison Grimsted (ORIE Undergraduate)

I TA: Carrie Rucker (ORIE Undergraduate)

4 / 39

Tech stack

I Zoom for lectures

I Course website for course materials(syllabus, schedule, homework, project, etc)

I iClicker REEF for polls

I Campuswire for Q&A and announcements

I Gradescope for quizzes, submitting homework, grades,solutions

I Github for code (demos, projects, and hw starter code)

Zoom contingencies

I If I get logged off (eg, due to connectivity issues), your TAswill stay on and provide further instructions

I If the Zoom platform fails (eg, Zoom-bombing or Zoomoutage), look on Campuswire for further instructions

5 / 39

Course requirements and grading

course website:(grading, course requirements, lectures, homework, etc)https://people.orie.cornell.edu/mru8/orie4741/

I (15%) Participation: for every lecture (after this one), useI iClicker REEF for sync lecturesI participation form for async lectures

I (30%) HomeworkI due every two weeks or soI first one due next Thursday

I (15%) QuizzesI 30 min quiz every week or so

I (40%) Project

FAQ:

I yes, you can take the class async in any timezoneI yes, you can take section async, or not take the section

6 / 39


Questions

during lecture:

I ask out loud

I zoom chat to TA Carrie Rucker

outside of lecture:

I ask at office hours

I ask on campuswire

I don’t send email

7 / 39

Outline

Logistics

Stories

Definitions

Kinds of learning

Syllabus

8 / 39

Oh, you work with big messy data? Maybe you couldhelp us out...?

9 / 39

My career in big data

academic

I B.S. in Mathematics and Physics at YaleI Ph.D. in Computational and Mathematical Engineering at

StanfordI postdoctoral fellow at the Center for the Mathematics of

Information at CaltechI professor in ORIE at Cornell

applied work

I finance: Goldman Sachs, BlackRock, Capital One,Schonfeld, Two Sigma, . . .

I cybersecurity: DARPA, Expanse (formerly Qadium)I healthcare: Apixio, OntarioI clean energy: AuroraI commerce: Retina.ai, Marketing AttributionI politics: Obama 2012

10 / 39

Data table: politics

age gender state income education voted? support · · ·29 F CT $53,000 college yes Biden · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·

41 F NV $23,000 ? yes Trump · · ·...

......

......

......

...

goals:

I detect demographic groups?

I find typical responses?

I identify related features?

I impute missing entries?

11 / 39

Data table: politics

age gender state income education voted? support · · ·29 F CT $53,000 college yes Biden · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·

41 F NV $23,000 ? yes Trump · · ·...

......

......

......

...

goals:

I detect demographic groups?

I find typical responses?

I identify related features?

I impute missing entries?11 / 39

How to get data for politics?

two major sources of data:

I census

I voter registration

data quality is critical!

I for hw0, you’ll respond to census + register to voteI if eligible:

I census eligibility: college students (including foreign) shouldbe counted at the on-campus or off-campus residence wherethey sleep most of the time (even if you went home beforeCensus Day last spring, and even if you’re a foreign citizen)

I voter eligibility: US citizen 18+

12 / 39

https://www.census.gov/content/dam/Census/programs-surveys/decennial/2020-census/2020-Census-Residence-Criteria.pdf



https://www.vote.org/

Medicine

13 / 39

Data table: medicine

age gender heart disease statins? · · ·29 F yes no · · ·57 ? no no · · ·? M no no · · ·

41 F yes yes · · ·...

......

...

I find similar patients?

I understand systemic healthcare needs?

I use symptoms to detect which patients have COVID-19?

I detect patients who had series of mini-strokes?

14 / 39

https://www.symptomchallenge.org/

COVID projections

https://covid19-projections.com/

15 / 39

https://covid19-projections.com/

COVID projections: Cornell data

https://covid.cornell.edu/testing/dashboard/

16 / 39

https://covid.cornell.edu/testing/dashboard/

Poll

(click “participants” and use Zoom reactions to respond to poll)

So far .2% of Cornell population has had COVID-19. I think X%of Cornell population will get COVID this semester, where X is

I (yes) < .5%

I (no) .5− 1%

I (go slower) 1− 5%

I (go faster) 5− 10%

I (coffee) > 10%

17 / 39

Poll


I think

I (yes) I will get COVID

I (no) I will not get COVID

I (coffee) I’ve already had COVID

18 / 39

Poll


I think a vaccine will be widely available by

I (yes) November

I (no) January

I (slower) Spring 2021

I (faster) Summer 2021

I (coffee) later

I (down) never

19 / 39

Data for projections: simulation

I simulate model given parameter values

I learn parameter values to match data

https:

//github.com/ORIE4741/demos/blob/master/SIR.ipynb

20 / 39

https://github.com/ORIE4741/demos/blob/master/SIR.ipynb

https://github.com/ORIE4741/demos/blob/master/SIR.ipynb

Simulation: results

by varying parameters, we see

I with infrequent testing (weekly+), pandemic is nearlyimpossible to control

I people who go to big parties get COVIDI if too many people go to big parties, Cornell shuts downI if moderate (e.g. 10%) of people go to big parties, classes

shut downI if < 1% go to big parties, Cornell stays open

I if PPE is effective,I people who just go to classes are nearly always okI people who go to small parties might get COVID

I if PPE is not effective,I people who just go to classes might get COVIDI people who go to small parties have even odds of getting

COVID

21 / 39

Simulation: assumptions

simulation is evocative, not realistic. assumes

I fluid limit: large population, no randomness

I only 3 groups that mix internally

I only 3 modes of contact: parties, classes, external(eg grocery shopping)

I no latency period

I actions don’t depend on infection rates

I quarantine is effective

I . . .

22 / 39

Application areas

I health

I politics

I governance

I advertising

I retail

I ecommerce

I finance

I . . .

23 / 39

Outline

Logistics

Stories

Definitions

Kinds of learning

Syllabus

24 / 39

Big

I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”

I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”

I 4 Vs:

1

I 5th V: value

1image courtesy of Kim Minor @ IBM25 / 39

Big

I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”

I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”

I 4 Vs:

1

I 5th V: value1image courtesy of Kim Minor @ IBM

25 / 39

Big: our definition

Definition

An algorithm for big data is one with computational andmemory requirements that scale linearly (or nearly linearly) inthe size of the data.

why this definition? independent of

I hardware

I business

if you use only algorithms for big data, then you’re workingwith big data

26 / 39

Messy

I noisy: some (or all) values suffer errors, inaccuracies, ormalicious corruption

I missing: some values are missing, inconsistent, notrecorded, or lost

I heterogeneous: values of many different typesI continuous values (e.g., 4.2, π)I discrete values (e.g., 0, 4, 994)I nominal values (e.g., apple, banana, pear)I ordinal values (e.g., rarely, sometimes, often)I graphs or networks (e.g., person 1 is friends with person 2)I text (e.g., doctor’s note describing symptoms)I sets (e.g., items purchased)

27 / 39

Learning

I machine learning?

I human learning?

I when data is big and messy,machine help is essential for human learning!

28 / 39

Outline

Logistics

Stories

Definitions

Kinds of learning

Syllabus

29 / 39

Data table

n examples (patients, respondents, households, assets)d features (tests, questions, sensors, times) A

=

a11 · · · a1d...

. . ....

an1 · · · and

I ai is ith row of A: feature vector for ith example

I a:j is jth column of A: values for jth feature across allexamples

I aij is jth feature of ith example

30 / 39

Supervised learning

I identify one column of data that we want to predict A

=

x11 · · · x1 d−1 y1...

. . ....

...xn1 · · · xn d−1 yn

=

X y

I xi ∈ X for i = 1, . . . , n are rows of X

I yi ∈ Y for i = 1, . . . , n are entries of y

I we believe there is a mapping f : X → Y

yi ≈ f (xi )

I our goal is to learn f

31 / 39

Example: supervised learning for credit decisioning

I goal: decide which credit card applicants should beapproved

I input space: entries of X ∈ Rd correspond to fields incredit applicationI e.g., salary, years in residence, outstanding debt,

number of credit lines, . . .I output space: Y = {+1,−1}

I +1 means approveI −1 means reject

I data: D = (x1, y1), . . . , (xn, yn)applications of previous customers, and credit approvaldecisions made by humans

Q: what are potential problems with using a model built withthis data?A: wrong objective: human decision may not be correct decision;covariate shift: future data may look unlike past data; . . .

32 / 39







Q: what are potential problems with using a model built withthis data?

A: wrong objective: human decision may not be correct decision;covariate shift: future data may look unlike past data; . . .

32 / 39







Q: what are potential problems with using a model built withthis data?A: wrong objective: human decision may not be correct decision;covariate shift: future data may look unlike past data; . . .

32 / 39

Exercise: formalizing real problems

I identify a prediction goal

I identify the input space XI identify the output space YI identify the data D = (x1, y1), . . . , (xn, yn) you’d like to use

I what kinds of noise do you expect in the data?

33 / 39

Outline

Logistics

Stories

Definitions

Kinds of learning

Syllabus

34 / 39

Course objectives (I)

I plot

I predict

I cluster

I impute

I denoise

I recommend

I understand

35 / 39

Course objectives (II)

this course is about

I algorithms for big messy data

I learning to ask the right questions

at the end of the course, you should have learned

I at least one method to solve any problem

I machine learning is not magic; it’s math

I when not to trust your solution

the rest you can learn online. . .

36 / 39

Next steps

I ASAP:I enroll (or drop) (or get on wait list)I fill out course survey (provides access to campuswire Q&A)I sign up for iClicker REEF

I Thursday 9/10/2020 9:30am: homework 0

links on course website:https://people.orie.cornell.edu/mru8/orie4741/

37 / 39


Questions?

https://docs.google.com/spreadsheets/d/

1vLbwi0WCOn0wU6cU_r0RHAnY7C0fDZ1F8Yq09pqYYuk

38 / 39

https://docs.google.com/spreadsheets/d/1vLbwi0WCOn0wU6cU_r0RHAnY7C0fDZ1F8Yq09pqYYuk

https://docs.google.com/spreadsheets/d/1vLbwi0WCOn0wU6cU_r0RHAnY7C0fDZ1F8Yq09pqYYuk

orie 4741: learning with big messy data [2ex] introduction · 29 f ct $53,000 college yes clinton...

Documents