orie 4741: learning with big messy data [2ex] introduction · 29 f ct $53,000 college yes clinton...

56
ORIE 4741: Learning with Big Messy Data Introduction Professor Udell Operations Research and Information Engineering Cornell September 8, 2020 1 / 39

Upload: others

Post on 11-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

ORIE 4741: Learning with Big Messy Data

Introduction

Professor Udell

Operations Research and Information EngineeringCornell

September 8, 2020

1 / 39

Page 2: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Outline

Logistics

Stories

Definitions

Kinds of learning

Syllabus

2 / 39

Page 3: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

ORIE 4741: Learning with Big Messy Data

want to take this class?

I ASAP:I enroll (or drop) (or get on wait list)I fill out course survey (provides access campuswire Q&A)I sign up for iClicker REEF

I Thursday 9/10/2020: homework 0

links on course website:https://people.orie.cornell.edu/mru8/orie4741/

3 / 39

Page 4: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Course staff

I Prof. Madeleine Udell

I TA: Chengrun Yang (ECE PhD)

I TA: Yuxuan Chen (Statistics PhD)

I TA: Anusha Avyukt (Statistics MPS)

I TA: Juliet Zhong (ORIE MEng + Undergraduate)

I TA: Allison Grimsted (ORIE Undergraduate)

I TA: Carrie Rucker (ORIE Undergraduate)

4 / 39

Page 5: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Tech stack

I Zoom for lectures

I Course website for course materials(syllabus, schedule, homework, project, etc)

I iClicker REEF for polls

I Campuswire for Q&A and announcements

I Gradescope for quizzes, submitting homework, grades,solutions

I Github for code (demos, projects, and hw starter code)

Zoom contingencies

I If I get logged off (eg, due to connectivity issues), your TAswill stay on and provide further instructions

I If the Zoom platform fails (eg, Zoom-bombing or Zoomoutage), look on Campuswire for further instructions

5 / 39

Page 6: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Tech stack

I Zoom for lectures

I Course website for course materials(syllabus, schedule, homework, project, etc)

I iClicker REEF for polls

I Campuswire for Q&A and announcements

I Gradescope for quizzes, submitting homework, grades,solutions

I Github for code (demos, projects, and hw starter code)

Zoom contingencies

I If I get logged off (eg, due to connectivity issues), your TAswill stay on and provide further instructions

I If the Zoom platform fails (eg, Zoom-bombing or Zoomoutage), look on Campuswire for further instructions

5 / 39

Page 7: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Course requirements and grading

course website:(grading, course requirements, lectures, homework, etc)https://people.orie.cornell.edu/mru8/orie4741/

I (15%) Participation: for every lecture (after this one), useI iClicker REEF for sync lecturesI participation form for async lectures

I (30%) HomeworkI due every two weeks or soI first one due next Thursday

I (15%) QuizzesI 30 min quiz every week or so

I (40%) Project

FAQ:

I yes, you can take the class async in any timezoneI yes, you can take section async, or not take the section

6 / 39

Page 8: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Course requirements and grading

course website:(grading, course requirements, lectures, homework, etc)https://people.orie.cornell.edu/mru8/orie4741/

I (15%) Participation: for every lecture (after this one), useI iClicker REEF for sync lecturesI participation form for async lectures

I (30%) HomeworkI due every two weeks or soI first one due next Thursday

I (15%) QuizzesI 30 min quiz every week or so

I (40%) Project

FAQ:

I yes, you can take the class async in any timezoneI yes, you can take section async, or not take the section

6 / 39

Page 9: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Questions

during lecture:

I ask out loud

I zoom chat to TA Carrie Rucker

outside of lecture:

I ask at office hours

I ask on campuswire

I don’t send email

7 / 39

Page 10: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Outline

Logistics

Stories

Definitions

Kinds of learning

Syllabus

8 / 39

Page 11: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Oh, you work with big messy data? Maybe you couldhelp us out...?

9 / 39

Page 12: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

My career in big data

academic

I B.S. in Mathematics and Physics at YaleI Ph.D. in Computational and Mathematical Engineering at

StanfordI postdoctoral fellow at the Center for the Mathematics of

Information at CaltechI professor in ORIE at Cornell

applied work

I finance: Goldman Sachs, BlackRock, Capital One,Schonfeld, Two Sigma, . . .

I cybersecurity: DARPA, Expanse (formerly Qadium)I healthcare: Apixio, OntarioI clean energy: AuroraI commerce: Retina.ai, Marketing AttributionI politics: Obama 2012

10 / 39

Page 13: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

My career in big data

academic

I B.S. in Mathematics and Physics at YaleI Ph.D. in Computational and Mathematical Engineering at

StanfordI postdoctoral fellow at the Center for the Mathematics of

Information at CaltechI professor in ORIE at Cornell

applied work

I finance: Goldman Sachs, BlackRock, Capital One,Schonfeld, Two Sigma, . . .

I cybersecurity: DARPA, Expanse (formerly Qadium)I healthcare: Apixio, OntarioI clean energy: AuroraI commerce: Retina.ai, Marketing AttributionI politics: Obama 2012

10 / 39

Page 14: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Data table: politics

age gender state income education voted? support · · ·29 F CT $53,000 college yes Biden · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·

41 F NV $23,000 ? yes Trump · · ·...

......

......

......

...

goals:

I detect demographic groups?

I find typical responses?

I identify related features?

I impute missing entries?

11 / 39

Page 15: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Data table: politics

age gender state income education voted? support · · ·29 F CT $53,000 college yes Biden · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·

41 F NV $23,000 ? yes Trump · · ·...

......

......

......

...

goals:

I detect demographic groups?

I find typical responses?

I identify related features?

I impute missing entries?11 / 39

Page 16: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

How to get data for politics?

two major sources of data:

I census

I voter registration

data quality is critical!

I for hw0, you’ll respond to census + register to voteI if eligible:

I census eligibility: college students (including foreign) shouldbe counted at the on-campus or off-campus residence wherethey sleep most of the time (even if you went home beforeCensus Day last spring, and even if you’re a foreign citizen)

I voter eligibility: US citizen 18+

12 / 39

Page 17: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Medicine

13 / 39

Page 18: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Data table: medicine

age gender heart disease statins? · · ·29 F yes no · · ·57 ? no no · · ·? M no no · · ·

41 F yes yes · · ·...

......

...

I find similar patients?

I understand systemic healthcare needs?

I use symptoms to detect which patients have COVID-19?

I detect patients who had series of mini-strokes?

14 / 39

Page 19: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

COVID projections

https://covid19-projections.com/

15 / 39

Page 20: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

COVID projections: Cornell data

https://covid.cornell.edu/testing/dashboard/

16 / 39

Page 21: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Poll

(click “participants” and use Zoom reactions to respond to poll)

So far .2% of Cornell population has had COVID-19. I think X%of Cornell population will get COVID this semester, where X is

I (yes) < .5%

I (no) .5− 1%

I (go slower) 1− 5%

I (go faster) 5− 10%

I (coffee) > 10%

17 / 39

Page 22: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Poll

(click “participants” and use Zoom reactions to respond to poll)

I think

I (yes) I will get COVID

I (no) I will not get COVID

I (coffee) I’ve already had COVID

18 / 39

Page 23: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Poll

(click “participants” and use Zoom reactions to respond to poll)

I think a vaccine will be widely available by

I (yes) November

I (no) January

I (slower) Spring 2021

I (faster) Summer 2021

I (coffee) later

I (down) never

19 / 39

Page 24: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Data for projections: simulation

I simulate model given parameter values

I learn parameter values to match data

https:

//github.com/ORIE4741/demos/blob/master/SIR.ipynb

20 / 39

Page 25: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Simulation: results

by varying parameters, we see

I with infrequent testing (weekly+), pandemic is nearlyimpossible to control

I people who go to big parties get COVIDI if too many people go to big parties, Cornell shuts downI if moderate (e.g. 10%) of people go to big parties, classes

shut downI if < 1% go to big parties, Cornell stays open

I if PPE is effective,I people who just go to classes are nearly always okI people who go to small parties might get COVID

I if PPE is not effective,I people who just go to classes might get COVIDI people who go to small parties have even odds of getting

COVID

21 / 39

Page 26: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Simulation: assumptions

simulation is evocative, not realistic. assumes

I fluid limit: large population, no randomness

I only 3 groups that mix internally

I only 3 modes of contact: parties, classes, external(eg grocery shopping)

I no latency period

I actions don’t depend on infection rates

I quarantine is effective

I . . .

22 / 39

Page 27: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Application areas

I health

I politics

I governance

I advertising

I retail

I ecommerce

I finance

I . . .

23 / 39

Page 28: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Outline

Logistics

Stories

Definitions

Kinds of learning

Syllabus

24 / 39

Page 29: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Big

I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”

I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”

I 4 Vs:

1

I 5th V: value

1image courtesy of Kim Minor @ IBM25 / 39

Page 30: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Big

I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”

I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”

I 4 Vs:

1

I 5th V: value

1image courtesy of Kim Minor @ IBM25 / 39

Page 31: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Big

I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”

I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”

I 4 Vs:

1

I 5th V: value

1image courtesy of Kim Minor @ IBM25 / 39

Page 32: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Big

I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”

I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”

I 4 Vs:

1

I 5th V: value1image courtesy of Kim Minor @ IBM

25 / 39

Page 33: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Big: our definition

Definition

An algorithm for big data is one with computational andmemory requirements that scale linearly (or nearly linearly) inthe size of the data.

why this definition? independent of

I hardware

I business

if you use only algorithms for big data, then you’re workingwith big data

26 / 39

Page 34: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Big: our definition

Definition

An algorithm for big data is one with computational andmemory requirements that scale linearly (or nearly linearly) inthe size of the data.

why this definition? independent of

I hardware

I business

if you use only algorithms for big data, then you’re workingwith big data

26 / 39

Page 35: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Big: our definition

Definition

An algorithm for big data is one with computational andmemory requirements that scale linearly (or nearly linearly) inthe size of the data.

why this definition? independent of

I hardware

I business

if you use only algorithms for big data, then you’re workingwith big data

26 / 39

Page 36: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Messy

I noisy: some (or all) values suffer errors, inaccuracies, ormalicious corruption

I missing: some values are missing, inconsistent, notrecorded, or lost

I heterogeneous: values of many different typesI continuous values (e.g., 4.2, π)I discrete values (e.g., 0, 4, 994)I nominal values (e.g., apple, banana, pear)I ordinal values (e.g., rarely, sometimes, often)I graphs or networks (e.g., person 1 is friends with person 2)I text (e.g., doctor’s note describing symptoms)I sets (e.g., items purchased)

27 / 39

Page 37: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Messy

I noisy: some (or all) values suffer errors, inaccuracies, ormalicious corruption

I missing: some values are missing, inconsistent, notrecorded, or lost

I heterogeneous: values of many different typesI continuous values (e.g., 4.2, π)I discrete values (e.g., 0, 4, 994)I nominal values (e.g., apple, banana, pear)I ordinal values (e.g., rarely, sometimes, often)I graphs or networks (e.g., person 1 is friends with person 2)I text (e.g., doctor’s note describing symptoms)I sets (e.g., items purchased)

27 / 39

Page 38: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Messy

I noisy: some (or all) values suffer errors, inaccuracies, ormalicious corruption

I missing: some values are missing, inconsistent, notrecorded, or lost

I heterogeneous: values of many different typesI continuous values (e.g., 4.2, π)I discrete values (e.g., 0, 4, 994)I nominal values (e.g., apple, banana, pear)I ordinal values (e.g., rarely, sometimes, often)I graphs or networks (e.g., person 1 is friends with person 2)I text (e.g., doctor’s note describing symptoms)I sets (e.g., items purchased)

27 / 39

Page 39: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Learning

I machine learning?

I human learning?

I when data is big and messy,machine help is essential for human learning!

28 / 39

Page 40: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Learning

I machine learning?

I human learning?

I when data is big and messy,machine help is essential for human learning!

28 / 39

Page 41: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Learning

I machine learning?

I human learning?

I when data is big and messy,machine help is essential for human learning!

28 / 39

Page 42: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Learning

I machine learning?

I human learning?

I when data is big and messy,machine help is essential for human learning!

28 / 39

Page 43: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Outline

Logistics

Stories

Definitions

Kinds of learning

Syllabus

29 / 39

Page 44: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Data table

n examples (patients, respondents, households, assets)d features (tests, questions, sensors, times) A

=

a11 · · · a1d...

. . ....

an1 · · · and

I ai is ith row of A: feature vector for ith example

I a:j is jth column of A: values for jth feature across allexamples

I aij is jth feature of ith example

30 / 39

Page 45: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Supervised learning

I identify one column of data that we want to predict A

=

x11 · · · x1 d−1 y1...

. . ....

...xn1 · · · xn d−1 yn

=

X y

I xi ∈ X for i = 1, . . . , n are rows of X

I yi ∈ Y for i = 1, . . . , n are entries of y

I we believe there is a mapping f : X → Y

yi ≈ f (xi )

I our goal is to learn f

31 / 39

Page 46: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Supervised learning

I identify one column of data that we want to predict A

=

x11 · · · x1 d−1 y1...

. . ....

...xn1 · · · xn d−1 yn

=

X y

I xi ∈ X for i = 1, . . . , n are rows of X

I yi ∈ Y for i = 1, . . . , n are entries of y

I we believe there is a mapping f : X → Y

yi ≈ f (xi )

I our goal is to learn f

31 / 39

Page 47: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Example: supervised learning for credit decisioning

I goal: decide which credit card applicants should beapproved

I input space: entries of X ∈ Rd correspond to fields incredit applicationI e.g., salary, years in residence, outstanding debt,

number of credit lines, . . .I output space: Y = {+1,−1}

I +1 means approveI −1 means reject

I data: D = (x1, y1), . . . , (xn, yn)applications of previous customers, and credit approvaldecisions made by humans

Q: what are potential problems with using a model built withthis data?A: wrong objective: human decision may not be correct decision;covariate shift: future data may look unlike past data; . . .

32 / 39

Page 48: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Example: supervised learning for credit decisioning

I goal: decide which credit card applicants should beapproved

I input space: entries of X ∈ Rd correspond to fields incredit applicationI e.g., salary, years in residence, outstanding debt,

number of credit lines, . . .I output space: Y = {+1,−1}

I +1 means approveI −1 means reject

I data: D = (x1, y1), . . . , (xn, yn)applications of previous customers, and credit approvaldecisions made by humans

Q: what are potential problems with using a model built withthis data?

A: wrong objective: human decision may not be correct decision;covariate shift: future data may look unlike past data; . . .

32 / 39

Page 49: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Example: supervised learning for credit decisioning

I goal: decide which credit card applicants should beapproved

I input space: entries of X ∈ Rd correspond to fields incredit applicationI e.g., salary, years in residence, outstanding debt,

number of credit lines, . . .I output space: Y = {+1,−1}

I +1 means approveI −1 means reject

I data: D = (x1, y1), . . . , (xn, yn)applications of previous customers, and credit approvaldecisions made by humans

Q: what are potential problems with using a model built withthis data?A: wrong objective: human decision may not be correct decision;covariate shift: future data may look unlike past data; . . .

32 / 39

Page 50: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Exercise: formalizing real problems

I identify a prediction goal

I identify the input space XI identify the output space YI identify the data D = (x1, y1), . . . , (xn, yn) you’d like to use

I what kinds of noise do you expect in the data?

33 / 39

Page 51: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Outline

Logistics

Stories

Definitions

Kinds of learning

Syllabus

34 / 39

Page 52: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Course objectives (I)

I plot

I predict

I cluster

I impute

I denoise

I recommend

I understand

35 / 39

Page 53: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Course objectives (II)

this course is about

I algorithms for big messy data

I learning to ask the right questions

at the end of the course, you should have learned

I at least one method to solve any problem

I machine learning is not magic; it’s math

I when not to trust your solution

the rest you can learn online. . .

36 / 39

Page 54: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Course objectives (II)

this course is about

I algorithms for big messy data

I learning to ask the right questions

at the end of the course, you should have learned

I at least one method to solve any problem

I machine learning is not magic; it’s math

I when not to trust your solution

the rest you can learn online. . .

36 / 39

Page 55: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Next steps

I ASAP:I enroll (or drop) (or get on wait list)I fill out course survey (provides access to campuswire Q&A)I sign up for iClicker REEF

I Thursday 9/10/2020 9:30am: homework 0

links on course website:https://people.orie.cornell.edu/mru8/orie4741/

37 / 39

Page 56: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Questions?

https://docs.google.com/spreadsheets/d/

1vLbwi0WCOn0wU6cU_r0RHAnY7C0fDZ1F8Yq09pqYYuk

38 / 39