orie 4741: learning with big messy data [2ex] introduction · 29 f ct $53,000 college yes clinton...
TRANSCRIPT
ORIE 4741: Learning with Big Messy Data
Introduction
Professor Udell
Operations Research and Information EngineeringCornell
September 8, 2020
1 / 39
Outline
Logistics
Stories
Definitions
Kinds of learning
Syllabus
2 / 39
ORIE 4741: Learning with Big Messy Data
want to take this class?
I ASAP:I enroll (or drop) (or get on wait list)I fill out course survey (provides access campuswire Q&A)I sign up for iClicker REEF
I Thursday 9/10/2020: homework 0
links on course website:https://people.orie.cornell.edu/mru8/orie4741/
3 / 39
Course staff
I Prof. Madeleine Udell
I TA: Chengrun Yang (ECE PhD)
I TA: Yuxuan Chen (Statistics PhD)
I TA: Anusha Avyukt (Statistics MPS)
I TA: Juliet Zhong (ORIE MEng + Undergraduate)
I TA: Allison Grimsted (ORIE Undergraduate)
I TA: Carrie Rucker (ORIE Undergraduate)
4 / 39
Tech stack
I Zoom for lectures
I Course website for course materials(syllabus, schedule, homework, project, etc)
I iClicker REEF for polls
I Campuswire for Q&A and announcements
I Gradescope for quizzes, submitting homework, grades,solutions
I Github for code (demos, projects, and hw starter code)
Zoom contingencies
I If I get logged off (eg, due to connectivity issues), your TAswill stay on and provide further instructions
I If the Zoom platform fails (eg, Zoom-bombing or Zoomoutage), look on Campuswire for further instructions
5 / 39
Tech stack
I Zoom for lectures
I Course website for course materials(syllabus, schedule, homework, project, etc)
I iClicker REEF for polls
I Campuswire for Q&A and announcements
I Gradescope for quizzes, submitting homework, grades,solutions
I Github for code (demos, projects, and hw starter code)
Zoom contingencies
I If I get logged off (eg, due to connectivity issues), your TAswill stay on and provide further instructions
I If the Zoom platform fails (eg, Zoom-bombing or Zoomoutage), look on Campuswire for further instructions
5 / 39
Course requirements and grading
course website:(grading, course requirements, lectures, homework, etc)https://people.orie.cornell.edu/mru8/orie4741/
I (15%) Participation: for every lecture (after this one), useI iClicker REEF for sync lecturesI participation form for async lectures
I (30%) HomeworkI due every two weeks or soI first one due next Thursday
I (15%) QuizzesI 30 min quiz every week or so
I (40%) Project
FAQ:
I yes, you can take the class async in any timezoneI yes, you can take section async, or not take the section
6 / 39
Course requirements and grading
course website:(grading, course requirements, lectures, homework, etc)https://people.orie.cornell.edu/mru8/orie4741/
I (15%) Participation: for every lecture (after this one), useI iClicker REEF for sync lecturesI participation form for async lectures
I (30%) HomeworkI due every two weeks or soI first one due next Thursday
I (15%) QuizzesI 30 min quiz every week or so
I (40%) Project
FAQ:
I yes, you can take the class async in any timezoneI yes, you can take section async, or not take the section
6 / 39
Questions
during lecture:
I ask out loud
I zoom chat to TA Carrie Rucker
outside of lecture:
I ask at office hours
I ask on campuswire
I don’t send email
7 / 39
Outline
Logistics
Stories
Definitions
Kinds of learning
Syllabus
8 / 39
Oh, you work with big messy data? Maybe you couldhelp us out...?
9 / 39
My career in big data
academic
I B.S. in Mathematics and Physics at YaleI Ph.D. in Computational and Mathematical Engineering at
StanfordI postdoctoral fellow at the Center for the Mathematics of
Information at CaltechI professor in ORIE at Cornell
applied work
I finance: Goldman Sachs, BlackRock, Capital One,Schonfeld, Two Sigma, . . .
I cybersecurity: DARPA, Expanse (formerly Qadium)I healthcare: Apixio, OntarioI clean energy: AuroraI commerce: Retina.ai, Marketing AttributionI politics: Obama 2012
10 / 39
My career in big data
academic
I B.S. in Mathematics and Physics at YaleI Ph.D. in Computational and Mathematical Engineering at
StanfordI postdoctoral fellow at the Center for the Mathematics of
Information at CaltechI professor in ORIE at Cornell
applied work
I finance: Goldman Sachs, BlackRock, Capital One,Schonfeld, Two Sigma, . . .
I cybersecurity: DARPA, Expanse (formerly Qadium)I healthcare: Apixio, OntarioI clean energy: AuroraI commerce: Retina.ai, Marketing AttributionI politics: Obama 2012
10 / 39
Data table: politics
age gender state income education voted? support · · ·29 F CT $53,000 college yes Biden · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·
41 F NV $23,000 ? yes Trump · · ·...
......
......
......
...
goals:
I detect demographic groups?
I find typical responses?
I identify related features?
I impute missing entries?
11 / 39
Data table: politics
age gender state income education voted? support · · ·29 F CT $53,000 college yes Biden · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·
41 F NV $23,000 ? yes Trump · · ·...
......
......
......
...
goals:
I detect demographic groups?
I find typical responses?
I identify related features?
I impute missing entries?11 / 39
How to get data for politics?
two major sources of data:
I census
I voter registration
data quality is critical!
I for hw0, you’ll respond to census + register to voteI if eligible:
I census eligibility: college students (including foreign) shouldbe counted at the on-campus or off-campus residence wherethey sleep most of the time (even if you went home beforeCensus Day last spring, and even if you’re a foreign citizen)
I voter eligibility: US citizen 18+
12 / 39
Medicine
13 / 39
Data table: medicine
age gender heart disease statins? · · ·29 F yes no · · ·57 ? no no · · ·? M no no · · ·
41 F yes yes · · ·...
......
...
I find similar patients?
I understand systemic healthcare needs?
I use symptoms to detect which patients have COVID-19?
I detect patients who had series of mini-strokes?
14 / 39
COVID projections: Cornell data
https://covid.cornell.edu/testing/dashboard/
16 / 39
Poll
(click “participants” and use Zoom reactions to respond to poll)
So far .2% of Cornell population has had COVID-19. I think X%of Cornell population will get COVID this semester, where X is
I (yes) < .5%
I (no) .5− 1%
I (go slower) 1− 5%
I (go faster) 5− 10%
I (coffee) > 10%
17 / 39
Poll
(click “participants” and use Zoom reactions to respond to poll)
I think
I (yes) I will get COVID
I (no) I will not get COVID
I (coffee) I’ve already had COVID
18 / 39
Poll
(click “participants” and use Zoom reactions to respond to poll)
I think a vaccine will be widely available by
I (yes) November
I (no) January
I (slower) Spring 2021
I (faster) Summer 2021
I (coffee) later
I (down) never
19 / 39
Data for projections: simulation
I simulate model given parameter values
I learn parameter values to match data
https:
//github.com/ORIE4741/demos/blob/master/SIR.ipynb
20 / 39
Simulation: results
by varying parameters, we see
I with infrequent testing (weekly+), pandemic is nearlyimpossible to control
I people who go to big parties get COVIDI if too many people go to big parties, Cornell shuts downI if moderate (e.g. 10%) of people go to big parties, classes
shut downI if < 1% go to big parties, Cornell stays open
I if PPE is effective,I people who just go to classes are nearly always okI people who go to small parties might get COVID
I if PPE is not effective,I people who just go to classes might get COVIDI people who go to small parties have even odds of getting
COVID
21 / 39
Simulation: assumptions
simulation is evocative, not realistic. assumes
I fluid limit: large population, no randomness
I only 3 groups that mix internally
I only 3 modes of contact: parties, classes, external(eg grocery shopping)
I no latency period
I actions don’t depend on infection rates
I quarantine is effective
I . . .
22 / 39
Application areas
I health
I politics
I governance
I advertising
I retail
I ecommerce
I finance
I . . .
23 / 39
Outline
Logistics
Stories
Definitions
Kinds of learning
Syllabus
24 / 39
Big
I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”
I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”
I 4 Vs:
1
I 5th V: value
1image courtesy of Kim Minor @ IBM25 / 39
Big
I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”
I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”
I 4 Vs:
1
I 5th V: value
1image courtesy of Kim Minor @ IBM25 / 39
Big
I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”
I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”
I 4 Vs:
1
I 5th V: value
1image courtesy of Kim Minor @ IBM25 / 39
Big
I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”
I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”
I 4 Vs:
1
I 5th V: value1image courtesy of Kim Minor @ IBM
25 / 39
Big: our definition
Definition
An algorithm for big data is one with computational andmemory requirements that scale linearly (or nearly linearly) inthe size of the data.
why this definition? independent of
I hardware
I business
if you use only algorithms for big data, then you’re workingwith big data
26 / 39
Big: our definition
Definition
An algorithm for big data is one with computational andmemory requirements that scale linearly (or nearly linearly) inthe size of the data.
why this definition? independent of
I hardware
I business
if you use only algorithms for big data, then you’re workingwith big data
26 / 39
Big: our definition
Definition
An algorithm for big data is one with computational andmemory requirements that scale linearly (or nearly linearly) inthe size of the data.
why this definition? independent of
I hardware
I business
if you use only algorithms for big data, then you’re workingwith big data
26 / 39
Messy
I noisy: some (or all) values suffer errors, inaccuracies, ormalicious corruption
I missing: some values are missing, inconsistent, notrecorded, or lost
I heterogeneous: values of many different typesI continuous values (e.g., 4.2, π)I discrete values (e.g., 0, 4, 994)I nominal values (e.g., apple, banana, pear)I ordinal values (e.g., rarely, sometimes, often)I graphs or networks (e.g., person 1 is friends with person 2)I text (e.g., doctor’s note describing symptoms)I sets (e.g., items purchased)
27 / 39
Messy
I noisy: some (or all) values suffer errors, inaccuracies, ormalicious corruption
I missing: some values are missing, inconsistent, notrecorded, or lost
I heterogeneous: values of many different typesI continuous values (e.g., 4.2, π)I discrete values (e.g., 0, 4, 994)I nominal values (e.g., apple, banana, pear)I ordinal values (e.g., rarely, sometimes, often)I graphs or networks (e.g., person 1 is friends with person 2)I text (e.g., doctor’s note describing symptoms)I sets (e.g., items purchased)
27 / 39
Messy
I noisy: some (or all) values suffer errors, inaccuracies, ormalicious corruption
I missing: some values are missing, inconsistent, notrecorded, or lost
I heterogeneous: values of many different typesI continuous values (e.g., 4.2, π)I discrete values (e.g., 0, 4, 994)I nominal values (e.g., apple, banana, pear)I ordinal values (e.g., rarely, sometimes, often)I graphs or networks (e.g., person 1 is friends with person 2)I text (e.g., doctor’s note describing symptoms)I sets (e.g., items purchased)
27 / 39
Learning
I machine learning?
I human learning?
I when data is big and messy,machine help is essential for human learning!
28 / 39
Learning
I machine learning?
I human learning?
I when data is big and messy,machine help is essential for human learning!
28 / 39
Learning
I machine learning?
I human learning?
I when data is big and messy,machine help is essential for human learning!
28 / 39
Learning
I machine learning?
I human learning?
I when data is big and messy,machine help is essential for human learning!
28 / 39
Outline
Logistics
Stories
Definitions
Kinds of learning
Syllabus
29 / 39
Data table
n examples (patients, respondents, households, assets)d features (tests, questions, sensors, times) A
=
a11 · · · a1d...
. . ....
an1 · · · and
I ai is ith row of A: feature vector for ith example
I a:j is jth column of A: values for jth feature across allexamples
I aij is jth feature of ith example
30 / 39
Supervised learning
I identify one column of data that we want to predict A
=
x11 · · · x1 d−1 y1...
. . ....
...xn1 · · · xn d−1 yn
=
X y
I xi ∈ X for i = 1, . . . , n are rows of X
I yi ∈ Y for i = 1, . . . , n are entries of y
I we believe there is a mapping f : X → Y
yi ≈ f (xi )
I our goal is to learn f
31 / 39
Supervised learning
I identify one column of data that we want to predict A
=
x11 · · · x1 d−1 y1...
. . ....
...xn1 · · · xn d−1 yn
=
X y
I xi ∈ X for i = 1, . . . , n are rows of X
I yi ∈ Y for i = 1, . . . , n are entries of y
I we believe there is a mapping f : X → Y
yi ≈ f (xi )
I our goal is to learn f
31 / 39
Example: supervised learning for credit decisioning
I goal: decide which credit card applicants should beapproved
I input space: entries of X ∈ Rd correspond to fields incredit applicationI e.g., salary, years in residence, outstanding debt,
number of credit lines, . . .I output space: Y = {+1,−1}
I +1 means approveI −1 means reject
I data: D = (x1, y1), . . . , (xn, yn)applications of previous customers, and credit approvaldecisions made by humans
Q: what are potential problems with using a model built withthis data?A: wrong objective: human decision may not be correct decision;covariate shift: future data may look unlike past data; . . .
32 / 39
Example: supervised learning for credit decisioning
I goal: decide which credit card applicants should beapproved
I input space: entries of X ∈ Rd correspond to fields incredit applicationI e.g., salary, years in residence, outstanding debt,
number of credit lines, . . .I output space: Y = {+1,−1}
I +1 means approveI −1 means reject
I data: D = (x1, y1), . . . , (xn, yn)applications of previous customers, and credit approvaldecisions made by humans
Q: what are potential problems with using a model built withthis data?
A: wrong objective: human decision may not be correct decision;covariate shift: future data may look unlike past data; . . .
32 / 39
Example: supervised learning for credit decisioning
I goal: decide which credit card applicants should beapproved
I input space: entries of X ∈ Rd correspond to fields incredit applicationI e.g., salary, years in residence, outstanding debt,
number of credit lines, . . .I output space: Y = {+1,−1}
I +1 means approveI −1 means reject
I data: D = (x1, y1), . . . , (xn, yn)applications of previous customers, and credit approvaldecisions made by humans
Q: what are potential problems with using a model built withthis data?A: wrong objective: human decision may not be correct decision;covariate shift: future data may look unlike past data; . . .
32 / 39
Exercise: formalizing real problems
I identify a prediction goal
I identify the input space XI identify the output space YI identify the data D = (x1, y1), . . . , (xn, yn) you’d like to use
I what kinds of noise do you expect in the data?
33 / 39
Outline
Logistics
Stories
Definitions
Kinds of learning
Syllabus
34 / 39
Course objectives (I)
I plot
I predict
I cluster
I impute
I denoise
I recommend
I understand
35 / 39
Course objectives (II)
this course is about
I algorithms for big messy data
I learning to ask the right questions
at the end of the course, you should have learned
I at least one method to solve any problem
I machine learning is not magic; it’s math
I when not to trust your solution
the rest you can learn online. . .
36 / 39
Course objectives (II)
this course is about
I algorithms for big messy data
I learning to ask the right questions
at the end of the course, you should have learned
I at least one method to solve any problem
I machine learning is not magic; it’s math
I when not to trust your solution
the rest you can learn online. . .
36 / 39
Next steps
I ASAP:I enroll (or drop) (or get on wait list)I fill out course survey (provides access to campuswire Q&A)I sign up for iClicker REEF
I Thursday 9/10/2020 9:30am: homework 0
links on course website:https://people.orie.cornell.edu/mru8/orie4741/
37 / 39
Questions?
https://docs.google.com/spreadsheets/d/
1vLbwi0WCOn0wU6cU_r0RHAnY7C0fDZ1F8Yq09pqYYuk
38 / 39