orie 4741: learning with big messy data [2ex] exploratory data analysis · 2019-09-10 · orie...

Post on 06-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ORIE 4741: Learning with Big Messy Data

Exploratory Data Analysis

Professor Udell

Operations Research and Information EngineeringCornell

October 1, 2020

1 / 21

Announcements

I If you’re taking lecture async: remember to submitparticipation post after each class!

I Sections start Wednesday. They are optional, attend anyone you prefer.Section this week is a Julia tutorial.

I Office hours: Zoom links and times are posted on coursewebsite.

I Gradescope is open for submission of hw0, due Thursday9-9-19 9:30am.

I First quiz this week! It should occupy about 20 minutes;you’ll have up to half an hour to complete it. Start itanytime between 6:15pm Thursday and 9pm Friday.

(All times ET)

2 / 21

Questions from campuswire

I search for your question before posting new question

I approximate date of voter registration is fine

3 / 21

Why julia?

I the two language problem

I julia is fast (JIT-compiled)

I julia has pleasant syntax, esp for linear algebra(MATLAB-like, but more principled)

I julia supports efficient parallelism (including multithreading)

I the julia ecosystem

for this class: you can use any language you’d like (which yourTAs can read), but the course staff will only support Julia.

4 / 21

Topics to review

We will cover (most of) these in section, too:

I Linear algebra: invertible matrices, rank, norm, basic matrixidentities. When is a matrix invertible?

I QR factorization

I Gradients (multivariate derivative)

I Projections

I SVD

I Maximum likelihood estimation

I Union bound

I Computational complexity

5 / 21

Why look at the data?

I detect errors in data

I check assumptions

I select appropriate models

I understand relationships among the features

I understand relationships between features and labels

6 / 21

How to look at the data?

I inspect raw data

I summary statistics

I visualize

7 / 21

American community survey

2013 ACS:

I 3M respondents, 87 economic/demographic surveyquestionsI incomeI cost of utilities (water, gas, electric)I weeks worked per yearI hours worked per weekI home ownershipI looking for workI use foodstampsI education levelI state of residenceI . . .

I 1/3 of responses missing

find it at https://people.orie.cornell.edu/mru8/orie4741/data/acs_2013.csv

8 / 21

How do computers work?

on a laptop:

I hard disk: usually ≤ 500 GB

I memory (RAM): usually ≤ 16 GB

I many programs (e.g., Excel): substantially more limited

don’t load a giant file into memory.your computer will crash.

how big is ACS data?3M respondents × 100 questions = 300M numbers ≈ 300MB

9 / 21

How do computers work?

on a laptop:

I hard disk: usually ≤ 500 GB

I memory (RAM): usually ≤ 16 GB

I many programs (e.g., Excel): substantially more limited

don’t load a giant file into memory.your computer will crash.

how big is ACS data?3M respondents × 100 questions = 300M numbers ≈ 300MB

9 / 21

How do computers work?

on a laptop:

I hard disk: usually ≤ 500 GB

I memory (RAM): usually ≤ 16 GB

I many programs (e.g., Excel): substantially more limited

don’t load a giant file into memory.your computer will crash.

how big is ACS data?3M respondents × 100 questions = 300M numbers ≈ 300MB

9 / 21

Inspect raw data

solution for large files: technology from the 70s!

bash shell:

I “how big are these files?”: ls -lh

I “show me some lines from the file”: head, tail, less

I “how many lines are in the file?”: wc -l

10 / 21

American Community SurveyVariable Description Type

HHTYPE household type categoricalSTATEICP state categoricalOWNERSHP own home BooleanCOMMUSE commercial use BooleanACREHOUS house on ≥ 10 acres BooleanHHINCOME household income realCOSTELEC monthly electricity bill realCOSTWATR monthly water bill realCOSTGAS monthly gas bill realFOODSTMP on food stamps BooleanHCOVANY have health insurance BooleanSCHOOL currently in school BooleanEDUC highest level of education ordinalGRADEATT highest grade level attained ordinalEMPSTAT employment status categoricalLABFORCE in labor force BooleanCLASSWKR class of worker BooleanWKSWORK2 weeks worked per year ordinalUHRSWORK usual hours worked per week realLOOKING looking for work BooleanMIGRATE1 migration status categorical 11 / 21

Julia and Jupyter

I Julia is a programming language:it parses human-readable code to machine-readable code,executes it, returns the answer

I Jupyter is a protocol for interacting with a programminglanguage.

I Jupyter stores inputs and outputs as .ipynb files.I Jupyter notebooks display inputs and outputs in a browserI JuliaBox is an interface to a webserver running Julia

how to access?

I recommended: install anaconda distribution, then julia(go to section or see section materials for details)

I not recommended: use Google Colabhttps://discourse.julialang.org/t/

julia-on-google-colab-free-gpu-accelerated-shareable-notebooks/

15319

12 / 21

Julia and Jupyter

I Julia is a programming language:it parses human-readable code to machine-readable code,executes it, returns the answer

I Jupyter is a protocol for interacting with a programminglanguage.

I Jupyter stores inputs and outputs as .ipynb files.I Jupyter notebooks display inputs and outputs in a browserI JuliaBox is an interface to a webserver running Julia

how to access?

I recommended: install anaconda distribution, then julia(go to section or see section materials for details)

I not recommended: use Google Colabhttps://discourse.julialang.org/t/

julia-on-google-colab-free-gpu-accelerated-shareable-notebooks/

1531912 / 21

Summary statistics

univariate

I mean, median, mode

I max, min, range

I variance

I . . .

explore via Julia + Jupyter notebook

https:

//github.com/ORIE4741/demos/blob/master/eda.ipynb

multi- (but usualy just bi-)variate

I correlation, covariance

I . . .

13 / 21

Summary statistics

univariate

I mean, median, mode

I max, min, range

I variance

I . . .

explore via Julia + Jupyter notebook

https:

//github.com/ORIE4741/demos/blob/master/eda.ipynb

multi- (but usualy just bi-)variate

I correlation, covariance

I . . .13 / 21

The perils of summary statistics

same mean, variance, correlation, line of best fit. . .

14 / 21

The perils of summary statistics

same mean, variance, correlation, line of best fit. . .14 / 21

The perils of summary statistics

15 / 21

The perils of summary statistics: modern update

https:

//www.autodeskresearch.com/publications/samestats

16 / 21

What to visualize?

I examples across all features (usually not)

I plot features across all examples (much more common)

17 / 21

Best practices

I Always label your axes.

I Ensure all marks on plot are meaningful.

I Beware of pie charts; bar charts are often easier to read.

I Beware of line plots; if your data is not continuous, tryscatter plot instead.

I Consider which curves to plot on same axes. Makecomparisons easy!

18 / 21

Beware of bad data

19 / 21

Take away

I always look at (some of) your data

I decide what question you want to answer

20 / 21

Questions?

https://docs.google.com/spreadsheets/d/

1vLbwi0WCOn0wU6cU_r0RHAnY7C0fDZ1F8Yq09pqYYuk

21 / 21

top related