deconstructing data sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...computational...

41
Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 4: Regression overview Feb 1, 2016

Upload: others

Post on 24-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Deconstructing Data ScienceDavid Bamman, UC Berkeley

Info 290

Lecture 4: Regression overview

Feb 1, 2016

Page 2: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Regression

x = the empire state building y = 17444.5625”

A mapping from input data x (drawn from instance space 𝓧) to a point y in ℝ

(ℝ = the set of real numbers)

Page 3: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

task 𝓧 𝒴

predicting box office revenue movie ℝ

Page 4: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Experiment designtraining development testing

size 80% 10% 10%

purpose training models model selectionevaluation;

never look at it until the very

end

Page 5: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Metrics• Measure difference between the prediction ŷ and

the true y

1N

N�

i=1(yi � yi)2Mean squared error

(MSE)

1N

N�

i=1|yi � yi|Mean absolute error

(MAE)

Page 6: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Linear regression

y =F�

i=1xiβi

β ∈ ℝF

(F-dimensional vector of real numbers)

0

5

10

15

20

0 5 10 15 20x

y

Page 7: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Polynomial regression

y =F�

i=1xiβa,i +

F�

i=1x2i βb,i

βa, βb ∈ ℝF

(F-dimensional vector of real numbers)

0

100

200

300

400

-20 -10 0 10 20x

x^2

Page 8: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Polynomial regression

βa, βb, βc ∈ ℝF

(F-dimensional vector of real numbers)

y =F�

i=1xiβa,i +

F�

i=1x2i βb,i +

F�

i=1x3i βc,i

-5000

0

5000

-20 -10 0 10 20x

x^3

Page 9: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Support vector machines (regression)

Probabilistic graphical models

Networks

Neural networks

Deep learning

Decision trees

Random forests

Nonlinear regression

Page 10: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Number of Parametersorder 1

(linear reg.)

order 2

order 3 y =F�

i=1xiβa,i +

F�

i=1x2i βb,i +

F�

i=1x3i βc,i

y =F�

i=1xiβa,i +

F�

i=1x2i βb,i

y =F�

i=1xiβa,i

Page 11: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

Page 12: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

labeled data

labeled data

labeled data

𝓧instance space

Page 13: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

degree 1, training MSE = 73.4

Page 14: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

degree 2, training MSE = 71.9

Page 15: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

degree 3, training MSE = 60.9

Page 16: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

degree 4, training MSE = 60.6

Page 17: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

degree 5, training MSE = 59.1

Page 18: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

degree 6, training MSE = 50.2

Page 19: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

degree 7, training MSE = 49.6

Page 20: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

degree 8, training MSE = 46.8

Page 21: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

degree 9, training MSE = 41.2

Page 22: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

degree 10, training MSE = 35.8

Page 23: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

degree 11, training MSE = 21.1

Page 24: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

degree 12, training MSE = 18.4

Page 25: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

18.4

Page 26: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

18.4 118.8

Page 27: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

18.4 118.8

136.5

Page 28: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

18.4 118.8

136.5 87.7

Page 29: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

73.4 86.3

65.0 94.7

Page 30: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Overfitting• Memorizing the nuances (and noise) of the training

data that prevents generalizing to unseen data

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

Page 31: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Sources of error• Bias: Error due to mis-specifying the relationship

between input and the output. [too few parameters, or the wrong kinds]

• Variance: Error due to sensitivity to random fluctuations in the training data. If you train on different data, do you get radically different predictions?[too many parameters]

Page 32: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

X XX

XX

X

X

X

X

X

X

X

X

X

X

X XX

XX

Low variance High variance

Low bias

High bias

Image from Flach 2012

Page 33: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

X XX

XX High bias, low variance: Always

predict “Berkeley”

X

X

X

X

X

High bias, high variance: Predict most frequent city in training data

X

X

X

X

X

Low bias, high variance: many features, some of which capture true signal but capture random noise

X XX

XX

Low bias, low variance: enough features to capture the true signal

Example: geolocation on

Twitter

Page 34: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Ordinal regression• In between classification and regression

• 𝒴 is categorical (e.g.,✩, ✩✩, ✩✩✩)

• Elements of 𝒴 are ordered

• ✩ < ✩✩ • ✩✩ < ✩✩✩ • ✩ < ✩✩✩

Page 35: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

task 𝓧 𝒴

predicting star ratings movie {✩, ✩✩, ✩✩✩}

Ordinal regression

Page 36: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

• Sarah Cohen, James T. Hamilton, and Fred Turner, “Computational Journalism,” Communications of the ACM (2011)

• Sylvain Parasie, “Data-Driven Revelation? Epistemological tensions in investigative journalism in the age of ‘big data,’” Digital Journalism (2015)

Computational Journalism

Page 37: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Computational Journalism

• “Changing how stories are discovered, presented, aggregated, monetized and archived” (Cohen et al. 2012)

• Draws on earlier tradition of computer-assisted reporting and “precision journalism” (Meyer 1972)

Page 38: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

• Database linking, e.g.:

• voting records to the deceased • press releases from different members of

congress • indictments/settlements from U.S. attorneys • documents from SEC, Pentagon, defense

contractors to note movement to industry (Cohen 2012)

• DSA database of safety status of CA public schools + US seismic zones + school list from CA Dept of (Parasie 2015)

Computational Journalism

Page 39: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

• Information extraction: need to pull out people, places, organizations and their relationship from large (often sudden) dumps of documents.

• Analyzing the relationship between entities

Computational Journalism

Page 40: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

Computational Journalism• Data-driven stories about large-scale trends

40

Change in insured Americans under the ACA, NY Times (Oct 29, 2014)

Relationship between birth year and political views NY Times (July 7, 2014)

Page 41: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans

• Data-driven lead generation; the outliers in analysis that point to a story

Computational Journalism