es-2lecture: fitting models to...

ES-2 Lecture:

Fitting models to data

2

Outline

• Motivation: why fit models to data?• Special case (exact solution): # unknowns in model =# datapoints• Typical case (approximate solution): # unknowns in model < # datapoints

– Defining what “best” fit means– Least squares criterion– One way to solve – brute-force search

• Some alternatives to least-squares

3

Big picture: turning data into features(and ultimately, information)

• Problem statement: given a set of noisy measurements, create features (lines, curves, coefficients in an equation) by fitting data to a model

•Why this is helpful,#1:Using features, information can be stored / transmitted more efficientlythan if we stored raw measurements

Example: underwater robot is mapping the seabed, and sends back data with a low-speed acoustic modemExample: you want to watch a movie on your phone

4

•Why this is helpful,#2:We may have a model wetrust, but need to find some coefficients.

Example: equations of drag force are well known, but drag coefficients need to be estimated from data

(https://www.grc.nasa.gov/www/Wright/airplane/drageq.html)



5

•Why this is helpful,#3:Use models to understand or interpret our data-Does cancer risk grow linearly with # cigs/day, or is another model better?

-If linear, what’s the slope –how fast does risk increase as # cigarettes increases?



6

Outline

• Motivation: why fit models to data?• Special case (exact solution): # unknowns in model = # datapoints• Typical case (approximate solution): # unknowns in model < # datapoints



7

Fitting 2 points to a line (from middle school!)

•Model: y = m x + b

(3,1)

(0.7, 3.1)

8

A more flexible-approach: matrices

•Model: y = m x + b

(3,1)

(0.7, 3.1)

9

Warning: making a slight detour

• This week, we’re talking about line fits

• The next few slides will talk about polynomial fits, as a way to reinforce the idea from the last slide

10

Pair work: Fit a quadratic to 3 points: Model rocket velocity vs time

Time Speed

5 106.8

8 167.2

12 280.4

Fit data to:v(t) = a t2 + b t + c

Write down a matrix version of this problem(same approach as line fit)

11

Clicker: what is the correct matrix equation?

Time Speed

5 106.88 167.2

12 280.4

1 5 251 8 641 12 144

𝑎𝑏𝑐=

106.8167.2280.4

25 5 1144 12 164 8 1

𝑎𝑏𝑐=

106.8280.4167.2

25 5 164 8 1144 12 1

106.8167.2280.4

=𝑎𝑏𝑐

Fit data to:v(t) = a t2 + b t + c

A)

B)

C)

12

A clicker question for planning upcoming lectures

Should I review what standard deviation is, before referring to it in lecture?

A. No, thanks – very familiar with itB. Yes, please – it’s been awhileC. Yes, please – never covered this

13

Something to notice about fits

Solution is exact when # unknowns = # points- If you draw a straight line through 2 points, it exactly goes through those points

- Same thing for a quadratic with 3 points

END OF DETOUR – Back to line fits

14

Outline


15

0 10 20 30 40 50

1500

2000

2500

3000

3500

4000

4500

5000

Miles/gallon

Weight

Think about fitting a straight line to this data (car mpg vs. weight)

Observations:1. More data is better – avoids errors2. There is no line that perfectly matches this data

16

Outline




17

Definition: Residual error

•The ‘residual’ is what’s left over after we subtract the fit from the data

18

“Least-squares” approach

• Define the “best” fit to be that which gives the smallest sum of squared errors

– hence, ‘least squares’• For a line, this means we want to pick slope m, intercept b to minimize:

( )åå==

--==n

iii

n

iir mxbyeS

1

2

1

2

19

“Least-squares” approach

• Least-squares is a Big Idea. It shows up all over the place – data processing, stats, etc.• We’ll spend this week and next on LS

– how to find the best-fit solution– how to extend it to models besides straight lines

• It’s the basic tool you’ll use in HW8 – HW10

20

Clicker – Which fit is best, under least-squares criterion?

( )åå==

--==n

iii

n

iir mxbyeS

1

2

1

2

Data Fit1 Fit2

1.5 1.0 1.7

2.1 2.0 1.9

3.0 3.9 2.8

Fit1A)

Fit2B)

Both equally good C)

Not sureD)

21

Solution approach

• How do we pick slope, intercept to minimize Sr?

•Most commonly used methods (efficient for big datasets) will be covered on Thursday

• Simple: exhaustive or brute-force search– Pick a range of likely values for b, m– Try out each possible fit– Tabulate Sr, keep the best answer

( )åå==

--==n

iii

n

iir mxbyeS

1

2

1

2

22

Outline



• Some alternatives to least-squares– and problems with LS

23

Problem: Least squares is sensitive to outliers

• An outlier is a data point that doesn’t match the usual pattern

• Often, it’s from bad data (an error in measurement)

Fit, no outliers Fit with outlier

24

Problem: Least squares is sensitive to outliers

Because we square the error, outlier has big effect

Fit with outlierFit ignoring outlier

Fit including outlier

Point # err err^2 err err^2

1 0.07 0.005 -1.18 1.4

2 0.10 0.010 -0.70 0.49

3 -0.04 0.001 -0.60 0.26

4 -0.09 0.008 -1.08 1.17

5 4.42 19.53 3.5 12.25

Sr 19.56 15.7

25

Possible fix #1: Use a different definition of “best” fit•Try minimizing sum of absolute values (not ^2!)

å=

=n

iiabs eS

1

Least absolute value fit

Least-squaresfit

Point # err err

1 0.05 -1.18

2 0.14 -0.70

3 -0.06 -0.60

4 -0.00 -1.08

5 4.5 3.5

sum 4.7 7.1

See outlierExample.m on Trunk

26

0 10 20 30 40 50

1500

2000

2500

3000

3500

4000

4500

5000

Miles/gallon

Weight

Possible fix #2: random sampling of the data

• Randomly pick 2 points, and fit a line to them

• Store the computed slope and intercept

• Repeat many times• Estimated slope = median of all stored slopes; same for intercept

That works – and helps ignore outliers!

es-2lecture: fitting models to...

Documents