es-2lecture: fitting models to...
TRANSCRIPT
ES-2 Lecture:
Fitting models to data
2
Outline
• Motivation: why fit models to data?• Special case (exact solution): # unknowns in model =# datapoints• Typical case (approximate solution): # unknowns in model < # datapoints
– Defining what “best” fit means– Least squares criterion– One way to solve – brute-force search
• Some alternatives to least-squares
3
Big picture: turning data into features(and ultimately, information)
• Problem statement: given a set of noisy measurements, create features (lines, curves, coefficients in an equation) by fitting data to a model
•Why this is helpful,#1:Using features, information can be stored / transmitted more efficientlythan if we stored raw measurements
Example: underwater robot is mapping the seabed, and sends back data with a low-speed acoustic modemExample: you want to watch a movie on your phone
4
•Why this is helpful,#2:We may have a model wetrust, but need to find some coefficients.
Example: equations of drag force are well known, but drag coefficients need to be estimated from data
(https://www.grc.nasa.gov/www/Wright/airplane/drageq.html)
• Problem statement: given a set of noisy measurements, create features (lines, curves, coefficients in an equation) by fitting data to a model
Big picture: turning data into features(and ultimately, information)
5
•Why this is helpful,#3:Use models to understand or interpret our data-Does cancer risk grow linearly with # cigs/day, or is another model better?
-If linear, what’s the slope –how fast does risk increase as # cigarettes increases?
• Problem statement: given a set of noisy measurements, create features (lines, curves, coefficients in an equation) by fitting data to a model
Big picture: turning data into features(and ultimately, information)
6
Outline
• Motivation: why fit models to data?• Special case (exact solution): # unknowns in model = # datapoints• Typical case (approximate solution): # unknowns in model < # datapoints
– Defining what “best” fit means– Least squares criterion– One way to solve – brute-force search
• Some alternatives to least-squares
7
Fitting 2 points to a line (from middle school!)
•Model: y = m x + b
(3,1)
(0.7, 3.1)
8
A more flexible-approach: matrices
•Model: y = m x + b
(3,1)
(0.7, 3.1)
9
Warning: making a slight detour
• This week, we’re talking about line fits
• The next few slides will talk about polynomial fits, as a way to reinforce the idea from the last slide
10
Pair work: Fit a quadratic to 3 points: Model rocket velocity vs time
Time Speed
5 106.8
8 167.2
12 280.4
Fit data to:v(t) = a t2 + b t + c
Write down a matrix version of this problem(same approach as line fit)
11
Clicker: what is the correct matrix equation?
Time Speed
5 106.88 167.2
12 280.4
1 5 251 8 641 12 144
𝑎𝑏𝑐=
106.8167.2280.4
25 5 1144 12 164 8 1
𝑎𝑏𝑐=
106.8280.4167.2
25 5 164 8 1144 12 1
106.8167.2280.4
=𝑎𝑏𝑐
Fit data to:v(t) = a t2 + b t + c
A)
B)
C)
12
A clicker question for planning upcoming lectures
Should I review what standard deviation is, before referring to it in lecture?
A. No, thanks – very familiar with itB. Yes, please – it’s been awhileC. Yes, please – never covered this
13
Something to notice about fits
Solution is exact when # unknowns = # points- If you draw a straight line through 2 points, it exactly goes through those points
- Same thing for a quadratic with 3 points
END OF DETOUR – Back to line fits
14
Outline
• Motivation: why fit models to data?• Special case (exact solution): # unknowns in model = # datapoints• Typical case (approximate solution): # unknowns in model < # datapoints
15
0 10 20 30 40 50
1500
2000
2500
3000
3500
4000
4500
5000
Miles/gallon
Weight
Think about fitting a straight line to this data (car mpg vs. weight)
Observations:1. More data is better – avoids errors2. There is no line that perfectly matches this data
16
Outline
• Motivation: why fit models to data?• Special case (exact solution): # unknowns in model = # datapoints• Typical case (approximate solution): # unknowns in model < # datapoints
– Defining what “best” fit means– Least squares criterion– One way to solve – brute-force search
• Some alternatives to least-squares
17
Definition: Residual error
•The ‘residual’ is what’s left over after we subtract the fit from the data
18
“Least-squares” approach
• Define the “best” fit to be that which gives the smallest sum of squared errors
– hence, ‘least squares’• For a line, this means we want to pick slope m, intercept b to minimize:
( )åå==
--==n
iii
n
iir mxbyeS
1
2
1
2
19
“Least-squares” approach
• Least-squares is a Big Idea. It shows up all over the place – data processing, stats, etc.• We’ll spend this week and next on LS
– how to find the best-fit solution– how to extend it to models besides straight lines
• It’s the basic tool you’ll use in HW8 – HW10
20
Clicker – Which fit is best, under least-squares criterion?
( )åå==
--==n
iii
n
iir mxbyeS
1
2
1
2
Data Fit1 Fit2
1.5 1.0 1.7
2.1 2.0 1.9
3.0 3.9 2.8
Fit1A)
Fit2B)
Both equally good C)
Not sureD)
21
Solution approach
• How do we pick slope, intercept to minimize Sr?
•Most commonly used methods (efficient for big datasets) will be covered on Thursday
• Simple: exhaustive or brute-force search– Pick a range of likely values for b, m– Try out each possible fit– Tabulate Sr, keep the best answer
( )åå==
--==n
iii
n
iir mxbyeS
1
2
1
2
22
Outline
• Motivation: why fit models to data?• Special case (exact solution): # unknowns in model = # datapoints• Typical case (approximate solution): # unknowns in model < # datapoints
– Defining what “best” fit means– Least squares criterion– One way to solve – brute-force search
• Some alternatives to least-squares– and problems with LS
23
Problem: Least squares is sensitive to outliers
• An outlier is a data point that doesn’t match the usual pattern
• Often, it’s from bad data (an error in measurement)
Fit, no outliers Fit with outlier
24
Problem: Least squares is sensitive to outliers
Because we square the error, outlier has big effect
Fit with outlierFit ignoring outlier
Fit including outlier
Point # err err^2 err err^2
1 0.07 0.005 -1.18 1.4
2 0.10 0.010 -0.70 0.49
3 -0.04 0.001 -0.60 0.26
4 -0.09 0.008 -1.08 1.17
5 4.42 19.53 3.5 12.25
Sr 19.56 15.7
25
Possible fix #1: Use a different definition of “best” fit•Try minimizing sum of absolute values (not ^2!)
å=
=n
iiabs eS
1
Least absolute value fit
Least-squaresfit
Point # err err
1 0.05 -1.18
2 0.14 -0.70
3 -0.06 -0.60
4 -0.00 -1.08
5 4.5 3.5
sum 4.7 7.1
See outlierExample.m on Trunk
26
0 10 20 30 40 50
1500
2000
2500
3000
3500
4000
4500
5000
Miles/gallon
Weight
Possible fix #2: random sampling of the data
• Randomly pick 2 points, and fit a line to them
• Store the computed slope and intercept
• Repeat many times• Estimated slope = median of all stored slopes; same for intercept
That works – and helps ignore outliers!