assignment 1 - linear...
TRANSCRIPT
Assignment 1 - Linear Regression
Durand Sinclair
01/04/2017
Definition
Linear Regression is a mathematical way to make predictions using multiple columns of continuous data. In contrast with an average, whose prediction is based on a single column of numbers, linear regression uses multiple columns, which reduces variance. We can measure the predictive power of our regression by using a metric called R Squared, which tells us what percentage of the variation can be explained by the second column of data, compared to the mean of the first column.
Example, with pictures
Let's pretend we work at a restaurant, and want to predict what we’ll make in tips. We record tips from the last 6 customers to predict what the next tip will be. Here's our data:
## # A tibble: 6 × 2 ## `Meal ID` `Tips ($)` ## <chr> <dbl> ## 1 Meal 1 5 ## 2 Meal 2 17 ## 3 Meal 3 11 ## 4 Meal 4 8 ## 5 Meal 5 14 ## 6 Meal 6 5
The best estimate for our next tip would be the average, which in this case is $10. Let's draw a line to represent this.
But how good is this prediction? To figure this out, we could measure the distance from each point to the mean.
Let's do this on our chart. First, turn the bars into dots ...
... and then draw lines from each dot to the mean.
We now know how far away each dot is from the mean. How do we turn that into a single metric? Let's calculate the total distance by adding up these line segments.
Unfortunately, adding up all the distances gives us zero. A better solution is to add up all the squared distances, so that we have no negative numbers.
In this case, the total amount of variation adds up to 120 units.
To reduce the variation, we can add another column of data and plot it on another axis. Let's say we think that the cost of the meal affects the tip. Here's the data for meal cost:
## # A tibble: 6 × 3 ## id tips cost ## <chr> <dbl> <dbl> ## 1 Meal 1 5 34 ## 2 Meal 2 17 108 ## 3 Meal 3 11 64 ## 4 Meal 4 8 88 ## 5 Meal 5 14 99 ## 6 Meal 6 5 51
Let's chart meal cost...
Let's have one chart with Cost on the x axis and Tip on the y axis.
... and draw a line of best fit.
We can now see how far away each tip is from that line. (These distances are called "residuals".)
If you add up the square distances, you get around 30:
The total variation has been reduced from 120 units to 30 units, or 75%. This means that meal cost can explain 75% of the variation.
## # A tibble: 7 × 3 ## Labels `Tip Residual` `Tip and Cost Residual` ## <chr> <dbl> <dbl> ## 1 Squared Residual 1 25 0.72 ## 2 Squared Residual 2 49 4.11 ## 3 Squared Residual 3 1 6.06 ## 4 Squared Residual 4 4 16.38 ## 5 Squared Residual 5 16 0.12 ## 6 Squared Residual 6 25 2.68 ## 7 TOTAL 120 30.07
Adding a third dimension can often explain even more of the variation, making our prediction even more accurate.
How To Do It
Base R provides the lm() function for linear regressions. You specify the independent & dependent variable, and data set. lm() returns the slope and intercept of the line of best fit ...
fit <- lm(formula = Tip ~ Cost, data = tbl_tips2) fit
## ## Call: ## lm(formula = Tip ~ Cost, data = tbl_tips2) ## ## Coefficients: ## (Intercept) Cost ## -0.8203 0.1462
... but also other useful data, such as residuals and fitted values, which can be found by typing your object, then a "$", then hitting tab.
fit$coefficients
## (Intercept) Cost ## -0.8202568 0.1462197
fit$residuals
## 1 2 3 4 5 6 ## 0.8487874 2.0285307 2.4621969 -4.0470756 0.3445078 -1.6369472
fit$fitted.values
## 1 2 3 4 5 6 ## 4.151213 14.971469 8.537803 12.047076 13.655492 6.636947
R-squared can be calculated with the summary.lm() function.
rsquared <- summary.lm(fit)$r.squared rsquared
## [1] 0.7493759
Conclusion
Regression is a useful mathematical technique to use extra dimensions of continuous data to make more accurate predictions. The lm() function calculates regression in R. The summary.lm() function can calculate R squared to metricate our regression function.
Reference List
Foltz, B. 2013, Statistics 101: Simple Linear Regression, The Very Basics, viewed 5 April 2017, < https://www.youtube.com/watch?v=Qa2APhWjQPc&t=257s>
Wickham, H. 2016, ggplot2 – Elegant Graphics for Data Analysis, 2nd Edition, Springer, Basel