cs7267 machine learning - kennesaw state...

17
CS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D. Computer Science, Kennesaw State University * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington

Upload: nguyencong

Post on 14-May-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

CS7267 MACHINE LEARNING

LINEAR REGRESSION

Mingon Kang, Ph.D.

Computer Science, Kennesaw State University

* Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington

Correlation (r)

Linear association between two variables

Show how to determine both the nature and

strength of relationship between two variables

Correlation lies between +1 to -1

Zero correlation indicates that there is no

relationship between the variables

Pearson correlation coefficient

most familiar measure of dependence between two

quantities

Correlation (r)

Correlation (r)

where E is the expected value operator, cov(,) means

covariance, and corr(,) is a widely used alternative

notation for the correlation coefficient

Reference: https://en.wikipedia.org/wiki/Correlation_and_dependence

Coefficient of Determination (๐‘Ÿ2)

Coefficient of determination lies between 0 and 1

Represented by ๐‘Ÿ2

Measure of how well the regression line represents the data

If r = 0.922, then ๐‘Ÿ2=0.85

Means that 85% of the total variation in y can be explained by the linear relationship between x and y in linear regression

The other 15% of the total variation in y remains unexplained

Linear Regression

Samples with ONE independent variable Samples with TWO independent variables

Linear Regression

Samples with ONE independent variable Samples with TWO independent variables

Linear Regression

How to represent the data as a vector/matrix

We assume a model:

๐ฒ = b0 + ๐›๐— + ฯต,

where b0 and ๐› are intercept and slope, known as

coefficients or parameters. ฯต is the error term (typically

assumes that ฯต~๐‘(๐œ‡, ๐œŽ2)

Linear Regression

Simple linear regression

A single independent variable is used to predict

Multiple linear regression

Two or more independent variables are used to predict

Linear Regression

How to represent the data as a vector/matrix

Include bias constant (intercept) in the input vector

๐— โˆˆ โ„๐’ร—(๐’‘+๐Ÿ), ๐ฒ โˆˆ โ„๐’, ๐› โˆˆ โ„๐’‘+๐Ÿ, and ๐ž โˆˆ โ„๐’

๐ฒ = ๐— โˆ™ ๐› + ๐ž

๐— = ๐Ÿ, ๐ฑ๐Ÿ, ๐ฑ๐Ÿ, โ€ฆ , ๐ฑ๐ฉ , ๐› = {๐‘0, ๐‘1, ๐‘2, โ€ฆ , ๐‘๐‘}T

๐ฒ = {๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ๐‘›}T, ๐ž = {๐‘’1, ๐‘’2, โ€ฆ , ๐‘’๐‘›}

T

โˆ™ is a dot product

equivalent toy๐‘– = 1 โˆ— b0 + ๐‘ฅ๐‘–1b1 + ๐‘ฅ๐‘–2b2 +โ‹ฏ+ ๐‘ฅ๐‘–๐‘bp (1 โ‰ค ๐‘– โ‰ค ๐‘›)

Linear Regression

Find the optimal coefficient vector b that makes the

most similar observation

๐‘ฆ1โ‹ฎ๐‘ฆ๐‘›

=111

๐‘ฅ11 โ‹ฏ ๐‘ฅ1๐‘โ‹ฎ โ‹ฑ โ‹ฎ

๐‘ฅ๐‘›1 โ‹ฏ ๐‘ฅ๐‘›๐‘

๐‘0โ‹ฎ๐‘๐‘

+

๐‘’1โ‹ฎ๐‘’๐‘›

Ordinary Least Squares (OLS)

๐ฒ = ๐—๐› + ๐ž

Estimate the unknown parameters (b) in linear regression model

Minimizing the sum of the squares of the differences between the observed responses and the predicted by a linear function

Sum squared error =

๐‘–=1

๐‘›

(๐‘ฆ๐‘– โˆ’ ๐ฑ๐‘–โˆ—๐›)2

Ordinary Least Squares (OLS)

Optimization

Need to minimize the error

min ๐ฝ(๐›) =

๐‘–=1

๐‘›

(๐‘ฆ๐‘– โˆ’ ๐ฑ๐‘–,โˆ—๐›)2

To obtain the optimal set of parameters (b),

derivatives of the error w.r.t. each parameters must

be zero.

Optimization

๐ฝ = ๐žT๐ž = ๐ฒ โˆ’ ๐—๐› โ€ฒ ๐ฒ โˆ’ ๐—๐›= ๐ฒโ€ฒ โˆ’ ๐›โ€ฒ๐—โ€ฒ ๐ฒ โˆ’ ๐—๐›= ๐ฒโ€ฒ๐ฒ โˆ’ ๐ฒโ€ฒ๐—๐› โˆ’ ๐›โ€ฒ๐—โ€ฒ๐ฒ + ๐›โ€ฒ๐—โ€ฒ๐—๐›= ๐ฒโ€ฒ๐ฒ โˆ’ ๐Ÿ๐›โ€ฒ๐—โ€ฒ๐ฒ + ๐›โ€ฒ๐—โ€ฒ๐—๐›

๐œ•๐žโ€ฒ๐ž

๐œ•๐›= โˆ’2๐—โ€ฒ๐ฒ + 2๐—โ€ฒ๐—๐› = 0

๐—โ€ฒ๐— ๐› = ๐—โ€ฒ๐ฒ๐› = (๐—โ€ฒ๐—)โˆ’1๐—โ€ฒ๐ฒ

Matrix Cookbook: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Linear regression for classification

For binary classification

Encode class labels as y = 0, 1 ๐‘œ๐‘Ÿ {โˆ’1, 1}

Apply OLS

Check which class the prediction is closer to

If class 1 is encoded to 1 and class 2 is -1.

๐‘๐‘™๐‘Ž๐‘ ๐‘  1 ๐‘–๐‘“ ๐‘“ ๐‘ฅ โ‰ฅ 0

๐‘๐‘™๐‘Ž๐‘ ๐‘  2 ๐‘–๐‘“ ๐‘“ ๐‘ฅ < 0

Linear models are NOT optimized for classification

Logistic regression

Assumptions in Linear regression

Linearity of independent variable in the predictor

normally good approximation, especially for high-dimensional data

Error has normal distribution, with mean zero and constant variance

important for tests

Independent variables are independent from each other

Otherwise, it causes a multicollinearity problem; two or more predictor variables are highly correlated.

Should remove them