correlacion y regresion 2

Upload: leonardo-torres

Post on 10-Apr-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Correlacion y Regresion 2

    1/28

    Correlation & Regression

    Do heavier people burn more energy? Does

    wine consumption affect cause a decrease inheart disease?

    These questions reflect a desire to understand the

    relationship between two variables.

    What we need:

    1. A plot/graph to view the relationship

    2. Characteristics to describe

    3. Measures of the characteristics4. Method to make inferences about the relationship

  • 8/8/2019 Correlacion y Regresion 2

    2/28

    Correlation & Regression

    The grapha Scatter Plot

    X

    YResponse variable

    (dependent variable)

    Explanatory variable

    (independent variable)

  • 8/8/2019 Correlacion y Regresion 2

    3/28

    Correlation & Regression

    Do heavier people burn more energy?

    Response: metabolic rate

    Explanatory: weight or mass

    Does wine consumption cause a decrease in heart

    disease?

    Response: death rate from heart diseaseExplanatory: wine consumption

  • 8/8/2019 Correlacion y Regresion 2

    4/28

    Correlation & Regression

    60504030

    2000

    1500

    1000

    Mass(kg)

    Rate(cal)

    Do heavier people burn more energy?

    Lean body mass vs. metabolic rate

  • 8/8/2019 Correlacion y Regresion 2

    5/28

    Correlation & Regression

    0 1 2 3 4 5 6 7 8 9

    100

    200

    300

    Alcoho l

    hrt_

    deathrate

    Is wine good for your heart?

    wine consumption vs. heart disease rate (per 100,000)

    wine consumption

  • 8/8/2019 Correlacion y Regresion 2

    6/28

    Correlation & Regression

    Interpretingcharacteristics to look for:

    Patterns:

    Form (clusters, scatter, linear..)

    Direction (positive, negative)

    Strength ( how closely points follow form)

    Deviations:

    Outliers

    Interpret the last two scatter plots.

  • 8/8/2019 Correlacion y Regresion 2

    7/28

    Correlation & Regression

    Options to consider:

    Adding a categorical variable

  • 8/8/2019 Correlacion y Regresion 2

    8/28

    Scatter plot:

    relationship between

    quantitative variables

    Form: Linear is

    probably the most

    common form

    Strength: We can

    measure the strength of

    a linear relationship

    because our eyes can

    deceive us!!!

    Strength?

    Strength?

  • 8/8/2019 Correlacion y Regresion 2

    9/28

    Correlation & Regression

    Correlation

    measure the direction and strength of a linear relationship

    Standardised value of each x

    Standardised value of each y

    Correlation is an average product of standardised values

  • 8/8/2019 Correlacion y Regresion 2

    10/28

    Quantitative variables

    Linear relationships

    r has no units

    r can be between 1 and 1

    Positive r =positive association

    Negative r =

    negative association

    0 = no association

    r is influenced by outliers

    Correlation = r

  • 8/8/2019 Correlacion y Regresion 2

    11/28

    Correlation & Regression

    Correlations: Mass (kg), Rate (cal)Pearson correlation of Mass(kg) and Rate(cal) = 0.865

    P-Value = 0.000

    60504030

    2000

    1500

    1000

    a g

    ae

    a

    o hea e peop e bu n o e ene g ?

    Lean bod a e abo a e

    r

  • 8/8/2019 Correlacion y Regresion 2

    12/28

    Correlation & Regression

    30 40 50 60

    1000

    1500

    2000

    a g

    ae

    a

    a e +

    Fe ae o

    We gh a e abo a e

    Correlations: Mass (kg)_F, Rate (cal)_FPearson correlation of Mass(kg)_F and Rate(cal)_F = 0.876

    Correlations: Mass (kg)_M, Rate (cal)_M

    Pearson correlation of Mass (kg)_M and Rate (cal)_M = 0.592

  • 8/8/2019 Correlacion y Regresion 2

    13/28

    Correlation & Regression

    Correlations: Alcohol, heart_death ratePearson correlation of Alcohol and hrt_death rate = -0.843

    0 1 2 3 4 5 6 7 8 9

    100

    200

    300

    Alcohol

    h

    rt_

    thr

    a

    t

    I w ood for ourheart

    w econsumpt on s. heart diseaserate per100,000)

    wineconsumption

  • 8/8/2019 Correlacion y Regresion 2

    14/28

    Correlation & Regression

    Correlations: Alcohol Wine consumption, heart death rate

    Pearson correlation of Alc Wine consumption and hrt death rate = -0.648

    1 2 3 4

    150

    200

    250

    300

    h

    d

    eah

    ae

    hea di ea e dea h a e ine on u p i n

    ou lie e o ed

    Al wine n u p i n

  • 8/8/2019 Correlacion y Regresion 2

    15/28

    Correlation & Regression

    Linear relationshipsusing a LINE

    0 1 2 3 4 5 6 7 8 9

    100

    200

    300

    A l c o h o l

    h

    rt

    thr

    t

    I ood foryourheart

    econsumpt onvs. heart disease rate (per100,000)

    ineconsumption

    We can summarise an overall linear form with a linethe

    best line is called the Regression Line

  • 8/8/2019 Correlacion y Regresion 2

    16/28

    Correlation & Regression

    9876543210

    30 0

    20 0

    10 0

    wi c s m i

    t

    t

    S = 37.8786 R-S q = 71.0 % R-Sq ( j) = 69.3 %

    t t = 260.563 - 22.9688w i c s m t

    Fitt g ssi li t t vs.wi c s m ti

    A regression line describes how a response variable changes as an

    explanatory variable changes. We can nowpredicta value of y when

    given an x.

    What would be the death rate

    due to heart disease if the

    average daily consumption of

    wine was 3 glasses?

    191.66 deaths per 100,000

  • 8/8/2019 Correlacion y Regresion 2

    17/28

    Correlation & Regression

    How do we determine the regression line?

    We want the vertical

    distances from the

    points (observed) to

    the line (predicted) to

    be as small as

    possiblethis means

    our error in predicting

    y is small.

  • 8/8/2019 Correlacion y Regresion 2

    18/28

    Correlation & Regression

    Calculating the line

    We will use the method of least squares to calculate the line.

    Least squares regression is the line that makes the sum of the

    squares of the vertical distances as small as possible.

    ! a bx

    b! rsysx

    a ! y bx

    Equation of the line (read y hat)

    b is the slope (rate of change iny whenx

    increases)

    a is the y intercept (value of y whenxis 0)

  • 8/8/2019 Correlacion y Regresion 2

    19/28

    Correlation & Regression

    9876543210

    30 0

    20 0

    10 0

    wine c ns tion

    deat

    rate

    S = 37.8786 R-Sq = 71.0 % R-S q (ad j ) = 69.3 %

    de a t ra te = 260.563 - 22.9688 w ine c on s

    t

    Fitted regression line deat rate vs.wine cons tion

    The regression equation is

    death rate = 260.563 - 22.9688 wine consumption

    S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

    Analysis of Variance

    Source DF SS MS F P

    Regression 1 59813.6 59813.6 41.6881 0.000

    Error 17 24391.4 1434.8

    Total 18 84204.9

  • 8/8/2019 Correlacion y Regresion 2

    20/28

    Correlation & Regression

    Facts about regression.

    1. Clear distinction between the response variable and theexplanatory variable.

    2. Correlation and slopea change in one Wofx

    corresponds to a change ofrW in y.

    3. Least-squares regression line passes through

    4. Some variation (spread) in y can be accounted for by

    changes in x when there is a linear relationship. The

    square of the correlation coefficient is the the fraction of

    the variation in y values that is explained by changes in x.

    (x,y)

    !variation in y due to x

    total variation in observed y

    = coefficientofdetermination

  • 8/8/2019 Correlacion y Regresion 2

    21/28

    Correlation & Regression

    9876543210

    30 0

    20 0

    10 0

    wine cons tion

    deat

    rate

    S = 37.8786 R-Sq = 71.0 % R-S q (ad j ) = 69.3 %

    de a t ra te = 260.563 - 22.9688 w ine c on s

    t

    Fitted regression line deat rate vs.wine cons tion

    The regression equation is

    death rate = 260.563 - 22.9688 wine consumption

    S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

    R-sq can have a value between 0 and 1.

  • 8/8/2019 Correlacion y Regresion 2

    22/28

    Correlation & Regression

    VARIATION OF DEPENDENT Y

  • 8/8/2019 Correlacion y Regresion 2

    23/28

    Correlation & Regression

    Residuals

    the left overs from least-squares regression

    Deviations from the overall pattern are important. The deviations

    In regression are the scatter of points about the line. The

    vertical distances from the line to the points are called residualsand they are the left-over variation after a regression line is fit.

    Residual = observedy predictedy

    residuals ! y y

  • 8/8/2019 Correlacion y Regresion 2

    24/28

    Correlation & Regression

    Obs Alcohol hrt_deat Fit SE Fit Residual St Resid1 2.50 211.00 203.14 8.89 7.86 0.21

    2 3.90 167.00 170.99 9.23 -3.99 -0.11

    3 2.90 131.00 193.95 8.70 -62.95 -1.71

    4 2.40 191.00 205.44 8.97 -14.44 -0.39

    5 2.90 220.00 193.95 8.70 26.05 0.71

    6 0.80 297.00 242.19 11.76 54.81 1.52

    7 9.10 71.00 51.55 23.29 19.45 0.65 X

    8 0.80 211.00 242.19 11.76 -31.19 -0.87

    9 0.70 300.00 244.49 12.00 55.51 1.55

    10 7.90 107.00 79.11 19.39 27.89 0.86

    11 1.80 167.00 219.22 9.72 -52.22 -1.43

    12 1.90 266.00 216.92 9.57 49.08 1.34

    13 0.80 227.00 242.19 11.76 -15.19 -0.42

    14 6.50 86.00 111.27 15.11 -25.27 -0.73

    15 1.60 207.00 223.81 10.06 -16.81 -0.46

    16 5.80 115.00 127.34 13.15 -12.34 -0.35

    17 1.30 285.00 230.70 10.64 54.30 1.49

    18 1.20 199.00 233.00 10.85 -34.00 -0.94

    19 2.70 172.00 198.55 8.77 -26.55 -0.72

    The regression equation is

    death rate = 260.563 - 22.9688 wine consumption

    s = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

    The residuals are.

    The mean of residuals is always equal to 0

  • 8/8/2019 Correlacion y Regresion 2

    25/28

    Correlation & Regression

    Residual Plots

    9876543210

    50

    0

    -50

    Alcohol

    Residual

    Residuals Versus Alcohol(response is hr

    _deat)

    Things to look for:

    1. A curved pattern means

    the relationship is not

    linear.

    2. Increasing/decreasing

    spread about the line

    3. Individual points with

    large residuals

    4. Individual points that areextreme in the x

    directionDo we have any influential

    points here?

  • 8/8/2019 Correlacion y Regresion 2

    26/28

    Correlation & Regression

    Ideal residual pattern

    Curvaturea linear fit is not

    appropriate

    Increasing variation

  • 8/8/2019 Correlacion y Regresion 2

    27/28

    Correlation & Regression

    4321

    30 0

    25 0

    20 0

    15 0

    C5

    C

    6

    S 40 .0879

    -S

    42 .0

    -S

    (

    !

    j ) 37 .5

    C6 28 0 .21 5 -33 .7666 C5

    Regressi Pl t

    9876543210

    30 0

    20 0

    10 0

    wi ec s ti

    eat

    rate

    S

    37 .8786 R -S

    71 .0

    R -S

    (adj )

    69 .3

    de a t"

    ra te 26 0 .56 3 -22 .9688 w i#

    e c$ #

    s% &

    t

    itted regressi li edeat rate s.wi ec s ti

    9876543210

    50

    0

    -50

    Alc'

    (

    '

    l

    Residual

    Residuals VersusAlc l(res ) 0 1 seis 2 rt_deat)

    4321

    50

    0

    -50

    C5

    Res

    idual

    Res iduals VersusC5(res

    3

    4

    5se is C6 )

  • 8/8/2019 Correlacion y Regresion 2

    28/28

    Correlation & Regression

    Attention!! Caution!!

    1. Correlation and regression describe only linearrelationships

    2. R and r-sq are not resistant

    3. Do not extrapolate!!! What is extrapolate?

    4. Correlations based on averages are too high when

    applied to individualsif the data has been averaged,

    the values of correlation and regression cannot be used

    with un-averaged values. (i.e., average alcohol

    consumption per countrynot individuals).

    5. Lurking variableslike the male/female variable in theweight vs. energy and the possible Mediterranean

    variable in the wine data.

    6. Correlation/association is not causation.