Transcript

SUMMARY

Two-sided t-test

š‘”ā‰ˆš‘„1āˆ’ š‘„2š‘ š‘„1āˆ’š‘„2āˆšš‘›

This is not an exact formula! It just demonstrates main ingrediences.

š‘”=š‘„āˆ’šœ‡0š‘ 

āˆšš‘›

difference between means, i.e. variability between samples

variability within samples

Two-sided t-test ā€¢ The numerator indicates how much the means differ.

ā€¢ This is an explained variation because it most likely results from the differences due to the treatment or just dut to the differences in the populations (recall beer prices, different brands are differently exppensive).

ā€¢ The denominator is a measure of an error. It measures individual differences of subjects.ā€¢ This is considered an error variation because we don't know why

individual subjects in the same group are different.

š‘”ā‰ˆš‘„1āˆ’ š‘„2š‘ š‘„1āˆ’š‘„2āˆšš‘›

Explained variation

Error variation

3 samples

3 samples

ANOVAā€¢ Compare as many means as you want just with one test.

š‘€š‘†š‘Š= š‘†š‘†š‘Šš‘‘ š‘“ š‘Š

=āˆ‘š‘˜

(š‘„ š‘–āˆ’š‘„š‘˜ )2

š‘āˆ’š‘˜š‘€š‘†šµ=

š‘†š‘†šµš‘‘ š‘“ šµ

=āˆ‘š‘›š¾ (š‘„š‘˜āˆ’š‘„šŗ )2

š‘˜āˆ’1

š‘ =āˆšāˆ‘ (š‘„š‘–āˆ’š‘„ )2

š‘›āˆ’1āŸ¹š‘ 2=

āˆ‘ (š‘„ š‘–āˆ’ š‘„ )2

š‘›āˆ’1=š‘†š‘†š‘‘š‘“

š‘”ā‰ˆš‘„1āˆ’ š‘„2š‘ š‘„1āˆ’š‘„2āˆšš‘›

š¹ š‘‘ š‘“ šµ ,š‘‘ š‘“š‘Š=š‘€š‘†šµš‘€š‘†š‘Š

Total variabilityā€¢ What is the total number of degrees of freedom?

ā€¢ Likewise, we have a total variation

š‘‘ š‘“ šµ=š‘˜āˆ’1

Hypothesisā€¢ Let's compare three samples with ANOVA. Just try tu guess

what the hypothesis will be?

at least one pair of samples is significantly different

ā€¢ Follow-up multiple comparison steps ā€“ see which means are different from each other.

betweenāˆ’ group variabilitywithināˆ’ group variability

Multiple comparisons problemā€¢ And there is another (more serious problem) with many t-

tests. It is called a multiple comparisons problem.

http://www.graphpad.com/guides/prism/6/statistics/index.htm?beware_of_multiple_comparisons.htm

NEW STUFF

Post hoc testsā€¢ F-test in ANOVA is the so-called omnibus test. It tests the

means globally. It says nothing about which particular means are different.

ā€¢ post hoc tests, multiple comparison tests.ā€¢ Tukey Honestly Significant Differences

TukeyHSD(fit) # where fit comes from aov()

ANOVA assumptionsā€¢ normality ā€“ all populations samples come from are normalā€¢ homogeneity of variance ā€“ variances are equalā€¢ independence of observations ā€“ the results found in one

sample won't affect others

ā€¢ Most influencial is the independence assumption. Otherwise, ANOVA is relatively robust.

ā€¢ We can sometimes violateā€¢ normality ā€“ large sample sizeā€¢ variance homogeneity ā€“ equal sample sizes + the ratio of any two

variances does not exceed four

ANOVA kindsā€¢ one-way ANOVA (analĆ½za rozptylu při jednoduchĆ©m

tÅ™Ć­děnĆ­, jednofaktorovĆ” ANOVA)aov(beer_brands$Price~beer_brands$Brand)

ā€¢ two-way ANOVA (analĆ½za rozptylu dvojnĆ©ho tÅ™Ć­děnĆ­, dvoufaktorovĆ” ANOVA)ā€¢ Example: engagement ratio, measure two educational methods

(with and without song) for men and women independentlyā€¢ aov(engagement~method+sex)ā€¢ interactions between factors

dependent variable independent variable

CORRELATION

Introductionā€¢ Up to this point we've been working with only one

variable.ā€¢ Now we are going to focus on two variables.ā€¢ Two variables that are probably related. Can you think of

some examples?ā€¢ weight and heightā€¢ time spent studying and your gradeā€¢ temperature outside and ankle injuries

Car dataMiles on a car Value of the car

60 000 $12 000

80 000 $10 000

90 000 $9 000

100 000 $7 500

120 000 $6 000

ā€¢ x ā€“ predictor, explanatory, independent variableā€¢ How do you think y is called? Think about opposites to x

name.ā€¢ outcomeā€¢ determinerā€¢ responseā€¢ stand-aloneā€¢ dependent

Car dataMiles on a car Value of the car

60 000 $12 000

80 000 $10 000

90 000 $9 000

100 000 $7 500

120 000 $6 000

ā€¢ How may we show these variables have a relationship?ā€¢ Tell me some of yours ideas.

ā€¢ scatterplot

Scatterplot

Stronger relationship?

Correlationā€¢ Relation between two variables = correlationā€¢ strong relationship = strong correlation, high correlation

Match these

strong positive

strong negative

weak positive

weak negative

Correlation coefficientā€¢ r (Pearson's r) - a number that quantifies the relationship.

ā€¢ ā€¦ covariance of X and Y. A statistic for how much X and Y co-vary. In other words, how much do they vary together.

ā€¢ ā€¦ standard deviations of X and Y. Describes, how to variables vary apart from each other, rather than with each other.

ā€¢ measures the strength of the relationship by looking at how closely the data falls along a straight line.

Covariance

ā€¢ Watch explanation video.

http://www.youtube.com/watch?v=35NWFr53cgA

divide by n-1 for sample but by n for population

Coefficient of determinationā€¢ Coefficient of determination - is the percentage of

variation in Y explained by the variation in X.ā€¢ Percentage of variance in one variable that is accounted for by the

variance in the other variable.

r2 = 0

r2 = 0.25

r2 = 0.81

from http://www.sagepub.com/upm-data/11894_Chapter_5.pdf

+1 -1 +0.14 +0.93 -0.73

ā€¢ If X is age in years and Y age in months, what will the correlation coefficient be?ā€¢ +1.0

ā€¢ X is hours you're awake a day, Y is hours you're asleep a day.ā€¢ -1.0

Cricketsā€¢ Find a cricket, count the number of its chirps in 15

seconds, add 37, you have just approximated the outside temperature in degrees Fahrenheit.

ā€¢ National Service Weather Forecast Office:

http://www.srh.noaa.gov/epz/?n=wxcalc_cricketconvert

chirps in 15 sec temperature chirps in 15 sec temperature

18 57 27 68

20 60 30 71

21 64 34 74

23 65 39 77

Hypothesis testingā€¢ Even when two variables describing a sample of data

may seem they have an relationship, this could just be due to the chance. The situation in population may be different.

ā€¢ ā€¦ sample corr. coeff., ā€¦ population corr. coeff.ā€¢ How hypotheses will look like?

š»0 :š‘Ÿ=0 š»0 :šœŒ=0A B C D

Hypothesis testing

ā€¢ test statistic has a t-distributionā€¢ Example: we are measuring relationship between two

variables, we have 25 participants, we get the t-statistic = 2.71. Is there a significant relationship between X and Y?ā€¢ , non-directonal test,

Confidence intervals

Statistics course from https://www.udacity.com

95% CI = (-0.3995, 0.0914) 95% CI = 0.1369, 0.5733)

ā€¢ reject the nullā€¢ fail to reject the null

ā€¢ reject the nullā€¢ fail to reject the null

try to guess:

Hypothesis testingā€¢ A statiscally correct way how to decide about the

relationship between two variables is, of course, hypothesis testing.

ā€¢ In these two particular cases:

Correlation vs. causationā€¢ causation ā€“ one variable causes another to happen

ā€¢ e.g. the facts it is raining cause people to take their umbrellas to work

ā€¢ correlation ā€“ just means there is a relationshipā€¢ e.g. do happy people have more friends? Are they just happy

because they have more friends? Or they act a certain way which causes them to have more friends.

Correlation vs. causation

ā€¢ There is a strong relationship between the ice cream consumption and the crime rate.

ā€¢ How could this be true?ā€¢ The two variables must have

something in common with one another. It must be something that relates to both level of ice cream consumption and level of crime rate. Can you guess what that is?

ā€¢ Outside temperature.

from causeweb.org

Correlation vs. causationā€¢ If you stop selling ice cream, does the crime rate drop?

What do you think?ā€¢ Thatā€™s because of the simple principle that correlations

express the association that exists between two or more variables; they have nothing to do with causality.

ā€¢ In other words, just because level of ice cream consumption and crime rate increase/descrease together does not mean that a change in one necessarily results in a change in the other.

ā€¢ You canā€™t interpret associations as being causal.

Correlation vs. causationā€¢ In an ice cream example, there exist a variable (outside

temperature) we did not realize to control.ā€¢ Such variable is called third variable, confounding

variable, lurking variable.ā€¢ The methodologies of scientific studies therefore need to

control for these factors to avoid a 'false positiveā€˜ conclusion that the dependent variables are in a causal relationship with the independent variable.

ā€¢ Letā€™s have a look at dependence of murder rate on temperature.

from http://www-personal.umich.edu/~bbushman/BWA05a.pdfJournal of Personality and Social Psychology, 2005, Vol. 89, No. 1, 62ā€“66

from http://www-personal.umich.edu/~bbushman/BWA05a.pdfJournal of Personality and Social Psychology, 2005, Vol. 89, No. 1, 62ā€“66

high assault period

low assault period

http://xkcd.com/552/

Correlation and regression analysisā€¢ Correlation analysis investigates the relationships

between variables using graphs or correlation coefficients.

ā€¢ Regression analysis answers the questions like: which relationship exists between variables X and Y (linear, quadratic ,ā€¦.), is it possible to predict Y using X, and with what error?

Simple linear regressionā€¢ also single linear regression (jednoduchĆ” lineĆ”rnĆ­ regrese)

ā€¢ one y (dependent variable, zĆ”visle proměnnĆ”), one x (independent variable, nezĆ”visle proměnnĆ”)

ā€¢ ā€“ y-intercept (constant), ā€“ slopeā€¢ is estimated value, so to distinguish it from the actual

value corresponding to the given statisticans use

Data setā€¢ Students in

higher grades carry more textbooks.

ā€¢ Weight of the textbooks depends on the weight of the student.

outlier

strong positive correlation, r = 0.926

from Intermediate Statistics for Dummies

Build a modelā€¢ Find a straight line y = a + bx

Interpretationā€¢ y-intercept (3.69 in our case)

ā€¢ it may or may not have a practical meaningā€¢ Does it fall within actual values in the data set? If yes, it is a clue it may have

a practical meaning.ā€¢ Does it fall within negative territory where negative y-value are not possible?

(e.g. weights canā€™t be negative)ā€¢ Does a value x = 0 have practical meaning (student weighting 0)?

ā€¢ However, even if it has no meaning, it may be necessary (i.e. significantly different from zero)!

ā€¢ slopeā€¢ change in y due to one-unit increase in x (i.e. if studentā€™s weight

increases by 1 pound, its textbookā€™s weight increases by 0.113 pounds)

ā€¢ now you can use regression line to estimate y value for new x

Regression model conditionsā€¢ After building a regression mode you need to check if the

required conditions are met.ā€¢ What are these conditions?

ā€¢ The yā€™s have to have normal distribution for each value of x.ā€¢ The yā€™s have to have constant spread (standard deviation) for each

value of x.

Normal yā€™s for every xā€¢ For any value of x, the population of possible y-values

must have a normal distribution.

from Intermediate Statistics for Dummies

Homoscedasticity condition

As you move from left to the right on the x-axis, the spread around y-values remain the same.

source: wikipedia.org

Confidence and prediction limit

95% confidence limits ā€“ this interval includes the true regression line with 95% probability.

(pƔs spolehlivosti)

95% prediction limits ā€“ this interval represents the 95% probability for the values of the dependent variable. i.e. 95% of data points lie within these lines.(pĆ”s predikce)

Residualsā€¢ To check the normality of y-values you need to measure

how far off your predictions were from the actual data, and to explore these errors.

ā€¢ residual (residuum, reziduĆ”lnĆ­ hodnota predikce)

from Intermediate Statistics for Dummies

actual value

predicted value

residual

Residualsā€¢ The residuals are data just like any other, so you can find

their mean (which is zero!!) and their standard deviation.ā€¢ Residuals can be standardized, i.e. converted to the Z-

score so you see where it falls on the standard normal distribution.

ā€¢ Plotting residuals on the graph ā€“ residual plots.

Using r2 to measure model fitā€¢ r2 measures what percentage of the variability in y is

explained by the model.ā€¢ The y-values of the data you collect have a great deal of

variability in and of themselves. ā€¢ You look for another variable (x) that helps you explain

that variability in the y-values. ā€¢ After you put that x variable into the model, and you find

itā€™s highly correlated with y, you want to find out how well this model did at explaining why the values of y are different.

Interpreting r2

ā€¢ high r2 (80-90% is extremely high, 70% is fairly high)ā€¢ A high percentage of variability means that the line fits well because there is not

much left to explain about the value of y other than using x and its relationship to y.

ā€¢ small r2 (0-30%) ā€¢ The model containing x doesnā€™t help much in explaining the difference in the y-

valuesā€¢ The model would not fit well. You need another variable to explain y other than the

one you already tried.

ā€¢ middle r2 (30-70%) ā€¢ x does help somewhat in explaining y, but it doesnā€™t do the job well enough on its

own. ā€¢ Add one or more variables to the model to help explain y more fully as a group.

ā€¢ Textbook example: r = 0.93, r2 = 0.8649. Approximately 86% of variability you find in textbook weights for is explained by the average student weight. Fairly good model.


Top Related