applied statistics - mit

654
Dr. Elizabeth Newton Slides prepared by Elizabeth Newton (MIT) with some slides by Roy Welsch (MIT) and Gordon Kaufman (MIT). 1

Upload: gzapas

Post on 19-Jan-2016

106 views

Category:

Documents


10 download

DESCRIPTION

statistics

TRANSCRIPT

Page 1: Applied Statistics - MIT

Dr. Elizabeth Newton

Slides prepared by Elizabeth Newton (MIT) with some slides by Roy Welsch (MIT) and Gordon Kaufman (MIT).

1

Page 2: Applied Statistics - MIT

15.075, Applied StatisticsLecture: M,W 10-11:30

Recitation: R 4-5

Text: Statistics and Data Analysis by Tamhane and Dunlop

Computing: S-Plus

Exams: Mid-term (in class) and Final during exam week

Prerequisites: Calculus, Probability, Linear Algebra

2

Page 3: Applied Statistics - MIT

15.075, Applied Statistics, Course Outline

• Collecting Data• Summarizing and Exploring Data• Review of Probability• Sampling Distributions of Statistics• Inference

Point and CI Estimation, Hypothesis Testing• Linear Regression• Analysis of Variance• Nonparametric Methods• Special Topics (Data Mining?)

3

Page 4: Applied Statistics - MIT

Statistics

“The science of collecting and analyzing data for the purpose ofdrawing conclusions and making decisions.” from Tamhane, Ajit C., and Dorothy D. Dunlop. Statistics and Data Analysis from Elementary to Intermediate. Prentice Hall, 2000, pp. 1.

“Statistics are no substitute for judgment.” Henry Clay

4

Page 5: Applied Statistics - MIT

How is the meter defined?

One ten-millionth of a quarter meridian(distance from pole to equator).

BUT – it isn’t exactly.

Why?

5

Page 6: Applied Statistics - MIT

The Measure of All Things, by Ken Alder, describes the attempt of 2 French astronomers,Delambre and Mechain, to determine the circumference of the earth during the time of the French Revolution.

Determined the distance between Barcelona and Dunkirk by triangulation.

Needed to know latitude at each end (by measuring heights of stars).

Seven months stretched to seven years.

Mechain obtained conflicting information andsuppressed some of his data.

6

Page 7: Applied Statistics - MIT

Page 214 (Measure of All Things):

“What counts as an error? Who is to say when you have made a mistake? How close is close enough? Neither Mechain nor his colleagues could have answered these questions with any degree of confidence. They were completely innocent of statistical method.”

- Quote from Alder, Ken. The Measure of All Things: The Seven-YearOdyssey and Hidden Error that Transformed the World. Free Press, 2003.

7

Page 8: Applied Statistics - MIT

Data: A Set of measurementsCharacter

Nominal, e.g. color: red, green, blueBinary e.g. (M,F), (H,T), (0,1)

Ordinal, e.g attitude to war: agree, neutral disagree

Numeric

Discrete, e.g. number of children

Continuous. e.g. distance, time, temperature

also:

Interval, e.g. Fahrenheit temperature

Ratio (real zero), e.g distance, number of children8

Page 9: Applied Statistics - MIT

S-Plus Data Set: cu.summary

9

Page 10: Applied Statistics - MIT

Concepts

Population:The set of all units of interest (finite or infinite). E.g. all students at MIT

Sample:A subset of the population actually observed. E.g. students in this room.

Variable: A property or attribute of each unit, e.g age, height

Observation:Values of all variables for an individual unit

A dataset is often organized as a matrix with rows corresponding to observations and columns to variables.

10

Page 11: Applied Statistics - MIT

Concepts (continued)

Parameter:Numerical characteristic of population, defined for each variable, e.g. proportion opposed to war

Statistic:Numerical function of sample used to estimate population parameter.

Precision: Spread of estimator of a parameter

Accuracy: How close estimator is to true value - opposite of

Bias: Systematic deviation of estimate from true value

11

Page 12: Applied Statistics - MIT

Accuracy and Precision

accurate and precise

accurate, not precise

precise, not accurate

not accurate, not precise

12

Diagram courtesy of MIT OpenCourseWare

Page 13: Applied Statistics - MIT

Steps in Study Design and Implementation

1. Background research and literature review.

2. Define the goals and specific hypotheses of the study.

3. Determine what variables should be measured and how.

5. Develop a plan to collect the dataSampling designSample sizeInclusions and exclusions

5. Train Personnel

6. Gather Data

7. Analyze Data

8. Report Results13

Page 14: Applied Statistics - MIT

Ethical IssuesFor human subjects:

For animal subjects:

(See Hulley & Cummings, Designing Clinical Research.)

14

Page 15: Applied Statistics - MIT

Statistical Studies

Descriptive:One group, e.g. survey, poll

Comparative:2 or more groups, e.g. compare effectiveness of different teaching methods.

Experimental:Investigator actively intervenes to control study conditionsLook at relationship between predictor (explanatory) and response (outcome) variablesEstablish causation, e.g. drug trial

Observational:Investigator records data without interveningDifficult to distinguish effects of predictors and confounding variables (lurking variables)Establish association, e.g. Framingham Heart Study

15

Page 16: Applied Statistics - MIT

Observational Studies:

Cross-sectionalLook at sample at a single point in timeE.g. Census, Sample survey

Prospective (expensive!)Follow sample (cohort) forward in time.E.g. Framingham heart study, Nurses’ Health Study

Retrospective (case-control)Look back in time

16

Page 17: Applied Statistics - MIT

Sources of Error in Observational Studies

Sampling Error – sample differs from population

Measurement Bias – poorly worded questions

Self-Selection Bias – refusal to participate

Response Bias – incorrect or untruthful responses

17

Page 18: Applied Statistics - MIT

Types of Samples

Probability Sample (every element in population has known non-zero probability of inclusion)

• Simple Random Sample (SRS)• Stratified Random Sample• Multi-Stage Cluster Sample• Systematic Sample

Non-Probability Sample (estimates may be biased, but frequently used as only feasible method)

• Convenience Sample e.g. supermarket survey• Judgment Sample – chosen by investigator

18

Page 19: Applied Statistics - MIT

Simple Random Sample (SRS)

Requires a Sampling Frame, a list of all the units in a finite population

Sample of size n is drawn without replacement from population of size N, such that each sample (there are of them) has same chance of being chosen.

Each unit in population has same chance of being chosen: n/N (the sampling fraction).

Generate random numbers to select from sampling frame.

⎟⎟⎠

⎞⎜⎜⎝

⎛nN

19

Page 20: Applied Statistics - MIT

Stratified Random Sample

Divide a diverse population into homogeneous subpopulations (strata).

Draw simple random sample from each one.

Advantages:

Separate estimates for strata obtained in addition to overall estimates.

Precision of estimates higher than for simple random sample

Disadvantage: Requires sampling frame20

Page 21: Applied Statistics - MIT

Multistage Cluster Sampling

Used to survey large populations when sampling frame not available, e.g. USA

For instance, in an educational survey, draw a sample of states, then towns within states, then schools within towns.

Prepare a sampling frame of students from selected schools and use SRS.

21

Page 22: Applied Statistics - MIT

Systematic Sampling

Useful when list of units exists or when units arrive sequentially (cars through a toll booth).

Select first unit at random, then every kth unit.

In finite population, each unit has same probability of selection (n/N)(however not all samples are equally likely).

Must avoid choosing k to coincide with regular cyclic variations in the data

22

Page 23: Applied Statistics - MIT

Questionnaire Design

Structured questions: responses should be mutually exclusive and collectively exhaustive.

E.g. How many glasses of water do you drink per day?-------------- 0 to 2--------------- 3 to 5--------------- 6 or more

Non-structured:E.g. How many glasses of water do you drink per day?Allow more individualized response, but more prone to data entry errors.

23

Page 24: Applied Statistics - MIT

Attitude questions

1. The homework load in this course is reasonable.

Strongly Neither Agree StronglyDisagree Disagree nor Disagree Agree Agree

Usually 5 to 9 categories.(Should we assign numbers to these categories?)(High to low or low to high?)

24

Page 25: Applied Statistics - MIT

Problems with Question Wording

Double-barreled question

Leading question

One-sided question

Ambiguous question

Pretest! Pretest! Pretest!

(For more information, see Johnson & Wichern, Business Statistics)

25

Page 26: Applied Statistics - MIT

26

Sensitive Questions

E.G Have you ever used heroin?

Randomized Response may elicit more accurate responses.Interviewer does not know what question respondent is answering.

E.g. Roll a die. If less than 3 then say whether statement 1 is true or false. Otherwise say whether statement 2 is true of false.

Statement 1: I have used heroin.Statement 2: I have not used heroin.

Let p=proportion of people who have used heroinq=proportion of people answering question 1 (can’t be 0.5).

P(True)=P(True|1)P(1) + P(True|2)P(2) = p q + (1-p) (1-q)

Solve for p.

Page 27: Applied Statistics - MIT

Question Sequencing

1. Demographics at end

2. Sensitive questions nearer to end

3. Same topic questions appear together

4. Go from general to specific

5. Avoid skipping around.27

Page 28: Applied Statistics - MIT

28

Experimental Studies

Purpose: Evaluate how a set of predictor variables (factors) affect a response variable.

Treatment Factors are of primary interest. Values (Levels) are controlled.

Nuisance Factors also affect response.

Treatment: particular combination of levels of treatment factors.

Experimental units (EU’s): subjects to which treatments applied.

Treatment group: all EU’s receiving same treatment

Run: observation on an EU under particular treatment condition.

Replicate: another independent run.

Page 29: Applied Statistics - MIT

Sources of Error in Experimental Studies

Systematic Error: differences among EU’s caused by Confounding Factors

Random Error: inherent variability in responses of EU’s.

Measurement Error: due to imprecision of measuring instruments.

29

Page 30: Applied Statistics - MIT

Strategies to Control Error in Experimental Studies

Blocking: Divide sample into groups of similar EU’s (same value for nuisance factors).E.g. In agricultural trials effect of nutrient and moisture gradients can be controlled for by blocking on agricultural plots

Matching: EU’s can be matched on nuisance factors, then each memberof match can be randomly assigned to different treatment (each match is a block).

Regression Analysis: If value of nuisance factor is known can include as covariate in final model.

Randomization: Randomly assign EU’s to treatments.

Basic Idea: Block over those nuisance factors that can be easily controlled and randomize over the rest

30

Page 31: Applied Statistics - MIT

Basic Experimental Designs

Completely Randomized Design (CRD)EU’s assigned at random to treatments

Randomized Block Design (RBD) EU’s divided into homogeneous blocksTreatments assigned randomly within blocks.

Randomized Complete Block Design (RCBD): Blocks contain all treatments.

Randomized Incomplete Block Design (RIBD)Blocks do not contain all treatments.

31

Page 32: Applied Statistics - MIT

Chapter 4: Summarizing & Exploring Data(Descriptive Statistics)

Graphics! Graphics! Graphics!(and some numbers)

Slides prepared by Elizabeth Newton (MIT) with some slides byJacqueline Telford (Johns Hopkins University) and Roy Welsch (MIT).

1

Page 33: Applied Statistics - MIT

Graphical Excellence“Complex ideas communicated with

clarity, precision, and efficiency”Shows the dataMakes you think about substance rather than

method, graphic design, or something elseMany numbers in a small spaceMakes large data sets coherentEncourages the eye to compare different

pieces of the data

2

Page 34: Applied Statistics - MIT

Charles Joseph Minard

Graphic Depicting Exports of Wine from France (1864)

Available at http://www.math.yorku.ca/SCS/Gallery/

Source: Minard, C. J. Carte figurative et approximative des quantités de vin français exportéspar mer en 1864. 1865. ENPC (École Nationale des Ponts et Chaussées), 1865.

Also available in: Tufte, Edward R. The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 2001.

3

Page 35: Applied Statistics - MIT

Summarizing Categorical DataA frequency table shows the number of occurrences of each category.Relative frequency is the proportion of the total in each category.

Bar charts and Pie Charts are used to graph categorical data. A Paretochart is a bar chart with categories arranged from the highest to lowest (QC: “vital few from the trivial many”).

Attraction FrequencyRelative

Frequency (%)Vertical Drop 101 15.1Roller Coaster A 54 8.1Roller Coaster B 77 11.5Water Park 155 23.1Spinners 35 5.2Tea Cups 81 12.1Haunted House 79 11.8Log Drop 88 13.1Total 670 100.0

Popularity of attractions at an amusement park

Relative Frequency (%)

0.0

5.0

10.0

15.0

20.0

25.0

Vertica

l Drop

Roller

Coaste

r A

Roller

Coaste

r BWater

Park

Spinne

rsTea

Cup

s

Haunte

d Hou

seLo

g Drop

4

Page 36: Applied Statistics - MIT

Pie Chart and Bar Chart of Attraction Popularity at an Amusement Park

5

Relative Frequency (%)

Vertical Drop Roller Coaster ARoller Coaster B Water ParkSpinners Tea CupsHaunted House Log Drop

Relative Frequency (%)

0.0

5.0

10.0

15.0

20.0

25.0

Vertica

l Drop

Roller

Coaste

r A

Roller

Coaste

r BWater

Park

Spinne

rsTea

Cup

s

Haunte

d Hou

seLo

g Drop

Page 37: Applied Statistics - MIT

Charles Joseph Minard

Graph showing quantities of meat sent from various regions of France to Paris using pie charts overlaid a

map of France (1864)

Available at http://www.math.yorku.ca/SCS/Gallery/

Source: Minard, C. J. Carte figurative et approximative des quantités de viande de boucherie envoyées sur pied par les départments et consommées à Paris. ENPC (École

Nationale des Ponts et Chaussées),1858, pp. 44.6

Page 38: Applied Statistics - MIT

Plots for Numerical Univariate Data

Scatter plot (vs. observation number)

Histogram

Stem and Leaf

Box Plot (Box and Whiskers)

QQ Plot (Normal probability plot)

7

Page 39: Applied Statistics - MIT

Scatter Plot of Iris Data

observation number

iris[

, "S

epal

W",

"Set

osa"

]

0 10 20 30 40 50

2.5

3.0

3.5

4.0

8This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 40: Applied Statistics - MIT

9

Scatter Plot of Iris Data with Observation Number Indicated

observation number

iris2

1

0 10 20 30 40 50

2.5

3.0

3.5

4.0

1

2

34

5

6

7 8

9

10

11

12

1314

15

16

17

18

1920

21

2223

2425

26

2728

29

3031

32

3334

3536

3738

39

4041

42

43

44

45

46

47

48

49

50

plot(iris21)text(iris21)

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 41: Applied Statistics - MIT

Plot of data using jitter function in S-Plus

observation number

x

0 100 200 300 400 500

0.0

0.5

1.0

1.5

2.0

2.5

3.0

observation number

jitte

r(x)

0 100 200 300 400 500

0.0

0.5

1.0

1.5

2.0

2.5

3.0

10This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 42: Applied Statistics - MIT

Run ChartFor time series data, it is often useful to plot the data in time sequence. A run chart graphs the data against time.

0 5 10 15 20 25 30

Production Order

Com

pres

sion

Freq

uenc

y

Compression

11

Always Plot Your Data Appropriately - Try Several Ways!

Page 43: Applied Statistics - MIT

HistogramData: n=24 Gas Mileage {31,13,20,21,24,25,25,27,28,40,29,30,31,23,31,32,35,28, 36,37,38,40,50,17}

Gives a picture of the distribution of data.

• Area under the histogram represents sample proportion.

• Use approx. sqrt(n) “bins” - if too many,too jagged; if too few, too smooth (no detail)

• Shows if the distribution is:– Symmetric or skewed– Unimodal or bimodal

• Gaps in the data may indicate a problem with the measurement process.

• Many quality control applications– Are there two processes?– Detection of rework or cheating– Tells if process meets the

specifications

2.5

5.0

7.5C

ount

Axi

s

10 15 20 25 30 35 40 45 50 55

Miles per gallonDistributions

Note: Bars touch for continuous data, but do NOTtouch for discrete data.

12

Page 44: Applied Statistics - MIT

Histogram of Iris Data

2.5 3.0 3.5 4.0

02

46

810

iris21

13This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 45: Applied Statistics - MIT

2.0 2.5 3.0 3.5 4.0 4.5siris21

0.0

0.2

0.4

0.6

0.8

1.0

Histogram of Iris Data with Density Curve

14This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 46: Applied Statistics - MIT

Stem and Leaf Diagram Cum. Dist. FunctionData: Gas Mileage

Stem Leaf5 044 003 56783 011122 5578892 01341 71 3

Count1

2456411 0.1

0.20.30.40.50.60.70.80.91.0

Cum

Pro

b10 15 20 25 30 35 40 45 50 55

Miles per gallon

CDF Plot

Shows distribution of data similar to a histogram but preserves the actual data.Can see numerical patterns in the data (like 40’s and 50).

Step occurs at each data value (higher for more values at the same data point).

15

Page 47: Applied Statistics - MIT

Stem and Leaf Diagram for Iris Data• Decimal point is 1 place to the left of the colon

• 23 : 0• 24 :• 25 :• 26 :• 27 :• 28 :• 29 : 0• 30 : 000000• 31 : 0000• 32 : 00000• 33 : 00• 34 : 000000000• 35 : 000000• 36 : 000• 37 : 000• 38 : 0000• 39 : 00• 40 : 0• 41 : 0• 42 : 0• 43 :• 44 : 0

16This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 48: Applied Statistics - MIT

Summary Statistics for Numerical DataMeasures of Location:

n

x

nxxxx

n

ii

n∑

==+++

= 121(“average”):Mean

Median: middle of the ordered sample (like θ.5 for distribution)

xmin = x(1) ≤ x(2) ≤ …≤ x(n) = xmax

⎪⎪⎩

⎪⎪⎨

⎥⎥⎦

⎢⎢⎣

⎡+

=

⎟⎠⎞

⎜⎝⎛ +

⎟⎠⎞

⎜⎝⎛

⎟⎠⎞

⎜⎝⎛ +

even is if

odd is if

mediannxx

nx

nn

n

22

221

21

Median of {0,1,2} is 1 : n=3 so n+1=4 & (n+1)/2=2 (2nd value)

Median of {0,1,2,3} is 1.5 (assumes data is continuous): n=4

Mode: The most common value 17

Page 49: Applied Statistics - MIT

Mean or Median?Appropriate summary of the center of the data?– Mean if the data has a symmetric distribution with light tails

(i.e. a relatively small proportion of the observations lie away from the center of the data).

– Median if the distribution has heavy tails or is asymmetric.

Extreme values that are far removed from the main body of the data are called outliers.

– Large influence on the mean but not on the median.Right and left skewness (asymmetry)

(reverse alphabetic - RIGHT skewed)

mode (high point)median

mean

(alphabetic - LEFT skewed)

modemedian

mean

18

Page 50: Applied Statistics - MIT

Quantiles, Fractiles, Percentiles

For a theoretical distribution:The pth quantile is the value of a random variable X, xp, such that P(X<xp)=p. For the normal dist’n:In S-Plus: qnorm(p), 0<p<1, gives the quantile.In S-Plus: pnorm(q) gives the probability.

For a sample:The order statistics are the sample values in ascending order. Denoted X(1) ,…X(n)The pth quantile is the data value in the sorted sample, such that a fraction p of the data is less than or equal to that value.

19

Page 51: Applied Statistics - MIT

20

Normal CDF

x

pnor

m(x

)

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

qnorm(0.8)=0.8416212

pnorm(0.8416212)=0.8

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 52: Applied Statistics - MIT

An algorithm for finding sample quantiles:

1) Arrange observations from smallest to largest.2) For a given proportion p, compute the sample

size × p = np.3) If np is NOT an integer, round up to the next

integer (ceiling (np)) and set the corresponding observation = xp.

4) If np IS an integer k, average the kth and (k + 1)st ordered values. This average is then xp.

– Text has a different algorithm

21

Page 53: Applied Statistics - MIT

Quantiles, continued(pth quantile is 100pth percentile)

Example:Data: {0, 1, 2, 3, 4, 5, 6}

= {x(1),x(2),x(3),x(4),x(5),x(6), x(7)}

n=7Q1 = ceiling(0.25*7) = 2 ⇒ Q1 = x(2)= 1 = 25th percentileQ2 = ceiling(0.50*7) = 4 ⇒ Q2 = x(4)= 3 = median (50th percentile)Q3 = ceiling(0.75*7) = 6 ⇒ Q3 = x(6)= 5 = 75th percentile

S-Plus gives different answers! Different methods for calculating quantiles.

22

Page 54: Applied Statistics - MIT

Measures of Dispersion (Spread, Variability):Two data sets may have the same center and but quite

different dispersions around it.Two ways to summarize variability: 1. Give the values that divide the data into equal parts.

– Median is the 50th percentile– The 25th, 50th, and 75th percentiles are called

quartiles (Q1,Q2,Q3) and divide the data into four equal parts.

– The minimum, maximum, and three quartiles are called the “five number summary” of the data.

2. Compute a single number, e.g., range, interquartilerange, variance, and standard deviation.

23

Page 55: Applied Statistics - MIT

Measures of Dispersion, continued

Range = maximum - minimumInterquartile range (IQR) = Q3 – Q1

⎥⎦

⎤⎢⎣

⎡−

−=−

−= ∑∑

==

n

ii

n

ii xnx

nxx

ns

1

22

1

22 )(1

1)(1

1Sample variance:

2ss =Sample standard deviation:

Sample mean, variance, and standard deviations are sample analogs of the population mean, variance, and standard deviation (µ, σ2, σ)

24

Page 56: Applied Statistics - MIT

Other Measures of Dispersion

Sample Average of Absolute Deviations from the Mean:

Sample Median of Absolute Deviations from the Median

Median of {|xi − x.5|, i = 1, . . . , n}

1

1

=−∑

ni

ix x

n

25

Page 57: Applied Statistics - MIT

Computations for Measures of DispersionExample:

Data: {0, 1, 2, 3, 4, 5, 6}= {x(1),x(2),x(3),x(4),x(5),x(6), x(7)}

mean = (0+1+2+3+4+5+6)/ 7 = 21/ 7 = 3min = 0, max = 6Q1 = x(2)= 1 = 25th percentileQ2 = x(4)= 3 = median (50th percentile)Q3 = x(6)= 5 = 75th percentileRange = max - min = 6 - 0 = 6IQR = Q3 - Q1 = 5 - 1 = 4s2 = [(02+12+22+32+42+52+62) - 7(32)]/(7-1) = [91-63]/6 =4.67s = sqrt(4.67) = 2.16

26

Page 58: Applied Statistics - MIT

Sample Variance and Standard Deviations2 and s should only be used to summarize dispersion with symmetric distributions.

For asymmetric distribution, a more detailed breakup of the dispersion must be given in terms of quartiles.

For normal data and large samples:– 50% of the data values fall between mean ± 0.67s– 68% of the data values fall between mean ± 1s– 95% of the data values fall between mean ± 2s– 99.7% of the data values fall between mean ± 3s

For normally distributed data:IQR=(mean + 0.67s) - (mean - 0.67s) = 1.34s

27

Page 59: Applied Statistics - MIT

28

Standard Normal Density

x

dnor

m(x

)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

68%

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 60: Applied Statistics - MIT

Box (and Whiskers) PlotsVisual display of summary of data (more than five numbers)Outlier Box Plot Quantile Box PlotData: Gas Mileage

median

Q3

Q1

IQR = Q3 - Q1

Upper Fence = Q3 + 1.5 x IQR

Lower Fence = Q1 – 1.5 x IQR

Two lines are called whiskers and extend to the most extreme data values that are still inside the fences.

Observations outside the fences are regarded as possible outliers and are denoted by dots and circles or asterisks.

90th percentile

10th percentile

Rectangle:

29

Page 61: Applied Statistics - MIT

Box Plot for Iris Data2.

53.

03.

54.

0

iris21

30This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 62: Applied Statistics - MIT

QQ PlotsCompare Sample to Theoretical

Distribution

Order the data. The ith ordered data value is the pth quantile, where p = (i - 0.5)/n, 0<p<1.Text uses i/(n+1). (Why can’t we just say i/n)?

Obtain quantiles from theoretical distribution corresponding to the values for p. E.g. qnorm(p), in S-Plus for normal distribution.

Plot theoretical quantiles vs. empirical quantiles (sorted data). S-Plus: plot(qnorm((1:length(y)-0.5)/n),sort(y))

Fit line through first and third quartiles of each distribution.

31

Page 63: Applied Statistics - MIT

32

QQ (Normal) Plot for Iris Data

Quantiles of Standard Normal

iris2

1

-2 -1 0 1 2

2.5

3.0

3.5

4.0

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 64: Applied Statistics - MIT

Normalizing TransformationsData can be non-normal in a number of ways, e.g., the distribution may not be bell shaped or may be heavier tailed than the normal distribution or may not be symmetric.

Only the departure from symmetry can be easily corrected by transforming the data.

If the distribution is positively skewed, then the right tail needs to be shrunk inward. The most common transformation used for this purpose is the log transformation: x → log x (e.g., decibels, Richter, and Beaufort (?) scales); see Figure 4.11.

xThe square-root ( ) transformation provides a weaker shrinking effect; it is frequently used for (Poisson) count data.

For negatively skewed data, use the exponential (ex) or squared (x2) transformations.

33

Page 65: Applied Statistics - MIT

34

Normal Probability Plot of data generated from a certain distribution

Quantiles of Standard Normal

x

-2 -1 0 1 2

02

46

810

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 66: Applied Statistics - MIT

35

Normal probability plot of log of same data

Quantiles of Standard Normal

log(

x)

-2 -1 0 1 2

-10

12

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 67: Applied Statistics - MIT

Histogram of the same data

36

0 2 4 6 8 10

010

2030

40

xThis graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 68: Applied Statistics - MIT

Summarizing Multivariate Data

37

When two or more variables are measured on each sampling unit, the result is multivariate data.

If only two variables are measured the result is bivariate data. One variable may be called the x variable and the other the y variable.

We can analyze the x and y variable separately with the methods we have learned so far, but these methods would NOT answer questions about the relationship between x and y.

– What is the nature of the relationship between x and y (if any)?

– How strong is the relationship?

– How well can one variable be predicted from the other?

Page 69: Applied Statistics - MIT

Summarizing Bivariate Categorical DataTwo-way Table

Overall Job Satisfaction

Annual Salary

Very Dissatisfied

Slightly Dissatisfied

Slightly Satisfied

Very Satisfied Row Sum

Less than $10,000

81 64 29 10 184

$10,000-25,000

73 79 35 24 211

$25,000-50,000

47 59 75 58 239

More than $50,000

14 23 84 69 190

Column Sum 215 225 223 161 824

38

The numbers in the cells are the frequencies of each possible combination of categories.

Cell, row and column percentages can be computed to assess distribution.

Page 70: Applied Statistics - MIT

Column Percentages for Income and Job Satisfaction Table

Overall Job Satisfaction

Annual Salary

Very Dissatisfied

Slightly Dissatisfied

Slightly Satisfied

Very Satisfied

Less than $10,000

37.7 28.4 13.0 6.2

$10,000-25,000

34.0 35.1 15.7 14.9

$25,000-50,000

21.9 26.2 33.6 36.0

More than $50,000

6.5 10.2 37.7 42.9

39

Page 71: Applied Statistics - MIT

Simpson’s Paradox

“Lurking variables [excluded from consideration] can change or

reverse a relation between two categorical variables!”

40

Page 72: Applied Statistics - MIT

Doctors’ Salaries

• The interpreter of a survey of doctors’ salaries in 1990 and again in 2000 concluded that their average income actually declined from $97,000 in 1990 to $91,000 in 2000.”

• Income is measured here in nominal (not adjusted for inflation) dollars.

41

Page 73: Applied Statistics - MIT

What about the “Rest of the Story”?

• What deductive piece of logic might clarify the real meaning of this particular pair of statistics?

• Look more deeply: Is there a piece missing?

• Here is a very simple breakdown of “the numbers” that may help.

42

Page 74: Applied Statistics - MIT

Doctors’ Salaries by Age

1980 1990Age fraction, f1 Income fraction, f2 Income<=45 0.5 $60,000 0.7 $70,000>45 0.5 $120,000 0.3 $130,000

Mean $90,000 $88,00043

Page 75: Applied Statistics - MIT

Conclusion

• If MD salaries are broken into two categories by age:– Doctors younger than 45 constituted 50%

of the MD population in 1980 and 70% in 1990

– Younger doctors tend to earn less than older, more experienced doctors

– Parsed by age, MD salaries increased in both age categories!

44

Page 76: Applied Statistics - MIT

Gender Bias in Graduate Admissions

For this example, see Johnson and Wichern, Business Statistics: Decision Making with Data. Wiley, First Edition, 1997.

45

Page 77: Applied Statistics - MIT

Randomized study

Gender should be randomly assigned to applicants!

This would automatically balance out the departmental factor which is not controlled for in the original plaintiff (observational) study.

Practical reality

Gender cannot be assigned randomly.

Control for department factor by comparing admission within department, i.e. controlling for the confounding factor aftercompletion of the study.

Statistical Ideal

46

Page 78: Applied Statistics - MIT

“There are lies, damn lies and then there are statistics!”

Benjamin Disraeli

47

Page 79: Applied Statistics - MIT

Summarizing Bivariate Numerical DataNo. Method

1 (xi)Method 2 (yi)

1 88 86

2 78 81

3 90 87

4 91 90

5 89 89

6 79 80

7 76 74

8 80 78

9 78 76

10 90 86

0102030405060708090

100

75 80 85 90 95

Method 1

Met

hod

2

Is it easier to grasp the relationship in the data between Method A and Method B from the Table or from the Figure (scatter plot)?

48

Page 80: Applied Statistics - MIT

Labeled Scatter PlotYear Country

ACountry

B Country

CCountry

D

1965 64.7 64.8 61.1 86.2

1970 65.0 65.2 61.2 86.5

1975 66.8 66.3 63.0 87.4

1980 66.9 67.4 62.8 87.0

1985 67.9 68.5 63.1 89.2

1990 68.3 69.1 63.5 89.4

1995 70.8 69.4 64.3 90.1

2000 71.7 70.0 65.1 90.5

Can you see the improvements in the literacy rates for these four countries more easily in the Table or in the Figure?

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

1960 1965 1970 1975 1980 1985 1990 1995 2000 2005

Year

Lite

rary

Rat

e Country ACountry BCountry CCountry D

49

Page 81: Applied Statistics - MIT

Sample Correlation CoefficientA single numerical summary statistic which measures the strength of a linear relationship between x and y.

r = covar(x,y)/(stddev(x)*stddev(y))

Properties similar to the population correlation coefficient ρ– Unitless quantity– Takes values between –1 and 1– The extreme values are attained if and only if the points (xi , yi) fall exactly on a straight line (r = -1 for a line with negative slope and r = +1 for a line with positive slope.)– Takes values close to zero if there is no linear relationship between x and y.

• See Figures 4.15, 4.16, 4.17 (a) and (b)

∑=

−−−

==n

iiixy

yx

xy yyxxn

sss

sr

1

))((1

1 where

50

Page 82: Applied Statistics - MIT

What is the correlation?

x

y

0 20 40 60 80 100

020

4060

8010

012

0

51This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 83: Applied Statistics - MIT

What is the correlation?

x

y

-4 -2 0 2 4

05

1015

52This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 84: Applied Statistics - MIT

What is the correlation?

x

y

-4 -2 0 2 4

-1.0

-0.5

0.0

0.5

1.0

53This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 85: Applied Statistics - MIT

Correlation and CausationHigh correlation is frequently mistaken for a cause and effect relationship. Such a conclusion may not be valid in observational studies, where the variables are not controlled.

– A lurking variable may be affecting both variables.– One can only claim association, not causation.

Countries with high fat diets tend to have higher incidences of cancer. Can we conclude causation?A common lurking variable in many studies is time order.

– Wealth and health problems go up with age.Does wealth cause health problems?

Sometimes correlations can be found without any plausible explanation, e.g., sun spots and economic cycles.

54

Page 86: Applied Statistics - MIT

Plots for Multivariate Data

• Side by Side Box Plots• Scatter plot matrix• Three dimensional plots• Brush and Spin plots – add motion• Maps for spatial data

55

Page 87: Applied Statistics - MIT

56

Box Plots of Auto Datawidths indicate number of each type

2025

3035

fuel

.fram

e[, "

Mile

age"

]

Compact Large Medium Small Sporty Van

fuel.frame[, "Type"]This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 88: Applied Statistics - MIT

57

Scatter plot matrix Iris –(Versicolor)

Sepal.L.

2.0 2.4 2.8 3.2 1.0 1.2 1.4 1.6 1.8

5.0

5.5

6.0

6.5

7.0

2.0

2.4

2.8

3.2

Sepal.W.

Petal.L.

3.0

3.5

4.0

4.5

5.0

5.0 5.5 6.0 6.5 7.0

1.0

1.2

1.4

1.6

1.8

3.0 3.5 4.0 4.5 5.0

Petal.W.

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 89: Applied Statistics - MIT

58

• Galaxy S-PLUS Language Reference • Radial Velocity of Galaxy NGC7531 • SUMMARY: • The galaxy data frame records the radial velocity of a spiral galaxy

measured at 323 points in the area of sky which it covers. All the measurements lie within seven slots crossing at the origin. The positions of the measurements given by four variables (columns).

• ARGUMENTS: • east.west

– the east-west coordinate. The origin, (0,0), is near the center of the galaxy, east is negative, west is positive.

• north.south– the north-south coordinate. The origin, (0,0), is near the center of the

galaxy, south is negative, north is positive. • angle

– degrees of counter-clockwise rotation from the horizontal of the slot within which the observation lies.

• radial.position– signed distance from origin; negative if east-west coordinate is

negative. • velocity

– radial velocity measured in km/sec. . This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 90: Applied Statistics - MIT

Galaxy Data

east.west

-40 -20 0 20 40 1400 1500 1600 1700

-30

-10

1030

-40

020

40

north.south

radial.position

-40

020

4060

-30 -20 -10 0 10 20 301400

1600

-40 -20 0 20 40 60

velocity

59This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 91: Applied Statistics - MIT

Galaxy 3D

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.60

Page 92: Applied Statistics - MIT

61

Earthquake Data

longitude

36.0 36.5 37.0 37.5 38.0 38.5

-123

-122

-121

-120

36.0

37.0

38.0

latitude

-123 -122 -121 -120 3 4 5

34

5

magnitude

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 93: Applied Statistics - MIT

62

Earthquake 3D

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 94: Applied Statistics - MIT

Narrative Graphics of Space and Time

• Adding spatial dimensions to a graph so that the data are moving over space and time can enhance the explanatory power of time series displays

• The Classic of Charles Joseph Minard (1781-1870) shows the terrible fate of Napoleon’s army during his Russian campaign of 1812. A copy of the map is available at http://www.math.yorku.ca/SCS/Gallery/

63

Map Source: Minard, C. J. Carte figurative des pertes successives en hommes de l'arméequ'Annibal conduisit d'Espagne en Italie en traversant les Gaules (selon Polybe). Carte figurative des pertes successives en hommes de l'armée française dans la campagne de Russie, 1812-1813. École Nationale des Ponts et Chaussées (ENPC), 1869. Also available in: Tufte, Edward R. The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 2001.

Page 95: Applied Statistics - MIT

Beginning at the left on the Polish-Russian border near the Niemen River the thick band shows the size of the army (422,000) as it invaded Russia in June 1812.– The width of the band indicates the size of the

army…– The army reached a sacked and deserted Moscow

with 100,000 men– Napoleon’s retreat path from Moscow is depicted by

a dark, lower band, linked to a temperature scale and dates at the bottom.

– The men struggled into Poland with only 10,000 troops remaining.

64

Page 96: Applied Statistics - MIT

• Minard’s graphic tells a rich, coherent story with its multivariate data, far more enlightening than just a single number

• SIX variables are plotted:– Its location on a two-dimensional

surface– Direction of army’s movement– Temperature as a function of time

during the retreat– The size of the army

• “It may well be the best statistical graphic ever drawn.” Edward Tufte (The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 2001, pp. 40)

65

Page 97: Applied Statistics - MIT

Scatter plot matrix of air data set in S-Plus

66

ozone

0 50 100 200 300 5 10 15 20

12

34

5

050

150

250

radiation

temperature

6070

8090

1 2 3 4 5

510

1520

60 70 80 90

wind

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 98: Applied Statistics - MIT

67

plot(temperature,ozone)

temperature

ozon

e

60 70 80 90

12

34

5

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 99: Applied Statistics - MIT

We often try to fit a straight line to bivariate data as a way to summarize bivariate data:

y = data = fit + residual

fit = a + bx

The parameter (coefficients) a and b can be found in many ways. Least-squares is commonly used.

The fit is often denoted by The residuals are What about curvature and outliers?

( )2

, 1

=

= .

min=

− −

∑n

i ia b i

xy x

y a bx

b S S

a y bx

+ .ˆi iy a bx= .ˆi iy y−

Fitting Lines

68

Page 100: Applied Statistics - MIT

Divide x data into thirds. Find median of x in each third, and median of the y’s that correspond to the x’s in each third. Call these three pairs (xa, ya), (xb, yb), (xc, yc). Fit a least-squares line to these three points.

Or consider other metrics

These are alternatives to least-squares.

Resistant Line

, 1

, .

min

medianmin

ni i

a b i

i ia b i

y a bx

y a bx

=− −

− −

69

Page 101: Applied Statistics - MIT

70

abline(lm(ozone~temperature))

temperature

ozon

e

60 70 80 90

12

34

5

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 102: Applied Statistics - MIT

Prediction and ResidualsFitted lines can be used to predict. If we go too far beyond range of x-data, we can expect poor results. Consider problems of interpolation and extrapolation.

Examination of residuals help tell us how well our model (a line) fits the data.We also compute

and call s the standard deviation of the residuals. Note use of n − 2 because two degrees of freedom are used to find a and b.

( )21

1 ˆ2

ni i

is y y

n == −

−∑

71

Page 103: Applied Statistics - MIT

Residual Plots

72

1. against fitted values2. against explanatory variable3. against other possible explanatory variables4. against time, if applicable.

We want these pictures to look random — no pattern.

Outliers and InfluenceValues of x far away from the line have a lot of leverage on the line. Values of y with large residuals at high leverage points will usually be quite influential on the fitted line.

We can check by setting influential points aside and comparing fits and residuals.

( )ˆiy

Page 104: Applied Statistics - MIT

73

Plot of residuals vs. observation number for ozone data

resi

d(lm

fit)

0 20 40 60 80 100

-10

12

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 105: Applied Statistics - MIT

74

Residuals vs. Fitted Values for ozone data

fitted(lmfit)

resi

d(lm

fit)

2.0 2.5 3.0 3.5 4.0 4.5

-10

12

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 106: Applied Statistics - MIT

Smoothing

• Fitting curves to data• Separate Signal from noise• Fitted values, , are a weighted average of the

response y. • Weights are a function of predictor x.• Degrees of freedom indicate roughness• Simple linear regression, df=2

y

75

Page 107: Applied Statistics - MIT

76

plot(temperature,ozone)lines(smooth.spline(temperature,ozone,df=16.5))

temperature

ozon

e

60 70 80 90

12

34

5

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 108: Applied Statistics - MIT

77

plot(temperature,ozone)lines(smooth.spline(temperature,ozone,df=6))

temperature

ozon

e

60 70 80 90

12

34

5

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 109: Applied Statistics - MIT

Time-Series/Runs Chart

Plot of Compression vs. Time (Order of Production)

This is example of a process not in “statistical control” as seen from the downward drift.0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

0 5 10 15 20 25 30

Production Order

Com

pres

sion

The usual statistics procedures (such as means, standard deviation, confidence interval, hypothesis testing) should NOT be applied until the process has been stabilized.

78

Page 110: Applied Statistics - MIT

Time-Series DataData obtained at successive time points for the same sampling unit(s).

A time series typically consists of the following components.1. Stable component2. Trend component3. Seasonal component4. Random component5. Cyclic (long term) component

Univariate time series { xt, t = 1, 2, …, T }

Time-series plot: Xt vs. Time

79

Page 111: Applied Statistics - MIT

Data Smoothing and ForecastingTwo types of averages for time-series data:

1. Moving averages

2. Exponentially weighted averages

These should be used only if mean is constant (process is in “statistical control” or is stationary) or mean varies slowly.

Regression techniques can be used to model trends.

More advanced methods are needed to model seasonality and dependence between successive observations (autocorrelation).

80

Page 112: Applied Statistics - MIT

(Arithmetic) Moving Averages (MA)The average of a set of w successive data values (called a window); the oldest data is successively dropped off.

T , 1, w w,for t 1 ……

+=++

= +−

wxxMA twt

t

The bigger the window (w), the more the smoothing.

MA forecast: 1ˆ −= tt MAx

T , 2, t ,ˆ 1 =−−=−= ttttt MAxxxeForecast error:

%100 1

12

×⎟⎟⎠

⎞⎜⎜⎝

− ∑−

T

t t

teT x

Mean Absolute Percent Error:(error in eqn 4.12 in textbook,x not y in the denominator)

81

Page 113: Applied Statistics - MIT

Exponentially Weighted Moving AveragesUses all data, but the most recent data is weighted the heaviest.

1)1( −−+= ttt EWMAwxwEWMA

where 0 < w < 1 is the smoothing constant (usually 0.2 to 0.3).

1ˆ −= tt EWMAx

1ˆ −

EWMA forecast:

−=−= ttttt EWMAxxxeForecast error:

1 −+= ttt EWMAewEWMAAlternative formula:

Interpretation: If the forecast error is positive (forecast underestimated the actual value), the next period’s forecast is adjusted upward by a fraction of the forecast error.

82

Page 114: Applied Statistics - MIT

Autocorrelation CoefficientFor time-series data, observations separated by a specified time period (called a lag) are said to be lagged.

First-order autocorrelation or the serial correlation coefficient between observations with lag = 1:

=

=−

−−= T

tt

T

ttt

xx

xxxxr

1

2

21

1

)(

))((

The k-th order autocorrelation coefficient:

=

+=−

−−= T

tt

T

kttkt

k

xx

xxxxr

1

2

1

)(

))((

83

Page 115: Applied Statistics - MIT

84

Lag Plots in S-Pluslag.plot(x) or plot(x[1:(n-i)],x[(i+1):n])

lagged 1

Serie

s 1

100 150 200

5010

015

020

0

lagged 2

Serie

s 1

100 150 200

5010

015

020

0

lagged 3

Serie

s 1

100 150 200

5010

015

020

0

lagged 4

Serie

s 1

100 150 200

5010

015

020

0

lagged 5

Serie

s 1

100 150 200

5010

015

020

0

lagged 6

Serie

s 1

100 150 20050

100

150

200

Housing starts 1966:1974, lagged scatterplotsHousing starts 1966:1974, lagged scatterplots

These graphs were created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 116: Applied Statistics - MIT

John W. Tukey (1915 - 2000)

Statistician at Princeton Univ. and Bell Labs

Co-developer of Fast Fourier Transform

Coined terms “bit” (binary digit) and “software”

“An approximate answer to the right problem is worth a great deal more than a precise answer to the wrong problem.”

Developed new graphical displays (stem-and-leaf and box plots) to examine the data, as a reaction to the “mathematization of statistics.”

85

Page 117: Applied Statistics - MIT

Review of Probability

Corresponds to Chapter 2 of Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT),with some slides by Jacqueline Telford

(Johns Hopkins University)

1

Page 118: Applied Statistics - MIT

Concepts (Review) A population is a collection of all units of interest.A sample is a subset of a population that is actually observed.A measurable property or attribute associated with each unit of a population

is called a variable. A parameter is a numerical characteristic of a population. A statistic is a numerical characteristic of a sample. Statistics are used to infer the values of parameters. A random sample gives a non-zero chance to every unit of the population to

enter the sample. In probability, we assume that the population and its parameters are known

and compute the probability of drawing a particular sample. In statistics, we assume that the population and its parameters are unknown

and the sample is used to infer the values of the parameters. Different samples give different estimates of population parameters (called

sampling variability). Sampling variability leads to “sampling error”. Probability is deductive (general -> particular) Statistics is inductive (particular -> general)

2

Page 119: Applied Statistics - MIT

Difference between Statistics and Probability

Statistics: Given the information in your hand, what is in the box?

Probability: Given the information in the box, what is in your hand?

Based on: Statistics, Norma Gilbert, W.B. Saunders Co., 1976. 3

Page 120: Applied Statistics - MIT

Probability Concepts

Random experiment – procedure whose outcome cannot be predicted in advance. E.g. toss a coin twice

Sample Space (S) – The finest grain, mutually exclusive, collectively exhaustive listing of all possible outcomes (Drake, Fundamentals of Applied Probability Theory) S={H,H},{H,T},{T,H},{T,T}

Event (A) a set of outcomes (subset of S). E.g. No heads A={T,T}

Union (or) E.g. A=heads on first, B=heads on second A U B= {H,T},{H,H},{T,H}

Intersection (and): E.g. A= heads on first, B=heads on second A ∩ B = {H,H}

Complement of Event A – set of all outcomes not in A. E.g. A={T,T}, Ac={H,H},{H,T},{T,H}

4

Page 121: Applied Statistics - MIT

5

Venn Diagram

A B

Page 122: Applied Statistics - MIT

Axioms of ProbabilityAssociated with each event A in S is the probability of A, P(A) Axioms:

1. P(A) ≥ 0 2. P(S) = 1 where S is the sample space 3. P(A U B) = P(A) + P(B) if A and B are mutually exclusive

E.g. P(ace or king) = P(ace)+P(king)=1/13+1/13=2/13.

Theorems about probability can be proved using these axioms and these theorems can be used in probability calculations.

P(A) = 1 - P(Ac ) (see “birthday problem” on p. 13)P(A U B) = P(A) + P(B) – P(A∩B)E.g. P(ace or black) = P(ace) + P(black) – P(ace and black)= 4/52 + 26/52 – 2/52 = 28/52 = 7/13

6

Page 123: Applied Statistics - MIT

Conditional Probabiity: P(A|B) = P(A∩B)/P(B) P(A∩B) = P(A|B)P(B)

E.g. Drawing a card from a deck of 52 cards, P(Heart)=1/4.

However, if it is known that the card is red, P(Heart | Red) = ½.

Sample space has been reduced to the 26 red cards.

(See page 16)

7

Page 124: Applied Statistics - MIT

Independence P(A|B)=P(A)

There are situations in which knowing that event B occurred gives no information about event A, E.g. knowing that a card is black gives no information about whether it is an ace. P(ace | black) = 2/26 = 4/52 = P(ace).

If two events are independent then P(A∩B)=P(A)P(B)P(A∩B)=P(A|B)P(B)=P(A)P(B)E.g. P(ace of hearts) = P(ace) * P(hearts) = 4/52 * 13/52 = 1/52

Independent events are not the same as disjoint events. Strong dependence between disjoint events. E.g. card is red means can’t be black. P(A|B)=0.

8

Page 125: Applied Statistics - MIT

Summary

If A and B are disjoint: P(A U B) = P(A) + P(B) P (A ∩B) =0

If A and B are independent: P(A ∩ B) = P(A) * P(B) P(A U B) = P(A) + P(B) – P(A ∩B)

9

Page 126: Applied Statistics - MIT

Bayes Theorem

• P(A∩B) = P(A|B) P(B) = P(B|A) P(A)• P(B|A) = P(A|B) P(B) / P(A) • P(B) = prior probability • P(B|A) = posterior probability

• E.g. P(heart | red)=P(red | heart) * P(heart) / P(red) = 1* 0.25 / 0.5 = 0.5

• Monte Hall problem (page 20) 10

Page 127: Applied Statistics - MIT

Sensor ProblemAssume that there are two chemical hazard sensors: A and B.

Let P(A falsely detecting a hazardous chemical)=0.05 and the same for B.

What is the probability of both sensors falsely detecting a hazardous chemical?

P (A ∩ B) = P(A|B)×P(B) = P(A) × P(B) = 0.05 × 0.05 = 0.0025

– only if A and B are independent (use different detection methods).

If A and B are both “fooled” by the same chemical substance, then P (A ∩ B) = P(A | B) × P(B) = 1 × 0.05 = 0.05 – which is 20 times the rate of false alarms (same type of sensor)

DON’T assume independence without good reason! 11

Page 128: Applied Statistics - MIT

HIV + HIV - Test positive (+) 95 495 590 Test negative (-) 5 9405 9410 100 9900 10000

P(HIV +) = 100/10000 = .01 (prevalence)

P(Test + | HIV +) = 95/100 = 0.95 (sensitivity)P(Test - | HIV -) = 9405/9900 = .95 (specificity)P(Test - | HIV +) = 5/100 = .05 (false negatives)P(Test + | HIV -) = 495/9900 = .05 (false positives)

P(HIV + | Test +) = 95/590 = 0.16This is one reason why we don’t have mass HIV screening

HIV Testing Example

want these to be high

want these to be low

Made-up data

Page 129: Applied Statistics - MIT

Suggestions for Solving Probability Problems

Draw a picture – Venn diagram – Tree or event diagram (Probabilistic Risk Assessment) – Sketch

Write out all possible combinations if feasible

Do a smaller scale problem first – Figure out the algorithm for the solution

– Increment the size of the problem by one and check algorithm for correctness

– Generalize algorithm (mathematical induction)

13

Page 130: Applied Statistics - MIT

Counting rulesNumber of Possible Arrangements of Size r from n Objects:

Without With Replacement Replacement

Ordered: !

( )! n

n r− rn

Unordered: n r

⎛ ⎞ ⎜ ⎟⎝ ⎠

1n r r

+ −⎛ ⎞ ⎜ ⎟⎝ ⎠

Source: Casella, George, and Roger L. Berger. Statistical Inference. Belmont, CA: Duxbury Press, 1990, page 16. 14

Page 131: Applied Statistics - MIT

Counting rules (from Casella & Berger)

For these examples, see pages 15-16 of: Casella, George, and Roger L. Berger. Statistical Inference. Belmont, CA: Duxbury Press, 1990.

15

Page 132: Applied Statistics - MIT

Birthday ProblemAt a gathering of s randomly chosen students what is the probability

that at least 2 will have the same birthday?

P(at least 2 have same birthday)=1-P(all s students have different birthdays).

Assume 365 days in a year. Think of students’ birthdays as a sample of these 365 days.

The total number of possible outcomes is: N=365s (ordered, with replacement)

The number of ways that s students can have different birthdays is M=364!/(365-s)! (ordered, without replacement)

P(all s students have different birthdays) is M / N. 16

Page 133: Applied Statistics - MIT

Probability that all students have different birthdays

choo

se(3

65, 1

:80,

ord

er =

....

0.0

0.4

0.8

0.2

0.6

1.0

0 20 40 60 80

Number of students 17 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 134: Applied Statistics - MIT

See “Harry Potter and the Sorcerer’s Stone” by J.K.

Rowling.

18

Page 135: Applied Statistics - MIT

Another Counting Rule

The number of ways of classifying n items into kgroups with ri in group i, r1+r2+…+rk=n, is:

n! / (r1! r2! r3!...rk!)

For example: How many ways are there to assign100 incoming students to the 4 houses atHogwarts?

(1.6 * 10^57)

19

Page 136: Applied Statistics - MIT

Random VariablesA random variable (r.v.) associates a unique numerical value with

each outcome in the sample space Example:

1 if coin toss results in a headX = 0 if coin toss results in a tail

Discrete random variables: number of possible values is finite or countably infinite: x1, x2, x3, x4, x5, x6, … Probability mass function (p.m.f.)

f(x) = P(X = x) (Sum over all possible values =1 always) Cumulative distribution function (c.d.f)

F(x) = P (X ≤ x) = Σ f(k)k ≤ x

• See Table 2.1 on p. 21 (p.m.f. and c.d.f. for sum of two dice)• See Figure 2.5 on p. 22 (p.m.f. and c.d.f. graphs for two dice)

20

Page 137: Applied Statistics - MIT

Continuous Random VariablesAn r.v. is continuous if it can assume any value from

one or more intervals of real numbers

Probability density function (p.d.f.) f(x)

f(x) ≥ 0 ∞

)f ( dx x = 1 curve the under (Area = always) 1 ∫ ∞ −

b

a P ≤ X ≤ b) = f ( ds x any for a ≤ b( )∫ a

21

Page 138: Applied Statistics - MIT

P(0<X<1) for standard normal= area under curve between 0 and 1

dnor

m(x

)

0.0

0.2

0.4

0.1

0.3

-4 -2 0 2 4

x 22 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 139: Applied Statistics - MIT

Cumulative Distribution Function

The cumulative distribution function (c.d.f.), denoted F(x), for a continuous random variable is given by:

x

F ( x) = X P ≤ x) = f ( dy y( )∫ ∞ −

f ( x) = dF ( x)

dx

23

Page 140: Applied Statistics - MIT

P(0<Z<1) for standard normal= F(1)-F(0) =0.8413-0.5 = 0.3413 (table page 674)

pnor

m(z

)

0.0

0.2

0.4

0.6

0.8

1.0

-4 -2 0 2 4

z 24 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 141: Applied Statistics - MIT

Expected Value

The expected value or mean of a discrete r.v. X,denoted by E(X), µ x, or simply µ , is defined as:

(X E ) = µ = ∑ f x ( x) = f x ( x ) + x2 f ( x ) + …1 1 2 x

This is essentially a weighted average of the possible values the r.v. can assume, weights=f(x)

The expected value of a continuous r.v. X is defined as:

X E ) = µ = ∫ f x ( dx x ( )

25

Page 142: Applied Statistics - MIT

Variance and Standard Deviation2The variance of an r.v. X, denoted by Var(X), σx , or

simply σ2, is defined as:

Var(X) = σ2 = E[(X - µ)2] Var(X) = E[(X - µ)2]= E(X2 - 2µX + µ2)

= E(X2) - 2µE(X) + E(µ2)

= E(X2) - 2µµ + µ2

= E(X2) - µ2 = E(X2) - [E(X)]2

The standard deviation (SD) is the square root of the variance. Note that the variance is in the square of the original units, while the SD is in the original units.

• See Example 2.17 on p. 26 (mean and variance of two dice)

26

Page 143: Applied Statistics - MIT

Quantiles and Percentiles

For 0 ≤ p ≤ 1 the pth quantile (or the 100pth percentile), denoted by θp, of a continuous r.v. X is defined by the following equation:

X P ≤ θ ) = F (θ p ) = p( p

θ.5 is called the median

• See Example 2.20 on p. 30 (exponential distribution)

Jointly distributed random variables and independent random variables

See pp. 30-33

27

Page 144: Applied Statistics - MIT

Joint Distributions

For a discrete distribution:

f(x,y) = P(X=x,Y=y)

f(x,y) ≥ 0 for all x and y ∑x ∑y f(x,y)=1

28

Page 145: Applied Statistics - MIT

Marginal Distributions

• g(x) = P(X=x) = ∑y f(x,y) • h(y) = P(Y=y) = ∑x f(x,y)

• Independent if joint distribution factors into product of marginal distributions

• f(x,y) = g(x) h(y)

29

Page 146: Applied Statistics - MIT

Conditional Distributions

f(y|x) = f(x,y) / g(x)If X and Y are independent:

f(y|x) = g(x) h(y) / g(x) = h(y)

Conditional distribution is just a probabilitydistribution defined on a reduced sample space. For every x, ∑y f(y|x) = 1

30

Page 147: Applied Statistics - MIT

Covariance and Correlation

Cov(X,Y) = σ XY = E[(X - µ X)(Y - µ Y)] = E(XY) - E(X)E(Y)

= E(XY) - µ X µ Y

If X and Y are independent, then E(XY) = E(X)E(Y) so the covariance is zero. The other direction is not true.

∞ ∞

Note that: E ( Y X ) = y x f y x ) dx dy( ,∫ ∫ ∞ − ∞ −

ρ XY = corr ( X ,Y ) = Cov ( X ,Y )

=σ XY

Var ( X )Var (Y ) σ σY

• See Examples 2.26 and 2.27 on pp. 37-38 (prob vs. stat grades)

31

x

Page 148: Applied Statistics - MIT

Example 2.25 in texty=x with probability 0.5 and y= -x with probability 0.5

y is not independent of x, yet covariance is zero

y

-40

0-2

0 20

40

0 10 20 30 40 50

x 32 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 149: Applied Statistics - MIT

Two Famous Theorems

Chebyshev’s Inequality: Let c > 0 be a constant. Then, irrespective of the distribution of X,

2σ( X P − µ ≥ c ) ≤ 2c • See Example 2.29 on p. 41 (exact vs. Cheb. for two dice)

Weak Law of Large Numbers: Let X be the sample mean of n i.i.d. observations from a population with finite mean µ

2and variance σ . Then, for any fixed c > 0,

( X P − µ ≥ c ) → as 0 n ∞ →

33

Page 150: Applied Statistics - MIT

Selected Discrete DistributionsBernoulli trials: (single coin flip)

xif (success)1( x f ) = (XP =x )=

⎧⎨⎩

p =

1−p xif (failure)0=0 1

E(X) = p and Var(X) = p(1-p)

Binomial distribution: (multiple coin flips)

X successes out of n trials ⎛ ⎞n

p x (1−p) −xn forx f( ) = XP( =x)= x= 1,0, …, n ⎜⎜⎝

⎟⎟⎠x

E(X) = np and Var(X) = np(1-p)

• See Example 2.30 on p. 43 (teeth) 0 1 . . n

34

Page 151: Applied Statistics - MIT

Selected Discrete Distributions (cont)

Hypergeometric: drawing balls from the box without replacing the balls (as in the hand with the question mark)

Poisson: number of occurrences of a rare event

Geometric: number of failures before the first success

Multinomial: more than two outcomes

Negative Binomial: number of trials to get r successes

Uniform: N equally likely events 1 2 3 … N

• See Table 2.5, p. 59 for properties of these distributions

35

Page 152: Applied Statistics - MIT

Selected Continuous DistributionsUniform: equally likely over an interval

Exponential: lifetimes of devices with no wear-out (“memoryless”), interarrival times when the arrivals are at random

Gamma: used to model lifetimes, related to many other distributions

Lognormal: lifetimes (similar shape to Gamma but with longer tail)

Beta: not equally likely over an interval

• See Table 2.5, p. 59 for properties of these distributions36

Page 153: Applied Statistics - MIT

Normal Distribution

First discovered by de Moivre (1667-1754) in1733

Rediscovered by Laplace (1749-1827) andalso by

Gauss (1777-1855) in their studies of errorsin astronomical measurements.

Often referred to as the Gaussian distribution.

37

Page 154: Applied Statistics - MIT

Carl Friedrick Gauss (1777 - 1855)

Photograph courtesy of John L. Telford, John Telford Photography. Used with permission. Currency from 1991.

38

Page 155: Applied Statistics - MIT

Karl Pearson (1857 - 1936)

“Many years ago I called the Laplace-Gauss curve the NORMAL curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another ABNORMAL. That belief is, of course, not justifiable.”

Karl Pearson, 1920

39

Page 156: Applied Statistics - MIT

Normal Distribution (“Bell-curve”, Gaussian)

A continuous r.v X has a normal distribution with parameter µ and σ 2 if its probability density function is given by:

x f ) = 1 exp[− ( x − µ )2 / 2σ 2 ] -for < ∞ x ∞ < (

σ 2π

E(X) = µ and Var(X) = σ 2 (see Figure 2.12, p. 53)

Standard normal distribution: Z = X − µ ~ N ( 1 ,0 )

σ• See Table A.3 on p. 673 Φ (z) = P(Z ≤ z)

X P ≤ x ) = Z P = X − µ

≤ x − µ

= z ⎟⎞ Φ = ⎛⎜

x − µ ⎞⎟( ⎜

⎝ σ σ ⎠ ⎝ σ ⎠

• See Examples 2.37 and 2.38 on pp. 54-55 (computations)40

Page 157: Applied Statistics - MIT

Percentiles of the Normal DistributionSuppose that the scores on a standardized test are normally distributed with mean 500 and standard deviation of 100. What is the 75th percentile score of this test?

X P ≤ x ) = P ⎜⎛ X − 500 x − 500 ⎞ ⎛ x − 500 ⎞( ≤ ⎟ = Φ⎜ ⎟ = 75 .0 ⎝ 100 100 ⎠ ⎝ 100 ⎠

From Table A.3, Φ (0.675) = 0.75

x − 500 = 675.0 ⇒ x = 500 + ( 100)(675. 0 ) = 5. 567

100 Useful Information about the Normal Distribution:

~68% of a normal population is within ± 1σ of µ ~95% of a normal population is within ± 2σ of µ ~99.7% of a normal population is within ± 3σ of µ

41

Page 158: Applied Statistics - MIT

75th percentile for a test with scores which are normally distributed, mean=500, standard deviation=100

pnor

m(x

, 500

, 100

)

0.0

0.4

0.8

qnorm(0.75, 500, 100)=567.5

pnorm(567.5, 500, 100)=0.75

0.2

0.6

1.0

200 400 600 800

x

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation. 42

Page 159: Applied Statistics - MIT

Linear Combinations of r.v.s

Xi ~ N(µi, σi2) for i = 1, …, n and Cov(X , Xj) = σij for i≠ji

Let X = a1X1 + a2X2 + … + anXn where are constants.ai

Then X has a normal distribution with mean and variance: n

X E ) = X a E + X a 2 +…+ X a ) = a µ +a µ +…+a µ = ∑a µ( ( 1 1 2 n n 1 1 2 2 n n i i i =1

n n n 2 2Var ( X ) = Var ( X a + X a 2 +… + X a ) = ∑ai σ + 2∑∑ a a j σ1 1 2 n n i i ij

i =1 i =1 j =1i ≠j

X = (X1 + X2 + … + Xn) / n , so ai = 1/n

Therefore, X from n i.i.d. N(µ, σ2) observations ~ N(µ, σ2/n), since the covariances (σij) are zero (by independence).

43

Page 160: Applied Statistics - MIT

Sampling Distributions of Statistics

Corresponds to Chapter 5 of Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT),with some slides by Jacqueline Telford

(Johns Hopkins University)

1

Page 161: Applied Statistics - MIT

Sampling Distributions

2

Definitions and Key Concepts• A sample statistic used to estimate an unknown population parameter is called an estimate.

• The discrepancy between the estimate and the true parameter value is known as sampling error.

• A statistic is a random variable with a probability distribution, called the sampling distribution, which is generated by repeated sampling.

• We use the sampling distribution of a statistic to assess the sampling error in an estimate.

Page 162: Applied Statistics - MIT

Random Sample• Definition 5.11, page 201, Casella and Berger.

• How is this different from a simple random sample?

• For mutual independence, population must be very large or must sample with replacement.

3

Page 163: Applied Statistics - MIT

Sample Mean and Variance

∑=

=n

iiX

nX

1

1

1

)(1

2

2

−=∑=

n

XXS

n

ii

Sample Mean

Sample Variance

How do the sample mean and variance vary in repeated samples of size n drawn from the population?

In general, difficult to find exact sampling distribution. However,see example of deriving distribution when all possible samplescan be enumerated (rolling 2 dice) in sections 5.1 and 5.2.Note errors on page 168.

4

Page 164: Applied Statistics - MIT

Properties of a sample mean and variance

See Theorem 5.2.2, page 268, Casella & Berger.

5

Page 165: Applied Statistics - MIT

Distribution of Sample Means• If the i.i.d. r.v.’s are

– Bernoulli– Normal– Exponential

The distributions of the sample means can be derived

Sum of n i.i.d. Bernoulli(p) r.v.’s is Binomial(n,p)

Sum of n i.i.d. Normal(µ,σ2) r.v.’s is Normal(nµ,nσ2)

Sum of n i.i.d. Exponential(λ) r.v.’s is Gamma(λ,n)

6

Page 166: Applied Statistics - MIT

Distribution of Sample Means

• Generally, the exact distribution is difficult to calculate.

• What can be said about the distribution of the sample mean when the sample is drawn from an arbitrary population?

• In many cases we can approximate the distribution of the sample mean when n is large by a normal distribution.

• The famous Central Limit Theorem

7

Page 167: Applied Statistics - MIT

Central Limit TheoremLet X1, X2, … , Xn be a random sample drawn from an arbitrary distribution with a finite mean µ and variance σ2

As n goes to infinity, the sampling distribution of

n

µ−

)1,0(

1 Nn

nXn

ii

≈−∑

=

σ

µ

converges to the N(0,1) distribution.

Sometimes this theorem is given in terms of the sums:

8

Page 168: Applied Statistics - MIT

Central Limit Theorem

Let X1… Xn be a random sample from an arbitrary distribution with finite mean µ and variance σ2. As n increases

?),(

?),(

)1,0(/

)(

2

1

2

σµ

σµ

σµ

nnNX

nNX

Nn

X

n

ii ≈⇒

≈⇒

≈−

∑=

What happens as n goes to infinity?

9

Page 169: Applied Statistics - MIT

10

Variance of means from uniform distributionsample size=10 to 10^6number of samples=100

log10(sample.size)

log1

0(va

rianc

e)

1 2 3 4 5 6

-7-6

-5-4

-3-2

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 170: Applied Statistics - MIT

Example: Uniform Distribution• f(x | a, b) = 1 / (b-a), a≤x≤b• E X = (b+a)/2• Var X = (b-a)2/12

0 2 4 6 8 10

05

1015

2025

30

runif(500, min = 0, max = 10)

11This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 171: Applied Statistics - MIT

12

Standardized Means, Uniform Distribution500 samples, n=1

-1 0 1

010

2030

40

number of samples=500, n=1

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 172: Applied Statistics - MIT

13

Standardized Means, Uniform Distribution500 samples, n=2

-2 -1 0 1 2

010

2030

40

number of samples=500, n=2

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 173: Applied Statistics - MIT

14

Standardized Means, Uniform Distribution500 samples, n=100

-3 -2 -1 0 1 2 3

010

2030

40

number of samples=500, n=100

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 174: Applied Statistics - MIT

15

QQ (Normal) plot of means of 500 samples of size 100 from uniform distribution

Quantiles of Standard Normal

tmp

-3 -2 -1 0 1 2 3

-3-2

-10

12

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 175: Applied Statistics - MIT

Bootstrap – sampling from the sample

• Previous slides have shown results for means of 500 samples (of size 100) from uniform distribution.

• Bootstrap takes just one sample of size 100 and then takes 500 samples (of size 100) with replacement from the sample.

• x<-runif(100)• y<- mean(sample(x,100,replace=T))

16

Page 176: Applied Statistics - MIT

17

Normal probability plot of sample of size 100 from exponential distribution

Quantiles of Standard Normal

x

-2 -1 0 1 2

01

23

45

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 177: Applied Statistics - MIT

18

Normal probability plot of means of 500 bootstrap samples from sample of size 100

from exponential distribution

Quantiles of Standard Normal

y

-3 -2 -1 0 1 2 3

1.0

1.1

1.2

1.3

1.4

1.5

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 178: Applied Statistics - MIT

Law of Large Numbers and Central Limit Theorem

Both are asymptotic results about the sample mean:

• Law of Large Numbers (LLN) says that as n → ∞, the sample mean converges to the population mean, i.e.,

0,n as →−∞→ µX

• Central Limit Theorem (CLT) says that as n → ∞, also the distribution converges to Normal, i.e.,

N(0,1) toconverges , n asn

µ−∞→

19

Page 179: Applied Statistics - MIT

Normal Approximation to the Binomial

A binomial r.v. is the sum of i.i.d. Bernoulli r.v.’s so the CLT can be used to approximate its distribution.

Suppose that X is B(n, p). Then the mean of X is np and the variance of X is np(1 - p) .

By the CLT, we have: )1,0()1(

Npnp

npX≈

−−

⎥⎦

⎤⎢⎣

⎡ −=

.).(.).(..

FormulaGeneral

vrSDvrEvr

How large a sample, n, do we need for the approximation to be good?

Rule of Thumb: np ≥ 10 and n(1-p) ≥ 10

For p=0.5, np = n(1-p) = n (0.5) = 10 ⇒ n should be 20. (symmetrical)

For p=0.1 or 0.9, np or n(1-p) = n (0.1) = 10 ⇒ n should be 100. (skewed)

• See Figures 5.2 and 5.3 and Example 5.3, pp.172-174

20

Page 180: Applied Statistics - MIT

Continuity Correction

See Figure 5.4 for motivation.

⎟⎟⎠

⎞⎜⎜⎝

−−+

Φ≅≤)1(

5.0)(pnpnpxxXP

⎟⎟⎠

⎞⎜⎜⎝

−−−

Φ−≅≥)1(

5.01)(pnpnpxxXP

Exact Binomial Probability:

P(X ≤ 8) = 0.2517

Normal approximation without Continuity Correction:

P(X ≤ 8) = 0.1867

Normal approximation with Continuity Correction:

P(X ≤ 8.5) = 0.2514 (much better agreement with exact calculation)21

Page 181: Applied Statistics - MIT

Sampling Distribution of the Sample Variance

?~1

)(1

2

2

−=∑=

n

XXS

n

ii

There is no analog to the CLT for which gives an approximation for large samples for an arbitrary distribution.

The exact distribution for S2 can be derived for X ~ i.i.d. Normal.

Chi-square distribution: For ν ≥ 1, let Z1, Z2, …, Zν be i.i.d. N(0,1) and let Y = Z1

2 + Z22 + …+ Zν2.

The p.d.f. of Y can be shown to be( ) 212

22 )(2

1)(y

exyf −−

Γ=

ν

νν

This is known as the χ2 distribution with ν degrees of freedom (d.f.) or Y ~ .2

νχ

• See Figures 5.5 and 5.6, pp. 176-177 and Table A.5, p.67622

Page 182: Applied Statistics - MIT

Distribution of the Sample Variance in the Normal Case

If Z ~ N(0,1), then Z2 ~21χ

212

2

2

2

~)1/(

)1(−−

=−

nnSSn χ

σσ

1~

21

22

−−

nS nχσ

It can be shown that

or equivalently , a scaled χ2

E(S2) = σ2 (is an unbiased estimator)

Var(S2) = 12 4

−nσ

See Result 2 (p.179)

23

Page 183: Applied Statistics - MIT

Chi-square distribution

24x

chi s

quar

e de

nsity

for d

f=5,

10,2

0,30

0 10 20 30 40 50

0.0

0.05

0.10

0.15

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 184: Applied Statistics - MIT

Chi-Square DistributionInteresting Facts

• EX = ν (degrees of freedom)• Var X = 2ν• Special case of the gamma distribution

with scale parameter=2, shape parameter=v/2.

• Chi-square variate with v d.f. is equal to the sum of the squares of v independent unit normal variates.

25

Page 185: Applied Statistics - MIT

Student’s t-DistributionConsider a random sample X1, X2, ..., Xn drawn from N(µ,σ2).

It is known thatn

X/σµ− is exactly distributed as N(0,1).

nSXT

/µ−

= is NOT distributed as N(0,1).

A different distribution for each ν = n-1 degrees of freedom (d.f.).

T is the ratio of a N(0,1) r.v. and sq.rt.(independent χ2 divided by its d.f.) - for derivation, see eqn 5.13, p.180, and its messy p.d.f., eqn 5.14

See Figure 5.7, Student’s t p.d.f.’s for ν = 2, 10,and ∞, p.180• See Table A.4, t-distribution table, p. 675• See Example 5.6, milk cartons, p. 181

26

Page 186: Applied Statistics - MIT

27

Student’s t densities for df=1,100

x

Stu

dent

's t

pdf,

df=1

& 1

00

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

df=1

df=100

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 187: Applied Statistics - MIT

Student’s t DistributionInteresting Facts

• E X = 0, for v>1• Var X = v/(v-2) for v>2• Related to F distribution (F1,v = t2v )• As v tends to infinity t variate tends to

unit normal• If v=1 then t variate is standard Cauchy

28

Page 188: Applied Statistics - MIT

29

Cauchy Distribution for center=0, scale=1 and center=1, scale=2

x

Cau

chy

pdf

-4 -2 0 2 4

0.05

0.10

0.15

0.20

0.25

0.30

center=1, scale=2

center=0, scale=1

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 189: Applied Statistics - MIT

Cauchy DistributionInteresting Facts

12 ]})(1[{),|( −−+=

baxbbaxf π

30

• Parameters, a=center, b=scale • Mean and Variance do not exist (how could this be?)• a=median• Quartiles=a +/- b• Special case of Student’s t with 1 d.f.• Ratio of 2 independent unit normal variates is standard

Cauchy variate• Should not be thought of as “only a pathological case”.

(Casella & Berger) as we frequently (when?) calculate ratios of random variables.

Page 190: Applied Statistics - MIT

Snedecor-Fisher’s F-Distribution

has an F-distribution with n1-1 d.f. in the numerator and n2-1 d.f. in the denominator.

•F is the ratio of two independent χ2’s divided by their respective d.f.’s

•Used to compare sample variances.

•See Table A.6, F-distribution, pp. 677-679

Consider two independent random samples:

X1, X2, ..., Xn1from N(µ1,σ1

2) , Y1, Y2, ..., Yn2from N(µ2,σ2

2).

Then

)12(

22

22)12(

)11(

21

21)11(

22

22

21

21

=

n

Sn

n

Sn

S

S

σ

σ

σ

σ

31

Page 191: Applied Statistics - MIT

32

Snedecor’s F Distribution

x

F pd

f for

df2

=40

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

df1=40

df1=10

df1=4

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 192: Applied Statistics - MIT

Snedecor’s F DistributionInteresting Facts

• Parameters, v, w, referred to as degrees of freedom (df).• Mean = w/(w-2), for w>2• Variance = 2w2(v+w-2)/(v(w-2)2(w-4)), for w>4• As d.f., v and w increase, F variate tends to normal• Related also to Chi-square, Student’s t, Beta and Binomial• Reference for distributions:

Statistical Distributions 3rd ed. by Evans, Hastings and Peacock, Wiley, 2000

33

Page 193: Applied Statistics - MIT

Sampling Distributions - Summary

• For random sample from any distribution, standardized sample mean converges to N(0,1) as n increases (CLT).

• In normal case, standardized sample mean with S instead of sigma in the denominator ~ Student’s t(n-1).

• Sum of n squared unit normal variates ~ Chi-square (n)

• In the normal case, sample variance has scaled Chi-square distribution.

• In the normal case, ratio of sample variances from two different samples divided by their respective d.f. has F distribution.

34

Page 194: Applied Statistics - MIT

Sir Ronald A. Fisher George W. Snedecor(1890-1962) (1882-1974)

Taught at Iowa State Univ. where wrote a college textbook (1937):

“Thank God for Snedecor;now we can understand Fisher.”

(named the distribution for Fisher)

Wrote the first books on statistical methods (1926 & 1936):

“A student should not be madeto read Fisher’s books

unless he has read them before.”

35

Page 195: Applied Statistics - MIT

Sampling Distributions for Order StatisticsMost sampling distribution results (except for CLT) apply to samples from normal populations.

If data does not come from a normal (or at least approximately normal), then statistical methods called “distribution-free” or “non-parametric” methods can be used (Chapter 14).

Non-parametric methods are often based on ordered data (called order statistics: X(1), X(2), …, X(n)) or just their ranks.

If X1..Xn are from a continuous population with cdf F(x) and pdf f(x) then the pdf of X(j) is:

The confidence intervals for percentiles can be derived using the order statistics and the binomial distribution.

jnjj xFxFxf

jnjnxf −− −

−−= )](1[)]()[(

)!()!1(!)( 1

)(

36

Page 196: Applied Statistics - MIT

Basic Concepts of Inference

Corresponds to Chapter 6 of Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT)with some slides by Jacqueline Telford

(Johns Hopkins University) and Roy Welsch (MIT).1

Page 197: Applied Statistics - MIT

“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” H. G. Wells

Statistical InferenceDeals with methods for making statements about a population based on a sample drawn from the population

Point Estimation: Estimate an unknown population parameter

Confidence Interval Estimation: Find an interval that contains the parameter with preassigned probability.

Hypothesis testing: Testing hypothesis about an unknown population parameter

2

Page 198: Applied Statistics - MIT

Examples

Point Estimation: estimate the mean package weight of a cereal box filled during a production shift

Confidence Interval Estimation: Find an interval [L,U] based on the data that includes the mean weight of the cereal box with a specified probability

Hypothesis testing: Do the cereal boxes meet the minimum mean weight specification of 16 oz?

3

Page 199: Applied Statistics - MIT

Two Levels of Statistical Inference

• Informal, using summary statistics (may only be descriptive statistics)

• Formal, which uses methods of probability and sampling distributions to develop measures of statistical accuracy

4

Page 200: Applied Statistics - MIT

Estimation Problems

• Point estimation: estimation of an unknown population parameter by a single statistic calculated from the sample data.

• Confidence interval estimation: calculation of an interval from sample data that includes the unknown population parameter with a pre-assigned probability.

5

Page 201: Applied Statistics - MIT

Point Estimation TerminologyEstimator = the random variable (r.v.) , a function of the Xi’sθ(the general formula of the rule to be computed from the data)

Estimate = the numerical value of calculated from the observed sample data X1 = x1, ..., Xn = xn

θ

n

XX

n

ii∑

== 1

n

xx

n

ii∑

== 1

Example: Xi ~ N(µ,σ2)

(the specific value calculated from the data)

of µ(= 10.2) is an estimateEstimate =

6

Estimator = is an estimator of µ µ=

Other estimators of µ?

Page 202: Applied Statistics - MIT

Methods of Evaluating EstimatorsBias and Variance

θθθ −= )ˆ()ˆ( EBias- The bias measures the accuracy of an estimator.- An estimator whose bias is zero is called unbiased.- An unbiased estimator may, nevertheless, fluctuate greatly fromsample to sample.

{ }2)]ˆ(ˆ[ )ˆ( θθθ EE −=Var

7

-The lower the variance, the more precise the estimator.- A low-variance estimator may be biased.- Among unbiased estimators, the one with the lowest variance should be chosen. “Best”=minimum variance.

Page 203: Applied Statistics - MIT

Accuracy and Precision

accurate and precise

accurate, not precise

precise, not accurate

not accurate, not precise

8Diagram courtesy of MIT OpenCourseWare

Page 204: Applied Statistics - MIT

Mean Squared Error- To chose among all estimators (biased and unbiased), minimize a measure that combines both bias and variance.- A “good” estimator should have low bias (accurate) AND low variance (precise).

{ }2)]ˆ[ )ˆ( θθθ −= EMSE 6.2) (eqnBiasVar 2)]ˆ([)ˆ( θθ +=

MSE = expected squared error loss function

θθθ −= )ˆ()ˆ( EBias

{ }2)]ˆ(ˆ[ )ˆ( θθθ EE −=Var

9

Page 205: Applied Statistics - MIT

Example: estimators of variance

Two estimators of variance:

)1()( 21

21 −−= ∑ =

nXXS n

i i is unbiased (Example 6.3)

nXXS n

i i2

122 )( −= ∑ =

is biased but has smaller MSE (Example 6.4)

In spite of larger MSE, we almost always use S12

10

Page 206: Applied Statistics - MIT

Example - Poisson

(See example in Casella & Berger, page 308)

11

Page 207: Applied Statistics - MIT

Standard Error (SE)- The standard deviation of an estimator is called the standard error of the estimator (SE).- The estimated standard error is also called standard error (se).- The precision of an estimator is measured by the SE.Examples for the normal and binomial distributions:

µ 1. of estimator unbiased an isXnXSE σ=)(

are called the standard error of the meannsXse =)(

ˆ 2. pp of estimator unbiased an isnpppse )ˆ1(ˆ)ˆ( −=

12

Page 208: Applied Statistics - MIT

Precision and Standard Error

• A precise estimate has a small standard error, but exactly how are the precision and standard error related?

• If the sampling distribution of an estimator is normal with mean equal to the true parameter value (i.e., unbiased). Then we know that about 95% of the time the estimator will be within two SE’s from the true parameter value.

13

Page 209: Applied Statistics - MIT

Methods of Point Estimation

•Method of Moments (Chapter 6)

•Maximum Likelihood Estimation (Chapter 15)

•Least Squares (Chapter 10 and 11)

14

Page 210: Applied Statistics - MIT

Method of Moments

• Equate sample moments to population moments (as we did with Poisson).

• Example: for the continuous uniform distribution, f(x|a,b)=1/(b-a), a≤x≤b

• E(X) = (b+a)/2, Var(X)=(b-a)2/12

• Set = (b+a)/2

• S2 = (b-a)2/12

• Solve for a and b (can be a bit messy).

X

15

Page 211: Applied Statistics - MIT

Maximum Likelihood Parameter Estimation

• By far the most popular estimation method! (Casella & Berger).

• MLE is the parameter point for which observed data is most likely under the assumed probability model.

• Likelihood function: L(θ |x) = f(x| θ), where x is the vector of sample values, θ also a vector possibly.

• When we consider f(x| θ), we consider θ as fixed and x as the variable.

• When we consider L(θ |x), we are considering x to be the fixed observed sample point and θ to be varying over all possible parameter values.

16

Page 212: Applied Statistics - MIT

MLE (continued)•If X1….Xn are iid then

L(θ|x)=f(x1…xn| θ) = ∏ f(xi| θ)

•The MLE of θ is the value which maximizes the likelihood function (assuming it has a global maximum).

•Found by differentiating when possible.

•Usually work with log of likelihood function (∏→∑).

•Equations obtained by setting partial derivatives of ln L(θ) = 0 are called the likelihood equations.

•See text page 616 for example – normal distribution.17

Page 213: Applied Statistics - MIT

Confidence Interval EstimationWe want an interval [ L, U ] where L and U are two statistics calculated from X1, X2, …, Xn such that

P[ L ≤ θ ≤ U] = 1 - α Note: L and U are random and θ is fixed but unknown

regardless of the true value of θ.

• [ L, U ] is called a 100(1-α)% confidence interval (CI).

• 1-α is called the confidence level of the interval.

• After the data is observed X1 = x1, ..., Xn = xn, the confidence limits L = l and U = u can be calculated.

18

Page 214: Applied Statistics - MIT

95% Confidence Interval: Normal known2σConsider a random sample X1, X2, …, Xn ~ N(µ,σ2) where σ2 is assumed to be known and µ is an unknown parameter to be estimated. Then

95.096.196.1P =⎥⎦

⎤⎢⎣

⎡≤

−≤−

nXσ

µ By the CLT even if the sample is not normal, this result is approximately correct.

95.096.196.1P =⎥⎦⎤

⎢⎣⎡ +=≤≤−=⇒

nXU

nXL σµσ

un

xn

xl =+≤≤−=⇒σµσ 96.196.1 is a 95% CI for µ

(two-sided)

• See Example 6.7, Airline Revenues, p. 20419

Page 215: Applied Statistics - MIT

Normal Distribution, 95% of area under curve is between -1.96 and 1.96

x

dnor

m(x

)

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

9 5 %

20This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 216: Applied Statistics - MIT

Frequentist Interpretation of CI’sIn an infinitely long series of trials in which repeated samples of size n are drawn from the same population and 95% CI’s for µ are calculated using the same method, the proportion of intervals that actually include µ will be 95% (coverage probability).

However, for any particular CI, it is not known whether or not the CI includes µ, but the probability that it includes µis either 0 or 1, that is, either it does or it doesn’t.

It is incorrect to say that the probability is 0.95 that the true µ is in a particular CI.

• See Figure 6.2, p. 205

21

Page 217: Applied Statistics - MIT

22

95% CI, 50 samples from unit normal distribution

95%

Con

fiden

ce In

terv

al

0 10 20 30 40 50

-1.0

-0.5

0.0

0.5

1.0

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 218: Applied Statistics - MIT

Arbitrary Confidence Level for CI: known2σ

100(1-α)% two-sided CI for µ based on the observed sample mean

nZx

nZx σµσ

αα 2/2/ +≤≤− For 99% confidence, Zα/2 = 2.576

The price paid for higher confidence level is a wider interval.

For large samples, these CI can be used for data from any distribution, since by CLT ≈ N(µ, σ2/n).x

23

Page 219: Applied Statistics - MIT

One-sided Confidence Intervals

nZx σµ α−≥ Lower one-sided CI For 95%

confidence, Zα= 1.645 vs. Zα/2= 1.96 n

Zx σµ α+≤ Upper one-sided CI

One-sided CIs are tighter for the same confidence level.

24

Page 220: Applied Statistics - MIT

Hypothesis Testing

The objective of hypothesis testing is to access the validity of a claim against a counterclaim using sample data.

• The claim to be “proved” is the alternative hypothesis (H1).

• The competing claim is called the null hypothesis (H0).

• One begins by assuming that H0 is true. If the data fails to contradict H0 beyond a reasonable doubt, then H0 is not rejected. However, failing to reject H0 does not mean that we accept it as true. It simply means that H0 cannot be ruled out as a possible explanation for the observed data. A proof by insufficient data is not a proof at all.

25

Page 221: Applied Statistics - MIT

Testing Hypotheses“The process by which we use data to answer questions about parametersis very similar to how juries evaluate evidence about a defendant.” – from Geoffrey Vining, Statistical Methods for Engineers, Duxbury, 1st edition, 1998. For more information, see that textbook.

26

Page 222: Applied Statistics - MIT

Hypothesis Tests• A hypothesis test is a data-based rule to decide between H0and H1.

• A test statistic calculated from the data is used to make this decision.

• The values of the test statistics for which the test rejects H0 comprise the rejection region of the test.

• The complement of the rejection region is called the acceptance region.

• The boundaries of the rejection region are defined by one or more critical constants (critical values).

• See Examples 6.13(acc. sampling) and 6.14(SAT coaching), pp. 210-211.

27

Page 223: Applied Statistics - MIT

Hypothesis Testing as a Two-Decision Problem

28

Framework developed by Neyman and Pearson in 1933.

When a hypothesis test is viewed as a decision procedure, two types of errors are possible:

Decision Do not reject H0 Reject H0

H0 True Correct Decision “Confidence”

1 - α

Type I Error “Significance Level”

α

Rea

lity

H0 False Type II Error “Failure to Detect”

β

Correct Decision “Prob. of Detection”

1 - β Column

Total

≠ 1

≠ 1

=1

=1

Page 224: Applied Statistics - MIT

Probabilities of Type I and II Errorsα = P{Type I error} = P{Reject H0 when H0 is true} = P{Reject H0|H0}

also called α-risk or producer’s risk or false alarm rate

β = P{Type II error} = P{Fail to reject H0 when H1 is true} = P{Fail to reject H0|H1}

also called β-risk or consumer’s risk or prob. of not detecting

π = 1 - β = P{Reject H0|H1} is prob. of detection or power of the test

We would like to have low α and low β (or equivalently, high power).

α and 1-β are directly related, can increase power by increasing α.

These probabilities are calculated using the sampling distributions from either the null hypothesis (for α) or alternative hypothesis (for β).

29

Page 225: Applied Statistics - MIT

Example 6.17 (SAT Coaching)

See Example 6.17, “SAT Coaching,” in the course textbook.

30

Page 226: Applied Statistics - MIT

Power Function and OC Curve

The operating characteristic function of a test is the probability that the test fails to reject H0 as a function of θ, where θ is the test parameter.

OC(θ) = P{test fails to reject H0 | θ}

For θ values included in H1 the OC function is the β –risk.

The power function is:

π(θ) = P{Test rejects H0 | θ} = 1 – OC(θ)

Example: In SAT coaching, for the test that rejects the null hypothesis when mean change is 25 or greater, the power = 1-pnorm(25,mean=0:50,sd=40/sqrt(20))

31

Page 227: Applied Statistics - MIT

Level of SignificanceThe practice of test of hypothesis is to put an upper bound on the P(Type I error) and, subject to that constraint, find a test with the lowest possible P(Type II error).

The upper bound on P(Type I error) is called the level of significance of the test and is denoted by α (usually some small number such as 0.01, 0.05, or 0.10).

The test is required to satisfy:

P{ Type I error } = P{ Test Rejects H0 | H0 } ≤ α

Note that α is now used to denote an upper bound on P(Type I error).

Motivated by the fact that the Type I error is usually the more serious.

A hypothesis test with a significance level α is called an a α-level test.

32

Page 228: Applied Statistics - MIT

Choice of Significance Level

What α level should one use?

Recall that as P(Type I error) decreases P(Type II error) increases.

A proper choice of α should take into account the relative costs of Type I and Type II errors. (These costs may be difficult to determine in practice, but must be considered!)

Fisher said: α =0.05

Today α = 0.10, 0.05, 0.01 depending on how much proof against the null hypothesis we want to have before rejecting it.

P-values have become popular with the advent of computer programs.

33

Page 229: Applied Statistics - MIT

Observed Level of Significance or P-valueSimply rejecting or not rejecting H0 at a specified α level does not fully convey the information in the data.

Example: H0 : µ = 15 vs H1 : µ > 15 is rejected at the α = 0.05

when 71.2920

40645.115 =×+>x

Is a sample with a mean of 30 equivalent to a sample with a meanof 50? (Note that both lead to rejection at the α-level of 0.05.)

More useful to report the smallest α-level for which the data would reject (this is called the observed level of significance or P-value).

Reject H0 if P-value < α34

Page 230: Applied Statistics - MIT

Example 6.23 (SAT Coaching: P-Value)

See Example 6.23, “SAT Coaching,” on page 220 of the course textbook.

35

Page 231: Applied Statistics - MIT

One-sided and Two-sided TestsH0 : µ = 15 can have three possible alternative hypotheses:

H1 : µ > 15 , H1 : µ < 15 , or H1 : µ ≠ 15

(upper one-sided) (lower one-sided) (two-sided)

Example 6.27 (SAT Coaching: Two-sided testing)

See Example 6.27 in the course textbook.

36

Page 232: Applied Statistics - MIT

Example 6.27 continued

See Example 6.27, “SAT Coaching,” on page 223 of the course textbook.

37

Page 233: Applied Statistics - MIT

Relationship Between Confidence Intervals and Hypothesis Tests

An α-level two-sided test rejects a hypothesis H0 : µ = µ0 if and only if the (1- α)100% confidence interval does not contain µ0.

Example 6.7 (Airline Revenues)

See Example 6.7, “Airline Revenues,” on page 207 of the course textbook.

38

Page 234: Applied Statistics - MIT

Use/Misuse of Hypothesis Tests in Practice

• Difficulties of Interpreting Tests on Non-random samples and observational data

• Statistical significance versus Practical significance– Statistical significance is a function of sample size

• Perils of searching for significance

• Ignoring lack of significance

•Confusing confidence (1 - α) with probability of detecting a difference (1 - β)

39

Page 235: Applied Statistics - MIT

Jerzy Neyman Egon Pearson(1894-1981) (1895-1980)

Carried on a decades-long feud with Fisher over the foundations of statistics (hypothesis testing and confidence limits) - Fisher never recognized Type II error & developed fiduciallimits

40

Page 236: Applied Statistics - MIT

Inference for Single Samples

Corresponds to Chapter 7 of

Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT), with some slides by Ramón V. León (University of Tennessee)

1

Page 237: Applied Statistics - MIT

Inference About the Mean and Variance of a Normal Population

Applications:• Monitor the mean of a manufacturing process to determine

if the process is under control• Evaluate the precision of a laboratory instrument measured

by the variance of its readings• Prediction intervals and tolerance intervals which are

methods for estimating future observations from a population.

By using the central limit theorem (CLT), inference procedures for the mean of a normal population can be extended to the mean of a non-normal population when a large sample is available

2

Page 238: Applied Statistics - MIT

Inferences on Mean (Large Samples)

( )

2

2

Inferences on will be based on the sample mean ,

which is an unbiased estimator of with variance .

For large sample size , the CLT tells us that is

approximately , distributed, even if

X

nn X

N n

µ

σµ

µ σ

2

2

the population

is not normal. Also for large , the sample variance may be

taken as an accurate estimator of with neglible sampling error.If 30, we may assume that in the formulas.

n s

n sσ

σ

3

Page 239: Applied Statistics - MIT

Pivots

• Definition: Casella & Berger, p. 413

• E.g. • Allow us to construct confidence intervals on

parameters.

)1,0(~/

Nn

XZσ

µ−=

4

Page 240: Applied Statistics - MIT

Confidence Intervals on the Mean: Large Samples

2 2 1XP z Z z

nα α

µ ασ

⎡ ⎤−⎢ ⎥− ≤ = ≤ = −⎢ ⎥

⎢ ⎥⎣ ⎦

Note: zα/2 = -qnorm(α/2)

(See Figure 2.15 on page 56 of the course textbook.)

5

Page 241: Applied Statistics - MIT

Confidence Intervals on the Mean

2 2x z x zn nα α

σ σµ− ≤ ≤ +

( Lower One-Sided CI)x znα

σ µ− ≤

(Upper One-Sided CI)x znα

σµ ≤ +

is the standard error of the meann

σ

6

Page 242: Applied Statistics - MIT

Confidence Intervals in S-Plus

t.test(lottery.payoff)

One-sample t-Test

data: lottery.payofft = 35.9035, df = 253, p-value = 0 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval:274.4315 306.2850 sample estimates:mean of x 290.3583

7

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 243: Applied Statistics - MIT

Sample Size Determination for a z-interval

[ ]Suppose that we require a (1- )-level two-sided CI for

of the form , with a margin of error x E x E Eα µ

− +

i

22

2Set and solve for , obtaining z

E z n nEn

αα

σσ ⎡ ⎤= = ⎢ ⎥

⎣ ⎦i

•Calculation is done at the design stage so a sample estimate of σ is not available.•An estimate for σ can be obtained by anticipatingthe range of the observations and dividing by 4.

[ ]Based on assuming normality since then 95% of the observation are expected to fall in 2 , 2µ σ µ σ− +

8

Page 244: Applied Statistics - MIT

Example 7.1 (Airline Revenue)

See Example 7.1, “Airline Revenue,” on page 239 of the course textbook.

9

Page 245: Applied Statistics - MIT

Example 7.2 – Strength of Steel Beams

See Example 7.2 on page 240 of the course textbook.

10

Page 246: Applied Statistics - MIT

Power Calculation for One-sided Z-tests0( ) P[Test rejects | ]Hπ µ µ=

Testing vs.

For the power function of the α-level upper one sided z-test derivation, see Equation 7.7 in the course textbook.

:o oH µ µ≤ 1 0:H µ µ>

Illustration of calculation on next page

-z z

Φ(−z) 1−Φ(z)

11

Page 247: Applied Statistics - MIT

Power Calculation for One-sided Z-tests

2

p.d.f. curves of

,X Nn

σµ⎛ ⎞⎜ ⎟⎝ ⎠

(See Figure 7.1 on page 243 of the course textbook.)

12

Page 248: Applied Statistics - MIT

Power Functions Curves

See Figure 7.2 on page 243 of the course textbook.

Notice how it is easier to detect a big difference from µ0.

13

Page 249: Applied Statistics - MIT

Example 7.3 (SAT Couching: Power Calculation)

See Example 7.3 on page 244 of the course textbook.

( )0( )n

µ µπ µ

σ

⎡ ⎤−= Φ − +⎢ ⎥

⎢ ⎥⎣ ⎦

14

Page 250: Applied Statistics - MIT

Power Calculation Two-Sided

Test(See Figure 7.3 on page 245 of the course textbook.)

15

Page 251: Applied Statistics - MIT

Power Curve for Two-sided TestIt is easier to detect large differences from the null hypothesis(See Figure 7.4 on

page 246 of the course textbook.)

Larger samples lead to more powerful tests

16

Page 252: Applied Statistics - MIT

17

Power as a function of µ and n, µ0=0, σ=1Uses function persp in S-Plus

2040

6080

100

n

-1

-0.5

0

0.5

1

mu

00.

20.

40.

60.

81

pow

er

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 253: Applied Statistics - MIT

Sample Size Determination for a One-Sided z-Test

• Determine the sample size so that a study will have sufficient power to detect an effect of practically important magnitude

• If the goal of the study is to show that the mean response µunder a treatment is higher than the mean response µ0 without the treatment, then µ−µ0 is called the treatment effect

• Let δ > 0 denote a practically important treatment effect and let 1−β denote the minimum power required to detect it. The goal is to find the minimum sample size n which would guarantee that an α-level test of H0 has at least 1-βpower to reject H0 when the treatment effect is at least δ.

18

Page 254: Applied Statistics - MIT

Sample Size Determination for a One-sided Z-test

Because Power is an increasing function of µ−µ0, it is only necessary to find n that makes the power 1− β at µ = µ0+δ.

( )

0

2

( ) 1 [See Equation (7.7), Slide 11]

Since ( ) 1 we have - .

Solving for n, we obtain

nz

nz z z

z zn

α

β α β

α β

δπ µ δ βσ

δβσ

σδ

⎛ ⎞+ = Φ − + = −⎜ ⎟⎜ ⎟

⎝ ⎠

Φ = − + =

⎡ ⎤+= ⎢ ⎥

⎢ ⎥⎣ ⎦ zβ

19

Page 255: Applied Statistics - MIT

Example 7.5 (SAT Coaching: Sample Size Determination

See Example 7.5 on page 248 of the course textbook.

20

Page 256: Applied Statistics - MIT

Sample Size Determination for a Two-Sided z-Test

( ) 2

2z zn α β σ

δ

⎡ ⎤+⎢ ⎥⎢ ⎥⎣ ⎦

Read on you own the derivation on pages 248-249

See Example 7.6 on page 249 of the course textbook.

Read on your own Example 7.4 (page246)

21

Page 257: Applied Statistics - MIT

Power and Sample Size in S-Plus

normal.sample.size(mean.alt = 0.3) mean.null sd1 mean.alt delta alpha power n1

0 1 0.3 0.3 0.05 0.8 88

> normal.sample.size(mean.alt = 0.3,n1=100) mean.null sd1 mean.alt delta alpha power n1

0 1 0.3 0.3 0.05 0.8508 100

22

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 258: Applied Statistics - MIT

Inference on Mean (Small Samples)

The sampling variability of s2 may be sizable if the sample is small(less than 30). Inference methods must take this variability intoaccount when σ2 is unknown .

1

2

Assume that ,..., is a random sample from an

( , ) ditribution. Then has a

-distribution with -1 degrees of freedom (d.f.)

nX XXN TS n

t n

µµ σ −=

(Note that T is a pivot)

23

Page 259: Applied Statistics - MIT

Confidence Intervals on Mean

1, 2 1, 2

1, 2 1, 2

1 n n

n n

XP t T tS n

S SP X t X tn n

α α

α α

µα

µ

− −

− −

⎡ ⎤−− = − ≤ = ≤⎢ ⎥

⎣ ⎦⎡ ⎤= − ≤ ≤ +⎢ ⎥⎣ ⎦

1, 2 1, 2 [Two-Sided 100(1- )% CI]n nS SX t X tn nα αµ α− −− ≤ ≤ +

1, 2 2 interval is wider on the average than z-intervalnt z tα α− > ⇒ −

24

Page 260: Applied Statistics - MIT

Example 7.7, 7.8, and 7.9

See Examples 7.7, 7.8, and 7.9 from the course textbook.

25

Page 261: Applied Statistics - MIT

Inference on Variance2

1Assume that ,..., is a random sample from an ( , ) distributionnX X N µ σ2

22

( 1) has a Chi-square distribution with -1 d.f.n S nχσ−

=

(See Figure 7.8 on page 255 of the course textbook)

( ) 22 2

21,1 1,2 2

11

n n

n SP α αα χ χ

σ− − −

⎡ ⎤−− = ≤ ≤⎢ ⎥

⎣ ⎦26

Page 262: Applied Statistics - MIT

CI for σ2 and σ

The 100(1-α)% two-sided CI for σ2 (Equation 7.17 in course textbook):

2 22

2 2

1, 1,12 2

( 1) ( 1)

n n

n s n s

α α

σχ χ

− − −

− −≤ ≤

The 100(1-α)% two-sided CI for σ (Equation 7.18 in course textbook):

2 2

1, 1,12 2

1 1

n n

n ns sα α

σχ χ

− − −

− −≤ ≤

27

Page 263: Applied Statistics - MIT

Hypothesis Test on Variance

See Equation 7.21 on page 256 of the course textbook for an explanation of the chi-square statistic:

22

20

( 1)n sχσ−

=

28

Page 264: Applied Statistics - MIT

Prediction Intervals• Many practical applications call for an interval estimate of

– an individual (future) observation sampled from a population – rather than of the mean of the population.

• An interval estimate for an individual observation is called a prediction interval

Prediction Interval Formula:

1, 2 1, 21 11 1n nx t s X x t sn nα α− −− + ≤ ≤ + +

29

Page 265: Applied Statistics - MIT

Confidence vs. Prediction IntervalPrediction interval of a single future observation:

1, 2 1, 2

2 2

1 11 1

As interval converges to [ , ]

n nx t s X x t sn n

n z z

α α

α αµ σ µ σ

− −− + ≤ ≤ + +

→ ∞ − +

Confidence interval for µ:

1, 2 1, 21 1

As interval converges to single point

n nx t s x t sn n

n

α αµ

µ

− −− ≤ ≤ +

→ ∞30

Page 266: Applied Statistics - MIT

Example 7.12: Tear Strength of Rubber

See Example 7.12 on page 259 of the course textbook.

Run chart shows process is predictable.

31

Page 267: Applied Statistics - MIT

Tolerance IntervalsSuppose we want an interval which will contain at least.90 = 1-γ of the strengths of the future batches (observations) with 95% = 1-α confidence

Using Table A.12 in the course textbook:1-α = 0.951-γ = 0.90n = 14So, the critical value we want is 2.529.

[ , ] 33.712 2.529 0.798 [31.694,35.730]x Ks x Ks− + = ± × =Note that this statistical interval is even wider than the prediction interval

32

Page 268: Applied Statistics - MIT

Inferences for Two Samples

Corresponds to Chapter 8 ofTamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT), with some slides by Ramón V. León

(University of Tennessee) 1

Page 269: Applied Statistics - MIT

Introductory Remarks• A majority of statistical studies, whether experimental or

observational, are comparative• Simplest type of comparative study compares two

populations• Two principal designs for comparative studies

– Using independent samples– Using matched pairs

• Graphical methods for informal comparisons• Formal comparisons of means and variances of normal

populations– Confidence intervals– Hypothesis tests

2

Page 270: Applied Statistics - MIT

Independent Samples DesignExample: Compare Control Group to Treatment Group •See page 270 in course textbook.

1

2

1 2

1 2

:Sample 1: , ,...,

Sample 2: , ,...,n

n

x x x

y y y

Independent samples design

Different Numbers

•The two samples are independent•Independent sample design relies on random assignment to make the two groups equal (on the average) on all attributes except for the treatment used (treatment factor).

3

Page 271: Applied Statistics - MIT

Graphical Methods for Comparing

Two Independent

Samples See Table 8.1 and Figure 8.1, which is a Q-Q Plot. Plot suggests that treatment group costs are less than control group costs. But is it true?

( ) ( )

Plot of the order statistics ordered pairs ( , )

which are the i quantiles

n+1of the respective samples

i ix y

⎛ ⎞⎜ ⎟⎝ ⎠

Book discusses how to prepare this graph when the two samples are not of the same size (interpolation).

4

Page 272: Applied Statistics - MIT

5

Box plots of hospitalization cost data0

5000

1000

015

000

2000

025

000

3000

0

hcc hct

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 273: Applied Statistics - MIT

Box plots of logs of hospitalization cost data

6

78

910

lhcc lhct

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 274: Applied Statistics - MIT

Graphical Displays of Data from Matched Pairs

7

• Plot the pairs (xi, yi) in a scatter plot. Using the 45° line as a reference, one can judge whether the two sets of values are similar or whether one set tends to be larger than the other

• Plots of the differences or the ratios of the pairs may prove to be useful

• A Q-Q plot is meaningless for paired data because the same quantiles based on the ordered observations do not, in general, come from the same pair.

Page 275: Applied Statistics - MIT

Comparing Means of Two Populations:Independent Samples Design

(Large Samples Case)

1 21 2 1 2

12 2

2 1 2

Suppose that the observations , ,..., and , ,...,

are random samples from two populations with means and and variances and . Both means and variancesare assumed to be unknown.

n nx x x y y y

µ

µ σ σ

1

2 1 2 1

2

The goal is to compare and in terms of their difference - . We assume that and are large (say 30).

nn

µµ µ µ

>

8

Page 276: Applied Statistics - MIT

Comparing Means of Two Populations:Independent Samples Design

1 22 21 2

1 2

1 22 21 1 2 2

1 2

( ) ( ) ( )

( ) ( ) ( )

Therefore the standarized r.v.( ) has mean = 0 and variance = 1

If and are large, then Z is approximately (0,1) byth

E X Y E X E Y

Var X Y Var X Var Yn n

X YZn n

n n N

µ µ

σ σ

µ µσ σ

− = − = −

− = + = +

− − −=

+

e Central Limit Theorem though we did not assume the samples came from normal populations. (We also use fact that the difference of independent normal r.v.'s is also normal.)

9

Page 277: Applied Statistics - MIT

Large Sample (Approximate) 100(1-α)% CI for µ1−µ2

( ) ( )2 2 2 21 2 1 2

2 1 2 21 2 1 2

2 2Note has been substituted for because samples arelarge, i.e., bigger than 30.

i i

s s s sx y z x y zn n n n

s

α αµ µ

σ

− − + ≤ − ≤ − + +

Example 8.2: See Example 8.2 in course textbook.

10

Page 278: Applied Statistics - MIT

Large Sample (Approximate) Test of Hypothesis

0 1 2 0 1 1 2 0 0: vs. : (Typically 0)H Hµ µ δ µ µ δ δ− = − ≠ =

02 21 1 2 2

( )Test statistics: x yzs n s n

δ− −=

+

11

Page 279: Applied Statistics - MIT

Inference for Small Samples2 21 2Case 1: Variances and assumed equal.σ σ

Assumption of normal populations is important since we cannot invoke the CLT

2 22 22 1 1 2 2

1 2 1 22 2 2

1 2

Pooled estimate of the common variance:( ) ( )( 1) ( 1)

( 1) ( 1) 2Note: ( ) / 2 if sample sizes are equal

i iX X Y Yn S n SSn n n n

S S S

− + −− + −= =

− + − + −

= +

∑ ∑

1 21 2

1 2

( ) has -distribution with 2 d.f.1 1

X YT t n nS n n

µ µ− − −= + −

+12

Page 280: Applied Statistics - MIT

Inference for Small Sample: Confidence Intervals and Hypothesis Tests

2 21 2Case 1: Variances and assumed equal.σ σ

1 2 1 22, 2 1 2 2, 21 2 1 2

Two-sided 100(1- )% CI is given by:

1 1 1 1n n n nx y t s x y t s

n n n nα α

α

µ µ+ − + −− − + ≤ − ≤ − + +

1 2

0 1 2 0 1 1 2 0

0

1 2

0 2, 2

Test of Hypothesis: : vs. :

Test statistics: 1 1

Reject if n n

H Hx yts

n n

H t t α

µ µ δ µ µ δδ

+ −

− = − ≠− −

=+

>

13

Page 281: Applied Statistics - MIT

Hospitalization Cost Example•See Example 8.2 on page 276 of course textbook.

Contrast this conclusion with apparent difference seen on the Q-Q plot in Figure 8.1

14

Page 282: Applied Statistics - MIT

t.test in S-Plus to test difference in means of logs of hospitalization cost data

t.test(lhcc,lhct)

Standard Two-Sample t-Test

data: lhcc and lhctt = 0.6181, df = 58, p-value = 0.5389 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-0.3731277 0.7064981 sample estimates:mean of x mean of y 8.250925 8.08424

15

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 283: Applied Statistics - MIT

Interpretation of Difference in Means on the Log Scale

Mean (log Cost) = Median (log Cost) = log (Median Cost)

Because distribution of log cost is symmetric

Because the log preserves ordering

0.373 (log ) (log ) 0.707 0.373 log( ) log( ) 0.707

0.373 log 0.707

.689 exp( 0.373) exp(0.707) 2.028

C T

C T

C

T

C

T

Mean Cost Mean CostMedian Cost Median Cost

Median CostMedian Cost

Median CostMedian Cost

− ≤ − ≤− ≤ − ≤

⎛ ⎞− ≤ ≤⎜ ⎟

⎝ ⎠

= − ≤ ≤ =

This Interpretation is not in your textbook

95% confidence interval for the ratio of median costs

16

Page 284: Applied Statistics - MIT

Inference for Small Samples2 21 2Case 2: Variances and unequal.σ σ

1 22 2

1 2

1 2

( ) does not have a Student - distributionX YT tS Sn n

µ µ− − −=

+

It can be shown that distribution of T depends on the ratio of unknown variances, hence T is not a pivotal quantity. However, whenn1 and n2 are large T has an approximate N(0,1) distribution

17

Page 285: Applied Statistics - MIT

Inference for Small Samples2 21 2Case 2: Variances and unequal.σ σ

( ) ( )

1 22 2

1 2

1 2

21 2

2 21 1 2 2

2 22 21 2

1 21 2

For small samples( ) has an approximately -distribution

( )with degrees of freedom( 1) ( 1)

where SEM( ) and SEM( )

X YT tS Sn n

w ww n w n

s sw x w yn n

µ µ

ν

− − −=

+

+=

− + −

= = = =

Note: d.f. are estimated from the data and are not a function of the samples sizes alone

Note: ν is not usually an integer but is rounded down to the nearest integer

18

Page 286: Applied Statistics - MIT

Inference for Small Samples2 21 2Case 2: Variances and unequal.σ σ

1 2

2 2 2 21 2 1 2

, 2 1 2 , 21 2 1 2

Approximate 100(1- )% two-sided CI for :

s s s sx y t x y tn n n nν α ν α

α µ µ

µ µ

− − + ≤ − ≤ − − +

0 1 2 0 1 1 2 0

02 21 1 1 1

0 , 2

Test statistics for : vs. :

is

Reject if .

H Hx yt

s n s n

H t tν α

µ µ δ µ µ δδ

− = − ≠− −

=+

>

19

Page 287: Applied Statistics - MIT

Hospitalization Costs: Inference Using Separate Variances

See Example 8.4 on page 280 of course textbook.

20

Page 288: Applied Statistics - MIT

t.test in S-Plus to test differences in means of hospitalization data, unequal variances

t.test(lhcc,lhct,var.equal=F)

Welch Modified Two-Sample t-Test

data: lhcc and lhctt = 0.6181, df = 54.61, p-value = 0.5391 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-0.3738420 0.7072124 sample estimates:mean of x mean of y 8.250925 8.08424

21

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 289: Applied Statistics - MIT

Testing for the Equality of VariancesSection 8.4 covers the classical F test for the equality of two variances and associated confidence intervals. However, this method is not robust against departures from normality. For example, p-values can be off by a factor of 10 if the distributions have shorter or longer tails than the normal.A robust alternative is Levene’s Test. His test applies the two-sample t-test to the absolute value of the difference of each observation and the group mean

1 1 1

2 2 2

| |, 1, 2, ,| |, 1, 2, ,

i

i

Y Y i nY Y i n

− = ⋅⋅⋅

− = ⋅⋅⋅This method works well even though these absolute deviations are not independent.

In the Brown-Forsythe test the response is the absolute value of the difference of each observation and the group median.

22

Page 290: Applied Statistics - MIT

Independent Sample Design: Sample Size Determination Assuming Equal Variances

0 1 2 1 1 22

21 2

: 0 vs. : 0

( )2

H H

z zn n n α β

µ µ µ µ

σδ

− = − ≠

+⎡ ⎤= = = ⎢ ⎥

⎣ ⎦

Because we assume a known variance this n is a slight underestimate of sample size

Smallest difference of practical importance that we want to detect

23

Page 291: Applied Statistics - MIT

Using S-Plus to compute sample size

normal.sample.size(mean2=.693,power=0.9)mean1 sd1 mean2 sd2 delta alpha power n1 n2 prop.n2

0 1 0.693 1 0.693 0.05 0.9 44 44 1

24

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 292: Applied Statistics - MIT

Matched Pairs Design

Example:See Section 8.3.2, page 283 in course textbook.

25

Page 293: Applied Statistics - MIT

Statistical Justification of Matched Pairs Design

See Section 8.3.2, page 283 in course textbook.

26

Page 294: Applied Statistics - MIT

Sample Size Determination2

22

( ) (One-Sided Test)

( ) (Two-Sided Test)

D

D

z zn

z zn

α β

α β

σδ

σδ

+⎡ ⎤= ⎢ ⎥⎣ ⎦

+⎡ ⎤= ⎢ ⎥⎣ ⎦

•One needs a planning value for σD

•This formulas come from the one-sample formulas applied to the differences

27

Page 295: Applied Statistics - MIT

Comparing Variances of Two Populations

•Application arises when comparing instrument precision oruniformities of products.

•The methods discussed in the book are applicable only under theassumption of normality of the data. They are highly sensitiveto even modest departures from normality

• In case of nonnormal data there are nonparametric and other robust methods for comparing data dispersion.

28

Page 296: Applied Statistics - MIT

Comparing Variances of Two Populations

1

1

21 2 1 1

21 2 1 1

Independent sample design:Sample 1: , ,..., is a random sample from ( , )

Sample 2: , ,..., is a random sample from ( , )n

n

x x x N

y y y N

µ σ

µ σ2 2

1 11 22 2

2 2

has an F distribution 1 and 1 d.f. respectivelySF n nS

σσ

= − −

1 2 1 2

2 21 1

1, 1,1 / 2 1, 1, / 22 22 2

/ 1/n n n n

SP f fSα α

σ ασ− − − − −

⎧ ⎫≤ ≤ = −⎨ ⎬

⎩ ⎭

1 2 1 2

2 2 21 1 12 2 2

1, 1, / 2 2 2 1, 1,1 / 2 2

1 1 1n n n n

S SPf S f Sα α

σ ασ− − − − −

⎧ ⎫⎪ ⎪≤ ≤ = −⎨ ⎬⎪ ⎪⎩ ⎭

(1-α)-level CI (two-sided):

1 2 1 2

2 2 21 1 12 2 2

1, 1, / 2 2 2 1, 1,1 / 2 2

1 1

n n n n

S Sf S f Sα α

σσ− − − − −

≤ ≤

29

Page 297: Applied Statistics - MIT

An Important Industrial Application:Example 8.8

(See Table 8.8 in course textbook.)

Do the two labs have equal measurement precision?

30

Page 298: Applied Statistics - MIT

Inferences for Proportions and Count Data

Corresponds to Chapter 9 of

Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT), with some slides by Ramón V. León

(University of Tennessee) 1

Page 299: Applied Statistics - MIT

Inference for Proportions

• Data = {0,1,1,10,0…..1,0}, Bernoulli(p)• Goal – estimate p, probability of success (or

proportion of population with a certain attribute) • p = x= number of successes in n trials• Var( p ) = p(1-p)/n = pq/n • Variance depends on the mean.

2

Page 300: Applied Statistics - MIT

Large Sample Confidence Interval for Proportionˆ −

Recall that ( p p)≈ N (0,1) if n is large

/pq n (q = 1- p, np ˆ ≥ 10 and n(1− p) ≥ 10)

It follows that:

⎛ ˆ − ⎞ P ⎜−zα 2 ≤

( p p)≤ zα 2 ⎟⎟ ≈ 1−α⎜ ˆ ˆpq n ⎝ ⎠

Confidence interval for p:

ˆ ˆ ˆ ˆ p zˆ − α 2

pq ≤ p ≤ p z pqˆ + α 2n n

3

Page 301: Applied Statistics - MIT

A Better Confidence Interval for Proportion

Use this probability statement

⎛ ˆ − ⎞P ⎜− ≤

( p p)≤ zα 2 ⎟⎟ ≈1−α⎜ zα 2 pq n ⎝ ⎠

Solve for p using quadratic equation

CI for p: z2 l � 2 z4 z2 pqz l lp + − pqz

+4n2 p + +

l � 2

+z4

2n n ≤ ≤ 2n n 4n2

z2 ⎞⎛ p

⎛ z2 ⎞ ⎜1+ ⎟ ⎜1+ ⎟n ⎠⎝ ⎝ n ⎠

where z = zα / 2

4

Page 302: Applied Statistics - MIT

Example

See Example 9.1 on page 301 of the course textbook.

5

Page 303: Applied Statistics - MIT

Binomial CI

In S-Plus: >qbinom(.975,800,0.45) [1] 388> qbinom(.025,800,0.45) [1] 332

95% CI for proportion of gun owners is: 332/800 ≤ p ≤ 388/8000.415 ≤ p ≤ 0.485

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

6

Page 304: Applied Statistics - MIT

Sample Size Determination for a Confidence Interval for Proportion

Want (1-α)-level two-sided CI:

ˆ ±p E where E is the margin of error. Then E = z 2 ˆ ˆ

.pq nα

2⎛ zα 2 ⎞

⎟ ˆ ˆ ⎝ E ⎠

pq Solving for n gives n = ⎜

1 1Largest value of pq = ⎛ ⎞⎛ ⎞ = 1 so conservative sample size is:⎜ ⎟⎜ ⎟2 2⎝ ⎠⎝ ⎠ 4

2⎛ zα 2 ⎞ 1n = ⎜ ⎟ (Formula 9.5) ⎝ E ⎠ 4

7

Page 305: Applied Statistics - MIT

Example 9.2: Presidential Poll

See Example 9.2 on page 302 of the course textbook.

Threefold increase in precision requires ninefold increase in sample size

8

Page 306: Applied Statistics - MIT

Largest Sample Hypothesis Test on Proportion

= : ≠ 0H : p p vs. H p p 0 0 1

ˆ − 0Best test statistics: z =p p

p q n 0 0

Acceptance Region: p0 ± cd, where c=za/2 and d=(p0q0/n)0.5

9

Page 307: Applied Statistics - MIT

Basketball Problem: z-test

See Example 9.3 on page 303 of the course textbook.

P-value

2.182

10

Page 308: Applied Statistics - MIT

Exact Binomial Test in S-Plus

1-pbinom(299,400,.7) 0.01553209

dbin

om(x

, 400

, 0.7

)

0.0

0.01

0.

02

0.03

0.

04

240 260 280 300 320

x 11

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 309: Applied Statistics - MIT

Sample Size for Z-Test of ProportionH p p H p p : ≤ 0 vs. : > 0o 1

Suppose that the power for rejecting H must be at0

least 1- β when the true proportion is p p p 0.= >1

Let δ = p p − 0 . Then 1

⎡ z p q + z p q ⎤2 Test based on:

0 0 β 1 1 ˆ − 0n = ⎢ α

δ ⎥⎥⎦

z = p p

⎢⎣ p q n 0 0

Replace z by zα for two-sided test sample size.α 2

12

Page 310: Applied Statistics - MIT

Example 9.4: Pizza Testing

See Example 9.4 on page 305 of the course textbook.

2⎡ z p q z p q ⎤

n = ⎢ α 2 0 0 + β 1 1 ⎥ ⎢ δ ⎥⎦⎣

13

Page 311: Applied Statistics - MIT

Comparing Two Proportions: Independent Sample Design

If n p , n q , n2 p2 , n2 q2 ≥ 10, then 1 1 1 1

Z p p p p =

ˆ1 − ˆ2 − ( 1 − 2 ) ≈ N (0,1)

ˆ ˆ ˆ 2p q p2q1 1 + n2n1

Confidence Interval:

ˆ1 − ˆ2 − p q p qˆ 1 + 2 2 ≤ p p p p z p p z 1 − 2 ≤ ˆ1 − ˆ2 +α 2 1 α 2n2n1

1 1 2 2

1 2

ˆ ˆp q n n

+ ˆ q p

14

Page 312: Applied Statistics - MIT

Test for Equality of Proportions (Large n) Independent Sample Design – pooled estimate of p

: 1 = vs. 1 : 1 ≠ 2H p p H p p 0 2

− ˆ2ˆ1Test statitics: z =p p

⎛ 1 1 ⎞ˆ ˆ ⎜ +pq ⎟n n 2⎝ 1 ⎠ ˆ + x + y1 1where p =

n p n2 p2 = n n2 n n2+1 + 1

15

Page 313: Applied Statistics - MIT

Example 9.6 –Comparing Two Leukemia Therapies

See Example 9.6 on page 310 of the course textbook.

16

Page 314: Applied Statistics - MIT

Inference for Small Samples Fisher’s Exact Test

• Calculates the probability of obtaining observed 2x2 table or any more extreme with margins fixed.

• Uses hypergeometric distribution

M x

N

K

N

K

M x

XP ( KMNx )| , ,

⎞−⎛⎞⎛ ⎟⎟⎠−⎜⎜

⎝⎟⎟⎠

⎜⎜⎝

⎞⎛ ⎟⎟⎠

⎜⎜⎝

= =

17

Page 315: Applied Statistics - MIT

Inference for Count Data

Data = cell counts = number of observations in each of sevaral (>2) categories, ni, i=1..c, Σni=n

Joint distribution of corresponding r.v.’s is multinomial.

Goal – determine if the probabilities of belonging to each of the categories are equal to hypothesized values, pi0.

χ

Test statistic, χ2 = Σ(observed-expected)2/expected, where observed=ni, expected=npi0

2 has chi-square distribution when sample size is large

18

Page 316: Applied Statistics - MIT

Multinomial Test of Proportions

See Example 9.10 on page 316 of the course textbook.

19

Page 317: Applied Statistics - MIT

Inferences for Two-Way Count Datay: Job Satisfaction

x: Annual Very Slightly Slightly Very Satisfied Row Sum Salary Dissatisfied Dissatisfied Satisfied

Less than $10,000

81 64 29 10 184

$10,000-25,000

73 79 35 24 211

$25,000-50,000

47 59 75 58 239

More than $50,000

14 23 84 69 190

Column Sum 215 225 223 161 824

Sampling Model 1: Multinomial Model (Total Sample Size Fixed) Sample of 824 from a single population that is then cross-classified

The null hypothesis is that X and Y are independent: : ( = , ( = ) ( i. . j for all i, jH pij = P X i Y = j) = P X i P Y = j) = p p 0

20

Page 318: Applied Statistics - MIT

Sampling Model 1 (Total Sample Size Fixed)Based on Table 9.10 in the course textbook

y: Job Satisfaction

x: Annual Very Slightly Slightly Very Satisfied Row Sum Salary Dissatisfied Dissatisfied Satisfied

Less than $10,000

81 64 29 10 184

$10,000-25,000 73 79 35 24 211

$25,000-50,000 47 59 75 58 239

More than $50,000

14 23 84 69 190

Column Sum 215 225 223 161 824

Estimated Expected Frequency = 824 ⎜⎛ 215 ⎞⎛ 184 ⎞ =

215×184 = 48.01 ⎟⎜ ⎟

⎝ 824 ⎠⎝ 824 ⎠ 824 (Cell 1,1) = np p1• •1

21

Page 319: Applied Statistics - MIT

Chi-Square Statistics

See Example 9.13, page 324 for instructions on calculating the chi-square statistic.

c

χ =∑ (n e )2

2 i − i

i=1 ei

22

Page 320: Applied Statistics - MIT

2Based on Table A.5, critical values χυ ,α for theChi-

Square Chi-square Distribution, in the course textbook:

Test Critical Value

2The d.f. for this χ − statistics is2 (4-1)(4-1) = 9. Since χ 9,.05 = 16.919

2the calculated χ = 11.989 is not sufficiently large to reject the hypothesis of independence at α = .05 level

α

v .995 .99 .975 .95 .90 .10 .05

1

2

3

4

5

6

7

8

9 16.919

10

11

23

Page 321: Applied Statistics - MIT

S-Plus – job satisfaction example• Call: • crosstabs(formula = c(jobsat) ~ c(row(jobsat)) + c(col(jobsat))) • 901 cases in table • +----------+ • |N | • |N/RowTotal| • |N/ColTotal| • |N/Total | • +----------+ • c(row(jobsat))|c(col(jobsat)) • |1 |2 |3 |4 |RowTotl| • -------+-------+-------+-------+-------+-------+ • 1 | 20 | 24 | 80 | 82 |206 | • |0.097 |0.12 |0.39 |0.4 |0.23 | • |0.32 |0.22 |0.25 |0.2 | | • |0.022 |0.027 |0.089 |0.091 | | • -------+-------+-------+-------+-------+-------+ • 2 | 22 | 38 |104 |125 |289 | • |0.076 |0.13 |0.36 |0.43 |0.32 | • |0.35 |0.35 |0.33 |0.3 | | • |0.024 |0.042 |0.12 |0.14 | | • -------+-------+-------+-------+-------+-------+ • 3 | 13 | 28 | 81 |113 |235 | • |0.055 |0.12 |0.34 |0.48 |0.26 | • |0.21 |0.26 |0.25 |0.27 | | • |0.014 |0.031 |0.09 |0.13 | | • -------+-------+-------+-------+-------+-------+ • 4 | 7 | 18 | 54 | 92 |171 | • |0.041 |0.11 |0.32 |0.54 |0.19 | • |0.11 |0.17 |0.17 |0.22 | | • |0.0078 |0.02 |0.06 |0.1 | | • -------+-------+-------+-------+-------+-------+ • ColTotl|62 |108 |319 |412 |901 | • |0.069 |0.12 |0.35 |0.46 | | • -------+-------+-------+-------+-------+-------+ • Test for independence of all factors • Chi^2 = 11.98857 d.f.= 9 (p=0.2139542) • Yates' correction not used 24 • >

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 322: Applied Statistics - MIT

Product Multinomial Model:Row Totals Fixed

(See Table 9.2 in the course textbook.)

Sampling Model 2: Product Multinomial Total number of patients in each drug group is fixed.

•The null hypothesis is that the probability of column response (success or failure) is the same, regardless of the row population:

0 : (Y = j | X i p j)H P = =

25

Page 323: Applied Statistics - MIT

S-Plus – leukemia trial• Call: • crosstabs(formula = c(leuk) ~ c(row(leuk)) + c(col(leuk))) • 63 cases in table • +----------+ • |N | • |N/RowTotal| • |N/ColTotal| • |N/Total | • +----------+ • c(row(leuk))|c(col(leuk)) • |1 |2 |RowTotl| • -------+-------+-------+-------+ • 1 |14 | 7 |21 | • |0.67 |0.33 |0.33 | • |0.27 |0.64 | | • |0.22 |0.11 | | • -------+-------+-------+-------+ • 2 |38 | 4 |42 | • |0.9 |0.095 |0.67 | • |0.73 |0.36 | | • |0.6 |0.063 | | • -------+-------+-------+-------+ • ColTotl|52 |11 |63 | • |0.83 |0.17 | | • -------+-------+-------+-------+ • Test for independence of all factors • Chi^2 = 5.506993 d.f.= 1 (p=0.01894058) • Yates' correction not used • Some expected values are less than 5, don't trust stated p-value • > 26

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 324: Applied Statistics - MIT

Remarks About Chi-Square Test

• The distribution of the chi-square statistics under the null hypothesis is approximately chi-square only when the sample sizes are large – The rule of thumb is that all expected cell counts should be greater

than 1 and – No more than 1/5th of the expected cell counts should be less than

5.

• Combine sparse cell (having small expected cell counts) with adjacent cells. Unfortunately, this has the drawback of losing some information.

• Never stop with the chi-square test. Look at cells with large values of (O-E), as in job satisfaction example.

27

Page 325: Applied Statistics - MIT

Odds Ratio as a Measure of Association for a 2x2 Table

Sampling Model I: Multinomialp p11 12ψ = p p 21 22

The numerator is the odds of the column 1 outcome vs. the column 2 outcome for row 1, and the denominator is the same odds for row 2, hence the name “odds ratio”

28

Page 326: Applied Statistics - MIT

Odds Ratio as a Measure of Association for a 2x2 Table

Sampling Model II: Product Multinomial1− p1 1ψ =

p p 1− p2 2

The two column outcomes are labeled as “success” and “failure,” then ψ is the odds of success for the row 1 population vs. the odds of success for the row 2 population

29

Page 327: Applied Statistics - MIT

Inference in a Nutshell

Slides prepared by Elizabeth Newton (MIT)

Corresponds to Chapters 6-9 of Tamhane and Dunlop

1

Page 328: Applied Statistics - MIT

OutlineChapter 6: Basic Concepts of Inference

Mean Square ErrorConfidence IntervalHypothesis Test

Chapter 7: Inference for Single SamplesMean - Large Sample - zMean - Small Sample – tVariance – Chi-squarePrediction and Tolerance Intervals

2

Page 329: Applied Statistics - MIT

Outline (continued)Chapter 8 – Inference for Two Samples

Comparing Means, Independent, Large Sample –zComparing Means, Independent, Small Sample

Variances equal – tVariances not equal – t with df from SEM

Matched Pairs – test differences – tComparing Variances – F

3

Page 330: Applied Statistics - MIT

Outline (continued)Chapter 9 - Inferences for Proportions and Count Data

Proportion, Large sample – zProportion, Small sample – binomialComparing 2 Proportions, large – z or Chi-squareComparing 2 Proportions, small – Fisher’s ExactMatched Pairs – McNemar’s TestOne way Count – Chi squareTwo-way Count – Chi squareGoodness of Fit – Chi squareOdds ratio - z

4

Page 331: Applied Statistics - MIT

Confidence Interval on the Mean

û ± cd is a two-sided CI for mean uwhere:û = estimator of u = sample meand=standard deviation of û.c=critical constant, for instance, zα/2 or tn-1,a/2.zα/2 is such that P(Z> zα/2)=α/2.zα/2=Φ-1(1-α/2) = qnorm(1-α/2) = -qnorm(α/2)If a=0.05 then zα/2= 1.96.If draw many samples and construct 95% CI’s from them, 95% would contain true value of u.

5

Page 332: Applied Statistics - MIT

Confidence Intervals

(See Figure 6.2 on page 205 of the course textbook.)

6

Page 333: Applied Statistics - MIT

Hypothesis Tests• H0: null hypothesis, no change, no effect,

for instance u=u0

• H1: alternative hypothesis, u≠u0

• α = P(Type I error = P(reject H0 | H0 true)• β = P(Type II error = P(accept H0 | H0 false)• Power = function of u = P(reject H0 | u)• A two-sided hypothesis test rejects H0 when

|û-u0|/d > c ↔ |û-u0| > cd ↔û<u0-cd or û>u0+cd

7

Page 334: Applied Statistics - MIT

Level α Tests

(See Table 7.1 on page 240 of the course textbook.)

8

Page 335: Applied Statistics - MIT

P-Values

• P-Value is the probability of obtaining the observed result or one more extreme

• Two-sided P-Value= P(|Z|>|(û-u0)|/d = 2[1-Φ[|(û-u0)|/d] = 2*(1-pnorm(abs(û-u0)/d)) in S-Plus

9

Page 336: Applied Statistics - MIT

P-Values

(See Table 7.2 on page 241 of the course textbook.)

10

Page 337: Applied Statistics - MIT

Power Function

Power is the probability of rejecting H0 for a given value of u.

π(u) = P(û<u0-cd | u) + P(û>u0+cd |u)

= Φ[-c+(u0-u)/d] + Φ[-c+(u-u0)/d]

11

Page 338: Applied Statistics - MIT

Power

(See Figure 7.3 on page 245 of the course textbook.)

12

Page 339: Applied Statistics - MIT

Reject H0

(1) If u0 falls outside interval û ± cd.

(2) if û falls outside interval u0 ± cd.

(3) if p-value is small.

13

Page 340: Applied Statistics - MIT

Simple Linear Regression and Correlation.

Corresponds to Chapter 10

Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT) with some slides by

Jacqueline Telford (Johns Hopkins University)

1

Page 341: Applied Statistics - MIT

Simple linear regression analysis estimates the relationship between two variables.

One of the variables is regarded as a response or outcome variable (y).

The other variable is regarded as predictor or explanatory variable (x).

Sometimes it is not clear which of two variable should be the response (e.g. height and weight). In this case, correlation analysis may be used.

Simple linear regression estimates relationships of the form y = a + bx.

2

Page 342: Applied Statistics - MIT

Scatter plot of ozone concentration by temperature

air$temperature

air$

ozon

e

60 70 80 90

12

34

5

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.3

Page 343: Applied Statistics - MIT

A Probabilistic Model for Simple Linear Regression

Let x1, x2,..., xn be specific settings of the predictor variable.

Let y1, y2,..., yn be the corresponding values of the response variable.

Assume that yi is the observed value of a random variable (r.v.) Yi, which depends X on according to the following model:

Yi = β0 + β1 xi + εi (i = 1, 2, …, n)

Here εi is the random error with E(εi)=0 and Var(εi)=σ2 .

Thus, E(Yi) = µi = β0 + β1 xi (true regression line).

The xi’s usually are assumed to be fixed (not random variables).

4

Page 344: Applied Statistics - MIT

A Probabilistic Model for Simple Linear Regression

See Figure 10.1, p. 348 and also see page 348 for the four assumptions of a simple linear regression model.

5

Page 345: Applied Statistics - MIT

Least Square Line Mathematics (invented by Gauss)

Find the line, i.e., values of β0 and β1 that minimizes the sum of the squared deviations:

∑=

+−=n

iii xy

1

210 )]([Q ββ

How?

Solve for values of β0 and β1 for which

0010

=∂∂

=∂∂

ββQ and Q

6

Page 346: Applied Statistics - MIT

Finding Regression Coefficients

)]([2

)]([2

1011

110

0

i

n

iii

n

iii

xyxQ

xyQ

βββ

βββ

+−−=∂∂

+−−=∂∂

=

=

7

Page 347: Applied Statistics - MIT

Normal Equations

∑∑∑

∑∑

===

==

=+

=+

n

iii

n

ii

n

ii

n

ii

n

ii

yxxx

yxn

11

2

110

1110

ββ

ββ

8

Page 348: Applied Statistics - MIT

Solution to Normal Equations

ˆˆ

SS

)(

))((ˆ

10

xx

xy

1

2

11

xy

xx

yyxx

n

ii

n

iii

ββ

β

−=

=−

−−=

=

=

.),( yxNote that least squares line goes through

9

Page 349: Applied Statistics - MIT

Fitted regression line

air$temperature

air$

ozon

e

60 70 80 90

12

34

5

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.10

Page 350: Applied Statistics - MIT

nixyy iii ,...,2,1,ˆˆˆ 10 : of values Fitted =+= ββ

nixyyye iiiii ,...,2 ,1 , ) ˆˆ(ˆ :Residuals 10 =+−=−= ββ

temperature ozone fitted resid 67 3.45 2.49 0.9672 3.30 2.84 0.4674 2.29 2.98 -0.6962 2.62 2.14 0.4865 2.84 2.35 0.50

11This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 351: Applied Statistics - MIT

Matrix Approach to Simple Linear Regression (what your regression package is really doing)

The model: y=Xβ + ε

y is n by 1X is n by 2β is 2 by 1ε is n by 1

12

Page 352: Applied Statistics - MIT

Y=Xβ + ε

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

+⎥⎦

⎤⎢⎣

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

4

3

1

0

1

0

4

3

2

1

4

3

2

1

x1 x1 x1 x1

εεεε

ββ

yyyy

13

Page 353: Applied Statistics - MIT

Solution of linear equations

In linear algebra:Find x which solves Ax=b.

In regression analysis:Find β which solves Xβ=y Why can’t we do this?

14

Page 354: Applied Statistics - MIT

Least Squares

Q=(y-Xβ)’(y-Xβ) = y’y – β’X’y – y’Xβ + β’X’Xβ= y’y – 2 β’X’y + β’X’Xβ

∂Q/ ∂β = -2X’y + 2X’Xβ

∂Q/ ∂β = 0 → X’y = X’Xb, where b= β

15

Page 355: Applied Statistics - MIT

Least Squares continued

For simple linear regression:

⎥⎥⎦

⎢⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎡=

∑∑∑ ∑

ii

i

i

yx

y

x

nXX

yX'

x

x '

2i

i

16

Page 356: Applied Statistics - MIT

Least Squares continued

X’Xb = X’y

⎥⎥⎦

⎢⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

∑∑

∑ ∑∑

ii

i

i yx

y

x

n b

x

x 2i

i

The Normal Equations as before17

Page 357: Applied Statistics - MIT

Least Squares continued

X’Xb = X’yb= (X’X)-1X’y (if X has linearly

independent columns)Solution by QR decompositionX=QR, Q orthonormal, R upper triangular

and invertibleb=(X’X)-1X’y = (R’Q’QR)-1R’Q’y=(R’R)-1R’Q’y = R-1Q’y

18

Page 358: Applied Statistics - MIT

The Hat Matrix

b=(X’X)-1 X’y=Xb = X(X’X)-1X’y =Hy

H (n by n) is the Hat matrixTakes y toH is symmetric and idempotent HH=HDiagonal elements of the hat matrix are

useful in detecting influential observations.

y

y

19

Page 359: Applied Statistics - MIT

Expected value of b

E(b) = E((X’X)-1X’y]= E[(X’X)-1X’(Xβ+ε)]= E[(X’X)-1X’X β+ (X’X)-1X’ε]= β

Hence b is an unbiased estimator of β.

20

Page 360: Applied Statistics - MIT

Covariance of b

The covariance matrix of y is σ2Ib=(X’X)-1X’y = Ay (where A is k by n)Cov(b) = A Var(y) A’ = A σ2I A = σ2AA’

= σ2 (X’X)-1X’X(X’X)-1

= σ2 (X’X)-1

21

Page 361: Applied Statistics - MIT

Covariance of b

For simple linear regression, σ2(X’X)-1=

⎥⎥⎦

⎢⎢⎣

−−=

⎥⎥⎦

⎢⎢⎣

∑∑∑

∑ ∑∑ ∑∑

n x

x- x)(x

x

i

i2i

2

21

2i

i2

iii xxnx

n σσ

xxxx

i

SnSx

bSD 1)SD(b ;)( 1

2

0 σσ == ∑

22

Page 362: Applied Statistics - MIT

Estimation of σ2

2

)ˆ(

22 1

2

1

2

−=

−=

∑∑==

n

yy

n

en

iii

n

ii

s

Note: The denominator is n - 2 since two parameters are being estimated (β0 and β1).

E[S2]=σ2 (See proof in Seber, Linear Regression Analysis)

23

Page 363: Applied Statistics - MIT

Statistical Inference for βo and β1

xxxx

i

SsSE

nSx

sSE == ∑ )ˆ( and )ˆ( 1

2

0 ββ

For ozone example:Coefficients:

Value Std. Error t value Pr(>|t|) (Intercept) -2.2260 0.4614 -4.8243 0.0000temperature 0.0704 0.0059 11.9511 0.0000

24This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 364: Applied Statistics - MIT

Sums of Squares

∑=

−n

ii yy

1

2)(: (SST) Total Squares of Sum

∑∑==

−=n

iii

n

ii yye

1

2

1

2 )ˆ(: (SSE) Error for Squares of Sum

∑=

−n

ii yy

1

2)ˆ(: (SSR) Regression for Squares of Sum

25

Page 365: Applied Statistics - MIT

Geometry of the Sums of Squares)ˆ()ˆ( iiii yyyyyy −+−=−

y

yi

SST = SSR + SSE, see derivation on p. 354

26J. Telford

Page 366: Applied Statistics - MIT

Coefficient of Determination (R-squared)

=−==SSTSSE1

SSTSSR2r

proportion of the variance in y that is accounted for by the regression on x

= square of correlation between y and y

For ozone example:Multiple R-Squared: 0.5672

27

Page 367: Applied Statistics - MIT

Analysis of Variance (ANOVA)

0 1 0 1: 0 . : 0H vs Hβ β= ≠

2

MSEMSR

2)-SSE/(nSSR/1 tF ===

For ozone example:summary.aov(tmp)

Df Sum of Sq Mean Sq F Value Pr(F) temperature 1 49.46178 49.46178 142.8282 0Residuals 109 37.74698 0.34630

28This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 368: Applied Statistics - MIT

Regression DiagnosticsResidual vs. observation number

resi

d(oz

one.

lm)

0 20 40 60 80 100

-10

12

29This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 369: Applied Statistics - MIT

Regression Diagnosticsresidual vs. fitted value

fitted(ozone.lm)

resi

d(oz

one.

lm)

2.0 2.5 3.0 3.5 4.0 4.5

-10

12

30This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 370: Applied Statistics - MIT

Regession Diagnosticsresidual vs. x

air$temperature

resi

d(oz

one.

lm)

60 70 80 90

-10

12

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.31

Page 371: Applied Statistics - MIT

Regression Diagnosticsqq plot of residuals

Quantiles of Standard Normal

resi

d(oz

one.

lm)

-2 -1 0 1 2

-10

12

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.32

Page 372: Applied Statistics - MIT

Hat Matrix Diagonalsha

t(mod

el.m

atrix

(ozo

ne.lm

))

0 20 40 60 80 100

0.01

0.02

0.03

0.04

0.05

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.33

Page 373: Applied Statistics - MIT

Some useful S-Plus commandsmy.lm <- lm(y~x, data=mydata, na.action=na.omit)

includes intercept term by defaultsummary(my.lm)

gives coefficients, correlation of coefficients, R-square, F-statistic, residual standard error

summary.aov(my.lm) gives ANOVA table

resid(my.lm) gives residuals

fitted(my.lm) gives fitted values

model.matrix(my.lm) gives model matrix

34This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 374: Applied Statistics - MIT

Multiple Linear Regression

Corresponds to Chapter 11 ofTamhane & Dunlop

Slides prepared by Elizabeth Newton (MIT)with some slides by Roy Welsch (MIT).

Page 375: Applied Statistics - MIT

Linear Regression

Review:Linear Model: y=Xβ + ε

y~N(Xβ, σ2I)Least squares: =(X’X)X’y

= fitted value of y = X =X(X’X)-1X’y=Hy

e = error = residuals = y- = y-Hy=(I-H)y

βy

y

β

2

Page 376: Applied Statistics - MIT

Properties of the Hat matrix

• Symmetric: H’=H• Idempotent: HH=H• Trace(H) = sum(diag(H)) = k+1 = number of

columns in the X matrix• 1’H=vector of 1’s (hence y and have same

mean)• 1’(I-H) = vector of 0’s (hence mean of residuals

is 0).• What is H when X is only a column of 1’s?

y

3

Page 377: Applied Statistics - MIT

Variance-Covariance Matrices

)()()())(()()()(

)()()ˆ(

time) lastsaw we(as )'()ˆCov(

22

22

12

HIHIIHIHIyCovHIyHICoveCov

HIHHHyHCovHyCovyCov

XX

−=−−=

−−=−=

==

==

= −

σσ

σσ

σβ

4

Page 378: Applied Statistics - MIT

Confidence and Prediction Intervals

)1()1)'(()'()()ˆ()ˆ(

ˆ y, xat nobservationew of

)'()ˆ()ˆ(

xat response mean

02

01'

022

01'

02

0000

0000

02

01'

02'

00

0

+=+=+

=+=+

+=

===

−−

vxXXxxXXx

VaryVaryVaryVariance

vxXXxxVaryVar

ofVariance

σσσσ

εεε

σσβ

An estimate of σ2 is s2 = MSE = y’(I-H)y /(n-k-1)

5

Page 379: Applied Statistics - MIT

Confidence and Prediction Intervals

(1-α) Confidence Interval on Mean Response at x0:

0/2 1),(k-n0 vsd and tc where,ˆ ==± + αcdy

(1-α) Prediction Interval on New Observation at x0:

1vsd and tc where,ˆ 0/2 1),(k-n0 +==± + αcdy

6

Page 380: Applied Statistics - MIT

Sums of Squares

∑=

−n

ii yy

1

2)(: (SST) Total Squares of Sum

∑∑==

−=n

iii

n

ii yye

1

2

1

2 )ˆ(: (SSE) Error for Squares of Sum

∑=

−n

ii yy

1

2)ˆ(: (SSR) Regression for Squares of Sum

SSR = SST - SSE7

Page 381: Applied Statistics - MIT

Overall Significance TestTo see if there is any linear relationship we test:

H0: β1 = β2 = . . . = βk = 0H1: βj ≠ 0 for some j.

Compute

The F statistic is:

with F based on k and (n − k − 1) degrees of freedom.

Reject H0 when F exceeds F k,n−k−1(α).

SSESSTSSRyyyySSE iiii −=−=−= ∑∑ )(SST )ˆ( 22

MSEMSR

knSSEkSSR

=−− )1/(

/

8

Page 382: Applied Statistics - MIT

Sequential Sums of Squares

SSR(x1) = SST - SSE(x1)

SSR(x2|x1) = SSR(x1,x2) - SSR(x1) =SSE(x1) - SSE(x1,x2)

SSR(x3|x1 x2) = SSE(x1,x2) - SSE(x1,x2,x3)

9

Page 383: Applied Statistics - MIT

ANOVA TableType 1 (sequential) sums of squares

Source of SS dfVariationRegression SSR(x1,x2,x3) 3

x1 SSR(x1) 1x2|x1 SSR(x2|x1) 1x3|x2 x1 SSR(x3|x2,x1) 1

Error SSE(x1,x2,x3) n-4Total SST n-1

10

Page 384: Applied Statistics - MIT

ANOVA TableType 3 (partial) sums of squares

Source of SS dfVariationRegression SSR(x1,x2,x3) 3

x1|x2,x3 SSR(x1|x2,x3) 1x2|x1,x3 SSR(x2|x1,x3) 1x3|x1,x2 SSR(x3|x1,x2) 1

Error SSE(x1,x2,x3) n-4Total SST n-1

11

Page 385: Applied Statistics - MIT

12

Scatter plot Matrix of the Air Data Set in S-Plus pairs(air)

ozone

0 50 100 200 300 5 10 15 20

12

34

5

050

150

250

radiation

temperature

6070

8090

1 2 3 4 5

510

1520

60 70 80 90

wind

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 386: Applied Statistics - MIT

air.lm<-lm(y~x1+x2+x3)

> summary(air.lm)$coefValue Std. Error t value Pr(>|t|)

(Intercept) -0.297329634 0.5552138923 -0.5355227 5.933998e-001x1 0.002205541 0.0005584658 3.9492854 1.407070e-004x2 0.050044325 0.0061061612 8.1957098 5.848655e-013x3 -0.076021950 0.0157548357 -4.8253090 4.665124e-006

> summary.aov(air.lm)Df Sum of Sq Mean Sq F Value Pr(F)

x1 1 15.53144 15.53144 59.6761 6.000000e-012x2 1 37.76939 37.76939 145.1204 0.000000e+000x3 1 6.05985 6.05985 23.2836 4.665124e-006

Residuals 107 27.84808 0.26026

> summary.aov(air.lm,ssType=3)Type III Sum of Squares

Df Sum of Sq Mean Sq F Value Pr(F) x1 1 4.05928 4.05928 15.59685 0.0001407070x2 1 17.48174 17.48174 67.16966 0.0000000000x3 1 6.05985 6.05985 23.28361 0.0000046651

Residuals 107 27.84808 0.26026 > 13

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 387: Applied Statistics - MIT

Polynomial Models

y=β0 + β1x + β2x2 … + βkxk

Problems:Powers of x tend to be large in magnitudePowers of x tend to be highly correlated

Solutions:Centering and scaling of x variablesOrthogonal polynomials (poly(x,k) in S-Plus,

see Seber for methods of generating)

14

Page 388: Applied Statistics - MIT

15

Plot of mpg vs. weight for 74 autos(S-Plus dataset auto.stats)

wt

mpg

2000 2500 3000 3500 4000 4500

1520

2530

3540

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 389: Applied Statistics - MIT

summary(lm(mpg~wt+wt^2+wt^3))

Call: lm(formula = mpg ~ wt + wt^2 + wt^3)Residuals:

Min 1Q Median 3Q Max -6.415 -1.556 -0.2815 1.265 13.06

Coefficients:Value Std. Error t value Pr(>|t|)

(Intercept) 68.1797 21.4515 3.1783 0.0022wt -0.0309 0.0214 -1.4430 0.1535

I(wt^2) 0.0000 0.0000 0.9586 0.3410I(wt^3) 0.0000 0.0000 -0.7449 0.4588

Residual standard error: 3.209 on 70 degrees of freedomMultiple R-Squared: 0.705 F-statistic: 55.76 on 3 and 70 degrees of freedom, the p-value is 0

Correlation of Coefficients:(Intercept) wt I(wt^2)

wt -0.9958 I(wt^2) 0.9841 -0.9961 I(wt^3) -0.9659 0.9846 -0.9961

16

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 390: Applied Statistics - MIT

wts<-(wt-mean(wt))/sqrt(var(wt))

summary(lm(mpg~wts+wts^2+wts^3))

Call: lm(formula = mpg ~ wts + wts^2 + wts^3)Residuals:

Min 1Q Median 3Q Max -6.415 -1.556 -0.2815 1.265 13.06

Coefficients:Value Std. Error t value Pr(>|t|)

(Intercept) 20.2331 0.5676 35.6470 0.0000wts -4.4466 0.7465 -5.9567 0.0000

I(wts^2) 1.1241 0.4682 2.4007 0.0190I(wts^3) -0.2521 0.3385 -0.7449 0.4588

Residual standard error: 3.209 on 70 degrees of freedomMultiple R-Squared: 0.705 F-statistic: 55.76 on 3 and 70 degrees of freedom, the p-value is 0

Correlation of Coefficients:(Intercept) wts I(wts^2)

wts -0.2800 I(wts^2) -0.7490 0.4558 I(wts^3) 0.3925 -0.8596 -0.6123

17

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 391: Applied Statistics - MIT

Orthogonal Polynomials

Generation is similar to Gram-Schmidt orthogonalization (see Strang, Linear Algebra)

Resulting vectors are orthonormal X’X=IHence (X’X)-1 = I and coefficients

= (X’X)-1X’y = X’yAddition of higher degree term does not affect

coefficients for lower degree termsCorrelation of coefficients = ISE of coefficients = s = sqrt(MSE)

18

Page 392: Applied Statistics - MIT

summary(lm(mpg~poly(wt,3)))

Call: lm(formula = mpg ~ poly(wt, 3))Residuals:

Min 1Q Median 3Q Max -6.415 -1.556 -0.2815 1.265 13.06

Coefficients:Value Std. Error t value Pr(>|t|)

(Intercept) 21.2973 0.3730 57.0912 0.0000poly(wt, 3)1 -40.6769 3.2090 -12.6758 0.0000poly(wt, 3)2 7.8926 3.2090 2.4595 0.0164poly(wt, 3)3 -2.3904 3.2090 -0.7449 0.4588

Residual standard error: 3.209 on 70 degrees of freedomMultiple R-Squared: 0.705 F-statistic: 55.76 on 3 and 70 degrees of freedom, the p-value is 0

Correlation of Coefficients:(Intercept) poly(wt, 3)1 poly(wt, 3)2

poly(wt, 3)1 0 poly(wt, 3)2 0 0 poly(wt, 3)3 0 0 0 19

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 393: Applied Statistics - MIT

20

Plot of mpg by weight with fitted regression line

wt

mpg

2000 2500 3000 3500 4000 4500

1520

2530

3540

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 394: Applied Statistics - MIT

Indicator Variables

• Sometimes we might want to fit a model with a categorical variable as a predictor. For instance, automobile price as a function of where the car is made (Germany, Japan, USA).

• If there are c categories, we need c-1 indicator (0,1) variables as predictors. For instance j=1 if car is made in Japan, 0 otherwise, u=1 if car is made in USA, 0 otherwise.

• If there are just 2 categories and no other predictors, we could just do a t-test for difference in means.

21

Page 395: Applied Statistics - MIT

22

Boxplots of price by country for S-Plus dataset cu.summary

1000

020

000

3000

040

000

pric

e

Germany Japan USA

cntry

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 396: Applied Statistics - MIT

23

Histogram of automobile prices for S-Plus dataset cu.summary

10000 20000 30000 40000

010

2030

40

price

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 397: Applied Statistics - MIT

24

Histogram of log of automobile prices for S-Plus dataset cu.summary

9.0 9.5 10.0 10.5

05

1015

20

log(price)

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 398: Applied Statistics - MIT

summary(lm(price~u+j))

Call: lm(formula = price ~ u + j)Residuals:

Min 1Q Median 3Q Max -15746 -4586 -2071 2374 22495

Coefficients:Value Std. Error t value Pr(>|t|)

(Intercept) 25741.3636 2282.2729 11.2788 0.0000u -10520.5473 2525.4871 -4.1657 0.0001j -10236.0088 2656.5095 -3.8532 0.0002

Residual standard error: 7569 on 88 degrees of freedomMultiple R-Squared: 0.1723 F-statistic: 9.159 on 2 and 88 degrees of freedom, the p-value is

0.0002435

Correlation of Coefficients:(Intercept) u

u -0.9037 j -0.8591 0.7764 25

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 399: Applied Statistics - MIT

summary(lm(price~u+g))

Call: lm(formula = price ~ u + g)Residuals:

Min 1Q Median 3Q Max -15746 -4586 -2071 2374 22495

Coefficients:Value Std. Error t value Pr(>|t|)

(Intercept) 15505.3548 1359.5121 11.4051 0.0000u -284.5385 1737.1208 -0.1638 0.8703g 10236.0088 2656.5095 3.8532 0.0002

Residual standard error: 7569 on 88 degrees of freedomMultiple R-Squared: 0.1723 F-statistic: 9.159 on 2 and 88 degrees of freedom, the p-value is

0.0002435

Correlation of Coefficients:(Intercept) u

u -0.7826 g -0.5118 0.4005 26

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 400: Applied Statistics - MIT

Regression DiagnosticsGoal: identify remarkable observations and unremarkable

predictors.

Problems with observations:OutliersInfluential observations

Problems with predictors:A predictor may not add much to model.A predictor may be too similar to another predictor (collinearity).Predictors may have been left out.

27

Page 401: Applied Statistics - MIT

28

Plot of standardized residuals vs. fitted values for air dataset

fitted value

stan

dard

ized

resi

dual

2 3 4

-2-1

01

23

1

2

3

45

67

8

9

10

11

12

1314

1516

17

1819

20

21

22

23

2425

26

2728

29

30

31

32

33

34

35

36

37

38

39

4041

42

43

44

45

46

47

48

49

50

51

52

53

54

5556

57

5859

60

61

62

6364

65

66

67

68

69

70

71

72

73

7475

76

77

78

79

80

81

82

83

84

85

8687

88

89

909192

93

94

9596

97

9899

100

101

102

103

104

105

106

107

108

109

110

111

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 402: Applied Statistics - MIT

Plot of residual vs. fit for air data set with all interaction terms

fitted(tmp)

resi

d(tm

p)

2.0 2.5 3.0 3.5 4.0 4.5 5.0

-1.0

-0.5

0.0

0.5

1.0

1.5

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

29

Page 403: Applied Statistics - MIT

30

Plot of residual vs. fit for air model with x3*x4 interaction

fitted(tmp)

resi

d(tm

p)

2 3 4 5

-1.0

-0.5

0.0

0.5

1.0

1.5

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 404: Applied Statistics - MIT

Call: lm(formula = air[, 1] ~ air[, 2] + air[, 3] + air[, 4] + air[, 3] * air[, 4])Residuals:

Min 1Q Median 3Q Max -1.088 -0.3542 -0.07242 0.3436 1.47

Coefficients:Value Std. Error t value Pr(>|t|)

(Intercept) -3.6465 1.1684 -3.1209 0.0023 air[, 2] 0.0023 0.0005 4.3223 0.0000 air[, 3] 0.0920 0.0143 6.4435 0.0000 air[, 4] 0.2523 0.1031 2.4478 0.0160

air[, 3]:air[, 4] -0.0042 0.0013 -3.2201 0.0017

Residual standard error: 0.4892 on 106 degrees of freedomMultiple R-Squared: 0.7091 F-statistic: 64.61 on 4 and 106 degrees of freedom, the p-value is 0

Correlation of Coefficients:(Intercept) air[, 2] air[, 3] air[, 4]

air[, 2] -0.0361 air[, 3] -0.9880 -0.0495 air[, 4] -0.9268 0.0620 0.9313

air[, 3]:air[, 4] 0.8902 -0.0661 -0.9119 -0.9892 >

31

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 405: Applied Statistics - MIT

Remarkable Observations

Residuals are the keyStandardized residuals:

Outlier if |ei*|>2Hat matrix diagonals, hii

Influential if hii > 2(k+1)/nCook’s Distance

ii

i

i

ii his

eeSE

ee−

==)(

*

)1

()1

( 2*

ii

iiii h

hked

−+=

Influential if di > 1 32

Page 406: Applied Statistics - MIT

33

Plot of standardized residual vs. observation number for air dataset

observation number

stan

dard

ized

resi

dual

0 20 40 60 80 100

-2-1

01

23

1

2

3

45

67

8

9

10

11

12

1314

1516

17

1819

20

21

22

23

2425

26

2728

29

30

31

32

33

34

35

36

37

38

39

4041

42

43

44

45

46

47

48

49

50

51

52

53

54

5556

57

5859

60

61

62

6364

65

66

67

68

69

70

71

72

73

7475

76

77

78

79

80

81

82

83

84

85

8687

88

89

909192

93

94

959697

9899

100

101

102

103

104

105

106

107

108

109

110

111

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 407: Applied Statistics - MIT

34

Hat matrix diagonals

observaton number

hat m

atrix

dia

gona

ls

0 20 40 60 80 100

0.02

0.04

0.06

0.08

0.10

0.12

1

2

3

45

6

7

8

9

10

11

12

13

14

1516

17

1819

202122

2324

25

26

27

28

29

30

31

3233

34

3536

3738394041

42

43

44

45

4647

484950

51

52

53

54

55

56575859

60

61

62

63

64656667

68

69

70

71

7273

7475

76

77

78

7980

8182

8384

85

8687

88

899091

92

9394

95

96

9798

99

100

101

102

103104

105

106

107

108

109

110111

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 408: Applied Statistics - MIT

35

Plot of wind vs. ozone

wind

ozon

e

5 10 15 20

12

34

5

12

3

45

6

7

8

910

1112

13

14

15

16

17

18

19

20

21

22

23

2425

26

27

28 29

30

31

3233

34

35

36

37

38

39

404142

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58 59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78 79

80

818283

8485

86

87

88

8990 91

92

93

9495

9697

98

99100

101

102103

104

105

106

107

108

109110 111

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 409: Applied Statistics - MIT

36

Cook’s DistanceC

ook'

s D

ista

nce

0 20 40 60 80 100

0.0

0.02

0.04

0.06

0.08

0.10

0.12

0.14

17

77

30

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 410: Applied Statistics - MIT

37

Plot of ozone vs. wind including fitted regression lines with and without observation 30

(simple linear regression)

wind

ozon

e

5 10 15 20

12

34

5

12

3

45

6

7

8

910

1112

13

14

15

16

17

18

19

20

21

22

23

2425

26

27

28 29

30

31

3233

34

35

36

37

38

39

404142

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58 59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78 79

80

818283

8485

86

87

88

8990 91

92

93

9495

9697

98

99100

101

102103

104

105

106

107

108

109110 111

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 411: Applied Statistics - MIT

Remedies for Outliers

• Nothing?• Data Transformation?• Remove outliers?• Robust Regression – weighted least

squares: b=(X’WX)-1X’Wy• Minimize median absolute deviation

38

Page 412: Applied Statistics - MIT

CollinearityHigh correlation among the predictors can cause problems with least

squares estimates (wrong signs, low t-values, unexpected results).If predictors are centered and scaled to unit length, then X’X is the

correlation matrix.Diagonal elements of inverse of correlation matrix are called VIF’s

(variance inflation factors).

R ,1

1 2j2 where

RVIF

jj −=

is the coefficient of determination for the regression of the jth predictor on the remaining predictors

39

Page 413: Applied Statistics - MIT

When Rj2 = .90, VIF is about 10 and caution is advised. (Some authors

say VIF = 5.) A large VIF indicates there is redundant information in the explanatory variables.

Why is this called the variance inflation factor?We can show that

Thus VIFj represents the variation inflation caused by adding all thevariables other than xj to the model.

( )( )

2

2 2

1

1ˆVar 1

ˆVIF Var in simple regression

j nj j j

i

j j

R x x

σβ

β=

=− −

⎡ ⎤= ⎣ ⎦

R Welsch 40

Page 414: Applied Statistics - MIT

Remedies for collinearity

1. Identify and eliminate redundant variables (large literatureon this).

2. Modified regression techniques

a. ridge regression, b=(X’X+cI)-1X’y

3. Regress on orthogonal linear combinations of theexplanatory variables

a. principal components regression

4. Careful variable selection

R Welsch 41

Page 415: Applied Statistics - MIT

Correlation and inverse of correlation matrix for air data set.

r<-cor(model.matrix(air.lm)[,-1])

> rx1 x2 x3

x1 1.0000000 0.2940876 -0.1273656x2 0.2940876 1.0000000 -0.4971459X3 -0.1273656 -0.4971459 1.0000000

> solve(r)x1 x2 x3

x1 1.09524102 -0.3357220 -0.02740677x2 -0.33572201 1.4312012 0.66875638x3 -0.02740677 0.6687564 1.32897882 > 42

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 416: Applied Statistics - MIT

Correlation and inverse of correlation matrix for mpg data set

r<-cor(model.matrix(auto1.lm)[,-1])

> rwt I(wt^2) I(wt^3)

wt 1.0000000 0.9917756 0.9677228I(wt^2) 0.9917756 1.0000000 0.9918939I(wt^3) 0.9677228 0.9918939 1.0000000

solve(r)wt I(wt^2) I(wt^3)

wt 2000.377 -3951.728 1983.884I(wt^2) -3951.728 7868.535 -3980.575I(wt^3) 1983.884 -3980.575 2029.459

43

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 417: Applied Statistics - MIT

Variable Selection

• We want a parsimonious model – as few variables as possible to still provide reasonable accuracy in predicting y.

• Some variables may not contribute much to the model.

• SSE never will increase if add more variables to model, however MSE=SSE/(n-k-1) may.

• Minimum MSE is one possible optimality criterion. However, must fit all possible subsets (2k of them) and find one with minimum MSE.

44

Page 418: Applied Statistics - MIT

Backward Elimination

1. Fit the full model (with all candidate predictors).

2. If P-values for all coefficients < α then stop.

3. Delete predictor with highest P-value4. Refit the model5. Go to Step 2.

45

Page 419: Applied Statistics - MIT

Logistic Regression

References: Applied Linear Statistical Models, Neter et al.

Categorical Data Analysis, Agresti

Slides prepared by Elizabeth Newton (MIT)

Page 420: Applied Statistics - MIT

Logistic Regression• Nonlinear regression model when response

variable is qualitative.• 2 possible outcomes, success or failure,

diseased or not diseased, present or absent• Examples: CAD (y/n) as a function of age,

weight, gender, smoking history, blood pressure• Smoker or non-smoker as a function of family

history, peer group behavior, income, age• Purchase an auto this year as a function of

income, age of current car, age

E Newton 2

Page 421: Applied Statistics - MIT

Response Function for Binary Outcome

iii

iiii

ii

ii

ii

iii

XYEYEYPYP

XYEXY

πββπππ

ππ

ββεββ

=+==−+=

−====

+=

++=

10

10

10

}{)1(0)(1}{

1)0()1(

}{

E Newton 3

Page 422: Applied Statistics - MIT

Special Problems when Response is Binary

Constraints on Response Function0 ≤ E{Y} = π = ≤ 1

Non-normal Error TermsWhen Yi=1: εi = 1-β0-β1Xi

When Yi=0: εi = -β0-β1Xi

Non-constant error varianceVar{Yi} = Var{εi} = πi(1-πi)

E Newton 4

Page 423: Applied Statistics - MIT

Logistic Response Function

X

X

XXXXXXX

XXYE

10

10

10

1010

1010

1010

10

10

1log

)exp(1

)exp()1()exp()exp()exp()exp()exp())exp(1(

)exp(1)exp(}{

ββπ

π

ββπ

πββππ

ββπββπββββππββββπ

ββββπ

+=⎟⎠⎞

⎜⎝⎛

+=−

+−=

+−+=

+=++

+=++

+++

==

E Newton 5

Page 424: Applied Statistics - MIT

Example of Logistic Response Function

Age

Pro

babi

lity

of C

AD

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

E Newton 6

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 425: Applied Statistics - MIT

Properties of Logistic Response Function

log(π/(1-π))=logit transformation, log odds

π/(1-π) = odds

Logit ranges from -∞ to ∞ as x varies from -∞ to ∞

E Newton 7

Page 426: Applied Statistics - MIT

Likelihood Function

)1log()]1

log([)...g(Y log

)1()()...g(Y

is; pdf joint t,independen re YSince1,2...ni ;1,0 Y,)1()(:

1)0()1(

11i

111i

i

i1

i

n

i

n

i i

iin

Yi

Yi

niii

nin

Yi

Yiii

ii

ii

YY

YfY

aYfpdf

YPYP

ii

ii

ππ

πππ

ππ

ππ

−+−

=

−Π=Π=

==−=

−====

∑∑==

−==

E Newton 8

Page 427: Applied Statistics - MIT

Likelihood Function (continued)

)]exp(1log[)(),(log

)exp(111

)-1

log(

1 1101010

10

10

∑ ∑= =

++−+=

++=−

+=

n

i

n

iiii

ii

ii

i

XXYL

X

X

ββββββ

ββπ

ββππ

E Newton 9

Page 428: Applied Statistics - MIT

Likelihood for Multiple Logistic Regression

yXyX

xx

xxxy

x

xxxyL

xXyL

iki

i

jijj

jijj

iik

iiki

jijj

jijj

iik

iiki

k

i jijj

ijiji

j

ˆ''

ˆ])exp(1

)exp([ :Equations Likelihood

])exp(1

)exp([

)]exp(1log[)()(log

=

=+

=

+−=

∂∂

+−=

∑∑∑

∑∑

∑∑

∑∑

∑ ∑∑∑

πβ

β

β

β

β

βββ

E Newton 10

Page 429: Applied Statistics - MIT

Solution of Likelihood Equations

No closed form solutionUse Newton-Raphson algorithm

Iteratively reweighted least squares (IRLS)Start with OLS solution for β at iteration t=0, β0

πit=1/(1+exp(-Xi’βt))

β(t+1)=βt + (XVX)-1 X’(y-πt)Where V=diag(πi

t(1-πit))

Usually only takes a few iterations

E Newton 11

Page 430: Applied Statistics - MIT

Interpretation of logistic regression coefficients

• Log(π/(1-π))=Xβ• So each βj is effect of unit increase in Xj

on log odds of success with values of other variables held constant

• Odds Ratio=exp(βj)

E Newton 12

Page 431: Applied Statistics - MIT

Example: Spinal Disease in Children Data SUMMARY: The kyphosis data frame has 81 rows representing data on 81 children

who have had corrective spinal surgery. The outcome Kyphosis is a binary variable, the other three variables (columns) are numeric.

ARGUMENTS: Kyphosis

a factor telling whether a postoperative deformity (kyphosis) is "present" or "absent" .

Agethe age of the child in months.

Numberthe number of vertebrae involved in the operation.

Startthe beginning of the range of vertebrae involved in the operation.

SOURCE: John M. Chambers and Trevor J. Hastie, Statistical Models in S,

Wadsworth and Brooks, Pacific Grove, CA 1992, pg. 200.

E Newton 13

This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 432: Applied Statistics - MIT

Observations 1:16 of kyphosis data setkyphosis[1:16,]

Kyphosis Age Number Start 1 absent 71 3 52 absent 158 3 143 present 128 4 54 absent 2 5 15 absent 1 4 156 absent 1 2 167 absent 61 2 178 absent 37 3 169 absent 113 2 1610 present 59 6 1211 present 82 5 1412 absent 148 3 1613 absent 18 5 214 absent 1 4 1216 absent 168 3 18

E Newton 14

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 433: Applied Statistics - MIT

Variables in kyphosissummary(kyphosis)

Kyphosis Age Number Start absent:64 Min.: 1.00 Min.: 2.000 Min.: 1.00 present:17 1st Qu.: 26.00 1st Qu.: 3.000 1st Qu.: 9.00

Median: 87.00 Median: 4.000 Median:13.00 Mean: 83.65 Mean: 4.049 Mean:11.49

3rd Qu.:130.00 3rd Qu.: 5.000 3rd Qu.:16.00 Max.:206.00 Max.:10.000 Max.:18.00

E Newton 15

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 434: Applied Statistics - MIT

Scatter plot matrix kyphosis data set

Kyphosis

0 50 100 150 200 5 10 15

absn

prsn

050

100

150

200

Age

Number

24

68

10

absn prsn

510

15

2 4 6 8 10

Start

E Newton 16

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 435: Applied Statistics - MIT

Boxplots of predictors vs. kyphosis0

5010

015

020

0

Age

absent present

Kyphosis

24

68

10

Num

ber

absent present

Kyphosis

510

15

Star

t

absent present

Kyphosis

E Newton 17

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 436: Applied Statistics - MIT

Smoothing spline fits, df=3

jitter(age)

kyp

0 50 100 150 200

1.0

1.2

1.4

1.6

1.8

2.0

jitter(num)

kyp

2 4 6 8 10

1.0

1.2

1.4

1.6

1.8

2.0

jitter(sta)

kyp

5 10 15

1.0

1.2

1.4

1.6

1.8

2.0

E Newton 18

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 437: Applied Statistics - MIT

Summary of glm fitCall: glm(formula = Kyphosis ~ Age + Number + Start,

family = binomial, data = kyphosis)

Deviance Residuals:Min 1Q Median 3Q Max

-2.312363 -0.5484308 -0.3631876 -0.1658653 2.16133

Coefficients:Value Std. Error t value

(Intercept) -2.03693225 1.44918287 -1.405573Age 0.01093048 0.00644419 1.696175

Number 0.41060098 0.22478659 1.826626Start -0.20651000 0.06768504 -3.051043

E Newton 19

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 438: Applied Statistics - MIT

Summary of glm fitNull Deviance: 83.23447 on 80 degrees of freedom

Residual Deviance: 61.37993 on 77 degrees of freedom

Number of Fisher Scoring Iterations: 5

Correlation of Coefficients:(Intercept) Age Number

Age -0.4633715 Number -0.8480574 0.2321004 Start -0.3784028 -0.2849547 0.1107516

E Newton 20

This code7 was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 439: Applied Statistics - MIT

Residuals

• Response Residuals: yi-πi

• Pearson Residuals: (yi-πi)/sqrt(πi(1-πi))

• Deviance Residuals: sqrt(-2log(|1-yi-πi|))

E Newton 21

Page 440: Applied Statistics - MIT

Model Deviance

• Deviance of fitted model compares log-likelihood of fitted model to that of saturated model.

• Log likelihood of saturated model=0

DEVd

YYYsignd

YYDEV

ii

iiiiiii

ii

n

iii

=

−−+−−=

−−+−==

2

2/11

)]}ˆ1log()1()ˆlog([2){ˆ(

)ˆ1log()1()ˆlog(2

πππ

ππ

E Newton 22

Page 441: Applied Statistics - MIT

Covariance Matrix> x<-model.matrix(kyph.glm)

> xvx<-t(x)%*%diag(fi*(1-fi))%*%x

> xvx(Intercept) Age Number Start

(Intercept) 9.620342 907.8887 43.67401 86.49845Age 907.888726 114049.8308 3904.31350 9013.14464

Number 43.674014 3904.3135 219.95353 378.82849Start 86.498450 9013.1446 378.82849 1024.07328

> xvxi<-solve(xvx)> xvxi

(Intercept) Age Number Start (Intercept) 2.101402986 -0.00433216784 -0.2764670205 -0.0370950612

Age -0.004332168 0.00004155736 0.0003368969 -0.0001244665Number -0.276467020 0.00033689690 0.0505664221 0.0016809996Start -0.037095061 -0.00012446655 0.0016809996 0.0045833534

> sqrt(diag(xvxi))[1] 1.44962167 0.00644650 0.22486979 0.06770047

E Newton 23

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 442: Applied Statistics - MIT

Change in Deviance resulting from adding terms to model

> anova(kyph.glm)Analysis of Deviance Table

Binomial model

Response: Kyphosis

Terms added sequentially (first to last)Df Deviance Resid. Df Resid. Dev

NULL 80 83.23447Age 1 1.30198 79 81.93249

Number 1 10.30593 78 71.62656Start 1 10.24663 77 61.37993

E Newton 24

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 443: Applied Statistics - MIT

Summary for kyphosis model with age^2 added

Call: glm(formula = Kyphosis ~ poly(Age, 2) + Number + Start, family = binomial, data = kyphosis)

Deviance Residuals:Min 1Q Median 3Q Max

-2.235654 -0.5124374 -0.245114 -0.06111367 2.354818

Coefficients:Value Std. Error t value

(Intercept) -1.6502939 1.40171048 -1.177343poly(Age, 2)1 7.3182325 4.66933068 1.567298poly(Age, 2)2 -10.6509151 5.05858692 -2.105512

Number 0.4268172 0.23531689 1.813798Start -0.2038329 0.07047967 -2.892080

E Newton 25

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 444: Applied Statistics - MIT

Summary of fit with age^2 addedNull Deviance: 83.23447 on 80 degrees of freedom

Residual Deviance: 54.42776 on 76 degrees of freedom

Number of Fisher Scoring Iterations: 5

Correlation of Coefficients:(Intercept) poly(Age, 2)1 poly(Age,

2)2 Number poly(Age, 2)1 -0.2107783 poly(Age, 2)2 0.2497127 -0.0924834

Number -0.8403856 0.3070957 -0.0988896 Start -0.4918747 -0.2208804 0.0911896

0.0721616

E Newton 26

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 445: Applied Statistics - MIT

Analysis of Deviance> anova(kyph.glm2)Analysis of Deviance Table

Binomial model

Response: Kyphosis

Terms added sequentially (first to last)Df Deviance Resid. Df Resid. Dev

NULL 80 83.23447poly(Age, 2) 2 10.49589 78 72.73858

Number 1 8.87597 77 63.86261Start 1 9.43485 76 54.42776

E Newton 27

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 446: Applied Statistics - MIT

Kyphosis data, 16 obs, with fit and residuals

cbind(kyphosis,round(p,3),round(rr,3),round(rp,3),round(rd,3))[1:16,]Kyphosis Age Number Start fit rr rp rd

1 absent 71 3 5 0.257 -0.257 -0.588 -0.7712 absent 158 3 14 0.122 -0.122 -0.374 -0.5113 present 128 4 5 0.493 0.507 1.014 1.1894 absent 2 5 1 0.458 -0.458 -0.919 -1.1075 absent 1 4 15 0.030 -0.030 -0.175 -0.2466 absent 1 2 16 0.011 -0.011 -0.105 -0.1487 absent 61 2 17 0.017 -0.017 -0.131 -0.1858 absent 37 3 16 0.024 -0.024 -0.157 -0.2209 absent 113 2 16 0.036 -0.036 -0.193 -0.27110 present 59 6 12 0.197 0.803 2.020 1.80311 present 82 5 14 0.121 0.879 2.689 2.05312 absent 148 3 16 0.076 -0.076 -0.288 -0.39913 absent 18 5 2 0.450 -0.450 -0.905 -1.09414 absent 1 4 12 0.054 -0.054 -0.239 -0.33316 absent 168 3 18 0.064 -0.064 -0.261 -0.36317 absent 1 3 16 0.016 -0.016 -0.129 -0.181

E Newton 28

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 447: Applied Statistics - MIT

Plot of response residual vs. fit

fi

y - f

i

0.0 0.2 0.4 0.6 0.8

-1.0

-0.5

0.0

0.5

E Newton 29

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 448: Applied Statistics - MIT

Plot of deviance residual vs. indexre

sid(

kyph

.glm

, typ

e =

"de.

...

0 20 40 60 80

-2-1

01

2

E Newton 30

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 449: Applied Statistics - MIT

Plot of deviance residuals vs. fitted value

fitted(kyph.glm2)

resi

d(ky

ph.g

lm2,

type

= "d

....

0.0 0.2 0.4 0.6 0.8

-2-1

01

2

E Newton 31

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 450: Applied Statistics - MIT

Summary of bootstrap for kyphosis model

E Newton 32

Call:bootstrap(data = kyphosis, statistic = coef(glm(Kyphosis ~

poly(Age, 2) + Number + Start, family = binomial,data = kyphosis)), trace = F)

Number of Replications: 1000

Summary Statistics:Observed Bias Mean SE

(Intercept) -1.6503 -0.85600 -2.5063 5.1675poly(Age, 2)1 7.3182 4.33814 11.6564 22.0166poly(Age, 2)2 -10.6509 -7.48557 -18.1365 37.6780

Number 0.4268 0.17785 0.6047 0.6823Start -0.2038 -0.07825 -0.2821 0.4593

Empirical Percentiles:2.5% 5% 95% 97.5%

(Intercept) -8.52922 -7.247145 1.1760 2.27636poly(Age, 2)1 -6.13910 -1.352143 27.1515 34.64701poly(Age, 2)2 -48.86864 -38.993192 -4.9585 -4.13232

Number -0.07539 -0.003433 1.4756 1.82754Start -0.58795 -0.470139 -0.1159 -0.08919

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 451: Applied Statistics - MIT

Summary of bootstrap (continued)BCa Confidence Limits:

2.5% 5% 95% 97.5% (Intercept) -6.4394 -5.3043 2.39707 3.56856

poly(Age, 2)1 -18.2205 -10.1003 18.34192 21.56654poly(Age, 2)2 -24.2382 -20.3911 -1.75701 -0.19269

Number -0.7653 -0.1694 1.14036 1.27858Start -0.3521 -0.3167 -0.03478 0.01461

Correlation of Replicates:(Intercept) poly(Age, 2)1 poly(Age, 2)2 Number Start

(Intercept) 1.0000 -0.4204 0.5082 -0.5676 -0.1839poly(Age, 2)1 -0.4204 1.0000 -0.8475 0.4368 -0.6478poly(Age, 2)2 0.5082 -0.8475 1.0000 -0.3739 0.5983

Number -0.5676 0.4368 -0.3739 1.0000 -0.4174Start -0.1839 -0.6478 0.5983 -0.4174 1.0000

E Newton 33

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 452: Applied Statistics - MIT

Histograms of coefficient estimates

-50 0 50

0.0

0.05

0.10

0.15

0.20

Value

Den

sity

(Intercept)

0 100 200 300 4000.

00.

010.

020.

030.

040.

05Value

Den

sity

poly(Age, 2)1

-600 -400 -200 0

0.0

0.01

0.03

0.05

Value

Den

sity

poly(Age, 2)2

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Value

Den

sity

Number

-12 -10 -8 -6 -4 -2 0

01

23

4

Value

Den

sity

Start

E Newton 34

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 453: Applied Statistics - MIT

QQ Plots of coefficient estimates

Quantiles of Standard Normal

Qua

ntile

s of

Rep

licat

es

-2 0 2

-50

050

(Intercept)

Quantiles of Standard Normal

Qua

ntile

s of

Rep

licat

es-2 0 2

010

020

030

040

0

poly(Age, 2)1

Quantiles of Standard Normal

Qua

ntile

s of

Rep

licat

es

-2 0 2

-600

-400

-200

0

poly(Age, 2)2

Quantiles of Standard Normal

Qua

ntile

s of

Rep

licat

es

-2 0 2

02

46

810

Number

Quantiles of Standard Normal

Qua

ntile

s of

Rep

licat

es

-2 0 2

-12

-10

-8-6

-4-2

0

Start

E Newton 35

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 454: Applied Statistics - MIT

Regression Reviewand Robust Regression

Slides prepared by Elizabeth Newton (MIT)

Page 455: Applied Statistics - MIT

S-Plus Oil City Data FrameMonthly Excess Returns of Oil City Petroleum, Inc.

Stocks and the Market SUMMARY: The oilcity data frame has 129 rows and 2 columns. The

sample runs from April 1979 to December 1989. This data frame contains the following columns:

VALUE: Oil

monthly excess returns of Oil City Petroleum, Inc. stocks. Market

monthly excess returns of the market.

E Newton 2

This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 456: Applied Statistics - MIT

Oil City Data (continued)• Returns = relative change in the stock price over a one

month interval• Excess returns are computed relative to the monthly

return of a 90-day US Treasury bill at the risk-free rate• Financial economists use least squares to fit a straight

line predicting a particular stock return from the market return.

• Beta= estimated coefficient of the market return. Measures the riskiness of the stock in terms of standard deviation and expected returns.

• Large beta -> stock is risky compared to market, but also expected returns from the stock are large.

E Newton 3

Page 457: Applied Statistics - MIT

Plot of Market returns vs. month

Month

oilc

ity$M

arke

t

0 20 40 60 80 100 120

-0.2

-0.1

0.0

E Newton 4

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 458: Applied Statistics - MIT

Plot of Oil City Petroleum return vs. month

month

Oil

0 20 40 60 80 100 120

01

23

45

E Newton 5

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 459: Applied Statistics - MIT

Histogram of Market Returns

-0.3 -0.2 -0.1 0.0 0.1

010

2030

4050

Market

E Newton 6

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 460: Applied Statistics - MIT

Histogram of Oil City Returns

-1 0 1 2 3 4 5

020

4060

8010

0

Oil

E Newton 7

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 461: Applied Statistics - MIT

Plot of Oil City vs. Market Returns

Market

Oil

City

-0.2 -0.1 0.0

01

23

45

12

34 5

6

7

8

9

10

11

1213

1415

16171819

20

2122 23

2425 2627

282930

31 323334

35

36373839 40

4142

4344

4546

4748

4950 5152 535455

56

57

5859 6061626364

65

66

6768

69 70717273 7475767778

79

808182 838485 868788

8990 919293

94

9596

97 98 99

100101

102103 104 105

106107

108 109110 111112113 114115116117 118119 120 121122123 124125126127 128129

E Newton 8

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 462: Applied Statistics - MIT

Plot of Oil City vs. Market Returns without observation 94

Market

Oil

City

-0.25 -0.20 -0.15 -0.10 -0.05 0.0 0.05

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1

2

3

45

6

7

8

9

10

11

12

13

14

15

16

17

1819

20

21

22 23

24

25

26

27

2829

30

3132

33

34

35

36

373839

40

41

42

43

44

45

46

47

48

49

50 51

52

53

5455

56

57

58

59

6061

62

6364

65

66

67

68

69

7071

7273

7475

76

77

78

79

80

8182 838485 868788

89

90 919293

94

95

96

97

98

99

100

101102103 104

105

106

107 108109110

111112

113114

115

116

117118 119 120

121

122 123

124

125126 127

128

E Newton 9

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 463: Applied Statistics - MIT

> summary(oilcity)Oil Market

Min.:-0.55667260 Min.:-0.27857020 1st Qu.:-0.23968330 1st Qu.:-0.10557534 Median:-0.10049000 Median:-0.07277544 Mean:-0.07221215 Mean:-0.07689209

3rd Qu.:-0.05821000 3rd Qu.:-0.03973828 Max.: 5.19292000 Max.: 0.07131940

E Newton 10

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 464: Applied Statistics - MIT

Summary oil.lm

Call: lm(formula = Oil ~ Market, data = oilcity)Residuals:

Min 1Q Median 3Q Max -0.6952 -0.1732 -0.05444 0.08407 4.842

Coefficients:Value Std. Error t value Pr(>|t|)

(Intercept) 0.1474 0.0707 2.0849 0.0391 Market 2.8567 0.7318 3.9040 0.0002

Residual standard error: 0.4867 on 127 degrees of freedomMultiple R-Squared: 0.1071 F-statistic: 15.24 on 1 and 127 degrees of freedom, the p-value

is 0.0001528

Correlation of Coefficients:(Intercept)

Market 0.7956

E Newton 11

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 465: Applied Statistics - MIT

Plot of residual vs. fit for oil.lm

Fitted : Market

Res

idua

ls

-0.6 -0.4 -0.2 0.0 0.2

01

23

45

65

79

94

E Newton 12

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 466: Applied Statistics - MIT

E Newton 13

Plot of Cooks Distance vs. IndexC

ook'

s D

ista

nce

0 20 40 60 80 100 120

0.0

0.5

1.0

1.5

2.0

2.5

3.0

6543

94

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 467: Applied Statistics - MIT

Plot of hat matrix diagonals for oil.lm

month

hat(m

odel

.mat

rix(o

il.lm

))

0 20 40 60 80 100 120

0.02

0.04

0.06

0.08

0.10

123456

7

8910

11

12

13

141516

17181920

2122

23

24

25

262728

2930

3132

3334

35

36

37

3839

40

41

42

43

4445464748

49

5051525354

55565758

59

6061

62

6364

65

66676869

70

717273

74

7576777879

80

8182

8384

85

86

8788

89

9091

9293

94

95

969798

99100101102

103

104

105

106107

108109110

111

112113114115116117

118119120

121122123

124

125126127128129

E Newton 14

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 468: Applied Statistics - MIT

Summary of model without observation 94

Call: lm(formula = Oil ~ Market, data = oilcity94)

Residuals:Min 1Q Median 3Q Max

-0.5169 -0.1174 -0.01959 0.06864 0.859

Coefficients:Value Std. Error t value Pr(>|t|)

(Intercept) -0.0247 0.0304 -0.8139 0.4173 Market 1.1355 0.3137 3.6202 0.0004

Residual standard error: 0.2033 on 126 degrees of freedomMultiple R-Squared: 0.09422 F-statistic: 13.11 on 1 and 126 degrees of freedom, the p-value

is 0.0004249

Correlation of Coefficients:(Intercept)

Market 0.8061

E Newton 15

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 469: Applied Statistics - MIT

Plot of residual vs fit for model without observation 94

Fitted : Market

Res

idua

ls

-0.3 -0.2 -0.1 0.0

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

8

10579

E Newton 16

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 470: Applied Statistics - MIT

Weighted Least Squares

Vof root square the called sometimes is RVRRRR' that such

R matrix,symmetric singular-non nxn symmetric always isV

ed,uncorrelat are errors if diagonal isV definite positive singular-non is

)( ,0)(

variances unequal have , yns,observatio whenUsed

2

i

==∃

==

+=

VVVarE

Xyσεε

εβ

E Newton 17

Page 471: Applied Statistics - MIT

Weighted least squares (continued)

0)()(

yor ,

becomes ,X ,y

:variablesnew Define

1*

***

111

1*

1*

1*

==

+=+=

+=

===

−−−

−−−

εε

εβεβ

εβεε

REE

XRXRyR

XyRXRyR

E Newton 18

Page 472: Applied Statistics - MIT

Weighted least squares (continued)

IRRRRVRR

RERRRE

EEEEVar

2

112

112

11

11**

*****

)'()'(

)'(})]'()][({[)(

σ

σ

σ

εε

εε

εεεεεεε

=

=

=

=

=

=

−−=

−−

−−

−−

−−

E Newton 19

Page 473: Applied Statistics - MIT

Weighted Least Squares (continued)

12

111-2

11-

1-

-11**

)'()'('WX)X'(

)()var('WX)X'()ˆ(

'WX)(X'ˆ :is solution The

WyX'ˆWX)(X' are equations normal squares Least

)()'(V W,')Q(

−−

=

=

=

=

=

−−=

=====

WXXWXXWXWWX

XWXWXyWXVar

WyX

XyWXyweightsWV

σ

σ

β

β

β

ββεεεεεεβ

E Newton 20

Page 474: Applied Statistics - MIT

Robust RegressionUsed to reduce influence of outliers

residuals of function a g ,)g(e)g(y :minimize

:estimators M

}median{e }]median{[y :minimize :Regression LMS

|e||y|L1 minimize

:Regression LAR

n

1ii

n

1ii

2i

2i

n

1ii

n

1ii

∑∑

∑∑

==

==

=−

=−

=−=

β

β

β

i

i

i

x

x

x

E Newton 21

Page 475: Applied Statistics - MIT

Robust Regression (continued)IRLS, iteratively reweighted least squaresMinimize e’WeW is a diagonal matrix of weights, inversely proportional to

magnitude of scaled residuals, uiui=ei/s, s=MAD=median{|ei-median(ei)|}

Procedure:1. Obtain initial coefficient estimates from OLS2. Obtain weights from scaled residuals3. Obtain coefficient estimates from WLS4. Return to 2.Convergence usually rapid.

E Newton 22

Page 476: Applied Statistics - MIT

(See Figure 10.4, and Equations 10.44 and 10.45 in Neter et al. Applied Linear Statistical Models.)

Neter et al. Applied Linear Statistical Models

23

Page 477: Applied Statistics - MIT

Plot of residuals in oil.rregoi

l.rre

g$re

sid

0 20 40 60 80 100 120

01

23

45

E Newton 24

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 478: Applied Statistics - MIT

Plot of weights in robust regression for oil city data set

Month

Wei

ghts

0 20 40 60 80 100 120

0.0

0.2

0.4

0.6

0.8

1.0

12

3

45

6

7

8

910

11

12

13

14

1516171819

20

21

2223

2425

26

27282930

313233

34

35

36

37

3839

4041

42

43

44

4546

47

48

49

5051

52

53

545556

57

58596061626364

65

66

67686970717273

74

75767778

79

80

8182838485868788

89

90

919293

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108109110111112113

114

115116117

118119120121122123124

125126127128

129

E Newton 25

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 479: Applied Statistics - MIT

Plot of sqrt(weights)*resid/s in oil.rreg(s

qrt(o

il.rr

eg$w

) * o

il.rr

....

0 20 40 60 80 100 120

-10

1

E Newton 26

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 480: Applied Statistics - MIT

Coefficient table for oil.rreg

> x<-cbind(1,Market)> beta<-solve(t(x)%*%diag(w)%*%x)%*%t(x)%*%diag(w)%*%Oil> r<-Oil-x%*%beta> s<- median(abs(r-median(r)))*1.4826> covm<-solve(t(x)%*%diag(w)%*%x)*s^2> se<-sqrt(diag(covm))> tvalue=beta/se> prob<-2*(1-pt(abs(tvalue),127))> cbind(beta,se,tvalue,prob)

beta se tvalue prob(Intercept) -0.06779903 0.02451469 -2.765649 0.0065285939

x 0.89895511 0.24902845 3.609849 0.0004394276

Covariance matrix is approximate.

E Newton 27

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 481: Applied Statistics - MIT

Plots of fitted regression lines for oil city data

Market

Oil

-0.2 -0.1 0.0

01

23

45

12

34 5

6

7

8

9

10

11

1213

1415

16171819

20

2122 23

2425 2627

282930

31 323334

35

36373839 40

4142

4344

4546

4748

4950 5152 535455

56

57

5859 6061626364

65

66

6768

69 70717273 7475767778

79

808182 838485 868788

8990 919293

94

9596

97 98 99

100101

102103 104 105

106107

108 109110 111112113 114115116117 118119 120 121122123 124125126127 128129

oil.lmoil.lm94oil.rreg

E Newton 28

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 482: Applied Statistics - MIT

Least Trimmed Squares Regression

n and n/2 between be to chosen is q where

,e :q

1i

2i∑

=

Minimizes

Based on a genetic algorithm for finding a subset of data with minimum SSE.

High breakdown point: fits the bulk of the data well, even if bulk is only a little more than half the data.

Resulting weights are 1 or 0

E Newton 29

Page 483: Applied Statistics - MIT

E Newton 30

> summary(oil.lts)Method:[1] "Least Trimmed Squares Robust Regression."

Call:ltsreg(formula = Oil ~ Market)

Coefficients:Intercept Market -0.0864 0.7907

Scale estimate of residuals: 0.1468

Robust Multiple R-Squared: 0.09863

Total number of observations: 129

Number of observations that determine the LTS estimate: 116

Residuals:Min. 1st Qu. Median 3rd Qu. Max.

-0.454 -0.088 0.032 0.097 5.223

Weights:0 1 10 119

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 484: Applied Statistics - MIT

Single Factor ANOVA Models

Corresponds to Chapter 12 ofTamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT) with some slides by Jacqueline Telford

(Johns Hopkins University).

1

Page 485: Applied Statistics - MIT

Chapter 8: How to compare two treatments

Chapter 12: How to compare more than two treatments (or just two).

Example: yields of several varieties of barley. Variety is the treatment factor (predictor)Yield is the response

2

Page 486: Applied Statistics - MIT

Experimental Designs

3

Page 487: Applied Statistics - MIT

S-Plus barley data set (observation 13:30)> barley.small

yield variety year site 13 35.13333 Svansota 1931 University Farm14 47.33333 Svansota 1931 Waseca15 25.76667 Svansota 1931 Morris16 40.46667 Svansota 1931 Crookston17 29.66667 Svansota 1931 Grand Rapids18 25.70000 Svansota 1931 Duluth19 39.90000 Velvet 1931 University Farm20 50.23333 Velvet 1931 Waseca21 26.13333 Velvet 1931 Morris22 41.33333 Velvet 1931 Crookston23 23.03333 Velvet 1931 Grand Rapids24 26.30000 Velvet 1931 Duluth25 36.56666 Trebi 1931 University Farm26 63.83330 Trebi 1931 Waseca27 43.76667 Trebi 1931 Morris28 46.93333 Trebi 1931 Crookston29 29.76667 Trebi 1931 Grand Rapids30 33.93333 Trebi 1931 Duluth

4This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 488: Applied Statistics - MIT

Completely Randomized Design Notation

If the sample sizes are equal the design is balanced; otherwise the design is unbalanced

See Table 12.1, page 458 in the course textbook.

1

a

ij

N n=

= ∑

5

Page 489: Applied Statistics - MIT

S-Plus barley dataset (observations 13:30)

Variety Svansota Velvet Trebi35.13333 39.90000 36.5666647.33333 50.23333 63.83330 25.76667 26.13333 43.76667 40.46667 41.33333 46.9333329.66667 23.03333 29.7666725.70000 26.30000 33.93333

Variety Mean 34.01111 34.48889 42.46666

6This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 490: Applied Statistics - MIT

Plot of yield by variety for S-Plus barley data set

3040

5060

barle

y.sm

all$

yiel

d

Svansota Velvet Trebi

barley.small$variety

7This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 491: Applied Statistics - MIT

8

S-plus plot.design function

Factors

mea

n of

yie

ld

3436

3840

42

Svansota

Velvet

Trebi

variety

Factors

med

ian

of y

ield

3436

3840

Svansota

Velvet

Trebi

variety

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 492: Applied Statistics - MIT

CRD: Model and Estimation (cell means model)

See Section 12.1.1 and Figure 12.2 on page 460 of the course textbook.

9

Page 493: Applied Statistics - MIT

CRD: Treatment Effects Model

Alternative Formulation of the Model:

Formula from 12.1.1, page 460 in the course textbook.

( 1, 2,..., ; 1, 2,..., )ij i ij iY i a j nµ τ ε= + + = =

10

Page 494: Applied Statistics - MIT

CRD parameter estimates

a)-e/(ne' sby estimated

ˆ - y error emeans treatment values fitted of vector ˆ

)/ny(1' yby estimated treatment, i of mean y)/n(1' yby estimated mean,

22

iiith

i

=

====

==

==

σ

µ

µ

yy

grand

11

Page 495: Applied Statistics - MIT

Fitted values and residuals for barley example

> cbind(barley.small[,1:2],fitted(tmp),resid(tmp))yield variety fitted resid

13 35.13333 Svansota 34.01111 1.12221814 47.33333 Svansota 34.01111 13.32221815 25.76667 Svansota 34.01111 -8.24444216 40.46667 Svansota 34.01111 6.45555817 29.66667 Svansota 34.01111 -4.34444218 25.70000 Svansota 34.01111 -8.31111219 39.90000 Velvet 34.48889 5.41111320 50.23333 Velvet 34.48889 15.74444321 26.13333 Velvet 34.48889 -8.35555722 41.33333 Velvet 34.48889 6.84444323 23.03333 Velvet 34.48889 -11.45555724 26.30000 Velvet 34.48889 -8.18888725 36.56666 Trebi 42.46666 -5.90000026 63.83330 Trebi 42.46666 21.36664027 43.76667 Trebi 42.46666 1.30001028 46.93333 Trebi 42.46666 4.46667029 29.76667 Trebi 42.46666 -12.69999030 33.93333 Trebi 42.46666 -8.533330

12This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 496: Applied Statistics - MIT

X matrix?1 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 0 1 01 0 1 01 0 1 01 0 1 01 0 1 01 0 1 01 0 0 11 0 0 11 0 0 11 0 0 11 0 0 11 0 0 1

13

Page 497: Applied Statistics - MIT

Model.matrix in S-Plus> round(model.matrix(barley.small.aov),3)

(Intercept) variety.L variety.Q13 1 -0.707 0.40814 1 -0.707 0.40815 1 -0.707 0.40816 1 -0.707 0.40817 1 -0.707 0.40818 1 -0.707 0.40819 1 0.000 -0.81620 1 0.000 -0.81621 1 0.000 -0.81622 1 0.000 -0.81623 1 0.000 -0.81624 1 0.000 -0.81625 1 0.707 0.40826 1 0.707 0.40827 1 0.707 0.40828 1 0.707 0.40829 1 0.707 0.40830 1 0.707 0.408

14This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 498: Applied Statistics - MIT

Model Coefficients

15

• > summary.lm(barley.small.aov)

• Call: aov(formula = yield ~ variety, data = barley.small)• Residuals:• Min 1Q Median 3Q Max • -12.7 -8.294 -1.611 6.194 21.37

• Coefficients:• Value Std. Error t value Pr(>|t|) • (Intercept) 36.9889 2.5207 14.6741 0.0000 • variety.L 5.9790 4.3660 1.3695 0.1910 • variety.Q 3.0619 4.3660 0.7013 0.4939

• Residual standard error: 10.69 on 15 degrees of freedom• Multiple R-Squared: 0.1363 • F-statistic: 1.184 on 2 and 15 degrees of freedom, the p-value is 0.3332

• Correlation of Coefficients:• (Intercept) variety.L• variety.L 0 • variety.Q 0 0

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 499: Applied Statistics - MIT

S-plus model.tables command gives treatment means or effects

> model.tables(barley.small.aov,type="mean")Warning messages:Model was refit to allow projection in: model.tables(tmp, type =

"mean")

Tables of meansGrand mean

36.989

variety Svansota Velvet Trebi34.011 34.489 42.467

16This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 500: Applied Statistics - MIT

S-plus model.tables command gives treatment means or effects

> model.tables(barley.small.aov)Warning messages:Model was refit to allow projection in:

model.tables(barley.small.aov)

Tables of effects

variety Svansota Velvet Trebi-2.9778 -2.5000 5.4778

17This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 501: Applied Statistics - MIT

Analysis of Variance (ANOVA)

Homogeneity Hypothesis:

Note SSR=SSA=Treatment sums of squares

0 1 2 1

0 1 2 1

: ... . : .: ... . : 0.

a i

a i

H vs H Not all the areequalH vs H At least some

µ µ µ µτ τ τ τ

= = =

= = = ≠

Variation Source Sum of Squares Degrees of Freedom Mean Square F

Treatments (A)

Error (E)

Total (T)

2( )ij iy y−∑ ∑

2( )i in y y−∑

2( )ijy y−∑ ∑

1a −

N a−

1N −

1SSAa −SSEN a−

MSAMSE

18

Page 502: Applied Statistics - MIT

ANOVA table for model with 3 varieties of barley, year 1

> summary(aov(yield~variety,barley.small))Df Sum of Sq Mean Sq F Value Pr(F)

variety 2 270.739 135.3694 1.183614 0.3332005Residuals 15 1715.544 114.3696

ANOVA table for model with all 10 varieties of barley, year 1

> summary(aov(yield~variety,barley1))Df Sum of Sq Mean Sq F Value Pr(F)

variety 9 646.262 71.8069 0.5963671 0.793823Residuals 50 6020.357 120.4071 >

19This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 503: Applied Statistics - MIT

F-statistic for One-way ANOVA

anaFMSEMSAF −−= ,1~

1)(

)(

1

2

2

2

−+=

=

∑=

a

nMSAE

MSEEa

iiiτ

σ

σ

20

Page 504: Applied Statistics - MIT

Fitting model with continuous vs. character predictor

> summary(aov(barley.small$yield~varnum)) Df Sum of Sq Mean Sq F Value Pr(F)

varnum 1 214.489 214.4889 1.93692 0.1830502Residuals 16 1771.794 110.7371

> summary(aov(barley.small$yield~as.factor(varnum)))Df Sum of Sq Mean Sq F Value Pr(F)

as.factor(varnum) 2 270.739 135.3694 1.183614 0.3332005Residuals 15 1715.544 114.3696

21This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 505: Applied Statistics - MIT

Equivalence of T test and ANOVA for model with single factor with 2 levels

> t.test(y[1:6],y[7:12])

Standard Two-Sample t-Test

data: y[1:6] and y[7:12] t = -1.194, df = 10, p-value = 0.26 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-22.864726 6.909179 sample estimates:mean of x mean of y 34.48889 42.46666

> summary(aov(yield~variety,barley.vsmall))Df Sum of Sq Mean Sq F Value Pr(F)

variety 1 190.935 190.9346 1.425727 0.2600178Residuals 10 1339.209 133.9209

22This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 506: Applied Statistics - MIT

23

Model Diagnostics, residual vs. fitted value(all 10 varieties, year 1)

fitted(barley1.aov)

resi

d(ba

rley1

.aov

)

32 34 36 38 40 42

-10

010

20

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 507: Applied Statistics - MIT

24

Model Diagnostics, residual vs. observation number(all 10 varieties, year 1)

resi

d(ba

rley1

.aov

)

0 10 20 30 40 50 60

-10

010

20

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 508: Applied Statistics - MIT

Model Diagnostics, normal plot of residuals(all 10 varieties, year 1)

25Quantiles of Standard Normal

resi

d(ba

rley1

.aov

)

-2 -1 0 1 2

-10

010

20

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 509: Applied Statistics - MIT

26

Model Diagnostics, histogram of residuals(all 10 varieties, year 1)

-10 0 10 20 30

05

1015

20

resid(barley1.aov)

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 510: Applied Statistics - MIT

Random Effects Model for a One-way LayoutWhen the treatment levels are determined by the experimenter (or those are the only levels of interest), the design is a fixed effects model.

• Goal is to measure the treatment effects or means (“pick the winner”).

When the treatment levels are a random sample from a population of possible treatment levels (e.g. workers in a factory) and the particular levels used in the experiment are not of any interest, the design is a random effects model.

• Goal is to measure the treatment variability (estimate the expected variability among workers).

27

Page 511: Applied Statistics - MIT

Random Effects Model for a One-way LayoutModel: Yij = µi + εij = µ + τi + εij (looks similar to the fixed effects model), where

εij ~ N(0,σ2) µi ~ N(µ,σA

2) or τi ~ N(0,σA2) (constants in fixed effects model)

Var(Yij) = Var(µi) + Var(eij) = σA2 + σ2

σA2=variance among, σ2 = variance within

With balanced one-way layout, n observations per treatment:

22

2

)()(

AnMSAEMSEE

σσ

σ

+=

=

Can estimate σA2 as (MSA-MSE)/n (if you are lucky!)

28

Page 512: Applied Statistics - MIT

Randomized Block Design

See Figure 3.2 on page 99 of the course textbook.

29

Page 513: Applied Statistics - MIT

Barley Example10 varieties, 6 sites

> ymUniversity Farm Waseca Morris Crookston Grand Rapids Duluth Variety Mean

Manchuria 27.00000 48.86667 27.43334 39.93333 32.96667 28.96667 34.19445Glabron 43.06666 55.20000 28.76667 38.13333 29.13333 29.66667 37.32778

Svansota 35.13333 47.33333 25.76667 40.46667 29.66667 25.70000 34.01111Velvet 39.90000 50.23333 26.13333 41.33333 23.03333 26.30000 34.48889Trebi 36.56666 63.83330 43.76667 46.93333 29.76667 33.93333 42.46666

No. 457 43.26667 58.10000 28.70000 45.66667 32.16667 33.60000 40.25000No. 462 36.60000 65.76670 30.36667 48.56666 24.93334 28.10000 39.05556

Peatland 32.76667 48.56666 29.86667 41.60000 34.70000 32.00000 36.58333No. 475 24.66667 46.76667 22.60000 44.10000 19.70000 33.06666 31.81667

Wisconsin No. 38 39.30000 58.80000 29.46667 49.86667 34.46667 31.60000 40.58333Site Mean 35.82667 54.34667 29.28667 43.66000 29.05334 30.29333 37.07778

30This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 514: Applied Statistics - MIT

Randomized Block Design (RBD)Method

( 1,..., ; 1,..., )ij i j ijY i a j bµ τ β ε= + + + = =

10

b

jjβ

=

=∑1

0a

iiτ

=

=∑

a-1 independent treatment effects

b-1 independent block effects

For more information, see 12.4, page 482 in course textbook.

31

Page 515: Applied Statistics - MIT

No Interactions Between Treatments and Blocks

' ' '( ) ( )ij i j i j i j i iµ µ µ τ β µ τ β τ τ− = + + + + + = −

Formula from page 483 in the course textbook.

32

Page 516: Applied Statistics - MIT

RBD: Sums of Squares

See formulas 12.17, 12.18, and 12.19 on pages 484-5 in the course textbook.

33

Page 517: Applied Statistics - MIT

ANOVA tables for models for barley data set

> summary(aov(yield~variety,barley1))Df Sum of Sq Mean Sq F Value Pr(F)

variety 9 646.262 71.8069 0.5963671 0.793823Residuals 50 6020.357 120.4071

> summary(aov(yield~variety+site,barley1))Df Sum of Sq Mean Sq F Value Pr(F)

variety 9 646.262 71.807 3.67995 0.001612103site 5 5142.272 1028.454 52.70610 0.000000000

Residuals 45 878.085 19.513

34This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 518: Applied Statistics - MIT

Type 1 and Type 3 Sums of Squares for barley example (balanced design)> summary(barley12.aov)

Df Sum of Sq Mean Sq F Value Pr(F) variety 9 646.262 71.807 3.67995 0.001612103

site 5 5142.272 1028.454 52.70610 0.000000000Residuals 45 878.085 19.513

> summary(barley12.aov,ssType=3)Type III Sum of Squares

Df Sum of Sq Mean Sq F Value Pr(F) variety 9 646.262 71.807 3.67995 0.001612103

site 5 5142.272 1028.454 52.70610 0.000000000Residuals 45 878.085 19.513

35This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 519: Applied Statistics - MIT

Degrees of Freedom

36

Page 520: Applied Statistics - MIT

Effects in barley model > model.tables(barley12.aov,type="effects")Warning messages:Model was refit to allow projection in: model.tables(barley12.aov, type = "effects")

Tables of effects

variety Svanso No. 462 Manch No. 475 Velvet Peatla Glabron No. 457 Wisc No. 38 Trebi-3.0667 1.9778 -2.8833 -5.2611 -2.5889 -0.4944 0.2500 3.1722 3.5056 5.3889

site Grand Rapids Duluth University Farm Morris Crookston Waseca -8.024 -6.784 -1.251 -7.791 6.582 17.269

37This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 521: Applied Statistics - MIT

Analysis of Multifactor Experiments

Corresponds to Chapter 13 of Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT), with some slides by Jacqueline Telford

(Johns Hopkins University) 1

Page 522: Applied Statistics - MIT

Analysis of Multifactor Experiments

(See Table 13.1 on page 505 of the course textbook.)

2

Page 523: Applied Statistics - MIT

Model and estimates

.ˆ.ˆ

........)(.....ˆ.....ˆ

...ˆ)(yijk

ijijkijkijkijk

ijijk

jiijij

jj

ii

ijkijji

yyyye

yy

yyyyyy

yyy

−=−=

=

+−−=

−=

−==

++++=

τβ

β

τµ

ετββτµ

3

Page 524: Applied Statistics - MIT

For any model

)y-(y)'y-(y SSError SSE)y-y()'y -y( SSModel SSM

)y-(y)'y-(y SSTotal SST

mean grand of vector yvalues fitted of vector y

valuesresponseobservedof vector

====

==

===y

4

Page 525: Applied Statistics - MIT

• Biochemical Reactions of Cells Treated with Puromycin

• SUMMARY: • The “Balanced” Puromycin data frame has 24 rows

representing the measurement of initial velocity of a biochemical reaction for 6 different concentrations of substrate and two different cell treatments. This data frame contains the following variables (columns):

• ARGUMENTS: • conc

– the concentration of the substrate. • vel

– the initial velocity of the reaction. • state

– a factor telling whether the cells involved were treated or untreated.

5

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 526: Applied Statistics - MIT

6

Scatterplot matrix for puromycin data set

conc

untr trtd

0.0

0.4

0.8

untr

trtd

state

0.2 0.4 0.6 0.8 1.0 50 100 150 200

5010

015

020

0

vel

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 527: Applied Statistics - MIT

7

plot.factor(conc,vel)50

100

150

200

vel

0.02 0.06 0.11 0.22 0.56 1.1

f(conc)

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 528: Applied Statistics - MIT

8

plot.factor(state,vel)50

100

150

200

vel

untreated treated

state

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 529: Applied Statistics - MIT

Velocity in “Balanced” puromycin data set

conc treated untreated0.02 76 47 67 510.06 97 107 84 860.11 123 139 98 1150.22 159 152 131 1240.56 191 201 144 1581.10 207 200 160 162

9

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 530: Applied Statistics - MIT

Histogram of velocity0

12

34

5

vel10

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 531: Applied Statistics - MIT

interaction.plot(pyb$state,pyb$conc,pyb$vel)

11pyb$state

mea

n of

pyb

$vel

6080

100

120

140

160

180

200

untreated treated

pyb$conc

1.10.560.220.110.060.02

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 532: Applied Statistics - MIT

12

interaction.plot(pyb$conc,pyb$state,pyb$vel)

pyb$conc

mea

n of

pyb

$vel

6080

100

120

140

160

180

200

0.02 0.06 0.11 0.22 0.56 1.1

pyb$state

treateduntreated

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 533: Applied Statistics - MIT

Summaries of puromycin model

Residuals:Min 1Q Median 3Q Max

-14.5 -5 -4.441e-016 5 14.5

Residual standard error: 9.559 on 12 degrees of freedomMultiple R-Squared: 0.9784 F-statistic: 49.5 on 11 and 12 degrees of freedom, the

p-value is 2.919e-008

Df Sum of Sq Mean Sq F Value Pr(F) state 1 4240.04 4240.042 46.40264 0.00001871conc 5 44243.71 8848.742 96.83985 0.00000000

state:conc 5 1270.71 254.142 2.78130 0.06803651Residuals 12 1096.50 91.375

13

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 534: Applied Statistics - MIT

Observed velocity and fitted values for puromycin model with interaction

Observed Fitted Valuesconc treated untreated treated untreated 0.02 76 47 67 51 61.5 61.5 59.0 59.00.06 97 107 84 86 102.0 102.0 85.0 85.00.11 123 139 98 115 131.0 131.0 106.5 106.50.22 159 152 131 124 155.5 155.5 127.5 127.50.56 191 201 144 158 196.0 196.0 151.0 151.01.10 207 200 160 162 203.5 203.5 161.0 161.0

14

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 535: Applied Statistics - MIT

model.tablesTables of meansGrand mean

128.29

state untreated treated 115.00 141.58

conc0.02 0.06 0.11 0.22 0.56 1.1 60.25 93.50 118.75 141.50 173.50 182.25

state:concDim 1 : stateDim 2 : conc

0.02 0.06 0.11 0.22 0.56 1.1 untreated 59.0 85.0 106.5 127.5 151.0 161.0treated 61.5 102.0 131.0 155.5 196.0 203.5

15

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 536: Applied Statistics - MIT

multicomp(pyb.aov,focus=“concf”)

95 % simultaneous confidence intervals for specified linear combinations, by the Tukey method

critical point: 3.3595 response variable: vel

intervals excluding 0 are flagged by '****'

Estimate Std.Error Lower Bound Upper Bound 0.02-0.06 -33.20 6.76 -56.0 -10.5000 ****0.02-0.11 -58.50 6.76 -81.2 -35.8000 ****0.02-0.22 -81.20 6.76 -104.0 -58.5000 ****0.02-0.56 -113.00 6.76 -136.0 -90.5000 ****0.02-1.1 -122.00 6.76 -145.0 -99.3000 ****

0.06-0.11 -25.30 6.76 -48.0 -2.5400 ****0.06-0.22 -48.00 6.76 -70.7 -25.3000 ****0.06-0.56 -80.00 6.76 -103.0 -57.3000 ****0.06-1.1 -88.70 6.76 -111.0 -66.0000 ****

0.11-0.22 -22.70 6.76 -45.5 -0.0425 ****0.11-0.56 -54.70 6.76 -77.5 -32.0000 ****0.11-1.1 -63.50 6.76 -86.2 -40.8000 ****

0.22-0.56 -32.00 6.76 -54.7 -9.2900 ****0.22-1.1 -40.70 6.76 -63.5 -18.0000 ****0.56-1.1 -8.75 6.76 -31.5 14.0000

16

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 537: Applied Statistics - MIT

17

Residual vs. fit for puromycin model

fitted(pyb.aov)

resi

d(py

b.ao

v)

-15

-10

-50

510

15

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 538: Applied Statistics - MIT

18

qqplot of residuals for puromycin model

Quantiles of Standard Normal

resi

d(py

b.ao

v)

-15

-10

-50

510

15

Page 539: Applied Statistics - MIT

Summaries of puromycin model without interaction

Residuals:Min 1Q Median 3Q Max

-26.54 -7.083 2.625 4.792 20.04Residual standard error: 11.8 on 17 degrees of freedomMultiple R-Squared: 0.9534 F-statistic: 58.03 on 6 and 17 degrees of freedom, the

p-value is 2.18e-010

Df Sum of Sq Mean Sq F Value Pr(F) conc 5 44243.71 8848.742 63.54684 0.00000000021

state 1 4240.04 4240.042 30.44967 0.00003762498Residuals 17 2367.21 139.248

19

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 540: Applied Statistics - MIT

Observed velocity and fitted values for puromycin model without interaction

Observed Fittedconc treated untreated treated untreated0.02 76 47 67 51 73.542 73.542 46.958 46.9580.06 97 107 84 86 106.792 106.792 80.208 80.2080.11 123 139 98 115 132.042 132.042 105.458 105.4580.22 159 152 131 124 154.792 154.792 128.208 128.2080.56 191 201 144 158 186.792 186.792 160.208 160.2081.10 207 200 160 162 195.542 195.542 168.958 168.958

20

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 541: Applied Statistics - MIT

21

Plot of residual vs. fit for puromycin model without interaction

Fitted : conc + state

Res

idua

ls

-20

-10

010

20

21

13

2

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 542: Applied Statistics - MIT

22

Plot of velocity vs. concentration

conc

vel

5010

015

020

0

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 543: Applied Statistics - MIT

Call: aov(formula = vel ~ conc + conc^2 + state)Residuals:

Min 1Q Median 3Q Max -45.4 -6.93 4.227 7.902 23.94

Coefficients:Value Std. Error t value Pr(>|t|)

(Intercept) 73.0885 6.0136 12.1539 0.0000conc 304.9581 37.3027 8.1752 0.0000

I(conc^2) -188.9327 32.5953 -5.7963 0.0000state 13.2917 3.4172 3.8897 0.0009

Residual standard error: 16.74 on 20 degrees of freedomMultiple R-Squared: 0.8898 F-statistic: 53.82 on 3 and 20 degrees of freedom, the p-

value is 9.291e-010

> summary(pyb2.aov)Df Sum of Sq Mean Sq F Value Pr(F)

conc 1 31590.27 31590.27 112.7215 0.0000000011I(conc^2) 1 9415.64 9415.64 33.5972 0.0000113551

state 1 4240.04 4240.04 15.1295 0.0009104989Residuals 20 5605.01 280.25 23

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 544: Applied Statistics - MIT

24

Plot of residual vs. fit for pyb2.aov

Fitted : conc + conc^2 + state

Res

idua

ls

-40

-20

020

18

21

2

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 545: Applied Statistics - MIT

25

qqplot of residuals for pyb2.aov

Quantiles of Standard Normal

Res

idua

ls

-40

-20

020

18

21

2

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 546: Applied Statistics - MIT

26

Call: aov(formula = vel ~ conc + conc^2 + conc^3 + conc^4 + conc^5 + state)

Residuals:Min 1Q Median 3Q Max

-26.54 -7.083 2.625 4.792 20.04

Coefficients:Residual standard error: 11.8 on 17 degrees of freedomMultiple R-Squared: 0.9534 F-statistic: 58.03 on 6 and 17 degrees of freedom, the p-

value is 2.18e-010

> summary(pyb5.aov)Df Sum of Sq Mean Sq F Value Pr(F)

conc 1 31590.27 31590.27 226.8641 0.0000000I(conc^2) 1 9415.64 9415.64 67.6180 0.0000003I(conc^3) 1 2603.71 2603.71 18.6984 0.0004604I(conc^4) 1 631.13 631.13 4.5324 0.0481759I(conc^5) 1 2.96 2.96 0.0213 0.8857934

state 1 4240.04 4240.04 30.4497 0.0000376Residuals 17 2367.21 139.25 >

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 547: Applied Statistics - MIT

27

Plot of residual vs. fit for pyb5.aov

Fitted : conc + conc^2 + conc^3 + conc^4 + conc^5 + state

Res

idua

ls

-20

-10

010

20

21

13

2

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 548: Applied Statistics - MIT

Guayule data set

• Rate of Germination of Treated Guayule Seeds • SUMMARY: • The guayule data frame, a design object, has 96 rows and 5

columns. The guayule is a Mexican plant from which rubber is manufactured. Batches of 100 seeds of eight varieties ( variety ) of guayule were given one of four treatments ( treatment ), and planted; the number of plants that came up in each batch ( plants ) was recorded.

• ARGUMENTS: • variety

– factor with levels V1 through V8 labeling the variety of guayule. • treatment

– factor with levels T1 through T4 labeling the treatment given to the seeds.

• plants– numeric vector givng the number seeds out of a batch of 100 that

germinated.

28

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 549: Applied Statistics - MIT

pairs(gy)

29

variety

T1 T2 T3 T4

V1

V3

V5

V7

T1T2

T3T4

treatment

V1 V3 V5 V7 20 40 60 80

2040

6080

plants

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 550: Applied Statistics - MIT

plot.factor(gy$variety,gy$plants)

30

2040

6080

gy$p

lant

s

V1 V2 V3 V4 V5 V6 V7 V8

gy$variety

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 551: Applied Statistics - MIT

plot.factor(gy$treatment,gy$plants)

31

2040

6080

gy$p

lant

s

T1 T2 T3 T4

gy$treatment

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 552: Applied Statistics - MIT

interaction.plot(gy$variety,gy$treatment,gy$plants)

32gy$variety

mea

n of

gy$

plan

ts

1020

3040

5060

V1 V2 V3 V4 V5 V6 V7 V8

gy$treatment

T1T3T2T4

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 553: Applied Statistics - MIT

interaction.plot(gy$treatment,gy$variety,gy$plants)

33gy$treatment

mea

n of

gy$

plan

ts

1020

3040

5060

T1 T2 T3 T4

gy$variety

V6V8V5V3V2V7V4V1

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 554: Applied Statistics - MIT

hist(gy$plants)

34

010

2030

gy$plants

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 555: Applied Statistics - MIT

Summaries of gy.aov

Call: aov(formula = plants ~ variety * treatment, data = gy)Residuals:

Min 1Q Median 3Q Max -16.33 -2.667 1.494e-015 2.75 16

Residual standard error: 6.348 on 64 degrees of freedomMultiple R-Squared: 0.9298 F-statistic: 27.35 on 31 and 64 degrees of freedom, the p-value

is 0 > summary(gy.aov)

Df Sum of Sq Mean Sq F Value Pr(F) variety 7 763.16 109.02 2.7058 0.01604076

treatment 3 30774.28 10258.09 254.5959 0.00000000variety:treatment 21 2620.14 124.77 3.0966 0.00026666

Residuals 64 2578.67 40.29

35

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 556: Applied Statistics - MIT

Plot of residual vs. fit for gy data set

36Fitted : variety * treatment

Res

idua

ls

-15

-10

-50

510

15

3

35

34

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 557: Applied Statistics - MIT

model.tables(gy.aov,type="mean")

Tables of meansGrand mean

25.302

variety V1 V2 V3 V4 V5 V6 V7 V8

24.667 26.833 28.833 21.000 21.917 28.167 23.250 27.750

treatment T1 T2 T3 T4

55.833 13.917 20.042 11.417

37

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 558: Applied Statistics - MIT

model.tables(gy.aov,type="mean")

variety:treatmentDim 1 : varietyDim 2 : treatment

T1 T2 T3 T4 V1 66.333 11.667 12.333 8.333V2 63.333 18.333 14.333 11.333V3 65.000 12.667 26.333 11.333V4 50.333 10.000 14.000 9.667V5 49.333 16.333 10.333 11.667V6 58.000 8.000 29.667 17.000V7 46.333 14.667 22.000 10.000V8 48.000 19.667 31.333 12.000

38

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 559: Applied Statistics - MIT

multicomp(gy.aov,focus="treatment")

95 % simultaneous confidence intervals for specified linear combinations, by the Tukey method

critical point: 2.6378 response variable: plants

intervals excluding 0 are flagged by '****'

Estimate Std.Error Lower Bound Upper Bound T1-T2 41.90 1.83 37.10 46.80 ****T1-T3 35.80 1.83 31.00 40.60 ****T1-T4 44.40 1.83 39.60 49.30 ****T2-T3 -6.12 1.83 -11.00 -1.29 ****T2-T4 2.50 1.83 -2.33 7.33 T3-T4 8.62 1.83 3.79 13.50 ****

39

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 560: Applied Statistics - MIT

Guayule ANOVA with variety random

> gyr.tabDf Sum of Sq Mean Sq F Value Pr(F)

treatment 3 30774.28 10258.09 82.21711 0.0000000variety 7 763.16 109.02 0.87380 0.5428964

treatment:variety 21 2620.14 124.77 3.09663 0.0002667Residuals 64 2578.67 40.29

40

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 561: Applied Statistics - MIT

Random if:

• Not interested in those particular factor levels (e.g. batches)

• Levels of factor are randomly chosen from a larger population of factor levels (e.g. 10 universities selected from all universities in country).

• Want to generalize to a larger population of factor levels.

41

Page 562: Applied Statistics - MIT

EMS for 2-factor models(See Table 24.5 on page 981 of Neter et al. Applied Linear Statistical Models.)

Nested vs. Crossed Design(See Figure 28.1 in Neter et al. Applied Linear Statistical Models.)

Nested Fixed Factors(See Table 28.3 on page 1129 of Neter et al. Applied Linear Statistical Models.)

42

Page 563: Applied Statistics - MIT

Nested Mixed Factors(See Table 28.5 on page 1133 of Neter et al. Applied Linear Statistical Models.)

Cross-Nested Models(See Table 28.11 on page 1151 of Neter et al. Applied Linear Statistical Models.)

43

Page 564: Applied Statistics - MIT

Images of book covers:

Patrick O’Brian, The Commodore.

Patrick O’Brian, The Fortune of War.

44

Page 565: Applied Statistics - MIT

Nested Factors• Speed of Firing Naval Guns • SUMMARY: • The gun data frame, a design object, has 36 rows representing runs

of a team of 3 men loading and firing naval guns attempting to get off as many rounds per minute as possible. The three predictor variables (columns) specify the team and the physique of the menon it and the loading method used; the outcome variable is the rounds fired per minute.

• ARGUMENTS: • Method

– factor giving one of two methods for loading rounds into Naval guns. Levels are M1 and M2 .

• Physique– an ordered factor giving the physique of the men: S for slight, A for

average, and H for heavy. • Team

– factor with levels T1 , T2 or T3 . In fact there are nine teams, three of each physique, i.e. a slight T1 , an average T1 , and a heavy T1 , etc.

• Rounds– numeric vector giving the number of rounds per minute fired by a team.

45

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 566: Applied Statistics - MIT

gunMethod Physique Team Rounds

1 M1 S T1 20.22 M2 S T1 14.23 M1 A T1 22.04 M2 A T1 14.15 M1 H T1 23.16 M2 H T1 14.17 M1 S T2 26.28 M2 S T2 18.09 M1 A T2 22.6

10 M2 A T2 14.011 M1 H T2 22.912 M2 H T2 12.213 M1 S T3 23.814 M2 S T3 12.515 M1 A T3 22.916 M2 A T3 13.717 M1 H T3 21.818 M2 H T3 12.719 M1 S T1 24.120 M2 S T1 16.2

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

46

Page 567: Applied Statistics - MIT

gunMethod Physique Team Rounds

1 M1 S T1 20.22 M2 S T1 14.23 M1 A T2 22.04 M2 A T2 14.15 M1 H T3 23.16 M2 H T3 14.17 M1 S T4 26.28 M2 S T4 18.09 M1 A T5 22.6

10 M2 A T5 14.011 M1 H T6 22.912 M2 H T6 12.213 M1 S T7 23.814 M2 S T7 12.515 M1 A T8 22.916 M2 A T8 13.717 M1 H T9 21.818 M2 H T9 12.719 M1 S T1 24.120 M2 S T1 16.2

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

47

Page 568: Applied Statistics - MIT

Speed of firing of naval guns

Slight Average Heavy

Method 1 T1: 20.2, 24.1T4: 26.2, 26.9T7: 23.8, 24.9

T2: 22.0, 23.5T5: 22.6, 24.6T8: 22.9, 25.0

T3: 23.1, 22.9T6: 22.9, 23.7T9: 21.8, 23.5

Method 2 T1: 14.2, 16.2T4: 18.0, 19.1T7: 12.5, 15.4

T2: 14.1, 16.1T5: 14.0, 18.1T8: 13.7, 16.0

T3: 14.1, 16.1T6: 12.2, 13.8T9: 12.7, 15.1

48

Page 569: Applied Statistics - MIT

pairs(gun2)

49

method

1.0 1.5 2.0 2.5 3.0 15 20 25

1.0

1.4

1.8

1.0

1.5

2.0

2.5

3.0

physique

team

24

68

1.0 1.2 1.4 1.6 1.8 2.0

1520

25

2 4 6 8

rounds

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 570: Applied Statistics - MIT

50

Method Effect

method

mea

n of

roun

ds

1618

2022

M1 M2

rep(1, 36)

1

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 571: Applied Statistics - MIT

51

Physique Effect

physique

mea

n of

roun

ds

18.5

19.0

19.5

20.0

S A H

rep(1, 36)

1

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 572: Applied Statistics - MIT

52

Team Effect

team

mea

n of

roun

ds

1819

2021

22

1 2 3 4 5 6 7 8 9

rep(1, 36)

1

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 573: Applied Statistics - MIT

53

Method-Physique Interaction

method

mea

n of

roun

ds

1416

1820

2224

M1 M2

physique

SAH

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 574: Applied Statistics - MIT

ANOVA tables for firing of naval guns example(with teams numbered 1-9)

54

> summary(aov(rounds~phys*meth*team))Df Sum of Sq Mean Sq F Value Pr(F)

phys 2 16.0517 8.0258 3.4736 0.0529995meth 1 651.9511 651.9511 282.1621 0.0000000team 6 39.2583 6.5431 2.8318 0.0403140

phys:meth 2 1.1872 0.5936 0.2569 0.7762240meth:team 6 10.7217 1.7869 0.7734 0.6009376Residuals 18 41.5900 2.3106

> summary(aov(rounds~phys*meth*team%in%phys))Df Sum of Sq Mean Sq F Value Pr(F)

phys 2 16.0517 8.0258 3.4736 0.0529995meth 1 651.9511 651.9511 282.1621 0.0000000

phys:meth 2 1.1872 0.5936 0.2569 0.7762240team %in% phys 6 39.2583 6.5431 2.8318 0.0403140

meth:(team %in% phys) 6 10.7217 1.7869 0.7734 0.6009376Residuals 18 41.5900 2.3106

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 575: Applied Statistics - MIT

> model.tables(gunaov,type="mean")Tables of meansGrand mean

19.333

Method M1 M2

23.589 15.078

Physique S A H

20.125 19.383 18.492

Team %in% Physique Dim 1 : PhysiqueDim 2 : Team

T1 T2 T3 S 18.675 22.550 19.150A 18.925 19.825 19.400H 19.050 18.150 18.275

55

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 576: Applied Statistics - MIT

56

Tables of meansGrand mean

19.333

method M1 M2

23.589 15.078rep 18.000 18.000

physique S A H

20.125 19.383 18.492rep 12.000 12.000 12.000

team %in% physique Dim 1 : physiqueDim 2 : team

1 2 3 4 5 6 7 8 9 S 18.675 22.550 19.150

rep 4.000 0.000 0.000 4.000 0.000 0.000 4.000 0.000 0.000A 18.925 19.825 19.400

rep 0.000 4.000 0.000 0.000 4.000 0.000 0.000 4.000 0.000H 19.050 18.150 18.275

rep 0.000 0.000 4.000 0.000 0.000 4.000 0.000 0.000 4.000

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 577: Applied Statistics - MIT

Summaries of firing of naval guns example (without interaction)

57

Call: aov(formula = Rounds ~ Method + Physique/Team, data = gun)

Residuals:Min 1Q Median 3Q Max

-2.731 -0.7368 2.498e-016 0.9972 2.531

Residual standard error: 1.434 on 26 degrees of freedomMultiple R-Squared: 0.9297 F-statistic: 38.19 on 9 and 26 degrees of freedom, the p-value is

9.602e-013

> summary(gunaov)Df Sum of Sq Mean Sq F Value Pr(F)

Method 1 651.9511 651.9511 316.8426 0.00000000Physique 2 16.0517 8.0258 3.9005 0.03300457

Team %in% Physique 6 39.2583 6.5431 3.1799 0.01782181Residuals 26 53.4989 2.0576

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 578: Applied Statistics - MIT

Plot of residual vs fit for gun.aov

58Fitted : Method + Physique/Team

Res

idua

ls

14 16 18 20 22 24 26

-2-1

01

2

14

28

1

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 579: Applied Statistics - MIT

2k Factorial Designs

• Exploratory experimental studies.• Multifactor experiment in which each factor

studied at two levels.• Used to screen large number of factors to

identify the most important.• Sometimes 2 levels naturally occur e.g.

present or absent, smoker or non-smoker• k factors => 2k treatment combinations

59

Page 580: Applied Statistics - MIT

2k Factorial Design Example

Example: 13.19, page 553 of the course textbook.

60

Page 581: Applied Statistics - MIT

61

pairs(nw.df)

y

-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0

5015

025

0

-1.0

0.0

0.5

1.0

a

b

-1.0

0.0

0.5

1.0

50 100 150 200 250 300

-1.0

0.0

0.5

1.0

-1.0 -0.5 0.0 0.5 1.0

c

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 582: Applied Statistics - MIT

62

hist(y)0

24

6

y

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 583: Applied Statistics - MIT

Effect of a

63a

mea

n of

y

160

170

180

190

-1 1

rep(1, 24)

1

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 584: Applied Statistics - MIT

64

Effect of b

b

mea

n of

y

100

150

200

250

-1 1

rep(1, 24)

1

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 585: Applied Statistics - MIT

65

Effect of c

c

mea

n of

y

160

165

170

175

180

185

-1 1

rep(1, 24)

1

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 586: Applied Statistics - MIT

66

interaction.plot(a,b,y)

a

mea

n of

y

100

150

200

250

-1 1

b

-11

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 587: Applied Statistics - MIT

67

interaction.plot(a,c,y)

a

mea

n of

y

140

160

180

-1 1

c

1-1

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 588: Applied Statistics - MIT

68

interaction.plot(b,c,y)

b

mea

n of

y

100

150

200

250

-1 1

c

1-1

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 589: Applied Statistics - MIT

summary.lm(nw.aov)Call: aov(formula = y ~ a * b * c, data = nw.df)Residuals:

Min 1Q Median 3Q Max -37.67 -6.861 2.388 12.67 28.67

Coefficients:Value Std. Error t value Pr(>|t|)

(Intercept) 171.1942 4.6675 36.6780 0.0000a -17.6942 4.6675 -3.7909 0.0016b -76.5833 4.6675 -16.4078 0.0000c 13.3333 4.6675 2.8566 0.0114

a:b -14.8050 4.6675 -3.1719 0.0059a:c 16.6667 4.6675 3.5708 0.0026b:c 4.9442 4.6675 1.0593 0.3052

a:b:c -25.0558 4.6675 -5.3682 0.0001

Residual standard error: 22.87 on 16 degrees of freedomMultiple R-Squared: 0.9556 F-statistic: 49.21 on 7 and 16 degrees of freedom, the p-value is

1.209e-009

Effect (of going from low to high level) is 2*regression coefficient 69

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 590: Applied Statistics - MIT

model.matrix(nw.aov)(Intercept) a b c a:b a:c b:c a:b:c

1 1 -1 -1 -1 1 1 1 -12 1 1 -1 -1 -1 -1 1 13 1 -1 1 -1 -1 1 -1 14 1 -1 -1 1 1 -1 -1 15 1 1 1 -1 1 -1 -1 -16 1 1 -1 1 -1 1 -1 -17 1 -1 1 1 -1 -1 1 -18 1 1 1 1 1 1 1 19 1 -1 -1 -1 1 1 1 -110 1 1 -1 -1 -1 -1 1 111 1 -1 1 -1 -1 1 -1 112 1 -1 -1 1 1 -1 -1 113 1 1 1 -1 1 -1 -1 -114 1 1 -1 1 -1 1 -1 -115 1 -1 1 1 -1 -1 1 -116 1 1 1 1 1 1 1 117 1 -1 -1 -1 1 1 1 -118 1 1 -1 -1 -1 -1 1 119 1 -1 1 -1 -1 1 -1 120 1 -1 -1 1 1 -1 -1 121 1 1 1 -1 1 -1 -1 -122 1 1 -1 1 -1 1 -1 -123 1 -1 1 1 -1 -1 1 -124 1 1 1 1 1 1 1 1 70

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 591: Applied Statistics - MIT

X’X Matrixt(X)%*%X

(Intercept) a b c a:b a:c b:c a:b:c (Intercept) 24 0 0 0 0 0 0 0

a 0 24 0 0 0 0 0 0b 0 0 24 0 0 0 0 0c 0 0 0 24 0 0 0 0

a:b 0 0 0 0 24 0 0 0a:c 0 0 0 0 0 24 0 0b:c 0 0 0 0 0 0 24 0

a:b:c 0 0 0 0 0 0 0 24

71

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 592: Applied Statistics - MIT

n*(X’X)-1 X’

> solve(t(X)%*%X)%*%t(X)*241 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

(Intercept) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1a -1 1 -1 -1 1 1 -1 1 -1 1 -1 -1 1 1 -1 1 -1 1 -1 -1 1 1 -1 1b -1 -1 1 -1 1 -1 1 1 -1 -1 1 -1 1 -1 1 1 -1 -1 1 -1 1 -1 1 1c -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 1 1 1

a:b 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1a:c 1 -1 1 -1 -1 1 -1 1 1 -1 1 -1 -1 1 -1 1 1 -1 1 -1 -1 1 -1 1b:c 1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 1 1

a:b:c -1 1 1 1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1

72

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 593: Applied Statistics - MIT

summary(nw.aov)> summary(nw.aov)

Df Sum of Sq Mean Sq F Value Pr(F) a 1 7514.0 7514.0 14.3712 0.0016031b 1 140760.2 140760.2 269.2166 0.0000000c 1 4266.7 4266.7 8.1604 0.0114229

a:b 1 5260.5 5260.5 10.0612 0.0059164a:c 1 6666.7 6666.7 12.7506 0.0025519b:c 1 586.7 586.7 1.1221 0.3052037

a:b:c 1 15067.1 15067.1 28.8171 0.0000628Residuals 16 8365.6 522.9

73

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 594: Applied Statistics - MIT

Plot of residual vs. fit for nw.aov

74Fitted : a * b * c

Res

idua

ls

-40

-20

020

6

23 22

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 595: Applied Statistics - MIT

Nonparametric Statistical Methods

Corresponds to Chapter 14 of Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT)

1

Page 596: Applied Statistics - MIT

Nonparametric Methods

• Most NP methods are based on ranks instead of original data

• Reference: Hollander & Wolfe, Nonparametric Statistical Methods

E Newton 2

Page 597: Applied Statistics - MIT

E Newton 3

Histogram of 100 gamma(1,1) r.v.’s

0 1 2 3 4

010

2030

g

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 598: Applied Statistics - MIT

Histogram of ranks of 100 r.v.’s

0 20 40 60 80 100

02

46

810

rank(g)

E Newton 4

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 599: Applied Statistics - MIT

Parametric and Nonparametric Tests

E Newton 5

Type of test Parametric NonparametricSingle Sample z and t tests Sign test

WilcoxonSigned Rank Test

Two independent samples

z and t tests Wilcoxon Rank Sum Test

Mann Whitney U Test

Page 600: Applied Statistics - MIT

E Newton 6

Type of test Parametric Nonparametric

Several Independent Samples

ANOVA CRD Kruskal-Wallace Test

Several Matched Samples

ANOVA RBD Friedman Test

Correlation Pearson Spearman Rank Correlation

Kendall’s Rank Correlation

Page 601: Applied Statistics - MIT

Sign Test

• Inference on median (u) for a single sample, size n• H0: u=u0 vs. H1 u≠u0

• Count the number of xi’s that are greater than u0 and denote this s+

• The number of xi‘s less than u are s- = n - s+• Reject H0 if s+ is large or if s- is small.• Under H0, s+ (and s-) has binomial(n,1/2)

distribution• Large sample z test

E Newton 7

Page 602: Applied Statistics - MIT

Histogram of thermostat data

198 200 202 204 206 208

01

23

4

x

E Newton 8

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 603: Applied Statistics - MIT

Sign Test in S-Plus > thermostat[1] 202.2 203.4 200.5 202.5 206.3 198.0 203.7 200.8

201.3 199.0

> thermostat<200[1] F F F F F T F F F T

> sum(thermostat<200)[1] 2

> 2*pbinom(sum(thermostat<200),10,0.5)[1] 0.109375

E Newton 9

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 604: Applied Statistics - MIT

Wilcoxon Signed Rank Test• Inference on median (u), single sample, size n• Assumes population distribution is symmetric• H0: u=u0 vs. H1 u≠u0• di = xi -u0• Rank order |di|• W+ = sum of ranks of positive differences• W- = sum of ranks of negative differences• Wmax = maximum (W+, W-)• Reject H0 if Wmax is large.• Null Distribution – see text• Large sample z test

E Newton 10

Page 605: Applied Statistics - MIT

S-Plus wilcox.test for thermostat data

E Newton 11

> thermostat[1] 202.2 203.4 200.5 202.5 206.3 198.0 203.7 200.8 201.3 199.0

> sum(rank(abs(thermostat-200))[-c(6,10)])[1] 47

> wilcox.test(thermostat,mu=200)

Exact Wilcoxon signed-rank test

data: thermostat signed-rank statistic V = 47, n = 10, p-value =

0.0488 alternative hypothesis: true mu is not equal to 200

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 606: Applied Statistics - MIT

S-Plus parametric t-test for thermostat data

> t.test(thermostat, mu=200)

One-sample t-Test

data: thermostatt = 2.3223, df = 9, p-value = 0.0453 alternative hypothesis: true mean is not equal to 200 95 percent confidence interval:200.0459 203.4941 sample estimates:mean of x

201.77

E Newton 12

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 607: Applied Statistics - MIT

Location-Scale Families

• See course textbook, page 575.

E Newton 13

Page 608: Applied Statistics - MIT

2 normal pdf’s with location parameters = -1 and 1, scale parameter =1

x

dnor

m(x

, 1, 1

)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

E Newton 14

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 609: Applied Statistics - MIT

Wilcoxon Rank Sum Test

• Inference on location of distribution of 2 independent random samples X and Y (e.g. from control and treatment population).

• Assume X~Y+∆• H0: ∆=0 vs. H1: ∆≠0• Rank all N = n1 + n2 observations• W=sum of ranks assigned to the Y’s (or X’s,

whichever has smaller sample size) • Reject H0 if W is extreme

E Newton 15

Page 610: Applied Statistics - MIT

Mann-Whitney U test

• Equivalent to Wilcoxon rank sum test• Compare each xi with each yi.• There are nx*ny such comparisons• U= number of pairs in which xi<yi.• Icbst W = U + (n*(n+1))/2 (when no ties)• Reject H0 if U is extreme.

E Newton 16

Page 611: Applied Statistics - MIT

Boxplots of times to failure for control and stressed capacitors

05

1015

2025

30

cg sg

time

to fa

ilure

E Newton 17

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 612: Applied Statistics - MIT

S-Plus wilcox.test

> wilcox.test(cg, sg)

Exact Wilcoxon rank-sum test

data: cg and sgrank-sum statistic W = 95, n = 8, m = 10, p-value =

0.1011 alternative hypothesis: true mu is not equal to 0

E Newton 18

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 613: Applied Statistics - MIT

S-Plus parametric t-test

> t.test(cg,sg)

Standard Two-Sample t-Test

data: cg and sgt = 1.8105, df = 16, p-value = 0.089 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-1.103506 14.018506

sample estimates:mean of x mean of y 15.5375 9.08

E Newton 19

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 614: Applied Statistics - MIT

Kolmogorov-Smirnov Tests

There is also a one-sample version for testing the distance between some observed data and a specified (ideal) distribution.

The Kolmogorov-Smirnov test detects differences in location, scale, skewness, or whatever (any differences between two distributions), uses two empirical cumulative distribution functions (step functions).

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Cum

ulat

ive

Freq

uenc

y

Distribution 2MaximumGap

Distribution 1

Two-sample Test

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Cum

ulat

ive

Freq

uenc

yIdeal Distribution

Maximum Gap

Observed Distribution

One-sample Test

Tests the maximum gap between the observed distribution and the hypothesized distribution as a function of sample size (tables or p-values).

J Telford 20

Page 615: Applied Statistics - MIT

E Newton 21

Histograms of 100 random normal (2,1) deviates and 100 random gamma(4,2) deviates

-1 0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

x

0 1 2 3 4 5 6

0.0

0.1

0.2

0.3

0.4

0.5

y

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 616: Applied Statistics - MIT

Kolmogorov-Smirnov Tests> ks.gof(x,y)

Two-Sample Kolmogorov-Smirnov Test

data: x and y ks = 0.15, p-value = 0.2112 alternative hypothesis: cdf of x does not equal the

cdf of y for at least one sample point.

> ks.gof(y)

One sample Kolmogorov-Smirnov Test of Composite Normality

data: y ks = 0.0969, p-value = 0.0216 alternative hypothesis: True cdf is not the normal distn. with

estimated parameters sample estimates:mean of x standard deviation of x 1.865857 0.9421928

E Newton 22

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 617: Applied Statistics - MIT

Kruskal-Wallis Test• Inference for several independent samples• Assume distributions of each of the samples differ

only possibly in location.• Xij = θ + τj + eij.• H0: τ1=τ2=..= τk, vs. H1: τi≠τj for some i ≠ j• Rank all N=n1+n2..+na observations.• Calculate rank sums and averages in each group• Calculate KW test statistic=kw (see text)• Reject H0 for large values of kw• For large ni’s, null dist’n of kw χ2

a-1

E Newton 23

Page 618: Applied Statistics - MIT

Test scores for four different teaching methods (page 582)

scm<-matrix(score,7,4)> scm

[,1] [,2] [,3] [,4] [1,] 14.06 14.71 23.32 26.93[2,] 14.26 19.49 23.42 29.76[3,] 14.59 20.20 24.92 30.43[4,] 18.15 20.27 27.82 33.16[5,] 20.82 22.34 28.68 33.88[6,] 23.44 24.92 32.85 36.43[7,] 25.43 26.84 33.90 37.04

E Newton 24

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 619: Applied Statistics - MIT

Plot.factor(f(grp),score)15

2025

3035

test

sco

res

for e

ach

teac

hing

met

hod

1 2 3 4

f(grp)E Newton 25

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 620: Applied Statistics - MIT

Ranks of Test Scores

E Newton 26

> scmr<-matrix(rank(score),7,4)> scmr

[,1] [,2] [,3] [,4] [1,] 1 4.0 11.0 18[2,] 2 6.0 12.0 21[3,] 3 7.0 14.5 22[4,] 5 8.0 19.0 24[5,] 9 10.0 20.0 25[6,] 13 14.5 23.0 27[7,] 16 17.0 26.0 28

> tmp<-apply(scmr,2,sum)> tmp[1] 49.0 66.5 125.5 165.0

> (12/(28*29))*sum((tmp^2)/7)-3*29[1] 18.13406

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 621: Applied Statistics - MIT

Kruskal-Wallis test in S-Plus

> kruskal.test(scm, col(scm))

Kruskal-Wallis rank sum test

data: scm and col(scm) Kruskal-Wallis chi-square = 18.139, df = 3,

p-value = 0.0004 alternative hypothesis: two.sided

E Newton 27

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 622: Applied Statistics - MIT

ANOVA for test scores

summary(aov(score~f(grp)))Df Sum of Sq Mean Sq F Value Pr(F)

f(grp) 3 830.1914 276.7305 15.93607 6.509182e-006Residuals 24 416.7609 17.3650

E Newton 28

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 623: Applied Statistics - MIT

Friedman Test

• Inference for several matched samples• a treatments, b blocks• H0: τ1=τ2=..= τk, vs. H1: τi≠τj for some i ≠ j• Rank observations separately within each block• Calculate rank sums• Calculate the Friedman statistic, fr (see text)• Reject H0 for large values of fr• For b large, fr ~ χ2

a-1

E Newton 29

Page 624: Applied Statistics - MIT

Ranks within Blocks (rows)> scmrb<-t(apply(scm,1,rank))> scmrb

[,1] [,2] [,3] [,4] [1,] 1 2 3 4[2,] 1 2 3 4[3,] 1 2 3 4[4,] 1 2 3 4[5,] 1 2 3 4[6,] 1 2 3 4[7,] 1 2 3 4

> tmp<-apply(scmrb,2,sum)[1] 7 14 21 28

> (12/(4*7*5))*sum(tmp^2)-3*7*5[1] 21

E Newton 30

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 625: Applied Statistics - MIT

Friedman test in S-Plus

• > friedman.test(scm, col(scm), row(scm))

• Friedman rank sum test

• data: scm and col(scm) and row(scm) • Friedman chi-square = 21, df = 3, p-value

= 0.0001 • alternative hypothesis: two.sided

E Newton 31

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 626: Applied Statistics - MIT

ANOVA test score data with blocks

> summary(aov(score~f(grp)+f(blk)))Df Sum of Sq Mean Sq F Value Pr(F)

f(grp) 3 830.1914 276.7305 260.4768 5.220000e-015f(blk) 6 397.6377 66.2729 62.3804 4.558276e-011

Residuals 18 19.1232 1.0624

E Newton 32

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 627: Applied Statistics - MIT

Correlation Methods

• Pearson Correlation: measures only linear association.

• Spearman Correlation: correlation of the ranks

• Kendall’s Tau: based on number of concordant and discordant pairs.

E Newton 33

Page 628: Applied Statistics - MIT

Kendall’s Tau

• Assume: the n bivariate observations (X1,Y1),…,(Xn,Yn) are a random sample from a continuous bivariate population.

• H0: Xi, Yi are independent• H0: F(x,y) = F(x)F(y)• Measure dependence by finding the number of

concordant and discordant pairs.• Population correlation coefficient:

τ = 2*P{X2-X1)(Y2-Y1)>0}-1

E Newton 34

Page 629: Applied Statistics - MIT

Kendall’s Tau

)1(2ˆ

)),(),,((

0)Y-)(YX-(X if 1,-0 )Y-)(YX-(X if 0, 0)Y-)(YX-(X if 1,

))Y,(X),Y,Q((X

:n j i1

1

1 1

jiji

jiji

jiji

jjii

−=

=

⎪⎩

⎪⎨

<

=

>

=

≤<≤

∑ ∑−

= +=

nnK

YXYXQK

For

n

i

n

ijjjii

τ

E Newton 35

Page 630: Applied Statistics - MIT

Kendall’s Tau example

E Newton 36

> m1 3 2 4

1 NA 1 1 12 NA NA -1 13 NA NA NA 14 NA NA NA NA

> 2*sum(m,na.rm=T)/12[1] 0.6666667

> cor.test(c(1,2,3,4),c(1,3,2,4),method="k")

Kendall's rank correlation tau

data: c(1, 2, 3, 4) and c(1, 3, 2, 4) normal-z = 1.3587, p-value = 0.1742 alternative hypothesis: true tau is not equal to 0 sample estimates:

tau0.6666667

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 631: Applied Statistics - MIT

x=1:10y=exp(x)

x

y

2 4 6 8 10

050

0010

000

1500

020

000

E Newton 37

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 632: Applied Statistics - MIT

Pearson Correlation

> cor.test(x,y,method="p")

Pearson's product-moment correlation

data: x and y t = 2.9082, df = 8, p-value = 0.0196 alternative hypothesis: true coef is not equal to 0

sample estimates:cor

0.7168704

E Newton 38

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 633: Applied Statistics - MIT

Spearman Correlation

> cor.test(x,y,method="s")

Spearman's rank correlation

data: x and y normal-z = 2.9818, p-value = 0.0029 alternative hypothesis: true rho is not equal to 0

sample estimates:rho1

E Newton 39

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 634: Applied Statistics - MIT

Kendall Correlaton

> cor.test(x,y,method="k")

Kendall's rank correlation tau

data: x and y normal-z = 4.0249, p-value = 0.0001 alternative hypothesis: true tau is not equal to 0

sample estimates:tau1

E Newton 40

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 635: Applied Statistics - MIT

E Newton 41

Example - Environmental Data –Censored below LOD

0 2 4 6 8 10 12 14

010

2030

4050

g

0 2 4 6 8 10 12 14

010

2030

4050

h

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 636: Applied Statistics - MIT

Resampling Methods

• Parametric methods – Inference based on assumed population distribution

• Resampling methods – No assumption about functional form of population distribution.

• Permutation Tests – 2 sample problem• Jackknife – Delete one observation at a

time• Bootstrap – resample with replacement

E Newton 42

Page 637: Applied Statistics - MIT

Permulation Tests• Goal: estimate difference in means (2 sample problem)• (x1, x2… xn1) and (y1, y2.. yn2) are independent samples

drawn from F1 and F2.• H0: F1=F2 => all assignments of labels x and y equally

likely.• Choose SRS of size n1 from n1+n2 observations and

label as x, label rest as y.• Calculate value of test statistic (e.g. difference in means)

for each assignment -> permutation distribution.• There are (n1+n2) choose (n1) possible distinct

assignments (capacitor data set Ex14.7, n1=8, n2=10, number of assignments=43,758)

E Newton 43

Page 638: Applied Statistics - MIT

Jackknife• Goal: estimate distribution and standard error of statistic

(e.g. median or mean)• Draw n samples of size n-1 from original sample, by

deleting one observation at a time.• Calculate mj*=mean (median) from each sample

∑=

−−

=n

jj mm

nnmJSE

1

2** )(1)(

• JSE is exact for mean, not necessarily very good for median

E Newton 44

Page 639: Applied Statistics - MIT

Bootstrap• Goal: estimate distribution, standard error,

confidence interval of statistic (e.g. mean, median, correlation)

• Draw B samples of size n, with replacement, from original sample

• Calculate test statistics from each sample

1

)()( 1

2**

−=∑ =

B

mmmBSE

B

j j

E Newton 45

Page 640: Applied Statistics - MIT

Swiss Data Set in S-PlusFertility Data for Switzerland in 1888 SUMMARY: The swiss.fertility and swiss.x data sets contain fertility data for Switzerland in 1888. ARGUMENTS:

swiss.fertilitystandardized fertility measure I[g] for each of 47 French-speaking provinces of

Switzerland in approximately 1888.

swiss.xmatrix with 5 columns that contain socioeconomic indicators for the provinces:

1) percent of population involved in agriculture as an occupation; 2) percent of "draftees" receiving highest mark on army examination; 3) percent of population whose education is beyond primary school; 4) percent of population who are Catholic; and, 5) percent of live births who live less than 1 year (infant mortality).

SOURCE: Mosteller and Tukey (1977). Data Analysis and Regression. Addison-Wesley. Unpublished data used by permission of Francine van de Walle. Population Study

Center, University of Pennsylvania, Philadelphia, PA.

E Newton 46

This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 641: Applied Statistics - MIT

Bootstrap estimates and CI for variance of education

> educ<-swiss.x[,3]> var(educ)[1] 92.45606

> educ.boot<-bootstrap(educ,var,trace=F)> summary(educ.boot)Call:bootstrap(data = educ, statistic = var, trace = F)

Number of Replications: 1000

Summary Statistics:Observed Bias Mean SE

var 92.46 -0.5972 91.86 39.14

Empirical Percentiles:2.5% 5% 95% 97.5%

var 29.98 36.26 165.3 175

E Newton 47

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 642: Applied Statistics - MIT

Histogram of variance estimates obtained from 1000 bootstrap samples

50 100 150 200

0.0

0.00

20.

004

0.00

60.

008

0.01

0

Value

Den

sity

var

E Newton 48

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 643: Applied Statistics - MIT

QQ plot of variance estimates

Quantiles of Standard Normal

Qua

ntile

s of

Rep

licat

es

-2 0 2

5010

015

020

0

var

E Newton 49

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 644: Applied Statistics - MIT

Plot of LSAT scores by GPA for a sample of 15 schools

gpa

lsat

2.8 3.0 3.2 3.4

560

580

600

620

640

660

E Newton 50

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 645: Applied Statistics - MIT

Bootstrap estimates and CI for correlation between LSAT and GPA

> law.boot<-bootstrap(law.data, cor(lsat,gpa), trace=F)> summary(law.boot)Call:bootstrap(data = law.data, statistic = cor(lsat, gpa), trace = F)

Number of Replications: 1000

Summary Statistics:Observed Bias Mean SE

Param 0.7764 -0.00506 0.7713 0.1368

Empirical Percentiles:2.5% 5% 95% 97.5%

Param 0.449 0.5133 0.947 0.9623

BCa Confidence Limits:2.5% 5% 95% 97.5%

Param 0.2623 0.4138 0.9232 0.9413

E Newton 51

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 646: Applied Statistics - MIT

Histogram of correlation estimates obtained from 1000 bootstrap samples

0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Value

Den

sity

Param

E Newton 52

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 647: Applied Statistics - MIT

QQ Plot of correlation estimates

Quantiles of Standard Normal

Qua

ntile

s of

Rep

licat

es

-2 0 2

0.2

0.4

0.6

0.8

1.0

Param

E Newton 53

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 648: Applied Statistics - MIT

S-Plus Stack-loss data set

• Stack-loss Data • SUMMARY: • The stack.loss and stack.x data sets are from the operation of a plant for the

oxidation of ammonia to nitric acid, measured on 21 consecutive days. • ARGUMENTS: • stack.loss

– percent of ammonia lost (times 10). • stack.x

– matrix with 21 rows and 3 columns representing air flow to the plant, cooling water inlet temperature, and acid concentration as a percentage (coded by subtracting 50 and then multiplying by 10).

• SOURCE: • Brownlee, K.A. (1965). Statistical Theory and Methodology in Science and

Engineering. New York: John Wiley & Sons, Inc. • Draper and Smith (1966). Applied Regression Analysis. New York: John

Wiley & Sons, Inc. • Daniel and Wood (1971). Fitting Equations to Data. New York: John Wiley &

Sons, Inc.E Newton 54

This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 649: Applied Statistics - MIT

S-Plus stack loss data set

stack.loss

50 55 60 65 70 75 80 75 80 85 90

1020

3040

5060

7080

Air.Flow

Water.Temp

1820

2224

26

10 20 30 40

7580

8590

18 20 22 24 26

Acid.Conc.

E Newton 55

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 650: Applied Statistics - MIT

Summary of stack loss regression> summary(tmp)

Call: lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data = stack)

Residuals:Min 1Q Median 3Q Max

-7.238 -1.712 -0.4551 2.361 5.698

Coefficients:Value Std. Error t value Pr(>|t|)

(Intercept) -39.9197 11.8960 -3.3557 0.0038Air.Flow 0.7156 0.1349 5.3066 0.0001

Water.Temp 1.2953 0.3680 3.5196 0.0026Acid.Conc. -0.1521 0.1563 -0.9733 0.3440

Residual standard error: 3.243 on 17 degrees of freedomMultiple R-Squared: 0.9136 F-statistic: 59.9 on 3 and 17 degrees of freedom, the p-value is 3.016e-009

Correlation of Coefficients:(Intercept) Air.Flow Water.Temp

Air.Flow 0.1793 Water.Temp -0.1489 -0.7356 Acid.Conc. -0.9016 -0.3389 0.0002

E Newton 56

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 651: Applied Statistics - MIT

Summary of stack loss bootstrap outputsummary(stack.boot)Call:bootstrap(data = stack, statistic = coef(lm(stack.loss ~ Air.Flow

+ Water.Temp + Acid.Conc., stack)), trace = F)

Number of Replications: 1000

Summary Statistics:Observed Bias Mean SE

(Intercept) -39.9197 0.5691396 -39.3505 9.3731Air.Flow 0.7156 0.0016734 0.7173 0.1777

Water.Temp 1.2953 -0.0264873 1.2688 0.4798Acid.Conc. -0.1521 -0.0006978 -0.1528 0.1261

Empirical Percentiles:2.5% 5% 95% 97.5%

(Intercept) -56.0109 -53.4216 -21.92994 -18.75262Air.Flow 0.3903 0.4366 1.00261 1.04605

Water.Temp 0.4004 0.5131 2.07381 2.23633Acid.Conc. -0.4285 -0.3740 0.03282 0.05912

E Newton 57

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 652: Applied Statistics - MIT

Summary of stack loss bootstrap output

summary(stack.boot)

BCa Confidence Limits:2.5% 5% 95% 97.5%

(Intercept) -55.6465 -52.6606 -21.451125 -18.55810Air.Flow 0.3266 0.4120 0.992007 1.01855

Water.Temp 0.5244 0.6193 2.264165 2.40956Acid.Conc. -0.4629 -0.4101 -0.007724 0.04459

Correlation of Replicates:(Intercept) Air.Flow Water.Temp Acid.Conc.

(Intercept) 1.00000 -0.17636 0.09902 -0.80236Air.Flow -0.17636 1.00000 -0.78822 -0.07635

Water.Temp 0.09902 -0.78822 1.00000 -0.24463Acid.Conc. -0.80236 -0.07635 -0.24463 1.00000

E Newton 58

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 653: Applied Statistics - MIT

Histograms of regression coefficients

E Newton 59

-60 -40 -20 0 20 40

0.0

0.01

0.03

0.05

Value

Den

sity

(Intercept)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0

0.5

1.0

1.5

2.0

Value

Den

sity

Air.Flow

-0.5 0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.2

0.4

0.6

0.8

Value

Den

sity

Water.Temp

-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4

01

23

4

Value

Den

sity

Acid.Conc.

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Page 654: Applied Statistics - MIT

QQ Plots of regression coefficients

E Newton 60

Quantiles of Standard Normal

Qua

ntile

s of

Rep

licat

es

-2 0 2

-60

-40

-20

020

40

(Intercept)

Quantiles of Standard Normal

Qua

ntile

s of

Rep

licat

es

-2 0 2

0.0

0.4

0.8

1.2

Air.Flow

Quantiles of Standard Normal

Qua

ntile

s of

Rep

licat

es

-2 0 2

-0.5

0.5

1.5

2.5

Water.Temp

Quantiles of Standard Normal

Qua

ntile

s of

Rep

licat

es

-2 0 2

-1.0

-0.6

-0.2

0.2

Acid.Conc.

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.