statistics primer

50
Statistics Primer Xiayu (Stacy) Huang Bioinformatics Shared Resource Email: [email protected] Sanford | Burnham Medical Research Institute

Upload: ermin

Post on 20-Jan-2016

62 views

Category:

Documents


1 download

DESCRIPTION

Statistics Primer. Xiayu (Stacy) Huang Bioinformatics Shared Resource Email: [email protected] Sanford | Burnham Medical Research Institute. Outline. Overview of basic statistics Introduction Descriptive statistics Inferential statistics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistics Primer

Statistics Primer

Xiayu (Stacy) Huang

Bioinformatics Shared ResourceEmail: [email protected]

Sanford | Burnham Medical Research Institute

Page 2: Statistics Primer

Outline Overview of basic statistics

Introduction

Descriptive statistics

Inferential statistics

Most common statistical test and its applicationsT test

Power analysis using t test

Page 3: Statistics Primer

What is statistics?

On American Statistical Association (ASA) website, statistics is defined as the science of collection, analysis, interpretation and presentation of data

Using Statistics to make decision can be a double-edged sword In the 1980s, Marriott conducted an extensive survey with

potential customers on their attitudes about current hotel offerings. After analyzing the data, the company launched Courtyard by Marriott, which has been a huge success

Coca-Cola performed a major consumer study in 1985 and, based on the results, decided to reformulate Coke, its flagship drink. After a huge public outcry, Coca-Cola had to backtrack and bring the original formulation back to market

Page 4: Statistics Primer

History of statistics

Karl Pearson•Pearson correlation•Chi-square distribution

William Gosset•Student’s t

Ronald Aylmer Fisher•ANOVA, maximum likelihood

•17th-18th century

•19th century

•20th century

Jakob Bernoulli

•Bernoulli number•Bernoulli trial•Bernoulli process

•Bayes theorem

Thomas Bayes

Carl Friedrich Gauss

•Gaussian distribution

Page 5: Statistics Primer

Why statistics is important to biologists?

• Designing experiment

• Analyzing biological data and understanding analysis results

• Preparing manuscript and grant applications

How many ???

DEGs

How many replicates for my microarray exp???

No replicates=No statistics?

Identifying outlierNormalization/transformationStatistical test, etc.

Page 6: Statistics Primer

Study Scheme

Study Hypothesis

Design Study

Conduct Study and Collect data

Data Analysis

Make Conclusions

Summarizing data usingDescriptive Statistics

Hypothesis Testing Using Inferential Statistics

Choose StatisticalTest

Compute test statisticCompute p-value

Compare p-value and α

Page 7: Statistics Primer

Branches of statistics

Descriptive statistics (Summary statistics)Summarize data graphically or numericallyLead to hypothesis generating

Inferential statisticsDistinguish true difference from random variationAllow hypothesis testing

Page 8: Statistics Primer

Types of data

Qualitative or Quantitative

Example

QualitativeGender

GenotypeTumor location

Qualitative or Quantitative

PerformanceGrade of tox

Disease stage

QuantitativeAge

Array intensities

Page 9: Statistics Primer

Descriptive statistics—central tendency

Mean—average

Median—middle value of sorted data

Mode—most frequently observed value

24

27 22

25 24

23

28 23

25

26 22

29 24

22

22 23

23 24

24

24

25

25

26 27

28 29

Median

Mode is 24 with frequency of 3

Mean=(24+27+….+24)/13=24.8

Agei.e.

Page 10: Statistics Primer

Descriptive statistics—dispersion

Range

Sample Variance (s2)\ Standard deviation (s)

Values beyond two standard deviations from the mean can be considered as “outliers” (>mean+2s=24.8+2x2.2=29.2 or <mean-2s=24.8-2x2.2=20.4)

Standard error of mean (SEM)

Agei.e.

2 2 22

2

(22 ) (22 ) ... (29 )4.84

13 1

2.2

mean mean means

s s

2.20.61

13

sSEM

n

22

22 23

23 24

24

24 25

25

26 27

28 29

Range=highest value-lowest value=29-22=7

Page 11: Statistics Primer

Descriptive statistics—data distributionHistogram (x-bin, y-frequency)

Graphical representation showing the distribution of data Summary graph showing how many data points falling in various ranges

22

22 23

23 24

24

24 25

25

26 27

28 29Frequency table

Bin Frequency

20-22 2

22-24 5

24-26 3

26-28 2

28-30 1

Histogram\frequency distribution

Bin percentage

20-22 0.155

22-24 0.38

24-26 0.23

26-28 0.155

28-30 0.08

Percentage table Histogram\probability distribution

Page 12: Statistics Primer

Different data distributions

Descriptive statistics—data distribution

Approximate normal distribution i.e. height of people, length of dogs

Right skewed distribution Left skewed distributioni.e. FC of Microarray data i.e. distribution of age at retirement

Page 13: Statistics Primer

•Bell-shaped curve

•Symmetrical about mean

•Mean, median and mode are equal

•~68% data points fall within 1 sd of mean

•~95% data points fall within 2 sd of mean

•~99.7% data points fall within 3 sd of mean

Normal (or Gaussian) distribution

mean=median=mode

Page 14: Statistics Primer

Installing graphpad prism

You can install Prism on Institute supplied computers, including home and personal computers.

http://graphpad.com/paasl/index.cfm?sitecode=burnhm

SERIAL NUMBERS:

Macintosh versioncontacting IT ([email protected]) to get serial number Windows versioncontacting IT ([email protected]) to get serial number

Page 15: Statistics Primer

Calculating descriptive statistics in excel

Page 16: Statistics Primer

Calculating descriptive statistics in prism

Page 17: Statistics Primer

Calculating descriptive statistics in prism

Page 18: Statistics Primer

Graphically displaying descriptive statistics

Histogram

Mean error bar plot

Line plot w/o error bar

Page 19: Statistics Primer

Graphically displaying descriptive statistics in Prism

Mean error bar plot

Histogram and frequency distribution

Page 20: Statistics Primer

Graphically displaying descriptive statistics in Prism

Group line plot

Group line plot witherror bar

Group line plot withouterror bar

Page 21: Statistics Primer

Choosing right measures of descriptive statistics

Normal distribution Skewed distribution

Normal distribution: mean and standard deviation

Skewed distribution: transform data to normal distribution

Page 22: Statistics Primer

Outline

Overview of basic statisticsBrief Introduction

Descriptive statistics

Inferential statistics

Most common statistical tests and its applicationsT test

Power analysis using t test

Page 23: Statistics Primer

Inferential statistics

Parametric Interval or ratio measurementsContinuous variableUsually assuming data are normally distributed

NonparametricOrdinal or nominal measurementsDiscreet variablesMaking no assumption about how data is

distributed

Page 24: Statistics Primer

Inferential statistics-hypothesis

Null hypothesis (H0)

Alternative hypothesis (HA)• is the opposite of null hypothesis• is generally the hypothesis that is believed to be

true by the researcher

new drug effect = old drug effect tumor growth of MT = tumor growth of WT

new drug effect ≠ or > old drug effect tumor growth of MT ≠ or < tumor growth of WT

Page 25: Statistics Primer

Inferential statistics-one and two sided tests

Hypothesis tests can be one or two sided (tailed)

One sided tests are directional:

Two sided tests are not directional:

H0 : new drug effect ≤ old drug effect

HA : new drug effect > old drug effect

H0 : new drug effect = old drug effect

HA : new drug effect ≠ old drug effect

Page 26: Statistics Primer

Inferential statistics-type I and type II errors

Correct decision (TN)1-α

Type II error (FN)β

Type I error (FP)α

Correct decision (TP)1-β

“Actual situation”

No difference

Difference

“Measured”

1820 10

180 20

Correct decision (TN)1-α=1820/2000=0.91

Type II error (FN)FN=10/30=0.33

Type I error (FP)α=180/2000=0.09

Correct decision (TP)1-β=20/30=0.67

No difference (H0) Difference (HA)

“Actual situation”

“Measured”

- +-

+

1830

200

2000 30

FOB screening(bowel cancer)

Page 27: Statistics Primer

Inferential statistics-type I and type II errors

• Control type I and type II errors• Inverse relationship between type I and type II errors• Make a choice to control which error

• i.e. controlling type I error (FP) is more important for microarray data than type II error (FN)

• i.e. controlling type II error (FN) is more important for cancer screening test than type I error (FP)

• Choose type I and type II errors for statistical test?

• Common choices (α = 5%, β = 20%)• Exploratory study (α = 10%, β = 10%)• Confirmatory study (α = 1%, β = 10%)

Page 28: Statistics Primer

Inferential Statistics-P-value

• the probability that an observed difference could have occurred by chance under null hypothesis

• Computed from test statistics score

• P-value is the same as false positive rate

• P-value below cut off (α) is referred as “statistically significant”

Page 29: Statistics Primer

Inferential Statistics-Power

Power (1-β, aka true positive rate (TP))

• Probability of detecting a significant scientific difference when it does exist

Power depends on:Sample size (n)

Standard deviation (s)

Size of the difference you want to detect (δ)

False positive rate (α)

Effect size

s

Page 30: Statistics Primer

Study scheme

Study Hypothesis

Design Study

Conduct Study and Collect data

Data Analysis

Make Conclusions

Calculating and Displaying Descriptive Statistics

Hypothesis Testing Using Inferential Statistics

Choose Statistical Test

Compute test statisticCompute p-value

Compare p-value and α

Page 31: Statistics Primer

Type of data

Quantitative Qualitative

Type of research question

Association Correlation Comparison

Data structure

Independent Paired Matched

How to choose an appropriate statistical test?

Page 32: Statistics Primer

Statistical test decision making tree

For qualitative or non-numerical data

For quantitative or numerical data

Page 33: Statistics Primer

Two sample comparison

Multiple sample comparison

Relationship between variables

Statistical test decision making tree

Page 34: Statistics Primer

Outline

Overview of basic statisticsBrief Introduction

Descriptive statistics

Inferential statistics

Most common statistical test and its applications

T testPower analysis using t test

Page 35: Statistics Primer

Student’s t test

Guinness employee William Sealy Gosset published the 'Student's t-test' in 1908

Page 36: Statistics Primer

Types of t test

One sample t test: test if a sample mean differs significantly from the given known mean

Unpaired t test: test if two independent sample means differ significantly

Paired t test: test if two dependent sample means differ significantly (mean of pre and post treatment for same set of patients

Page 37: Statistics Primer

Application of t test in biology

Proteomics experiment

WT MT

Technical reps

Biological reps

You need to have at least two replicates in each condition

to do t test, otherwise, t test is invalid and you won’t have statistics

Mincroarry experiment

WT MT

Page 38: Statistics Primer

Two sample unpaired t test

Assumptions Data is approximately normally distributed The sample has been independently and randomly

selected Similar variances between comparing groups

Hypothesis (two sided or one sided)

Test statistics1 2

1 2 1 2, 2

1 2

2 22 1 1 2 2

1 2

( ) ( )

1 / 1 /

( 1) ( 1)

2

n np

p

X Xt t

sn n

n s n ss

n n

-- sample means

-- population means

-- sample standard deviation

-- sample size

-- pooled sample variance

1 2,X X

1 2,s s

1 2,

1 2,n n2ps

0 1 2

1 2

: 0

: 0A

H

H

Page 39: Statistics Primer

Sample data

1st Question to be answered:

Will the two treatments have different effect on patients’ remission time from cancer?

Patients TreatmentsRemission time

from cancer (years)

1 Drug 72 Drug 53 Drug 24 Drug 85 Drug 36 Drug 47 Drug 108 Drug 79 Drug 410 Drug 911 Placebo 412 Placebo 313 Placebo 114 Placebo 615 Placebo 216 Placebo 417 Placebo 918 Placebo 519 Placebo 320 Placebo 8

Page 40: Statistics Primer

Summarizing sample data using descriptive statistics

Page 41: Statistics Primer

Hypothesis testing of sample data using inferential statistics

Step1: Choosing an appropriate statistical test

Step2: Performing statistical test in software

Step3: Making conclusions

Page 42: Statistics Primer

Statistical test decision making tree

Page 43: Statistics Primer

Two sample t test in Prism-normality check

Page 44: Statistics Primer

Two sample t test in Prism

Page 45: Statistics Primer

Two sample t test in excel

Page 46: Statistics Primer

Power analysis using two sample t test

2nd question to be answered:

How many patients do we need in order to detect a significantly difference b/w two treatments?

N α β δ/s Test K:1 efficiency imbalance

2 2 21 /2 1 1 /2 1

22

( ) ( )

( )

s t t t tn

s

Page 47: Statistics Primer

Power analysis of t test in G*power

Page 48: Statistics Primer

Power analysis of t test in G*power

Page 49: Statistics Primer

Basic Statistics toolsStatistics softwares and packages:

1.Excel and add-ins: EZAnalyze, Analysis Toolpak2. Our institute supported Prism3. SPSS, Statistica (commercial)4. SAS (commercial) and R 5. G*Power

Basic statistics books:

1. Intro Stats, SDSU, 2nd edition, Deveaux, Velleman, Bock2. Choosing and Using Statistics: A Biologist's Guide 3. Introduction to Statistics for Biology 4. Biostatistical analysis, fifth edition, Jerrold H. Zar

Statistics videos:

1. http://www.microbiologybytes.com/maths/videos2. http://www.youtube.com: descriptive statistics, basic statistics, install 2007 Excel data analysis add-ins…

Page 50: Statistics Primer

Next.....

My presentation will be posted on website: http://bsrweb.burnham.org/

I am located in building 10, Office 2405, ext 3916

Feel free to come or call or send e-mail to ask questions ([email protected])

Group email: [email protected]