xiao wu data analysis & basic statistics

35
XIAO WU [email protected] DATA ANALYSIS & BASIC STATISTICS

Upload: shon-nelson

Post on 06-Jan-2018

230 views

Category:

Documents


1 download

DESCRIPTION

WHY DO WE NEED STATISTICS?

TRANSCRIPT

Page 1: XIAO WU DATA ANALYSIS & BASIC STATISTICS

X I A O W UX I A O . W U @ Y A L E . E D U

DATA ANALYSIS & BASIC STATISTICS

Page 2: XIAO WU DATA ANALYSIS & BASIC STATISTICS

PURPOSE OF THIS WORKSHOP

• Statistics as a useful tool to analyze results

• Basic terminology and most commonly used tests

• Exposure to more advanced statistical tools

Page 3: XIAO WU DATA ANALYSIS & BASIC STATISTICS

WHY DO WE NEED STATISTICS?

Page 4: XIAO WU DATA ANALYSIS & BASIC STATISTICS

WHY DO WE NEED STATISTICS?

• Summary• Classification• Interpretation• Pattern searching• Abnormality identification

• Prediction• Intrapolation• Extrapolation

Page 5: XIAO WU DATA ANALYSIS & BASIC STATISTICS

SUMMARY

http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/

Page 6: XIAO WU DATA ANALYSIS & BASIC STATISTICS

SUMMARY

• Mean, median, mode• Variance, standard deviation• Max, min values and range• Quartiles

http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/

Page 7: XIAO WU DATA ANALYSIS & BASIC STATISTICS

EXAMPLE

Firm A• Mean: $5,800

Firm B• Mean: $5,000

Page 8: XIAO WU DATA ANALYSIS & BASIC STATISTICS

EXAMPLE

Firm A• Mean: $5,800• Median: $4,000• SD: $7,270• 3rd Quartile: $4,000• 1st Quartile: $500

Firm B• Mean: $5,000• Median: $5,000• SD: $203• 3rd Quartile: $5,175• 1st Quartile: $4,825

Page 9: XIAO WU DATA ANALYSIS & BASIC STATISTICS

EXAMPLE# Salary ($)1 46502 47003 47504 48005 48506 49007 49508 50009 5050

10 510011 515012 520013 525014 530015 5350

# Salary ($)

1 200002 40003 40004 5005 500

Page 10: XIAO WU DATA ANALYSIS & BASIC STATISTICS

CLASSIFICATION

Identification of variable• Independent vs.

dependent• Numeric vs.

categorical

Variable

Categorical

Nominal

Ordinal

Numeric

Continuous

Discrete

Page 11: XIAO WU DATA ANALYSIS & BASIC STATISTICS

PATTERN SEARCHING

• Distribution of data• Some commonly

used distributions• Uniform• Binomial• Poisson• …

• Central limit theorem

http://www.mathwave.com/img/art/graphs_pdf2.gif

Page 12: XIAO WU DATA ANALYSIS & BASIC STATISTICS

UNIFORM

• Every outcome has equal chance• Example:• Flipping a coin• Rolling a dice• What if you need to

flip multiple times?

Page 13: XIAO WU DATA ANALYSIS & BASIC STATISTICS

BINOMIAL

• Two outcomes, probability p and 1-p• Multiple trials: n• Example: • Flipping a coin 100

times• Germination of

multiple seedshttps://onlinecourses.science.psu.edu/stat414/sites/onlinecourses.science.psu.edu.stat414/files/lesson09/graph_n15_p02.gif

Page 14: XIAO WU DATA ANALYSIS & BASIC STATISTICS

POISSON

• Counts of rare, independent events• Each with

probability, or average rate p• Example:

radioactive decay

http://kaffee.50webs.com/Science/images/alpha_decay.gif

Page 15: XIAO WU DATA ANALYSIS & BASIC STATISTICS

THE MOST IMPORTANT DISTRIBUTION

Page 16: XIAO WU DATA ANALYSIS & BASIC STATISTICS

NORMAL DISTRIBUTION

• Central limit theorem• Every distribution

converges to a normal distribution• Large sample size

normal distributionParameters:• mean• standard deviation

https://www.mathsisfun.com/data/images/normal-distrubution-large.gif

Page 17: XIAO WU DATA ANALYSIS & BASIC STATISTICS

PATTERN SEARCHING

Hypothesis testing• Difference between two populations• Z-test or t-test?• What does p-value mean?• Family-wise error – Bonferroni correction

• More than two possibilities• Chi square test• Fisher’s exact test

• More than two variables• ANOVA

Page 18: XIAO WU DATA ANALYSIS & BASIC STATISTICS

EXAMPLE 1

SAT score is related to gender• Null hypothesis• Alternative hypothesis (3 possibilities)• One or two tail?• Z or T test?• p=0.07, conclusion?

Page 19: XIAO WU DATA ANALYSIS & BASIC STATISTICS

EXAMPLE 2

Predictors of stroke• Age• Hypertension• Gender• …

Page 20: XIAO WU DATA ANALYSIS & BASIC STATISTICS

EXAMPLE 3

Genome-wide association studies• Scanning markers across the DNA of many people

to find genetic variations associated with certain diseases

Page 21: XIAO WU DATA ANALYSIS & BASIC STATISTICS

PATTERN SEARCHING

Hypothesis testing• One variable• Z-test or t-test?• What does p-value mean?• Family-wise error – Bonferroni correction

• Compare two categorical variables• Chi square test• Fisher’s exact test

• More than two variables• ANOVA

Page 22: XIAO WU DATA ANALYSIS & BASIC STATISTICS

CHI SQUARE

Punnett Square• A cross between two pea plants yields 880 plants,

639 green, 241 yellow• Hypothesis: The green allele is dominant and

both parents are heterozygous.

http://www2.lv.psu.edu/jxm57/irp/chisquar.html

Page 23: XIAO WU DATA ANALYSIS & BASIC STATISTICS

CHI SQUAREG g

G GG (green) Gg(green)

g Gg(green) gg (yellow)

• 75% green• 25% yellow

Page 24: XIAO WU DATA ANALYSIS & BASIC STATISTICS

CHI SQUAREGreen Yellow

Observed (o) 639 241Expected (e) 660 220Deviation (d=o – e) -21 21Deviation squared (d^2)

441 441

d^2/e 0.668 2Sum 2.669

Degree of freedom: number of categories – 1 = 1

Page 25: XIAO WU DATA ANALYSIS & BASIC STATISTICS

CHI SQUARE

Page 26: XIAO WU DATA ANALYSIS & BASIC STATISTICS

PREDICTION

• Regression• Linear regression• Multiple linear

regression

• Accuracy vs. simplicity

• Validation• leave-k-out

http://2.bp.blogspot.com/-W7Ptp8uB02U/T8UAGm4Uw5I/AAAAAAAAC08/DcHCtLWXv-U/s1600/actnactn+1.png

Page 27: XIAO WU DATA ANALYSIS & BASIC STATISTICS

EXAMPLE

• Use brain structural measurements to predict a subject’s performance on picture vocabulary test• 144 total structural measurements• 521 subjects• First step: eliminate unnecessary variables• All zeros? • Highly correlated pairs• Variables that do not correlate well with performance

score

Page 28: XIAO WU DATA ANALYSIS & BASIC STATISTICS

EXAMPLE

• Run regression• Validation: leave 1 out and leave 10 out• Principle component analysis• …

Page 29: XIAO WU DATA ANALYSIS & BASIC STATISTICS

PREDICTION

More complicated models:• Baysian approach• Use prior knowledge to update prediction

• Diffusion weights• Use local structure to predict neighboring values

Page 30: XIAO WU DATA ANALYSIS & BASIC STATISTICS

STATISTICAL TOOLS

• EXCEL• MatLab• R• MiniTab• …

Page 31: XIAO WU DATA ANALYSIS & BASIC STATISTICS

QUESTIONS?

Page 32: XIAO WU DATA ANALYSIS & BASIC STATISTICS

MY OWN RESEARCH

• Cost-effectiveness analysis• Mathematical modeling in medicine• Simulate iterations rather than actual patients

Page 33: XIAO WU DATA ANALYSIS & BASIC STATISTICS

RECENT RESULTS

Page 34: XIAO WU DATA ANALYSIS & BASIC STATISTICS

RESULTS

Page 35: XIAO WU DATA ANALYSIS & BASIC STATISTICS

GROUP EXERCISE