xiao wu data analysis & basic statistics
DESCRIPTION
WHY DO WE NEED STATISTICS?TRANSCRIPT
X I A O W UX I A O . W U @ Y A L E . E D U
DATA ANALYSIS & BASIC STATISTICS
PURPOSE OF THIS WORKSHOP
• Statistics as a useful tool to analyze results
• Basic terminology and most commonly used tests
• Exposure to more advanced statistical tools
WHY DO WE NEED STATISTICS?
WHY DO WE NEED STATISTICS?
• Summary• Classification• Interpretation• Pattern searching• Abnormality identification
• Prediction• Intrapolation• Extrapolation
SUMMARY
http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/
SUMMARY
• Mean, median, mode• Variance, standard deviation• Max, min values and range• Quartiles
http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/
EXAMPLE
Firm A• Mean: $5,800
Firm B• Mean: $5,000
EXAMPLE
Firm A• Mean: $5,800• Median: $4,000• SD: $7,270• 3rd Quartile: $4,000• 1st Quartile: $500
Firm B• Mean: $5,000• Median: $5,000• SD: $203• 3rd Quartile: $5,175• 1st Quartile: $4,825
EXAMPLE# Salary ($)1 46502 47003 47504 48005 48506 49007 49508 50009 5050
10 510011 515012 520013 525014 530015 5350
# Salary ($)
1 200002 40003 40004 5005 500
CLASSIFICATION
Identification of variable• Independent vs.
dependent• Numeric vs.
categorical
Variable
Categorical
Nominal
Ordinal
Numeric
Continuous
Discrete
PATTERN SEARCHING
• Distribution of data• Some commonly
used distributions• Uniform• Binomial• Poisson• …
• Central limit theorem
http://www.mathwave.com/img/art/graphs_pdf2.gif
UNIFORM
• Every outcome has equal chance• Example:• Flipping a coin• Rolling a dice• What if you need to
flip multiple times?
BINOMIAL
• Two outcomes, probability p and 1-p• Multiple trials: n• Example: • Flipping a coin 100
times• Germination of
multiple seedshttps://onlinecourses.science.psu.edu/stat414/sites/onlinecourses.science.psu.edu.stat414/files/lesson09/graph_n15_p02.gif
POISSON
• Counts of rare, independent events• Each with
probability, or average rate p• Example:
radioactive decay
http://kaffee.50webs.com/Science/images/alpha_decay.gif
THE MOST IMPORTANT DISTRIBUTION
NORMAL DISTRIBUTION
• Central limit theorem• Every distribution
converges to a normal distribution• Large sample size
normal distributionParameters:• mean• standard deviation
https://www.mathsisfun.com/data/images/normal-distrubution-large.gif
PATTERN SEARCHING
Hypothesis testing• Difference between two populations• Z-test or t-test?• What does p-value mean?• Family-wise error – Bonferroni correction
• More than two possibilities• Chi square test• Fisher’s exact test
• More than two variables• ANOVA
EXAMPLE 1
SAT score is related to gender• Null hypothesis• Alternative hypothesis (3 possibilities)• One or two tail?• Z or T test?• p=0.07, conclusion?
EXAMPLE 2
Predictors of stroke• Age• Hypertension• Gender• …
EXAMPLE 3
Genome-wide association studies• Scanning markers across the DNA of many people
to find genetic variations associated with certain diseases
PATTERN SEARCHING
Hypothesis testing• One variable• Z-test or t-test?• What does p-value mean?• Family-wise error – Bonferroni correction
• Compare two categorical variables• Chi square test• Fisher’s exact test
• More than two variables• ANOVA
CHI SQUARE
Punnett Square• A cross between two pea plants yields 880 plants,
639 green, 241 yellow• Hypothesis: The green allele is dominant and
both parents are heterozygous.
http://www2.lv.psu.edu/jxm57/irp/chisquar.html
CHI SQUAREG g
G GG (green) Gg(green)
g Gg(green) gg (yellow)
• 75% green• 25% yellow
CHI SQUAREGreen Yellow
Observed (o) 639 241Expected (e) 660 220Deviation (d=o – e) -21 21Deviation squared (d^2)
441 441
d^2/e 0.668 2Sum 2.669
Degree of freedom: number of categories – 1 = 1
CHI SQUARE
PREDICTION
• Regression• Linear regression• Multiple linear
regression
• Accuracy vs. simplicity
• Validation• leave-k-out
http://2.bp.blogspot.com/-W7Ptp8uB02U/T8UAGm4Uw5I/AAAAAAAAC08/DcHCtLWXv-U/s1600/actnactn+1.png
EXAMPLE
• Use brain structural measurements to predict a subject’s performance on picture vocabulary test• 144 total structural measurements• 521 subjects• First step: eliminate unnecessary variables• All zeros? • Highly correlated pairs• Variables that do not correlate well with performance
score
EXAMPLE
• Run regression• Validation: leave 1 out and leave 10 out• Principle component analysis• …
PREDICTION
More complicated models:• Baysian approach• Use prior knowledge to update prediction
• Diffusion weights• Use local structure to predict neighboring values
STATISTICAL TOOLS
• EXCEL• MatLab• R• MiniTab• …
QUESTIONS?
MY OWN RESEARCH
• Cost-effectiveness analysis• Mathematical modeling in medicine• Simulate iterations rather than actual patients
RECENT RESULTS
RESULTS
GROUP EXERCISE