r: a statistics program for teaching & research josué guzmán 11 nov. 2007 [email protected]
TRANSCRIPT
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 2
Some Useful R Links
• R Home Page www.r-project.org
• CRAN http://cran.r-project.org
• Precompiled Binary Distributions
• Windows (95 and later)
• R Manuals
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 3
R Installation
• R: Statistical Analysis & Graphics
• Freely Available Under GPL
• Binary Distributions
• Installation – Standard Steps
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 4
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 5
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 6
Running R
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 7
Statistical Programming with R
• Learn Language Basics
• Learn Documentation / Help System
• Learn Data Manipulation & Graphics
• Perform Basic Statistical Analysis
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 8
First Steps: Interacting with R
• Type a Command & Press Enter
• R Executes (printing the result if relevant)
• R waits for more input
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 9
Some Examples
2 * 2
[1] 4
exp(-2)
[1] 0.1353353
rdmnorm =rnormal(1000)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 10
R Functions
• exp, log and rnorm are functions
• Function calls are indicated by the presence of parentheses
Example: hist(rdmnorm, col = "magenta")
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 11
Variables and Assignments
The = operator; the <- operator also works
x = 2.2y = x + 3.5sqrt(x)y
x ^ y
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 12
Variables and Assignments
• Variable names cannot start with a digit
• Names are Case-Sensitive
• Some common names are already used by R
• Examples: c, q, t, C, D, F, I, T
• Should be avoided
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 13
Vectorized Arithmetic
• Elementary data types in R are all vectors
• The c(...) construct used to create vectors:
• Bolstad, 2004, exercise 13.2, page 253
fertilizer = c(1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5)
fertilizer
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 14
Vectorized Arithmetic [cont.]
•Arithmetic operations (+, -, *, /, ^) and mathematical functions (sin, cos, log, …) work element-wise on vectors
yield = c(25, 31, 27, 28, 36, 35, 32, 34)
log(yield)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 15
Vectorized Arithmetic [cont.]
sum.yield = sum(yield)sum.yield
n = length(yield)n
avg.yield = sum.yield/navg.yield
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 16
Graphics
• plot(x, y) function – simple way to produce R graphics:
plot(fertilizer, log(yield), main = "Fertilizer vs. Yield")
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 17
Getting Help• help.start( ) Starts a browser window with an HTML
help interface. Links to manual An Introduction to R, as well as topic-wise listings.
• help(topic) Help page for a particular topic or
function. Every R function has a help page.
• help.search("search string") Subject/keyword search
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 18
Getting Help [cont.]
• Short-cut: question mark (?) help(plot) ? plot
• To know about a specific subject, use help.search function. Example:
help.search("logarithm")
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 19
apropos( )
• apropos function - list of topics that partially match its argument:
apropos("plot")[1:10][1] ".__C__recordedplot" "biplot"
[3] "interaction.plot" "lag.plot"
[5] "monthplot" "plot.TukeyHSD"
[7] "plot.density" "plot.ecdf"
[9] "plot.lm" "plot.mlm"
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 20
R Packages
• R makes use of a system of packages• Each package is a collection of routines
with a common theme• The core of R itself is a package called
base• A collection of packages is called a library• Some packages are already loaded when
R starts up• Other packages need be loaded using the
library function
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 21
R Packages [cont.]
Several packages come pre-installed with R:
installed.packages( )[, 1][1] "ISwR" "KernSmooth" "MASS" "base"[5] "boot" "class" "cluster" "foreign"[9] "graphics" "grid" "lattice" "methods"[13] "mgcv" "nlme" "nnet" "rpart"[17] "spatial" "splines" "stats" "stats4"[21] "survival" "tcltk" "tools" "utils"
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 22
Contributed Packages
• Many packages are available from CRAN
• Some packages are already loaded when R starts up. List of currently loaded packages - use search:
search( )[1] ".GlobalEnv" "package:tools" "package:methods"
[4] "package:stats" "package:graphics" "package:utils"
[7] "Autoloads" "package:base"
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 23
R Packages
• Can be loaded by the user. Example: UsingR package
library(UsingR)
• New packages downloaded using the install.packages function:
install.packages("UsingR") library(help = UsingR)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 24
Data Types
• vector – Set of elements in a specified order
• matrix – Two-dimensional array of elements of the same mode
• factor – Vector of categorical data• data frame – Two-dimensional array
whose columns may represent data of different modes
• list – Set of components that can be any other object type
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 25
Editing Data Sets• Can create and modify data sets on the command
line xx = seq(from = 1, to = 5) xx
x2 = 1 : 5 x2
yy = scan( )5 8 10 4 2 6 2011 21 32 43 55 yy
• Can edit a data set once it is created edit(mydata) data.entry(mydata)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 26
Built-in Data
Data from a library:library(UsingR) attach(cfb)#Consumer-Finances Surveycfb$INCOMEcfb$EDUCeduc.fac = factor(EDUC)plot(INCOME ~ educ.fac, xlab = "EDUCATION", ylab = "INCOME")
detach(cfb)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 27
Data Modes
• logical – Binary mode, values represented as TRUE or FALSE
• numeric – Numeric mode [integer, single, & double precision]
• complex – Complex numeric values
• character – Character values represented as strings
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 28
Data Frames
• read.table( ) – Reads in data from an external file
read.table("data.txt" , header = T)
read.table(file = file.choose( ), header = T)
• data.frame – Binds R objects of various kinds
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 29
read.table Function
• Reads ASCII file, creates a data frame• Data in tables of rows and columns• If first line contains column labels:
Use argument header = T
• Field separator is white space• Also read.csv and read.csv2
– Assume , and ; separations, respectively
• Treats characters as factors
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 30
save( ) and load( )• Used for R Functions and Objects
• Understandable to load only
x = 23
y = 44
save(x, y, file = "xy.Rdata")
load("xy.Rdata")
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 31
Comparison Operators
!= Not Equal To
< Less Than
<= Less Than or Equal To
== Exactly Equal To
> Greater Than
>= Greater Than or Equal To
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 32
Some Logical Operators
! Not
| Or (For Calculating Vectors and Arrays of Logicals)
& And (For Calculating Vectors and Arrays of Logicals)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 33
Some Mathematical Functions
abs Absolute Valueceiling Next Larger Integerfloor Next Smallest Integercos, sin, tan Trigonometric
Functionsexp(x) e^x [e = 2.71828 …]log Natural Logarithmlog10 Logarithm Base 10sqrt Square Root
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 34
Statistical Summary Functions
length Length of Objectmax Maximum Valuemean Arithmetic Meanmedian Medianmin Minimum Valueprod Product of Valuesquantile Empirical Quantilessum Sumvar Variance - Covariancesd Standard Deviationcor Correlation Between Vectors or
Matrices
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 35
Sorting and Other Functions
rev Put Values of Vectors in Reverse Order
sort Sort Values of Vectororder Permutation of Elements to Produce
Sorted Orderrank Ranks of Values in Vectormatch Detect Occurrences in a Vectorcumsum Cumulative Sums of Values in
Vectorcumprod Cumulative Products
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 36
Plotting Functions Useful for
One-Dimensional Databarplot Bar plot
boxplot Box & Whisker plot
hist Histogram
dotchart Dot plot
pie Pie chart
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 37
Plotting Functions Useful for
Two-Dimensional Dataplot Creates a scatter plot:
plot(x, y)
qqnorm Quantile-quantile plot sample vs. N(0, 1): qqnorm(x)
qqplot Plot quantile-quantile plot for two samples: qqplot(x , y)
pairs Creates a pairs or scatter plot matrix: attach(babies) pairs(babies[ , c("gestation", "wt", "age", "inc" ) ] )
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 38
Three-Dimensional PlottingFunctions
contour Contour plot
persp Perspective plot
image Image plot
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 39
Probability Distributions Using R
• Pseudo-random sampling
sample(0:20, 5) # select 5 WOR
sample(0:20, 5, replace = T) # select WR
• Coin toss simulation [0 = tail; 1 = head] 20 tosses:
sample(c(0, 1), 20, replace=T)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 40
For Any Probability Distribution
ddist density or probability
pdist cumulative probability
qdist quantiles [percentiles]
rdist pseudo-random selection
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 41
Binomial Distribution
X ~ Binomial(n , p) ; x = 0, 1, …, n
dbinom(x , n , p ) Density or point probability
pbinom(x , n , p ) Cumulative distribution
qbinom(q , n , p ) Quantiles [ 0 < q < 1 ]
rbinom(m , n , p ) Pseudo-random numbers
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 42
Binomial Distribution
Coin toss simulation: x = 0:20 # num. of heads in 20 tosses
px = dbinom(x , size = 20, prob = 0.5)
plot(x , px, type = "h") # graph display
curve(dnorm(x, 10, sqrt(20*.5*.5)), col=2, add=T)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 43
0 5 10 15 20
0.0
00
.05
0.1
00
.15
x
px
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 44
Normal Distribution
X ~ Normal(µ,)dnorm(x , µ,) Density
pnorm(x , µ,) Cumulative probability
qnorm(q , µ,) Quantiles
rnorm(m , µ,) Random numbers
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 45
Standard Normal
x = seq(-3.5,3.5,0.1) # x ~ N(0,1)
prx = dnorm(x) # M = 0 , SD = 1
plot(x , prx , type = "l" )
Or using: curve(dnorm(x), from = -3.5 , to = 3.5)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 46
Cumulative Normal & Quantiles
curve(pnorm(x), from=-3.5,to=3.5)
qnorm(.25) #Percentile 25, x~N(0,1)
qnorm(.75, m=50, sd=2) # M=50,SD=2
qnorm(c(.1,.3,.7,.9), m=65, sd=3)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 47
Poisson Distribution
X ~ Poisson( λ ) ; X = 0, 1, 2, 3, …
x = 0:20 # Suppose λ = 3.5
prx = dpois(x, lambda = 3.5)
plot(x , prx, type = "h", main = "Poisson Distribution")
text(10, .10, "Lambda = 3.5")
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 48
0 5 10 15 20
0.0
00
.05
0.1
00
.15
0.2
0
Poisson Distribution
x
prx
Lambda = 3.5
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 49
Sampling Distributions
n = 25; curve(dnorm(x , 0, 1/sqrt(n)), -3, 3,
xlab = "Mean", ylab = "Densities of Sample Mean", bty = "l" )
n=5 ; curve(dnorm(x, 0, 1/sqrt(n)), add=T)
n=1 ; curve(dnorm(x, 0, 1/sqrt(n)), add=T)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 50
-3 -2 -1 0 1 2 3
0.0
0.5
1.0
1.5
2.0
Mean
De
nsitie
s o
f S
am
ple
Me
an
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 51
t – Distribution as df Increase curve(dnorm(x), -4, 4, main="Normal & t
Distributions", ylab="Densities" )
k=3; curve(dt(x , df = k ), lty = k, add = T)
k=5; curve(dt(x , df = k ), lty = k, add = T)
k=15; curve(dt(x , df = k ), lty = k, add = T)
k=100; curve(dt(x , df = k ), lty = k, add = T)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 52
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
Normal & t Distributions
x
De
nsitie
s
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 53
Binomial-Normal Approximation
• Coin toss example: n = 100, p = .5• P(X ≤ 40)?
Using Larget’s prob.R file: source(file.choose( ) )
gbinom(100, .5, b = 40 )
Normal approximation: µ = 50, = 5 gnorm(50, 5, b = 40.5)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 54
30 40 50 60 70
0.0
00
.02
0.0
40
.06
0.0
8
Binomial Distribution n = 100 , p = 0.5
Possible Values
Pro
ba
bility
P(0 <= Y <= 40) = 0.028444
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 55
Normal Distribution with 50, 5
Possible Values
Pro
ba
bility D
en
sity
30 40 50 60 70
P( X < 40.5 ) = 0.0287
P( X > 40.5 ) = 0.9713
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 56
One-Sample t-test
Ho: µ = µ0 Null Hypothesis
Ha: µ µ0 Two-sided
Ha: µ > µ0 One-sided
Ha: µ < µ0 One-sided
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 57
R One-Sample t.test
x = c(x1, x2, …, xn) # data set
t.test(x, mu = Mo) # two-sided
t.test(x, mu = Mo, alt = "g") # one-sided
t.test(x, mu = Mo, alt = "l") # one-sided
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 58
R One-Sample t.test [cont.]
Example: Text, Problem 8.11, page 226 library(UsingR) attach(stud.recs) x = sat.m # Math SAT Scores hist(x) # Visual display qqnorm(x) # Normal quantile plot qqline(x, col=2) # Add equality line
t.test(x, mu = 500) detach(stud.recs)
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 59
Normality Test
Shapiro-Wilk test:Ho: X ~ Normal Ha: X !~ Normal
Command: shapiro.test(x)
# Examine p-value
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 60
Normality Test [cont.]
Example: On Base %
data(OBP)
summary(OBP)
boxplot(OBP) 0.2
0.3
0.4
0.5
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 61
Normality Test [cont.]
qqnorm(OBP)
qqline(OBP, col=2)
shapiro.test(OBP)
wilcox.test(OBP, mu=.330)
-3 -2 -1 0 1 2 3
0.2
0.3
0.4
0.5
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 62
One-Sample Proportion Test
x total successes; n sample size
prop.test(x, n, p = Po) # two-sided
prop.test(x, n, p = Po, alt= "g")
prop.test(x, n, p = Po, alt= "l")
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 63
Or Using Binomial “Exact” Test
binom.test(x, n, p = Po) binom.test(x, n, p = Po, alt = "g")
binom.test(x, n, p = Po, alt = "l")
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 64
Proportion Test
Text, Example 8.3: Survey US Poverty Rate
Ho: P = 0.113 # Year 2000 RateHa: P > 0.113 # Year 2001 Rate Increased
x = 5850 # Sample people UPL n = 50000 # Sample size prop.test(x, n, p = 0.113, alt = "g") binom.test(x, n, p = 0.113, alt = "g")
© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 65
Some Modeling Functions/Packages
Linear Models: anova, car, lm, glmGraphics: graphics, grid,
latticeMultivariate: mva, clusterSurvey: surveySQC: qccTime Series: tseriesBayesian: BRugs, MCMCpack,
… Simulation: boot, bootstrap, Zelig
You Perform An Experiment
In Order To Learn,Not To Prove.
W Edwards Deming