philippe glaziou berlin, april 2009 · exploring data with r introduction philippe glaziou berlin,...
TRANSCRIPT
Exploring data with Rintroduction
Philippe Glaziou
Berlin, April 2009
Getting started
install.packages ("ggplot2")
# once per computer
library (ggplot2)
# every time you open R
What is R?
• A free implementation of a dialect of the S language (now S-Plus)
• R is a high-level programming language with similarities to Python and Scheme
• R is slow and memory-hungry, but computers are fast and memory is cheap
R Development Core Team (2008). R: A language and environment for statisticalcomputing. R Foundation for Statistical Computing, Vienna, Austria. ISBN3-900051-07-0, URL http://www.R-project.org.
What does R
• The same things as mainstream commercial statistical packages
• Data management
• Stochastic and deterministic modelling
• Publication-quality graphics
• And more:
– R can generate and solve Sudoku games
Diamondshttp://www.diamondse.info
Colour and Cut
> head (diamonds)carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
54,000 diamonds
qplot (carat, data = diamonds)
qplot (carat, data = diamonds, binwidth=0.1)
qplot (carat, data = diamonds, binwidth=0.01)
What questions could be answered with these data?
Variables:
• price
• quality: carat, cut, clarity, colour
• geometry: x, y, z, depth, table
Does price vary with weight?
• How would you like to explore the relationship?
qplot (carat, price, data = diamonds)
qplot (log(carat), log(price), data = diamonds)
qplot (log(carat), log(price), data = diamonds, xlim=c(1,2))
qplot (log(carat), log(price), data=diamonds, colour=color)
qplot (log(carat), log(price), data=diamonds, fill=color, geom='hex', bins=250) + scale_fill_brewer (palette="Blues")
Data quality
• What simple checks could we do?
• Can we check whether the density (weight/volume) of diamonds is approximately constant?
– Volume: V ~ x * y * z
qplot (carat, x * y * z, data = diamonds)
Are there diamonds with a square table?
qplot (y, x, data = diamonds)
qplot (y, x, data = diamonds, xlim=c(4,5), ylim=c(4,5))
qplot (y, x, data = diamonds, xlim=c(4,5), ylim=c(4,5)) + geom_abline (colour='red')
Interpreting a scatterplot
• Big patterns
– form: linear, curved
– direction: increasing, decreasing
– strength: is there a lot of variation
– are there multiple groups
• Small patterns
• Deviations
– outliers
Your turn, experiment:
qplot (carat, price, data=diamonds)
qplot (log(carat), log(price), data=diamonds)
qplot (carat, price/carat, data=diamonds)
Facetting
# Row variables ~ column variables (. for none)
qplot (price, carat, data=diamonds, facets = . ~ color)
qplot (price, carat, data=diamonds, facets = color ~ clarity)
qplot (price, carat, data=diamonds, facets = color ~ clarity)
Your turn
• Try the examples
• Try other plots to answers your questions
Generate a subset of the data
set.seed (1234)
# to ensure replicability of sampling
dsmall <- diamonds [sample (nrow (diamonds), 100), ]
# sample 100 diamonds randomly
qplot (carat, price, data = dsmall, colour = color)
qplot (carat, price, data = dsmall, shape = cut)
Map other variables to size or colour
qplot (carat, price, data=dsmall, colour=color)
qplot (carat, price, data=dsmall, size=carat)
qplot (carat, price, data=dsmall, shape=cut)
qplot(carat, price, data = dsmall, geom = c("point", "smooth"))
qplot(carat, price, data = diamonds, geom = c("point", "smooth"))
Is price related to colour?
• But what if colour is also related to weight (carat)?
qplot (color, price / carat, data = diamonds, geom = "jitter")
qplot (color, price / carat, data = diamonds, geom = "jitter", alpha=1/10)
qplot (color, price / carat, data = diamonds, geom = "boxplot"
Your turn, experiment with binwidth
qplot (price, data=diamonds, geom="histogram")
qplot (price, data=diamonds, geom="histogram", binwidth=500)
qplot (price, data=diamonds, geom="histogram", binwidth=100)
qplot (price, data=diamonds, geom="histogram", binwidth=10)
qplot (depth, data=diamonds, binwidth=1, xlim=c(58,68))
qplot (depth, data=diamonds, binwidth=0.1, xlim=c(58,68))
qplot (depth, data=diamonds, binwidth=0.1, xlim=c(58,68), fill=cut)
qplot (depth, data=diamonds, binwidth=1, xlim=c(58,68), facets = cut ~ .)
qplot (cut, depth, data=diamonds, geom="boxplot")
qplot(cut (carat, seq(0, 3, 0.2)), depth, data=diamonds, geom="boxplot")
More on R
R's homepage
• http://www.r-project.org
Ggplot homepage
• http://had.co.nz/ggplot2
Statistical modelling and graphics
Exploring effects
• Modelling and graphing effect modification
• Multiple comparisons
Add variables to a dataset
diamonds$lcarat <- log10 (diamonds$carat)
diamonds$lprice <- log10 (diamonds$price)
qplot (lcarat, lprice, data=diamonds, colour=color)
de-trend
detrend <- lm (lprice ~ lcarat, data = diamonds)
diamonds$lprice2 <- resid (detrend)
mod <- lm (lprice2 ~ lcarat * color, data = diamonds)
qplot(lcarat, lprice2, data=diamonds, colour=color)
Modelling effects
library(effects)
effectdf <- function(...) { suppressWarnings(as.data.frame(effect(...)))
}
color <- effectdf ("color", mod)
both2 <- effectdf ("lcarat:color", mod, default.levels = 3)
qplot (color, fit, data=color)+ geom_errorbar (aes(ymin=lower, ymax=upper))
qplot (color, fit, data=both2, colour=lcarat) + geom_line (aes(group=lcarat)) + geom_errorbar (aes (ymin=lower, ymax=upper))