philippe glaziou berlin, april 2009 · exploring data with r introduction philippe glaziou berlin,...

54
Exploring data with R introduction Philippe Glaziou Berlin, April 2009

Upload: others

Post on 07-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Exploring data with Rintroduction

Philippe Glaziou

Berlin, April 2009

Page 2: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Getting started

install.packages ("ggplot2")

# once per computer

library (ggplot2)

# every time you open R

Page 3: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

What is R?

• A free implementation of a dialect of the S language (now S-Plus)

• R is a high-level programming language with similarities to Python and Scheme

• R is slow and memory-hungry, but computers are fast and memory is cheap

R Development Core Team (2008). R: A language and environment for statisticalcomputing. R Foundation for Statistical Computing, Vienna, Austria. ISBN3-900051-07-0, URL http://www.R-project.org.

Page 4: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

What does R

• The same things as mainstream commercial statistical packages

• Data management

• Stochastic and deterministic modelling

• Publication-quality graphics

• And more:

– R can generate and solve Sudoku games

Page 5: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Diamondshttp://www.diamondse.info

Page 6: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Colour and Cut

Page 7: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

> head (diamonds)carat cut color clarity depth table price x y z

1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63

5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

54,000 diamonds

Page 8: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (carat, data = diamonds)

Page 9: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (carat, data = diamonds, binwidth=0.1)

Page 10: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (carat, data = diamonds, binwidth=0.01)

Page 11: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

What questions could be answered with these data?

Variables:

• price

• quality: carat, cut, clarity, colour

• geometry: x, y, z, depth, table

Page 12: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Does price vary with weight?

• How would you like to explore the relationship?

Page 13: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (carat, price, data = diamonds)

Page 14: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (log(carat), log(price), data = diamonds)

Page 15: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (log(carat), log(price), data = diamonds, xlim=c(1,2))

Page 16: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (log(carat), log(price), data=diamonds, colour=color)

Page 17: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (log(carat), log(price), data=diamonds, fill=color, geom='hex', bins=250) + scale_fill_brewer (palette="Blues")

Page 18: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Data quality

• What simple checks could we do?

• Can we check whether the density (weight/volume) of diamonds is approximately constant?

– Volume: V ~ x * y * z

Page 19: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (carat, x * y * z, data = diamonds)

Page 20: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Are there diamonds with a square table?

Page 21: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (y, x, data = diamonds)

Page 22: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (y, x, data = diamonds, xlim=c(4,5), ylim=c(4,5))

Page 23: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (y, x, data = diamonds, xlim=c(4,5), ylim=c(4,5)) + geom_abline (colour='red')

Page 24: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Interpreting a scatterplot

• Big patterns

– form: linear, curved

– direction: increasing, decreasing

– strength: is there a lot of variation

– are there multiple groups

• Small patterns

• Deviations

– outliers

Page 25: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Your turn, experiment:

qplot (carat, price, data=diamonds)

qplot (log(carat), log(price), data=diamonds)

qplot (carat, price/carat, data=diamonds)

Page 26: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Facetting

# Row variables ~ column variables (. for none)

qplot (price, carat, data=diamonds, facets = . ~ color)

qplot (price, carat, data=diamonds, facets = color ~ clarity)

Page 27: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (price, carat, data=diamonds, facets = color ~ clarity)

Page 28: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Your turn

• Try the examples

• Try other plots to answers your questions

Page 29: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Generate a subset of the data

set.seed (1234)

# to ensure replicability of sampling

dsmall <- diamonds [sample (nrow (diamonds), 100), ]

# sample 100 diamonds randomly

Page 30: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (carat, price, data = dsmall, colour = color)

Page 31: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (carat, price, data = dsmall, shape = cut)

Page 32: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Map other variables to size or colour

qplot (carat, price, data=dsmall, colour=color)

qplot (carat, price, data=dsmall, size=carat)

qplot (carat, price, data=dsmall, shape=cut)

Page 33: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot(carat, price, data = dsmall, geom = c("point", "smooth"))

qplot(carat, price, data = diamonds, geom = c("point", "smooth"))

Page 34: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Is price related to colour?

• But what if colour is also related to weight (carat)?

Page 35: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (color, price / carat, data = diamonds, geom = "jitter")

Page 36: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (color, price / carat, data = diamonds, geom = "jitter", alpha=1/10)

Page 37: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (color, price / carat, data = diamonds, geom = "boxplot"

Page 38: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Your turn, experiment with binwidth

qplot (price, data=diamonds, geom="histogram")

qplot (price, data=diamonds, geom="histogram", binwidth=500)

qplot (price, data=diamonds, geom="histogram", binwidth=100)

qplot (price, data=diamonds, geom="histogram", binwidth=10)

Page 39: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (depth, data=diamonds, binwidth=1, xlim=c(58,68))

Page 40: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (depth, data=diamonds, binwidth=0.1, xlim=c(58,68))

Page 41: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (depth, data=diamonds, binwidth=0.1, xlim=c(58,68), fill=cut)

Page 42: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (depth, data=diamonds, binwidth=1, xlim=c(58,68), facets = cut ~ .)

Page 43: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (cut, depth, data=diamonds, geom="boxplot")

Page 44: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot(cut (carat, seq(0, 3, 0.2)), depth, data=diamonds, geom="boxplot")

Page 45: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

More on R

R's homepage

• http://www.r-project.org

Ggplot homepage

• http://had.co.nz/ggplot2

Page 46: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Statistical modelling and graphics

Page 47: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Exploring effects

• Modelling and graphing effect modification

• Multiple comparisons

Page 48: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Add variables to a dataset

diamonds$lcarat <- log10 (diamonds$carat)

diamonds$lprice <- log10 (diamonds$price)

Page 49: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (lcarat, lprice, data=diamonds, colour=color)

Page 50: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

de-trend

detrend <- lm (lprice ~ lcarat, data = diamonds)

diamonds$lprice2 <- resid (detrend)

mod <- lm (lprice2 ~ lcarat * color, data = diamonds)

Page 51: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot(lcarat, lprice2, data=diamonds, colour=color)

Page 52: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

Modelling effects

library(effects)

effectdf <- function(...) { suppressWarnings(as.data.frame(effect(...)))

}

color <- effectdf ("color", mod)

both2 <- effectdf ("lcarat:color", mod, default.levels = 3)

Page 53: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (color, fit, data=color)+ geom_errorbar (aes(ymin=lower, ymax=upper))

Page 54: Philippe Glaziou Berlin, April 2009 · Exploring data with R introduction Philippe Glaziou Berlin, April 2009. Getting started install.packages ("ggplot2") # once per computer library

qplot (color, fit, data=both2, colour=lcarat) + geom_line (aes(group=lcarat)) + geom_errorbar (aes (ymin=lower, ymax=upper))