an introduction to ggplot2
TRANSCRIPT
An Introduction to ggplot2
Laurel Stell
May 1, 2019
Introduction
To run the examples yourself
1 http://web.stanford.edu/˜lstell/2 Links at bottom of page to R Markdown file and slides
May also need to install the following packages:
library(ggplot2)library(gridExtra)library(plyr)library(reshape2)
Objectives
How to make some common types of plots with ggplot2Demonstrate the paradigm for advanced customizationsTips on learning more
Prerequisites
Basic knowledge of R:
Factors, data frames, etcInstalling and loading packagesBase graphics functions such as plot
Note: ggplot2 is based on grid package. Do not mix with base graphicssuch as par(), split.screen(), axis(), legend().
First Steps
The paradigm
ggplot2 is part of tidyverse by Hadley WickhamSyntax is based on The Grammar of Graphics by Leland Wilkinson
$140 new on AmazonAvailable online for Stanford affiliates at https://ebookcentral.proquest.com/lib/stanford-ebooks/detail.action?docID=302755But I’ve never looked at this book.
Basically, build sentences:Call ggplot() to specify dataAdd “layers” with geom_point(), geom_histogram(), etcAdd customizations
Example: simple scatter plot
ggplot(data = mpg) +geom_point(mapping = aes(x = displ, y = hwy))
20
30
40
2 3 4 5 6 7
displ
hwy
Basic plot specification
First argument to ggplot() is a data framempg is actually a tibble (tbl_df)Tibbles are part of tidyverseTibbles inherit from data.frame
Columns in data frame referred to simply by name in remainder ofggplot2 sentenceAesthetics use data:
x and y axesColor, shape, size, etcGroupingEtc
Jitter
Some points on top of each other:
nrow(mpg)
## [1] 234
nrow(unique(mpg[ , c("displ","hwy") ]))
## [1] 126
Jitter (continued)
ggplot(data = mpg) +geom_jitter(mapping = aes(x = displ, y = hwy))
20
30
40
2 3 4 5 6 7
displ
hwy
Custom color
ggplot(data = mpg) +geom_jitter(mapping = aes(x = displ, y = hwy),
color="orange") # Outside aes()
20
30
40
2 3 4 5 6 7
displ
hwy
Color by type of drivetrain
ggplot(mpg) +geom_jitter(aes(x = displ, y = hwy,
color = drv)) # Inside aes()
20
30
40
2 3 4 5 6 7
displ
hwy
drv
4
f
r
Marker shape indicates year of manufacture
ggplot(mpg) +geom_jitter(aes(x = displ, y = hwy,
color = drv, shape = year))
Error: A continuous variable can not be mapped to shape
class(mpg$year)
## [1] "integer"
R thinks integers are continuous.
Turn year into a factor
table(mpg$year)
#### 1999 2008## 117 117
mpg.mod <- transform(mpg, year = as.factor(year))class(mpg.mod$year)
## [1] "factor"
Marker shape indicates year of manufacture (CORRECT)
ggplot(mpg.mod) +geom_jitter(aes(x = displ, y = hwy,
color = drv, shape = year))
20
30
40
2 3 4 5 6 7
displ
hwy
year
1999
2008
drv
4
f
r
Multiple Layers
What if we want more than one layer?
For example, add smoother to scatter plot:
ggplot(...) + geom_point(aes(...)) + geom_smooth(aes(...))
Specifying same aesthetics in each geom call is hard to read and proneto errorInstead:
Specify shared aesthetics in call to ggplot()When creating a layer, add aesthetics specific to that layerCan also override aesthetics with inherit.aes in geom call
Single smoother for all data
ggplot(mpg.mod,aes(x = displ, y = hwy)) +
geom_jitter(aes(group = drv, color = drv, shape = year)) +geom_smooth(method = "loess", color = "black")
20
30
40
2 3 4 5 6 7
displ
hwy
year
1999
2008
drv
4
f
r
All aesthetics shared
h.basic <- ggplot(mpg.mod,aes(x = displ, y = hwy,
color = drv, group = drv,shape = year)) +
geom_jitter()h.basic + geom_smooth(method="loess")
20
30
40
2 3 4 5 6 7
displ
hwy
year
1999
2008
drv
4
f
r
Indicate number of observations in each group
plyr package (by H. Wickham) makes it easy to compute group sizes:
(mpg.N <- ddply(mpg, .(drv), nrow))
## drv V1## 1 4 103## 2 f 106## 3 r 25
Also specify locations and prettify labels:
mpg.N <- cbind(mpg.N, x = c(2.2, 3, 6), y = c(15, 35, 31))mpg.N <- transform(mpg.N, V1 = paste("N =", V1))
Using inherit.aes
h.basic +geom_label(data = mpg.N,
aes(x = x, y = y, label = V1, color = drv),inherit.aes = F, show.legend = F) +
guides(color = guide_legend(override.aes = list(shape = 15)))
N = 103
N = 106
N = 25
20
30
40
2 3 4 5 6 7
displ
hwy
year
1999
2008
drv
4
f
r
Without show.legend argument, color legend shows “a” instead of shape.
Alternative approach
tmp <- cbind(ddply(mpg, .(drv), nrow),displ = c(2.2, 3, 6), hwy = c(15, 35, 31))
h.basic + geom_label(data = tmp, aes(label = V1))
Error in FUN(X[[i]], ...) : object 'year' not found
Shared aesthetic “shape = year” must be available even though geom_labeldoesn’t understand shape aesthetic.
Alternative approach (CORRECT)
tmp <- cbind(ddply(mpg, .(drv), nrow),displ = c(2.2, 3, 6), hwy = c(15, 35, 31))
ggplot(mpg.mod,aes(x = displ, y = hwy, color = drv, group = drv)) +
geom_jitter(aes(shape = year)) +geom_label(data = tmp, aes(label = V1), show.legend = F)
103
106
25
20
30
40
2 3 4 5 6 7
displ
hwy
year
1999
2008
drv
4
f
r
Some other layer types
geom_histogram, geom_densitygeom_bar, geom_boxplotgeom_linegeom_ablinegeom_errorbar
To discover more:
Cheat sheetSearch R help for “geom_”Google
Documentation for a geom function
Explains argumentsExplains aesthetics; boldface ⇒ requiredExamplesCombines similar functions on same page (eg geom_label andgeom_text)Available via:
Your favorite R help interfacehttps://ggplot2.tidyverse.org/
Appears in Google searchesShows results of examples
Additional Customization
Change axis titles
(h.basic <- h.basic +labs(x = "Displacement", y = "Highway MPG"))
20
30
40
2 3 4 5 6 7
Displacement
Hig
hway
MP
G
year
1999
2008
drv
4
f
r
Change group colors
clr.drive <- c(f = "orchid1",r = "lightskyblue","4" = "plum4")
If you do not explicitly specify factor levels in vector, colors will bemapped by order.Can also use color hex codes (eg “#8c1515”)
Change group colors (continued)
h.basic + scale_color_manual(values = clr.drive)
20
30
40
2 3 4 5 6 7
Displacement
Hig
hway
MP
Gyear
1999
2008
drv
4
f
r
Change legend labels
lbl.drive <- c(f = "Front wheel",r = "Rear wheel","4" = "All-wheel")
Now it’s VERY IMPORTANT to get the mapping correct!
Change legend labels (continued)
(h.basic <- h.basic +labs(shape = "Year") +scale_color_manual(name = "Drivetrain",
values = clr.drive,labels = lbl.drive))
10
20
30
40
2 3 4 5 6 7
Displacement
Hig
hway
MP
G
Year
1999
2008
Drivetrain
All−wheel
Front wheel
Rear wheel
Change background
h.basic + theme_bw()
20
30
40
2 3 4 5 6 7
Displacement
Hig
hway
MP
GYear
1999
2008
Drivetrain
All−wheel
Front wheel
Rear wheel
Change grid
h.basic + theme_bw() +theme(panel.grid.minor.x = element_blank())
20
30
40
2 3 4 5 6 7
Displacement
Hig
hway
MP
GYear
1999
2008
Drivetrain
All−wheel
Front wheel
Rear wheel
theme() must be called after theme_bw()
Use theme_bw() for future plots in R session
theme.default <- theme_set(theme_bw())h.basic + theme(panel.grid.minor.x = element_blank())
20
30
40
2 3 4 5 6 7
Displacement
Hig
hway
MP
G
Year
1999
2008
Drivetrain
All−wheel
Front wheel
Rear wheel
Multiple Plots in a Figure
FacetsSplit plot by number of cylinders
h.basic + facet_wrap(vars(cyl))
6 8
4 5
2 3 4 5 6 7 2 3 4 5 6 7
20
30
40
20
30
40
Displacement
Hig
hway
MP
G
Year
1999
2008
Drivetrain
All−wheel
Front wheel
Rear wheel
Changing facet panel strip labels
lbl.cyl <- paste(unique(mpg.mod$cyl), "cylinders")names(lbl.cyl) <- unique(mpg.mod$cyl) # REQUIREDh.basic +
facet_wrap(vars(cyl), nrow = 1,labeller = as_labeller(lbl.cyl))
4 cylinders 5 cylinders 6 cylinders 8 cylinders
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 710
20
30
40
Displacement
Hig
hway
MP
G
Year
1999
2008
Drivetrain
All−wheel
Front wheel
Rear wheel
Changing facet panel strip labels (ALTERNATIVE)
Labeller can also be a function:
lbl.cyl <- function(string) paste(string, "cylinders")# Same facet_wrap() call as preceding slide
Using grid layout
# HOMEWORK: Define h.city with mpg$cty on y-axis.grid.arrange(h.basic + theme(legend.position="none"),
h.city,nrow=1, widths = c(1, 1.55))
10
20
30
40
2 3 4 5 6 7
Displacement
Hig
hway
MP
G
10
15
20
25
30
35
2 3 4 5 6 7
Displacement
City
MP
G
Year
1999
2008
Drivetrain
All−wheel
Front wheel
Rear wheel
Using list of ggplot objects with grid.arrange
h <- list()for (y in yvars) {
htmp <- ggplot(...) + ...h <- append(h, list(htmp))
}do.call(grid.arrange, append(h, list(...)))
# grid.arrange optional args in ...
Revisiting facets
Using grid.arrange required fiddling to get widths correct.Couldn’t use facets because ggplot2 uses a single column for y.melt() in reshape2 package (by H. Wickham) makes it easy tocombine M columns in data frame into single column
Result has M times as many rowsThe M columns are replaced by 2:
Original column nameValue in that column
Do not combine factors with continuous values
Example: Using melt
tmp <- mpg.mod[ , c("displ", "drv", "year", "hwy", "cty") ]tmp <- melt(tmp, measure.vars = c("hwy", "cty"))names(tmp)
## [1] "displ" "drv" "year" "variable" "value"
summary(tmp$variable)
## hwy cty## 234 234
tmp$value contains values previously in columns hwy and cty
Example: Using facets after melt
lbls <- c(hwy="Highway", cty="City")ggplot(tmp,
aes(x = displ, y = value,color = drv, group = drv, shape = year)) +
geom_jitter() +facet_wrap(vars(variable), nrow = 1,
labeller = as_labeller(lbls))
Highway City
2 3 4 5 6 7 2 3 4 5 6 7
10
20
30
40
displ
valu
e
year
1999
2008
drv
4
f
r
Allowing different y-axis limits
ggplot(tmp,aes(x = displ, y = value,
color = drv, group = drv, shape = year)) +geom_jitter() +facet_wrap(vars(variable), nrow = 1, scales = "free_y",
labeller = as_labeller(lbls))
Highway City
2 3 4 5 6 7 2 3 4 5 6 7
10
15
20
25
30
35
20
30
40
displ
valu
e
year
1999
2008
drv
4
f
r
Carrying On
Some other ways to get started
https://ggplot2.tidyverse.org/ has links to:Cheat sheetWickham’s recommended places to startTips for getting helphttps://ggplot2.tidyverse.org/reference/index.html
Google “ggplot2 tutorial”; some top hits:http://r-statistics.co/ggplot2-Tutorial-With-R.htmlhttps://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html
Becoming an expert
Function documentationGoogle (eg “ggplot2 strip color”); best information usually attidyverse.org or Stack OverflowFurther explore the tidyverse, particularly:
dplyr instead of plyrtidyr instead of reshape2
Often multiple ways to achieve an effect