statistical graphics with - duke university
TRANSCRIPT
Statistical graphics withStatistical graphics withggplot2ggplot2
Statistical Computing &Statistical Computing &ProgrammingProgramming
Shawn SantoShawn Santo
1 / 611 / 61
Supplementary materials
Full video lecture available in Zoom Cloud Recordings
Additional resources
Chapter 3, R for Data Scienceggplot2 Referenceggplot2 cheat sheetcolor brewer 2
2 / 61
ggplot2
ggplot2 is a plotting system for R, based on the grammar of graphics
using the good parts of base and lattice
It takes care of many of the fiddly details that make plotting a hassle
such as drawing legends and facetingparticularly helpful for plotting multivariate data
Package ggplot2 is available in package tidyverse. Let's load that now.
library(tidyverse)
3 / 61
The Grammar of Graphics
Visualization concept created by Leland Wilkinson (1999)
to define the basic elements of a statistical graphic
Adapted for R by Wickham (2009)
consistent and compact syntax to describe statistical graphicshighly modular as it breaks up graphs into semantic components
It is not meant as a guide to which graph to use and how to best convey your data (moreon that later).
4 / 61
Today's data: MLB
Object teams is a data frame that contains yearly statistics and standings for MLB teamsfrom 2009 to 2018.
The data has 300 rows and 56 variables.
teams <- read_csv("http://www2.stat.duke.edu/~sms185/data/mlb/teams.csv")
5 / 61
teams
#> # A tibble: 300 x 56#> yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin#> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>#> 1 2009 NL ARI ARI W 5 162 81 70 92 N N #> 2 2009 NL ATL ATL E 3 162 81 86 76 N N #> 3 2009 AL BAL BAL E 5 162 81 64 98 N N #> 4 2009 AL BOS BOS E 2 162 81 95 67 N Y #> 5 2009 AL CHA CHW C 3 162 81 79 83 N N #> 6 2009 NL CHN CHC C 2 161 80 83 78 N N #> 7 2009 NL CIN CIN C 4 162 81 78 84 N N #> 8 2009 AL CLE CLE C 4 162 81 65 97 N N #> 9 2009 NL COL COL W 2 162 81 92 70 N Y #> 10 2009 AL DET DET C 2 163 81 86 77 N N #> # … with 290 more rows, and 44 more variables: LgWin <chr>, WSWin <chr>,#> # R <dbl>, AB <dbl>, H <dbl>, X2B <dbl>, X3B <dbl>, HR <dbl>, BB <dbl>,#> # SO <dbl>, SB <dbl>, CS <dbl>, HBP <dbl>, SF <dbl>, RA <dbl>, ER <dbl>,#> # ERA <dbl>, CG <dbl>, SHO <dbl>, SV <dbl>, IPouts <dbl>, HA <dbl>,#> # HRA <dbl>, BBA <dbl>, SOA <dbl>, E <dbl>, DP <dbl>, FP <dbl>, name <chr>,#> # park <chr>, attendance <dbl>, BPF <dbl>, PPF <dbl>, teamIDBR <chr>,#> # teamIDlahman45 <chr>, teamIDretro <chr>, TB <dbl>, WinPct <dbl>, rpg <dbl>,#> # hrpg <dbl>, tbpg <dbl>, kpg <dbl>, k2bb <dbl>, whip <dbl>
6 / 61
Plot comparisonPlot comparison
7 / 617 / 61
Using ggplot()
8 / 61
Using plot()
9 / 61
Code comparison
Using ggplot()
ggplot(teams, mapping = aes(x = R - RA, y = WinPct, color = DivWin)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = "Win Percentage", y = "Run Differential")
Using plot()
teams$RD <- teams$R - teams$RAteams_div <- teams[teams$DivWin == "Y", ]teams_no_div <- teams[teams$DivWin == "N", ]
mod1 <- lm(WinPct ~ RD, data = teams_div)mod2 <- lm(WinPct ~ RD, data = teams_no_div)
plot(x = (teams$R - teams$RA), y = teams$WinPct, col = adjustcolor(as.integer(factor(teams$DivWin))), pch = 16, xlab = "Run Differential", ylab = "Win Percentage")abline(mod1, col = 2, lwd=2)abline(mod2, col = 1, lwd=2)
10 / 61
What's in a What's in a ggplot()ggplot()??
11 / 6111 / 61
Terminology
A statistical graphic is a...
mapping of data
which may be statistically transformed (summarized, log-transformed, etc.)
to aesthetic attributes (color, size, xy-position, etc.)
using geometric objects (points, lines, bars, etc.)
and mapped onto a specific facet and coordinate system.
12 / 61
What do I "need"?
1) Some data (preferably in a data frame)
ggplot(data = teams)
13 / 61
2) A set of variable mappings
ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W))
14 / 61
3) A geom with arguments, or multiple geoms with arguments connected by +
ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) + geom_point(color = "blue")
15 / 61
4) Some options on changing scales or adding facets
ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) + geom_point(color = "blue") + facet_wrap(~yearID, nrow = 2)
16 / 61
5) Some labels
ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) + geom_point(color = "blue") + facet_wrap(~yearID, nrow = 2) + labs(x = "Attendance", y = "Wins", caption = "Attendance in thousands")
17 / 61
6) Other options
ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) + geom_point(color = "blue") + facet_wrap(~yearID, nrow = 2) + labs(x = "Attendance", y = "Wins", caption = "Attendance in thousands") theme_bw(base_size = 16) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
18 / 61
Anatomy of a ggplot
ggplot( data = [dataframe],
aes( x = [var_x], y = [var_y], color = [var_for_color], fill = [var_for_fill], shape = [var_for_shape], size = [var_for_size], alpha = [var_for_alpha], ...#other aesthetics )) + geom_<some_geom>([geom_arguments]) + ... # other geoms scale_<some_axis>_<some_scale>() + facet_<some_facet>([formula]) + ... # other options
To visualize multivariate relationships we can add variables to our visualization byspecifying aesthetics: color, size, shape, linetype, alpha, or fill; we can also add facets basedon variable levels.
19 / 61
Scatter plotsScatter plots
20 / 6120 / 61
Base plot
ggplot(data = teams, mapping = aes(x = (R ^ 2 / (R ^ 2 + RA ^2 )), y = WinPct)) + geom_point()
21 / 61
Altering aesthetic color
ggplot(data = teams, mapping = aes(x = (R ^ 2 / (R ^ 2 + RA ^2 )), y = WinPct)) + geom_point(color = "#E81828")
22 / 61
Altering aesthetic color
ggplot(data = teams, mapping = aes(x = (R ^ 2 / (R ^ 2 + RA ^2 )), y = WinPct, color = lgID)) + geom_point(show.legend = FALSE)
23 / 61
Altering aesthetic color
ggplot(data = teams, mapping = aes(x = (R ^ 2 / (R ^ 2 + RA ^2 )), y = WinPct, color = lgID)) + geom_point()
24 / 61
Base plot
teams %>% filter(yearID == 2018) %>% ggplot(mapping = aes(x = BB + H, y = SO)) + geom_point()
25 / 61
Altering multiple aesthetics
teams %>% filter(yearID == 2018) %>% ggplot(mapping = aes(x = BB + H, y = SO)) + geom_point(size = 3, shape = 2, color = "#E81828")
26 / 61
Altering multiple aesthetics
teams %>% filter(yearID == 2018) %>% ggplot(mapping = aes(x = BB + H, y = SO, color = factor(Rank), shape = factor(Rank))) + geom_point(size = 4, alpha = .8, show.legend = FALSE)
27 / 61
Altering multiple aesthetics
teams %>% filter(yearID == 2018) %>% ggplot(mapping = aes(x = BB + H, y = SO, color = factor(Rank), shape = factor(Rank))) + geom_point(size = 4, alpha = .8)
28 / 61
Inside or outside aes()?
When does an aesthetic go inside function aes()?
If you want an aesthetic to be reflective of a variable's values, it must go inside aes.
If you want to set an aesthetic manually and not have it convey information about avariable, use the aesthetic's name outside of aes and set it to your desired value.
Aesthetics for continuous and discrete variables are measured on continuous and discretescales, respectively.
29 / 61
Faceting
ggplot(data = teams, mapping = aes(x = R, y = WinPct, color = DivWin)) + geom_point(alpha = .8) + facet_grid(lgID~ .)
30 / 61
Faceting
ggplot(data = teams, mapping = aes(x = R, y = WinPct, color = DivWin)) + geom_point(alpha = .8) + facet_grid(. ~lgID)
31 / 61
Faceting
ggplot(data = teams, mapping = aes(x = R, y = WinPct, color = DivWin)) + geom_point(alpha = .8) + facet_grid(divID~lgID)
32 / 61
Faceting
ggplot(data = teams, mapping = aes(x = R, y = WinPct, color = DivWin)) + geom_point(alpha = .8) + facet_wrap(~yearID)
33 / 61
Facet grid or wrap?
Use facet_wrap() to wrap a one dimensional sequence into two dimensionalpanels.
Use facet_grid() when you have two discrete variables and you want panels ofplots to represent all possible combinations.
34 / 61
Exercise
Let's explore the relationship between runs and strikeouts for division winners and non-division winners. Use tibble teams to re-create the plot below.
How can we improve this visualization if our goal is investigate the role strikeouts andruns play for division winners and non-division winners? 35 / 61
A more effective visualization
36 / 61
Other geomsOther geoms
37 / 6137 / 61
Caution
The following plots are not well-polished. They are designed to demonstrate the variousgeoms and options that exist within ggplot2.
You should always have a well-labelled and polished visualization if it will be seen byan outside audience.
38 / 61
Box plots
ggplot(teams, mapping = aes(x = factor(yearID), y = kpg)) + geom_boxplot(color = "#E81828", fill = "#002D72", alpha = .7)
39 / 61
Box plots: flipped coordinates
ggplot(teams, mapping = aes(x = factor(yearID), y = kpg)) + geom_boxplot(color = "#E81828", fill = "#002D72", alpha = .7) + coord_flip()
40 / 61
Box plots: custom colors
ggplot(teams, mapping = aes(x = factor(yearID), y = kpg, fill = lgID)) + geom_boxplot(color = "grey", alpha = .7) + scale_fill_manual(values = c("#E81828", "#002D72")) + coord_flip() + theme_bw()
41 / 61
Bar plots
ggplot(teams[teams$yearID == 2018, ], mapping = aes(y = W, x = franchID)) + geom_bar(stat = "identity")
42 / 61
Bar plots: angled text
ggplot(teams[teams$yearID == 2018, ], mapping = aes(y = W, x = franchID)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
43 / 61
Bar plots: sorted
ggplot(teams[teams$yearID == 2018, ], mapping = aes(y = W, x = forcats::fct_reorder(franchID, W, .desc = TRUE))) + geom_bar(stat = "identity", color = "#E81828", fill = "#002D72", alpha = .7) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
44 / 61
Bar plots: granular scale
ggplot(teams[teams$yearID == 2018, ], mapping = aes(y = W, x = forcats::fct_reorder(franchID, W, .desc = TRUE))) + geom_bar(stat = "identity", color = "#E81828", fill = "#002D72", alpha = .7) + scale_y_continuous(breaks = seq(0, 120, 15), labels = seq(0, 120, 15), limits = c(0, 120)) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
45 / 61
Histograms
ggplot(teams, mapping = aes(x = WinPct)) + geom_histogram(binwidth = .025, fill = "#E81828", color = "#002D72", alpha = .7)
46 / 61
Density plots
ggplot(teams, mapping = aes(x = WinPct)) + geom_density(fill = "#E81828", color = "#002D72", alpha = .7)
47 / 61
Density plots: custom colors
ggplot(teams, mapping = aes(x = WinPct, fill = lgID)) + geom_density(alpha = .5) + scale_fill_manual(values = c("#E81828", "#002D72"))
48 / 61
Heat maps
ggplot(teams[teams$yearID == 2018, ], mapping = aes(x = Rank, y = divID, fill = RD)) + geom_raster()
49 / 61
Heat maps: color palette
ggplot(teams[teams$yearID == 2018, ], mapping = aes(x = Rank, y = divID, fill = RD)) + geom_raster() + scale_fill_gradientn(colours = terrain.colors(10))
50 / 61
Heat maps: color palette
ggplot(teams[teams$yearID == 2018, ], mapping = aes(x = Rank, y = divID, fill = RD)) + geom_raster() + scale_fill_gradient(low = "#fef0d9", high = "#b30000")
51 / 61
Effective visualization tips
Provide a title that tells a story.
Strive to have your visualization function in a closed environment.
Be mindful of color and scale choices.
Generally, color is better than shape to make things "pop".
Not everything has to have a color, shape, transparency, etc.
Add labels and annotation.
Use your visualization to support your story.
Use chunk options fig.width, fig.height, fig.align, and fig.show tomanipulate your plot's size and placement.
53 / 61
ExerciseExercise
56 / 6156 / 61
Energy data
energy <- read_csv("http://www2.stat.duke.edu/~sms185/data/energy/energy.csv")
energy
#> # A tibble: 105 x 6#> MWhperDay name type location note boe#> <dbl> <chr> <chr> <chr> <chr> <dbl>#> 1 3 Chernobyl Solar Solar Ukraine "On the site of the former… 0#> 2 637 Solarpark Meuro Solar Germany <NA> 55#> 3 920 Tesla's propos… Solar South Aust… "50,000 homes with solar p… 79#> 4 1280 Quaid-e-Azam Solar Pakistan "Named in honor of Quaid-e… 110#> 5 1760 Topaz Solar USA <NA> 152#> 6 2025 Agua Caliente Solar USA "Arizona" 175#> 7 2466 Kamuthi Solar India "\"150,000\" homes" 213#> 8 2720 Longyangxia Solar China <NA> 234#> 9 3840 Kurnool Solar India <NA> 331#> 10 4950 Tengger Desert Solar China "Covers 3.2% of the land a… 427#> # … with 95 more rows
57 / 61
Data dictionary
The power sources represent the amount of energy a power source generates each day asrepresented in daily MWh.
MWhperDay: MWh of energy generated per dayname: energy source nametype: type of energy sourcelocation: country of energy sourcenote: more details on energy sourceboe: barrel of oil equivalent
Daily megawatt hour (MWh) is a measure of energy output.1 MWh is, on average, enough power for 28 people in the USA
58 / 61
Objective
Re-create the plot on the following slide.
A few notes:
base font size is 18
hex colors: c("#9d8b7e", "#315a70", "#66344c", "#678b93","#b5cfe1", "#ffcccc")
59 / 61
60 / 61
References
1. Grolemund, G., & Wickham, H. (2021). R for Data Science. R4ds.had.co.nz.https://r4ds.had.co.nz/data-visualisation.html
2. https://ggplot2.tidyverse.org/reference/
3. Lahman, S. (2021) Lahman's Baseball Database, 1871-2018, Main page,http://www.seanlahman.com/baseball-archive/statistics/
4. https://www.visualcapitalist.com/worlds-largest-energy-sources/
61 / 61