session 5 - bootstrappers.umassmed.educd74-lps sod2-lps cd74-r848 sod2-r848 cd74-ifnb sod2-ifnb 0...

26
Session 5 Nick Hathaway; [email protected] Contents Adding Text To Plots 1 Line graph ................................................. 1 Bar graph .................................................. 6 Dividing Data into Quantiles 17 Part 1 Excerices 21 RMarkdown 21 R Code Chunks ............................................... 22 Adding Text To Plots Line graph Reading and processing data library(tidyverse) # ts_longFormat = read_tsv("time.series.data.txt") %>% rename(gene = X1) %>% gather(Condition, expression, 2:ncol(.) ) %>% separate(Condition, c("exposure", "time")) %>% mutate(time = as.numeric(gsub("h", "", time) ) ) # also you can also the %in% operator that R offers ts_longFormat_SOD2_CD74 = ts_longFormat %>% filter(gene %in% c("SOD2", "CD74")) # create a grouping variable to make plotting easier ts_longFormat_SOD2_CD74 = ts_longFormat_SOD2_CD74 %>% mutate(grouping = paste0(gene, "-", exposure)) Here is a plot from the last Session # using group = grouping to separate out the different genes and the exposure but still color by exposur geneLinetypes =c("dotted", "solid") names(geneLinetypes) = c("CD74", "SOD2") # make the points larger, the value given to size is a relative number ggplot(ts_longFormat_SOD2_CD74, aes(x= time, y= expression, color = exposure, group = grouping)) + geom_point(aes(shape = gene), size = 3)+ 1

Upload: others

Post on 26-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

Session 5Nick Hathaway; [email protected]

Contents

Adding Text To Plots 1

Line graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Bar graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Dividing Data into Quantiles 17

Part 1 Excerices 21

RMarkdown 21

R Code Chunks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Adding Text To Plots

Line graph

Reading and processing data

library(tidyverse)#ts_longFormat = read_tsv("time.series.data.txt") %>%

rename(gene = X1) %>%gather(Condition, expression, 2:ncol(.) ) %>%separate(Condition, c("exposure", "time") ) %>%mutate(time = as.numeric(gsub("h", "", time) ) )

# also you can also the %in% operator that R offersts_longFormat_SOD2_CD74 = ts_longFormat %>%

filter(gene %in% c("SOD2", "CD74") )

# create a grouping variable to make plotting easierts_longFormat_SOD2_CD74 = ts_longFormat_SOD2_CD74 %>%

mutate(grouping = paste0(gene, "-", exposure))

Here is a plot from the last Session

# using group = grouping to separate out the different genes and the exposure but still color by exposuregeneLinetypes =c("dotted", "solid")names(geneLinetypes) = c("CD74", "SOD2")# make the points larger, the value given to size is a relative numberggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, color = exposure, group = grouping)) +

geom_point(aes(shape = gene), size = 3) +

1

Page 2: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

geom_line(aes(linetype = gene)) +scale_color_brewer(palette = "Dark2") +scale_shape_manual(values = c(1, 3)) +scale_linetype_manual(values = geneLinetypes)

0

5000

10000

15000

0 5 10 15 20 25

time

expr

essi

on

gene

CD74

SOD2

exposure

Ctrl

Ifnb

Lps

R848

Now imagine we want to add text to each line so we can have a label for what each line represents. Thiscould be accomplished in several ways, one way is to first create a data frame with data points for each themax time point for each grouping variable.

ts_longFormat_SOD2_CD74_summary = ts_longFormat_SOD2_CD74 %>%filter("Ctrl" != exposure) %>%group_by(grouping) %>%mutate(maxTime = max(time)) %>%filter(time == maxTime)

Once we have this data frame, we can use it to add a label at the further time point, which will be at the endof each line. This can be done by utilizing the fact that when adding geom_[LAYER] layers we can assign anew data frame for the layer to base its layout off of by doing data=.

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, color = exposure, group = grouping)) +geom_point(aes(shape = gene), size = 3) +geom_line(aes(linetype = gene)) +scale_color_brewer(palette = "Dark2") +scale_shape_manual(values = c(1, 3)) +scale_linetype_manual(values = geneLinetypes) +geom_text(aes(label = grouping), data = ts_longFormat_SOD2_CD74_summary)

2

Page 3: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

CD74−Lps

SOD2−Lps

CD74−R848

SOD2−R848

CD74−Ifnb

SOD2−Ifnb0

5000

10000

15000

0 5 10 15 20 25

time

expr

essi

on

gene

CD74

SOD2

exposure

a

a

a

a

Ctrl

Ifnb

Lps

R848

Here we used the geom_text layer which adds text, it needs a x and y (which was set in the top ggplot aes)and a label variable for what it’s going to add as text to the plot. Notice how the text is centered on thelast point, but we can’t see it very well, so to change the text alignment we use hjust=, 0 = start at the x,ycoordinates, 0.5 (default) = center on the x,y coordinates, and 1 = end at the x,y coordinates.

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, color = exposure, group = grouping)) +geom_point(aes(shape = gene), size = 3) +geom_line(aes(linetype = gene)) +scale_color_brewer(palette = "Dark2") +scale_shape_manual(values = c(1, 3)) +scale_linetype_manual(values = geneLinetypes) +geom_text(aes(label = grouping), hjust = 0, data = ts_longFormat_SOD2_CD74_summary)

3

Page 4: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

CD74−Lps

SOD2−Lps

CD74−R848

SOD2−R848

CD74−Ifnb

SOD2−Ifnb0

5000

10000

15000

0 5 10 15 20 25

time

expr

essi

on

gene

CD74

SOD2

exposure

a

a

a

a

Ctrl

Ifnb

Lps

R848

So now the text starts at the point but it’s still over the point so lets nudge it a little bit to the right by usenudge_x= to nudge it over 1 x-axis unit

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, color = exposure, group = grouping)) +geom_point(aes(shape = gene), size = 3) +geom_line(aes(linetype = gene)) +scale_color_brewer(palette = "Dark2") +scale_shape_manual(values = c(1, 3)) +scale_linetype_manual(values = geneLinetypes) +geom_text(aes(label = grouping), nudge_x = 1, hjust = 0, data = ts_longFormat_SOD2_CD74_summary)

4

Page 5: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

CD74−Lps

SOD2−Lps

CD74−R848

SOD2−R848

CD74−Ifnb

SOD2−Ifnb0

5000

10000

15000

0 5 10 15 20 25

time

expr

essi

on

gene

CD74

SOD2

exposure

a

a

a

a

Ctrl

Ifnb

Lps

R848

Though unfortunately ggplot doesn’t take into account the text when determining limits so we have to changethem to be able to see the text

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, color = exposure, group = grouping)) +geom_point(aes(shape = gene), size = 3) +geom_line(aes(linetype = gene)) +scale_color_brewer(palette = "Dark2") +scale_shape_manual(values = c(1, 3)) +scale_linetype_manual(values = geneLinetypes) +geom_text(aes(label = grouping), nudge_x = 1, hjust = 0, data = ts_longFormat_SOD2_CD74_summary) +xlim(0, 30)

5

Page 6: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

CD74−Lps

SOD2−Lps

CD74−R848

SOD2−R848

CD74−Ifnb

SOD2−Ifnb0

5000

10000

15000

0 10 20 30

time

expr

essi

on

gene

CD74

SOD2

exposure

a

a

a

a

Ctrl

Ifnb

Lps

R848

Bar graph

Now let’s try adding text to a barplot, lets define the time points as factors so we don’t have to have so muchspace between each bar

ts_longFormat_SOD2_CD74 = read_tsv("time.series.data.txt") %>%rename(gene = X1) %>%gather(Condition, expression, 2:ncol(.) ) %>%separate(Condition, c("exposure", "time") ) %>%filter(gene %in% c("SOD2", "CD74") )%>%mutate(time = factor(time, levels = c("0h", "1h", "2h", "4h", "6h", "12h", "24h")))

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, fill = gene)) +geom_bar(stat = "identity", position = "dodge") +scale_fill_brewer(palette = "Dark2")

6

Page 7: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

0

5000

10000

15000

0h 1h 2h 4h 6h 12h 24h

time

expr

essi

on gene

CD74

SOD2

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, fill = gene)) +geom_bar(stat = "identity", position = "dodge") +scale_fill_brewer(palette = "Dark2") +geom_text(aes(label = expression))

7

Page 8: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

12037.00875

209.537083333333

16970.932

1310.058

16475.174

3927.324

17277.43875

8543.60125

18339.2325

11942.29

14720.914

10835.0349882.88

7789.53333333333

12580.2066666667

527.106666666667

13359.5766666667

1430.8

12102.8766666667

6844.12

12448.3233333333

9642.0310213.05

13774.21

5812.25333333333

13947.8266666667

11408.615

210.055

9773.32

303.53

8625.455

487.47

9327.76

607.655

10434.865

680.81

10828.38

596.4050

5000

10000

15000

0h 1h 2h 4h 6h 12h 24h

time

expr

essi

on gene

CD74

SOD2

These numbers are quite large so let’s change it so they only show 3 significant figures using signif()function

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, fill = gene)) +geom_bar(stat = "identity", position = "dodge") +scale_fill_brewer(palette = "Dark2") +geom_text(aes(label = signif(expression, 3)))

8

Page 9: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

12000

210

17000

1310

16500

3930

17300

8540

18300

11900

14700

108009880

7790

12600

527

13400

1430

12100

6840

12400

964010200

13800

5810

13900

11400

210

9770

304

8630

487

9330

608

10400

681

10800

5960

5000

10000

15000

0h 1h 2h 4h 6h 12h 24h

time

expr

essi

on gene

CD74

SOD2

Also look how the numbers are over the place, what’s happening? Well, we still haven’t taken into accountthe different exposures. We could handle this in a couple of ways but let’s take advantage of the face_wrapfunction in ggplot. By using the ~ symbol we tell face_wrap what columns to use to create separate panels.

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, fill = gene)) +geom_bar(stat = "identity", position = "dodge") +scale_fill_brewer(palette = "Dark2") +geom_text(aes(label = signif(expression, 3))) +facet_wrap(~exposure)

9

Page 10: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

12000

210

17000

1310

16500

3930

17300

8540

18300

11900

14700

1080098807790

11400

210

9770

304

8630

487

9330

608

10400

681

10800

596

12600

527

13400

1430

12100

6840

12400

964010200

13800

5810

13900

Lps R848

Ctrl Ifnb

0h 1h 2h 4h 6h 12h 24h 0h 1h 2h 4h 6h 12h 24h

0

5000

10000

15000

0

5000

10000

15000

time

expr

essi

on gene

CD74

SOD2

Notice how the limits for each axis is the same across all panels, we can change this by setting the scales=to free (different limits for each panel), free_x(different for just x), or free_y (different for just y)

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, fill = gene)) +geom_bar(stat = "identity", position = "dodge") +scale_fill_brewer(palette = "Dark2") +geom_text(aes(label = signif(expression, 3)), position = "dodge") +facet_wrap(~exposure, scales = "free_x")

10

Page 11: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

210

12000

1310

17000

3930

16500

8540

17300

11900

18300

10800

14700

77909880

210

11400

304

9770

487

8630

608

9330

681

10400

596

10800

527

12600

1430

13400

6840

121009640

1240013800

10200

13900

5810

Lps R848

Ctrl Ifnb

1h 2h 4h 6h 12h 24h 1h 2h 4h 6h 12h 24h

0h 1h 2h 4h 6h 12h 24h

0

5000

10000

15000

0

5000

10000

15000

time

expr

essi

on gene

CD74

SOD2

Because the barplot is dodged, we have to doge the geom_text as well, each that is done with the posi-tion_dodge function rather than just "dodge".

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, fill = gene, group = gene)) +geom_bar(stat = "identity", position = "dodge") +scale_fill_brewer(palette = "Dark2") +geom_text(aes(label = signif(expression, 3)), position = position_dodge(width = 0.9) ) +facet_wrap(~exposure, scales = "free_x")

11

Page 12: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

210

12000

1310

17000

3930

16500

8540

17300

11900

18300

10800

14700

77909880

210

11400

304

9770

487

8630

608

9330

681

10400

596

10800

527

12600

1430

13400

6840

121009640

1240013800

10200

13900

5810

Lps R848

Ctrl Ifnb

1h 2h 4h 6h 12h 24h 1h 2h 4h 6h 12h 24h

0h 1h 2h 4h 6h 12h 24h

0

5000

10000

15000

0

5000

10000

15000

time

expr

essi

on gene

CD74

SOD2

Also let’s raise the labels a bit above the bars by adding 1000 to the y

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, fill = gene)) +geom_bar(stat = "identity", position = "dodge") +scale_fill_brewer(palette = "Dark2") +geom_text(aes(y = expression + 1000, label = signif(expression, 3)), position = position_dodge(width = 0.9) ) +facet_wrap(~exposure, scales = "free_x")

12

Page 13: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

210

12000

1310

17000

3930

16500

8540

17300

11900

18300

10800

14700

77909880

210

11400

304

9770

487

8630

608

9330

681

10400

596

10800

527

12600

1430

13400

6840

121009640

1240013800

10200

13900

5810

Lps R848

Ctrl Ifnb

1h 2h 4h 6h 12h 24h 1h 2h 4h 6h 12h 24h

0h 1h 2h 4h 6h 12h 24h

0

5000

10000

15000

20000

0

5000

10000

15000

20000

time

expr

essi

on gene

CD74

SOD2

Also let’s angle the text by setting angle = 45 to put the text at a slant to fit a bit better

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, fill = gene)) +geom_bar(stat = "identity", position = "dodge") +scale_fill_brewer(palette = "Dark2") +geom_text(aes(y = expression + 1000, label = signif(expression, 3)), angle = 45, position = position_dodge(width = 0.9) ) +facet_wrap(~exposure, scales = "free_x")

13

Page 14: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

210

1200

0

1310

1700

0

3930

1650

0

8540

1730

0

1190

0

1830

0

1080

01470

0

779098

80

210

1140

0

304

9770

487

8630

608

9330

681

1040

0

596

1080

0

527

1260

0

1430

1340

0

6840

1210

0

964012

400

1380

0

1020

0 1390

0

5810

Lps R848

Ctrl Ifnb

1h 2h 4h 6h 12h 24h 1h 2h 4h 6h 12h 24h

0h 1h 2h 4h 6h 12h 24h

0

5000

10000

15000

20000

0

5000

10000

15000

20000

time

expr

essi

on gene

CD74

SOD2

We can also give a bit more room with putting the label on the bottom.

ggplot(ts_longFormat_SOD2_CD74, aes(x = time, y = expression, fill = gene)) +geom_bar(stat = "identity", position = "dodge") +scale_fill_brewer(palette = "Dark2") +geom_text(aes(y = expression + 1000, label = signif(expression, 3)), angle = 45, position = position_dodge(width = 0.9) ) +facet_wrap(~exposure, scales = "free_x") +theme(legend.position = "bottom")

14

Page 15: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

210

1200

0

1310

1700

0

3930

1650

0

8540

1730

0

1190

018

300

1080

01470

0

779098

80

210

1140

0

304

9770

487

8630

608

9330

681

1040

0

596

1080

0

527

1260

0

1430

1340

0

6840

1210

0

964012

400

1380

0

1020

0 1390

0

5810

Lps R848

Ctrl Ifnb

1h 2h 4h 6h 12h 24h 1h 2h 4h 6h 12h 24h

0h 1h 2h 4h 6h 12h 24h0

5000

10000

15000

20000

0

5000

10000

15000

20000

time

expr

essi

on

gene CD74 SOD2

Another Example

Here is an example of adding text to a bar graph to indicate how big each group is when plotting relativeproportions

maln_protein_to_matrix_mat_pca_dat_samplesMeta = readr::read_tsv("maln_protein_to_matrix_mat_pca_dat_samplesMeta.tab.txt")maln_protein_to_matrix_mat_pca_dat_samplesMeta

# A tibble: 2,398 x 6sample country region hdbcluster reads collection_year<chr> <chr> <chr> <int> <chr> <int>

1 Ghana.~ Ghana West A~ 6 Ghan~ 20132 Ghana.~ Ghana West A~ 5 Ghan~ 20133 Guinea~ Guinea West A~ 5 Guin~ 20114 Malawi~ Malawi East A~ 5 Mala~ 20115 DRC.08~ DRC Centra~ 5 DRC.~ 20136 DRC.08~ DRC Centra~ 5 DRC.~ 20137 Ghana.~ Ghana West A~ 5 Ghan~ 20138 Gambia~ Gambia West A~ 5 Gamb~ 20089 Gambia~ Gambia West A~ 5 Gamb~ 2008

10 Gambia~ Mali West A~ 5 Mali~ 2010# ... with 2,388 more rows

maln_protein_to_matrix_mat_pca_dat_samplesMeta_sum = maln_protein_to_matrix_mat_pca_dat_samplesMeta %>%group_by(hdbcluster, collection_year) %>%

15

Page 16: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

summarise(n = n()) %>%group_by(hdbcluster) %>%mutate(clusterTotal = sum(n)) %>%mutate(clusterFrac = n/clusterTotal) %>%group_by(collection_year) %>%mutate(yearTotal = sum(n)) %>%mutate(yearFrac = n/yearTotal)

maln_protein_to_matrix_mat_pca_dat_samplesMeta_sum_filt = maln_protein_to_matrix_mat_pca_dat_samplesMeta_sum %>%group_by() %>%filter("NA" != collection_year) %>%mutate(collection_year = as.integer(collection_year)) %>%filter(yearTotal >10)

maln_protein_to_matrix_mat_pca_dat_samplesMeta_sum_filt_yearTotals = maln_protein_to_matrix_mat_pca_dat_samplesMeta_sum_filt %>%filter() %>%select(collection_year, yearTotal) %>%unique()

maln_protein_to_matrix_mat_pca_dat_samplesMeta_sum_filt_yearTotals

# A tibble: 9 x 2collection_year yearTotal

<int> <int>1 2007 622 2008 1043 2009 1434 2010 2875 2011 9306 2012 4997 2013 2878 2002 159 2014 23

clusterColors = c("black", "#005AC8", "#AA0A3C", "#0AB45A", "#8214A0", "#FA7850", "#006E82", "#FA78FA", "black", "#005AC8", "#AA0A3C", "#0AB45A", "#8214A0", "#FA7850", "#005AC8", "#AA0A3C", "#0AB45A", "#8214A0", "#FA7850","#14D2DC")names(clusterColors) = c("0", "1", "2", "3", "4", "5", "6", "7", "Lab", "central_africa", "e_africa", "se_asia", "w_africa", "south_america","Central Africa", "East Africa", "South East Asia", "West Africa", "South America", "India")

ggplot(maln_protein_to_matrix_mat_pca_dat_samplesMeta_sum_filt %>%filter(), aes(x = collection_year, y = yearFrac, fill = as.factor(hdbcluster) ) ) +

geom_bar(stat = "identity", color = "black") +scale_fill_manual("Cluster",values = clusterColors) +scale_x_continuous(breaks = seq(min(maln_protein_to_matrix_mat_pca_dat_samplesMeta_sum_filt$collection_year), max(maln_protein_to_matrix_mat_pca_dat_samplesMeta_sum_filt$collection_year))) +theme_bw() + ggtitle("") +theme(axis.text.x = element_text(family = "Helvetica",face="plain", colour="#000000", angle = 90, hjust = 1),

axis.title = element_text(family = "Helvetica", face="bold", colour="#000000"),plot.title = element_text(family = "Helvetica", face="bold", colour="#000000", hjust = 0.5),panel.border = element_blank(),panel.grid.major.x = element_blank(),axis.ticks.x = element_blank()) +

geom_text(data = maln_protein_to_matrix_mat_pca_dat_samplesMeta_sum_filt_yearTotals,aes(x = collection_year, y = 1.07, label = paste0("n=", yearTotal), angle = 45) ,inherit.aes = F) +

labs(x = "Collection Year", y = "Relative Proportions")

16

Page 17: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

n=62

n=10

4

n=14

3

n=28

7

n=93

0

n=49

9

n=28

7n=

15n=

23

0.0

0.3

0.6

0.9

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

Collection Year

Rel

ativ

e P

ropo

rtio

ns

Cluster

1

2

3

4

5

6

Dividing Data into Quantiles

Some times it might be helpful to split data into different bins by creating evenly sized quantiles. This can bedone by using the ntile() function. Here we are creating quantiles based off of the sd of the Lps exposure,to bin genes by how variable their expression is during the Lps time points.

ts_longFormat_lps_sum = ts_longFormat %>%filter(exposure == "Lps") %>%group_by(gene) %>%summarise(lps_sd = sd(expression))

ts_longFormat_lps_sum

# A tibble: 25,807 x 2gene lps_sd<chr> <dbl>

1 A1BG 0.4882 A1BG-AS1 0.4613 A1CF 0.03534 A2M 173.5 A2M-AS1 0.5526 A2ML1 0.01357 A2MP1 0.03938 A3GALT2 0.02619 A4GALT 3.06

10 A4GNT 0.0519# ... with 25,797 more rows

17

Page 18: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

ntiles = 2000ts_longFormat_lps_sum = ts_longFormat_lps_sum %>%

mutate(lps_sd_quantile = ntile(lps_sd, ntiles))

We can add this quantile information back to the original data set by using the left_join() function, whichtakes a data frame and takes another data frame with which it shares columns, by matching information inthe shared columns left_join adds what ever columns the first data frame doesn’t have and populates thesecolumns by matching up the data in the shared columns.

ts_longFormat = ts_longFormat %>%left_join(ts_longFormat_lps_sum)

ts_longFormat

# A tibble: 490,333 x 6gene exposure time expression lps_sd lps_sd_quantile<chr> <chr> <dbl> <dbl> <dbl> <int>

1 A1BG Ctrl 0. 5.41 4.88e-1 10502 A1BG-~ Ctrl 0. 1.72 4.61e-1 10403 A1CF Ctrl 0. 0.0504 3.53e-2 6274 A2M Ctrl 0. 708. 1.73e+2 19775 A2M-A~ Ctrl 0. 1.38 5.52e-1 10706 A2ML1 Ctrl 0. 0.0275 1.35e-2 4837 A2MP1 Ctrl 0. 0.0329 3.93e-2 6428 A3GAL~ Ctrl 0. 0.0229 2.61e-2 5789 A4GALT Ctrl 0. 3.85 3.06e+0 1386

10 A4GNT Ctrl 0. 0.120 5.19e-2 694# ... with 490,323 more rows

Let’s take the 2000th quantile

top_ts_longFormat = ts_longFormat %>%filter(lps_sd_quantile == 2000)

bottom_ts_longFormat = ts_longFormat %>%filter(lps_sd_quantile == 1)

And plot all the genes by using facet_wrap to seperate out the genes and allow their y axis to be differentbetween panels.

ggplot(top_ts_longFormat, aes(x = time, y = expression, color = exposure)) +geom_point() +geom_line() +scale_color_brewer(palette = "Dark2") +facet_wrap(~gene, scales = "free_y")

18

Page 19: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

IL1RN LIPA SOD2 TGM2

CCL4 CD74 FTH1 FTL

B2M CCL18 CCL22 CCL3

0 5 10152025 0 5 10152025 0 5 10152025 0 5 10152025

0

2500

5000

7500

10000

20000

40000

60000

80000

2500

5000

7500

10000

0

5000

10000

15000

10000

20000

30000

40000

50000

0

5000

10000

5000

10000

15000

20000

10000

15000

2500

5000

7500

10000

5000

10000

15000

20000

0

2000

4000

6000

8000

2500

5000

7500

10000

12500

time

expr

essi

on

exposure

Ctrl

Ifnb

Lps

R848

The facet_wrap function also allows you set how many columns to have by using the ncol= arguments.

ggplot(top_ts_longFormat, aes(x = time, y = expression, color = exposure)) +geom_point() +geom_line() +scale_color_brewer(palette = "Dark2") +facet_wrap(~gene, scales = "free_y", ncol = 3)

19

Page 20: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

LIPA SOD2 TGM2

FTH1 FTL IL1RN

CCL3 CCL4 CD74

B2M CCL18 CCL22

0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25

0

5000

10000

15000

10000

15000

250050007500

1000012500

2500

5000

7500

10000

5000

10000

15000

20000

02000400060008000

20000

40000

60000

80000

0

5000

10000

5000

10000

15000

20000

0250050007500

10000

1000020000300004000050000

250050007500

10000

time

expr

essi

on

exposure

Ctrl

Ifnb

Lps

R848

facet_grid is another faceting function that sets out things in a grid pattern which is better for showingrelationships, the face_wrap just create a panel for each level and puts these panels in the order that thelevels go but facet_grid will layout the panels in a grid.

ggplot(top_ts_longFormat, aes(x = gene, y = expression, fill = gene)) +geom_bar(stat = "identity", color = "black") +scale_fill_brewer(palette = "Paired") +facet_grid(time~exposure) +theme(axis.text.x = element_text(angle = -45, hjust = 0))

20

Page 21: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

Ctrl Ifnb Lps R848

01

24

612

24

B2MCCL18

CCL22

CCL3CCL4

CD74FTH1

FTLIL1RN

LIPASOD2

TGM2

B2MCCL18

CCL22

CCL3CCL4

CD74FTH1

FTLIL1RN

LIPASOD2

TGM2

B2MCCL18

CCL22

CCL3CCL4

CD74FTH1

FTLIL1RN

LIPASOD2

TGM2

B2MCCL18

CCL22

CCL3CCL4

CD74FTH1

FTLIL1RN

LIPASOD2

TGM2

020000400006000080000

020000400006000080000

020000400006000080000

020000400006000080000

020000400006000080000

020000400006000080000

020000400006000080000

gene

expr

essi

on

gene

B2M

CCL18

CCL22

CCL3

CCL4

CD74

FTH1

FTL

IL1RN

LIPA

SOD2

TGM2

Also the library cowplot is a great library for setting up completely different plots in custum sized panels likein a figure.

Part 1 Excerices

Using the Temperature data frame from last sessionsAverage Temperatures USA

1. Create a bar plot of temperatures for 1995 for Boston and put the temperatures on top of the bars,x-axis = month, y-axis = temperature

2. Create a line graph for all years in Boston and put the name of the year next to the line after December, x-axis = month, y-axis = temperature

3. Create a bar plot for all years in Boston but facet the plot so each year has its own panel , x-axis =month, y-axis = temperature

4. Create quantiles with 100 bins for mean temperatures over all years for each Station, and take the100th bin and create a bar graph, with x-axis Station_Name, y-axis temperate, and using face_gridplot month by year

RMarkdown

Markdown is the term for a way of writing plain text files with certain syntax that when given to a programwill render the contents into a rich document, like an HTML document. Many different flavors of Markdown

21

Page 22: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

exist but most follow similar rules. RMarkdown is a flavor of markdown that allows for inserting R Codeinto the document that will then run and the output of the code will be captured and placed into the finaldocument. This is a great way to create an information document for your R code, creating R examples, andbecause the final output is an HTML document they can include interactive graphs and tables that R helpsto create. In fact all Session pages so far have been created by using RMarkdown, for example here is thedocument that created this page itself Session 6.

There are many features offered by RMarkdown, here are a few cheatsheets that RStudio offers that help and aregreat references guides, https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdfand https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf.

Below is an example of how text looks like rendering and how syntax controls the output

Figure 1:

Within RStudio you can create a new RMarkdown by click the + symbol in the top left corner. And younormally just pick HTML for output. When you do this, RStudio will ask to install the libraries needed tocreate RMarkdowns.

Below is the default RMarkdown document created when creating a new Document

R Code Chunks

Below is an example of a R code chunk

Important nodes of about r chunks

22

Page 23: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

Figure 2:

23

Page 24: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

Figure 3:

24

Page 25: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

Figure 4:

Figure 5:

25

Page 26: Session 5 - bootstrappers.umassmed.eduCD74-Lps SOD2-Lps CD74-R848 SOD2-R848 CD74-Ifnb SOD2-Ifnb 0 5000 10000 15000 0 5 10 15 20 25 time expression gene CD74 SOD2 exposure a a a a Ctrl

• The whole document is ran in a brand new R Session and therefore libraries need to be loaded at thebeginning of the document

• Each r code is ran in the same R Session, meaning all the R code is ran as if it you took all the R codeand pasted into one R script and ran it

• When naming chunks, the name must always be unique (the name above for this chunk is pressure andcannot be used again)

• Options given to the chunk are separated by commas• The working directory of the R code executed is the directory where the RMarkdown document is

located• The resulting output document is in the same directory as the RMarkdown document.

Some important and commonly used options to

• echo - This will control if the R code itself is shown in the output document (by default it is)

• eval - This will control if the R code is executed, if this is set to FALSE the code will be shown butnot executed (this might be good for when trying to show R examples but don’t want the code to execute)

• fig.width - This will affect the width of the captured output of the code chunk, important for plots

• fig.height - This will affect the height of the captured output of the code chunk, important for plots

And there are many more options, see the reference/cheatsheets for examples.

Once you want to create the output document, you hit the knit button.

Part 2. Exercises

1. Create a directory and put the temperature dataset in it and create a new RMarkdown and save it inthe same directory.

2. Take the code from Part 1 and put it in the RMarkdown to create a HTMl page of the plots youcreated, add a Header for each plot (by using the # symbol).

3. By looking at the cheatsheets, try to figure out how to add a table of contents to document.

26