user2012!: high-throughput sequence analysis with r and ... · user2012!: high-throughput sequence...

48
useR2012! : High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan * June 2012 Abstract DNA sequence analysis generates large volumes of data presenting challenging bioinformatic and statistical problems. This tutorial intro- duces Bioconductor packages and work flows for the analysis of sequence data. We learn about approaches for efficiently manipulating sequences and alignments, and introduce common work flows and the unique sta- tistical challenges associated with ‘RNA-seq’, ‘ChIP-seq‘ and variant an- notation experiments. The emphasis is on exploratory analysis, and the analysis of designed experiments. The workshop assumes an intermedi- ate level of familiarity with R, and basic understanding of biological and technological aspects of high-throughput sequence analysis. The workshop emphasizes orientation within the Bioconductor milieu; we will touch on the Biostrings, ShortRead, GenomicRanges, edgeR, and DiffBind, and VariantAnnotation packages, with short exercises to illustrate the func- tionality of each package. Participants should come prepared with a mod- ern laptop with current R installed. Contents 1 Introduction 3 1.1 Rstudio ................................ 3 1.2 R .................................... 3 1.3 Bioconductor ............................. 5 1.4 Resources ............................... 7 2 Sequences and Ranges 8 2.1 Biostrings ............................... 8 2.2 GenomicRanges ............................ 8 2.3 Resources ............................... 14 * [email protected] 1

Upload: others

Post on 09-Sep-2019

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

useR2012 High-throughput sequence analysis

with R and Bioconductor

Valerie Obenchain Martin Morganlowast

June 2012

Abstract

DNA sequence analysis generates large volumes of data presentingchallenging bioinformatic and statistical problems This tutorial intro-duces Bioconductor packages and work flows for the analysis of sequencedata We learn about approaches for efficiently manipulating sequencesand alignments and introduce common work flows and the unique sta-tistical challenges associated with lsquoRNA-seqrsquo lsquoChIP-seqlsquo and variant an-notation experiments The emphasis is on exploratory analysis and theanalysis of designed experiments The workshop assumes an intermedi-ate level of familiarity with R and basic understanding of biological andtechnological aspects of high-throughput sequence analysis The workshopemphasizes orientation within the Bioconductor milieu we will touch onthe Biostrings ShortRead GenomicRanges edgeR and DiffBind andVariantAnnotation packages with short exercises to illustrate the func-tionality of each package Participants should come prepared with a mod-ern laptop with current R installed

Contents

1 Introduction 311 Rstudio 312 R 313 Bioconductor 514 Resources 7

2 Sequences and Ranges 821 Biostrings 822 GenomicRanges 823 Resources 14

lowastmtmorganfhcrcorg

1

3 Reads and Alignments 1531 The pasilla Data Set 1532 Reads and the ShortRead Package 1533 Alignments and the Rsamtools Package 19

4 RNA-seq 2441 Differential Expression with the edgeR Package 2442 Additional Steps in RNA-seq Work Flows 2943 Resources 30

5 ChIP-seq 3151 Initial Work Flow 3152 Motifs 34

6 Annotation 3761 Gene-Centric Annotations with AnnotationDbi 3762 Genome-Centric Annotations with GenomicFeatures 4063 Exploring Annotated Variants VariantAnnotation 43

2

Table 1 Schedule

Introduction Rstudio and the AMI R packages and help Bioconductor

Sequences and ranges Representing and manipulating biological sequencesworking with genomic coordinates

Reads and alignments High-throughput sequence reads (fastq files) andtheir aligned representation (bam files)

RNA-seq A post-alignment workflow for differential representation

ChIP-seq Working with multiple ChIP-seq experiments

Annotation Resources for annotation of gene and genomes working the vari-ants (VCF files)

1 Introduction

This workshop introduces use of R and Bioconductor for analysis of high-throughput sequence data The workshop is structured as a series of shortremarks followed by group exercises The exercises explore the diversity of tasksfor which R Bioconductor are appropriate but are far from comprehensive

The goals of the workshop are to (1) develop familiarity with R Biocon-ductor software for high-throughput analysis (2) expose key statistical issues inthe analysis of sequence data and (3) provide inspiration and a framework forfurther independent exploration An approximate schedule is shown in Table 1

11 RstudioExercise 1Log on to the Rstudio account set up for this course

Visit the lsquoPackagesrsquo tab in the lower right panel find the useR2012 packageand discover the vignette (ie this document)

Under the lsquoFilesrsquo tab figure out how to upload and download (small) filesto the server

12 R

The following should be familiar to you R has a number of standard datatypes that represent integer numeric (floating point) complex characterlogical (boolean) and raw (byte) data It is possible to convert between datatypes and to discover the class (eg class) of a variable All of the vectorsmentioned so far are homogenous consisting of a single type of element A list

can contain a collection of different types of elements and like all vectors theseelements can be named to create a key-value association A dataframe is a list

3

of equal-length vectors representing a rectangular data structure not unlike aspread sheet Each column of the data frame is a vector so data types must behomogenous within a column A dataframe can be subset by row or columnand columns can be accessed with $ or [[ A matrix is also a rectangular datastructure but subject to the constraint that all elements are the same type

R has object-oriented ways of representing complicated data objects Bio-conductor makes extensive use of lsquoS4rsquo objects Objects are often created byfunctions (eg GRanges below) with parts of the object extracted or assignedusing accessor functions Many operations on classes are implemented as meth-ods that specialize a generic function for the particular class of objects used toinvoke the function For instance countOverlaps is a generic that counts thenumber of times elements of its query argument overlaps elements of its sub-

ject there are methods with slightly different behaviors when the argumentsare IRange instances or GRanges instances (in the latter case the countOverlaps

method pays attention to whether ranges are on the same strand and chromo-some for instance)

Packages provide functionality beyond that available in base R There aremore than 500 Bioconductor packages Packages are contributed by diversemembers of the community they vary in quality (many are excellent) and some-times contain idiosyncratic aspects to their implementation New packages (egShortRead VariantAnnotation packages and their dependencies) can be addedto an R installation using

gt source(httpbioconductororgbiocLiteR)

gt biocLite(c(ShortRead VariantAnnotation)) new packages

gt biocLite(character()) update packages

installpackages A package is installed only once per R installation butneeds to be loaded (with library) in each session in which it is used

Find help using the R help system Start a web browser with

gt helpstart()

The lsquoSearch Engine and Keywordsrsquo link is helpful in day-to-day use Use man-ual pages to find detailed descriptions of the arguments and return values offunctions and the structure and methods of classes Find help within an Rsession using the package defining the help page must have been loaded (withlibrary)

gt library(ShortRead)

gt readFastq

S4 classes and generics can be discovered with syntax like the following for thecomplement generic in the Biostrings package

gt library(Biostrings)

gt showMethods(complement)

4

Function complement (package Biostrings)

x=DNAString

x=DNAStringSet

x=MaskedDNAString

x=MaskedRNAString

x=RNAString

x=RNAStringSet

x=XStringViews

Methods defined on the DNAStringSet class of Biostrings can be found with

gt showMethods(class=DNAStringSet where=getNamespace(Biostrings))

Obtaining help on S4 classes and methods requires syntax such as

gt class DNAStringSet

gt method complementDNAStringSet

Vignettes especially in Bioconductor packages provide an extensive narra-tive describing overall package functionality Use

gt vignette(package=useR2012)

to see a list of vignettes available in the useR2012 package Vignettes usuallyconsist of text with embedded R code a form of literate programming Thevignette can be read as a PDF document while the R source code is present asa script file ending with extension R The script file can be sourced or copiedinto an R session to evaluate exactly the commands used in the vignette

13 Bioconductor

Bioconductor is a collection of R packages for the analysis and comprehensionof high-throughput genomic data Bioconductor started more than 10 yearsago It gained credibility for its statistically rigorous approach to microarraypre-preprocessing and analysis of designed experiments and integrative and re-producible approaches to bioinformatic tasks There are now more than 500Bioconductor packages for expression and other microarrays sequence analy-sis flow cytometry imaging and other domains The Bioconductor web siteprovides installation package repository help and other documentation

The Bioconductor web site is at bioconductororg Features include

bull Introductory work flowsbull A manifest of Bioconductor packages arranged in BiocViewsbull Annotation (data bases of relevant genomic information eg Entrez gene

ids in model organisms KEGG pathways) and experiment data (contain-ing relatively comprehensive data sets and their analysis) packages

bull Mailing lists including searchable archives as the primary source of helpbull Course and conference information including extensive reference materialbull General information about the projectbull Package developer resources including guidelines for creating and submit-

ting new packages

5

Table 2 Selected Bioconductor packages for high-throughput sequence analysis

Concept PackagesData representation IRanges GenomicRanges GenomicFeatures

Biostrings BSgenome girafeInput output ShortRead (fastq) Rsamtools (bam) rtrack-

layer (gff wig bed) VariantAnnotation (vcf)R453Plus1Toolbox (454)

Annotation GenomicFeatures ChIPpeakAnno VariantAnnota-tion

Alignment Rsubread BiostringsVisualization ggbio GvizQuality assessment qrqc seqbias ReQON htSeqTools TEQC Rolexa

ShortReadRNA-seq BitSeq cqn cummeRbund DESeq DEXSeq

EDASeq edgeR gage goseq iASeq tweeDEseqChIP-seq etc BayesPeak baySeq ChIPpeakAnno chipseq

ChIPseqR ChIPsim CSAR DiffBind MEDIPSmosaics NarrowPeaks nucleR PICS PING RED-seq Repitools TSSi

Motifs BCRANK cosmo cosmoGUI MotIV seqLogorGADEM

3C etc HiTC r3CseqCopy number cnmops CNAnorm exomeCopy seqmentSeqMicrobiome phyloseq DirichletMultinomial clstutils manta

mcaGUI Work flows ArrayExpressHTS Genominator easyRNASeq

oneChannelGUI rnaSeqMapDatabase SRAdb

High-throughput sequence analysis Table 2 enumerates many of the pack-ages available for sequence analysis The table includes packages for repre-senting sequence-related data (eg GenomicRanges Biostrings) as well asdomain-specific analysis such as RNA-seq (eg edgeR DEXSeq) ChIP-seq(eg ChIPpeakAnno DiffBind) and SNPs and copy number variation (eggenoset ggtools VariantAnnotation)

Exercise 2Scavenger hunt Spend five minutes tracking down the following information

a The package containing the readFastq function

b The author of the alphabetFrequency function defined in the Biostringspackage

c A description of the GappedAlignments class

6

d The number of vignettes in the GenomicRanges package

e From the Bioconductor web site instructions for installing or updatingBioconductor packages

f A list of all packages in the current release of Bioconductor

g The URL of the Bioconductor mailing list subscription page

Solution Possible solutions are found with the following R commands

gt readFastq

gt library(Biostrings)

gt alphabetFrequency

gt classGappedAlignments

gt vignette(package=GenomicRanges)

and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3

14 Resources

Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity

1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list

7

Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures

Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-

ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data

GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name

GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes

Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences

BSgenome Representation and manipulation of large (eg whole-genome) sequences

2 Sequences and Ranges

Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3

21 Biostrings

The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages

Fixme Exercise trimLRPattern

22 GenomicRanges

Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed

8

on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation

The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges

GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)

A complete definition of these genes as GRanges is

gt genes lt- GRanges(seqnames=c(3R X)

+ ranges=IRanges(

+ start=c(19967117 18962306)

+ end=c(19973212 18962925))

+ strand=c(+ -)

+ seqlengths=c(`3R`=27905053L `X`=22422827L))

The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as

gt genes

GRanges with 2 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] 3R [19967117 19973212] +

[2] X [18962306 18962925] -

---

seqlengths

3R X

27905053 22422827

9

Figure 1 Ranges

For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6

The GRanges class has many useful methods defined on it Consult the helppage

gt GRanges

and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)

gt vignette(package=GenomicRanges)

for a comprehensive introduction

Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges

gt ir lt- IRanges(start=c(7 9 12 14 2224)

+ end=c(15 11 12 18 26 27 28))

These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4

elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)

10

Table 4 Common operations on IRanges GRanges and GRangesList

Category Function DescriptionAccessors start end width Get or s et the starts ends and widths

names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end

Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates

Arithmetic r + x r - x r x Shrink or expand ranges r by number x

shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end

Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints

Overlaps findOverlaps Find all overlaps for each x in y

countOverlaps Count overlaps of each x range in y

nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y

Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index

r[[i]] Get integer sequence from start[i] to end[i]

subsetByOverlaps Subset x for those that overlap in y

head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList

c Concatenate two or more range objects

11

gt elementMetadata(genes) lt-

+ DataFrame(EntrezId=c(42865 2768869)

+ Symbol=c(kal-1 CG34330))

metadata allows addition of information to the entire object The information isin the form of a list any data can be provided

gt metadata(genes) lt-

+ list(CreatedBy=A User Date=date())

GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements

GRangesList of length 1

$FBgn0039155

GRanges with 7 ranges and 2 elementMetadata cols

seqnames ranges strand | exon_id exon_name

ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt

[1] chr3R [19967117 19967382] + | 64137 ltNAgt

[2] chr3R [19970915 19971592] + | 64138 ltNAgt

[3] chr3R [19971652 19971770] + | 64139 ltNAgt

[4] chr3R [19971831 19972024] + | 64140 ltNAgt

[5] chr3R [19972088 19972461] + | 64141 ltNAgt

[6] chr3R [19972523 19972589] + | 64142 ltNAgt

[7] chr3R [19972918 19973212] + | 64143 ltNAgt

---

seqlengths

chr3R

27905053

The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above

12

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 2: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

3 Reads and Alignments 1531 The pasilla Data Set 1532 Reads and the ShortRead Package 1533 Alignments and the Rsamtools Package 19

4 RNA-seq 2441 Differential Expression with the edgeR Package 2442 Additional Steps in RNA-seq Work Flows 2943 Resources 30

5 ChIP-seq 3151 Initial Work Flow 3152 Motifs 34

6 Annotation 3761 Gene-Centric Annotations with AnnotationDbi 3762 Genome-Centric Annotations with GenomicFeatures 4063 Exploring Annotated Variants VariantAnnotation 43

2

Table 1 Schedule

Introduction Rstudio and the AMI R packages and help Bioconductor

Sequences and ranges Representing and manipulating biological sequencesworking with genomic coordinates

Reads and alignments High-throughput sequence reads (fastq files) andtheir aligned representation (bam files)

RNA-seq A post-alignment workflow for differential representation

ChIP-seq Working with multiple ChIP-seq experiments

Annotation Resources for annotation of gene and genomes working the vari-ants (VCF files)

1 Introduction

This workshop introduces use of R and Bioconductor for analysis of high-throughput sequence data The workshop is structured as a series of shortremarks followed by group exercises The exercises explore the diversity of tasksfor which R Bioconductor are appropriate but are far from comprehensive

The goals of the workshop are to (1) develop familiarity with R Biocon-ductor software for high-throughput analysis (2) expose key statistical issues inthe analysis of sequence data and (3) provide inspiration and a framework forfurther independent exploration An approximate schedule is shown in Table 1

11 RstudioExercise 1Log on to the Rstudio account set up for this course

Visit the lsquoPackagesrsquo tab in the lower right panel find the useR2012 packageand discover the vignette (ie this document)

Under the lsquoFilesrsquo tab figure out how to upload and download (small) filesto the server

12 R

The following should be familiar to you R has a number of standard datatypes that represent integer numeric (floating point) complex characterlogical (boolean) and raw (byte) data It is possible to convert between datatypes and to discover the class (eg class) of a variable All of the vectorsmentioned so far are homogenous consisting of a single type of element A list

can contain a collection of different types of elements and like all vectors theseelements can be named to create a key-value association A dataframe is a list

3

of equal-length vectors representing a rectangular data structure not unlike aspread sheet Each column of the data frame is a vector so data types must behomogenous within a column A dataframe can be subset by row or columnand columns can be accessed with $ or [[ A matrix is also a rectangular datastructure but subject to the constraint that all elements are the same type

R has object-oriented ways of representing complicated data objects Bio-conductor makes extensive use of lsquoS4rsquo objects Objects are often created byfunctions (eg GRanges below) with parts of the object extracted or assignedusing accessor functions Many operations on classes are implemented as meth-ods that specialize a generic function for the particular class of objects used toinvoke the function For instance countOverlaps is a generic that counts thenumber of times elements of its query argument overlaps elements of its sub-

ject there are methods with slightly different behaviors when the argumentsare IRange instances or GRanges instances (in the latter case the countOverlaps

method pays attention to whether ranges are on the same strand and chromo-some for instance)

Packages provide functionality beyond that available in base R There aremore than 500 Bioconductor packages Packages are contributed by diversemembers of the community they vary in quality (many are excellent) and some-times contain idiosyncratic aspects to their implementation New packages (egShortRead VariantAnnotation packages and their dependencies) can be addedto an R installation using

gt source(httpbioconductororgbiocLiteR)

gt biocLite(c(ShortRead VariantAnnotation)) new packages

gt biocLite(character()) update packages

installpackages A package is installed only once per R installation butneeds to be loaded (with library) in each session in which it is used

Find help using the R help system Start a web browser with

gt helpstart()

The lsquoSearch Engine and Keywordsrsquo link is helpful in day-to-day use Use man-ual pages to find detailed descriptions of the arguments and return values offunctions and the structure and methods of classes Find help within an Rsession using the package defining the help page must have been loaded (withlibrary)

gt library(ShortRead)

gt readFastq

S4 classes and generics can be discovered with syntax like the following for thecomplement generic in the Biostrings package

gt library(Biostrings)

gt showMethods(complement)

4

Function complement (package Biostrings)

x=DNAString

x=DNAStringSet

x=MaskedDNAString

x=MaskedRNAString

x=RNAString

x=RNAStringSet

x=XStringViews

Methods defined on the DNAStringSet class of Biostrings can be found with

gt showMethods(class=DNAStringSet where=getNamespace(Biostrings))

Obtaining help on S4 classes and methods requires syntax such as

gt class DNAStringSet

gt method complementDNAStringSet

Vignettes especially in Bioconductor packages provide an extensive narra-tive describing overall package functionality Use

gt vignette(package=useR2012)

to see a list of vignettes available in the useR2012 package Vignettes usuallyconsist of text with embedded R code a form of literate programming Thevignette can be read as a PDF document while the R source code is present asa script file ending with extension R The script file can be sourced or copiedinto an R session to evaluate exactly the commands used in the vignette

13 Bioconductor

Bioconductor is a collection of R packages for the analysis and comprehensionof high-throughput genomic data Bioconductor started more than 10 yearsago It gained credibility for its statistically rigorous approach to microarraypre-preprocessing and analysis of designed experiments and integrative and re-producible approaches to bioinformatic tasks There are now more than 500Bioconductor packages for expression and other microarrays sequence analy-sis flow cytometry imaging and other domains The Bioconductor web siteprovides installation package repository help and other documentation

The Bioconductor web site is at bioconductororg Features include

bull Introductory work flowsbull A manifest of Bioconductor packages arranged in BiocViewsbull Annotation (data bases of relevant genomic information eg Entrez gene

ids in model organisms KEGG pathways) and experiment data (contain-ing relatively comprehensive data sets and their analysis) packages

bull Mailing lists including searchable archives as the primary source of helpbull Course and conference information including extensive reference materialbull General information about the projectbull Package developer resources including guidelines for creating and submit-

ting new packages

5

Table 2 Selected Bioconductor packages for high-throughput sequence analysis

Concept PackagesData representation IRanges GenomicRanges GenomicFeatures

Biostrings BSgenome girafeInput output ShortRead (fastq) Rsamtools (bam) rtrack-

layer (gff wig bed) VariantAnnotation (vcf)R453Plus1Toolbox (454)

Annotation GenomicFeatures ChIPpeakAnno VariantAnnota-tion

Alignment Rsubread BiostringsVisualization ggbio GvizQuality assessment qrqc seqbias ReQON htSeqTools TEQC Rolexa

ShortReadRNA-seq BitSeq cqn cummeRbund DESeq DEXSeq

EDASeq edgeR gage goseq iASeq tweeDEseqChIP-seq etc BayesPeak baySeq ChIPpeakAnno chipseq

ChIPseqR ChIPsim CSAR DiffBind MEDIPSmosaics NarrowPeaks nucleR PICS PING RED-seq Repitools TSSi

Motifs BCRANK cosmo cosmoGUI MotIV seqLogorGADEM

3C etc HiTC r3CseqCopy number cnmops CNAnorm exomeCopy seqmentSeqMicrobiome phyloseq DirichletMultinomial clstutils manta

mcaGUI Work flows ArrayExpressHTS Genominator easyRNASeq

oneChannelGUI rnaSeqMapDatabase SRAdb

High-throughput sequence analysis Table 2 enumerates many of the pack-ages available for sequence analysis The table includes packages for repre-senting sequence-related data (eg GenomicRanges Biostrings) as well asdomain-specific analysis such as RNA-seq (eg edgeR DEXSeq) ChIP-seq(eg ChIPpeakAnno DiffBind) and SNPs and copy number variation (eggenoset ggtools VariantAnnotation)

Exercise 2Scavenger hunt Spend five minutes tracking down the following information

a The package containing the readFastq function

b The author of the alphabetFrequency function defined in the Biostringspackage

c A description of the GappedAlignments class

6

d The number of vignettes in the GenomicRanges package

e From the Bioconductor web site instructions for installing or updatingBioconductor packages

f A list of all packages in the current release of Bioconductor

g The URL of the Bioconductor mailing list subscription page

Solution Possible solutions are found with the following R commands

gt readFastq

gt library(Biostrings)

gt alphabetFrequency

gt classGappedAlignments

gt vignette(package=GenomicRanges)

and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3

14 Resources

Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity

1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list

7

Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures

Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-

ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data

GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name

GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes

Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences

BSgenome Representation and manipulation of large (eg whole-genome) sequences

2 Sequences and Ranges

Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3

21 Biostrings

The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages

Fixme Exercise trimLRPattern

22 GenomicRanges

Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed

8

on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation

The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges

GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)

A complete definition of these genes as GRanges is

gt genes lt- GRanges(seqnames=c(3R X)

+ ranges=IRanges(

+ start=c(19967117 18962306)

+ end=c(19973212 18962925))

+ strand=c(+ -)

+ seqlengths=c(`3R`=27905053L `X`=22422827L))

The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as

gt genes

GRanges with 2 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] 3R [19967117 19973212] +

[2] X [18962306 18962925] -

---

seqlengths

3R X

27905053 22422827

9

Figure 1 Ranges

For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6

The GRanges class has many useful methods defined on it Consult the helppage

gt GRanges

and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)

gt vignette(package=GenomicRanges)

for a comprehensive introduction

Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges

gt ir lt- IRanges(start=c(7 9 12 14 2224)

+ end=c(15 11 12 18 26 27 28))

These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4

elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)

10

Table 4 Common operations on IRanges GRanges and GRangesList

Category Function DescriptionAccessors start end width Get or s et the starts ends and widths

names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end

Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates

Arithmetic r + x r - x r x Shrink or expand ranges r by number x

shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end

Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints

Overlaps findOverlaps Find all overlaps for each x in y

countOverlaps Count overlaps of each x range in y

nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y

Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index

r[[i]] Get integer sequence from start[i] to end[i]

subsetByOverlaps Subset x for those that overlap in y

head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList

c Concatenate two or more range objects

11

gt elementMetadata(genes) lt-

+ DataFrame(EntrezId=c(42865 2768869)

+ Symbol=c(kal-1 CG34330))

metadata allows addition of information to the entire object The information isin the form of a list any data can be provided

gt metadata(genes) lt-

+ list(CreatedBy=A User Date=date())

GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements

GRangesList of length 1

$FBgn0039155

GRanges with 7 ranges and 2 elementMetadata cols

seqnames ranges strand | exon_id exon_name

ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt

[1] chr3R [19967117 19967382] + | 64137 ltNAgt

[2] chr3R [19970915 19971592] + | 64138 ltNAgt

[3] chr3R [19971652 19971770] + | 64139 ltNAgt

[4] chr3R [19971831 19972024] + | 64140 ltNAgt

[5] chr3R [19972088 19972461] + | 64141 ltNAgt

[6] chr3R [19972523 19972589] + | 64142 ltNAgt

[7] chr3R [19972918 19973212] + | 64143 ltNAgt

---

seqlengths

chr3R

27905053

The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above

12

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 3: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Table 1 Schedule

Introduction Rstudio and the AMI R packages and help Bioconductor

Sequences and ranges Representing and manipulating biological sequencesworking with genomic coordinates

Reads and alignments High-throughput sequence reads (fastq files) andtheir aligned representation (bam files)

RNA-seq A post-alignment workflow for differential representation

ChIP-seq Working with multiple ChIP-seq experiments

Annotation Resources for annotation of gene and genomes working the vari-ants (VCF files)

1 Introduction

This workshop introduces use of R and Bioconductor for analysis of high-throughput sequence data The workshop is structured as a series of shortremarks followed by group exercises The exercises explore the diversity of tasksfor which R Bioconductor are appropriate but are far from comprehensive

The goals of the workshop are to (1) develop familiarity with R Biocon-ductor software for high-throughput analysis (2) expose key statistical issues inthe analysis of sequence data and (3) provide inspiration and a framework forfurther independent exploration An approximate schedule is shown in Table 1

11 RstudioExercise 1Log on to the Rstudio account set up for this course

Visit the lsquoPackagesrsquo tab in the lower right panel find the useR2012 packageand discover the vignette (ie this document)

Under the lsquoFilesrsquo tab figure out how to upload and download (small) filesto the server

12 R

The following should be familiar to you R has a number of standard datatypes that represent integer numeric (floating point) complex characterlogical (boolean) and raw (byte) data It is possible to convert between datatypes and to discover the class (eg class) of a variable All of the vectorsmentioned so far are homogenous consisting of a single type of element A list

can contain a collection of different types of elements and like all vectors theseelements can be named to create a key-value association A dataframe is a list

3

of equal-length vectors representing a rectangular data structure not unlike aspread sheet Each column of the data frame is a vector so data types must behomogenous within a column A dataframe can be subset by row or columnand columns can be accessed with $ or [[ A matrix is also a rectangular datastructure but subject to the constraint that all elements are the same type

R has object-oriented ways of representing complicated data objects Bio-conductor makes extensive use of lsquoS4rsquo objects Objects are often created byfunctions (eg GRanges below) with parts of the object extracted or assignedusing accessor functions Many operations on classes are implemented as meth-ods that specialize a generic function for the particular class of objects used toinvoke the function For instance countOverlaps is a generic that counts thenumber of times elements of its query argument overlaps elements of its sub-

ject there are methods with slightly different behaviors when the argumentsare IRange instances or GRanges instances (in the latter case the countOverlaps

method pays attention to whether ranges are on the same strand and chromo-some for instance)

Packages provide functionality beyond that available in base R There aremore than 500 Bioconductor packages Packages are contributed by diversemembers of the community they vary in quality (many are excellent) and some-times contain idiosyncratic aspects to their implementation New packages (egShortRead VariantAnnotation packages and their dependencies) can be addedto an R installation using

gt source(httpbioconductororgbiocLiteR)

gt biocLite(c(ShortRead VariantAnnotation)) new packages

gt biocLite(character()) update packages

installpackages A package is installed only once per R installation butneeds to be loaded (with library) in each session in which it is used

Find help using the R help system Start a web browser with

gt helpstart()

The lsquoSearch Engine and Keywordsrsquo link is helpful in day-to-day use Use man-ual pages to find detailed descriptions of the arguments and return values offunctions and the structure and methods of classes Find help within an Rsession using the package defining the help page must have been loaded (withlibrary)

gt library(ShortRead)

gt readFastq

S4 classes and generics can be discovered with syntax like the following for thecomplement generic in the Biostrings package

gt library(Biostrings)

gt showMethods(complement)

4

Function complement (package Biostrings)

x=DNAString

x=DNAStringSet

x=MaskedDNAString

x=MaskedRNAString

x=RNAString

x=RNAStringSet

x=XStringViews

Methods defined on the DNAStringSet class of Biostrings can be found with

gt showMethods(class=DNAStringSet where=getNamespace(Biostrings))

Obtaining help on S4 classes and methods requires syntax such as

gt class DNAStringSet

gt method complementDNAStringSet

Vignettes especially in Bioconductor packages provide an extensive narra-tive describing overall package functionality Use

gt vignette(package=useR2012)

to see a list of vignettes available in the useR2012 package Vignettes usuallyconsist of text with embedded R code a form of literate programming Thevignette can be read as a PDF document while the R source code is present asa script file ending with extension R The script file can be sourced or copiedinto an R session to evaluate exactly the commands used in the vignette

13 Bioconductor

Bioconductor is a collection of R packages for the analysis and comprehensionof high-throughput genomic data Bioconductor started more than 10 yearsago It gained credibility for its statistically rigorous approach to microarraypre-preprocessing and analysis of designed experiments and integrative and re-producible approaches to bioinformatic tasks There are now more than 500Bioconductor packages for expression and other microarrays sequence analy-sis flow cytometry imaging and other domains The Bioconductor web siteprovides installation package repository help and other documentation

The Bioconductor web site is at bioconductororg Features include

bull Introductory work flowsbull A manifest of Bioconductor packages arranged in BiocViewsbull Annotation (data bases of relevant genomic information eg Entrez gene

ids in model organisms KEGG pathways) and experiment data (contain-ing relatively comprehensive data sets and their analysis) packages

bull Mailing lists including searchable archives as the primary source of helpbull Course and conference information including extensive reference materialbull General information about the projectbull Package developer resources including guidelines for creating and submit-

ting new packages

5

Table 2 Selected Bioconductor packages for high-throughput sequence analysis

Concept PackagesData representation IRanges GenomicRanges GenomicFeatures

Biostrings BSgenome girafeInput output ShortRead (fastq) Rsamtools (bam) rtrack-

layer (gff wig bed) VariantAnnotation (vcf)R453Plus1Toolbox (454)

Annotation GenomicFeatures ChIPpeakAnno VariantAnnota-tion

Alignment Rsubread BiostringsVisualization ggbio GvizQuality assessment qrqc seqbias ReQON htSeqTools TEQC Rolexa

ShortReadRNA-seq BitSeq cqn cummeRbund DESeq DEXSeq

EDASeq edgeR gage goseq iASeq tweeDEseqChIP-seq etc BayesPeak baySeq ChIPpeakAnno chipseq

ChIPseqR ChIPsim CSAR DiffBind MEDIPSmosaics NarrowPeaks nucleR PICS PING RED-seq Repitools TSSi

Motifs BCRANK cosmo cosmoGUI MotIV seqLogorGADEM

3C etc HiTC r3CseqCopy number cnmops CNAnorm exomeCopy seqmentSeqMicrobiome phyloseq DirichletMultinomial clstutils manta

mcaGUI Work flows ArrayExpressHTS Genominator easyRNASeq

oneChannelGUI rnaSeqMapDatabase SRAdb

High-throughput sequence analysis Table 2 enumerates many of the pack-ages available for sequence analysis The table includes packages for repre-senting sequence-related data (eg GenomicRanges Biostrings) as well asdomain-specific analysis such as RNA-seq (eg edgeR DEXSeq) ChIP-seq(eg ChIPpeakAnno DiffBind) and SNPs and copy number variation (eggenoset ggtools VariantAnnotation)

Exercise 2Scavenger hunt Spend five minutes tracking down the following information

a The package containing the readFastq function

b The author of the alphabetFrequency function defined in the Biostringspackage

c A description of the GappedAlignments class

6

d The number of vignettes in the GenomicRanges package

e From the Bioconductor web site instructions for installing or updatingBioconductor packages

f A list of all packages in the current release of Bioconductor

g The URL of the Bioconductor mailing list subscription page

Solution Possible solutions are found with the following R commands

gt readFastq

gt library(Biostrings)

gt alphabetFrequency

gt classGappedAlignments

gt vignette(package=GenomicRanges)

and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3

14 Resources

Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity

1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list

7

Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures

Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-

ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data

GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name

GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes

Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences

BSgenome Representation and manipulation of large (eg whole-genome) sequences

2 Sequences and Ranges

Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3

21 Biostrings

The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages

Fixme Exercise trimLRPattern

22 GenomicRanges

Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed

8

on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation

The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges

GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)

A complete definition of these genes as GRanges is

gt genes lt- GRanges(seqnames=c(3R X)

+ ranges=IRanges(

+ start=c(19967117 18962306)

+ end=c(19973212 18962925))

+ strand=c(+ -)

+ seqlengths=c(`3R`=27905053L `X`=22422827L))

The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as

gt genes

GRanges with 2 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] 3R [19967117 19973212] +

[2] X [18962306 18962925] -

---

seqlengths

3R X

27905053 22422827

9

Figure 1 Ranges

For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6

The GRanges class has many useful methods defined on it Consult the helppage

gt GRanges

and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)

gt vignette(package=GenomicRanges)

for a comprehensive introduction

Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges

gt ir lt- IRanges(start=c(7 9 12 14 2224)

+ end=c(15 11 12 18 26 27 28))

These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4

elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)

10

Table 4 Common operations on IRanges GRanges and GRangesList

Category Function DescriptionAccessors start end width Get or s et the starts ends and widths

names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end

Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates

Arithmetic r + x r - x r x Shrink or expand ranges r by number x

shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end

Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints

Overlaps findOverlaps Find all overlaps for each x in y

countOverlaps Count overlaps of each x range in y

nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y

Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index

r[[i]] Get integer sequence from start[i] to end[i]

subsetByOverlaps Subset x for those that overlap in y

head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList

c Concatenate two or more range objects

11

gt elementMetadata(genes) lt-

+ DataFrame(EntrezId=c(42865 2768869)

+ Symbol=c(kal-1 CG34330))

metadata allows addition of information to the entire object The information isin the form of a list any data can be provided

gt metadata(genes) lt-

+ list(CreatedBy=A User Date=date())

GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements

GRangesList of length 1

$FBgn0039155

GRanges with 7 ranges and 2 elementMetadata cols

seqnames ranges strand | exon_id exon_name

ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt

[1] chr3R [19967117 19967382] + | 64137 ltNAgt

[2] chr3R [19970915 19971592] + | 64138 ltNAgt

[3] chr3R [19971652 19971770] + | 64139 ltNAgt

[4] chr3R [19971831 19972024] + | 64140 ltNAgt

[5] chr3R [19972088 19972461] + | 64141 ltNAgt

[6] chr3R [19972523 19972589] + | 64142 ltNAgt

[7] chr3R [19972918 19973212] + | 64143 ltNAgt

---

seqlengths

chr3R

27905053

The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above

12

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 4: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

of equal-length vectors representing a rectangular data structure not unlike aspread sheet Each column of the data frame is a vector so data types must behomogenous within a column A dataframe can be subset by row or columnand columns can be accessed with $ or [[ A matrix is also a rectangular datastructure but subject to the constraint that all elements are the same type

R has object-oriented ways of representing complicated data objects Bio-conductor makes extensive use of lsquoS4rsquo objects Objects are often created byfunctions (eg GRanges below) with parts of the object extracted or assignedusing accessor functions Many operations on classes are implemented as meth-ods that specialize a generic function for the particular class of objects used toinvoke the function For instance countOverlaps is a generic that counts thenumber of times elements of its query argument overlaps elements of its sub-

ject there are methods with slightly different behaviors when the argumentsare IRange instances or GRanges instances (in the latter case the countOverlaps

method pays attention to whether ranges are on the same strand and chromo-some for instance)

Packages provide functionality beyond that available in base R There aremore than 500 Bioconductor packages Packages are contributed by diversemembers of the community they vary in quality (many are excellent) and some-times contain idiosyncratic aspects to their implementation New packages (egShortRead VariantAnnotation packages and their dependencies) can be addedto an R installation using

gt source(httpbioconductororgbiocLiteR)

gt biocLite(c(ShortRead VariantAnnotation)) new packages

gt biocLite(character()) update packages

installpackages A package is installed only once per R installation butneeds to be loaded (with library) in each session in which it is used

Find help using the R help system Start a web browser with

gt helpstart()

The lsquoSearch Engine and Keywordsrsquo link is helpful in day-to-day use Use man-ual pages to find detailed descriptions of the arguments and return values offunctions and the structure and methods of classes Find help within an Rsession using the package defining the help page must have been loaded (withlibrary)

gt library(ShortRead)

gt readFastq

S4 classes and generics can be discovered with syntax like the following for thecomplement generic in the Biostrings package

gt library(Biostrings)

gt showMethods(complement)

4

Function complement (package Biostrings)

x=DNAString

x=DNAStringSet

x=MaskedDNAString

x=MaskedRNAString

x=RNAString

x=RNAStringSet

x=XStringViews

Methods defined on the DNAStringSet class of Biostrings can be found with

gt showMethods(class=DNAStringSet where=getNamespace(Biostrings))

Obtaining help on S4 classes and methods requires syntax such as

gt class DNAStringSet

gt method complementDNAStringSet

Vignettes especially in Bioconductor packages provide an extensive narra-tive describing overall package functionality Use

gt vignette(package=useR2012)

to see a list of vignettes available in the useR2012 package Vignettes usuallyconsist of text with embedded R code a form of literate programming Thevignette can be read as a PDF document while the R source code is present asa script file ending with extension R The script file can be sourced or copiedinto an R session to evaluate exactly the commands used in the vignette

13 Bioconductor

Bioconductor is a collection of R packages for the analysis and comprehensionof high-throughput genomic data Bioconductor started more than 10 yearsago It gained credibility for its statistically rigorous approach to microarraypre-preprocessing and analysis of designed experiments and integrative and re-producible approaches to bioinformatic tasks There are now more than 500Bioconductor packages for expression and other microarrays sequence analy-sis flow cytometry imaging and other domains The Bioconductor web siteprovides installation package repository help and other documentation

The Bioconductor web site is at bioconductororg Features include

bull Introductory work flowsbull A manifest of Bioconductor packages arranged in BiocViewsbull Annotation (data bases of relevant genomic information eg Entrez gene

ids in model organisms KEGG pathways) and experiment data (contain-ing relatively comprehensive data sets and their analysis) packages

bull Mailing lists including searchable archives as the primary source of helpbull Course and conference information including extensive reference materialbull General information about the projectbull Package developer resources including guidelines for creating and submit-

ting new packages

5

Table 2 Selected Bioconductor packages for high-throughput sequence analysis

Concept PackagesData representation IRanges GenomicRanges GenomicFeatures

Biostrings BSgenome girafeInput output ShortRead (fastq) Rsamtools (bam) rtrack-

layer (gff wig bed) VariantAnnotation (vcf)R453Plus1Toolbox (454)

Annotation GenomicFeatures ChIPpeakAnno VariantAnnota-tion

Alignment Rsubread BiostringsVisualization ggbio GvizQuality assessment qrqc seqbias ReQON htSeqTools TEQC Rolexa

ShortReadRNA-seq BitSeq cqn cummeRbund DESeq DEXSeq

EDASeq edgeR gage goseq iASeq tweeDEseqChIP-seq etc BayesPeak baySeq ChIPpeakAnno chipseq

ChIPseqR ChIPsim CSAR DiffBind MEDIPSmosaics NarrowPeaks nucleR PICS PING RED-seq Repitools TSSi

Motifs BCRANK cosmo cosmoGUI MotIV seqLogorGADEM

3C etc HiTC r3CseqCopy number cnmops CNAnorm exomeCopy seqmentSeqMicrobiome phyloseq DirichletMultinomial clstutils manta

mcaGUI Work flows ArrayExpressHTS Genominator easyRNASeq

oneChannelGUI rnaSeqMapDatabase SRAdb

High-throughput sequence analysis Table 2 enumerates many of the pack-ages available for sequence analysis The table includes packages for repre-senting sequence-related data (eg GenomicRanges Biostrings) as well asdomain-specific analysis such as RNA-seq (eg edgeR DEXSeq) ChIP-seq(eg ChIPpeakAnno DiffBind) and SNPs and copy number variation (eggenoset ggtools VariantAnnotation)

Exercise 2Scavenger hunt Spend five minutes tracking down the following information

a The package containing the readFastq function

b The author of the alphabetFrequency function defined in the Biostringspackage

c A description of the GappedAlignments class

6

d The number of vignettes in the GenomicRanges package

e From the Bioconductor web site instructions for installing or updatingBioconductor packages

f A list of all packages in the current release of Bioconductor

g The URL of the Bioconductor mailing list subscription page

Solution Possible solutions are found with the following R commands

gt readFastq

gt library(Biostrings)

gt alphabetFrequency

gt classGappedAlignments

gt vignette(package=GenomicRanges)

and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3

14 Resources

Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity

1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list

7

Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures

Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-

ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data

GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name

GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes

Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences

BSgenome Representation and manipulation of large (eg whole-genome) sequences

2 Sequences and Ranges

Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3

21 Biostrings

The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages

Fixme Exercise trimLRPattern

22 GenomicRanges

Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed

8

on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation

The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges

GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)

A complete definition of these genes as GRanges is

gt genes lt- GRanges(seqnames=c(3R X)

+ ranges=IRanges(

+ start=c(19967117 18962306)

+ end=c(19973212 18962925))

+ strand=c(+ -)

+ seqlengths=c(`3R`=27905053L `X`=22422827L))

The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as

gt genes

GRanges with 2 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] 3R [19967117 19973212] +

[2] X [18962306 18962925] -

---

seqlengths

3R X

27905053 22422827

9

Figure 1 Ranges

For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6

The GRanges class has many useful methods defined on it Consult the helppage

gt GRanges

and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)

gt vignette(package=GenomicRanges)

for a comprehensive introduction

Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges

gt ir lt- IRanges(start=c(7 9 12 14 2224)

+ end=c(15 11 12 18 26 27 28))

These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4

elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)

10

Table 4 Common operations on IRanges GRanges and GRangesList

Category Function DescriptionAccessors start end width Get or s et the starts ends and widths

names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end

Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates

Arithmetic r + x r - x r x Shrink or expand ranges r by number x

shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end

Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints

Overlaps findOverlaps Find all overlaps for each x in y

countOverlaps Count overlaps of each x range in y

nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y

Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index

r[[i]] Get integer sequence from start[i] to end[i]

subsetByOverlaps Subset x for those that overlap in y

head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList

c Concatenate two or more range objects

11

gt elementMetadata(genes) lt-

+ DataFrame(EntrezId=c(42865 2768869)

+ Symbol=c(kal-1 CG34330))

metadata allows addition of information to the entire object The information isin the form of a list any data can be provided

gt metadata(genes) lt-

+ list(CreatedBy=A User Date=date())

GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements

GRangesList of length 1

$FBgn0039155

GRanges with 7 ranges and 2 elementMetadata cols

seqnames ranges strand | exon_id exon_name

ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt

[1] chr3R [19967117 19967382] + | 64137 ltNAgt

[2] chr3R [19970915 19971592] + | 64138 ltNAgt

[3] chr3R [19971652 19971770] + | 64139 ltNAgt

[4] chr3R [19971831 19972024] + | 64140 ltNAgt

[5] chr3R [19972088 19972461] + | 64141 ltNAgt

[6] chr3R [19972523 19972589] + | 64142 ltNAgt

[7] chr3R [19972918 19973212] + | 64143 ltNAgt

---

seqlengths

chr3R

27905053

The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above

12

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 5: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Function complement (package Biostrings)

x=DNAString

x=DNAStringSet

x=MaskedDNAString

x=MaskedRNAString

x=RNAString

x=RNAStringSet

x=XStringViews

Methods defined on the DNAStringSet class of Biostrings can be found with

gt showMethods(class=DNAStringSet where=getNamespace(Biostrings))

Obtaining help on S4 classes and methods requires syntax such as

gt class DNAStringSet

gt method complementDNAStringSet

Vignettes especially in Bioconductor packages provide an extensive narra-tive describing overall package functionality Use

gt vignette(package=useR2012)

to see a list of vignettes available in the useR2012 package Vignettes usuallyconsist of text with embedded R code a form of literate programming Thevignette can be read as a PDF document while the R source code is present asa script file ending with extension R The script file can be sourced or copiedinto an R session to evaluate exactly the commands used in the vignette

13 Bioconductor

Bioconductor is a collection of R packages for the analysis and comprehensionof high-throughput genomic data Bioconductor started more than 10 yearsago It gained credibility for its statistically rigorous approach to microarraypre-preprocessing and analysis of designed experiments and integrative and re-producible approaches to bioinformatic tasks There are now more than 500Bioconductor packages for expression and other microarrays sequence analy-sis flow cytometry imaging and other domains The Bioconductor web siteprovides installation package repository help and other documentation

The Bioconductor web site is at bioconductororg Features include

bull Introductory work flowsbull A manifest of Bioconductor packages arranged in BiocViewsbull Annotation (data bases of relevant genomic information eg Entrez gene

ids in model organisms KEGG pathways) and experiment data (contain-ing relatively comprehensive data sets and their analysis) packages

bull Mailing lists including searchable archives as the primary source of helpbull Course and conference information including extensive reference materialbull General information about the projectbull Package developer resources including guidelines for creating and submit-

ting new packages

5

Table 2 Selected Bioconductor packages for high-throughput sequence analysis

Concept PackagesData representation IRanges GenomicRanges GenomicFeatures

Biostrings BSgenome girafeInput output ShortRead (fastq) Rsamtools (bam) rtrack-

layer (gff wig bed) VariantAnnotation (vcf)R453Plus1Toolbox (454)

Annotation GenomicFeatures ChIPpeakAnno VariantAnnota-tion

Alignment Rsubread BiostringsVisualization ggbio GvizQuality assessment qrqc seqbias ReQON htSeqTools TEQC Rolexa

ShortReadRNA-seq BitSeq cqn cummeRbund DESeq DEXSeq

EDASeq edgeR gage goseq iASeq tweeDEseqChIP-seq etc BayesPeak baySeq ChIPpeakAnno chipseq

ChIPseqR ChIPsim CSAR DiffBind MEDIPSmosaics NarrowPeaks nucleR PICS PING RED-seq Repitools TSSi

Motifs BCRANK cosmo cosmoGUI MotIV seqLogorGADEM

3C etc HiTC r3CseqCopy number cnmops CNAnorm exomeCopy seqmentSeqMicrobiome phyloseq DirichletMultinomial clstutils manta

mcaGUI Work flows ArrayExpressHTS Genominator easyRNASeq

oneChannelGUI rnaSeqMapDatabase SRAdb

High-throughput sequence analysis Table 2 enumerates many of the pack-ages available for sequence analysis The table includes packages for repre-senting sequence-related data (eg GenomicRanges Biostrings) as well asdomain-specific analysis such as RNA-seq (eg edgeR DEXSeq) ChIP-seq(eg ChIPpeakAnno DiffBind) and SNPs and copy number variation (eggenoset ggtools VariantAnnotation)

Exercise 2Scavenger hunt Spend five minutes tracking down the following information

a The package containing the readFastq function

b The author of the alphabetFrequency function defined in the Biostringspackage

c A description of the GappedAlignments class

6

d The number of vignettes in the GenomicRanges package

e From the Bioconductor web site instructions for installing or updatingBioconductor packages

f A list of all packages in the current release of Bioconductor

g The URL of the Bioconductor mailing list subscription page

Solution Possible solutions are found with the following R commands

gt readFastq

gt library(Biostrings)

gt alphabetFrequency

gt classGappedAlignments

gt vignette(package=GenomicRanges)

and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3

14 Resources

Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity

1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list

7

Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures

Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-

ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data

GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name

GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes

Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences

BSgenome Representation and manipulation of large (eg whole-genome) sequences

2 Sequences and Ranges

Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3

21 Biostrings

The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages

Fixme Exercise trimLRPattern

22 GenomicRanges

Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed

8

on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation

The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges

GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)

A complete definition of these genes as GRanges is

gt genes lt- GRanges(seqnames=c(3R X)

+ ranges=IRanges(

+ start=c(19967117 18962306)

+ end=c(19973212 18962925))

+ strand=c(+ -)

+ seqlengths=c(`3R`=27905053L `X`=22422827L))

The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as

gt genes

GRanges with 2 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] 3R [19967117 19973212] +

[2] X [18962306 18962925] -

---

seqlengths

3R X

27905053 22422827

9

Figure 1 Ranges

For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6

The GRanges class has many useful methods defined on it Consult the helppage

gt GRanges

and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)

gt vignette(package=GenomicRanges)

for a comprehensive introduction

Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges

gt ir lt- IRanges(start=c(7 9 12 14 2224)

+ end=c(15 11 12 18 26 27 28))

These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4

elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)

10

Table 4 Common operations on IRanges GRanges and GRangesList

Category Function DescriptionAccessors start end width Get or s et the starts ends and widths

names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end

Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates

Arithmetic r + x r - x r x Shrink or expand ranges r by number x

shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end

Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints

Overlaps findOverlaps Find all overlaps for each x in y

countOverlaps Count overlaps of each x range in y

nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y

Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index

r[[i]] Get integer sequence from start[i] to end[i]

subsetByOverlaps Subset x for those that overlap in y

head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList

c Concatenate two or more range objects

11

gt elementMetadata(genes) lt-

+ DataFrame(EntrezId=c(42865 2768869)

+ Symbol=c(kal-1 CG34330))

metadata allows addition of information to the entire object The information isin the form of a list any data can be provided

gt metadata(genes) lt-

+ list(CreatedBy=A User Date=date())

GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements

GRangesList of length 1

$FBgn0039155

GRanges with 7 ranges and 2 elementMetadata cols

seqnames ranges strand | exon_id exon_name

ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt

[1] chr3R [19967117 19967382] + | 64137 ltNAgt

[2] chr3R [19970915 19971592] + | 64138 ltNAgt

[3] chr3R [19971652 19971770] + | 64139 ltNAgt

[4] chr3R [19971831 19972024] + | 64140 ltNAgt

[5] chr3R [19972088 19972461] + | 64141 ltNAgt

[6] chr3R [19972523 19972589] + | 64142 ltNAgt

[7] chr3R [19972918 19973212] + | 64143 ltNAgt

---

seqlengths

chr3R

27905053

The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above

12

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 6: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Table 2 Selected Bioconductor packages for high-throughput sequence analysis

Concept PackagesData representation IRanges GenomicRanges GenomicFeatures

Biostrings BSgenome girafeInput output ShortRead (fastq) Rsamtools (bam) rtrack-

layer (gff wig bed) VariantAnnotation (vcf)R453Plus1Toolbox (454)

Annotation GenomicFeatures ChIPpeakAnno VariantAnnota-tion

Alignment Rsubread BiostringsVisualization ggbio GvizQuality assessment qrqc seqbias ReQON htSeqTools TEQC Rolexa

ShortReadRNA-seq BitSeq cqn cummeRbund DESeq DEXSeq

EDASeq edgeR gage goseq iASeq tweeDEseqChIP-seq etc BayesPeak baySeq ChIPpeakAnno chipseq

ChIPseqR ChIPsim CSAR DiffBind MEDIPSmosaics NarrowPeaks nucleR PICS PING RED-seq Repitools TSSi

Motifs BCRANK cosmo cosmoGUI MotIV seqLogorGADEM

3C etc HiTC r3CseqCopy number cnmops CNAnorm exomeCopy seqmentSeqMicrobiome phyloseq DirichletMultinomial clstutils manta

mcaGUI Work flows ArrayExpressHTS Genominator easyRNASeq

oneChannelGUI rnaSeqMapDatabase SRAdb

High-throughput sequence analysis Table 2 enumerates many of the pack-ages available for sequence analysis The table includes packages for repre-senting sequence-related data (eg GenomicRanges Biostrings) as well asdomain-specific analysis such as RNA-seq (eg edgeR DEXSeq) ChIP-seq(eg ChIPpeakAnno DiffBind) and SNPs and copy number variation (eggenoset ggtools VariantAnnotation)

Exercise 2Scavenger hunt Spend five minutes tracking down the following information

a The package containing the readFastq function

b The author of the alphabetFrequency function defined in the Biostringspackage

c A description of the GappedAlignments class

6

d The number of vignettes in the GenomicRanges package

e From the Bioconductor web site instructions for installing or updatingBioconductor packages

f A list of all packages in the current release of Bioconductor

g The URL of the Bioconductor mailing list subscription page

Solution Possible solutions are found with the following R commands

gt readFastq

gt library(Biostrings)

gt alphabetFrequency

gt classGappedAlignments

gt vignette(package=GenomicRanges)

and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3

14 Resources

Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity

1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list

7

Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures

Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-

ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data

GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name

GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes

Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences

BSgenome Representation and manipulation of large (eg whole-genome) sequences

2 Sequences and Ranges

Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3

21 Biostrings

The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages

Fixme Exercise trimLRPattern

22 GenomicRanges

Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed

8

on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation

The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges

GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)

A complete definition of these genes as GRanges is

gt genes lt- GRanges(seqnames=c(3R X)

+ ranges=IRanges(

+ start=c(19967117 18962306)

+ end=c(19973212 18962925))

+ strand=c(+ -)

+ seqlengths=c(`3R`=27905053L `X`=22422827L))

The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as

gt genes

GRanges with 2 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] 3R [19967117 19973212] +

[2] X [18962306 18962925] -

---

seqlengths

3R X

27905053 22422827

9

Figure 1 Ranges

For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6

The GRanges class has many useful methods defined on it Consult the helppage

gt GRanges

and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)

gt vignette(package=GenomicRanges)

for a comprehensive introduction

Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges

gt ir lt- IRanges(start=c(7 9 12 14 2224)

+ end=c(15 11 12 18 26 27 28))

These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4

elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)

10

Table 4 Common operations on IRanges GRanges and GRangesList

Category Function DescriptionAccessors start end width Get or s et the starts ends and widths

names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end

Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates

Arithmetic r + x r - x r x Shrink or expand ranges r by number x

shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end

Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints

Overlaps findOverlaps Find all overlaps for each x in y

countOverlaps Count overlaps of each x range in y

nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y

Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index

r[[i]] Get integer sequence from start[i] to end[i]

subsetByOverlaps Subset x for those that overlap in y

head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList

c Concatenate two or more range objects

11

gt elementMetadata(genes) lt-

+ DataFrame(EntrezId=c(42865 2768869)

+ Symbol=c(kal-1 CG34330))

metadata allows addition of information to the entire object The information isin the form of a list any data can be provided

gt metadata(genes) lt-

+ list(CreatedBy=A User Date=date())

GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements

GRangesList of length 1

$FBgn0039155

GRanges with 7 ranges and 2 elementMetadata cols

seqnames ranges strand | exon_id exon_name

ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt

[1] chr3R [19967117 19967382] + | 64137 ltNAgt

[2] chr3R [19970915 19971592] + | 64138 ltNAgt

[3] chr3R [19971652 19971770] + | 64139 ltNAgt

[4] chr3R [19971831 19972024] + | 64140 ltNAgt

[5] chr3R [19972088 19972461] + | 64141 ltNAgt

[6] chr3R [19972523 19972589] + | 64142 ltNAgt

[7] chr3R [19972918 19973212] + | 64143 ltNAgt

---

seqlengths

chr3R

27905053

The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above

12

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 7: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

d The number of vignettes in the GenomicRanges package

e From the Bioconductor web site instructions for installing or updatingBioconductor packages

f A list of all packages in the current release of Bioconductor

g The URL of the Bioconductor mailing list subscription page

Solution Possible solutions are found with the following R commands

gt readFastq

gt library(Biostrings)

gt alphabetFrequency

gt classGappedAlignments

gt vignette(package=GenomicRanges)

and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3

14 Resources

Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity

1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list

7

Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures

Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-

ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data

GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name

GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes

Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences

BSgenome Representation and manipulation of large (eg whole-genome) sequences

2 Sequences and Ranges

Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3

21 Biostrings

The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages

Fixme Exercise trimLRPattern

22 GenomicRanges

Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed

8

on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation

The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges

GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)

A complete definition of these genes as GRanges is

gt genes lt- GRanges(seqnames=c(3R X)

+ ranges=IRanges(

+ start=c(19967117 18962306)

+ end=c(19973212 18962925))

+ strand=c(+ -)

+ seqlengths=c(`3R`=27905053L `X`=22422827L))

The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as

gt genes

GRanges with 2 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] 3R [19967117 19973212] +

[2] X [18962306 18962925] -

---

seqlengths

3R X

27905053 22422827

9

Figure 1 Ranges

For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6

The GRanges class has many useful methods defined on it Consult the helppage

gt GRanges

and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)

gt vignette(package=GenomicRanges)

for a comprehensive introduction

Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges

gt ir lt- IRanges(start=c(7 9 12 14 2224)

+ end=c(15 11 12 18 26 27 28))

These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4

elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)

10

Table 4 Common operations on IRanges GRanges and GRangesList

Category Function DescriptionAccessors start end width Get or s et the starts ends and widths

names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end

Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates

Arithmetic r + x r - x r x Shrink or expand ranges r by number x

shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end

Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints

Overlaps findOverlaps Find all overlaps for each x in y

countOverlaps Count overlaps of each x range in y

nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y

Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index

r[[i]] Get integer sequence from start[i] to end[i]

subsetByOverlaps Subset x for those that overlap in y

head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList

c Concatenate two or more range objects

11

gt elementMetadata(genes) lt-

+ DataFrame(EntrezId=c(42865 2768869)

+ Symbol=c(kal-1 CG34330))

metadata allows addition of information to the entire object The information isin the form of a list any data can be provided

gt metadata(genes) lt-

+ list(CreatedBy=A User Date=date())

GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements

GRangesList of length 1

$FBgn0039155

GRanges with 7 ranges and 2 elementMetadata cols

seqnames ranges strand | exon_id exon_name

ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt

[1] chr3R [19967117 19967382] + | 64137 ltNAgt

[2] chr3R [19970915 19971592] + | 64138 ltNAgt

[3] chr3R [19971652 19971770] + | 64139 ltNAgt

[4] chr3R [19971831 19972024] + | 64140 ltNAgt

[5] chr3R [19972088 19972461] + | 64141 ltNAgt

[6] chr3R [19972523 19972589] + | 64142 ltNAgt

[7] chr3R [19972918 19973212] + | 64143 ltNAgt

---

seqlengths

chr3R

27905053

The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above

12

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 8: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures

Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-

ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data

GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name

GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes

Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences

BSgenome Representation and manipulation of large (eg whole-genome) sequences

2 Sequences and Ranges

Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3

21 Biostrings

The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages

Fixme Exercise trimLRPattern

22 GenomicRanges

Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed

8

on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation

The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges

GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)

A complete definition of these genes as GRanges is

gt genes lt- GRanges(seqnames=c(3R X)

+ ranges=IRanges(

+ start=c(19967117 18962306)

+ end=c(19973212 18962925))

+ strand=c(+ -)

+ seqlengths=c(`3R`=27905053L `X`=22422827L))

The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as

gt genes

GRanges with 2 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] 3R [19967117 19973212] +

[2] X [18962306 18962925] -

---

seqlengths

3R X

27905053 22422827

9

Figure 1 Ranges

For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6

The GRanges class has many useful methods defined on it Consult the helppage

gt GRanges

and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)

gt vignette(package=GenomicRanges)

for a comprehensive introduction

Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges

gt ir lt- IRanges(start=c(7 9 12 14 2224)

+ end=c(15 11 12 18 26 27 28))

These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4

elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)

10

Table 4 Common operations on IRanges GRanges and GRangesList

Category Function DescriptionAccessors start end width Get or s et the starts ends and widths

names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end

Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates

Arithmetic r + x r - x r x Shrink or expand ranges r by number x

shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end

Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints

Overlaps findOverlaps Find all overlaps for each x in y

countOverlaps Count overlaps of each x range in y

nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y

Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index

r[[i]] Get integer sequence from start[i] to end[i]

subsetByOverlaps Subset x for those that overlap in y

head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList

c Concatenate two or more range objects

11

gt elementMetadata(genes) lt-

+ DataFrame(EntrezId=c(42865 2768869)

+ Symbol=c(kal-1 CG34330))

metadata allows addition of information to the entire object The information isin the form of a list any data can be provided

gt metadata(genes) lt-

+ list(CreatedBy=A User Date=date())

GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements

GRangesList of length 1

$FBgn0039155

GRanges with 7 ranges and 2 elementMetadata cols

seqnames ranges strand | exon_id exon_name

ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt

[1] chr3R [19967117 19967382] + | 64137 ltNAgt

[2] chr3R [19970915 19971592] + | 64138 ltNAgt

[3] chr3R [19971652 19971770] + | 64139 ltNAgt

[4] chr3R [19971831 19972024] + | 64140 ltNAgt

[5] chr3R [19972088 19972461] + | 64141 ltNAgt

[6] chr3R [19972523 19972589] + | 64142 ltNAgt

[7] chr3R [19972918 19973212] + | 64143 ltNAgt

---

seqlengths

chr3R

27905053

The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above

12

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 9: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation

The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges

GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)

A complete definition of these genes as GRanges is

gt genes lt- GRanges(seqnames=c(3R X)

+ ranges=IRanges(

+ start=c(19967117 18962306)

+ end=c(19973212 18962925))

+ strand=c(+ -)

+ seqlengths=c(`3R`=27905053L `X`=22422827L))

The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as

gt genes

GRanges with 2 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] 3R [19967117 19973212] +

[2] X [18962306 18962925] -

---

seqlengths

3R X

27905053 22422827

9

Figure 1 Ranges

For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6

The GRanges class has many useful methods defined on it Consult the helppage

gt GRanges

and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)

gt vignette(package=GenomicRanges)

for a comprehensive introduction

Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges

gt ir lt- IRanges(start=c(7 9 12 14 2224)

+ end=c(15 11 12 18 26 27 28))

These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4

elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)

10

Table 4 Common operations on IRanges GRanges and GRangesList

Category Function DescriptionAccessors start end width Get or s et the starts ends and widths

names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end

Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates

Arithmetic r + x r - x r x Shrink or expand ranges r by number x

shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end

Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints

Overlaps findOverlaps Find all overlaps for each x in y

countOverlaps Count overlaps of each x range in y

nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y

Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index

r[[i]] Get integer sequence from start[i] to end[i]

subsetByOverlaps Subset x for those that overlap in y

head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList

c Concatenate two or more range objects

11

gt elementMetadata(genes) lt-

+ DataFrame(EntrezId=c(42865 2768869)

+ Symbol=c(kal-1 CG34330))

metadata allows addition of information to the entire object The information isin the form of a list any data can be provided

gt metadata(genes) lt-

+ list(CreatedBy=A User Date=date())

GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements

GRangesList of length 1

$FBgn0039155

GRanges with 7 ranges and 2 elementMetadata cols

seqnames ranges strand | exon_id exon_name

ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt

[1] chr3R [19967117 19967382] + | 64137 ltNAgt

[2] chr3R [19970915 19971592] + | 64138 ltNAgt

[3] chr3R [19971652 19971770] + | 64139 ltNAgt

[4] chr3R [19971831 19972024] + | 64140 ltNAgt

[5] chr3R [19972088 19972461] + | 64141 ltNAgt

[6] chr3R [19972523 19972589] + | 64142 ltNAgt

[7] chr3R [19972918 19973212] + | 64143 ltNAgt

---

seqlengths

chr3R

27905053

The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above

12

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 10: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Figure 1 Ranges

For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6

The GRanges class has many useful methods defined on it Consult the helppage

gt GRanges

and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)

gt vignette(package=GenomicRanges)

for a comprehensive introduction

Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges

gt ir lt- IRanges(start=c(7 9 12 14 2224)

+ end=c(15 11 12 18 26 27 28))

These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4

elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)

10

Table 4 Common operations on IRanges GRanges and GRangesList

Category Function DescriptionAccessors start end width Get or s et the starts ends and widths

names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end

Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates

Arithmetic r + x r - x r x Shrink or expand ranges r by number x

shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end

Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints

Overlaps findOverlaps Find all overlaps for each x in y

countOverlaps Count overlaps of each x range in y

nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y

Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index

r[[i]] Get integer sequence from start[i] to end[i]

subsetByOverlaps Subset x for those that overlap in y

head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList

c Concatenate two or more range objects

11

gt elementMetadata(genes) lt-

+ DataFrame(EntrezId=c(42865 2768869)

+ Symbol=c(kal-1 CG34330))

metadata allows addition of information to the entire object The information isin the form of a list any data can be provided

gt metadata(genes) lt-

+ list(CreatedBy=A User Date=date())

GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements

GRangesList of length 1

$FBgn0039155

GRanges with 7 ranges and 2 elementMetadata cols

seqnames ranges strand | exon_id exon_name

ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt

[1] chr3R [19967117 19967382] + | 64137 ltNAgt

[2] chr3R [19970915 19971592] + | 64138 ltNAgt

[3] chr3R [19971652 19971770] + | 64139 ltNAgt

[4] chr3R [19971831 19972024] + | 64140 ltNAgt

[5] chr3R [19972088 19972461] + | 64141 ltNAgt

[6] chr3R [19972523 19972589] + | 64142 ltNAgt

[7] chr3R [19972918 19973212] + | 64143 ltNAgt

---

seqlengths

chr3R

27905053

The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above

12

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 11: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Table 4 Common operations on IRanges GRanges and GRangesList

Category Function DescriptionAccessors start end width Get or s et the starts ends and widths

names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end

Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates

Arithmetic r + x r - x r x Shrink or expand ranges r by number x

shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end

Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints

Overlaps findOverlaps Find all overlaps for each x in y

countOverlaps Count overlaps of each x range in y

nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y

Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index

r[[i]] Get integer sequence from start[i] to end[i]

subsetByOverlaps Subset x for those that overlap in y

head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList

c Concatenate two or more range objects

11

gt elementMetadata(genes) lt-

+ DataFrame(EntrezId=c(42865 2768869)

+ Symbol=c(kal-1 CG34330))

metadata allows addition of information to the entire object The information isin the form of a list any data can be provided

gt metadata(genes) lt-

+ list(CreatedBy=A User Date=date())

GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements

GRangesList of length 1

$FBgn0039155

GRanges with 7 ranges and 2 elementMetadata cols

seqnames ranges strand | exon_id exon_name

ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt

[1] chr3R [19967117 19967382] + | 64137 ltNAgt

[2] chr3R [19970915 19971592] + | 64138 ltNAgt

[3] chr3R [19971652 19971770] + | 64139 ltNAgt

[4] chr3R [19971831 19972024] + | 64140 ltNAgt

[5] chr3R [19972088 19972461] + | 64141 ltNAgt

[6] chr3R [19972523 19972589] + | 64142 ltNAgt

[7] chr3R [19972918 19973212] + | 64143 ltNAgt

---

seqlengths

chr3R

27905053

The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above

12

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 12: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

gt elementMetadata(genes) lt-

+ DataFrame(EntrezId=c(42865 2768869)

+ Symbol=c(kal-1 CG34330))

metadata allows addition of information to the entire object The information isin the form of a list any data can be provided

gt metadata(genes) lt-

+ list(CreatedBy=A User Date=date())

GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements

GRangesList of length 1

$FBgn0039155

GRanges with 7 ranges and 2 elementMetadata cols

seqnames ranges strand | exon_id exon_name

ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt

[1] chr3R [19967117 19967382] + | 64137 ltNAgt

[2] chr3R [19970915 19971592] + | 64138 ltNAgt

[3] chr3R [19971652 19971770] + | 64139 ltNAgt

[4] chr3R [19971831 19972024] + | 64140 ltNAgt

[5] chr3R [19972088 19972461] + | 64141 ltNAgt

[6] chr3R [19972523 19972589] + | 64142 ltNAgt

[7] chr3R [19972918 19973212] + | 64143 ltNAgt

---

seqlengths

chr3R

27905053

The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above

12

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 13: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines

Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths

and table to summarize the number of exons in each gene for instance howmany single-exon genes are there

Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo

Solution

gt library(TxDbDmelanogasterUCSCdm3ensGene)

gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias

gt ex0 lt- exonsBy(txdb gene)

gt head(table(elementLengths(ex0)))

1 2 3 4 5 6

3182 2608 2070 1628 1133 886

gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)

gt ex lt- reduce(ex0[ids])

Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise

Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3

Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome

Use Views to create views on to the chromosome that span the start and endcoordinates of all exons

The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons

Look at the helper function and describe what it does

Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183

gt library(BSgenomeDmelanogasterUCSCdm3)

gt nm lt- ascharacter(unique(seqnames(ex[[1]])))

gt chr lt- Dmelanogaster[[nm]]

gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))

13

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 14: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Using the gcFunction helper function the subject GC content is

gt gcFunction(v)

[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333

[8] 05189394 05110132

The gcFunction is really straight-forward it invokes the function alphabetFre-

quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon

gt gcFunction

function (x)

alf lt- alphabetFrequency(x asprob = TRUE)

rowSums(alf[ c(G C)])

ltenvironment namespaceuseR2012gt

23 Resources

There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group

14

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 15: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Table 5 Selected Bioconductor packages for sequence reads and alignments

Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-

nipulating fastq files these classes rely heavily onBiostrings

GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads

Rsamtools Provides access to BAM alignment and other largesequence-related files

rtracklayer Input and output of bed wig and similar files

3 Reads and Alignments

The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis

31 The pasilla Data Set

As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences

In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package

32 Reads and the ShortRead Package

Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components

SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

15

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 16: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37

IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet

$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~

are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported

FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set

gt fastqDir lt- filepath(bigdata() fastq)

gt fastqFiles lt- dir(fastqDir full=TRUE)

gt fq lt- readFastq(fastqFiles[1])

gt fq

class ShortReadQ

length 1000000 reads width 37 cycles

The data are represented as an object of class ShortReadQ

gt head(sread(fq) 3)

A DNAStringSet instance of length 3

width seq

[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC

[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT

[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA

gt head(quality(fq) 3)

class FastqQuality

quality

A BStringSet instance of length 3

width seq

[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE

[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+

16

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 17: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance

gt abc lt- alphabetByCycle(sread(fq))

gt abc[14 18]

cycle

alphabet [1] [2] [3] [4] [5] [6] [7] [8]

A 78194 153156 200468 230120 283083 322913 162766 220205

C 439302 265338 362839 251434 203787 220855 253245 287010

G 397671 270342 258739 356003 301640 247090 227811 246684

T 84833 311164 177954 162443 211490 209142 356178 246101

FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample

gt sampler lt- FastqSampler(fastqFiles[1] 1000000)

gt yield(sampler) sample of 1000000 reads

class ShortReadQ

length 1000000 reads width 37 cycles

A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently

ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser

gt qas0 lt- Map(function(fl nm)

+ fq lt- FastqSampler(fl)

+ qa(yield(fq) nm)

+ fastqFiles

+ sub(_subsetfastq basename(fastqFiles)))

gt qas lt- docall(rbind qas0)

gt rpt lt- report(qas dest=tempfile())

gt browseURL(rpt)

A report from a larger subset of the experiment is available

gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml

+ package=useR2012)

gt browseURL(rpt)

17

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 18: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package

Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a

histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each

cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package

and mclapply to read and summarize the GC content of reads in two files inparallel

Solution Discovery

gt dir(bigdata())

[1] bam fastq

gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)

Input

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 4071585 2175 6193578 3308 6193578 3308

Vcells 125856983 9603 279142478 21297 279142148 21297

gt fq lt- readFastq(fls[1])

GC content

gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)

gt sum(alf0[c(G C)])

[1] 05457237

A histogram of the GC content of individual reads is obtained with

gt gc lt- gcFunction(sread(fq))

gt hist(gc)

Alphabet by cycle

gt abc lt- alphabetByCycle(sread(fq))

gt matplot(t(abc[c(A C G T)]) type=l)

Advanced (Mac Linux only) processing on multiple cores

18

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 19: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

gt library(parallel)

gt gc0 lt- mclapply(fls function(fl)

+ fq lt- readFastq(fl)

+ gc lt- gcFunction(sread(fq))

+ table(cut(gc seq(0 1 05)))

+ )

gt simplify list of length 2 to 2-D array

gt gc lt- simplify2array(gc0)

gt matplot(gc type=s)

33 Alignments and the Rsamtools Package

Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data

Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields

gt fl lt- systemfile(extdata ex1sam package=Rsamtools)

gt strsplit(readLines(fl 1) t)[[1]]

[1] B7_591496693509

[2] 73

[3] seq1

[4] 1

[5] 99

[6] 36M

[7]

[8] 0

[9] 0

[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7

[12] MFi18

[13] Aqi73

[14] NMi0

[15] UQi0

[16] H0i1

[17] H1i0

19

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 20: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R

Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)

gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)

gt aln lt- readGappedAlignments(alnFile)

gt head(aln 3)

GappedAlignments with 3 alignments and 0 elementMetadata cols

seqnames strand cigar qwidth start end width

ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt

[1] seq1 + 36M 36 1 36 36

[2] seq1 + 35M 35 3 37 35

[3] seq1 + 35M 35 5 39 35

ngap

ltintegergt

[1] 0

[2] 0

[3] 0

---

seqlengths

seq1 seq2

1575 1584

The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments

A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings

gt table(strand(aln))

+ -

1647 1624

20

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 21: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

gt table(width(aln))

30 31 32 33 34 35 36 38 40

2 21 1 8 37 2804 285 1 112

gt head(sort(table(cigar(aln)) decreasing=TRUE))

35M 36M 40M 34M 33M 14M4I17M

2804 283 112 37 6 4

Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes

Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from

The object ex created earlier contains coordinates of four genes Use coun-

tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to

Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation

Solution We discover the location of files using standard R commands

gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)

gt names(fls) lt- sub(_ basename(fls))

Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data

gt input

gt aln lt- readGappedAlignments(fls[1])

gt xtabs(~seqnames + strand asdataframe(aln))

strand

seqnames + -

chr3L 5402 5974

chrX 2278 2283

To count overlaps in regions defined in a previous exercise load the regions

gt data(ex) from an earlier exercise

Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known

21

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 22: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

gt strand(aln) lt- protocol not strand-aware

For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to

gt hits lt- countOverlaps(aln ex)

gt table(hits)

hits

0 1 2

772 15026 139

and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment

gt cnt lt- countOverlaps(ex aln[hits==1])

A simple function for counting reads is

gt counter lt-

+ function(filePath range)

+

+ aln lt- readGappedAlignments(filePath)

+ strand(aln) lt-

+ hits lt- countOverlaps(aln range)

+ countOverlaps(range aln[hits==1])

+

This can be applied to all files using sapply

gt counts lt- sapply(fls counter ex)

The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is

gt if (require(parallel))

+ simplify2array(mclapply(fls counter ex))

The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies

Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function

Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)

22

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 23: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Figure 2 GC content in aligned reads

Solution

gt param lt- ScanBamParam(what=seq)

gt seqs lt- scanBam(fls[[1]] param=param)

gt readGC lt- gcFunction(seqs[[1]][[seq]])

gt hist(readGC)

23

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 24: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Table 6 Selected Bioconductor packages for RNA-seq analysis

Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-

rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data

also limmarsquos roast or camera after transformation withvoom or cqn

easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI

Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments

4 RNA-seq

Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments

The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial

Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use

41 Differential Expression with the edgeR Package

RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of

24

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 25: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance

Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis

The following exercise illustrates key steps in counting and filtering readsoverlapping known genes

Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata

Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-

tors functionA lesson from the microarray world is to discard genes that cannot be in-

formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples

Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix

gt data(counts)

gt dim(counts)

[1] 14470 7

25

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 26: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

gt grps lt- factor(sub([1-4] colnames(counts))

+ levels=c(untreated treated))

gt pairs lt- factor(c(single paired paired

+ single single paired paired))

gt pData lt- dataframe(Group=grps PairType=pairs

+ rownames=colnames(counts))

We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model

gt library(edgeR)

gt dge lt- DGEList(counts group=pData$Group)

gt dge lt- calcNormFactors(dge)

To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance

gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2

gt table(ridx) number filtered retained

ridx

FALSE TRUE

6476 7994

gt dge lt- dge[ridx]

Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula

describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects

26

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 27: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment

Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance

gt (design lt- modelmatrix(~ Group pData))

(Intercept) Grouptreated

treated1fb 1 1

treated2fb 1 1

treated3fb 1 1

untreated1fb 1 0

untreated2fb 1 0

untreated3fb 1 0

untreated4fb 1 0

attr(assign)

[1] 0 1

attr(contrasts)

attr(contrasts)$Group

[1] contrtreatment

The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed

Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion

gt dge lt- estimateTagwiseDisp(dge)

gt mean(sqrt(dge$tagwisedispersion))

[1] 01778359

27

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 28: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion

Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList

Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion

Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test

Create a lsquotop tablersquo of differentially represented genes using topTags

Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate

gt fit lt- glmFit(dge design)

The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit

gt lrTest lt- glmLRT(dge fit coef=2)

Here the topTags function summarizes results across the experiment

gt tt lt- topTags(lrTest n=10)

gt tt[13]

28

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 29: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Coefficient Grouptreated

logFC logCPM LR PValue FDR

FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116

FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52

FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44

As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different

gt sapply(rownames(tt$table)[14]

+ function(x) tapply(counts[x] pData$Group mean))

FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736

untreated 1576 554 6447000 38225

treated 64 31 1482667 3600

42 Additional Steps in RNA-seq Work Flows

The forgoing provides an elementary work flow There are many interestingadditional opportunities including

Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes

Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq

Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring

29

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 30: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

43 Resources

The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis

30

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 31: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Table 7 Selected Bioconductor packages for RNA-seq analysis

Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR

BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM

5 ChIP-seq

ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]

The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)

Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation

51 Initial Work Flow

We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines

31

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 32: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-

sesssement of the data an example is available

gt rpt lt- systemfile(GSE30263_qa_report indexhtml

+ package=useR2012 mustWork=TRUE)

gt if (interactive())

+ browseURL(rpt)

The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

gt stam

class SummarizedExperiment

dim 369674 96

exptData(0)

assays(2) Tags PVals

rownames NULL

rowData values names(0)

colnames(96) A549_1 A549_2 Wi38_1 Wi38_2

colData names(10) CellLine Replicate PeaksDate PeaksFile

Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they

Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)

gt head(colData(stam) 3)

DataFrame with 3 rows and 10 columns

CellLine Replicate TotTags TotPeaks Tags Peaks

ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt

A549_1 A549 1 1857934 50144 1569215 43119

A549_2 A549 2 2994916 77355 2881475 73062

Ag04449_1 Ag04449 1 5041026 81855 4730232 75677

32

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 33: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

FastqDate FastqSize PeaksDate

ltDategt ltnumericgt ltDategt

A549_1 2011-06-25 463 2011-06-25

A549_2 2011-06-25 703 2011-06-25

Ag04449_1 2010-10-22 368 2010-10-22

PeaksFile

ltcharactergt

A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz

A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz

Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz

gt head(rowData(stam) 3)

GRanges with 3 ranges and 0 elementMetadata cols

seqnames ranges strand

ltRlegt ltIRangesgt ltRlegt

[1] chr1 [10100 10370]

[2] chr1 [15640 15790]

[3] chr1 [16100 16490]

---

seqlengths

chr1 chr2 chr3 chr4 chr22 chrX chrY

249250621 243199373 198022430 191154276 51304566 155270560 59373566

gt xtabs(~Replicate + CellLine colData(stam))[15]

CellLine

Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319

1 1 1 1 1 1

2 1 1 1 1 1

Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples

gt m lt- assays(stam)[[Tags]] gt 0 peaks detected

gt peaksPerSample lt- table(rowSums(m))

gt head(peaksPerSample)

1 2 3 4 5 6

174574 35965 18939 12669 9143 7178

gt tail(peaksPerSample)

91 92 93 94 95 96

1226 1285 1542 2082 2749 14695

33

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 34: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Figure 3 Hierarchical clustering of ENCODE samples

To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function

gt library(bioDist) for cordist

gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts

gt d lt- cordist(t(m)) correlation distance

gt h lt- hclust(d) hierarchical clustering

Plot the result as in Figure 3

gt plot(h cex=8 ann=FALSE)

52 Motifs

Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery

Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained

34

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 35: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens

Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization

Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package

Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo

As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes

Solution Here we load the required packages and retrieve the position weightmatrix for CTCF

gt library(Biostrings)

gt library(BSgenomeHsapiensUCSChg19)

gt library(seqLogo)

gt library(lattice)

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt

gt chrid lt- chr1

gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand

gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))

The distribution of scores can be visualized with eg densityplot from thelattice package

gt densityplot(scores xlim=range(scores) pch=|)

consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected

gt cm lt- consensusMatrix(hits)[14]

gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))

35

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 36: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)

36

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 37: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

6 Annotation

Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include

bull Organism level eg orgMmegdb

bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf

bull Homology level eg homDminpdb

bull System biology level GOdb KEGGdb Reactomedb

Examples of genome-centric packages include

bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources

bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track

bull BSgenome for whole genome sequence representation and manipulation

bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build

Web-based resources include

bull biomaRt to query biomart resource for genes sequence SNPs and etc

bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser

61 Gene-Centric Annotations with AnnotationDbi

Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months

37

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 38: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

GENE ID

PLATFORM PKGS

GENE ID

ONTO IDrsquoS

ORG PKGS

GENE ID

ONTO ID

TRANSCRIPT PKGS

SYSTEM BIOLOGY

(GO KEGG)

GENE ID

HOMOLOGY PKGS

Figure 5 Annotation Packages the big picture

Exercise 13What is the name of the org package for Drosophila Load it

Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable

Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map

Solution

gt library(orgDmegdb)

gt head(ls(packageorgDmegdb) 3)

[1] orgDmeg orgDmegdb orgDmegACCNUM

gt orgDmegUNIPROT

UNIPROT map for Fly (object of class AnnDbBimap)

gt class(orgDmegUNIPROT)

[1] AnnDbBimap

attr(package)

[1] AnnotationDbi

gt toTable(head(orgDmegUNIPROT 3))

gene_id uniprot_id

1 30970 Q8IRZ0

2 30970 Q95RP8

3 30971 Q95RU8

4 30972 Q9W5H1

38

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 39: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id

gt toTable(head(revmap(orgDmegUNIGENE) 3))

gene_id unigene_id

1 30970 Dm6474

2 30971 Dm9

3 30972 Dm12271

gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))

[1] TRUE

Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select

Exercise 14Display the OrgDb object for the orgDmegdb package

Use the cols method to discover which sorts of annotations can be extractedfrom it

Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each

Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310

Solution The OrgDb object is named orgDmegdb

gt cols(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] GENENAME MAP PATH PMID REFSEQ

[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE

[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT

[21] ENSEMBLTRANS GO

gt keytypes(orgDmegdb)

[1] ENTREZID ACCNUM ALIAS CHR ENZYME

[6] MAP PATH PMID REFSEQ SYMBOL

[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT

[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO

39

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 40: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))

gt cols lt- c(SYMBOL PATH)

gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)

UNIPROT SYMBOL PATH

1 Q8IRZ0 CG3038 ltNAgt

2 Q95RP8 CG3038 ltNAgt

3 Q95RU8 G9a 00310

4 Q9W5H1 CG13377 ltNAgt

5 P39205 cin ltNAgt

6 Q24312 ewg ltNAgt

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar

gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)

gt nrow(kegg)

[1] 32

gt head(kegg 3)

PATH UNIPROT SYMBOL

1 00310 Q95RU8 G9a

2 00310 Q9W5E0 Suv4-20

3 00310 Q9W3N9 CG10932

62 Genome-Centric Annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization

Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site

Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site

Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)

Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance

Summarize the locations of the peaks relative to the TSS

40

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 41: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Solution Read in the ENCODE ChIP peaks for all cell lines

gt stamFile lt- systemfile(data stamRda package=useR2012)

gt load(stamFile)

Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt peak lt- rowData(stam)[ridx]

Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite

gt peak lt- resize(peak width=1 fix=center)

Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand

gt library(TxDbHsapiensUCSChg19knownGene)

gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)

gt tss lt- resize(tx width=1)

The nearest function returns the index of the nearest subject to each query

element the distance between peak and nearest TSS is thus

gt idx lt- nearest(peak tss)

gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))

gt dist lt- (start(peak) - start(tss)[idx]) sgn

Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6

gt bound lt- 1000

gt ok lt- abs(dist) lt bound

gt dist lt- dist[ok]

gt table(sign(dist))

-1 0 1

1262 4 707

gt griddensityplot lt-

+ function()

+ panel function to plot a grid underneath density

+

+ panelgrid()

+ paneldensityplot()

41

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 42: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Figure 6 Distance to nearest TSS amongst conserved peaks

+

gt print(densityplot(dist[ok] plotpoints=FALSE

+ panel=griddensityplot

+ xlab=Distance to Nearest TSS))

The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function

gt distToTss lt-

+ function(peak tx)

+

+ peak lt- resize(peak width=1 fix=center)

+ tss lt- resize(tx width=1)

+ idx lt- nearest(peak tss)

+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))

+ (start(peak) - start(tss)[idx]) sgn

+

Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery

Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks

42

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 43: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

gt library(BSgenomeHsapiensUCSChg19)

gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)

gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)

gt pk6 lt- rowData(stam)[ridx]

gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)

gt head(seqs 3)

A DNAStringSet instance of length 3

width seq

[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG

[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA

[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA

matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot

gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR

gt hits lt- lapply(seqs matchPWM pwm=pwm)

gt hasPwmMatch lt- sapply(hits length) gt 0

gt dist lt- distToTss(pk6 tx)

gt ok lt- abs(dist) lt bound

gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])

gt print(densityplot(~Distance group=HasPwmMatch df

+ plotpoints=FALSE panel=griddensityplot

+ autokey=list(

+ columns=2

+ title=Has Position Weight Matrix

+ cextitle=1)

+ xlab=Distance to Nearest Tss))

63 Exploring Annotated Variants VariantAnnotation

A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position

Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function

43

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 44: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel

Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT

(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build

(genome=hg19) on which the variants are annotated Take a peak at the rowData

to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas

the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes

The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership

Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results

Solution Explore the header

gt library(VariantAnnotation)

gt fl lt- systemfile(extdata chr22vcfgz

+ package=VariantAnnotation)

gt (hdr lt- scanVcfHeader(fl))

class VCFHeader

samples(5) HG00096 HG00097 HG00099 HG00100 HG00101

meta(1) fileformat

fixed(1) ALT

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

gt info(hdr)[c(VT RSQ)]

DataFrame with 2 rows and 3 columns

Number Type Description

ltcharactergt ltcharactergt ltcharactergt

VT 1 String indicates what type of variant the line represents

RSQ 1 Float Genotype imputation quality from MaCHThunder

Input the data and peak at their locations

gt (vcf lt- readVcf(fl hg19))

class VCF

dim 10376 5

genome hg19

exptData(1) header

44

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 45: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

fixed(4) REF ALT QUAL FILTER

info(22) LDAF AVGPOST VT SNPSOURCE

geno(3) GT DS GL

rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001

rowData values names(1) paramRangeID

colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101

colData names(1) Samples

gt head(rowData(vcf) 3)

GRanges with 3 ranges and 1 elementMetadata col

seqnames ranges strand | paramRangeID

ltRlegt ltIRangesgt ltRlegt | ltfactorgt

rs7410291 22 [50300078 50300078] | ltNAgt

rs147922003 22 [50300086 50300086] | ltNAgt

rs114143073 22 [50300101 50300101] | ltNAgt

---

seqlengths

22

NA

Rename chromosome levels

gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))

Discover whether SNPs are located in dbSNP

gt library(SNPlocsHsapiensdbSNP20101109)

gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)

gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)

gt table(inDbSNP)

inDbSNP

FALSE TRUE

6126 4250

Create a data frame summarizing SNP quality and dbSNP membership

gt metrics lt-

+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)

Finally visualize the data eg using ggplot2 (Figure 7)

gt library(ggplot2)

gt ggplot(metrics aes(RSQ fill=inDbSNP)) +

+ geom_density(alpha=05) +

+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +

+ scale_y_continuous(name=Density) +

+ opts(legendposition=top)

45

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 46: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP

46

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 47: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

References

[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010

[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011

[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008

[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008

[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008

[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]

[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]

[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009

[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009

[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010

[11] N Matloff The Art of R Programming No Starch Pess 2011

[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]

47

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation
Page 48: useR2012!: High-throughput sequence analysis with R and ... · useR2012!: High-throughput sequence analysis with R and Bioconductor Valerie Obenchain, Martin Morgan June 2012 Abstract

[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]

[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]

[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010

[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004

48

  • Introduction
    • Rstudio
    • R
    • Bioconductor
    • Resources
      • Sequences and Ranges
        • Biostrings
        • GenomicRanges
        • Resources
          • Reads and Alignments
            • The pasilla Data Set
            • Reads and the ShortRead Package
            • Alignments and the Rsamtools Package
              • RNA-seq
                • Differential Expression with the edgeR Package
                • Additional Steps in RNA-seq Work Flows
                • Resources
                  • ChIP-seq
                    • Initial Work Flow
                    • Motifs
                      • Annotation
                        • Gene-Centric Annotations with AnnotationDbi
                        • Genome-Centric Annotations with GenomicFeatures
                        • Exploring Annotated Variants VariantAnnotation