applications of high throughput nucleotide waage.pdf · applications of high throughput nucleotide...

APPLICATIONS OF HIGH THROUGHPUT NUCLEOTIDE

SEQUENCING by

JOHANNES WAAGE, M. SC.

A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Computational Biology and Bioinformatics

at

The PhD School of Science, Faculty of Science,

University of Copenhagen, Denmark.

Committee:

Assoc. Prof. JEPPE VINTHER, Chair, Uni. Of Copenhagen

Prof. ANDERS GORM PEDERSEN, Technical University of Denmark

Assoc. Prof. RICHARD SANDBERG, Karolinska Institutet, Sweden

Supervisors:

Prof. ALBIN SANDELIN (Principal), University of Copenhagen

Prof. BO PORSE, University of Copenhagen

THE BIOINFORMATICS CENTRE, DEPARTMENT OF BIOLOGY

AND

BIOTECH RESEARCH AND INNOVATION CENTRE

UNIVERSITY OF COPENHAGEN

September 1st, 201

Biology has at least 50 more interesting years.

- JAMES WATSON, 1989

“ ”

Preface

This work is presented as the requirement for obtaining the PhD degree at the

Bioinformatics Centre, Department of Biology, Faculty of Science, University of

Copenhagen. The work was carried out under the supervision of, in no particular

order, associate professor Albin Sandelin (Bioinformatics) and professor Bo Porse

(Biotech Research and Innovation Centre) from medio 2010 to medio 2013. A portion

of the work leading to article I was carried out from medio 2008 to medio 2010 during

my master’s thesis. The work was in part funded by the European Research Council

(EU FP7 framework programme/ERC grant agreement 204135), and in part by the

faculty of science, Uni. Of Copenhagen.

Table of Contents

Preface ..................................................................................................................... 5

English Summary ................................................................................................... 7

Dansk Resumé ........................................................................................................ 8

Acknowledgements ............................................................................................... 9

Author Publications Included In This Thesis .................................................... 10

Author Publications Not Included In This Thesis ............................................. 11

Abbreviations ........................................................................................................ 12

Introduction ........................................................................................................... 13

Introduction to articles ......................................................................................... 16

Sequencing and mapping issues .......................................................................... 20

RNA-sequencing .................................................................................................. 22

Milestones and history ..................................................................................... 22

Normalization issues ........................................................................................ 25

Isoform de-convolution and splice class detection .......................................... 27

Coding Potential Prediction ............................................................................. 31

Processing and analysis of CAGE-data ........................................................... 32

Chromatin Immunoprecipitation coupled with sequencing .............................. 36

Milestones and history ..................................................................................... 36

Peakfinding and normalization ........................................................................ 37

Motif discovery ................................................................................................. 38

Concluding Remarks .......................................................................................... 40

Original Articles ................................................................................................... 42

Article I ............................................................................................................. 43

Article II ............................................................................................................ 63

Article III ......................................................................................................... 66

Article IV ........................................................................................................... 97

References ............................................................................................................ 110

Supplemental Figures ......................................................................................... 116

ENGLISH SUMMARY | 7

English Summary

The recent advent of high throughput sequencing of nucleic acids (RNA and

DNA) has vastly expanded research into the functional and structural biology of the

genome of all living organisms (and even a few dead ones). With this enormous and

exponential growth in biological data generation come equally large demands in data

handling, analysis and interpretation, perhaps defining the modern challenge of the

computational biologist of the post-genomic era.

The first part of this thesis consists of a general introduction to the history,

common terms and challenges of next generation sequencing, focusing on oft

encountered problems in data processing, such as quality assurance, mapping,

normalization, visualization, and interpretation.

Presented in the second part are scientific endeavors representing solutions to

problems of two sub-genres of next generation sequencing.

For the first flavor, RNA-sequencing, a study of the effects on alternative RNA

splicing of KO of the nonsense mediated RNA decay system in Mus, using digital

gene expression and a custom-built exon-exon junction mapping pipeline is presented

(article I). Evolved from this work, a Bioconductor package, spliceR, for classifying

alternative splicing events and coding potential of isoforms from full isoform

deconvolution software, such as Cufflinks (article II), is presented. Finally, a study

using 5’-end RNA-seq for alternative promoter detection between healthy patients and

patients with acute promyelocytic leukemia is presented (article III).

For the second flavor, DNA-seq, a study presenting genome wide profiling of

transcription factor CEBP/A in liver cells undergoing regeneration after partial

hepatectomy (article IV) is included.

DANSK RESUMÉ | 8

Dansk Resumé

De nye nukleinsyresekventeringsteknologiers komme har i høj grad udvidet

horisonterne og mulighederne inden for forskning i funktionel og strukturel biologi af

alle levende (og et par få uddøde) organismers genom. Med denne enorme og

eksponentielle vækst af biologisk data opstår ligeledes store krav til behandling,

analyse og fortolkning af data, og således defineres den fremtidige udfordring for

bioinformatikkeren i den post-genomiske æra.

I den første del af denne afhandling præsenteres en general introduktion til

næste-generationssekvenserings historie, terminologi og udfordringer, med fokus på

typiske problemer indenfor datahåndtering, såsom kvalitetssikring, mapping

(placering af sekventerede aflæsningsfragmenter på kromosomet), normalisering og

fortolkning.

I den anden del præsenteres 4 videnskabelige arbejder der forsøger at løse

nuværende problemer indenfor to undertyper af sekvensering.

For den første type, RNA-sekvensering, præsenteres et studie der kortlægger

effekterne af knockout af nonsens-medieret RNA-nedbrydning på alternativ splejsning

i Mus ved hjælp af digitalt genudtryk og en specialbygget pipeline til mapping af exon-

exon-forbindelser. I forlængelse af dette, præsenteres en Bioconductor-pakke, spliceR,

der klassificerer alternative splejshændelser og proteinkodende potentiale af RNA-

isoformer genereret fra software som f.eks. Cufflinks. Endelig præsenteres et studie

der bruger RNA-sekvensering til at finde differentielt benyttede promotere mellem

raske patienter, og patienter diagnosticeret med akut promyelocytisk leukæmi.

For den anden type, DNA-sekvensering, præsenteres et studie omhandlende en

fuld genomisk profilering af transkriptionsfaktoren CEBP/As binding i murine

leverceller der undergår regenerering efter partiel hepatektomi.

ACKNOWLEDGEMENTS | 9

Acknowledgements

A sincere thank-you to my bosses Bo and Albin for being not just that when

things get rough.

Thank you to both the Porse group and the whole of Upper Binf for good

times!

AUTHOR PUBLICATIONS INCLUDED IN THIS THESIS | 10

Author Publications Included In This Thesis

Sorted by date (oldest first). Underlined articles are included in this thesis.

1. Waage, J*., Weischenfeldt, J*., Tian, G., Zhao, J., Damgaard, I., Jakobsen, J.

S., ... & Porse, B. T. (2012). Mammalian tissues defective in nonsense-mediated

mRNA decay display highly aberrant splicing patterns. Genome Biol,13, R35.

2. Jakobsen, J. S., Waage, J., Rapin, N., Bisgaard, H. C., Larsen, F. S., &

Porse, B. T. (2013). Temporal mapping of CEBPA and CEBPB binding

during liver regeneration reveals dynamic occupancy and specific regulatory

codes for homeostatic and cell cycle gene batteries. Genome research, 23(4),

592-603.

3. Knudsen, K., Porse, B. T., Sandelin A., Waage, J. (2013). spliceR: An R

package for classification of alternative splicing and prediction of coding

potential from RNA-seq data. Manuscript under preparation (Bioinformatics

format included)

4. Waage, J*., Boyd, M.*, Theilgaard-Mönch, K., Mora-Jensen, H., Carninci,

P., Rapin, N., … & Sandelin, A. (2013). Genome-wide mapping of transcription

start sites reveals deregulation of full-length transcripts in acute promyelocytic

leukemia. Under preparation.

*=shared first author

AUTHOR PUBLICATIONS NOT INCLUDED IN THIS THESIS | 11

Author Publications Not Included In This Thesis

1. Thoren, L. A., Nørgaard, G. A., Weischenfeldt, J., Waage, J., Jakobsen, J. S.,

Damgaard, I., ... & Porse, B. T. (2010). UPF2 is a critical regulator of liver

development, function and regeneration. PLoS one, 5(7), e11650.

2. Hasemann, M., Lauridsen, F. K. B., Waage, J., Jacobsen, J. S., Frank, A.,

Schuster, M., … & Porse, B.T. (2013) Submitted to PLoS Genetics.

3. Leucci, E., Patella, F., Waage, J., Holmstrøm, K., Lindow, M., Porse, B.T.,

… & Lund, A. H. (2013). MicroRNA-9 targets the long non-coding RNA

MALAT1 for degradation in the nucleus. Scientific Reports, 3, 2535

ABBREVIATIONS | 12

Abbreviations

(List of abbreviations used in thesis text, not including articles).

A3SS Alternative 3’ Splice Site

A5SS Alternative 5’ Splice Site

AFE Alternative First Exon

ALE Alternative Last Exon

AML Acute Myeloid Leukemia

APL Acute Promyelocytic Leukemia

AS Alternative Splicing

ATSS Alternative Transcription Start Site

ATTS Alternative Transcription Termination Site

BMM Bone Marrow-derived Macrophages

CAGE Cap Analysis of Gene Expression

ChIP Chromatin Immunoprecipitation

ENCODE ENCyclopedia Of DNA Elements

FANTOM Functional ANnotaTion Of the Mammalian genome

FPKM Fragments per Kilobase per Million mapped

IC Information Content

MES Multiple Exon Skipping

MX Mutually Exclusive exon

NMD Nonsense Mediated Decay

ORF Open Reading Frame

PTC Premature Termination Codon

PWM Position Weight Matrix

RPKM Reads per Kilobase per Million mapped

SES Single Exon Skipping

TMM Trimmed Mean of M-values

TPM Tags Per Million mapped

TSS Transcription Start Site

INTRODUCTION | 13

Introduction

The history of nucleic acid sequencing started with the complete sequencing of

the RNA gene of a bacteriophage coat protein in 1972 (1), based on the method of

incorporating dye-coupled chain-terminating nucleotides, published by Frederick

Sanger a few years later (2). An impressive feat at that time, this deduction of the full

387 nucleotides of sequence marked perhaps the earliest beginning of the era of the

genome – an era which is currently experiencing it’s renaissance in the 21th century

with the advent of the next generation of nucleotide sequencing.

From the early 1970s to the late 1990s, technological advances in sequencing

were significant but somewhat slow compared to the evolution of the microprocessor

(driving many other scientific advances), and sequencing of large full genome projects

required large machine-parks and extensive collaborative efforts. Following the first

fully sequenced genome, the 5386 nucleotides long DNA of the phiX174 phage (3) (and

later the first genome to be fully synthesized and assembled in vitro (4)), several small

genomes of prokaryotes were sequenced, an by the late eighties, an international

consortium of more than 74 laboratories and 600 scientists began a 7-year effort to

sequence the first eukaryotic and large genome of Saccharomyces cerevisea.

Completed in 1996 (5), this 12-megabase genome was a huge step forward in terms of

sheer size, but the 3000-megabase human genome was still a daunting, distant and

unfeasible task. But the foundation was laid, and from the yeast genome project (and

other early sequencing projects) sprang a plethora of technological and computational

disciplines and methods. And perhaps just as important, from these early sequencing

projects rose the cross-disciplinary field of bioinformatics, combining sequence analysis

with mathematical and statistical models of machine learning, allowing for sequence

alignment, gene prediction, structure prediction, and so on. An iconic example of an

early bioinformatics algorithm, BLAST, published in 1990 (6), was perhaps the

foundation of fast computational analysis of biological sequence.

The obvious culmination of the first era of sequencing was the complete

deduction of the human genomic sequence, an effort first planned in the early 80s,

INTRODUCTION | 14

initiated in 1991, and completed in 2001. A multi-party race to finish, private company

Celera intended to achieve the completion of both sequencing and assembly first,

raising fears that genes and genomic sequence could be patented and not available on

the open domain for research. Initial efforts to assemble the genome from the public

sequencing consortium were unsuccessful, but on June 22, 2000 GigAssembler (7), a

assembler written by then graduate student at University of Santa Cruz in California

(UCSC) Jim Kent completed the job just days before Celera, and a few weeks later,

the first draft of the human genome was made available at the newborn human genome

browser at the UCSC (8), releasing the genome into the public domain.

Concurrent with the human genome project, still based on the Sanger

sequencing methods invented in the 1960s, companies were trying to reinvent

nucleotide sequencing. Several full-fledged automated high-throughput sequencers

appeared on the scene around 2005, and underwent several iterative improvements to

both speed, accuracy and cost over the next 8 years. At time of writing, such machines

are relatively widespread, and used in a number of research disciplines.

Omicsmaps.com, a user-updated map of sequencers in academia, show 23 machines in

Denmark alone.

Three platforms currently dominate the market, based on different approaches

for deducing nucleotide sequence in parallel: Roches/454s pyrosequencing is based on

fixing DNA to micro-beads, followed by amplification and sequencing, and currently

supports read lengths up to 1000 nt, it’s main advantage. Applied Biosystems SOLiD

is based on ligation and two-base encoding (colorspace), and has high accuracy as its

primary advantage. Illuminas Solexa platform is based on sequencing-by-synthesis,

and has the highest throughput and the lowest cost of the three (9). As of 2013,

Illumina by far has the largest marketshare (10), and all of the work included in this

thesis is based on data from this platform. The different sequencing platforms will not

be covered more in detail, but for a comprehensive comparison and overview,

including advantages and disadvantages, please refer to (9).

INTRODUCTION | 15

Much of the work presented here was initiated between 2008 and 2010, a time

where the field of high-throughput sequencing and its sub-disciplines, such as RNA-

seq and ChIP-seq, were in their nascency. Working with the data presented an array of

interesting (and some not so interesting) problems with few ready-made tools and

pipelines available, necessitating a healthy dose of experimentation. Fortunately, the

bioinformatics discipline borrows a lot of its mentality and affinity for open-source

from computer science in general, and from early on, researchers and programmers

were sharing experiences, approaches and source code online.

INTRODUCTION TO ARTICLES | 16

Introduction to articles

The articles presented herein represent the main work of this dissertation, as

well as some additional work initiated during my master thesis (article I). Here, a brief

motivation, introduction to the subject matter as well as an overview of my

contributions and approaches to each work is presented. A more detailed treatise on

the methodologies used follows in each section of the general introduction, referencing

to each article as needed.

Obviously, the bioinformatic method cannot be wholly separated from the

underlying biology, but it should be noted that the aim of this thesis is not to carry the

reader through the different biological systems that are the focus point of the articles

included (although a brief introduction is given beforehand), but rather to investigate

the history, techniques, challenges and solutions in current appliances of next-

generation sequencing, and selected examples on how these were used in the included

works.


Paper I: NMD and alternative splicing – Genome Biology (2012)

In eukaryotes, most multi-exon genes undergo alternative splicing, the process

of regulated shuffling of genomic coding entities (exons) in a combinatorial manner,

creating several distinct mRNA molecules (and thereby protein products) from a

single gene unit, and this process is believed to be a significant contributor to

complexity of higher organisms. Such flexibility in any biological system often carries

with it a level of noise, and in this case, a potentially detrimental kind.

NMD or nonsense mediated decay is a general mRNA pathway in eukaryotes

that senses aberrantly spliced transcripts, or transcripts that would otherwise lead to

potentially dysfunctional protein products, and removes these by means of nucleolysis

and degradation.

For this paper, originally planned in early 2008, (and at the dawn of RNA-seq

technology) we sought to take advantage of the new digital platform to carry out the

first genome-wide and non-discriminative analysis of the effects of NMD KO

(facilitated by a murine KO of the essential NMD-factor Upf2) in an mammalian

system on the global splicing pattern as a whole. The study gave rise to several

bioinformatic challenges: dealing with the nascent RNA-seq data as a whole; creating

an artificial exhaustive genome of exon-exon splice junctions facilitating deconvolution

of full isoforms based on junction evidence (this was before paired-end sequencing was

widespread), predicting coding potential of each transcript, and downstream

interpretative analyses, including splice class classification, motif search, and sequence

conservation assessment. A challenge for this data was to meaningfully describe,

quantify and visualize “noisy” transcripts rescued by NMD KO, without setting

inhibitory filter thresholds that would otherwise remove these. Perhaps in some sense

a boon for this project, using biological replicates for sequencing was not feasible cost-

wise, and post-processing with replication would potentially have eliminated some of

the “stochastic” noisy splicing.

Overall, the study was successful, and we found that NMD KO upregulates a

large population of non-productive mRNAs through splice factor deregulation, and


extended the understanding of NMD from an mRNA quality pathway to a regulator

of several layers of gene expression. The study was published in Genome Biology in

2012 (11).

Paper II: a package for splice class detection – under preparation (formatted as a

Bioinformatics Application Note) (2013)

Our NMD-paper produced a more-or-less ready to run pipeline (RAINMAN)

for the community to run on their own RNA-seq data. But by 2012, paired-end RNA-

sequencing was the norm, and serious advances had been made in the field of full-

length isoform deconvolution based on robust statistical methods, including both

expectation–maximization and bayesian methods. One such tool published was

Cufflinks, part of the popular Tuxido suite for RNA-seq analysis; Cufflinks were

among many other studies used for the assembly of RNA-seq data in the large scale

ENCODE study of ultimo 2012.

As such, RAINMAN was outdated, and Kristoffer Knudsen (master student in

the Sandelin lab) and I set out to write a new tool for splice class detection and coding

potential prediction based on full length isoforms. From this emerged spliceR, an R

and Bioconductor package that allows for user-friendly annotation of assembled

RNA-seq data from Cufflinks (or any other full-length assembler), and outputs a range

of useful data, including isoform fraction switch scores and genomic locations of

differentially spliced elements for downstream analyses. spliceR was accepted into the

Bioconductor 2.13 release in October ‘13 (12), and the manuscript is currently under

preparation.

Paper III: Alternative promoter usage in acute promyelocytic leukemia – awaiting

experimental validation.

A different application for RNA-seq is CAGE-seq, or Cap Analysis of Gene

Expression. This method effectively captures small sequence fragments from the 5’ end

of capped RNA molecules, which are in turn sequenced, and processed to generate a

genome wide map of promoter usage. In humans (and other higher organisms), genes


often have more than one transcription start site, allowing for a diversification of the

resulting protein product, similar to and in concert with alternative splicing. For this

study, we took advantage of in-house cell-sorting facility and expertise combined with

access to samples from patients with acute promyelocytic leukemia (APL). The data

for this paper was particularly challenging, as isolatable material from the (liquid)

tumors was scarce, resulting in low input material for the method. This generated a lot

of technical noise, and necessitated rigorous filtering methods. After processing, we

found promoters with upregulated usage in APL, which were often located

downstream of annotated promoters, while promoters supporting the annotated

longest transcript variants were commonly downregulated in cancer. This manuscript

has been prepared, and is awaiting experimental validation of selected targets.

Paper IV: Transcription factor profiling in liver regeneration – Genome Research

(2013)

Another major advance for biology brought about by the high-throughput-era

was the coupling of traditional transcription factor binding site profiling by ChIP to

sequencing, a field previously limited by the microarray platform’s limited and pre-

selected number of probes. In this study we presented one the first genome wide

studies of temporal transcription factor dynamics – the binding patterns of key

hepatocyte factors CEBPA and CEBPB during liver regeneration following partial

hepatectomy in a mouse model.

Combined with DNA polymerase II binding data, as well as large array of in-

silico binding predictions of other transcription factors, we described a circuitry of

temporal codes, binding classes, and mutual exclusive motifs. Major challenges for this

study were many and included handling and processing (at that time nascent) ChIP-

seq data, peak calling, filtering and clustering techniques using unsupervised

approaches. In addition, considerable effort went into generating a high information /

low redundancy set of transcription factor motifs from three large repositories

(JASPAR, Transfac and UniProbe).

SEQUENCING AND MAPPING ISSUES | 20

Sequencing and mapping issues

A number of issues are shared between sequencing platforms and appliances

pertaining to read length, read quality and trimming and mapping. Although not one

of the most exciting issues of sequence analyses, this short section will briefly cover the

most common challenges, and how some of these were overcome in the presented

works.

Sequence data generated by current platforms is exported in the FASTQ

format, a sequence format that combines the raw sequence format FASTA with the

quality data. Each platform has its own idiosyncrasies in terms of sequencing quality;

the Illumina platform, for instance, generally has a low quality towards the edges of the

flow cell, in the very beginning of the sequencing run and trailing off towards the end

of the run (data not shown). Most common workflows include a read quality

visualization and trimming step for ensuring optimal mapping. In addition, this step

can include the removal of sequencing adapter sequences, and identical reads arising

from PCR amplification artifacts. Separated form the analytic and interpretive side, a

large amount of time is spent readying and “massaging” data in bioinformatics. For

article I, this was done manually in Perl and Python. For article III and IV, this was

done using FastX-tools (13) and FastQC (14).

Perhaps the pivotal step in sequence processing is the mapping of short

sequenced fragments back to the relevant genome, a field where significant progress

has been made to accommodate the much larger amounts of data since the early days

of BLAST (6). The seed-and-extend approach of BLAST is still applicable in some

cases, but modern mappers applying more efficient forms of genome indexing have

significantly increased the speed of mapping, a necessity when dealing with millions of

reads. Except for the ELAND mapper proprietary to the Illumina pipeline, MAQ

was one the earliest attempts at a fast short-read mapper (15), using a hash table of the

genome a look-up index, incorporating the quality scores into the mapping algorithm.

A bit on the slow side (typically requiring > 1000 CPU hours for larger sequencing

projects on workstations at the time of publishing), a big stride forward was made with

SEQUENCING AND MAPPING ISSUES | 21

Bowtie (16), utilizing the Burrows-Wheeler transform for encoding the reference

genome. This transform, published by computer company HP in 1994 as a lossless

compression algorithm, proved to be an ingenious way of compressing the genomic

index to a size manageable by most workstations (<2GB RAM), but at the same time

allowing for very rapid lookup of sequences. The bowtie suite of tools has later been

expanded with an RNA-seq assembler and isoform deconvolver (described later), and

is at the time of writing still the most widely used short read mapper in research (based

on number of citations).

RNA-SEQUENCING | 22

RNA-sequencing

Like the first fully sequenced biological polymer in 1972, the first two works

included in this theses is based on sequencing of RNA. This section will briefly cover

the history and important landmarks over the last few years of RNA-sequencing by

next generation technologies, and will subsequently dig deeper into common problems

and issues with the presented works as main focus.

Milestones and history

RNA-seq was introduced to the research community as a group of 5 papers

coming out at approximately the same time, but the first significant paper using RNA-

seq and paving the way for digital gene expression was by Mortazavi et al. in 2008. An

important benchmark paper, their results showed high inter-sample fidelity between

technical replicates (later confirmed in (17), arguing that technical replicates are per se

not needed in RNA-seq) , and, using in-vitro synthesized transcripts as spike-ins, high

sensitivity and dynamic range, establishing the method as a serious competitor to the

previously ubiquitous microarray-based gene expression platform (18). This work also

established the first simple normalization and quantification methods and metrics, and

showed, perhaps surprisingly, a correlation with arrays with a R2 at around only 0.7.

More recent advances in data processing and analysis have increased this correlation to

around 0.8 (19). Whether the remaining difference owes to the shortcomings of the

array platform, the immaturity of RNA-seq processing, or a combination of the two, is

not yet readily discernible.

RNA-SEQUENCING | 23

Later the same year, the first global study of alternative splicing using RNA-seq

was published in Nature by the Burge lab (20). Surprisingly, this study found that

more than 90% of all human genes undergo alternative splicing. Classifying eight

different types of alternative events, authors found strong tissue-specific and highly

conserved “switch-like” splicing events. To accomplish this extensive mRNA isoform

discovery, reads were mapped to an artificial custom genome based on a transcript

database of all possible exon-exon junctions, in turn allowing capture of all possible

splice events (between the annotated exons from the selected repositories, at least).

Until recent advances in full isoform deconvolution, this approach was popular, and

was also employed in a modified version by the RAINMAN pipeline presented in

paper I (as described below).

Much progress followed in the following years, not all relating to digital gene

expression and alternative splicing. RNA-seq found other interesting usages including

sequencing of non-coding RNA-species, assessment of allele-specific expression and

detection of fusion genes (one of the first works on this is (21)). Technical advances

were also made, including single cell RNA-seq (22) and strand-specific RNA-seq (23).

Similar advances were made on the software side (some will be discussed later),

Figure 1: Isofo rm expr es sio n results from the ENC ODE p roject , bas ed on isof orm as sembly from RNA-seq da ta . Adap ted fr om Djeba li et al , 201 2.

RNA-SEQUENCING | 24

including software packages for normalization, differential calling, and visualization,

and reference-free assembly (they are too numerous to be referenced here).

Perhaps the pinnacle of research in genome wide transcription profiles was the

publication of the ENCODE consortiums large scale sequencing of the transcriptome

of 15 human cell lines in 2012 (24). This study validated the previous claim of almost all

human multi-exon genes undergoing alternative splicing, and showed that at least 60%

of the full human genome is transcribed (but not necessarily translated), perhaps the

final nail in the coffin of “junk-DNA”. Using Tophats/Cufflinks for isoform assembly

(discussed below), ENCODE made several interesting findings – perhaps one of the

more curious ones showed, that as the number of annotated isoforms increase for a

given gene, the fraction of the most expressed isoform (the major isoform) decreases

(signifying biological importance of multiple isoforms, Error! Reference source

not found.a), but that most genes reach a plateau at 10-12 isoforms (perhaps

signifying a rate-limiting or energy-wise prohibitive value of expressing too many

differentially spliced isoforms, Error! Reference source not found.b). This is a

good example of a bias-free finding facilitated by sequencing, which would have been

impossible just 10 years ago with microarray-technology.

In the relatively short life of RNA-seq, the field has been undergoing rapid

advances, and methods are ever evolving. This sometimes translates to older methods

being called obsolete by newer, as will be evident of this follows in the section about

isoform deconvolution.

RNA-SEQUENCING | 25

Figure 2: RNA-seq normalization. Left: RNA-seq data from Upf2 KO mouse l iver (data from artic le I) after RPKM normalization. Red dots represent the 50 most expressed genes in WT, green and pink dots represent 50 random sampled genes from the second 2 nd and 4 th bin of an 8-bin quantization of expression, respectively . Note how neither subset straddles the black l ine of no DE. Right: same data after quanti le -normalization, one of many normalization methods avai lable. Note: this f igure is for i l lustrative purposes, and does not relate to method used in paper I .

Normalization issues

A pervasive RNA-seq challenge is the problem of normalization, an issue that has been

apparent in almost all data this author has been in touch with. In this context,

normalization is understood as the action of adjusting count values proxy of transcript

or gene expression in a way that adjusts for unwanted (often technical) bias, and makes

comparisons between and within samples meaningful. Normalization methods can

roughly be divided into two groups - the first group, intra-sample methods, normalize

each given data point (transcript or gene count) using a specific approach. One of the

earliest proposed intra-sample methods was that of RPKM (Reads Per Kilobase of

transcript length per Million mapped reads), simply normalizing coverage of a given

transcript to the transcript length and the sequencing depth (18). Even though the

RPKM method has been shown to have some pitfalls in relation to gene length bias (in

particular in the over-estimation of lowly expressed transcripts)(25,26), it is still a

popular method and readily gives a assessment of the relative levels of transcripts intra-

RNA-SEQUENCING | 26

sample-wise. Another method, borrowed from microarray analysis, is quantile

normalization.

RNA-seq doesn’t disclose any information about the total RNA levels in the

cell. This have sometimes been dubbed the “sequencing real estate” problem and stems

from the notion that in some experiments, as a given sample is sequenced at a given

depth, larger groups of RNA species can be more or less abundant in only one sample.

After the RPKM normalization method is applied, ranges of expression can be skewed

towards or away from this sample (over- or undersampling), although this does not

represent an actual difference in expression (but rather a difference in absolute RNA

abundances), and this can be potentially dangerous in downstream significance calling.

An example of a skewed RNA-seq expression profile is given in figure 2. Here, the

highly expressed range of genes is skewed towards the WT, resulting in under-

sampling of the remaining expression space. This phenomenon is somewhat akin to

issues of probe saturation found in microarrays, but additionally gives rise to more

peculiar distributions of inter-sample expression comparisons, where difference in

absolute abundances in certain ranges of expression may produce a local skew - e.g.

“banana-like” scatterplots (data not shown). The second group of normalization

methods, inter-sample methods, generally calculates normalization factors to be

applied equally to all data-points for a given sample to account for differences in the

true abundance of RNA. The TMM method, used in paper I, and part of the EdgeR

Bioconductor package (27), assumes that technical variation can be modeled using the

poisson distribution, and biological variation using the negative binomial distribution,

a assumption that also holds true for other seq-data, including ChIP-seq. Using this

variance as weights, TMM simply proposes an inter-sample normalization procedure

by calculating a normalization factor based on the trimmed mean of log expression

ratios, under the assumption that most genes are not differentially expressed (28).

RNA-SEQUENCING | 27

Since the publication of paper I, several methods have been proposed in

literature regarding RNA-seq normalization, and it’s now evident that this issue is just

as pertinent for RNA-seq as it was for microarray. The results from one recent

comparison study including several methods (29) confirmed earlier suspicions that the

RPKM (and by extension, the FPKM metric used by Cufflinks) metric is a bad choice

due to gene length bias issues, and found DESeq (30) and TMM normalization to be

most robust (Figure 3). A combination of RPKM and TMM normalization, as was

applied in paper I, wasn’t tested, but is expected to inherit some of the strengths and

weaknesses of both methods.

Isoform de-convolution and splice class detection

Initial approaches in detecting alternative splicing was based on creation of

artificial genomes, consisting of all possible exon-exon junctions generated from an

mRNA repository of choice. For article I, we developed our own pipeline based on

this approach, RAINMAN. This method, primarily motivated by single-end

sequencing being the available data, and by its relative simplicity, has a few inherent

pitfalls. Firstly, each alternative splicing event is only detected based on co-occurrence

of two exons, and the inter-dependence between multiple events cannot be

determined. In our case (article I), the observation that more than 95% of junctions

mapped to concurrent exons of annotated, productive and coding mRNAs, provided

Figure 3: A comp aris on inter - and intra sam ple norm aliza tion methods for RNA-seq. T C=to ta l count, UQ = upp er quart il e, Med = media n, DES eq = nor maliza tion b y the R-pa cka ge o f the s ame name, T MM = trimmed mean of M-val ues (edgeR), Q = quanti le , RPKM = r ea ds p er k ilob ase per mil l ion mapp ed, Raw Count = raw read counts (no norma lizat ion) . Diff erent normal izat ion m ethods p roduce w idely di fferent results depending on their a ppr oach, and w hether they ’ re intra- (Q a nd RPKM) or inter-sa mple ( the rest) . Ada pted from Dill ies et al ., 2012

RNA-SEQUENCING | 28

the argument that although we couldn’t detect multiple splice events within the same

messenger, those would statistically be relatively rare (0.05^2 = 0.0025), assuming no

dependency between events. We dared this normally dangerous assumption, as noisy

splicing was thought to happen in a random fashion.

Under this assumption, the remaining exons of the longest supported

annotated isoform from a array of public repositories (Refseq, UCSC knownGene,

Ensembl, etc…) was attached to each exon of the respective junction, and thus the full

length isoform inferred, and, later, its coding potential predicted. This exhaustive

exon-exon junction catalog also provided a simple but suitable platform for inferring

alternative splicing events, and the alterations of NMD KO on AS. We chose to detect

the classes of AS recognized in literature as the minimum informative set, namely

single and multiple exon-skipping (SES/MES), alternative 5’ and 3’ splice sites (A5SS

/A3SS), alternative first and last exons (AFE, ALE) and mutually exclusive exons

(MX). A graphical overview of these classes is given in Supplemental figure 1. The

detection algorithm was detected on an iterative approach – in the first step, for each

given splice junction, junctions with similar left or right exonic edges (or coordinates)

were isolated. Next, the connections and supporting exons of these junctions were

included in the model, and depending on the trace back to the original junction, and

the inclusion or exclusion of supportive or inhibitive junctions and exons, the splice

class was deduced. An example is given in Figure 4 for two splice classes. For (A),

two junctions (contJunc1 and contJunc2) with identical left and right coordinates with

testJunctestJuncleftcontJunc1left

testJuncrightcontJunc2right

skippedExon

contJunc1rightskippedExonstart

contJunc2leftskippedExonend

contJunc2contJunc1

testJunc

juncShareRightx

juncShareRighty

testJunc-Exon

juncShareExonx

juncShareExony

A B

Single exon skipping Alternative 5’ splice site

Figure 4: Cla ssi f icat ion o f alter na tive s plicing events ba sed on exon-exon junct ion evidence. Ref er to text for detai ls.

RNA-SEQUENCING | 29

the single junction testJunc, and with evidence supporting only one exon between

those two junctions, classifies as SES. For (B), a junction (testJunc), sharing right

coordinates with another junction (juncShareRighty) and left coordinates with an

exon, which itself has support for an additional exon sharing left coordinates, but with

different right coordinates, from which a junction (juncShareRightx) has identical right

coordinates with juncShareRightx, classifies as A5SS. Similar logic was applied for the

remaining splice classes. It has to be noted, that although desirable, exon read

coverage was not included in the classification, as time did not permit this feature to be

implemented.

As a nicety and convenient functionality, visualization of splice junctions color

coded by splice class for export to genome browser was included in the RAINMAN

software. Figure 5 shows a classical example exon-exon junction coverage of the

NMD-regulated splice factor Srsf7, color coded by splicing class – in NMD KO

tissues, isoforms including a PTC-containing exon (marked in yellow), normally

spliced in by the splicing factor itself in a auto-regulatory manner, are rescued.

For the work presented, this platform was sufficient and successful in detecting

a marked phenotype of unproductive PTC+ transcripts and a global alteration in

alternative splicing in the two NMD KO systems utilized.

Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7

Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7

chr17.2379

Mammal Cons

UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics

RefSeq Genes

Ensembl Gene Predictions

Genscan Gene Predictions

Exoniphy Mouse/Rat/Human/Dog

TROMER Transcriptome database

30-Way Multiz Alignment & Conservation

Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7

Sfrs7

ENSMUST00000063417ENSMUST00000063403ENSMUST00000112421

chr2.38800001.10chr2.38800001.12

chr2.38800001.14 chr2.38800001.16chr2.38800001.18

chr2.38800001.19 chr2.38800001.21

MTR000672.17.2018.2MTR000672.17.2018.4MTR000672.17.2018.5MTR000672.17.2018.8MTR000672.17.2018.9

MTR000672.17.2018.7MTR000672.17.2018.0

MTR000672.17.2018.3MTR000672.17.2018.6

MTR000672.17.2018.1

0.47

0.13

0.13

0.13Junctions, KO

Junctions, WT

Figure 5 : S creens hot f rom the UCS C geno me b row ser, show ing bo nema rrow -derived ma cro phage (BMM ) RNA-seq coverage in Up f2 WT vs KO of the NMD -regulated sp lice factor Srs f7 (S fsr7 ). Gr een junct ions repr es ent alterna tive splice s ites and red junct ions indicate single exon skipping. Numbers ab ove junct ions incl uding o r skipping the yel lo w auto-regulatory S TOP -containing exon a re re la tive exp ression val ues.

RNA-SEQUENCING | 30

Around 2010 perhaps, paired-end sequencing was becoming the norm in RNA-

seq. Here, both ends of the library insert are sequenced and can be mapped to the

genome. This information, combined with the known insert size as well as exon

coverage / junction-evidence, enables modern RNA-seq assemblers to attempt full-

length transcript de-convolution. Perhaps the most widely used and best cited is the

Cufflinks assembler (31), part of the ubiquitous Tuxedo-suite from the hand of Cole

Trapnell. Cufflinks achieves de-convolution by a series of simple steps (Figure 6). After

initial mapping of reads to the genome, and de novo splice junction detection (A), all

mutually incompatible splice

fragments are reduced to the

parsimonious number of possible

paths through the splice graph (B,

C). For genes with several splice

events, this combinatorial approach

generates a large number of

potentially false positive isoforms,

and this initial step is similar to the

junction-based approach described

above and presented in paper I,

where “canonical”, annotated exons

are simply concatenated to each

junction. In perhaps the pivotal step,

each putative isoforms compatibility

with the evidence is determined,

using exon coverage and the

fragment length distribution (D)

(either determined the mate distance

in pre-annotated single exon genes

or given by the user), setting the prior

Figure 6: RNA Deconvolut ion b y Cuff l inks. Refer to main text fo r details . Adap ted fr om T rap nell et al . , 201 0

RNA-SEQUENCING | 31

abundances of each transcript. Finally, a minimum cost probability function is

maximized, finding the final set of correct parsimonious isoforms, and abundances are

reported (E). This maximizing approach vastly improves isoform inference (31).

Other approaches exist to RNA-seq assembly (for a recent review, see (32)), but

the Tuxedo suite has proved to be robust in benchmarks, and was among others used

as assembler in the ENCODE project, and for our splice class detection software

spliceR (paper II), implemented in R and available in the Bioconductor project, we

choose to build directly on Cufflinks-assembled trancripts (although other full-length

assemblers are supported). As several earlier efforts in splice detection from RNA-seq

data has been made, the main aim for spliceR was easily interoperability with R,

Bioconductor and Cufflinks. Another major point was complete classification of all

splice classes, as well as no requirement for pre-annotated splice event or transcript

data. In addition, the option to import the genomic locations of elements spliced in/out

for each alternative splicing type was included. This should be a powerful tool for

downstream analysis of systems where global alternative splicing is altered, including

sequence analysis and motif finding, and phylogenetic analyses. At the time of writing,

three collaboration projects take advantage of spliceR, and we hope to see a good

amount of interest in the package, when Bioconductor 2.13 is released.

Coding Potential Prediction

Coding potential prediction of mRNAs is in itself not a super-tough problem,

but a few different approaches exist. PhyloCSF (33) and CPC (34) have an

evolutionary approach, using aligments of potential coding regions combined with

machine learning approaches to scan for the frequency of synonymous and conservative

substitutions, scoring for coding potential. Coding potential prediction. CPAT (35)

bases its score on both the length of the longest ORF in all reading frames, as well as

other sequence linguistic features.

The approach chosen in article I to predict coding potential and position of

(potentially premature) STOP codons was based on the concept, that in the NMD

KO, events representing aberrant splicing are independent (as exemplified by 95% of

RNA-SEQUENCING | 32

junctions mapping to canonical junctions), and thus annotated ORFs can be applied

to transcripts generated by exon concatenation and scanned trough. The approach is

outlined in Supplemental figure 2

For spliceR (article II), the same approach was chosen, scanning full isoforms

with the most upstream compatible ORF, if any. Cufflinks, however, often finds many

novel genes and isoforms, and spliceR in general only annotates 30-35% of found

isoforms with ORF and STOP positions, and heuristic approaches, e.g. as presented

in CPAT, would be a desirable addition to the package.

Processing and analysis of CAGE-data

An entirely different flavor of RNA-seq is CAGE-seq, or Cap Analysis of Gene

Expression combined with high-throughput sequencing. CAGE is based on the

sequencing of 5’ ends of RNA, and when these short sequences, referred to as “tags”,

are mapped to back to the genome, a genome-wide map of transcription start sites

(TSSs) can be inferred, and the number of times each TSS is represented can be taken

as a proxy of that transcript’s expression (36). This technique was originally devised as

a technological platform for the FANTOM project for genome-wide profiling of

promoters (36). Obviously, regular RNA-seq intrinsically has the power to map 5’

transcription start sites, but with lower precision, both due to lower coverage as well

as coverage bias – all parts of the mRNA is not covered equally, and in some cases, the

5’UTR may not be covered at all.

For paper III, blood was taken from human patients diagnosed with acute

promyelocytic leukemia (APL, a cancer of the myeloid lineage of blood cells), as well as

from healthy patients, and thus very little biological material was available from the

purified and cell-sorted samples, To facilitate a promoterome analysis, we applied a

variant of the CAGE technique dubbed nanoCAGE, which allows for small library

input sizes (37,38). This was the first time this technique had been applied to human

clinical samples, and served as a proof-of-concept for the method in this setting.

In this low-input material scenario, PCR-amplification bias was expected to be

a significant issue, and after initial observations of the data, we had some doubts to

RNA-SEQUENCING | 33

whether any meaningful biological signals were to be deducted. Applying rigorous

filtering and thresholding, however, resulted in an interesting phenotype – a switch in

transcript-length preference between WT and cancer cells: After initial read quality

filtering, and mapping to the genome, the following four filters were applied:

1. PCR-bias: Removal of all tag “towers” with a width of one read (in normal CAGE,

this could be dangerous, removing several meaningful TSS clusters, but manual

inspection of the nanoCAGE data showed, that these towers cropped up in a more

random fashion.)

2. High-pass expression filter: in CAGE, the analog to RNA-seqs RPKM is tags per

million (TPM), but without the gene length normalization. To further weed out

technical noise, an expression of 2 TPM was required.

RNA-SEQUENCING | 34

3. Variance: as some heterogeneity in patients with APL is expected due to unique

cancer genotypes, some of the variance between the cancer samples may not be due

to noise. None the less, variance obfuscates downstream analyses and significance

calling, and clusters with a coefficient of variance above 1 were removed

4. Exon noise filtering: doing “birds-eye” analysis of the data, an amount of “exon-

painting”, or low-coverage reads spanning exonic areas, were evident, possibly due

to capture of degradation products. To avoid clusters being called in these regions,

clusters were required to be expressed more than 1.5 times the average exonic

coverage for the relevant gene, for some analyses. Variance filtering also

contributed to countering this problem.

Figure 7: Fo r na noC AGE da ta , r igid f i ltering is required to w eed out noise - low ly expressed and narrow c lusters are genera lly no t a ssociated w ith pre-annotated T SS s, motiva ting f i ltering on these pa rameter s. (Distance to near es t annotated TS S on the x-axis) .

RNA-SEQUENCING | 35

Some of these filters, and the motivation for the thresholds may seem arbitrary.

Figure 7 represents three variables, TPM (color coding), width of clusters (Y) and

distance from closest annotated UCSC knownGene transcription start site. Although

an intrinsic strength of the CAGE technique is promoter discovery, many promoters

detected in this experiment represent exisiting pre-annotated promoters, and the

distance to the nearest TSS can be used as a quality measure of different filtering

thresholds. From the plot, distance to nearest TSS is much tighter for higher

expressed clusters, and for wider clusters, legitimizing filters for both expression and

width.

In CAGE, tag clustering and filtering is just one component of the data

analysis, for which many parts are similar to those of RNA-seq, including gene

ontology analysis, overlap with other datasets, network analysis, etc. One specific

analysis of paper IV was specific to CAGE, however, and shows well how the ever-

growing amount of public data can often be taken advantage of. To deduce more

about the regulatory neighborhood of our putative TSSs (lacking any transcription

factor ChIP-seq data for these samples), we instead generalized the question by

downloading all Mus transcription factor binding sites from the ENCODE project

(across cell lines), comprising more than 125 data sets in replicates. The distance

distribution from putative TSSs to the nearest transcription factor binding site was

then measured and visualized. Although no mechanistic inference can be made

directly, as tissues and sample conditions are not comparable, this is an example of

large-scale data integration in genomics.

CHROMATIN IMMUNOPRECIPITATION COUPLED WITH SEQUENCING | 36

Chromatin Immunoprecipitation coupled with sequencing

The work done in the fourth article presented in this thesis is based largely on

data from sequencing of DNA-fragments bound by certain transcription factors (or

ChIP-seq), and this section will focus on common challenges in analyzing ChIP-seq

data and how these were countered in the presented work.

Milestones and history

The predecessor of ChIP-seq is ChIP-chip or ChIP-on-chip, using chromatin

pull downs coupled with DNA micro arrays. Before the first papers using high

throughput sequencing were published in ‘07/’08, this was the method of choice for

assaying global transcription factor binding profiles, but obviously had its limits in the

same way that microarray analyses for gene expression is limited, most apparent being

the restriction to a pre-defined set probes. As such, researchers could only probe a very

small fraction of a mammalian genome, resulting in arrays covering, for instance, only

known promoters, or exonic areas (39).

Harnessing the (theoretically) non-discriminatory power of sequencing, a

natural evolution of the method was given, and the first papers utilizing ChIP-seq was

published in Nature and Cell in 2007. In the first, profiling the global binding of

transcription factor STAT1 in interferon-treated versus normal HeLa cells, more than

40,000 putative binding regions were presented with, compared to todays standards,

relatively few mapped reads (40), and this paper established a methodological and

statistical framework for signal processing of ChIP-seq data, that is still in use today.

The second paper presented a genome-wide map of histone lysine and arginine

methylations, providing new insights in epigenetic regulation that was not readily

obtainable with older technologies, and likewise defining much of the data analysis

terminology (41).

Since then, a slew of papers have been published using ChIP-seq, and the

largest effort so far, similar to RNA-seq, is most likely that of the ENCODE project,

integrating data from thousands of experiments and sequencing runs into a combined


regulatory lexicon, describing the

transcriptional, regulatory, and

epigenetic landscape of the human

genome (24,42–44).

Peakfinding and normalization

Perhaps the primary challenge in

analazying ChIP-seq data of

transcription factors (as well as histone

modifications) is the detection of peaks,

i.e. DNA regions that can statistically

be induced to be bound by the protein

of interest, compared to the

background, defined either by the

signal-to-noise ratio in the sample, by a

mock IgG sample, or both. Since the

introduction of next-gen sequencing,

developing peak finder software have

been an active pursuit in genomics, and

to date, there is more than 20 published

attempts at solving this problem. Peak

finders generally differ in the signal

scanning process, statistical models

used, and in sensitivity for detecting

different regions (e.g. broad histone

marks vs. narrow transcription factor

peaks).

As mentioned above, a statistical framework based on modeling the

background genomic coverage (often referred to as lambda) using a Poisson

distribution, and calculating thresholds and significance levels based on this was

Figure 8: An example o f CH iP-seq pea ks fro m 16 sam ples; 8 t im e-po ints after pa rt ial l iver hep atectomy fo r two transcrip tion fa ctors, CE BPA and CE BPA. Black do tted l ines represent consens us p ea k r egions, in w hich ta gs f or each s ampl e are quantif ied.


presented in the first effort on ChIP-seq, as was subsequently carried through in one of

the earliest most popular peak finders, MACS (45). MACS refined peak calling by

introducing a local lambda (defined as either 1, 2 or 10k nt around the peak) to correct

for local biases, as well as the employment of a shift size modeling step, trying to

deduce the DNA library insert size for more accurate peak definition.

For article VI, MACS and uSeq (a peak caller with similar inner workings),

were employed for our sample pool size of 16 (two TFs, 8 time points). An example of

the binding landscape and peak definition is given in Figure 8. When quantifying

binding intensity for each sample, a common reference for inter-sample comparison is

required (akin to gene or transcript ids for RNA-seq). As each sample was peak-called

separately, the common reference was simply defined as the area under all overlapping

peaks for all samples – analog to the 1-dimensional “shadow” simultaneously cast by all

samples on the genome. Having these consensus peaks, regular downstream

bioinformatic analyses could follow, including clustering, gene-association, network

analysis, and ...

Motif discovery

Discovering transcription factor binding motifs has been a pursuit in genomics

for some time, and data from ChIP-on-chip and ChIP-seq methods is a natural

catalyst for this. Three large repositories containing binding motifs exist, and were the

basis of the combined high-confidence reduced motif bank generated for article IV:

JASPAR, open-source, and containing a small, non-redundant and literature-backed

set, primarily based on the SELEX method (46); TRANSFAC (47), a commercial set

of binding motifs, but with a limited open access set; and UniPROBE (48) , a open

access library based on protein-binding microarray data. Our purpose was to reduce

the data from these three sources into something smaller, more manageable and of

higher quality and lower redundancy. Aggregation of all available models resulted in

939 position weight matrices (PWMs), and the smaller, final high-confidence set was

generated as follows:

1. Removal of duplicates


2. End-trimming of models, removing all subsequent nucleotides with an

information content (IC) less than 0.5 bits.

3. Removal of all models with a total IC of less than 8 bits

4. Scoring of all models against each other using the spearman correlation

coefficient, and subsequent hierarchical clustering.

5. After inspection of the distance tree, a threshold of ~500 clusters were

chosen (figure 9) based on biologically meaningful clusters, and within

each cluster, the motifs were aligned using Smith-Waterman local

algorithm. The consensus alignment motif was extracted, and

subsequently end-trimmed and filtered as in step 2 and 3.

This approach resulted in a little less than 500 high confidence motifs, which

were the basis of the binding motif profiling used in article IV.

Figure 9 : A sect ion of a dendrogram representing the hierarchical clustering o f ~1000 pos it ion w eight matrices (PWM s) (nam es at the le ft side), after tr imming a nd al ignm ent. T he blue l ine represents a n arb itraty cuto ff ( a b io logical meaningful va lue set af ter manual inspection of the tree), reducing the set to a ppr ox. 500 PW Ms -

CONCLUDING REMARKS | 40

Concluding Remarks

In the 14 years that has passed so far of the 21th century, genomics can truly be

said to have taken centre stage, in no small part due to the advent of high throughput

sequencing. The genomic era, perhaps starting in 1972 with the first determination of

the sequence of a gene, and perhaps ending with the final draft of the human genome,

is now over, and the post-genomic era has begun. But as easy as it is to fling around

clichés about the marvels of this time of cutting edge research, evenly hard is it perhaps

to pinpoint defining key research discoveries, and this is an inherent challenge in

modern day genomics. As the amount of data grows larger and larger, research has a

tendency to move away from hypothesis-driven manipulation of simpler systems to

genome-wide, organism-wide profiling of any biological information system.

The articles presented in this thesis all present attempts to solve problems in

contemporary biology by applying the methodology of quantification, classification

and statistics on data from high throughput sequencers.

They were all challenging in the sense that no clear-cut hypothesis existed a

priori, in least in terms of analyzing the sequence data. This required an exploration-

heavy approach, which when most frustrating (and it will inevitably be at times) could

be likened to a “fishing expedition” and when most giving, an unbeatable feeling of

actually finding the needle in the haystack. I found, through trial-and-error and an

increasing confidence in dealing with big data, that a certain amount of

systematization and consistence in data handling and analysis should be a priority in

bioinformatics projects. A wide variety of tools and pipelines exist, and even though

some general approaches are agreed on in an implicit manner in the field, few standard

operating procedures exist (one example of such is the ENCODEs complete guideline

for ChIP-seq studies, from lab-bench to computer (49)) – the data is only as good as

the tools you use to analyze it with! The Bioconductor project is a good example of an

inter-operable, common (and open source) platform for genomics, and with spliceR

CONCLUDING REMARKS | 41

(presented in artile II), we hoped to present a valuable addition to that project. And as

applications and tools mature with the field as a whole, more standards will evolve.

As discussed in the introduction, sequencing has become an accepted and

widely used method in most laboratories that do some sort of genomic research, and a

lot more robust and well-documented tools and pipelines have been published since. It

is this author’s observation, that the gap between researchers from a classical biological

background (and with few skills in computer science), and researchers with a

background in computer science is decreasing, and most likely, future researchers in

genomics will be equally well trained in both disciplines. “Big Data” is not a

phenomenon restricted to genome biology, and undergraduate programs in science

(and other disciplines) are aware of this.

The future in sequencing is promising, especially as the clinical domain starts to

accept the new possibilities fully, in part also facilitated by decreasing costs and

improved opportunities for running hundreds of samples by the means of multiplexing.

With the increased processing and normalization quality of RNA-seq, this platform is

expected to completely replace microarrays, with its non-discriminative approach,

offering de-novo transcript detection, characterization of alternative splicing, etc. And

as reads get longer, and transcript assemblers more advanced, we will come closer to

solving the isoform problem fully. For ChiP-seq, the advantages over ChIP-on-chip

are just as apparent, and even though large-scale efforts are increasing our knowledge

of the regulatory landscape repeatedly, luckily there’s a few thousand more

transcription factors in the human genome, who hasn’t been described yet, and just as

many cell types! Combined with profiling of epigenetic marks, genome resequencing

and all the other wonderful appliances of sequencing, the unraveling of the complex

layers of life goes perhaps a little faster.

ORIGINAL ARTICLES | 42

Original Articles

ORIGINAL ARTICLES | 43

Article I

Mammalian tissues defective in nonsense-mediated mRNA decay display

highly aberrant splicing patterns

ARTICLE I | 44

RESEARCH Open Access

Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrantsplicing patternsJoachim Weischenfeldt1,2,3†, Johannes Waage1,2,4†, Geng Tian5, Jing Zhao5, Inge Damgaard1,2,3,Janus Schou Jakobsen1,2, Karsten Kristiansen6, Anders Krogh2,4,6, Jun Wang5,6 and Bo T Porse1,2,3*

Abstract

Background: Nonsense-mediated mRNA decay (NMD) affects the outcome of alternative splicing by degradingmRNA isoforms with premature termination codons. Splicing regulators constitute important NMD targets;however, the extent to which loss of NMD causes extensive deregulation of alternative splicing has not previouslybeen assayed in a global, unbiased manner. Here, we combine mouse genetics and RNA-seq to provide the first invivo analysis of the global impact of NMD on splicing patterns in two primary mouse tissues ablated for the NMDfactor UPF2.

Results: We developed a bioinformatic pipeline that maps RNA-seq data to a combinatorial exon database,predicts NMD-susceptibility for mRNA isoforms and calculates the distribution of major splice isoform classes. Wepresent a catalog of NMD-regulated alternative splicing events, showing that isoforms of 30% of all expressedgenes are upregulated in NMD-deficient cells and that NMD targets all major splicing classes. Importantly, NMD-dependent effects are not restricted to premature termination codon+ isoforms but also involve an abundance ofsplicing events that do not generate premature termination codons. Supporting their functional importance, thelatter events are associated with high intronic conservation.

Conclusions: Our data demonstrate that NMD regulates alternative splicing outcomes through an intricate web ofsplicing regulators and that its loss leads to the deregulation of a panoply of splicing events, providing novelinsights into its role in core- and tissue-specific regulation of gene expression. Thus, our study extends theimportance of NMD from an mRNA quality pathway to a regulator of several layers of gene expression.

BackgroundAlternative splicing (AS) involves the selective inclusionand exclusion of exons from a nascent pre-mRNA thatresults in various combinations of mature mRNAs withdifferent coding potential and thus protein sequence [1].Importantly, it has recently been estimated that nearly95% of all multi-exon genes in the mammalian cellundergo AS [2,3], suggesting a pivotal role for AS inregulating and expanding the repertoire of isoformsexpressed. By examining ESTs, it has been proposedthat one-third of all AS isoforms contain a premature

termination codon (PTC) [4], and these are expected tobe targeted for degradation by nonsense-mediatedmRNA decay (NMD). NMD is an mRNA quality controlmechanism, and the primary function of NMD was initi-ally thought to be in removal of aberrant transcriptsarising from mutations or faulty transcription, mRNAprocessing or translation, but it is now evident thatNMD impacts on both diverse physiological processes[5-7] as well as pathophysiological conditions (reviewedin [8]). The conserved core components of the NMDpathway are the UPF1, UPF2 and UPF3A/B proteins,and mutations or depletion of these factors inactivateNMD [9,10]. In mammalian cells, PTCs are distin-guished from normal stop codons by their position rela-tive to a downstream exon-exon junction, which ismarked by the deposition of the exon junction complex

* Correspondence: [email protected]† Contributed equally1The Finsen Laboratory, Rigshospitalet, Faculty of Health Sciences, Universityof Copenhagen, DK2200 Copenhagen, DenmarkFull list of author information is available at the end of the article

Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35

© 2012 Weischenfeldt et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

ARTICLE III | 45

[11]. It has been generally established that for a stopcodon to be recognized by the NMD apparatus, it mustbe situated at least 50 nucleotides upstream of an exon-exon boundary (the 50 nucleotides rule) [12]. Thus,nearly all naturally occurring eukaryotic stop codons arefound downstream of the last intron, thereby renderingthem immune to NMD. Although recent data havedemonstrated that the proximity of the poly(A)-bindingprotein (PABP) to the PTC is inversely correlated withthe efficiency of NMD [13,14], the 50 nucleotides ruleapplies to almost all studied mammalian transcripts, tak-ing heed of a few noted exceptions [15,16]. Mechanisti-cally, AS can utilize NMD to selectively degradetranscripts by the selective inclusion of a PTC-contain-ing (PTC+) exon or exclusion of an exon, resulting in aPTC+ downstream exon. This coupling, initially discov-ered for serine/arginine-rich (SR) proteins in Caenor-habditis elegans [17], has been coined regulatedunproductive splicing and translation (RUST) or AScoupled to NMD (AS-NMD) [4,18]. Intriguingly, pro-teins involved in splicing processes utilize AS-NMD toautoregulate their own synthesis through a negativefeedback loop. The most well characterized splicing acti-vators, the SR proteins, bind to cis elements in the pre-mRNA, usually stimulating the inclusion of an exon.The SR proteins have been shown to utilize AS-NMD ina negative feedback loop to activate the inclusion of aPTC+ exon (PTC upon inclusion) in their own pre-mRNA, thus resulting in NMD [18-21]. The othermajor class of splice regulators, the heterogeneousnuclear ribonucleoproteins (hnRNPs), are a class ofRNA binding proteins with roles in mRNA splicing,export and translation [22,23]. The hnRNPs often, butnot always, bind to splice silencer elements and represssplicing at nearby splice sites. Splicing repressors, suchas hnRNPs, use AS-NMD to repress the inclusion of acoding exon in their own pre-mRNA that leads to anout-of-frame skipping event, consequently inducing adownstream PTC and thus NMD (PTC upon exclusion).Moreover, AS-NMD is also used to cross-regulateexpression of other splice factors, as described elegantlyfor PTBP1 and PTBP2 [24].AS is regulated by the selective recruitment of splice

regulators to pre-mRNAs. It is well established that spli-cing activators (such as SR proteins) compete with spli-cing repressors (such as hnRNPs) for binding to splicesites in an antagonistic manner, where the relative con-centration of the two classes regulates the level of AS[25,26]. Thus, the fate of an alternative exon is usuallydecided by the antagonism between SR proteins andhnRNPs and their concentration and activity (reviewedin [27]). Due to the autoregulatory feedback loopemployed by splice regulators, modulating NMD couldpotentially have widespread effects on the concentration

of splicing activators and repressors and thus AS andAS-NMD.Despite the potential of AS-NMD, it is presently not

known to what extent this pathway regulates the tran-scriptome on a global scale, and how AS homeostasis isaffected by NMD. A major problem in discovering thefull spectrum of PTC+ transcripts is that EST andcDNA repositories are biased against these isoforms dueto their unstable nature in normal cells. Additionally,most studies have primarily focused on microarrays toquery the transcriptome upon muting NMD [18,19,28],and the newer studies using sequencing [29] havefocused on single cassette exon events, thus disregardingmany other physiologically important splice classes suchas mutually exclusive exons and alternative 5’ and 3’splice site usage. Last but not least, the transcriptomicconsequences of genetically modulating NMD andthereby AS in the mammalian organism are largelyunanswered.In the present study, we have performed RNA-seq on

two different tissues from a Upf2 conditional knock-out(KO) mouse line. Thus, in addition to providing signifi-cantly novel insights into common and tissue-specificfunctions of NMD, our study represents the first com-prehensive and unbiased transcriptome analysis of adultgenetically modified mice. To facilitate a high-resolutionanalysis of all possible exon-exon combinations, we havegenerated a bioinformatic pipeline, named RAINMAN(RnAseq-based Isoform detection and NMd ANalysispipeline; available to the scientific community), thatmaps reads to a combinatorial database that incorpo-rates both known and in silico predicted exon-exonjunctions. The pipeline predicts NMD susceptibilitybased on junction evidence and groups AS events intoseven major splice isoform classes. Using this approach,our results reveal an unprecedented increase in ASupon ablating UPF2, and by inference NMD (althoughsmall additive effects from UPF3A/B interaction andderegulation of UPF1 phosphorylation cannot be whollyexcluded from analysis, we juxtapose UPF2 and NMDablation for all purposes in this work) and show that ahigh proportion of these upregulated AS events are notpredicted to contain a PTC. Hence, we find that only50% of the increase in AS is directly due to stabilizationof PTC+ isoforms. Our data demonstrate that mutingNMD results in deregulated levels of core splice regula-tors at both the mRNA and protein level, and furthersuggests that this contributes to the deregulation of gen-eral AS upon ablating NMD. Hence, our data support amodel where NMD ablation leads to deregulated levelsof core splicing factors that ultimately lead to aberrantglobal splicing, implying a novel intricate interplaybetween the NMD machinery and the splicing factors.Finally, our data analysis generates the first insights into


Page 2 of 19

ARTICLE I | 46

a putative functional role of the PTC and its surround-ing regions through analysis of their conservationpatterns.

ResultsA major task in analyzing AS by RNA-seq is to exploreand quantify the differential representation of varioussplice classes and the protein-coding potential of themapped reads. To this end, we have developed RAIN-MAN, a streamlined bioinformatic pipeline available tothe scientific community that maps RNA-seq reads to acomprehensive combinatorial database of exon-exonjunctions for unique junction discovery, PTC detectionand identification of splice isoforms. We used this pipe-line to investigate the complexity of the transcriptomeand global role of AS-NMD in vivo in mammalian cells,by taking advantage of our Upf2 conditional KO mouse,which we previously used to demonstrate the in vivoimportance of NMD [7,30]. To explore the effect ofNMD on global splicing and to generate and validate anattractive bioinformatic pipeline to study AS and NMD,we chose to analyze the transcriptomes of two differentmammalian organ systems with distinct phenotypesupon UPF2 deletion. In one end of the spectrum, weanalyzed liver, wherein removal of UPF2 results in fail-ure in liver metabolism and a high mortality rate [30],and in the other, we analyzed bone marrow-derivedmacrophages (BMMs). These macrophages are gener-ated in vitro from murine bone marrow cells, and Upf2deleted BMMs are completely devoid of NMD activitybut nevertheless show no morphological or functionalphenotype compared to wild-type (WT) controls [7].

Splice isoform inference and PTC detectionIn order to obtain biological material, we first generatedmice in which the NMD core factor Upf2 could beselectively inactivated in liver or in BMMs using ourpreviously reported strategy (see Materials and meth-ods). We performed whole transcriptome sequencing(single-end) on poly (A)-purified RNA, isolated frompoly-IC injected Upf2fl/fl; Mx1Cre and Upf2fl/fl livers(termed from now on ‘Liver KO’ and ‘Liver WT’, respec-tively) and Upf2fl/fl; LysMCre BMMs and Upf2fl/fl BMMs(henceforth ‘BMM KO’ and ‘BMM WT’, respectively).In order to minimize biological variation, we generatedlibraries from pools of poly(A)-purified RNA derivedfrom three individual animals. Due to the underrepre-sented nature of NMD-susceptible transcripts in ESTand cDNA databases [4], we generated a comprehensivecombinatorial database of exon-exon junctions in themurine genome, using sequences from exon modelsannotated in different repositories (RefSeq, Ensembl,UCSC Known Genes, GENSCAN and Exoniphy;

Supplemental Materials in Additional file 1 and FigureS1B in Additional file 2). To discover junctions betweenexons not previously recorded in the murine transcrip-tome, we also employed TopHat [31], a de novo map-ping algorithm, to map reads to junctions betweenunannotated exons (Table S1 in Additional file 1). Toutilize reads that cover three or more exons, thus align-ing to two or more exon-exon junctions, and not initi-ally mapped to our combinatorial database or byTopHat, we incorporated an extra read truncation step(Figure S1B in Additional file 2), trimming reads insteps of 10 bp and remapping, allowing us to recover anadditional 4 to 7% of total reads. As a result, 84% ofreads mapped to the genome or transcriptome, and outof these, 28% mapped to splice junctions (Figure 1a),corresponding to a combined total of 323,474 uniquesplice junctions across our two tissues and two geno-types. With a minimum requirement of three reads to asplice junction, minimizing sequencing and mappingartifacts, we mapped approximately 150,000 uniquejunctions. Of note, we found that the de novo mapperTopHat contributed significantly to our transcriptomeset, with 11 to 16% of all discovered junctions uniquelydefined by TopHat (Table S1 in Additional file 1). Map-ping of the KO samples benefited the most from theseTopHat predicted splice junctions, suggesting that thecurated transcript repositories are biased against theTopHat-predicted junctions due to their NMD suscept-ibility. Indeed, TopHat identified 20% and 24% of thePTC+ splice junctions above our minimum read cutoff,compared to only 4% and 11% of PTC- splice junctions(junctions that do result in the generation of a PTC, inBMM and liver samples, respectively (Table S2 in Addi-tional file 1). These data demonstrate that at least aquarter of NMD susceptible transcripts are not presentin the normal murine repositories. This has implicationsfor future splice isoform detection, since many tran-scripts that are below detectable levels under physiologi-cal conditions, and hence are absent in the repositories,will not be detected under conditions that could favortheir presence, for example, perturbed or disease states.Thus, taking advantage of our combinatorial databasethat includes a pipeline for splice isoform detection andPTC prediction is likely to considerably increase thenumber of isoforms detected under conditions thatfavor increased AS.In the next step of RAINMAN, the pipeline calculates

the distribution of seven different splice isoform classesand predicts PTCs (see Supplementary Methods inAdditional file 1 for details), and all data are combinedfor easy visualization and data mining. An illustrativeexample of a genome browser output from the mappingsteps is shown in Figure 1b, with the conditional Upf2


Page 3 of 19

ARTICLE III | 47

KO gene deleted for exons 2 and 3 in the KO sample.The UCSC-based visualization includes both evidence ofreads mapped to junctions and to exons, and, in thiscase, demonstrates increased AS in the liver KO com-pared to WT.We mapped a total of 150,000 unique junctions with

approximately 100,000 and 130,000 junctions in BMMand liver tissues, respectively (minimum of three readsto a junction; Figure 2a, top, and data not shown). Tally-ing up, we found that 6,256 and 7,997 unique genes har-bored 14,056 and 25,534 upregulated junction events inBMM and liver, respectively, an average of 2.3 and 3.2upregulated junction events per gene (Figure 2a),demonstrating that loss of UPF2, and by inferenceNMD, leads to a substantial deregulation of splicing.

Ablating NMD results in significant upregulation of PTC+junctionsIt has been shown that a stop codon is generally recog-nized as a PTC by the NMD machinery, if it is situatedat least 50 nucleotides upstream of an exon-exon junc-tion, termed the 50 nucleotides rule. We assayed andverified this distance requirement by simply plotting theposition of RefSeq gene model stop codons that residein the penultimate exon (Figure S2 in Additional file 3),observing that the vast majority of these stops indeedfall precisely within 50 nucleotides of the final intronboundary, and thereby avoiding the elicitation of NMDfor those transcripts. Incorporating this distance metricin our PTC prediction algorithm, we calculated the frac-tion of regulated PTC+ junctions as a function of UPF2

(a)

(b)

9%2%

27%62%

8%2%

27%

63%

8%2%

30%60%

8%2%

27%62%

ExonicJunctionsIntronicIntergenic

Reads, BMM WT Reads, BMM KO Reads, Liver WT Reads, Liver KOExonicJunctionIntronicIntergenic

5,124,062 5,794,917 10,795,989 10,421,2412,195,422 2,469,916 5,341,681 4,573,723202,137 226,777 350,573 418,607718,984 760,327 1,512,332 1,378,347

Total mapped 8,240,605 9,251,937 18,000,575 16,791,918

chr2:

BMM WT

BMM KO

Liver WT

Liver KO

Upf2 gene

Upf2 gene locus

Exoncoverage

Junctions

78.088.097.077.0

Exons deleted in Upf2 KOExons deleted in Upf2 KO2

Junctions

Junctions

Junctions

Exoncoverage

Exoncoverage

Exoncoverage

Mapping frequency

59700005880000

Figure 1 Mapping and downstream analysis of RNA-seq data. (a) Total number of RNA-seq reads mapped to exonic, junction, intronic andintergenic regions as well as the fraction of mapped reads for all four samples. Pie charts below visualize the distribution of all mapped reads.(b) An example of an output from the pipeline, visualized in the UCSC genome browser, showing the Upf2 gene locus, with junctions supportedby reads (horizontal bars in each sub-window), and exon coverage (vertical bars). Refer to Figure S1 in Additional file 2 for another example of agenome browser output. No minimum read cutoff was used for junction visualization. The very low exon coverage present for knocked outexons 2 and 3 in KO samples represents a miniscule amount of non-recombined tissue/cells.


Page 4 of 19

ARTICLE I | 48

ablation. As expected, this analysis demonstrated ahighly significant upregulation of PTC+ junctions in theKO samples (Figure 2a, all junctions versus PTC+ junc-tions, P-value 1 ! 10-16, Chi-square). In total, approxi-mately 44% of predicted PTC+ junctions wereupregulated in BMMs and Liver KO (Figure 2a),amounting to 16% (BMM) and 28% (liver) of expressedgenes containing at least one NMD-susceptible splicejunction regulated more than two-fold.Among the upregulated PTC+ splice junctions, 516

were shared between the two tissues, and these corre-sponded to 448 unique genes (Figure 2a; Table S3 in

Additional file 4), which we thus termed core NMD tar-gets. The group includes well-known NMD targets suchas Smg5, Hsf1, and Zcchc6 as well as many knownNMD-susceptible splicing factors [6,18,19]. Indeed,Gene Ontology (GO) analysis demonstrated that thehighest ranked cluster contained genes involved inmRNA processing and splicing (P-value of 9.0 ! 10-18,Bonferroni corrected; see the full list of GO terms andgenes in Table S3 in Additional file 4). All classical SRproteins have been shown to utilize AS-NMD to autore-gulate their own synthesis [20], and our finding thatcommon splicing factor junctions are upregulated upon

(a)

(b)

Upregulated Unregulated Downregulated

Junctions 14,056 (14 %) 72,499 (74 %) 11,311 (12 %)

Genes 6,256 (30 %) 9,156 (44 %) 5,501 (26 %)

Junctions 25,534 (20 %) 84,901 (68 %) 14,793 (12 %)

Genes 7,997 (33 %) 9,932 (41 %) 6,580 (27 %)

Junctions 2,257 (4 %) 49,583 (94 %) 989 (2 %)

Genes 1,689 (16 %) 7,736 (75 %) 885 (9 %)

Junctions 1,733 (44 %) 1,705 (43 %) 524 (13 %)

Genes 1,272 (45 %) 1,135 (40 %) 449 (16 %)

Junctions 5,962 (43 %) 5,725 (41 %) 2,180 (16 %)

Genes 3,062 (43 %) 2,581 (37 %) 1,401 (20 %)

Junctions 516 (41 %) 720 (57 %) 32 (3 %)

Genes 448 (45 %) 507 (51 %) 32 (3 %)

Liver

Common

BMM

PTC+ junctions

All junctions

Liver

Common

BMM

Major Gene Ontology for PTC+ genes

Liver KO

BMM KOmRNA processing (Common genes)

Mitochondrial function (Liver-specific genes)

GTPase regulator activity (BMM-specific genes)

Junctions Genes

Figure 2 Number of regulated splicing and PTC occurrences in Upf2 KO tissues. (a) Splice junctions with a minimum of three reads in WTor KO samples were grouped into upregulated (!2-fold up in KO), downregulated (!2-fold down in KO) or unregulated. Genes corresponding tothe junctions are also shown. Left: number of junctions and genes and percentage in parentheses for the relative fraction in each tissue. The tophalf of the table lists results for all junctions, the bottom half for PTC+ junctions only. Right: pie charts visualize the relative distribution ofregulated junctions and corresponding genes. (b) Venn diagram of major Gene Ontology terms associated with liver-unique genes (top), BMM-unique genes (bottom) and common genes (overlap).


Page 5 of 19

ARTICLE III | 49

muting NMD is in agreement with earlier studies[7,18,19,32]. We also determined the GO terms asso-ciated with genes containing upregulated PTC+ junc-tions unique to either liver or BMMs (Figure 2b). In theBMM-specific set, genes involved in G-protein coupledreceptor function were enriched (P-value 1.5 ! 10-4,Bonferroni corrected; Table S4 in Additional file 5),whereas the liver specific set was strongly associatedwith mitochondrion, among others (P-value 9.7 ! 10-56,Bonferroni corrected; Table S4 in Additional file 5).In summary, we find that 43 to 44% of all predicted

PTC+ junctions are upregulated in Upf2-ablated tissuesand that 516 of these junctions are common in BMMand liver, several of which are well-known NMD targets.Moreover, we show that NMD is predicted to downre-gulate isoforms of 16 to 28% of all expressed genes(excluding the well-described regulation of splicing fac-tors from the analysis). Moreover, NMD regulates genesinvolved in mitochondria and G-protein-coupled recep-tor functions in liver and BMM, respectively, suggestingthat NMD also serves tissue-specific functions.

Loss of NMD leads to the selective stabilization ofalternative splicing eventsWe next compared the impact of ablating NMD oncanonical versus AS in more detail. Here, a canonicaljunction is defined as a splicing junction between twoconsecutive exons of the longest RefSeq isoform foreach gene, and AS junctions are thus all other combina-tions. In total, our pipeline detected approximately10,000 unique AS junctions in BMMs compared toapproximately 35,000 in the liver (Figure 3). A signifi-cant challenge in analyzing expression changes uponNMD ablation is to distinguish primary from secondaryeffects. Using the sensitive PTC-detection in RAIN-MAN, we found that close to 50% of AS was predictedto generate a PTC (Figure 3a, bottom), which is in starkcontrast to canonical splicing junctions (Figure 3a, top).This suggests that NMD directly degrades 16% (0.47 !0.35) of all AS in BMMs and 18% in liver (0.45 ! 0.39),resulting in more than a two-fold reduction of theinvolved isoforms.The BMM and liver samples demonstrated essentially

similar fractions of up- and downregulated AS junctions(Figure 3a). However, the number of unique AS junc-tions were higher in liver samples, even after calibratingfor the number of mapped junction reads (and therebysequencing depth; Figure 3b, top). To investigatewhether this was a characteristic feature of the organsor whether it was due primarily to muting of NMD, weanalyzed the AS complexity per gene (Figure 3b). Thenumber of mapped junction reads was first normalizedto BMM WT (Figure 1a) by stochastically removingreads from BMM KO, liver WT and liver KO, to

exclude any skewing due to differences in sequencingdepth. Interestingly, these data demonstrate an inher-ently different AS profile between the two tissues.Whereas BMM WT and KO showed similar proportionsof AS, measured as the correlation between the numberof AS junctions and the total number of junctions pergene (Figure 3b, top), the proportion of AS of the liverKO sample increased considerably more compared toboth BMM KO and liver WT (Figure 3b, top). Thisstrongly suggests that the liver has a high level of ASrelative to BMM, both normally and through ablation ofNMD.Finally, we also compared the proportion of PTC+

junctions, measured as the number of PTC+ junctionsas a function of total splicing for each gene (Figure 3b,bottom). The PTC+ junctions followed the same trendas for AS junctions, with a steeper relative increase inPTC+ junctions in liver KO compared to liver WT andBMM. From linear interpolation, we found that, onaverage, 27% of all spliced junctions in a given gene inthe liver are predicted to result in a PTC (liver KOslope of 0.27; Figure 3b).In conclusion, these data demonstrate that approxi-

mately 17% of all AS is downregulated via NMD morethan two-fold. We furthermore show that liver is char-acterized by a high degree of AS compared to BMMsand that close to one-third of all uniquely spliced junc-tions in the liver are predicted to elicit a PTC, thus giv-ing the first in vivo evidence of the pervasive effect andimportance of AS-NMD in the mammalian organism.

Splice isoform classes are differently affected by NMDablationAs described above, we detected a significant increasein AS upon ablating UPF2. It has, however, not pre-viously been described to what extent NMD affects dif-ferent splicing classes. Transcriptome-wide studieswhere NMD has been muted have primarily looked atsingle cassette exon skipping (ES) events [28,32], thusdisregarding many physiologically relevant AS events.To this end, we implemented an AS isoform classifica-tion module that allowed us to infer the degree towhich all major splice isoform classes were affected byNMD. Hence, our splice isoform inference pipelineassessed the distribution of splice junctions to sevenmajor groups of AS events, namely single exon skip-ping (SES) and multiple exon skipping (MES), alterna-tive 5’ splice site (A5SS) and alternative 3’ splice site(A3SS), mutually exclusive exons (MXE), alternativefirst exon (AFE) and alternative last exon (ALE) (seeTable 1, Supplementary Materials in Additional file 1and Table S6 in Additional file 6 for pipeline perfor-mance). We narrowed our downstream splice isoformanalysis to those that demonstrated a change in


Page 6 of 19

ARTICLE I | 50

‘percent spliced in’ (PSI) between KO and WT (!PSI)higher than 20%. PSI is calculated from the ratio ofjunctions supporting a given feature (for example, theinclusion of an exon) versus junctions supporting thereciprocal event (for example, the skipping of the sameexon), and the !PSI is thus a metric of how much theinclusion changes upon NMD ablation (see Supplemen-tary Methods in Additional file 1 for further details).We found that approximately 40% of all ES events witha !PSI higher than 20% (inclusion) or lower than -20%(exclusion) in the KO tissues were upregulated due tostabilization of a PTC+ isoform (Table 1). Next, wesubdivided ES events into PTC upon inclusion andexclusion events. An example of a PTC upon exclusion

event that is stabilized upon NMD ablation in bothBMM (!PSI = -61%) and liver (!PSI = -63%) is Mgea5,which was also found upregulated in immortalizedmouse embryonic fibroblasts (MEFs) ablated for NMD[32]. We validated this Mgea5 isoform among others byRT-PCR (Figure 4b). In the liver, 55% of all single exoninclusion events generated a PTC, whereas 40% werefound in the BMM (compare PTC upon inclusion tototal inclusion events for single ES in Table 1).Tmem183a is an example of a highly skipped PTCupon inclusion that was found stabilized in both BMM(!PSI = 51%), liver (!PSI = 54%) and immortalizedMEFs ablated for NMD [32], and was similarly vali-dated by RT-PCR (Figure 4b).

(a) (b)

LiverBMM

AS

junc

tions

PTC

+ ju

nctio

ns

Number of junctions per gene

!=0.62

!=0.41

!=0.13!=0.11

!=0.37

!=0.35

KOWT

99% 99%

PTC+PTC-

UpregulatedUnregulatedDownregulated

BMM Liver CommonCanonical splice junctions

Alternative splice junctions

!=0.27

!=0.16

BMM WT: 8,107/2.195 = 3,693BMM KO: 9,447/2.470 = 3,824Liver WT: 25,390/5.341 = 4,753Liver KO: 31,041/4.573 = 6,78711%

77%

12%10%

76%

14% 2%

95%

3%

35%

99%

1% 1% 1%

Fraction of unique alternative spliced junctions

BMM Liver Common

PTC+PTC-

UpregulatedUnregulatedDownregulated

number of unique AS junctions per million mapped junction reads

48 %

17 %

35 %

43 %

39 %

17 %69 %

27 %4 %

53 %47 %

55 %45 %

37 %

63 %

0 50 100 150 200

050

100

150

200

0 50 100 150 200

050

100

150

200

0 50 100 150 200

050

100

150

200

0 50 100 150 200

050

100

150

200

Figure 3 Loss of UPF2 leads to increased AS-NMD. (a) Proportions of regulated canonical junctions versus junctions supporting AS events(red, upregulated; yellow, unregulated; green, downregulated), with additional distribution pie charts underneath displaying the predictedpercentage of PTC+ junctions versus PTC- junctions (junctions that do not elicit a PTC) of upregulated junctions (black, PTC+; grey, PTC-). (b) Thenumber of AS junctions and the number of PTC+ junctions are shown as a function of number of junctions per gene in the top and bottompanels, respectively. The slopes of the curves were calculated by linear interpolation. In total, our pipeline detected 10,061 unique AS junctions inBMMs (8,107 in WT and 9,447 in KO) compared to 33,164 in the liver (25,390 in WT and 31,041 in KO), and calculations in the table (top) showthe number of unique AS junctions per million mapped reads for each sample. No minimum read cutoff was used in (b).


Page 7 of 19

ARTICLE III | 51

Interestingly, for alternative usage of A5SSs and A3SSs,we saw an even higher proportion of PTC+ events,especially in the liver (compare the fourth and fifth rowsin BMM versus liver tissue in Table 1), suggesting a pre-viously unrecognized importance for these splice eventsin AS-NMD. The ribosomal protein Rps12 is an exam-ple of a gene that generates a PTC+ isoform as a resultof A5SS in both BMM (!PSI = 50%) and liver (!PSI =55%), and likewise for Smg5 in BMM (!PSI = 37%) andliver (!PSI = 50%) as a result of A3SS (Table 1).Using stringent criteria (Supplementary Methods in

Additional file 1), we detected 44 MXE events in theBMM KO, but more than 10 times as many in the moresplice-prone liver KO tissue (Table 1). Out of thedetected MXE events, few were predicted to elicit aPTC+ isoform, possibly due to the regulated nature of amutually exclusive splice event, which in most instanceswould not be predicted to undergo NMD. An exceptionis the pyruvate kinase gene Pkm2, which encodes twodifferent enzymatic active isoforms due to a single MXEevent [33]. In the KO samples, and this was particularlytrue for BMMs, aberrant splicing resulted in inclusionof both mutually exclusive exons, resulting in a PTC+isoform (validation in Figure 4b and schematic in Figure4d). This may indicate an important role for NMD inensuring the exclusive incorporation of exons in MXEevents.Finally, we quantified the number of isoforms with an

increase in AFE or ALE. From transcriptome data alone,it is often impossible to determine whether an AFE isthe result of alternative promoter usage or AS between

two mutually exclusive first exons and the second exon.Nevertheless, AFE results in alternative transcripts withpotential PTCs. Analysis of AFE showed that more thanone-third of the detected AFE events are predicted togenerate a PTC+ isoform (Table 1). For ALE categoriza-tion, we required AS to two mutually exclusive 3’ term-inal exons, and PTC+ prediction is thus not meaningful.Hence, changes in levels of ALE are therefore mostlikely secondary effects of altered splicing upon UPF2ablation. A functional mechanism for ALE is the regu-lated inclusion of microRNA target sites in the 3’ UTR,and this potentially adds another layer of complexitywhen studying conditions where splicing is perturbed,such as by removing NMD.To test the accuracy of our splice isoform detection

algorithm, we chose 50 RAINMAN-predicted AS events(!PSI > 20% (inclusion), or < -20% (exclusion)) for RT-PCR on independent liver and BMM material. We vali-dated AS events that are not predicted to result in aPTC+ isoform (Figure 4b, top row), isoforms stabilizedin the KO due to inclusion of a PTC (Figure 4b, middlerow) and isoforms that elicit a PTC due to a skippingevent (Figure 4b, bottom row). Out of the 50 tested ASevents, we were able to validate 49 of these (98%), andwe therefore conclude that our splice isoform detectionalgorithm is highly accurate in predicting different spliceisoform events.To validate the inferred expression changes and to

assess inter sample-replicate variability, we performedquantitative PCR experiments on biological replicates.This analysis showed a strong correlation with the

Table 1 Splice events and PTC+ classificationTissue Splice

eventTotalevents

PTC+events

Percentage PTC+

Exclusionevents

PTC uponexclusion

Inclusionevents

PTC uponinclusion

BMM Total ES 730 281 38% 428 201 302 80

Single ES 511 310 148 201 80

Multiple ES 219 118 53 101 -

A5SS 130 68 52%

A3SS 238 110 46%

MXE 44 5 11%

AFE 130 44 34%

ALE 49 NA NA

Liver Total ES 3,102 1,285 41% 1,926 965 1,176 320

Single ES 1,505 932 531 573 320

Multiple ES 1,597 994 434 603 -

A5SS 449 263 59%

A3SS 654 370 57%

MXE 475 66 14%

AFE 172 66 38%

ALE 143 NA NA

Splice isoform classes are shown for both single and multiple exon skipping (ES), alternative 5’ splice site (A5SS), alternative 3’ splice site (A3SS), mutuallyexclusive exons (MXE), alternative first exon (AFE) and alternative last exon (ALE). For ALE, no events were classified as PTC+, since the class relies on splicing totwo different last exons. Here, a !PSI > 20% was required for all classes. NA, not applicable.


Page 8 of 19

ARTICLE I | 52

Immunoblotting:

UPF2

b-Actin

*

(d)(c)

Pkm2 spliced isoforms

exon 8 exon 9 exon 10 exon 11

exon 8 exon 9 exon 11

exon 8 exon 9 exon 10 exon 11

exon 8 exon 10 exon 11

STOPM1 isoform M2 isoform

NMD-sensitive M1/M2 isoform

Srfs2

Mgea5

Hnrnph

1

U2af2

Hnrnpl

Srfs6

Ptbp2

Hnrnph

3Srfs

3

Pkm2

Jmjd6

Trub2

Tmem18

3aRpl3

NgfEpn1

Nelf Mff Hpn

A5SS+ES

MXE

Smg5

A3SSES ES ES ES ES A3SS

Rps12

A5SS A3SS

ES ES ES ES ES

ESA5SS ES ES A3SSES ESES

Cfi

ES ESCbs

ES

(b)

STOP

STOP

Exon Skipping (ES)

Alternative 5’ Splice Site (A5SS)

Alternative 3’ Splice Site (A3SS)

AlternativeFirst Exon (AFE)

AlternativeLast Exon (ALE)

Mutually ExclusiveExon (MXE)

(a)

Lpca

t3

Papss

2

Ndufv3

Brp44l

Figure 4 Splice isoform classes are differentially affected by loss of UPF2. (a) Schematic drawing of the main isoform classes detected inour pipeline. (b) RT-PCR validation of 25 splicing events predicted from our pipeline (see Table S9 in Additional file 12 for a list of the 49validated events out of 50 tested). Top: normal AS events. Middle: PTC upon inclusion events. Bottom: PTC upon exclusion by ES. (c) Westernblotting from two different liver pairs, showing UPF2 (rabbit a-UPF2), SRp55, SRp40 and SRp30 (mouse a-mAb104) and b-actin (rabbit a-actin).The asterisk denotes the truncated UPF2 isoform found in cells ablated for NMD. (d) MXE splicing of Pkm2 and the inclusion of both mutuallyexclusive exons in the KO sample.


Page 9 of 19

ARTICLE III | 53

RNA-seq inferred expression changes (R = 0.975, Pear-son’s correlation; Figure S3A in Additional file 7) and avery low sample replicate variability (Figure S3B inAdditional file 7). In addition, we assessed our ability tocorrectly infer PTC+ isoforms by cloning and sequen-cing the full-length isoforms of Srsf9, a gene with aPTC-upon inclusion event (!PSI = -40% in liver; iso-form not detectable in BMMs). The predicted PTC+isoform was precisely recapitulated in vivo (Figure S3Cin Additional file 7), further validating the usefulness ofour approach.

Loss of NMD leads to deregulation of PTC- isoformsthrough breakdown of splice factor homeostasisWe have demonstrated that only 50% of upregulated ASjunctions are predicted to be the direct result of stabi-lized PTC+ isoforms, suggesting that a major proportionof the increased AS is the indirect result of perturbedNMD (Figure 3a). As we have shown here, the liver hasa high degree of AS (Figures 1 to 3), and it has beenfound to have a prominent divergent expression patternof SR proteins and hnRNPs compared to other mamma-lian tissues [34]. Hence, we hypothesize that this couldmake the liver particularly susceptible to changes insplicing patterns upon ablation of NMD.Splicing factors are known to utilize AS-NMD to

autoregulate their own synthesis in a negative feedbackloop, through the selective inclusion or exclusion ofPTC+ exons [4,18]. Indeed, our RNA-seq analysis alsoidentifies a range of these factors among the core NMDtargets (Figure 2; Table S3 in Additional file 4). How-ever, as most of the NMD sensitive splicing factor iso-forms are predicted to encode truncated proteins, therescue of these upon NMD ablation is not expected toaffect AS per se (unless they encode proteins with domi-nant negative properties). Previous microarray-basedstudies have not addressed the formal possibility thatthe stabilization of PTC+ isoforms upon loss of NMDcould occur against the backdrop of changed levels ofthe canonical isoforms capable of encoding the full-length form of proteins such as splicing regulators. Totest this possibility, we scrutinized our datasets forchanges in the expression levels of 25 canonical mRNAisoforms capable of encoding full-length splicing regula-tors by correcting the changes in the gene expressionratios between KO and WT samples with the fractioncontributed by the PTC+ isoforms (Table S5 in Addi-tional file 1). Strikingly, for the 19 splice factors forwhich we have evidence for AS-NMD in the liver, 10displayed a > 1.5-fold deregulation, with 5 displayingincreased (Sfrs3, Srfs4, Tra2a, Srfs16, Hnrpdl) and 5decreased (Tra2b, Hnrnpf, Hnrnph3, Hnrnpr, Ptbp2)expression of the canonical mRNA isoforms. Moreover,western blot analysis revealed that SRp30 and SRp40

protein levels were elevated in KO liver, demonstratingthat the changes in canonical mRNA isoform levels forat least some splicing factors are indeed translated intofull-length protein.Collectively, these findings support a model where the

increase in global AS upon loss of NMD is at least inpart caused by a massive deregulation of key splicingfactors, and further help to explain the PTC- AS frac-tion deregulated upon muting NMD. A highly aberrantlevel of splice regulators would likely tilt the normal bal-ance between activators and repressors and deregulate alarge subset of exons [4,18].

Ablating NMD leads to skipping of highly conservedcassette exonsSES represents the best characterized AS event, and is afrequent PTC-generating mechanism. RAINMAN classi-fied 40% and 60% of all SES as PTC+ events in theBMM and liver KO samples (Table 1 - compare PTCupon exclusion with total exclusion events, and PTCupon inclusion with total inclusion events). Out of thePTC upon inclusion events, a total of 26 events wereupregulated in both KO tissues (!PSI > 20%) and 51PTC upon exclusion events were similarly upregulatedin both KO tissues (!PSI < -20%; manually curated listsare included in Table S7 in Additional file 8 and TableS8 in Additional file 9). In support of an important phy-siological function, almost all of the genes harboringthese highly upregulated PTC+ isoforms were also upre-gulated at the gene level in KO samples (Table S7 inAdditional file 8 and Table S8 in Additional file 9), sug-gesting that stabilization of the PTC+ isoforms led to anappreciable increase in total mRNA levels in KO tissues.It should be noted that most of the splicing factors andribosomal proteins known to use AS-NMD were indeedupregulated in both tissues, but had !PSI > -20% or <20% in BMMs (data not shown). Hence, the high fre-quency of PTC+ splice events suggests that these areregulated AS events. It is generally found that AS exonsare more conserved than constitutive exons, especiallyin the flanking introns [35]. The core members of theSR protein family and hnRNPs have been proposed touse AS-NMD in an autoregulatory loop, by binding tohighly conserved regions, termed ultraconserved ele-ments (UCEs), to elicit the PTC+ splice event [18,20].Upon inhibition of NMD in cell lines, it has been shownthat many PTC+ exons are surrounded by high intronicconservation [18,19]. As described above, our sensitiveapproach demonstrated a high proportion of PTC- ASevents upregulated in KO tissues, and we thereforeexamined the conservation of PTC+ and PTC- singleexon exclusion (!PSI < -20%) and inclusion (!PSI >20%) events in the KO samples compared to the unre-gulated skipping events. The latter control group


Page 10 of 19

ARTICLE I | 54

contains exons that undergo AS but are not differentlyskipped between WT and KO (-20% < !PSI " 20%). Wefirst examined exclusion events, and found that flankingintrons of the skipped exons are significantly more con-served compared to unregulated skipping events (Figure5a; P-value < 2.2 ! 10-16, Komogorov-Smirnov test), andthis was even more pronounced in BMMs (Figure S4 inAdditional file 10). Interestingly, flanking introns ofPTC+ upon exclusion events were less conserved thanPTC- exclusion events, but displayed a significant higherexonic conservation (P-value < 2.2 ! 10-16, Komogorov-Smirnov test). We next considered exons that demon-strated increased inclusion in KO samples. In thesecases, flanking introns were also significantly morehighly conserved compared to skipping events not regu-lated by NMD (Figure 5b). Again, PTC- inclusion eventswere more highly conserved compared to PTC+ eventsin the flanking introns.In summary, these results demonstrate that 40 to 60% ofupregulated SES events are predicted to undergo NMD,depending on tissue type, and that these splice eventsare highly conserved. Importantly, PTC- inclusion andexclusion events displayed higher intronic conservation

compared to PTC+ events, strongly suggesting that theUCE-containing splice regulator isoforms (which are allPTC+) are not the primary reason for the high intronicconservation surrounding skipped exons. Moreover,removal of UCE elements from the analysis did not alterthe intronic conservation profiles of either PTC- or PTC+ events (data not shown). Thus, the high conservationfor both included and excluded exons appears to be ageneral attribute of highly regulated cassette exons(!PSI > 20%) that are affected directly and indirectly byNMD. Thus, the deregulation of PTC- events is there-fore consistent with a mechanism where deregulatedlevels of splice regulators in the KO samples markedlyaffect regulated splicing of their cognate target exons.This again implies that ablating Upf2, and by inferenceNMD, has important secondary effects on splicing factorhomeostasis and that this impacts widely on global AS.

Muting NMD leads to increase in splicing by-productsWe have demonstrated that removing UPF2, and byinference NMD, causes a dramatic increase in AS, inpart by stabilizing PTC+ AS isoforms. Apart from AS-NMD, NMD has been implicated in removal of genomic

(a) (b)

STOP

PTC+ single exon skipping upregulatedPTC- single exon skipping upregulatedSingle exon skipping unregulatedMm9 refseq

0.0

0.2

0.4

0.6

0.8

1.0

Exon Exclusion Events upregulated in liver KO

Con

serv

atio

n

25 25 75 75757525 25

0.0

0.2

0.4

0.6

0.8

1.0

Con

serv

atio

n

25 25 75 75757525 25

STOP

Exon Inclusion Events upregulated in liver KO

Figure 5 Introns flanking regulated exons are highly conserved. (a, b) Mean per position phastCon conservation scores around SES eventsare shown for exclusion events upregulated in the KO sample (a), and for inclusion events upregulated in the KO sample (b). Shown are PTC+exclusion/inclusion events (red line), PTC- exclusion/inclusion events (green line) and unregulated skipping events (grey line). Yellow lines arescores for all mm9 RefSeq exons and 75 bp into surrounding introns. Data shown are for liver. Exclusion events: PTC+, 439; PTC-, 251. Inclusionevents: PTC+, 64; PTC-, 162. Unregulated events: 3,494. Mm9 RefSeq exons: 274,281. See Figure S4 in Additional file 10 for a graph with BMMdata. Numbers on the x-axis indicate nucleotide intervals - 25 and 75 nucleotides for exons and flanking introns, respectively. Curves represent acubic smoothing spline fitted to data.


Page 11 of 19

ARTICLE III | 55

noise and splice errors that might otherwise lead to apanoply of spurious isoforms [7,36-38]. The extent towhich NMD destroys such splice by-products has not,however, been studied on a transcriptome-wide basis inmammalian tissues. We therefore examined the relativeexpression level of PTC+ junctions, measured as thePTC+ fraction (PTC-generating junctions/All otherjunctions) against the junction expression level (Figure6), binned by the RNA-seq metric RPKM (reads per kbof gene model per Mb of mapped reads, refer to [39]).These data show that the PTC+ fraction is enriched inliver compared to BMM over the entire expression pro-file, further emphasizing that the liver is characterizedby an increased abundance of PTC+ isoforms comparedto BMM. Importantly, there is a distinct overrepresenta-tion of PTC+ isoforms in the lower expression range forboth liver (Figure 6 - compare dark red and light red forliver KO and WT, respectively) and BMM (Figure 6 -compare dark blue with light blue for BMM KO andWT, respectively) relative to more highly expressed iso-forms. These data thus confirm and quantify, on a glo-bal scale, another major role for NMD in removing low-abundance isoforms such as the products of erroneoussplicing.

PTC exons are distinct from normal stop codon exonsWe have shown that NMD-affected SES events demon-strate a high conservation score and that this is not dueto UCE-containing SR proteins and hnRNPs. The ques-tion then arises to what extent the PTC+ exon itself isaberrant compared to other expressed exons, and inparticular the normal stop codon-containing exon. Wethus examined the conservation of the exonic sequencesurrounding the PTC in upregulated PTC+ exons, asevidenced by upregulated PTC+ junctions. To normalizefor normal exonic conservation variation and for biasesin tissue-specific exon conservation, we subtracted ran-dom exonic sequences expressed in the same samples.Interestingly, we found that PTC+ exons had a markedlydifferent conservation profile compared to normal stopcodon exons (Figure 7a; for comparison, see unnorma-lized conservation score in Figure S5 in Additional file11). Hence, exonic sequence conservation increasedtowards the PTC, and displayed only a minor drop inrelative conservation score in nucleotides following thePTC. Importantly, the PTC and surrounding nucleotidesdisplayed an overall higher conservation score relative torandom exons, and this was particularly striking forPTCs in BMM. Notably, removal of the UCE-containing

1 2 5 10 20

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Isoform expression (Junction RPKM)

PTC

+ sp

licin

g fra

ctio

n

BMM WTBMM KOLiver WTLiver KO

Figure 6 Low-expressed junctions are enriched in Upf2 KO samples. PTC-content (y-axis) in junctions binned by expression (RPKM; x-axis).All junctions for samples were binned in increments of 1 RPKM and the PTC+/PTC- ratio for each bin was calculated. No minimum read cutoffwas applied.


Page 12 of 19

ARTICLE I | 56

¨SKDVW&RQ�VFRUH��VWRS�FRGRQ�H[RQLF�VFRUH��UDQGRP

�H[RQLF�VFRUH�

TAA

TAG

TGA

Intronic

TAA

TAG

TGA

Intergenic

TAA

TAG

TGA

RefSeq STOPs

TAA

TAG

TGA

Exonic

TAA

TAG

TGA

BMM PTC+

TAA

TAG

TGA

Liver PTC+

P = 8.104e-5

P = 7.166e-12

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Codo

n fre

quen

cy

0.7

6WRS�FRGRQ�GLVWULEXWLRQ

(a)

(b)

1XFOHRWLGH�FRQVHUYDWLRQ�DW�37&V�YV.�QRUPDO�67O3�FRGRQV

3RVLWLRQ�LQ�100ES�ZLQGRZ��67O3�FRGRQ�FHQWHUHG0 20 40 60 80 100

%00�37&�/LYHU�37&�1RUPDO�5HI6HT�67O3

�0.25

�0.20

�0.15

�0.10

�0.05

0.00

0.05

0.10

Figure 7 Conservation of regulated PTCs and surrounding exons. (a) Mean per-position phastCon scores are shown, centered on the PTC,for upregulated junctions in liver and BMM. To visualize conservation around PTCs in comparison to normal exonic areas, phastCon scores froma random sample (n = 4,000) of BMM and liver-expressed exons were subtracted from either sample. For normal STOPs, phastCon-scores fromrandom RefSeq exons (n = 4,000) were subtracted. Normal STOPs are based on all RefSeq transcript models. Ranges of scores do not extendinto introns, and may be shorter than 100 bp for individual PTCs. For PTC+ junctions, a KO/WT fold change of 2 was required. BMM PTC+positions: 884. Liver PTC+ positions: 3,091. Normal RefSeq STOP positions: 23,231. (b) Distribution of stop codons: for intronic, intergenic andexonic bins, all mm9 trinucleotides in all three reading frames were sampled. RefSeq STOPs represent normal stop codons for all 21,470 RefSeqtranscript models (for genes with multiple models, the longest was used). For BMM (n = 497), PTC-inducing junctions were required to have alog2(KO/WT) fold change > 2, and a minimum of 5 reads for both genotypes summed. For liver (n = 670), PTC-inducing junctions were requiredto have a log2(KO/WT) fold change > 2, and a minimum of 10 reads for both genotypes summed. Fisher’s exact test was used to test forsignificance.


Page 13 of 19

ARTICLE III | 57

core splicing factor genes did not influence the conser-vation profile in liver or BMM (data not shown). TheBMMs are characterized by a more stringent splice pat-tern, whereas the liver has a high degree of AS andincreased PTC+ junctions expressed at low levels (Fig-ures 2 to 4 and 6), pointing to an increased fraction ofrandom splice byproducts in the liver. Hence, theincreased PTC+ exonic conservation in BMMs com-pared to the liver is likely due to a higher proportion offunctionally relevant PTCs in BMMs, found in variousvertebrates. Moreover, the relatively high conservationscore at and after the PTC suggests that these sequencesmay possibly harbor un-recognized cis elements thatcould assist in recognizing the PTC.Finally, we also gauged whether PTCs associated with

highly regulated PTC-inducing junctions displayed anypreference for the identity of the stop codon (Figure7b). Surprisingly, whereas the frequencies of normalcanonical UAA, UAG and UGA RefSeq stop codonscorresponded closely to the distribution observed inexonic regions, regulated PTCs had a marked and highlysignificant preference for UGA in both tissues. Thesefindings suggest that the NMD eliciting PTCs have beenunder selective pressure in order to adapt to the regula-tory needs of AS-NMD.

DiscussionHere, we present the first RNA-seq study from geneti-cally modified adult mice accompanied by a thoroughcomparative study of the role of NMD in two distinctmurine tissues. Moreover, this is the first analysis todate of the global impact of NMD on all major classesof AS in untransformed mammalian cells.To facilitate a detailed analysis of AS, we have gener-

ated a streamlined bioinformatic pipeline, RAINMAN,available to the scientific community, that maps RNA-seq reads to a comprehensive combinatorial exon-exonjunction database to obtain maximum AS isoform infor-mation. This method allowed us to map 28% of all map-pable reads to junctions (all samples combined; Figure1) with a total discovery of 150,000 unique junctions,thus giving us high AS information. The junction dataare further processed in the pipeline to predict NMDsusceptibility and thus coding potential with high accu-racy at single junction resolution. Finally, seven majorsplice isoform classes are inferred and processed formultiple comparison purposes.

Loss of NMD leads to a pronounced deregulation ofalternative splicingUsing this pipeline, we have shown that ablating Upf2,and by inference NMD, impacts on several layers of ASin both BMMs and liver. By examining genes that harborthe exact same spliced junctions predicted to elicit a PTC

in both tissues, our data confirmed that genes involved inmRNA processing are highly overrepresented amongcommon NMD targets, as previously shown (Table S3 inAdditional file 4). From our data, we estimate that closeto one-third of all junction events supporting AS resultin PTC+ and NMD-sensitive transcripts in the liver (Fig-ure 3b), which is in agreement with earlier predictions[4]. Previous computational analyses of dbEST andSWISS-PROT have found that 8 to 12% of human genesare predicted to be targets of NMD [4,40], but theserepositories inherently underestimate the true number ofNMD-sensitive isoforms. Our data demonstrated that16% and 28% of expressed genes in BMM and liver,respectively, were predicted to undergo NMD (upregu-lated more than two-fold in KO) of one or more iso-forms. These numbers expand upon and underscore theglobal importance of NMD and in particular AS-NMD.Nearly all detected PTC+ events could be attributed toAS, and we found that approximately 50% of all AS inthe KO was in fact predicted to elicit a PTC (Figure 3a),with the liver as the tissue with the most AS (Figure 3b).This complements previous findings showing that theliver has one of the highest levels of AS compared toother organs [2,34]. The high level of AS in the liver wasalso reflected in the fraction of low-level PTC+ junctions(Figure 6), most likely consisting of splice errors. Thus,our data make a strong case for NMD playing an impor-tant role in removing ‘splicing’ noise also in untrans-formed mammalian tissues.In contrast to the splice error-generated PTC+ iso-

forms, our analysis of highly regulated splice classes(!PSI > 20% or !PSI < -20%), revealed that approxi-mately 40% of all highly skipped exons were degradedby NMD (Table 1). These ES events were further subdi-vided into SES and MES events. Whereas three-quartersof all ES events in BMMs were SES, the proportion ofSES and MES in liver was approximately equal. Wespeculate that the increased MES proportion reflects themore promiscuous splicing pattern in the liver, thusleading to novel AS between exons not normally splicedtogether.Importantly, our analysis of conservation patterns in

SES revealed a high conservation score of NMD-regu-lated skipping exons and their flanking introns com-pared to that of unregulated events. This suggests thatevents deregulated in the absence of NMD are undertight control in normal NMD-proficient cells and thatAS-NMD plays a crucial role in controlling the expres-sion of a vast number of genes.

The splice isoform inference analysis expands therepertoire of NMD targetsFrom our splice isoform inference, we found a high pro-portion of PTC+ events in most classes with the notable


Page 14 of 19

ARTICLE I | 58

exception of MXEs (ALE events were not included inPTC classification). Due to the regulatory nature ofMXEs, in which two coding exons compete for theselective inclusion, it is not surprising that this group isunderrepresented in the PTC+ events. One interestingexample was Pkm2, in which a well-characterized MXEevent either dictates whether the adult form (M1 iso-form containing exon 9) or the embryonic isoform (M2containing exon 10) is produced. In many tumors, theembryonic M2 isoform is switched on, leading toincreased aerobic glycolysis [33]. However, we foundthat ablating UPF2, and by inference NMD, led to inclu-sion of both mutually exclusive exons with a resultingdownstream PTC (Figure 4b, d), suggesting an interest-ing function for NMD in controlling MXE. We also pro-vide the first report of a surprisingly high frequency ofPTC+ A5SS and A3SS events, suggesting that the alter-native usage of splice acceptor or donor sites is oftenutilized to regulate expression by AS-NMD. Anotherpossibility could be that these events are more error-prone - for example, by the use of a strong and weaksplice site (the alternative splice site), leading to a lessstringent AS than between two strong splice sites suchas SES. Nevertheless, our study significantly expands theknown repertoire of NMD targets in all classes of AS,many previously uninvestigated on a global scale.

Common and tissue-specific functions of NMDThe impact of NMD on transcriptome composition hasup to now only been studied in whole organisms orindividual cell lines, which has therefore precluded anyinsights into conserved versus tissue-specific functionsof NMD. Focusing first on core NMD targets, our strin-gent comparative analysis found 50 genes with a com-mon highly upregulated PTC upon exon exclusion inBMMs and liver (Table S8 in Additional file 9). Almostall of these genes were upregulated in both tissues, sug-gesting a functional role for AS-NMD in regulating theabundance of transcripts with full coding potential fromthese genes (see below). Several of the known splicingfactors previously shown to utilize AS-NMD in an auto-regulatory loop, such as Ptbp1, demonstrated PTC uponexclusion in both tissues, but below !PSI of 20% inBMMs. Besides SR proteins and hnRNPs, most of thegenes found upregulated are unknown targets for AS-NMD, such as Soat2/and Acat2. The latter gene isinvolved in esterification of cholesterol, and is known tobe regulated in a tissue-specific manner, with a particu-larly high expression level in liver and intestine and to alesser extent in macrophages [41]. Here, we found thatthis gene was upregulated 8.4-fold in BMMs and 4.2-fold in liver upon muting NMD, suggesting an impor-tant regulatory function for AS-NMD in the synthesis ofcholesterol esters.

We found 26 genes with a common PTC upon inclu-sion event (!PSI > 20%) in both BMM and liver KO(Table S7 in Additional file 8). An example of a genewith a PTC upon inclusion isoform is Nktr, which isexclusively expressed in and required for natural killer(NK) cells. We found that inhibiting NMD caused ahigh upregulation of a PTC+ isoform and a concomitant2.4- to 3.9-fold upregulation at the gene level. It hasbeen shown that aberrant isoforms of the Nktr gene arepresent in cells not expressing Nktr [42]. Here, we showthat NMD mutes Nktr in BMMs and liver (and also inNMD ablated transformed MEFs [32]), suggesting thatAS-NMD serves an important regulatory function inregulating the functional expression of this NK cell-spe-cific protein. Interestingly, Nktr is involved in NK cellactivity but contains RS repeats and a cyclophilin-domain found in several splicing factors, and this couldconfer on the protein properties sufficient to facilitateregulation of its own expression through AS-NMD.Thus, it could be that several other proteins not directlyinvolved in splicing have gained RNA-binding propertiesthat would allow them to autoregulate their own synth-esis by AS-NMD.Apart from the core NMD targets, our analysis also

yielded the first insights into potential tissue-specificeffects of NMD. Through GO analysis of parent genesof PTC+ junctions that were exclusively upregulated ineither the BMM or liver datasets, we could show a tis-sue-specific role for NMD in the regulation of G-pro-tein-coupled receptors and mitochondria function,respectively. Interestingly, these GO classes mirror themain biological functions of monocytes/macrophagesand liver tissue, that is, in immunological reactions andenergy metabolism, respectively, and we predict thatNMD-dependent transcriptome analysis in other organswould uncover tissue-specific NMD targets in pathwaysof particular importance for the organ in question.

NMD controls the expression of a network of splicingfactorsIn contrast to the well-characterized role of AS-NMD inthe autoregulation of individual splicing factors, the glo-bal consequences of its disruption on broad splicing pat-terns have not been studied in detail. We were thereforeintrigued by finding that approximately 50% of the upre-gulated AS events in the NMD-deficient tissues weredevoid of PTCs, suggesting that they were indirect tar-gets of NMD and that the importance of NMD in ASextends well beyond AS-NMD.Using conventional microarray-based gene expression

analysis, we have previously shown that several splicefactors were upregulated in both UPF2-deficient BMMsand liver [30,43]. However, these studies could not dis-criminate between canonical mRNA isoforms and


Page 15 of 19

ARTICLE III | 59

stabilized PTC+ isoforms, which is important as manyof the latter would be predicted to encode truncatedand thereby (most likely) non-functional proteins. Usingour RNA-seq data, and correcting for the presence ofthe PTC+ isoforms, we can now show deregulatedexpression of the canonical mRNA isoform (encodingthe full-length protein) for a prominent number ofsplice factors, a finding that was also corroborated bychanges in protein levels for two of these. These find-ings suggest that the prominent deregulation of PTC-containing events, at least in part, can be explained bychanges in the protein levels of splicing factors. This isfurther supported by the observation that the intronicregions surrounding deregulated PTC- exons are highlyconserved, suggesting that they are subjected to splicefactor-mediated regulation.Our data therefore extend the importance of NMD to

the regulation of PTC-deficient mRNA isoforms throughthe deregulation of splice factor levels. The mechanisms(s) by which this occurs will be the subject of future stu-dies. They also provide a clue as to why the liver ismore affected by the loss of NMD than BMMs, as theformer organ displays much higher levels of AS and istherefore predicted to be more sensitive to alterations insplicing factor levels.Finally, and most importantly, these findings expand

the functional impact of NMD on transcriptome beha-vior to also include substantial indirect effects on PTC-splicing events and highlights the crucial importance ofthis pathway as a gatekeeper of transcriptome integrity.

A role for the stop codon and its flanking regions inregulating NMDThe importance of PTC identity and its flanking regionsin the regulation of NMD is a relatively unexplored areaand has not previously been addressed on a global scale.Here we were able to show that regions flanking regu-lated PTCs display a divergent and increased level of con-servation when compared to both canonical stop codonsand exonic regions, suggesting that they have been underselective pressure. Similarly, the distribution of regulatedPTCs was markedly different from that of normal termi-nation events with a strong preference for the NMDmachinery to use UGA. This is interesting in light ofrecent findings showing that sequences (including stopcodons) having positive impact on stop codon read-through lead to a reduction of NMD through the removalof UPF1 from the 3’ UTR through a translation-depen-dent mechanism [44]. These findings may suggest thatthe PTC and its surrounding sequences have been underselective pressure to optimize NMD efficiency in a pro-cess involving translational read-through, which in turnmay be used by the cell for regulatory purposes.

ConclusionsBy developing and applying a robust bioinformatic pipe-line mediating a high-resolution study of AS and itsdynamics, this whole-transcriptome analysis of twoNMD-deficient primary mouse tissues provides a com-prehensive quantification of the impact of NMD onuntransformed mammalian transcriptomes, providingcrucial novel insights into its role in both core and tis-sue-specific regulation of gene expression, significantlyextending the importance of NMD from an mRNAquality pathway to a regulator of several layers of geneexpression. Thus, in addition to removing low levels ofsplicing ‘errors’, and destabilizing targets of AS-NMD,our analysis reveals the potentially crucial importance ofNMD in maintaining splicing homeostasis. In particular,the observed deregulation of full-length splicing regula-tors suggests that the NMD pathway controls theirexpression in a manner distinct from direct AS-NMD,and that their deregulation is an important contributorto the global deregulation of AS that we observe inNMD-deficient tissues.Finally, we have provided a reference catalogue of

NMD-regulated AS events as well as an open sourcetool, RAINMAN, facilitating the process of splice iso-form inference and PTC prediction from RNA-seqreads. Future transcriptome studies in other organs andorganisms will further reveal the impact of NMD onsplicing patterns and how this pathway modulates biolo-gical read-out in a tissue- and pathway-specific manner.

Materials and methodsMiceFor the selective deletion of UPF2 in liver and BMM weused our conditional floxed Upf2 line and the LysMCreand Mx1Cre driver lines as described previously [7].Upf2 is recombined during macrophage differentiation,and the mature BMMs are devoid of any functionalUPF2 protein as well as NMD activity. To rescue theUpf2fl/fl; Mx1Cre from the previously reported hemato-poietic lethality, Upf2fl/fl; Mx1Cre and Upf2fl/fl weretransplanted with wild-type bone marrow cells prior topoly-I:C injection. As described previously, liver washarvested 3 weeks after Upf2 recombination to avoidany indirect effects from the poly-I:C treatment [30]. Allmouse work was performed according to national andinternational guidelines and approved by the DanishAnimal Ethical Committee. This study was approved bythe review board at the Faculty of Health, University ofCopenhagen (LT-P0658).

cDNA synthesis for RNA-seqTotal RNA was harvested in Trizol (Invitrogen; Carlsbad,CA, USA from Upf2fl/fl; LysMCre and Upf2fl/fl-derived


Page 16 of 19

ARTICLE I | 60

male BMMs, grown as previously described [7]. Briefly,bone marrow cells were grown in vitro in the presence ofmacrophage colony-stimulating factor-conditioned med-ium for 7 days, giving essentially a 100% pure macrophagepopulation. Total RNA from liver was harvested in Trizol21 days after injection of poly I:C from Upf2fl/fl; Mx1Creand Upf2fl/fl; Mx1Cre. We pooled 50 μg RNA from each ofthree age-matched males to get a total of 150 μg RNA,which was subjected to two rounds of mRNA purificationby hybridizing to oligo(dT) beads (Dynabeads mRNA puri-fication kit, Invitrogen). The resulting mRNA (1 μg) wasthen used as template to prepare cDNA. Double-strandedcDNA was essentially prepared as described by the manu-facturer (Superscript Double-Stranded cDNA Synthesiskit, Invitrogen), using 10 μM random hexamers to primefirst strand synthesis. Finally, the double-stranded cDNAwas purified using QiaQuick PCR columns (Qiagen, Hil-den, Germany) followed by phenol-chloroform extraction.The quality of the cDNA was verified using a Bioanalyzer.Library preparation for cDNA sequencing was per-

formed essentially as for the whole genome DNA libraryconstruction (DNA sample prep kit, Illumina; see Sup-plementary Methods in Additional file 1). Sequencingwas performed on an Illumina Genome Analyzer IIflowcell, generating 75 bp single-end reads. RNA-seqdata have been submitted to the NCBI Short ReadArchive database with accession number GSE26561.

RT-PCR and western blottingTotal RNA was purified using Trizol and 0.5 μg RNAwas used to synthesize single-stranded cDNA with oligod(T) primers, using a ProtoScript M-MULV First-StrandSynthesis kit (New England Biolabs, Ipswich, MA, USA)as described by the manufacturer. The cDNA was usedin standard PCR reactions using Taq Polymerase (Invi-trogen), with primers specific for the exons flanking thealternative exon (see Table S9 in Additional file 12 for alist of primers). For protein analysis, snap-frozen tissuewas dissected and lysed in ice-cold RIPA buffer supple-mented with proteinase inhibitors and separated bySDS-PAGE and subjected to western blotting. Antibo-dies used: affinity-purified rabbit a-UPF2 (kind gift fromDr Jens Lykke-Andersen); rabbit a-mAb104 serum(pan-SR protein antibody; kind gift from Dr JavierCaceres); mouse a-b actin (ab6276, Abcam, Cambridge,UK). Development of a-mAb104 immunoblotted liverprotein lysate required different exposure times depend-ing on the molecular weight band, and was thus treatedindividually for the different protein bands.

RAINMAN overviewFigure S1B in Additional file 2 outlines the sequentialbioinformatic processing steps making up the RAIN-MAN pipeline utilized in this study. Our PTC and splice

event detection and quantification pipeline is availableas a series of Python and R scripts, readily scalable andcustomizable for use with RNA-seq short reads of anylength. First, an index is generated, consisting of therelevant genome assembly combined with a combinator-ial exon-exon junction database, as described in detail inthe Supplementary Methods in Additional file 1. Readsare mapped to this index (using the Bowtie/TopHat[31,39] mappers as default) and unmapped reads aretruncated to rescue reads spanning more than twoexons. Gene expression is calculated by the RPKMmetric (analogous to fragments per kb of exon per mil-lion fragments mapped (FPKM)) using the Cufflinkspackage [45] and upper quartile normalization, andjunction expression levels are calculated by quantifyingreads spanning exon-exon boundaries, likewise using theRPKM metric, and are (optionally) normalized using“Trimmed mean of M component” (TMM) normaliza-tion [46]. Finally, junctions are processed by PTC andsplice event detection algorithms, and all data are out-put, allowing the researcher to visualize results in UCSCGenome Browser and spreadsheet software. See Supple-mentary Methods in Additional file 1 for a detaileddescription of RAINMAN and Table S6 in Additionalfile 6 for splice detection validation. RAINMAN scripts,documentation and complete junction and gene expres-sion lists are available online [47].

Statistical methodsThe statistical package R was used to calculate Chi-square and Kolmogorov-Smirnov tests.

Database accessionRNA-seq data are available at the NCBI Short ReadArchive database (accession number GSE26561). RAIN-MAN scripts, documentation and complete junction andgene expression lists are available online [47].

Additional material

Additional file 1: Supplementary Information and SupplementaryTables S1, S2 and S5. Supplementary Materials, Methods andReferences. Supplementary Table S1: number of mapped junctionscontributed uniquely by the combinatorial database (Comb DB only) orTopHat (TopHat only) and the number of mapped junctions detected byboth the combinatorial database and TopHat (Both). SupplementaryTable S2: the contribution of TopHat to the number of junctionspredicted to generate a PTC versus all junctions (minimum of three readsper junction). Supplementary Table S5: deregulation of core splice factors.Gene FC indicates the change in mRNA levels for all the isoforms for theparticular gene between KO and WT.

Additional file 2: Supplementary Figure S1. UCSC Genome browseroutput of Pion gene and schematic of the RAINMAN pipeline with stepsfor mapping and processing of reads.

Additional file 3: Supplementary Figure S2. Histogram showingdistance from normal stop codon to the 3’ end of RefSeq genes with


Page 17 of 19

ARTICLE III | 61

stops in final exon, and distances to nearest downstream exon-exonjunction for genes with stop codons in the second to last exon.

Additional file 4: Supplementary Table S3. Reads per unique junctionstatistics for all samples, split into junctions discovered by mapping tothe combinatorial database versus junctions discovered by TopHat.

Additional file 5: Supplementary Table S4. Table with Gene Ontologyterms associated with genes containing upregulated PTC+ junctions thatare unique for Upf2 KO liver or BMM.

Additional file 6: Supplementary Table S6. Results from validation bymanual inspection of output from isoform class inference.

Additional file 7: Supplementary Figure S3. Validation of expressionchange inference and isoform inference.

Additional file 8: Supplementary Table S7. PTC upon inclusionisoforms (SES) upregulated in both Upf2 KO liver and BMM (!PSI > 20%).

Additional file 9: Supplementary Table S8. PTC upon exclusionisoforms (SES) upregulated in both Upf2 KO liver and BMM (!PSI <-20%).

Additional file 10: Supplementary Figure S4. Mean per positionphastCon conservation score around single exon skipping events forBMMs. Numbers on x-axis indicate nucleotide intervals - 25 and 75nucleotides for exons and flanking introns, respectively.

Additional file 11: Supplementary Figure S5. Conservation aroundupregulated PTCs, with mean per-position phastCon scores centered onthe PTC for upregulated junctions in liver and BMMs.

Additional file 12: Supplementary Table S9. List of primers used in RT-PCR validation of splicing events.

AbbreviationsA3SS: alternative 3’ splice site; A5SS: alternative 5’ splice site; AFE: alternativefirst exon; ALE: alternative last exon; AS: alternative splicing; AS-NMD:alternative splicing coupled to nonsense-mediated mRNA decay; BMM: bonemarrow-derived macrophage; bp: base pair; ES: exon skipping; EST:expressed sequence tag; GO: Gene Ontology; hnRNP: heterogeneous nuclearribonucleoprotein; KO: knock-out; MEF: mouse embryonic fibroblast; MES:multiple exon skipping; MXE: mutually exclusive exon; NK: natural killer;NMD: nonsense-mediated mRNA decay; PCR: polymerase chain reaction; PSI:percent spliced in; PTC: premature termination codon; PTC-: PTC absence;PTC+: PTC containing; RAINMAN: RnAseq-based Isoform detection and NMdANalysis pipeline; RNA-seq: next generation RNA sequencing; RPKM: readsper kb of gene model per million mapped reads; SES: single exon skipping;SR: serine/arginine-rich; UCE: ultraconserved element; UTR: untranslatedregion; WT: wild type.

AcknowledgementsThis work was supported by grants from the Danish Natural ScienceResearch Council, the Novo Nordisk foundation, the Lundbeck Foundationand the European Research Council (financial support to J Waage; EU FP7framework programme/ERC grant agreement 204135).

Author details1The Finsen Laboratory, Rigshospitalet, Faculty of Health Sciences, Universityof Copenhagen, DK2200 Copenhagen, Denmark. 2Biotech Research andInnovation Centre (BRIC), University of Copenhagen, DK-2200 Copenhagen,Denmark. 3Section for Gene Therapy Research, Rigshospitalet, University ofCopenhagen, DK-2100 Copenhagen, Denmark. 4The Bioinformatics Centre,University of Copenhagen, DK-2200, Copenhagen, Denmark. 5BGI-Shenzhen,Shenzhen 518083, China. 6Department of Biology, University of Copenhagen,DK-2200 Copenhagen, Denmark.

Authors’ contributionsJoWe carried out experiments, drafted the manuscript and conceived thestudy. JoWa carried out the bioinformatic analysis and helped to draft themanuscript. GT, JZ, JuWa, and KK facilitated sequencing. ID and JSJ assistedwith experimental work. AK participated in study design. BP helped to draft

the manuscript and conceived the study. All authors have read andapproved the manuscript for publication.

Competing interestsThe authors declare that they have no competing or conflicting interests.

Received: 28 November 2011 Revised: 6 May 2012Accepted: 24 May 2012 Published: 24 May 2012

References1. Black DL: Mechanisms of alternative pre-messenger RNA splicing. Annu

Rev Biochem 2003, 72:291-336.2. Pan Q, Shai O, Lee L, Frey B, Blencowe BJ: Deep surveying of alternative

splicing complexity in the human transcriptome by high-throughputsequencing. Nat Genet 2008, 40:1413-1415.

3. Wang E, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore S,Schroth G, Burge CB: Alternative isoform regulation in human tissuetranscriptomes. Nature 2008, 456:470-476.

4. Lewis BP, Green RE, Brenner SE: Evidence for the widespread coupling ofalternative splicing and nonsense-mediated mRNA decay in humans.Proc Natl Acad Sci USA 2003, 100:189-192.

5. Rehwinkel J, Letunic I, Raes J, Bork P, Izaurralde E: Nonsense-mediatedmRNA decay factors act in concert to regulate common mRNA targets.RNA 2005, 11:1530-1544.

6. Wittmann J, Hol EM, Jäck H-M: hUPF2 silencing identifies physiologicsubstrates of mammalian nonsense-mediated mRNA decay. Mol Cell Biol2006, 26:1272-1287.

7. Weischenfeldt J, Damgaard I, Bryder D, Theilgaard-Mönch K, Thoren LA,Nielsen FC, Jacobsen SEW, Nerlov C, Porse BT: NMD is essential forhematopoietic stem and progenitor cells and for eliminating by-products of programmed DNA rearrangements. Genes Dev 2008,22:1381-1396.

8. Gardner LB: Nonsense-mediated RNA decay regulation by cellular stress:implications for tumorigenesis. Mol Cancer Res 2010, 8:295-308.

9. Perlick HA, Medghalchi SM, Spencer FA, Kendzior RJ Jr, Dietz HC:Mammalian orthologues of a yeast regulator of nonsense transcriptstability. Proc Natl Acad Sci USA 1996, 93:10928-10932.

10. Lykke-Andersen J, Shu MD, Steitz JA: Human Upf proteins target an mRNAfor nonsense-mediated decay when bound downstream of atermination codon. Cell 2000, 103:1121-1131.

11. Le Hir H, Gatfield D, Izaurralde E, Moore MJ: The exon-exon junctioncomplex provides a binding platform for factors involved in mRNAexport and nonsense-mediated mRNA decay. EMBO J 2001, 20:4987-4997.

12. Nagy E, Maquat LE: A rule for termination-codon position within intron-containing genes: when nonsense affects RNA abundance. TrendsBiochem Sci 1998, 23:198-199.

13. Behm-Ansmant I, Gatfield D, Rehwinkel J, Hilgers V, Izaurralde E: Aconserved role for cytoplasmic poly(A)-binding protein 1 (PABPC1) innonsense-mediated mRNA decay. EMBO J 2007, 26:1591-1601.

14. Eberle AB, Stalder L, Mathys H, Orozco RZ, Mühlemann O:Posttranscriptional gene regulation by spatial rearrangement of the 3’untranslated region. PloS Biol 2008, 6:e92.

15. Wang J, Gudikote JP, Olivas OR, Wilkinson MF: Boundary-independentpolar nonsense-mediated decay. EMBO Rep 2002, 3:274-279.

16. Bühler M, Steiner S, Mohn F, Paillusson A, Mühlemann O: EJC-independentdegradation of nonsense immunoglobulin-mu mRNA depends on 3’UTR length. Nat Struct Mol Biol 2006, 13:462-464.

17. Morrison M, Harris KS, Roth MB: smg mutants affect the expression ofalternatively spliced SR protein mRNAs in Caenorhabditis elegans. ProcNatl Acad Sci USA 1997, 94:9782-9785.

18. Ni JZ, Grate L, Donohue JP, Preston C, Nobida N, O’Brien G, Shiue L,Clark TA, Blume JE, Ares M: Ultraconserved elements are associated withhomeostatic control of splicing regulators by alternative splicing andnonsense-mediated decay. Genes Dev 2007, 21:708-718.

19. Saltzman AL, Kim YK, Pan Q, Fagnani MM, Maquat LE, Blencowe BJ:Regulation of multiple core spliceosomal proteins by alternativesplicing-coupled nonsense-mediated mRNA decay. Mol Cell Biol 2008,28:4320-4330.

20. Lareau LF, Inada M, Green RE, Wengrod JC, Brenner SE: Unproductivesplicing of SR genes associated with highly conserved andultraconserved DNA elements. Nature 2007, 446:926-929.


Page 18 of 19

ARTICLE I | 62

21. Saltzman AL, Pan Q, Blencowe BJ: Regulation of alternative splicing bythe core spliceosomal machinery. Genes Dev 2011, 25:373-384.

22. Pinol-Roma S, Choi YD, Matunis MJ, Dreyfuss G: Immunopurification ofheterogeneous nuclear ribonucleoprotein particles reveals anassortment of RNA-binding proteins. Genes Dev 1988, 2:215-227.

23. Han SP, Tang YH, Smith R: Functional diversity of the hnRNPs: past,present and perspectives. Biochem J 2010, 430:379-392.

24. Boutz PL, Stoilov P, Li Q, Lin CH, Chawla G, Ostrow K, Shiue L, Ares M Jr,Black DL: A post-transcriptional regulatory switch in polypyrimidine tract-binding proteins reprograms alternative splicing in developing neurons.Genes Dev 2007, 21:1636-1652.

25. Hanamura A, Caceres JF, Mayeda A, Franza BR, Krainer AR: Regulatedtissue-specific expression of antagonistic pre-mRNA splicing factors. RNA1998, 4:430-444.

26. Bai Y, Lee D, Yu T, Chasin LA: Control of 3’ splice site choice in vivo byASF/SF2 and hnRNP A1. Nucleic Acids Res 1999, 27:1126-1134.

27. Chen M, Manley JL: Mechanisms of alternative splicing regulation:insights from molecular and genomics approaches. Nat Rev Mol Cell Biol2009, 10:741-754.

28. Pan Q, Saltzman AL, Kim YK, Misquitta C, Shai O, Maquat LE, Frey BJ,Blencowe BJ: Quantitative microarray profiling provides evidence againstwidespread coupling of alternative splicing with nonsense-mediatedmRNA decay to control gene expression. Genes Dev 2006, 20:153-158.

29. Ramani AK, Nelson AC, Kapranov P, Bell I, Gingeras TR, Fraser AG: Highresolution transcriptome maps for wild-type and nonsense-mediateddecay-defective Caenorhabditis elegans. Genome Biol 2009, 10:R101.

30. Thoren LA, Nørgaard GA, Weischenfeldt J, Waage J, Jakobsen JS,Damgaard I, Bergström FC, Blom AM, Borup R, Bisgaard HC, Porse BT: UPF2Is a Critical Regulator of Liver Development, Function and Regeneration.PLoS ONE 2010, 5:e11650.

31. Trapnell C, Pachter L, Salzberg S: TopHat: discovering splice junctions withRNA-Seq. Bioinformatics 2009, 25:1105-1111.

32. McIlwain DR, Pan Q, Reilly PT, Elia AJ, McCracken S, Wakeham AC, Itie-Youten A, Blencowe BJ, Mak TW: Smg1 is required for embryogenesis andregulates diverse genes via alternative splicing coupled to nonsense-mediated mRNA decay. Proc Natl Acad Sci USA 2010, 107:12186-12191.

33. Christofk HR, Vander Heiden MG, Harris MH, Ramanathan A, Gerszten RE,Wei R, Fleming MD, Schreiber SL, Cantley LC: The M2 splice isoform ofpyruvate kinase is important for cancer metabolism and tumour growth.Nature 2008, 452:230-233.

34. Yeo G, Holste D, Kreiman G, Burge CB: Variation in alternative splicingacross human tissues. Genome Biol 2004, 5:R74.

35. Kim E, Goren A, Ast G: Alternative splicing: current perspectives. Bioessays2008, 30:38-47.

36. Mitrovich QM, Anderson P: mRNA surveillance of expressed pseudogenesin C. elegans. Curr Biol 2005, 15:963-967.

37. Mendell JT, Sharifi NA, Meyers JL, Martinez-Murillo F, Dietz HC: Nonsensesurveillance regulates expression of diverse classes of mammaliantranscripts and mutes genomic noise. Nat Genet 2004, 36:1073-1078.

38. Sayani S, Janis M, Lee CY, Toesca I, Chanfreau GF: Widespread impact ofnonsense-mediated mRNA decay on the yeast intronome. Mol Cell 2008,31:360-370.

39. Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficientalignment of short DNA sequences to the human genome. Genome Biol2009, 10:R25.

40. Hillman RT, Green RE, Brenner SE: An unappreciated role for RNAsurveillance. Genome Biol 2004, 5:R8.

41. Cases S: ACAT-2, A second mammalian Acyl-CoA:cholesterolacyltransferase. Its cloning, expression, and characterization. J Biol Chem1998, 273:26755-26764.

42. Simons-Evelyn M, Young HA, Anderson SK: Characterization of the mouseNktr gene and promoter. Genomics 1997, 40:94-100.

43. Weischenfeldt J, Damgaard I, Bryder D, Theilgaard-Mönch K, Thoren LA,Nielsen FC, Jacobsen SE, Nerlov C, Porse BT: NMD is essential forhematopoietic stem and progenitor cells and for eliminating by-products of programmed DNA rearrangements. Genes Dev 2008,22:1381-1396.

44. Hogg JR, Goff SP: Upf1 Senses 3’UTR Length to Potentiate mRNA Decay.Cell 2010, 143:379-389.

45. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H,Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression

analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc2012, 7:562-578.

46. Robinson MD, Oshlack A: A scaling normalization method for differentialexpression analysis of RNA-seq data. Genome Biol 2010, 11:R25.

47. RAINMAN (RnAseq-based Isoform detection and NMd ANalysis pipeline)..[http://people.binf.ku.dk/jwaage/RAINMAN/].

doi:10.1186/gb-2012-13-5-r35Cite this article as: Weischenfeldt et al.: Mammalian tissues defective innonsense-mediated mRNA decay display highly aberrant splicingpatterns. Genome Biology 2012 13:R35.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit


Page 19 of 19

ARTICLE II | 63

Article II

spliceR: An R package for classification of alternative splicing and prediction of

coding potential from RNA-seq data

ARTICLE II | 64

© Oxford University Press 2013 1

Gene expression spliceR: An R package for classification of alternative splicing and prediction of coding potential from RNA-seq data Kristoffer Knudsen1,2,3,4, Bo Torben Porse1,2,3, Albin Sandelin2,4,* and Johannes Waa-ge1,2,3,4,* 1The Finsen Laboratory, Rigshospitalet, Faculty of Health Sciences, University of Copenhagen, DK2200 Copen-hagen, Denmark

2Biotech Research and Innovation Centre (BRIC), University of Copenhagen, DK-2200 Copenhagen, Denmark 3!The Danish Stem Cell Centre (DanStem) Faculty of Health Sciences, University of Copenhagen, DK2200 Co-penhagen Denmark

4The Bioinformatics Centre, University of Copenhagen, DK2200, Copenhagen, Denmark Received on XXXXX; revised on XXXXX; accepted on XXXXX

Associate Editor: XXXXXXX

ABSTRACT Summary: With the advent of increasing depth and decreasing costs in digital gene expression technologies exemplified by RNA-sequencing, researchers are now able to profile the transcriptome with unprecedented detail. These advances not only allow for pre-cise approximation of gene expression levels, but also for character-ization of alternative isoform usage/switching between samples. Recent software improvements in full transcript deconvolution prompted us to develop spliceR, an R package for classification of alternative splicing. spliceR labels isoforms based on fully assem-bled transcripts, detecting single- and multiple exon skipping, alter-native donor or acceptor sites, intron retention, alternative first or last exon usage, and mutually exclusive exon events. Alongside, event spliced-in/out values are calculated for effective post-filtering, and genomic coordinates of differentially spliced elements are annotated for downstream sequence analysis. Furthermore, spliceR has the option to predict the coding potential and thereby the nonsense mediated decay (NMD) sensitivity of transcripts based on stop co-don position. Availability and Implementation: spliceR is implemented as an R package, is freely available from https://github.com/splicer-tool/splicer, and has been submitted to the Bioconductor repository. Contact: [email protected] Alternative splicing is an important part of the multi-layered pro-cess of RNA processing, elevating the potential number of unique products with orders of magnitude. More than 95% of all human genes undergo alternative splicing, and this is thought be a key element in driving the phenotypical complexity of mammals (Pan et al., 2008). Recent advances in sequencing technology of RNA (RNA-seq), combined with modern RNA-seq transcript assem-blers, such as Cufflinks (Roberts et al., 2011), now allows for de-scribing the diverse RNA landscape with high resolution.

*To whom correspondence should be addressed.

Here we present an easy-to-use tool, spliceR, which builds upon common RNA-seq assembly workflows by annotating multiple transcripts from the same gene with alternative splicing classes. For each gene entity, each alternative transcript is classified against either the gene’s hypothetical pre-mRNA (based on all isoforms) or against the gene’s most expressed transcript. Additionally, spliceR allows for the characterization of the protein coding potential of each transcript by translating the full exonic sequence of each transcript with supported annotated open reading frames (ORFs). spliceR is fully based on object types found in the Bioconductor (Gentleman et al., 2004) project, such as GRanges, allowing for full flexibility and modularity, and has been submitted to the Bio-conductor repository. Classification of alternative splicing: spliceR takes full-length transcript information either from Cufflinks, or from a data gener-ated by any RNA-seq assembler that outputs fully deconvoluted transcripts. spliceR supports a number of filters, letting the user choose to classify alternative splicing only from those transcripts that have passed set of qualifiers, either given by cufflinks (tran-script and/or gene model confidence), or based on NMD-sensitivity or expression thresholds. In some scenarios, especially when looking at changes between samples where splicing patters are expected to change, users may opt to configure spliceR to use the most expressed transcript as the reference transcript instead of the theoretical pre-RNA. Based on this comparison spliceR outputs a complete classification of splicing events as seen in figure 1a. Furthermore, percent spliced in (PSI) values are calculated for each of the transcripts, representing the percentage of the total parent gene expression originating from this transcript. Finally, delta-PSI values (dPSI) allow researchers to assess changes in splicing (i.e. isoform switching) between sam-ples.

ARTICLE II | 65

K Knudsen1et al.

2

The output of spliceR also facilitates various downstream anal-yses, including filtering on transcripts that have major changes between samples, filtering for specific splicing classes, or sequence analysis on elements that are spliced in or out between samples. The latter could include detection of enriched motifs in or sur-rounding such elements, or identification of protein domains that are spliced in/out. spliceR facilitates this type of analysis by outputting the genomic coordinates of each alternatively spliced element. Visualization: To analyze global trends in splicing, spliceR can produce a range of Venn diagrams, showing the overlap of splicing events between samples for either a specific type of alternative splicing or for all events. An example is shown in figure 1b. Analysis of coding potential: For assessment of coding potential, spliceR initially retrieves the genomic exon sequence using BSGenome objects, easily downloadable from Bioconductor di-rectly in R. Next, ORF annotation is retrieved from the UCSC Genome Browser repository from either Refseq, Known Gene or Ensembl. Alternatively, one can generate a custom ORF-table. Finally, the RNA sequences of input transcripts are assembled, and if a compatible ORF exists, translated, and positional data about the stop codon, including distance to final exon-exon junction, is recorded and returned to the user. Based on the generally accepted 50 nt rule in the NMD literature (Weischenfeldt et al., 2012), tran-scripts are marked NMD-sensitive if the stop codon falls more than 50 nt upstream of the final exon-exon junction, although this set-ting is user-configurable.

Conclusion: We present a Bioconductor package spliceR, which harnesses the power of current RNA-seq and assembly technologies, and provides a full overview of alternative splicing events and protein coding potential of transcripts. spliceR is flexible and easy integrated in existing workflows, supporting input and output of standard Bioconductor data types. To our knowledge, spliceR is the only software that facilitates compre-hensive splice class detection based on full-length transcripts.

ACKNOWLEDGEMENTS Funding: This work was supported in part through a center grant from the Novo Nordisk Foundation (The Novo Nordisk Founda-tion Section for Stem Cell Biology in Human Disease). Conflict of Interest: The authors have noting to declare.

REFERENCES Gentleman,R.C. et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5, R80. Pan,Q. et al. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature genetics, 40, 1413–5. Roberts,A. et al. (2011) Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics (Oxford, England), 27, 2325–9. Visconte,V. et al. (2012) SF3B1 haploinsufficiency leads to formation of ring sideroblasts in myelodysplastic syndromes. Blood, 120, 3173–86. Weischenfeldt,J. et al. (2012) Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns. Genome biology, 13, R35.

ARTICLE III | 66

Article III

Genome-wide mapping of transcription start sites reveals deregulation of full-

length transcripts in acute promyelocytic leuke mia

ARTICLE III | 67

Genome-wide mapping of transcription start sites reveals

deregulation of full-length transcripts in acute promyelocytic

leukemia.

Johannes Waage1,2,3,4, Mette Boyd1,2, Kim Theilgaard-Mönch2,3,5, Helena

Mora-Jensen6, Piero Carninci7, Nicolas Rapin1,2,3,4, Berit Lilje1,2, Peter Hokland8,

Niels Borregaard6, Bo Torben Porse2,3,4, Albin Sandelin1,2,*

1 The Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen,

Denmark; 2 Biotech Research and Innovation Centre (BRIC), University of Copenhagen, Copenhagen,

Denmark; 3 The Finsen Laboratory, Rigshospitalet, Faculty of Health Sciences, University of Copenhagen,

Copenhagen, Denmark; 4 The Danish Stem Cell Centre (DanStem) Faculty of Health Sciences, University of

Copenhagen, Copenhagen. Denmark; 5 The Department of Hematology, Skanes University Hospital, Lund

University, Sweden, 6 The Granulocyte Research Laboratory, Department of Hematology, Rigshospitalet,

University of Copenhagen, Copenhagen, Denmark; 7 RIKEN Center for Life Science Technologies,

Yokohama, Japan; 8 Department of Hematology, Århus University Hospital, Århus, Denmark

* Corresponding author

ARTICLE III | 68

Abstract

Transcription start sites are the focal points of transcriptional regulation, where

information from regulatory elements is integrated to stabilize initiation of

transcription. In humans, most genes have more than one transcription start site, and

these often exhibit different tissue specificity, serving as distinct regulatory frameworks

for the same gene. Usage of such promoters can also result in differential gene function

manifested on the protein level. Alternative promoter usage is increased in several

disease states, including cancer, but its impact in most disease states is

uncharacterized.

Most methods for genome-wide detection of transcription start sites cannot be

applied to rare cell samples or highly purified cells because they need high amounts of

RNA . In this study, we have applied the nanoCAGE method to create a genome-

wide map of transcription start site usage in highly purified cancer cells from patients

with acute promyelocytic leukemia, and corresponding normal bone marrow cells (i.e.

promyelocytes) purified from healthy subjects, using as little as 50 ng total RNA.

We show that the nanoCAGE method gives similar results as experiments

made with microarrays in terms of expression on gene level, but in addition allows for

the identification of alternative promoter usage in cancer cells. We identify 2,162

putative promoters that are significantly differentially regulated between APL and

controls. Interestingly, promoters whose usage is upregulated in APL have an

increased propensity to be located within genes, downstream of annotated promoters.

Conversely, promoters supporting annotated longest transcript variants are commonly

downregulated in cancer. We show several examples of genes with upregulated

alternative promoters located downstream, many of which confer protein domain loss

that could contribute to leukemic transformation and maintenance. Moreover, we

show that these downstream promoters likely have different regulatory cues than

cancer-specific promoters corresponding to the longer RNA isoforms.

ARTICLE III | 69

Introduction

Messenger RNA (mRNA) heterogeneity is a hallmark of higher eukaryotes,

contributing to the plethora of tissue- and cell specific patterns of protein expression

and function. The underlying mechanisms contributing to the final selection of mRNA

isoforms include alternative promoter usage, alternative splicing, and alternative

termination (reviewed in refs [1–3]), potentially generating several isoforms from a

single gene.

There are several examples where alternative isoform usage can affect the

protein product of a gene, typically by excluding exons that code for important protein

residues, or domains (for a review, see [4]). Alternative promoters are particularly

interesting generators of isoform diversity, since they confer additional regulatory

inputs to the same gene. Indeed, there are many examples of alternative promoters that

confer different tissue specificity within the same gene. An example is the Dlgap1 gene

in mouse which has at least four promoters that are specific for different brain tissues

[5]. There are indications that alternative promoter usage is increased in disease states,

and particularly cancer[1].

Thus, mapping the promoter usage genome-wide in specific cell states,

including disease, is important. There are several methods for high-throughput

discovery of promoter usage, ranging from chromatin immunoprecipitation (ChIP) of

key components of the preinitiation complex, to sequencing RNA 5’ ends [6]. RNA-

based methods offer base-pair resolution, and are typically based on selection of cap

structures at mature 5’ ends of RNAs followed by sequencing of the first 20-30 nt of the

transcript from the 5’ end. The two most commonly used methods are Cap Analysis of

Gene Expression (CAGE) [7] and TSS Seq [8]. While these methods have high

sensitivity and specificity, they also require RNA from a substantial number of cells

(typically >1 µg total RNA of high quality). This makes it hard to analyze rare cell

types or highly purified cells from clinical tissue samples. The nanoCAGE method,

based on template switching instead of cap trapping, is an alternative which can

measure promoter usage with as little as 10ng total RNA, at the cost of lower

ARTICLE III | 70

sensitivity [9,10]. Previous studies have used the nanoCAGE method to investigate the

promoter landscape of cultured hepatocellular carcinoma cells [10,11], but it has not yet

been applied to clinical samples. As a proof of concept, we applied the nanoCAGE

method to comprehensively identify promoters specifically used in highly purified cells

from patients with acute promyelocytic leukemia (APL), compared to corresponding

purified controls, namely promyelocytes.

Acute myeloid leukemia (AML) encompasses a variety of clonal disorders,

whose malignant populations do not respond normally to regulatory cues and have lost

the ability to differentiate into fully mature blood cells [12]. APL represents a

genetically defined subclass according to the current WHO classification of AML, and

comprises 5-10% of all AMLs. APL is characterized by a block in terminal neutrophil

differentiation and the accumulation of “leukemic” promyelocytes (APL blast cells) in

the bone marrow and blood [13,14]. Over 98% of APLs harbor a t(15;17) translocation,

which juxtaposes the promyelocytic leukemia gene (PML) and the retinioc-acid

receptor alpha gene (RARA) resulting in the expression of the PML-RARA fusion

protein[14]. In normal cells, RARA forms heterodimers with the retinoid X receptor

(RXR) that binds the retinoic acid responsive elements (RARE) of their target gene

promoters. In the absence of retinoic acid (RA), RARA interacts with histone

deacetylase (HDAC)-containing co-repressor complexes and represses transcription,

whereas binding of RA triggers a conformational switch resulting in recruitment of

coactivator complexes and subsequently, activation of transcription [15]. In APL cells,

PML-RARA forms homodimers, multimeric complexes, and heterodimers with RXR

that all bind to RARE and repress transcription at physiological levels of RA through

recruitment of corepressor complexes containing HDACs, DNA methyltransferases

and polycomb complexes [16–18]. At pharmacological levels, however, RA reverses

PML-RARA mediated repression of RARA target genes resulting in terminal

neutrophil differentiation of APL cells.

Despite the tremendous progress in our understanding of how PML-RARA

regulates transcription at the molecular level, little is currently known to what extent

ARTICLE III | 71

PML-RARA changes the TSS landscape in APL. Alternative promoter usage has

been associated with cancer; hence, elucidating the promoter usage landscape in APL

is an important aspect of the regulatory state of the disease.

In the present study we used APL as a cancer model to investigate changes of

TSS usage associated with malignant transformation. For this we compare the TSS

landscapes of highly purified leukemic promyelocytes (iAPL blasts) from APL patients

with those of normal early promyelocytes (EPMs) purified from healthy subjects, using

as little as 50 ng total RNA.

We find that the cancer cells tend to use downstream alternative promoters

much more often than normal cells, thereby producing shorter transcripts, which lie at

a median distance of ~6400 nt from the TSS generating the longest pre-mRNA. We

demonstrate that usage of these downstream alternative promoters often leads to

transcripts lacking the potential to code for important protein domains that could

contribute to the cancer phenotype. Conversely, the usage of canonical TSSs

(corresponding to full-length genes) for genes in repair pathways is downregulated in

cancer cells. Furthermore, using transcription factor ChIP-seq data from the

ENCODE project, we demonstrate that the two categories of putative TSSs have

distinct regulatory profiles. We also report 90 cases of non-coding RNA promoters, in

particular snoRNAs, that are upregulated in the APL cells.

Materials and Methods

Sample preparation

Bone marrow was aspirated from the posterior iliac crest of healthy donors and

patients with newly diagnosed APL after informed consent had been given according

to guidelines established by the Danish National Committee on Health Research

Ethics, and cell sorted by fluorescence activated flow cytometry (FACS). Detailed

FACS protocols are described in [19], and the FACS strategy is depicted in Figure 2.

RNA was isolated from sorted cells using the RNeasy Micro Kit (Qiagen,

Valencia, CA, USA) according to manufacturer’s instructions. Briefly, BM

populations were sorted once and then resorted directly into 350 µl RLT lysis buffer/β-

ARTICLE III | 72

Mercaptoethanol (Sigma-Aldrich). Tubes containing sorted cells were vortexed

immediately to lyse all cells, then snap-frozen on dry ice and stored at -80ºC before

mRNA was purified according to the manufacturer’s instructions (RNeasy Micro Kit,

Qiagen, Valencia, CA, USA).

NanoCAGE library preparation

NanoCAGE was performed as described in [9], with 50 ng total RNA for all

samples. RNA was sequenced on the Illumina Genome Analyzer IIX in biological

duplicates for both EPM and APL samples, with an input of 2 nM in 10 uL per

sample, following standard protocol.

NanoCAGE tag mapping

Reads were mapped to the NCBI GRCh37/hg19 human reference genome

using Bowtie v. 0.12.8 [20] with standard parameters. See suppl. Figure S1 for

mapping results.

Tag clustering

Initial consensus clusters were created, consisting of overlapping and

neighboring tags for all samples merged, followed by tag quantification for each

sample in the consensus clusters. Clusters were end-trimmed for low tag content,

reducing each tag cluster to the minimum width required for at least 80% of tags to

remain in the cluster, and clusters of 1 nt width were removed to eliminate PCR

artifacts. Tag counts were quantified as TPM (tags per million mapped reads), and

clusters with less than 2 TPM in the highest sample were removed. Clusters with high

inter-replicate variance were removed using a coefficient of variance-threshold of 1,

resulting in a final set of 7,458 clusters.

Exon noise filtering

To further solidify our distal promoter discovery, the per nt coverage of a given

TC was divided by the average per nt coverage across exons for the respective gene

(minus exons containing any tag clusters), and a ratio of minimum 1.5 was required.

ARTICLE III | 73

Statistical testing for differential expression

To test for differential promoter usage between APL and EPM, we utilized

edgeR v. 3.13 [21], using standard settings, with FDR = 0.05, and resulting in 1,437

clusters downregulated, and 725 clusters upregulated in APL.

Promoter classification

Promoters (tag clusters defined above) were assigned to the closest UCSC

knownGene transcriptional unit, and were partitioned into four categories: Full-length

canonical (within 300 nt from the most upstream annotated TSS of the closest gene),

proximal (from 300 nt to within 1/10 of the total gene length), distal (from 1/10 to the 3’-

end of the gene), and intergenic. Ambiguous promoters (due to e.g. overlapping gene

units) were discarded.

Protein domain loss analysis

Transcript regions lost in due to usage of alternative promoters located within

genes and upregulated in APL were intersected with protein domain mappings from

the SUPERFAMILY repository [22], reporting both full and partial overlaps.

Regulatory neighborhood analysis

The full ENCODE transcription factor ChIP-seq binding site set [23] was

downloaded from the UCSC Genome Browser, and the distance from each TC to the

nearest binding site was determined for each TF. For a select subset of TFs, the

distance from each TC to the full TF binding site set was measured.

Data availability

Raw sequence data as well as the full TSS set is available at the NCBI Gene

Expression Omnibus and Short Read Archive under accession no. GSE46561.

Results

Figure 1 gives a full overview of the analysis flow, starting from FACS-sorting of

cells, followed by microarray and nanoCAGE experiments and concluded by

computational analysis and comparisons with existing datasets.

ARTICLE III | 74

Analysis of cancer phenotypes is often hampered by the lack of appropriate

normal controls, i.e. normal cells at a similar stage of differentiation. We have

previously established that the closest normal counterparts of APL blast cells in terms

of expression are early promyelocytes (EPM) cells [24]. Therefore, a FACS cell sorting

strategy was developed to purify these two cell types from bone marrow samples, as

described in Figure 2: reanalysis of double-sorted populations demonstrated a purity

between 90%-100% for all populations. Importantly, by comparing APL blast to

normal EPMs, we are therefore able to determine cancer specific transcriptional

changes as opposed to those arising from differences in degree of differentiation.

Following cell sorting, for each patient, RNA was extracted and used to

measure gene expression and promoter usage by nanoCAGE. Sequenced

nanoCAGE reads were mapped to the human genome as described in Methods

(suppl. Figure S1A). Using pooled data, neighboring tags were aggregated into tag

clusters (TCs), whose expression from each library was quantified by the normalized

number of tags within the cluster on the same strand (expressed as tags per million

mapped reads, TPM). For simplicity, we will refer to these clusters interchangeably as

TSSs or promoters in the text, as in [5]. Following initial filtering, the resulting TCs

were found to have an distance distribution showing an increase of clusters close to

previously annotated TSSs, and a decrease of clusters with similar distance to TSSs as

random genomic locations (suppl. Figure 1B).

Visual inspection of raw mapped data showed a tendency towards low intensity

tag clusters mapping uniformly over exons. These might represent cases where the

reverse transcriptase failed to reach the real 5’ end, or the capture of partially degraded

(and possibly recapped) mRNAs, as discussed in [10]. We also noted cases of PCR

clonal expansion, typically only present in one of the replicates; these are likely

attributed to the low amount of starting RNA and the high number of PCR cycles

employed in the NanoCAGE method. To take these issues into account, we employed

additional filtering on clusters, requiring low inter-replicate variance, as well as

constraints on cluster widths and expression levels. This produced a final set of 7,458

ARTICLE III | 75

TCs. Of these, 1,398 overlapped with annotated RefSeq TSSs, 1,795 overlapped with

GenCode V14 TSSs, and 5,663 were un-annotated.

Next, we identified differentially expressed promoters using edgeR [21] and

found 725 promoters to be upregulated and 1,437 promoters to be downregulated in

APL vs. control (Figure 3A) (P<0.05, FDR-adjusted).

To validate our findings in terms of expression, we compared our set of

differentially expressed promoters with previously published microarray-based gene

expression data [25]. Genes assigned as significantly downregulated by microarrays

(n=2,066, P < 0.05, FDR-adjusted) overlap the corresponding promoters sets from

CAGE significantly (n=563, P=1.34e-22, hypergeometric test) (Figure 3B, top).

Similarly, for all genes found upregulated in APL, compared to normal counterpart

cells (EPMs), by microarray (n=855, P<0.05, FDR-adjusted), we found a less marked,

but still significant overlap (n=165, P=2.14e-25, hypergeometric test) with promoters

upregulated in APL by nanoCAGE (Figure 3B, bottom). The combined microarray

data was based on samples from 37 APL patients. Owing to the sheer difference in

sample numbers between our APL nanoCAGE (2) and the array studies, some

precaution is necessary when comparing expression values between platforms, and

smaller overlap fractions were expected.

Next, we investigated the location of differentially expressed promoters in

relation to annotated genes. We partitioned all promoters across both samples into

bins of location relative to nearest gene body, grouping 16% as full-length canonical

(here defined to be within -150/150 bp of longest UCSC transcript TSS on the same

strand), 7% as intragenic proximal (from 150 bp to 1/10 th of the total gene length), 43%

as intragenic distal (placed somewhere along the remaining gene body), and 33% as

intergenic (figure 3, grey bars).

Interestingly, promoters upregulated in APL were overrepresented in the

category of TCs located downstream, distal of annotated TSSs (P<0.01,

hypergeometric test); conversely, downregulated promoters were overrepresented in

the bin of TCs overlapping canonical TSSs (P < 0.01, hypergeometric test) (figure 3).

ARTICLE III | 76

The distal upregulated promoters, by definition located between 1/10 of the gene

length and the transcription termination site, were not placed uniformly along the

gene, peaking around a median distance from the TSS of ~6400 nt (suppl. Figure S2).

In summary, TSSs preferentially used in APL are more often downstream alternative

promoters compared to promoters preferentially used in the EPM control samples.

Comparing to gene set signatures collected from the Molecular Signature

Database (http://www.broadinstitute.org/gsea/msigdb/index.jsp) [26], we observed a

significant overlap between our upregulated distal promoters, and genes

overexpressed in APLs ( P = 2.3e-12, hypergeometric test )[27], chronic myelogenous

leukemia (CML) (P = 2.44e-15, hypergeometric test)[28], as well as a number of other

cancers, indicating that promoters of oncogenic and cancer maintaining genes,

expressed in APL, produce shorter transcripts (Figure 4, overlay). Similarly, we

observed a significant overlap between down-regulated canonical promoters, and gene

sets for DNA repair pathways, including tumor suppressors CHEK2 (P<0.01,

hypergeometric test) and ATM (P <0.01, hypergeometric) both involved in activation

of the DNA damage checkpoint, leading to repair, cell cycle arrest, and/or apoptosis

(Figure 4, overlay) [29], further validating the observation.

Since degraded RNA has potential to be captured by the nanoCAGE protocol,

heightened RNA degradation activity in the cancer cells could potentially cause some

of these observations. We reasoned that if this were the case, then we would expect the

nanoCAGE tags to be uniformly distributed over the exons, and not congregate to

distinct and reproducible clusters over replicates. Even though our initial intra-

replicate filtering strategy would catch most of these exon tags, we further refined our

filter, requiring that the downstream, cancer-specific tag clusters of interest should

have higher tag coverage than expected over the exons of the respective gene (see

Methods). This filtering resulted in a list of 40 high confidence downstream promoters

upregulated in APLs (the full list is given in suppl. table T1), suitable for further

downstream functional analysis. Two examples of such are given in Figure 5. Both of

these are close to annotated alternative promoters (further validating the approach),

ARTICLE III | 77

but here we give evidence that the alternative promoters are used preferentially in the

APL cells compared to controls. Figure 5A shows an APL-specific alternative

promoter, which generates a shorter transcript version of ADAMTSL2. This is placed

downstream of several protein-coding exons, and usage of the alternative promoter will

produce a transcript which cannot encode the signal peptide present in the longer

variant of the RNA, as well as several regions in which mutations have been shown to

result in increased amounts of available TGF-beta [30]. TGF-beta, known to inhibit

proliferation of hematopoietic precursors, is normally found highly expressed in APLs

[31], and the selective downstream promoter usage of ADAMTSL2 could contribute

to this effect. Figure 5B shows a downstream alternative promoter in the GPR56 gene,

a GPCR family receptor recently shown to play a role in both hematopoietic stem cell-

as well as leukemic stem cell maintenance [32]. In this case, the alternative promoter is

not predicted to give a different protein product, as it is located in an extended 5’ UTR.

An overview of 24 selected genes with downstream promoters upregulated in APL is

presented in Figure 6, indicating novel promoters (15/24) and previously annotated

alternative promoters (9/24). To further investigate the functional effects of

downstream promoter usage, we extrapolated the lost mRNA exons to protein

domains. Several genes exhibited domain loss, which could be implicated in an

oncogenic phenotype, including: USP13, involved in deubiquitination and implied in

the stabilization of p53 [33], loses its proteinase domains. ADAMTSL2, in addition to

the regulation of TGF-beta described above, looses its TSP-1 type 1 repeat domains

(see Figure 5A), previously shown to be implicated in induction of apoptosis [34] and

inhibition of angiogenesis [35]. CTNNA3, a key player in cellular adhesion, loses its

cadherin associating alpha-catenin domains. A full list of domains lost is given in suppl.

table T2.

To further assess the regulatory differences between the promoter classes, we

investigated the binding of transcription factors in the genomic neighborhood of each

TSS class. To facilitate this, we measured the distance from each TC to the nearest

transcription factor binding site, based on ChIP-seq peaks from ENCODE (all cells)

ARTICLE III | 78

(suppl. Figure S3). We observed that transcriptional inhibitor members of the E2F

family were associated closer with APL-downregulated full-length promoters, while

early response transcription factors JUN and FOS (together forming the AP1

complex, and both expressed in APL cells measured by nanoCAGE (data not shown))

were closer associated with upregulated downstream promoters (figure 7). This

indicates that some regulatory cues differ between the downstream and canonical

promoters, even within AML cells.

A strength of nanoCAGE is that it is not limited to a pre-defined gene set, but

can potentially capture any RNA species in the cell, including ncRNAs. To categorize

putative non-coding TSSs, we extracted all non-coding RNA identifiers from RefSeq:

of these, 90 and 84 were up- and downregulated (figure 8A), respectively, with a

significantly larger fraction of non-coding RNAs being upregulated compare to non-

regulated (figure 8B). Suppl. table T3 lists the most upregulated ncRNAs.

SNORD114-1 has previously been found upregulated APL in a PML-RAR-alpha

context by microarray and ChIP-seq analyses, conferring cell growth.[36] Similarly,

miR-21, classically described as an “oncomir” [37], and with many tumor-supressor

targets, is upregulated. Curiously, MEG3 is high on the list of upregulated ncRNA

TSSs, normally thought to be a tumor suppressor not normally expressed in tumors

[38] and has previously been show to be hypermethylated in AML and myeloplastic

syndrome (MDS). Microarray data confirm this upregulation (7 times upregulated in

APL). The high expression of MEG3 in APL may point to a new and as yet

undescribed role in cancer pathology.

Discussion

In this paper we present the first investigation of promoter usage in APL cancer

cells, compared to their corresponding normal counterpart, both FACS-purified from

human bone marrow. The nanoCAGE method allows for analyses of promoters with

these rare cells – in our case with as little as 50 ng total RNA. . By applying stringent

filtering and statistical analyses, we identify 2,162 TSS clusters that are significantly

changed between the two states. On the gene level, the findings correspond well with

ARTICLE III | 79

microarray studies, but the method can pinpoint novel promoters in the set, which

array approaches cannot. Surprisingly, we see a shift of promoter usage: APL cells are

prone to use promoters that are downstream of canonical TSSs, downregulating

usage of full-length canonical transcripts. To further refine these findings, we

employed a stringent filter on the TC-to-exon-coverage, which resulted in a list of 40

high-confidence APL-specific promoters - followed by an extrapolation of mRNA to

protein domains, we demonstrated several examples of domain loss with clear

implications to protein function in the cancer cells, several of which can be linked

mechanistically with a cancer phenotype. Furthermore, we noted that the regulatory

landscape around these downstream promoters is different to those of full-length

transcripts, even if the full-length transcripts are preferentially expressed in APL. In

particular, AP1 transcription factors are preferentially bound closer. A caveat with

these results is that the ENCODE ChIP-seq data is based on multiple cell lines. On

the other hand, we could in many cases see corresponding support of expression of the

TF in question by nanoCAGE, which argues that the observed regulatory event

should also be relevant for the APL cells.

In summary, our findings indicate that alternative promoter usage is potentially

important yet often overlooked biological feature of APL cells. While these results are

indicative of functional impact, it is unclear what the causality is: we do not at present

know if the observed shorter isoforms are a result of disease progression or play a

causal role in malignant transformation. To address such questions, it would be

necessary to either knock down or overexpress the shorter variants and observe the

resulting phenotypes.

Regardless of cause and effect, these types of cancer-specific alternative

promoters could potentially be used as biomarkers for APL, since they have additional

power compared to microarray expression profiles, which cannot detect the placement

of alternative promoters. However, for a comprehensive delineation of APL-specific

biomarkers it would then be necessary to expand the study to cover multiple APL and

normal cancer states, as well as biological variation over many patients.

ARTICLE III | 80

Acknowledgments

The authors thank the National High-throughput DNA Sequencing Centre of

Copenhagen, Denmark for collaborative assistance with sequencing.

This study was supported by grants to AS from the Danish Cancer Society,

The Lundbeck Foundation, the Novo Nordisk Foundation and the European

Research Council (FP7/2007–2013/ERC grant agreement 204135). Work in the Porse

lab was supported by grants from the Danish Cancer Society, The Danish Research

Council for Strategic Research and through a center grant from the Novo Nordisk

Foundation (The Novo Nordisk Foundation Section for Stem Cell Biology in

Human Disease).

Author contributions

JW and AS carried out the computational analysis, MB and PC carried out

laboratory work. HMJ and KT collected cell samples and carried out cell sorting. NR

provided microarray data and analysis. BL contributed with domain loss analysis. PH

and NB contributed with clinical sample collection. BTP and AS supervised the

project. JW, BTP and AS wrote the paper.

References 1. Davuluri R V, Suzuki Y, Sugano S, Plass C, Huang TH-M (2008) The functional

consequences of alternative promoter use in mammalian genomes. Trends in genetics  : TIG 24: 167–

177. Available: http://www.ncbi.nlm.nih.gov/pubmed/18329129. Accessed 1 March 2013.

2. Kornblihtt AR, Schor IE, Alló M, Dujardin G, Petrillo E, et al. (2013) Alternative

splicing: a pivotal step between eukaryotic transcription and translation. Nature reviews Molecular

cell biology 14: 153–165. Available: http://europepmc.org/abstract/MED/23385723. Accessed 27

February 2013.

3. Sun Y, Fu Y, Li Y, Xu A (2012) Genome-wide alternative polyadenylation in animals:

insights from high-throughput technologies. Journal of molecular cell biology 4: 352–361. Available:

http://www.ncbi.nlm.nih.gov/pubmed/23099521. Accessed 13 March 2013.

4. Tress ML, Martelli PL, Frankish A, Reeves G a, Wesselink JJ, et al. (2007) The

implications of alternative splicing in the ENCODE protein complement. Proceedings of the

National Academy of Sciences of the United States of America 104: 5495–5500. Available:

ARTICLE III | 81

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1838448&tool=pmcentrez&rendertype

=abstract.

5. Valen E, Pascarella G, Chalk A, Maeda N, Kojima M, et al. (2009) Genome-wide

detection and analysis of hippocampus core promoters using DeepCAGE. Genome research 19: 255–

265. Available:

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2652207&tool=pmcentrez&rendertype=

abstract. Accessed 5 March 2013.

6. Sandelin A, Carninci P, Lenhard B, Ponjavic J, Hayashizaki Y, et al. (2007)

Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nature reviews

Genetics 8: 424–436. Available: http://www.ncbi.nlm.nih.gov/pubmed/17486122. Accessed 5 March

2013.

7. Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, et al. (2003) Cap analysis

gene expression for high-throughput analysis of transcriptional starting point and identification of

promoter usage. Proceedings of the National Academy of Sciences of the United States of America

100: 15776–15781. Available:


abstract.

8. Tsuchihara K, Suzuki Y, Wakaguri H, Irie T, Tanimoto K, et al. (2009) Massive

transcriptional start site analysis of human genes in hypoxia cells. Nucleic acids research 37: 2249–2263.

Available:



9. Salimullah M, Mizuho S, Plessy C, Carninci P (2011) NanoCAGE: A High-

Resolution Technique to Discover and Interrogate Cell Transcriptomes. Cold Spring Harbor

Protocols 2011: pdb.prot5559–pdb.prot5559. Available:

http://www.cshprotocols.org/cgi/doi/10.1101/pdb.prot5559. Accessed 4 November 2012.

10. Plessy C, Bertin N, Takahashi H, Simone R, Salimullah M, et al. (2010) Linking

promoters to functional transcripts in small samples with nanoCAGE and CAGEscan. Nature

methods 7: 528–534. Available: http://dx.doi.org/10.1038/nmeth.1470. Accessed 10 November 2012.

11. Plessy C, Pascarella G, Bertin N, Akalin A, Carrieri C, et al. (2012) Promoter

architecture of mouse olfactory receptor genes. Genome research 22: 486–497. Available:


=abstract. Accessed 5 March 2013.

12. Döhner H, Estey EH, Amadori S, Appelbaum FR, Büchner T, et al. (2010)

Diagnosis and management of acute myeloid leukemia in adults: recommendations from an

ARTICLE III | 82

international expert panel, on behalf of the European LeukemiaNet. Blood 115: 453–474. Available:


13. Wang Z-Y, Chen Z (2008) Acute promyelocytic leukemia: from highly fatal to highly

curable. Blood 111: 2505–2515. Available:

http://bloodjournal.hematologylibrary.org/content/111/5/2505.short. Accessed 5 November 2012.

14. Döhner H, Estey EH, Amadori S, Appelbaum FR, Büchner T, et al. (2010)

Diagnosis and management of acute myeloid leukemia in adults: recommendations from an

international expert panel, on behalf of the European LeukemiaNet. Blood 115: 453–474.

doi:10.1182/blood-2009-07-235358.

15. Minucci S, Pelicci PG (2006) Histone deacetylase inhibitors and the promise of

epigenetic (and more) treatments for cancer. Nature reviews Cancer 6: 38–51. Available:


16. Lin RJ, Evans RM (2000) Acquisition of oncogenic potential by RAR chimeras in

acute promyelocytic leukemia through formation of homodimers. Molecular cell 5: 821–830. Available:

http://www.ncbi.nlm.nih.gov/pubmed/10882118.

17. Licht JD (2006) Reconstructing a disease: What essential features of the retinoic acid

receptor fusion oncoproteins generate acute promyelocytic leukemia? Cancer cell 9: 73–74. Available:

http://www.ncbi.nlm.nih.gov/pubmed/16473273. Accessed 23 April 2013.

18. Villa R, Pasini D, Gutierrez A, Morey L, Occhionorelli M, et al. (2007) Role of the

polycomb repressive complex 2 in acute promyelocytic leukemia. Cancer cell 11: 513–525. Available:


19. Mora-Jensen H, Jendholm J, Fossum A, Porse B, Borregaard N, et al. (2011)

Technical advance: immunophenotypical characterization of human neutrophil differentiation. Journal

of leukocyte biology 90: 629–634. Available: http://www.ncbi.nlm.nih.gov/pubmed/21653237. Accessed

22 May 2013.

20. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient

alignment of short DNA sequences to the human genome. Genome biology 10: R25. Available:


=abstract. Accessed 10 June 2011.

21. Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for

differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) 26:

139–140. Available:


abstract. Accessed 2 November 2012.

ARTICLE III | 83

22. Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homology to

genome sequences using a library of hidden Markov models that represent all proteins of known

structure. Journal of molecular biology 313: 903–919. Available:

http://www.ncbi.nlm.nih.gov/pubmed/11697912. Accessed 28 February 2013.

23. Encode T, Consortium P (2011) A user’s guide to the encyclopedia of DNA elements

(ENCODE). PLoS biology 9: e1001046. doi:10.1371/journal.pbio.1001046.

24. Marstrand TT, Borup R, Willer a, Borregaard N, Sandelin a, et al. (2010) A

conceptual framework for the identification of candidate drugs and drug targets in acute

promyelocytic leukemia. Leukemia 24: 1265–1275. Available:


25. Haferlach T, Kohlmann A, Wieczorek L, Basso G, Kronnie G Te, et al. (2010)

Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of

leukemia: report from the International Microarray Innovations in Leukemia Study Group. Journal of

clinical oncology  : official journal of the American Society of Clinical Oncology 28: 2529–2537.

Available: http://www.ncbi.nlm.nih.gov/pubmed/20406941. Accessed 5 March 2013.

26. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL (2005) Gene set

enrichment analysis  : A knowledge-based approach for interpreting genome-wide.

27. Ross ME, Mahfouz R, Onciu M, Liu H-C, Zhou X, et al. (2004) Gene expression

profiling of pediatric acute myelogenous leukemia. Blood 104: 3679–3687. Available:


28. Diaz-Blanco E, Bruns I, Neumann F, Fischer JC, Graef T, et al. (2007) Molecular

signature of CD34(+) hematopoietic stem and progenitor cells of patients with CML in chronic

phase. Leukemia 21: 494–504. Available: http://www.ncbi.nlm.nih.gov/pubmed/17252012. Accessed 3

April 2013.

29. Pujana MA, Han J-DJ, Starita LM, Stevens KN, Tewari M, et al. (2007) Network

modeling links breast cancer susceptibility and centrosome dysfunction. Nature genetics 39: 1338–1349.

Available: http://www.ncbi.nlm.nih.gov/pubmed/17922014. Accessed 27 February 2013.

30. Le Goff C, Morice-Picard F, Dagoneau N, Wang LW, Perrot C, et al. (2008)

ADAMTSL2 mutations in geleophysic dysplasia demonstrate a role for ADAMTS-like proteins in

TGF-beta bioavailability regulation. Nature genetics 40: 1119–1123. Available:



31. Masterson M, A Raza, N Yousuf, A Abbas, A Umerani, A Mehdi, SA Bokhari, Y

Sheikh, K Qadir JF and (2013) High expression of transforming growth factor-beta long cell cycle

ARTICLE III | 84

times and a unique clustering of S-phase cells in patients with acute promyelocytic leukemia [see

comments]: 1037–1048.

32. Saito Y, Kaneda K, Suekane a, Ichihara E, Nakahata S, et al. (2013) Maintenance of

the hematopoietic stem cell pool in bone marrow niches by EVI1-regulated GPR56. Leukemia.

Available: http://www.ncbi.nlm.nih.gov/pubmed/23478665. Accessed 4 April 2013.

33. Liu J, Xia H, Kim M, Xu L, Li Y, et al. (2011) Beclin1 controls the levels of p53 by

regulating the deubiquitination activity of USP10 and USP13. Cell 147: 223–234. Available:



34. Guo N, Krutzsch HC, Inman JK, Roberts DD (1997) Thrombospondin 1 and Type I

Repeat Peptides of Thrombospondin 1 Specifically Induce Apoptosis of Endothelial Cells

Thrombospondin 1 and Type I Repeat Peptides of Thrombospondin 1 Specifically Induce Apoptosis

of Endothelial Cells ’: 1735–1742.

35. Iruela-Arispe ML, Lombardo M, Krutzsch HC, Lawler J, Roberts DD (1999)

Inhibition of Angiogenesis by Thrombospondin-1 Is Mediated by 2 Independent Regions Within the

Type 1 Repeats. Circulation 100: 1423–1431. Available:

http://circ.ahajournals.org/cgi/doi/10.1161/01.CIR.100.13.1423. Accessed 23 April 2013.

36. Valleron W, Laprevotte E, Gautier E-F, Quelen C, Demur C, et al. (2012) Specific

small nucleolar RNA expression profiles in acute leukemia. Leukemia  : official journal of the Leukemia

Society of America, Leukemia Research Fund, UK 26: 2052–2060. Available:


37. Kent O a, Mendell JT (2006) A small piece in the cancer puzzle: microRNAs as

tumor suppressors and oncogenes. Oncogene 25: 6188–6196. Available:

http://www.ncbi.nlm.nih.gov/pubmed/17028598. Accessed 28 February 2013.

38. Zhang X (2003) A Pituitary-Derived MEG3 Isoform Functions as a Growth

Suppressor in Tumor Cells. Journal of Clinical Endocrinology & Metabolism 88: 5119–5126. Available:

http://jcem.endojournals.org/cgi/doi/10.1210/jc.2003-030222. Accessed 2 April 2013.

ARTICLE III | 85

Figure 1: Pipeline overview

Flowchart of the full data analysis process presented in this work. Tag cluster

data sets are depicted in green, experimental and computational steps to produce data

in blue, and analysis modules in red.

ARTICLE III | 86

Figure 2: FACS sorting strategy for the purification of APL blasts and their

corresponding normal bone marrow population.

Mononuclear cells (MNCs) were purified from BM aspirates of healthy

subjects and APL patients. MNCs were stained with a cocktail of fluorochrome-

conjugated MoAbs and the DNA dye 7AAD allowing for sorting of

immunophenotypical identical normal early promyelocytes (EPM) and APL blast as

described previously [19]. Purified EPM and APL blasts were defined as Lin-

CD34lo/negCD15+ cells. Wrigth Giemsa stains of sorted EPM and APL blasts

demonstrate a typical promyelocyte morpholgy.

ARTICLE III | 87

Figure 3: Characteristics of gene expression in APL by nanoCAGE

(A) MA-plot for normalized nanoCAGE-seq data, showing the combined

sample expression on the x-axis, and the sample fold change on the y-axis. Blue data

points represent candidate TSSs which are significantly different beteen APL cells

and control (with FDR <= 0.05), giving 1,437 downregulated and 725 upregulated

genes, respectively. (B) Venn diagrams comparing the overlap of

TSSs of genes down- or upregulated in APL using microarrays and

nanoCAGE.

chr16:57,653,181-57,677,232

GPR56

10 kb

Apl

Epm

chr9:136,395,254-136,442,102 10 kb

ADAMTSL2

Nor

mal

ized

tag

coun

t Apl

Epm

0

500

50

0

500

50

A

B

Nor

mal

ized

tag

coun

t

ARTICLE III | 88

UCSC longest transcript

APL promoter, supported

APL promoter, novel

KIAA0226L

ARGLU1

CTNNA3

RUVBL1

HSPA8

SNX20

ESYT2

GLG1

KIF24

ELF1

ERG

ZFR

PZP

ST5

USP53

USP13

TPM4

IFT88

AKAP9

GPR56

GCNT1

ROBO3

VPS13B

ADAMTSL2

ARTICLE III | 89

Figure 4: (Previous page) Classification of candidate transcription start sites

Barplot of genomic location classifications of candidate TSSs for pooled

samples (grey bars), or the subsets which are not changing (green), upregulated in

APL (blue), or downregulated in APL (red). For these subcategories, we investigated

their locations: overlapping known TSSs (canonical), novel TSSs within genes either

proximal or siatal to the TSS, or intragenic (See Methods for details).

TSSs downregulated (FDR<0.05, N=827) in APL are overrepresented in

canonical promoters , while TSSs upregulated in APL (FDR<0.05, N=364) are

overrepresented in intragenic promoters, supporting shorter transcripts of known

genes. Overlay boxes show gene sets from other studies enriched for either

downregulated canonical promoters (pink), or upregulated distal promoters (teal).

ARTICLE III | 90

Figure 5: Examples of AML-specific alternative promoters

Two examples of AML-specific alternative promoters identified in the study, at

the ADAMTSL2(A) and GPR56(B) loci. Top panel shows UCSC genes from the

UCSCS genome browser: grey panels show normalized mean TPM counts over

replicates for AML and NanoCAGE tags on the Y axis and the genomic location on

the X axis. For ADAMTSL2, the mRNA stretches corresponding to protein domains

lost by alternative promoter usage are indicated in colored boxes.

E2F4 E2F6

c-Fos p300

0

2*10-4

4*10-4

0

2*10-4

4*10-4

-1000

0-50

00 050

0010

000

-1000

0-50

00 050

0010

000

Distance to respective TSS [nt]

Den

sity

of C

hIP

pea

k oc

cura

nce

Location and expression of TSS

Downregulated in APL,canonical TSS

Upregulated in APL, canonical TSS

Upregulated in APL, distal alternative TSS within gene

6*10-4

6*10-4

ARTICLE III | 91

Figure 6: Visual representation of 24 genes with downstream promoters

upregulated in APL.

Gene models visualized are based on longest canonical UCSC gene models,

and are normalized to have the same lengths. Thick boxes indicate protein-coding

exons. Orange arrows indicate the TSSs inferred from the gene models. Red arrows

indicate AML-specific promoters not previously annotated by either UCSC Known

Gene, RefSeq or Ensembl. Green arrows indicate AML-specific promoters that are

supported by aforementioned databases. Note that the large majority of downstream

promoters are within or between coding exons.

-4

0

4

-2 0 2 4

log2[APL]+log2[EPM]2

log2

[APL

/EPM

]

0.00

0.25

0.50

0.75

Downregulated

Unregulated

Upregulated

A B

Freq

uenc

y

Coding RNAs

Non-coding RNAs

p<0.01

ARTICLE III | 92

Figure 7: Distance distributions from transcription factor binding sites to

candidate TSSs

Distribution of selected ENCODE transcription factor ChIP peaks (pooled

over all samples) in relation to TSS categories. For each TSS in all three categories,

the distance to the nearest respective TF was logged, and the total distance

distribution is shown. *** indicate p < 0.001, * indicates p < 0.05, NS indicates non-

significance.

0

5

10

15

20

E2F4

E2F6

E2F6_(H

-50)

c-Fos

c-Ju

n

p300_(N

-15)

TF

log

2(d

ista

nce

to

ne

are

st

TF

bin

din

g s

ite

)

Canonical, downregulated in APL

Distal, upregulated in APL

***

*** ****NS NS

ARTICLE III | 93

Figure 8: Characteristics of expression of non-coding RNAs

A) MA-plot of non-coding RNAs. Blue dots indicate ncRNAs upregulated in

APL (p< 0.05, n = 90), red dots indicate ncRNAs downregulated in APL (p < 0.05,

n=84). B) Compared to the total number of genes deregulated in APL, ncRNAs are

overrepresented in upregulated genes.

-4

0

4

-2 0 2 4

log2[APL]+log2[EPM]2

log2

[AP

L/E

PM

]

0.00

0.25

0.50

0.75

Downregulated

Unregulated

Upregulated

A

Fre

quen

cy

Coding

RNAs

Non-c

oding

RNAs

p<0.01

B

ARTICLE III | 94

Figure S1: A; Vigourous filtering ensures a reduced but high confidence set of

candidate promoters.

“Reads mapped” indicate reads mapped to mm9, after removing PhiX-specific

sequences. Common tag clusters are defined as overlapping reads, based on all

samples pooled. The filtering steps done to reach 5,748 candidate TSS’s are described

in the text.

B: Violin plots of distance to nearest RefSeq TSS of all (pink) and filtered

(blue) TSSs. The blue line indicates the mean of the distance of random genomic

locations to nearst RefSeq TSS.

ARTICLE III | 95

Figure S2: Density plot of distance in nucleotides from distal upregulated

promoters to the TSS of the longest refseq gene model.

The green line indicates the median.

0.00

0.05

0.10

0.15

5 10 15 20Distance from TSS [nt] (green line = median)

density

ARTICLE III | 96

[Figure too large for inclusion. Downloadable at

https://www.dropbox.com/s/2e50t51xnntpx5r/Figure_S3.pdf]

Figure S3: Distance boxplots from each tag cluster of a given category to the

closest ENCODE TF binding site.

ARTICLE IV | 97

Article IV

Temporal mapping of CEBPA and CEBPB binding during liver regeneration

reveals dynamic occupancy and specific regulatory codes for homeostatic and cell cycle

gene batteries

ARTICLE IV | 98

Research

Temporal mapping of CEBPA and CEBPB bindingduring liver regeneration reveals dynamic occupancyand specific regulatory codes for homeostatic and cellcycle gene batteriesJanus Schou Jakobsen,1,2,3,7 Johannes Waage,1,2,3,4 Nicolas Rapin,1,2,3,4

Hanne Cathrine Bisgaard,5 Fin Stolze Larsen,6 and Bo Torben Porse1,2,3,7

1The Finsen Laboratory, Rigshospitalet, Faculty of Health Sciences, 2Biotech Research and Innovation Centre (BRIC), 3The Danish Stem

Cell Centre (DanStem), Faculty of Health Sciences, 4The Bioinformatics Centre, 5Department of Cellular and Molecular Medicine,

Faculty of Health Sciences, University of Copenhagen, DK-2100 Copenhagen, Denmark; 6Department of Hepatology, Rigshospitalet,

DK2200 Copenhagen, Denmark

Dynamic shifts in transcription factor binding are central to the regulation of biological processes by allowing rapidchanges in gene transcription. However, very few genome-wide studies have examined how transcription factor occu-pancy is coordinated temporally in vivo in higher animals. Here, we quantified the genome-wide binding patterns of twokey hepatocyte transcription factors, CEBPA and CEBPB (also known as C/EBPalpha and C/EBPbeta), at multiple timepoints during the highly dynamic process of liver regeneration elicited by partial hepatectomy in mouse. Combining theseprofiles with RNA polymerase II binding data, we find three temporal classes of transcription factor binding to beassociated with distinct sets of regulated genes involved in the acute phase response, metabolic/homeostatic functions, orcell cycle progression. Moreover, we demonstrate a previously unrecognized early phase of homeostatic gene expressionprior to S-phase entry. By analyzing the three classes of CEBP bound regions, we uncovered mutually exclusive sets ofsequence motifs, suggesting temporal codes of CEBP recruitment by differential cobinding with other factors. Thesefindings were validated by sequential ChIP experiments involving a panel of central transcription factors and/or bycomparison to external ChIP-seq data. Our quantitative investigation not only provides in vivo evidence for the in-volvement of many new factors in liver regeneration but also points to similarities in the circuitries regulating self-renewalof differentiated cells. Taken together, our work emphasizes the power of global temporal analyses of transcription factoroccupancy to elucidate mechanisms regulating dynamic biological processes in complex higher organisms.

[Supplemental material is available for this article.]

Most biological processes—such as embryonic development, dif-ferentiation, or cellular responses to external stimuli—are in es-sence dynamic, which requires the underlying transcriptionalnetworks to be tightly temporally coordinated. Several large-scaletemporal studies have focused on the comprehensive mapping ofdynamic changes of gene expression but have not revealed howcoordination is achieved on the transcriptional level (e.g., vanWageningen et al. 2010). To address this question globally, thebinding of regulatory proteins has been examined using methodssuch as chromatin immunoprecipitation with microarrays (ChIP-chip) or sequencing (ChIP-seq) (e.g., Sandmann et al. 2006a;Johnson et al. 2007; Vogel et al. 2007). Still, many such studieshave operated with just one or two conditions. The exceptionsinclude temporal examinations in the model systems Drosophilaand yeast (Sandmann et al. 2006b; Jakobsen et al. 2007; Ni et al.2009; Zinzen et al. 2009). So far, very few in vivo, multi-time pointinvestigations of transcription factor (TF) binding have been un-dertaken in higher vertebrates.

Mammalian liver regeneration is a well-studied process, inwhich the large majority of mature hepatocytes rapidly and ina highly synchronized manner re-enter the cell cycle upon injury(Fausto et al. 2006; Michalopoulos 2007; Malato et al. 2011). Serialtransplantation of liver tissue has demonstrated a very high‘‘repopulating’’ capacity of hepatocytes (Overturf et al. 1997),which lends hope to using these cells in regenerative medicine.The ability of mature liver cells to proliferate is reminiscent ofspecific, differentiated cells of the immune system (naive T cells, Bcells) that are kept in quiescence until exposed to specific stimuli(Glynne et al. 2000; Yusuf and Fruman 2003; Feng et al. 2008).However, it remains open whether similar programs control theproliferation of hepatocytes and immune cells. In the liver, anumber of studies have mapped temporal changes in mRNA levelsduring regeneration (e.g., White et al. 2005). This, coupled withfunctional studies, has led to the identification of several TFs in-volved in the regenerative response (for review, see Kurinna andBarton 2011). Still, knowledge about how these factors are coor-dinated temporally throughout the regenerative process is limited.

CEBPA (C/EBPalpha) and CEBPB (C/EBPbeta) are two keyhepatocyte TFs known to have divergent roles in liver function andregeneration. The two factors belong to the same basic regionleucine zipper-family (bZIP), and several studies have shown thatthey bind the same core DNA sequence, acting as either homo- or

7Corresponding authorsE-mail [email protected] [email protected] published online before print. Article, supplemental material, and publi-cation date are at http://www.genome.org/cgi/doi/10.1101/gr.146399.112.

592 Genome Researchwww.genome.org

23:592–603 ! 2013, Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/13; www.genome.org

Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from

ARTICLE IV | 99

heterodimers (Diehl and Yang 1994; Ranaet al. 1995; Osada et al. 1996). WhileCEBPA is highly expressed in the quies-cent condition (before injury) and regu-lates many metabolic liver genes, CEBPBis up-regulated during liver regrowth andis required for a full regenerative response(Greenbaum et al. 1995; Wang et al. 1995).In many tissues, CEBPA is observed to bean anti-proliferative factor facilitating dif-ferentiation, while CEBPB has been foundto be either pro- or anti-proliferative indifferent settings (Porse et al. 2001; Nerlov2007). In the skin, the two factors appearto act redundantly to limit epidermal stemcell activity (Lopez et al. 2009).

In the current study, we have ex-amined dynamic TF binding during liverregeneration in the mouse by performinga time course of ChIP-seq experiments tomap and quantify binding of CEBPA andCEBPB on a genome-wide scale at a highlevel of temporal resolution. We find thatCEBPA and CEBPB generally occupy thesame positions in the genome of hepato-cytes in vivo, but they do so with quanti-tatively divergent temporal patterns duringregeneration.

To further dissect the dynamic tran-scriptional network behind liver regen-eration, we interrogated the temporallydefined groups of CEBP-bound elementsfor regulatory properties, both with re-spect to sequence composition and dif-ferential expression of associated genes.

Results

A time course of CEBPA and CEBPBin vivo ChIP experiments reveals threedistinct patterns of binding

Liver regeneration in rodents has beenstudied in detail using partial hepatec-tomy, in which three of five lobes of theliver are resected. The hepatocytes in theremaining liver lobes undergo up to twocell cycles during the first week of liverregeneration, hereby reestablishing pre-surgical liver mass (for reviews, see Faustoet al. 2006; Michalopoulos 2007). To ex-amine the binding dynamics of CEBPAand CEBPB during liver regeneration, weharvested regenerating liver tissue ateight time points (0, 3, 8, 16, 24, 36, 48,and 168 h; Methods). These time pointscover the quiescent state (G0), severalstages of the first cell cycle growth phase(G1), the G1–S-phase transition at 36 h, a later time point at 48 h, aswell as the terminal phase point of 168 h (Fig. 1A; Matsuo et al.2003; Fausto et al. 2006). Livers from five mice for each time pointwere subjected to ChIP using specific CEBPA or CEBPB antibodies

(Methods) (Fig. 1B). We confirmed antibody specificity by doingChIP in mice deficient for Cebpa or Cebpb or by including epitopeblocking peptides (Supplemental Fig. S1). To minimize experi-mental variation, we pooled DNA precipitated from each of the five

Figure 1. A ChIP-seq time course of mouse liver regeneration. (A) Overview of the experimental timecourse. Eight time points covering the full week of the regenerative response; phases of the first cell cycleindicated. Previous studies show a prolonged decrease in the CEBPA to CEBPB ratio. (B) Experimentaloutline for ChIP experiments. Partial hepatectomy was carried out on five mice for each time point. Theregenerating liver tissue was harvested after variable time intervals, fixed, and subjected to ChIP-seq.(C ) Normalized coverage maps of four gene proximal regions with ChIP-seq peaks, marked by dashedvertical lines. Each row represents a time point for either CEBPA (top eight rows, dark purple) or CEBPB(bottom eight rows, dark pink). Gene loci are shown in black below. (D) CEBPA and CEBPB temporalbinding profiles based on genomic coverage, generated from the peaks in the C. (Alb-enh) Albuminproximal region; (Mdc1) mediator of DNA damage checkpoint 1; (Orm1) orosomucoid 1; (Cps1)carbamoyl-phosphate synthetase 1.

Genome Research 593www.genome.org

Regulation of liver regeneration


ARTICLE IV | 100

mice in equal amounts and sequenced the combined material. Wemapped the sequences to the mouse genome (mm9) and deter-mined base pair coverage after internal normalization to the totalmapped read count for each antibody (Supplemental Methods; Sup-plemental Fig. S2; Supplemental Table S1). Genomic regions bound byeither CEBPA or CEBPB were identified with the Useq peak-finder al-gorithm using a mock immunoprecipitation (IP) control (Supple-mental Methods, Supplemental Table S2). We validated IP consistencyof two series of three independent ChIPs at four different CEBP boundgenomic locations (Supplemental Fig. S3). For further validation, weperformed de novo motif searches on several data sets and a conser-vation score analysis centering on CEBP motifs, all of which con-firmed the quality of our ChIP-seq data (Supplemental Fig. S4).

Next, CEBPA or CEBPB peaks for each time point were used toconstruct a sum list of all enriched regions. Four examples ofenriched regions located just upstream of gene loci are shown inFigure 1, depicting genomic coverage for the CEBPA and CEBPBChIP time series (Fig. 1C).

After applying a stringent filtering regimen, we identifieda total of 11,314 high-confidence regions bound by CEBPA orCEBPB, many of which show highly dynamic occupancy duringthe course of regeneration (for full peak calling numbers and re-gion coverage, see Supplemental Tables S2, S3; for filtering, seeSupplemental Methods). We assembled temporal binding profilesfor all bound regions based on maximal genomic coverage. As il-lustrated in Figure 1 for the four regions mentioned above, distinctprofile patterns can be observed (Fig. 1D).

The high temporal resolution of the obtained binding profilesallowed us to query for groups of putative cis-regulatory regionswith similar occupancy dynamics. By hierarchical clustering, weidentified three prevalent clusters (Fig. 2A), containing roughlyequal numbers of bound regions (3549, 2818, and 3034 for the A,B, and C clusters, respectively). Two clusters (A and B) were definedby strong CEBPB binding, with maxima either at the 3- and 36-h(A) or at the 3-h (B) time points, which is also evident in a summedprofile of temporal coverage (Fig. 2B). As opposed to the A and B

Figure 2. Three distinct temporal binding patterns reflect the CEBPA and CEBPB protein level ratios. (A) Heatmap of hierarchical clustering on CEBPAand CEBPB binding profiles, based on genomic coverage in each region (row) and time point (column). White signifies low binding level (coverage 0); darkblue, high (cut-off at coverage 120). The most prominent clusters are marked A, B, and C. (B) Summarized coverage profiles of the three clusters. For eachprofile, summed time point levels are shown as normalized to the highest point (set to 1). (C ) Western blot showing CEBPA and CEBPB levels through thefirst 36 h of liver regeneration. Isoforms indicated (CEBPA: p42 and p30; CEBPB: LAP*, LAP, and LIP). Loading control is histone H3. Experiment isrepresentative of three independent repeats. (D) Representative quantification of protein levels based on signal intensities (using ImageJ). All isoformsincluded for protein level determination, normalized to a maximum value of 1.

Jakobsen et al.



ARTICLE IV | 101

clusters, the C cluster was characterized by a high degree of tem-poral correspondence between the CEBPA and CEBPB bindinglevels, with maxima at the quiescent state 0-h time point and at 24h (Fig. 2A,B). The detected robust binding of CEBPA at the 16- and24-h time points was unexpected, as previous studies have re-ported decreased protein levels for this TF throughout the re-generative process (Greenbaum et al. 1995, 1998). To clarify this,we examined the protein levels for CEBPA and CEBPB until the firstround of replication (time points, 0–36 h) by Western blotting(Fig. 2C). This highlighted a clear correlation between the CEBPAto CEBPB protein level ratios and the observed binding patterns.Specifically, we detected high levels of CEBPB at the 3- and 36-h timepoints, high CEBPA levels at the quiescent state (0 h), and a returnof CEBPA predominance at 16- and 24-h time points (Fig. 2C,D).

The three binding pattern clusters are associated with specificfunctional classes of genes

The temporally distinct CEBP binding patterns of putative cis-regulatory regions belonging to the three clusters suggest divergentregulatory roles through the regenerative process. To address thispossibility, we investigated whether each cluster was associatedwith specific sets of differentially expressed genes. To this end, weperformed ChIP-seq experiments with liver tissue from the eighttime points outlined above with an antibody specific to RNApolymerase II (POL2). In contrast to examining steady-state mRNAlevels, this allowed us to directly measure the transcriptional ac-tivity at any given time point as POL2 binding to each gene body(Supplemental Methods; Sandoval et al. 2004). To pinpoint genesregulated by CEBPA and/or CEBPB, a single putative target genewas assigned to each bound region, based on proximity to thetranscription start site (TSS) of neighboring genes (SupplementalMethods). Genes associated with regions belonging to one of thethree binding profile clusters (A, B, or C cluster genes) were in-spected for differential expression (for a full target gene list, seeSupplemental Table S4).

For an initial overview, we counted differentially expressedgenes from each cluster, defined as genes with a change in POL2gene body coverage above twofold, comparing each time point to0 h (Fig. 3A). This revealed that all clusters are associated withprominent gene expression changes at the 3-h time point. The Ccluster was associated with a large group of down-regulated genesat 3 h and, furthermore, displayed up-regulation of many genes at24 h. In contrast, the A and B clusters mostly exhibit down-regulationat the 24-h time point and up-regulation early in the time course(3 and 8 h), as well as a resurge at 36 h (mainly the A cluster).

Next, we focused the analysis on the 0-, 3-, 24-, and 36-h timepoints, as these displayed the most pronounced TF binding dif-ferences between the three clusters and therefore are likely torepresent the most distinct regulatory states. Groups of genes, up-or down-regulated from one time point to the next (i.e., from 0–3h, 3–24 h, and 24–36 h), were subjected to gene ontology (GO)analysis with the online DAVID tool (Supplemental Methods)(Huang da et al. 2009). As summarized in Table 1, we observed ex-tensive differences in functional classes (GO terms, Biological Process)of differentially expressed genes associated with the three bindingpatterns (for full GO analysis lists, see Supplemental Table S5).

At the 0- to 3-h transition, the A cluster displays almost equalnumbers of genes up- and down-regulated (554 vs. 682) (Table 1).In contrast, the B and C clusters showed mainly decreased targetgene activity, with 610 and 709 down-regulated genes (63% and70% of total regulated genes, respectively). The down-regulated

genes associated with both B and C regions are annotated withlipid metabolism or oxidation-reduction terms, while the activatedgenes are associated with acute phase or inflammatory response.This shift of gene activity is expected, as the liver shifts froma quiescent, homeostatic state to an acute stage as a response toinjury. By inspection of shared B and C target gene loci, we foundseveral examples of B and C peaks in close proximity to each other.These often bind CEBPs at mutually exclusive time points, as ex-emplified in Figure 3B (lower panels). This observation suggests thatCEBP complexes, possibly via interaction with specific co-activatorsor repressors at B or C regions, could have opposing actions ontranscription, leading to timed fine-tuning of gene activity.

Prominent gene expression changes from 3 to 24 h includea significant down-regulation of A cluster genes with the term‘‘transcriptional regulatory activity,’’ a shift that is paralleled bya decrease of A cluster CEBPA and CEBPB occupancy. This maysuggest that the expression of these regulatory factors is dependenton CEBP binding. Another finding was the up-regulation of genestargeted by C cluster regulatory regions (889 of 1209 differentially

Figure 3. The A, B, and C cluster regions target distinct sets of genes.(A) Heatmap of gene expression changes through liver regeneration. ‘‘Up’’and ‘‘down’’ bins count genes associated with a CEBP cluster (A, B, or C),exceeding a 2.0-fold change threshold (POL2 binding level change rela-tive to 0 h). Blue color intensity indicates number of genes going up; grayintensity, genes going down. (B) Representative examples of transcrip-tional activity (polymerase II [POL2] binding) of putative target genes.CEBPB binding illustrates gene association with either the C (Cps1) or A(Mdc1) cluster exclusively (upper left and right panel, respectively) or as-sociation with peaks belonging to both B and C clusters (Adh7 and Orm1;bottom panels). Gene loci are shown in black below; binding levels forCEBPB and POL2, in dark pink and red, respectively. Black numbers in-dicate POL2 gene body coverage for time point comparison. (Adh7) Al-cohol dehydrogenase 7; for additional gene symbols, see Figure 1.




ARTICLE IV | 102

Tab

le1

.T

arg

et

gen

eG

Oca

teg

ori

es

defi

ne

dis

tin

ctre

gu

lato

ryro

les

for

the

A,

B,

an

dC

bin

din

gcl

ust

ers

Clu

ster

Tim

ep

oin

t0–3

hP-v

alu

eT

ime

po

int

3–2

4h

P-v

alu

eT

ime

po

int

24–3

6h

P-v

alu

e

AU

PH

exose

met

abolic

pro

cess

0.2

55

UP

Purin

en

ucl

eotid

eb

ind

ing

0.0

82

UP

Cel

lcy

cle

0.0

01

554

Pro

gra

mm

edce

lld

eath

0.2

67

562

Ad

enyl

nucl

eoti

de

bin

din

g0.0

86

545

Reg

ula

tion

of

cell

cycl

e0.0

38

DO

WN

682

Am

ine

bio

syn

thet

icp

roce

ssC

ellu

lar

amin

oac

idb

iosy

nth

etic

pro

cess

0.0

12

0.0

14

DO

WN

749

Tra

nsc

rip

tion

reg

ula

tor

activi

tyD

NA

bin

din

g2.6

310!

6

2.2

310!

4D

OW

N73

Neg

ativ

ere

gula

tion

of

tran

scrip

tion

,D

NA

-dep

end

ent

Wn

tre

cep

tor

sig

nal

ing

pat

hw

ay

1.0

00

1.0

00

BU

PReg

ula

tion

of

infla

mm

atory

resp

on

se0.0

16

UP

Solu

te:c

atio

nsy

mp

ort

erac

tivi

ty0.2

77

UP

Cofa

ctor

bin

din

g0.1

52

360

Cofa

ctor

met

abolic

pro

cess

0.0

63

518

Sym

port

erac

tivi

ty0.2

85

272

Oxi

dore

duct

ase

activi

ty,

actin

gon

sulfu

rg

roup

of

don

ors

0.2

77

DO

WN

Org

anic

eth

erm

etab

olic

pro

cess

0.0

07

DO

WN

Tra

nsc

rip

tion

reg

ula

tor

activi

ty0.0

36

DO

WN

NA

610

Trig

lyce

rid

em

etab

olic

pro

cess

0.0

09

460

Lip

ase

activi

ty0.5

21

77

NA

CU

PA

cute

infla

mm

atory

resp

on

se0.0

08

UP

Oxi

dat

ion

red

uct

ion

2.4

310!

4U

PBio

log

ical

adh

esio

n0.0

12

309

Res

pon

seto

woun

din

g0.0

31

889

Ster

oid

met

abolic

pro

cess

1.0

310!

3177

Cel

lad

hes

ion

0.0

24

DO

WN

709

Oxi

dat

ion

red

uct

ion

Lip

idb

iosy

nth

etic

pro

cess

5.8

310!

8

8.9

310!

5D

OW

N320

Posi

tive

reg

ula

tion

of

mac

rom

ole

cule

bio

syn

thet

icp

roce

ssPosi

tive

reg

ula

tion

of

nitro

gen

com

poun

dm

etab

olic

pro

cess

0.0

64

0.0

64

DO

WN

196

Pat

tern

bin

din

gPoly

sacc

har

ide

bin

din

g0.0

59

0.0

59

Th

etw

oto

ph

its

of

GO

bio

log

ical

pro

cess

cate

gori

esen

rich

edar

esh

ow

n.

Gen

esas

sig

ned

toa

CEB

Pb

ind

ing

clust

erw

ere

incl

ud

edin

the

UP

and

DO

WN

bin

sw

hen

exce

edin

ga

fold

chan

ge

of

tran

scrip

tion

alac

tivi

tyab

ove

1.5

5.C

han

ges

are

mea

sure

das

PO

L2b

ind

ing

leve

lsre

lative

toa

pre

ced

ing

tim

ep

oin

tas

ind

icat

ed.P-

valu

esar

eBen

jam

inic

orr

ecte

dfo

rm

ultip

lete

stin

g.Fo

ra

com

ple

teG

Ote

rmlis

t,se

eSu

pp

lem

enta

lTab

leS5

.

Jakobsen et al.



ARTICLE IV | 103

expressed genes). Enriched C cluster GO terms were consistentwith a resurge of homeostatic and metabolic gene activity beforeS-phase entry (Table 1). One notable example is Cps1, carbamoyl-phosphate synthetase 1, which encodes a rate-limiting enzyme inthe urea-cycle (Jones 1965) and displays a pronounced up-regulationat the 24-h time point (Fig. 3B, upper left panel).

At the 36-h time point, the A cluster associated genes dis-played a strong bias toward increased transcriptional activity, with545 up-regulated against only 73 down-regulated, in contrast tothe C group target genes with 177 up and 196 down. In the large setof up-regulated A cluster genes, we found a strong overrep-resentation of cell cycle GO terms (Table 1). An example of a cellcycle regulator gene (Mdc1) targeted by an A cluster enhancer isshown in Figure 3 (Fig. 3B, upper right panel). The B cluster showsassociation with relatively few regulated genes at this time point(272 up and 77 down).

These data collectively show the three clusters to be associatedwith defined sets of genes with distinct biological functions andtiming of expression.

Motif analysis identifies specific and mutually exclusive setsof regulatory code

The targeting of distinct sets of genes and the distinct bindingpatterns suggest the presence of divergent regulatory features inthe regions constituting the three temporal clusters. Combinato-rial binding of multiple TFs has been proposed to explain dynamicchanges in occupancy and gene regulation during development(e.g., Zinzen et al. 2009).

To examine this possibility in our system, we carried out ananalysis of the collective set of sequences of each binding cluster(A, B, or C) versus a background set. We counted instances ofknown TF cognate sequences to generate probability scores usingthe ASAP software (Marstrand et al. 2008). In order to attain com-prehensive coverage with minimal redundancy, the analysis wasperformed using a condensate of three publicly available databasesof TF binding sequences or position weight matrixes (PWMs) (Sup-plemental Methods; Supplemental Table S6). Probability Z-scoresfor all 246 condensed PWMs with representation above a thresholdwere used for hierarchical clustering to examine the general differ-ences among the three clusters (Fig. 4A). Key examples are shownwith full Z-score information (Fig. 4B).

As expected, we found CEBP motifs to be among the mostenriched for all of the binding profile clusters (Fig. 4A,B; Supple-mental Table S6). From the hierarchical clustering, it was evidentthat the majority of motifs found in the A set were shared with theB set, while a small group of motifs was found to have robustoverrepresentation in all the three clusters (Fig. 4A). Prominentamong these were motifs recognized by CREB and HNF4A (Fig.4A,B), two TFs known to have a wide set of target genes in hepa-tocytes and central roles in liver metabolism and development(Costa et al. 2003; Montminy et al. 2004; Bolotin et al. 2010).

Notably, many motifs displayed a clear mutually exclusivepattern of overrepresentation (Fig. 4A). Specifically, many motifsfound in the A and B clusters of cis-regulatory elements were un-derrepresented in the C cluster and vice versa. Binding sequencesof the A and B regions belong to TFs related to stress response,proliferation, or cell cycle regulation, such as E2F, EGR1 and MYC(c-myc), the hypoxia response factor HIF1A, the metal responsefactor MTF1, and Kruppel-like factors (KLFs) (Fig. 4B; SupplementalTable S6; Lichtlen and Schaffner 2001; Blais and Dynlacht 2004;Liao et al. 2004; Eilers and Eisenman 2008; Majmundar et al. 2010;

McConnell and Yang 2010; Zwang et al. 2011). The C regions, onthe other hand, contain matches to PWMs of liver-specificTFs—such as FOXA2, ONECUT1 (previously known as HNF6), andHNF1A (Costa et al. 2003; Guillaumond et al. 2010)—or factorsfound to be associated with a quiescent state (G0), e.g., MAFB andFOXO3A (Greer and Brunet 2005; Sarrazin et al. 2009). Addition-ally, several different SOX motifs were identified as strongly over-represented within this cluster (Fig. 4B; Supplemental Table S6).

Overall, these observations suggest significant differences inthe regulatory mechanics of the A/B clusters versus the C cluster ofputative cis-regulatory elements. Moreover, the data point to setsof specific, dynamic TF binding partners of CEBPs, defining atemporal regulatory code.

Multi-level support of temporal cis-regulatory code predictions

To assess if the TFs predicted to interact with A or C cluster regions arepresent in the liver, we examined their expression levels (POL2 genebody reads) against all genes (Fig. 4C; Supplemental Table S7). Thisrevealed that the majority are highly expressed, being representedby either direct matches or top protein family members (e.g., Klf10,-13, -15 or Sox13, -15, -18). Strikingly, Cebpa, Cebpb, and Mafb are atthe top of the list (rank positions of 107, 21, and 61, respectively).

We further examined the indicated differences of the threeclusters by interrogating their genomic position relative to thenearest TSS (Fig. 4D). The three clusters clearly diverge in this re-spect, as the C cluster regions are distinct from A (or B) regions byshowing no proximity to TSSs. Several studies have shown differ-ences in preferred genomic positioning of TFs (e.g., Gerstein et al.2012). This suggests, based solely on the genomic difference inposition, that C cluster regions recruit other cobinding factors thanthe A (or B) regions. Specifically, E2Fs and MYC have been shownto bind at positions similar to the A regions, supporting the rele-vancy of enrichment for E2F and MYC cognate sequences in the Acluster (e.g., Eilers and Eisenman 2008).

Next, we took advantage of published mouse ChIP-seq datasets (Chen et al. 2008; Hoffman et al. 2010; Laudadio et al. 2012) totest if a panel of factors bound to A versus C regions as predicted byour computational analysis. We find our analysis to be supportedas the E2F1, KLF4, and MYC factor bound regions overlap signifi-cantly more with A regions, while FOXA2 and ONECUT1 prefer-entially bind C regions (Fig. 4E; Supplemental Methods).

Finally, we tested co-occupancy of CEBPs and several putativecofactors by performing sequential ChIP (Methods). This shows thatthe factor E2F3 binds to the two A regions Slbp and Cbx5 simul-taneously with CEBP factors, but none of the examined C regions,in accordance with our predictions. In contrast, the C cluster–associated factors ONECUT1, HNF1A, and MAFB interact with sev-eral C regions also occupied by CEBPs (Fig. 4F). Noticeably, a numberof target regions are shared among the three C region factors.

Two modes of transcriptional regulation by EGR1

EGR1 has been shown to be essential for a timely regenerative re-sponse in the mouse liver (Liao et al. 2004) and represents a classical‘‘immediate early TF,’’ which is up-regulated upon a growth stimulus(for review, see Thiel and Cibelli 2002). Recently, it was found to bepart of a growth-signal discriminatory circuit together with TP53(Zwang et al. 2011). Canonical EGR1 cognate sequences are amongthe most highly enriched in the ‘‘cell cycle’’ or ‘‘proliferation’’ set ofCEBP bound regulatory regions (the A cluster), together with E2Fand MYC motifs (Supplemental Table S6).




ARTICLE IV | 104

To investigate the role of EGR1 in liver regeneration in vivo,we performed ChIP-seq experiments with EGR1-specific antibodiesat the 24- and 36-h time points. Unexpectedly, we observed apreference of EGR1 for the C set over the A set regions (Fig. 5A).Accordingly, A and EGR1 peak summits were generally positioned

further apart than C and EGR1 peaks (Fig. 5B). Moreover, a GOanalysis of putative target genes suggested that EGR1 may be in-volved in all aspects of liver regeneration, as acute phase genes,metabolic genes, and cell cycle genes are found proximal to EGR1bound regions (Supplemental Table S9; data not shown).

Figure 4. The CEBP temporal binding patterns display two sets of cis-regulatory code. (A) Representation of motif frequencies in each temporal cluster.Hierarchical clustering based on Z-scores of 210 position weight matrices (PWMs). Each row indicates a specific motif (PWM), while columns represent eachbinding cluster (A, B, or C). Color scale indicates representation relative to background set (overrepresented, dark blue; underrepresented, dark gray; nodifference, white). Z-score scale is cut off at 640 for clarity. Example PWMs are shown in black text. (B) Selected transcription factors with clear binding sequenceoverrepresentation in A and B clusters (upper six), all clusters (middle three), or only the C cluster (lower seven). Clusters are denoted by color. (C ) Expression levelfrequency distribution (POL2 gene body read coverage) of all genes (reads per kilobase, sum of eight time points, normalized, above 0.5) with candidatetranscription factors indicated. (D) Distances from A, B, and C cluster peak summits or random positions to the most proximal RefSeq transcription start site(TSS). (E) External ChIP-seq data peak regions (yellow bars) showing overlaps with CEBP A and C cluster regions (white numbers indicate proportions) andP-value of hypergeometric test (Fisher’s one-tailed) of overlap similarity. (F) Sequential ChIP (reChIP) assessing co-occupancy at A and C cluster regions. Anti-CEBP was used as first-round antibody (recognizing both CEBPA and CEBPB), with second-round IgG or antibody against indicated TFs. Enrichments arenormalized to IgG levels. N = 2–5. Error bars, SEM. (*) P < 0.05; (**) P < 0.01, t-test versus IgG enrichments. For gene names, see Supplemental Table S10.

Jakobsen et al.



ARTICLE IV | 105

By manual inspection of tightly overlapping EGR1 and C re-gions bound by CEBPs, we found many peaks positioned preciselyat a CEBP sequence but lacking EGR1 cognate sequence (Fig. 5C,bottom panels). Conversely, EGR1 peaks that possessed EGR1cognate sequences did not overlap exactly with the CEBP peaks,and these peaks were consistently associated with A cluster regions(Fig. 5C, top panels). Sequences of target regions with positions ofCEBP and EGR1 PWM hits and logos of used PWMs can be found inSupplemental Figure S5.

To test if EGR1 depends on CEBPs for interacting with DNA atC regions (lacking EGR1 sequences), we performed EGR1-specificChIP with liver tissue from wild-type and Cebpb knockout mice.Our results show that EGR1 binding depends significantly more onCEBPB at C regions than at A regions, suggesting that EGR1 couldbind DNA in two modes, indirectly (via CEBPs) at C regions anddirectly at A regions (Fig. 5D).

Differential sets of transcriptional regulators targetedby the three clusters of cis-regulatory regions

To further explore the transcriptional network of liver regeneration,we focused on the 120 most highly expressed CEBP target genesannotated with the ‘‘transcriptional regulator, DNA-dependent’’

GO term (Supplemental Table S8). These we represent as targetsof either A, B, or C cluster regions or of any combination hereof(Fig. 6A).

We find that the large majority of regulators (110 of 120) aretargeted by either the A or B cluster regions, while the C clustertargets much fewer (38), suggesting a massive shift at the tran-scriptional network level upon transition from a quiescent to a re-generative state. Moreover, the injury response clusters (A/B) alsoimpinge on many of the regulators targeted at the quiescent stage(29 of 38 bound by the C cluster), hinting at tight integration of thegenetic programs at successive stages of the complex regenerativeprocess.

A number of genes are associated with a high number of CEBPbound regions belonging to two or three clusters (Fig. 6A), sug-gesting a complex cis-regulatory structure. This may suggest thatthey are important components, or ‘‘hubs,’’ of the network. Prom-inent examples are the genes for the bHLH DNA-binding dominantrepressor, ID2, as well as the cell cycle regulators JUN and EGR1(Fig. 6A,B). Also Cebpa and Cebpb themselves are targets of multiplebound regions, as are Mafb, Hnf4a, and several Klf genes (Fig. 6A).

Several genes of regulatory factors are targeted by regions ofwhich they may be coregulators based on the binding motif analysis(Figs. 4B, 6A). This can be interpreted as network feed-forward

Figure 5. Two modes of EGR1 cobinding at A and C cluster regions. (A) Bar diagram displaying overlap between peaks defined by EGR1 binding (24 h or36 h, yellow bars) and CEBP binding. A or C cluster CEBP peak overlaps are shown (white numbers indicate proportions) with P-values of hypergeometrictests (Fisher’s one-tailed) of similarity. (B) Diagram showing the distribution of distances between CEBP peak summits and the closest EGR1 (36-h) peak. A(gray) and C (blue) cluster peaks and a randomly distributed set (dashed line) are included. (C ) CEBP and EGR1 ChIP read coverage and PWM hit tracksfrom representative A (top panels) and C (bottom panels) cluster regions. Vertical red bars indicate CEBP and EGR1 PWM hits; height of bars identity score;and transparent black bars show position of peak summits. (D) EGR1-ChIP using livers of mice deficient for Cebpb and wild-type livers. (Y-axis) Ratio ofenrichment for Cebpb-null versus wild-type experiments. EGR1 bound regions overlapping with CEPB A cluster (gray) and C cluster (blue) peaks are shown.N = 3–4. Error bars, SEM. (**) P < 0.01, Mann-Whitney test, two-tailed. For gene names, see Supplemental Table S10.




ARTICLE IV | 106

or auto-regulatory loops (Fig. 6C). Examples include genesencoding E2Fs and KLFs for the A and B cluster regions and FOXfactors for the C regions. Hnf4a is a shared target of regions fromall three clusters, and the HNF4A binding sequence is alsoenriched in all clusters (Fig. 6A; Supplemental Tables S6, S7). Theseregulatory loops may be involved in adding robustness to thenetwork as suggested for other transcriptional hierarchies (Leeet al. 2002).

We detect the cholesterol Nr1h3 (also known as Lxr) and bileacid Nr1h4 (also known as Fxr) metabolism regulatory genes to be Aand C cluster targets, respectively (Fig. 6A; Supplemental Table S8;Kalaany and Mangelsdorf 2006). We also find the gene encoding

RXRA, the obligate binding partner of both TFs, to be targeted bymultiple CEBP bound elements. This suggests that these factorsmay be interconnected with the CEBPs in regulation of choles-terol and bile metabolism during liver regeneration.

Finally, we note that almost all core members of the circadianclock system (Zhang and Kay 2010) turn out to be putative CEBPtargets (Fig. 6A; Supplemental Table S8), e.g., the core TF genesClock and Bmal1 (also known as Arntl), as well as the downstreameffectors encoded by Per1/2/3 and Cry1/2, and secondary compo-nent genes Rora, Rorc, Nr1d2 (also known as Rev-erb beta), Dbp, andNfil3 (also known as E4bp4). This points to a high level of cross-talkbetween circadian clock and regenerative response regulation.

Figure 6. The A, B, and C cluster regions target distinct sets of transcriptional regulator encoding genes. (A) Regulatory network showing transcriptionfactors targeted by CEBP-bound regions belonging to the A, B, or C clusters. The total number of connections (peaks associated with a gene) is indicated bycolor scale (white is one, and dark blue is 11, the maximum), and the relative, total expression (POL2 binding throughout the eight time points) isindicated by circle size. (B) CEBPB 3-h ChIP genomic coverage in the vicinity of three putative targets. Jun, Id2, and Egr1 loci display several CEBP peaks, i.e.,‘‘connections’’ in A. (C ) Examples of putative feed-forward loops in the network. Arrows point from the transcription factor to the regulated gene. Forclarity, the gene product level of CEBP targets has been left out. ‘‘A’’ and ‘‘C’’ are cis-regulatory elements belonging to the A or C cluster.

Jakobsen et al.



ARTICLE IV | 107

DiscussionThe multi-time point global quantification of TF occupancy en-abled us to study the temporal dynamics of regulation in vivo. Wefind that the two central transcriptional regulators CEBPA andCEBPB interact with a large group of cis-regulatory elements ina highly dynamic manner through the progressive phases of liverregeneration. This group of genomic regions can be subdividedby CEBP binding dynamics into three clusters with distinctlyenriched sets of sequence motifs, pointing to molecular mecha-nisms governing differential CEBP binding. Guided by this ob-servation, we validate the predicted binding preference of severalkey cobinding TFs with sequential ChIP and published ChIP-seqdata.

An important aspect of our study is the integration of ob-served TF binding dynamics and expression of putative targetgenes. We find that each of the three binding clusters is associatedwith distinct sets of regulated target genes, i.e., acute phase genes,metabolic/homeostatic genes, and cell cycle–related genes (Table 1).This demonstrates the possibility of linking dynamic TF occupancypatterns with the shifting phases of the regenerative program(Fig. 7A). Moreover, the concomitant resurge of CEBP binding tothe C cluster regions and metabolic gene activity at 24 h providesevidence for a previously uncharacterized, early ‘‘homeostatic’’

phase of liver regeneration, taking place before the first round ofreplication.

The observed shifting phases of CEBP binding is closelytemporally coupled to the dynamic ratio of CEBPA to CEBPB (Fig.2C,D). One possible interpretation is that this ratio is involved indictating enhancer occupancy by CEBP complexes. As such, a highratio of CEBPA to CEBPB would lead to binding of C cluster cis-regulatory elements enhancing metabolic and repressing acutephase response genes. Conversely, a low ratio would direct bindingof B cluster elements that repress metabolic or activate acute phasegenes or of A cluster elements that activate cell cycle genes. Thisdifferential recruitment of CEBPs could be explained by in-teraction with distinct sets of TFs, depending on CEBP complexcomposition. This model is summarized in Figure 7B. Our obser-vations and model contradict the conventional view that CEBPAlevels are low while CEBPB levels are high throughout the re-generative process (Greenbaum et al. 1995, 1998), and emphasizesthe requirement for a high degree of temporal resolution in studiesof dynamic biological processes.

Many cognate sequences associated with TFs not previouslyknown to be involved in liver regeneration were found to beoverrepresented in the CEBP bound regions (Supplemental TableS6). In the C cluster regions, motifs of several SOX factors wereabundant (Fig. 4B). Recent findings implicate SOX factors in he-patocyte differentiation and stem cell biology (Duan et al. 2010;Furuyama et al. 2011). Our data find Sox18, -13, and -15 to behighly expressed (Fig. 4C; Supplemental Table S7), but exactlywhich SOX factors cooperate with CEBPs in the liver remains to bedetermined. In the A cluster, we found overrepresented KLF factormotifs, and several highly expressed Klfs (e.g., Klf10, -13, and -15)are targets of CEBP A cluster enhancers (Figs. 4B,C, 6A; Supple-mental Table S8). Liver functions were recently identified forKLF15 and KLF10 (Guillaumond et al. 2010; Takashima et al.2010). Moreover, KLF factors are central for self-renewal programsin embryonic stem cells (Takahashi and Yamanaka 2006; Jianget al. 2008), which suggests that they may play similar roles in theself-renewal circuitry of fully differentiated hepatocytes, very likelyas both binding partners and targets of CEBPs.

An unexpected observation of this study was that EGR1binding preferentially colocalizes with the C cluster regions (Fig.5A), which are almost depleted of EGR1 cognate sequence (Fig. 4B;Supplemental Table S6). Our data support a model of indirect or‘‘assisted’’ EGR1 binding to explain this observation (Fig. 5C,D).Only one example of indirect EGR1 binding via CEBPB has beenpublished (Zhang et al. 2003), but our genome-wide observationsindicate that EGR1 targeting of metabolic gene promoters (Ccluster regions) via CEBPs could be a general phenomenon in theliver. EGR1 has also been found to be important for interpretationof mitogenic signals (Zwang et al. 2011) and targets A cluster re-gions in this study (Fig. 5A,D; Supplemental Table S9). Hence, theobserved CEBPA-to-CEBPB shift in the CEBP pool could regulateEGR1 recruitment to A or C cluster cis-regulatory elements andthus ensure a tight temporal control of EGR1 action in line withthe progressive phases of liver regeneration.

In essence, liver regeneration upon partial hepatectomy iscompensatory growth, as the cells proliferating are fully differen-tiated hepatocytes. Previous studies have shown that liver re-generation does not involve loss of hepatocyte differentiation state(e.g., Malato et al. 2011; for reviews, see Fausto et al. 2006;Michalopoulos 2007). This is consistent with our finding thatmany hepatocyte-specific genes, e.g., involved in metabolic func-tions, are up-regulated rather than down-regulated. (Fig. 3B; Table

Figure 7. Protein level ratios of CEBPA versus CEBPB may define a dy-namic transcriptional switch. (A) Schematic diagram showing the di-vergent binding patterns and biological roles of the A, B, and C cluster cis-regulatory elements through the first cell cycle of liver regeneration. Apreviously uncharacterized early resurge of metabolic/homeostatic genes,associated with C cluster regions with binding peaking at 24 h, was ob-served. (B) Model of a transcriptional switch centered on the relative ratioof CEBPA and CEBPB determining the composition of the CEBP complexpool (homo- or heterodimers). Gene activating or repressive actions areindicated, as well as sets of transcription factors found to be associatedwith A or C putative enhancers. Metabolic genes are induced and acutephase genes repressed by binding in a CEBPA-high setting (C regions),while the opposite is true for the CEBPB-high setting (B regions). A-typeregions are only targeted by CEBPs when the CEBPB form is abundant.




ARTICLE IV | 108

1). Only few examples are known of differentiated cells capable ofhepatocyte-like ‘‘self-renewal’’ in the adult mammalian organism.It has recently been shown that the quiescence of naı̈ve T cells isactively enforced by a balance between FOXP1 and FOXO1,allowing swift expansion upon external cues (Feng et al. 2011).Similarly, deficiency of just two TFs lends long-term, non-tumorigenic expansion capacity to mature macrophages (Azizet al. 2009). These two factors are MAF (c-maf) and MAFB, both ofwhich we find to be putative CEBP targets in the liver (Fig. 6A;Supplemental Table S8). MAFB is among the most highly expressedTFs in the quiescent condition and is strongly down-regulatedfrom the 8-h time point (Supplemental Tables S4, S8). Further-more, we find MAFB and MAFK binding sequences enriched in thequiescence/metabolism-associated C cluster of enhancers, pre-dominantly bound by CEBPs at the 0- and 24-h time points. In theproliferation-associated A cluster, we find KLF and MYC sequences(Fig. 4B; Supplemental Table S6); KLF4 and MYC were demon-strated to be required for expansion of the differentiated macro-phages (Aziz et al. 2009). Moreover, CEBPA is a key determinant ofmacrophage as well as liver cell differentiation (e.g., Feng et al.2008). These observations may suggest that similar transcriptionalnetworks centered on a CEBP/MAF axis are enforcing quiescencein both fully differentiated hepatocytes and macrophages. Un-derstanding this circuitry has the potential to be of use in re-generative medicine as a step toward manipulation of hepatocytesor other fully differentiated cells for therapeutic purposes.

In conclusion, the present work shows how time-resolvedanalysis of two core TFs of liver regeneration can reveal specificaspects of the regulation operating at distinct phases during theprocess. In a broader sense, our approach of analyzing dynamic TFoccupancy globally in vivo should be applicable to other experi-mental systems and holds promise to aid in the elucidation ofcomplex transcriptional networks in higher vertebrates.

Methods

Mouse workThe partial hepatectomy was performed on wild-type mice(C57BL6/J, male, 7 wk old) by removing three of five liver lobesaccording to the method described earlier (Thoren et al. 2010).Mice were culled after varying amounts of time covering theregrowth phase of liver regeneration (0, 3, 8, 16, 24, 36, 48, and 168h) (Fig. 1A,B). Livers from five individual mice for each time pointwere harvested, generating 40 samples; directly snap-frozen inliquid nitrogen; and stored at !80°C. Animal experiments con-formed to institutional as well as Danish national guidelines.

Chromatin immunoprecipitationAfter thawing, tissue samples were homogenized by douncing(loose pestle, Wheaton 15-mL douncer) in cold PBS, and werecross-linked for 10 min in 1% formaldehyde using a rotator.Chromatin was fragmented by sonication (Bioruptor, Diagenode).The 40 samples were used in ChIP-seq experiments with antibodiesagainst CEBPA, CEBPB, EGR1 (Santa Cruz: sc-61, sc-150, sc-110x),RNA-POL2 subunit B1 antibody (AC-055-100, Diagenode), andMock IgG (Sigma I8140). ChIP was performed according to themethod described earlier (Sandmann et al. 2006a). Sequential ChIPwas done according to the method described previously (Truax andGreer 2012), with the modification of cross-linking antibody–Protein A beads (www.neb.com). Santa Cruz antibodies as abovewere used for sequential CEBPA-CEBPB ChIP (Supplemental Fig.S6). An in-house CEBP antibody ( JSJ#1052) recognizing both

CEBPA and CEBPB was used in the first round in sequential ChIPwith antibodies against ONECUT1 (HNF6), HNF1A, E2F3, MAFB,or IgG for the second round (Santa Cruz: sc-13050x, sc-8986x, sc-878x; Novus Bio: nb600-266). All sequential ChIP enrichmentswere normalized to IgG ratios. Enrichment was validated by qPCR(ABI Prism 7000 or Roche Lightcycler 480), using ratios of a posi-tive detector primer set versus a negative (Mamstr, Chr_12_desert1or Sfi2). All primer sets are listed in Supplemental Table S10.

Clustering of CEBPA and CEBPB peaksConsensus regions for CEBPA and CEBPB peaks for all time pointswere defined by merging overlapping regions between sets, re-quiring a called peak in at least one sample, producing a total of87,049 consensus regions. Subsequently, coverage, defined as themaximal coverage level for each region, was determined using thenormalized counts (Supplemental Material). Filtering was applied,requiring normalized coverage of a minimum of 50 reads in at leastone sample for any given region, reducing the set to 11,314. Hi-erarchical clustering was performed in MeV (http://www.tm4.org/mev/) (Saeed et al. 2006), and three predominant clusters wereidentified: A (3449 regions), B (2818), and C (3034). The remaining2013 regions were excluded from further analysis. Coverage readswere summed for each time point/cluster, normalized, and used fordisplaying sum coverage tracks (Fig. 2B). Consensus region cover-age data can be found in Supplemental Table S3.

Data accessThe timeseries ChIP-seq data generated for this work have beendeposited in the NCBI Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) and are accessible through GEO Seriesaccession number GSE42321.

AcknowledgmentsWe thank Claus Nerlov and Agnes Zay for Cebpb knockout mouselivers, Mie Poulsen and Bjørg Krog for expert help in performingthe partial hepatectomy, and Thomas Sandmann, Federico DeMasi, and members of the Porse laboratory for critical reading ofthe manuscript. This study was supported by grants from TheNovo Foundation and the Danish Cancer Society.

References

Aziz A, Soucie E, Sarrazin S, Sieweke MH. 2009. MafB/c-Maf deficiency enablesself-renewal of differentiated functional macrophages. Science 326: 867–871.

Blais A, Dynlacht BD. 2004. Hitting their targets: An emerging picture of E2Fand cell cycle control. Curr Opin Genet Dev 14: 527–532.

Bolotin E, Liao H, Ta TC, Yang C, Hwang-Verslues W, Evans JR, Jiang T,Sladek FM. 2010. Integrated approach for the identification of humanhepatocyte nuclear factor 4a target genes using protein bindingmicroarrays. Hepatology 51: 642–653.

Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W,Jiang J, et al. 2008. Integration of external signaling pathways with thecore transcriptional network in embryonic stem cells. Cell 133: 1106–1117.

Costa RH, Kalinichenko VV, Holterman AX, Wang X. 2003. Transcriptionfactors in liver development, differentiation, and regeneration.Hepatology 38: 1331–1347.

Diehl AM, Yang SQ. 1994. Regenerative changes in C/EBPa and C/EBPbexpression modulate binding to the C/EBP site in the c-fos promoter.Hepatology 19: 447–456.

Duan Y, Ma X, Zou W, Wang C, Bahbahan IS, Ahuja TP, Tolstikov V, Zern MA.2010. Differentiation and characterization of metabolically functioninghepatocytes from human embryonic stem cells. Stem Cells 28: 674–686.

Eilers M, Eisenman RN. 2008. Myc’s broad reach. Genes Dev 22: 2755–2766.Fausto N, Campbell JS, Riehle KJ. 2006. Liver regeneration. Hepatology 43:

S45–S53.Feng R, Desbordes SC, Xie H, Tillo ES, Pixley F, Stanley ER, Graf T. 2008. PU.1

and C/EBPa/b convert fibroblasts into macrophage-like cells. Proc NatlAcad Sci 105: 6057–6062.

Jakobsen et al.



ARTICLE IV | 109

Feng X, Wang H, Takata H, Day TJ, Willen J, Hu H. 2011. Transcription factorFoxp1 exerts essential cell-intrinsic regulation of the quiescence of naiveT cells. Nat Immunol 12: 544–550.

Furuyama K, Kawaguchi Y, Akiyama H, Horiguchi M, Kodama S, Kuhara T,Hosokawa S, Elbahrawy A, Soeda T, Koizumi M, et al. 2011. Continuouscell supply from a Sox9-expressing progenitor zone in adult liver,exocrine pancreas and intestine. Nat Genet 43: 34–41.

Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ,Khurana E, Rozowsky J, Alexander R, et al. 2012. Architecture of the humanregulatory network derived from ENCODE data. Nature 489: 91–100.

Glynne R, Ghandour G, Rayner J, Mack DH, Goodnow CC. 2000. B-lymphocyte quiescence, tolerance and activation as viewed by globalgene expression profiling on microarrays. Immunol Rev 176: 216–246.

Greenbaum LE, Cressman DE, Haber BA, Taub R. 1995. Coexistence of C/EBPa, b, growth-induced proteins and DNA synthesis in hepatocytesduring liver regeneration. Implications for maintenance of thedifferentiated state during liver growth. J Clin Invest 96: 1351–1365.

Greenbaum LE, Li W, Cressman DE, Peng Y, Ciliberto G, Poli V, Taub R. 1998.CCAAT enhancer-binding protein b is required for normal hepatocyteproliferation in mice after partial hepatectomy. J Clin Invest 102: 996–1007.

Greer EL, Brunet A. 2005. FOXO transcription factors at the interfacebetween longevity and tumor suppression. Oncogene 24: 7410–7425.

Guillaumond F, Grechez-Cassiau A, Subramaniam M, Brangolo S, Peteri-Brunback B, Staels B, Fievet C, Spelsberg TC, Delaunay F, Teboul M. 2010.Kruppel-like factor KLF10 is a link between the circadian clock andmetabolism in liver. Mol Cell Biol 30: 3059–3070.

Hoffman BG, Robertson G, Zavaglia B, Beach M, Cullum R, Lee S,Soukhatcheva G, Li L, Wederell ED, Thiessen N, et al. 2010. Locus co-occupancy, nucleosome positioning, and H3K4me1 regulate thefunctionality of FOXA2-, HNF4A-, and PDX1-bound loci in islets andliver. Genome Res 20: 1037–1051.

Huang da W, Sherman BT, Lempicki RA. 2009. Systematic and integrativeanalysis of large gene lists using DAVID bioinformatics resources. NatProtoc 4: 44–57.

Jakobsen JS, Braun M, Astorga J, Gustafson EH, Sandmann T, Karzynski M,Carlsson P, Furlong EE. 2007. Temporal ChIP-on-chip reveals Biniou asa universal regulator of the visceral muscle transcriptional network.Genes Dev 21: 2448–2460.

Jiang J, Chan YS, Loh YH, Cai J, Tong GQ, Lim CA, Robson P, Zhong S, NgHH. 2008. A core Klf circuitry regulates self-renewal of embryonic stemcells. Nat Cell Biol 10: 353–360.

Johnson DS, Mortazavi A, Myers RM, Wold B. 2007. Genome-wide mappingof in vivo protein-DNA interactions. Science 316: 1497–1502.

Jones ME. 1965. Amino acid metabolism. Annu Rev Biochem 34: 381–418.Kalaany NY, Mangelsdorf DJ. 2006. LXRS and FXR: The yin and yang of

cholesterol and fat metabolism. Annu Rev Physiol 68: 159–191.Kurinna S, Barton MC. 2011. Cascades of transcription regulation during

liver regeneration. Int J Biochem Cell Biol 43: 189–197.Laudadio I, Manfroid I, Achouri Y, Schmidt D, Wilson MD, Cordi S, Thorrez

L, Knoops L, Jacquemin P, Schuit F, et al. 2012. A feedback loop betweenthe liver-enriched transcription factor network and miR-122 controlshepatocyte differentiation. Gastroenterology 142: 119–129.

Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, HannettNM, Harbison CT, Thompson CM, Simon I, et al. 2002. Transcriptionalregulatory networks in Saccharomyces cerevisiae. Science 298: 799–804.

Liao Y, Shikapwashya ON, Shteyer E, Dieckgraefe BK, Hruz PW, Rudnick DA.2004. Delayed hepatocellular mitotic progression and impaired liverregeneration in early growth response-1-deficient mice. J Biol Chem 279:43107–43116.

Lichtlen P, Schaffner W. 2001. Putting its fingers on stressful situations: Theheavy metal-regulatory transcription factor MTF-1. Bioessays 23: 1010–1017.

Lopez RG, Garcia-Silva S, Moore SJ, Bereshchenko O, Martinez-Cruz AB,Ermakova O, Kurz E, Paramio JM, Nerlov C. 2009. C/EBPa and b coupleinterfollicular keratinocyte proliferation arrest to commitment andterminal differentiation. Nat Cell Biol 11: 1181–1190.

Majmundar AJ, Wong WJ, Simon MC. 2010. Hypoxia-inducible factors andthe response to hypoxic stress. Mol Cell 40: 294–309.

Malato Y, Naqvi S, Schurmann N, Ng R, Wang B, Zape J, Kay MA, Grimm D,Willenbring H. 2011. Fate tracing of mature hepatocytes in mouse liverhomeostasis and regeneration. J Clin Invest 121: 4850–4860.

Marstrand TT, Frellsen J, Moltke I, Thiim M, Valen E, Retelska D, Krogh A.2008. Asap: A framework for over-representation statistics fortranscription factor binding sites. PLoS One 3: e1623.

Matsuo T, Yamaguchi S, Mitsui S, Emi A, Shimoda F, Okamura H. 2003.Control mechanism of the circadian clock for timing of cell division invivo. Science 302: 255–259.

McConnell BB, Yang VW. 2010. Mammalian Kruppel-like factors in healthand diseases. Physiol Rev 90: 1337–1381.

Michalopoulos GK. 2007. Liver regeneration. J Cell Physiol 213: 286–300.Montminy M, Koo SH, Zhang X. 2004. The CREB family: Key regulators of

hepatic metabolism. Ann Endocrinol 65: 73–75.

Nerlov C. 2007. The C/EBP family of transcription factors: A paradigm forinteraction between gene expression and proliferation control. TrendsCell Biol 17: 318–324.

Ni L, Bruce C, Hart C, Leigh-Bell J, Gelperin D, Umansky L, Gerstein MB,Snyder M. 2009. Dynamic and complex transcription factor bindingduring an inducible response in yeast. Genes Dev 23: 1351–1363.

Osada S, Yamamoto H, Nishihara T, Imagawa M. 1996. DNA bindingspecificity of the CCAAT/enhancer-binding protein transcription factorfamily. J Biol Chem 271: 3891–3896.

Overturf K, al-Dhalimy M, Ou CN, Finegold M, Grompe M. 1997. Serialtransplantation reveals the stem-cell-like regenerative potential of adultmouse hepatocytes. Am J Pathol 151: 1273–1280.

Porse BT, Pedersen TA, Xu X, Lindberg B, Wewer UM, Friis-Hansen L, NerlovC. 2001. E2F repression by C/EBPa is required for adipogenesis andgranulopoiesis in vivo. Cell 107: 247–258.

Rana B, Xie Y, Mischoulon D, Bucher NL, Farmer SR. 1995. The DNA bindingactivity of C/EBP transcription factor is regulated in the G1 phase of thehepatocyte cell cycle. J Biol Chem 270: 18123–18132.

Saeed AI, Bhagabati NK, Braisted JC, Liang W, Sharov V, Howe EA, Li J,Thiagarajan M, White JA, Quackenbush J. 2006. TM4 microarraysoftware suite. Methods Enzymol 411: 134–193.

Sandmann T, Jakobsen JS, Furlong EE. 2006a. ChIP-on-chip protocol forgenome-wide analysis of transcription factor binding in Drosophilamelanogaster embryos. Nat Protoc 1: 2839–2855.

Sandmann T, Jensen LJ, Jakobsen JS, Karzynski MM, Eichenlaub MP, Bork P,Furlong EE. 2006b. A temporal map of transcription factor activity: mef2directly regulates target genes at all stages of muscle development. DevCell 10: 797–807.

Sandoval J, Rodriguez JL, Tur G, Serviddio G, Pereda J, Boukaba A, Sastre J,Torres L, Franco L, Lopez-Rodas G. 2004. RNAPol-ChIP: A novelapplication of chromatin immunoprecipitation to the analysis of real-time gene transcription. Nucleic Acids Res 32: e88.

Sarrazin S, Mossadegh-Keller N, Fukao T, Aziz A, Mourcin F, Vanhille L, KellyModis L, Kastner P, Chan S, Duprez E, et al. 2009. MafB restricts M-CSF-dependent myeloid commitment divisions of hematopoietic stem cells.Cell 138: 300–313.

Takahashi K, Yamanaka S. 2006. Induction of pluripotent stem cells frommouse embryonic and adult fibroblast cultures by defined factors. Cell126: 663–676.

Takashima M, Ogawa W, Hayashi K, Inoue H, Kinoshita S, Okamoto Y, SakaueH, Wataoka Y, Emi A, Senga Y, et al. 2010. Role of KLF15 in regulation ofhepatic gluconeogenesis and metformin action. Diabetes 59: 1608–1615.

Thiel G, Cibelli G. 2002. Regulation of life and death by the zinc fingertranscription factor Egr-1. J Cell Physiol 193: 287–292.

Thoren LA, Norgaard GA, Weischenfeldt J, Waage J, Jakobsen JS, DamgaardI, Bergstrom FC, Blom AM, Borup R, Bisgaard HC, et al. 2010. UPF2 isa critical regulator of liver development, function and regeneration.PLoS ONE 5: e11650.

Truax AD, Greer SF. 2012. ChIP and Re-ChIP assays: Investigatinginteractions between regulatory proteins, histone modifications, andthe DNA sequences to which they bind. Methods Mol Biol 809: 175–188.

van Wageningen S, Kemmeren P, Lijnzaad P, Margaritis T, Benschop JJ, deCastro IJ, van Leenen D, Groot Koerkamp MJ, Ko CW, Miles AJ, et al.2010. Functional overlap and regulatory links shape genetic interactionsbetween signaling pathways. Cell 143: 991–1004.

Vogel MJ, Peric-Hupkes D, van Steensel B. 2007. Detection of in vivoprotein-DNA interactions using DamID in mammalian cells. Nat Protoc2: 1467–1478.

Wang ND, Finegold MJ, Bradley A, Ou CN, Abdelsayed SV, Wilde MD, TaylorLR, Wilson DR, Darlington GJ. 1995. Impaired energy homeostasis in C/EBPa knockout mice. Science 269: 1108–1112.

White P, Brestelli JE, Kaestner KH, Greenbaum LE. 2005. Identification oftranscriptional networks during liver regeneration. J Biol Chem 280:3715–3722.

Yusuf I, Fruman DA. 2003. Regulation of quiescence in lymphocytes. TrendsImmunol 24: 380–386.

Zhang EE, Kay SA. 2010. Clocks not winding down: Unravelling circadiannetworks. Nat Rev Mol Cell Biol 11: 764–776.

Zhang F, Lin M, Abidi P, Thiel G, Liu J. 2003. Specific interaction of Egr1 andc/EBPb leads to the transcriptional activation of the human low densitylipoprotein receptor gene. J Biol Chem 278: 44246–44254.

Zinzen RP, Girardot C, Gagneur J, Braun M, Furlong EE. 2009.Combinatorial binding predicts spatio-temporal cis-regulatory activity.Nature 462: 65–70.

Zwang Y, Sas-Chen A, Drier Y, Shay T, Avraham R, Lauriola M, Shema E,Lidor-Nili E, Jacob-Hirsch J, Amariglio N, et al. 2011. Two phases ofmitogenic signaling unveil roles for p53 and EGR1 in elimination ofinconsistent growth signals. Mol Cell 42: 524–535.

Received July 20, 2012; accepted in revised form February 5, 2013.




REFERENCES | 110

References

1. Jou W, Haegeman G, Ysebaert M, Fiers W. Nucleotide Sequence of the Gene Coding for the Bacteriophage MS2 Coat Protein. Nature [Internet]. 1972 [cited 2013 Jul 15]; Available from: http://adsabs.harvard.edu/abs/1972Natur.237...82J

2. Sanger F, Coulson a R. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of molecular biology [Internet]. 1975 May 25;94(3):441–8. Available from: http://www.ncbi.nlm.nih.gov/pubmed/1100841

3. Sanger F, Coulson a R, Friedmann T, Air GM, Barrell BG, Brown NL, et al. The nucleotide sequence of bacteriophage phiX174. Journal of molecular biology [Internet]. 1978 Oct 25;125(2):225–46. Available from: http://www.ncbi.nlm.nih.gov/pubmed/714153

4. Cherwa JE, Organtini LJ, Ashley RE, Hafenstein SL, Fane B a. In VITRO ASSEMBLY of the øX174 procapsid from external scaffolding protein oligomers and early pentameric assembly intermediates. Journal of molecular biology [Internet]. Elsevier Ltd; 2011 Sep 23 [cited 2013 Jun 20];412(3):387–96. Available from: http://www.ncbi.nlm.nih.gov/pubmed/21840317

5. Goffeau A, Barrell B, Bussey H, Davis R. life with 6000 genes. Science [Internet]. 1996 [cited 2013 Jul 15]; Available from: http://www.sciencemag.org/content/274/5287/546.short

6. Altschup SF, Gish W, Pennsylvania T, Park U. Basic Local Alignment Search Tool. Journal of Molecular Biology. 1990;215(3):403–10.

7. Kent WJ, Haussler D. Assembly of the Working Draft of the Human Genome with GigAssembler. 2001;1541–8.

8. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The sequence of the human genome. Science (New York, N.Y.) [Internet]. 2001 Feb 16 [cited 2013 May 21];291(5507):1304–51. Available from: http://www.ncbi.nlm.nih.gov/pubmed/11181995

9. Liu L, Li Y, Li S, Hu N, He Y, Pong R, et al. Comparison of Next-Generation Sequencing Systems. 2012;2012.

10. Peterson TW. Next Gen Sequencing Survey. JP Morgan. 2013;(May 2010).

11. Weischenfeldt J, Waage J, Tian G, Zhao J, Damgaard I, Jakobsen JS, et al. Mammalian tissues defective in nonsense-mediated mRNA decay display highly

REFERENCES | 111

aberrant splicing patterns. Genome biology [Internet]. BioMed Central Ltd; 2012 Jan [cited 2013 Mar 13];13(5):R35. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3446288&tool=pmcentrez&rendertype=abstract

12. spliceR [Internet]. Available from: http://www.bioconductor.org/packages/2.13/bioc/html/spliceR.html

13. Hannon G. FASTX-toolkit [Internet]. Available from: http://hannonlab.cshl.edu/fastx_toolkit/

14. Andrews S. FastQC [Internet]. Available from: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

15. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research. 2008;18(11):1851–8.

16. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology [Internet]. 2009 Jan [cited 2011 Jun 10];10(3):R25. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2690996&tool=pmcentrez&rendertype=abstract

17. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research [Internet]. 2008 Sep [cited 2013 Aug 6];18(9):1509–17. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2527709&tool=pmcentrez&rendertype=abstract

18. Mortazavi A, Williams BA, Mccue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. 2008;5(7):1–8.

19. Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nature methods [Internet]. Nature Publishing Group; 2010 Sep [cited 2013 May 22];7(9):709–15. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3005310&tool=pmcentrez&rendertype=abstract

20. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, et al. Alternative isoform regulation in human tissue transcriptomes. Nature [Internet]. 2008 Nov 27 [cited 2013 May 21];456(7221):470–6. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2593745&tool=pmcentrez&rendertype=abstract

REFERENCES | 112

21. Berger MF, Levin JZ, Vijayendran K, Sivachenko A, Adiconis X, Maguire J, et al. Integrative analysis of the melanoma transcriptome. Genome research [Internet]. 2010 Apr [cited 2013 Aug 15];20(4):413–27. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2847744&tool=pmcentrez&rendertype=abstract

22. Tang F, Barbacioru C, Nordman E, Li B, Xu N, Bashkirov VI, et al. RNA-Seq analysis to capture the transcriptome landscape of a single cell. Nature protocols [Internet]. Nature Publishing Group; 2010 Mar [cited 2013 Aug 9];5(3):516–35. Available from: http://www.ncbi.nlm.nih.gov/pubmed/20203668

23. Perkins TT, Kingsley R a, Fookes MC, Gardner PP, James KD, Yu L, et al. A strand-specific RNA-Seq analysis of the transcriptome of the typhoid bacillus Salmonella typhi. PLoS genetics [Internet]. 2009 Jul [cited 2013 Aug 6];5(7):e1000569. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2704369&tool=pmcentrez&rendertype=abstract

24. Djebali S, Davis C a, Merkel A, Dobin A, Lassmann T, Mortazavi A, et al. Landscape of transcription in human cells. Nature [Internet]. 2012 Sep 6 [cited 2013 May 21];489(7414):101–8. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22955620

25. Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biology direct [Internet]. 2009 Jan [cited 2013 Aug 8];4:14. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2678084&tool=pmcentrez&rendertype=abstract

26. Gao L, Fang Z, Zhang K, Zhi D, Cui X. Length bias correction for RNA-seq data in gene set analyses. Bioinformatics (Oxford, England) [Internet]. 2011 Mar 1 [cited 2011 Jul 15];27(5):662–9. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3042188&tool=pmcentrez&rendertype=abstract

27. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) [Internet]. 2010 Jan 1 [cited 2012 Nov 2];26(1):139–40. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2796818&tool=pmcentrez&rendertype=abstract

28. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology [Internet]. 2010 Jan;11(3):R25. Available from:

REFERENCES | 113

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2864565&tool=pmcentrez&rendertype=abstract

29. Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in bioinformatics [Internet]. 2012 Sep 17 [cited 2013 Aug 6]; Available from: http://www.ncbi.nlm.nih.gov/pubmed/22988256

30. Anders S, Huber W. Differential expression analysis for sequence count data. Genome biology [Internet]. BioMed Central Ltd; 2010 Jan [cited 2013 Aug 6];11(10):R106. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3218662&tool=pmcentrez&rendertype=abstract

31. Trapnell C, Williams B a, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology [Internet]. Nature Publishing Group; 2010 May [cited 2013 May 21];28(5):511–5. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3146043&tool=pmcentrez&rendertype=abstract

32. Martin J a, Wang Z. Next-generation transcriptome assembly. Nature reviews. Genetics [Internet]. Nature Publishing Group; 2011 Oct [cited 2013 Aug 6];12(10):671–82. Available from: http://www.ncbi.nlm.nih.gov/pubmed/21897427

33. Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics (Oxford, England) [Internet]. 2011 Jul 1 [cited 2013 Aug 8];27(13):i275–82. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3117341&tool=pmcentrez&rendertype=abstract

34. Kong L, Zhang Y, Ye Z-Q, Liu X-Q, Zhao S-Q, Wei L, et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic acids research [Internet]. 2007 Jul [cited 2013 Aug 12];35(Web Server issue):W345–9. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1933232&tool=pmcentrez&rendertype=abstract

35. Wang L, Park HJ, Dasari S, Wang S, Kocher J-P, Li W. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic acids research [Internet]. 2013 Apr 1 [cited 2013 Aug 21];41(6):e74. Available from:

REFERENCES | 114

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3616698&tool=pmcentrez&rendertype=abstract

36. Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proceedings of the National Academy of Sciences of the United States of America [Internet]. 2003 Dec 23;100(26):15776–81. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=307644&tool=pmcentrez&rendertype=abstract

37. Salimullah M, Mizuho S, Plessy C, Carninci P. NanoCAGE: A High-Resolution Technique to Discover and Interrogate Cell Transcriptomes. Cold Spring Harbor Protocols [Internet]. 2011 Jan 4 [cited 2012 Nov 4];2011(1):pdb.prot5559–pdb.prot5559. Available from: http://www.cshprotocols.org/cgi/doi/10.1101/pdb.prot5559

38. Plessy C, Bertin N, Takahashi H, Simone R, Salimullah M, Lassmann T, et al. Linking promoters to functional transcripts in small samples with nanoCAGE and CAGEscan. Nature methods [Internet]. Nature Publishing Group; 2010 Jul [cited 2012 Nov 10];7(7):528–34. Available from: http://dx.doi.org/10.1038/nmeth.1470

39. Buck MJ, Lieb JD. ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics [Internet]. 2004 Mar [cited 2013 Aug 15];83(3):349–60. Available from: http://linkinghub.elsevier.com/retrieve/pii/S0888754303003628

40. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. 2007;4(8):651–7.

41. Barski A, Cuddapah S, Cui K, Roh T-Y, Schones DE, Wang Z, et al. High-resolution profiling of histone methylations in the human genome. Cell [Internet]. 2007 May 18 [cited 2011 Jul 17];129(4):823–37. Available from: http://www.ncbi.nlm.nih.gov/pubmed/17512414

42. Encode T, Consortium P. A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS biology [Internet]. 2011 Apr [cited 2013 Feb 27];9(4):e1001046. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3079585&tool=pmcentrez&rendertype=abstract

43. Neph S, Vierstra J, Stergachis AB, Reynolds AP, Haugen E, Vernot B, et al. An expansive human regulatory lexicon encoded in transcription factor

REFERENCES | 115

footprints. Nature [Internet]. Nature Publishing Group; 2012 Sep 6 [cited 2013 Aug 7];489(7414):83–90. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22955618

44. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, et al. The accessible chromatin landscape of the human genome. Nature [Internet]. Nature Publishing Group; 2012 Sep 6 [cited 2013 Aug 6];489(7414):75–82. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3721348&tool=pmcentrez&rendertype=abstract

45. Zhang Y, Liu T, Meyer C a, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome biology [Internet]. 2008 Jan [cited 2011 Jun 10];9(9):R137. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2592715&tool=pmcentrez&rendertype=abstract

46. Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Research [Internet]. Oxford University Press; 2010;38(Database issue):D105–D110. Available from: http://www.ncbi.nlm.nih.gov/pubmed/19906716

47. Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, et al. The TRANSFAC system on gene expression regulation. Nucleic Acids Research [Internet]. Oxford University Press; 2001;29(1):281–3. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=29801&tool=pmcentrez&rendertype=abstract

48. Robasky K, Bulyk ML. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Research [Internet]. Oxford University Press; 2011;39(Database issue):D124–D128. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3013812&tool=pmcentrez&rendertype=abstract

49. Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome research [Internet]. 2012 Sep [cited 2013 Aug 8];22(9):1813–31. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3431496&tool=pmcentrez&rendertype=abstract

SUPPLEMENTAL FIGURES | 116

Supplemental Figures

S up plemental f igure 1: A gr aphical overview of sp lice cla sses detected by RAINMAN and spliceR. In a ddition, spliceR detects intron retent ion events, not show n here.

Alternative 5’ splice site (A5SS) Alternative 3’ splice site (A3SS)

Mutually exclusive exon (MXE)

Single exon skipping (SES) Multiple exon skipping (MES)

Alternative !rst exon (AFE) Alternative last exon (ALE)

SUPPLEMENTAL FIGURES | 117

Left exon Right exon

Junction x

Refseq E2 Refseq E3 Refseq E4

Refseq E1 Refseq E2 Refseq E4 Refseq E5

Gene y

Putative isoform

Refseq E1 Refseq E2 Refseq E4 Refseq E5

ORF STOP

Putative polypeptide

Translation

With annotated CDS: Without annotated CDS:

Refseq E1 Refseq E2 Refseq E4 Refseq E5 Scan for ORFs

ORF ORF ORF


STOP


STOP

STOP


Select best ORF based on length of polypeptide vs.transcript length. Optimize this ratio, minimizing false positives, using genes with known CDS.

A

B

C

ED

Reads

Combinatorial databasejunction

Left exon Right exon

Refseq E1 Refseq E5

S up plemental f igure 2: Stra tegy for pr edict ing coding potent ia l of t ranscrip ts bas ed on exon-exon junct io n evidence. In A, s ho rt sequenced reads ar e ma pped to a ar ti f ic ia l genome consis ting of al l pos sible exon-exon combinat ions for each gene. I n B and C, each a djoining exo n fr om the longest suppor ted refseq (or another repos itory) i sofo rm is concatena ted to the exons sup ported b y the junct io n. In D, if an annotated C DS exists, the p utat iv e i sof orm is translated and the pos it ion of the S TOP co do n is stored. In E, if no C DS exists, t ransla te in al l f rames, choosing a combinat ion o f lo ngest s up ported polyp ep tide and mos t upstream ORF.

Formula for breakthroughs in research: Take young researchers, put them

together in virtual seclusion, give then an unprecedented degree of freedom and

turn up the pressure by fostering competition

- JAMES WATSON, 1989 ” “

applications of high throughput nucleotide waage.pdf · applications of high throughput nucleotide...

Documents