applications of high throughput nucleotide waage.pdf · applications of high throughput nucleotide...
TRANSCRIPT
APPLICATIONS OF HIGH THROUGHPUT NUCLEOTIDE
SEQUENCING by
JOHANNES WAAGE, M. SC.
A dissertation submitted in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in
Computational Biology and Bioinformatics
at
The PhD School of Science, Faculty of Science,
University of Copenhagen, Denmark.
Committee:
Assoc. Prof. JEPPE VINTHER, Chair, Uni. Of Copenhagen
Prof. ANDERS GORM PEDERSEN, Technical University of Denmark
Assoc. Prof. RICHARD SANDBERG, Karolinska Institutet, Sweden
Supervisors:
Prof. ALBIN SANDELIN (Principal), University of Copenhagen
Prof. BO PORSE, University of Copenhagen
THE BIOINFORMATICS CENTRE, DEPARTMENT OF BIOLOGY
AND
BIOTECH RESEARCH AND INNOVATION CENTRE
UNIVERSITY OF COPENHAGEN
September 1st, 201
Biology has at least 50 more interesting years.
- JAMES WATSON, 1989
“ ”
Preface
This work is presented as the requirement for obtaining the PhD degree at the
Bioinformatics Centre, Department of Biology, Faculty of Science, University of
Copenhagen. The work was carried out under the supervision of, in no particular
order, associate professor Albin Sandelin (Bioinformatics) and professor Bo Porse
(Biotech Research and Innovation Centre) from medio 2010 to medio 2013. A portion
of the work leading to article I was carried out from medio 2008 to medio 2010 during
my master’s thesis. The work was in part funded by the European Research Council
(EU FP7 framework programme/ERC grant agreement 204135), and in part by the
faculty of science, Uni. Of Copenhagen.
Table of Contents
Preface ..................................................................................................................... 5
English Summary ................................................................................................... 7
Dansk Resumé ........................................................................................................ 8
Acknowledgements ............................................................................................... 9
Author Publications Included In This Thesis .................................................... 10
Author Publications Not Included In This Thesis ............................................. 11
Abbreviations ........................................................................................................ 12
Introduction ........................................................................................................... 13
Introduction to articles ......................................................................................... 16
Sequencing and mapping issues .......................................................................... 20
RNA-sequencing .................................................................................................. 22
Milestones and history ..................................................................................... 22
Normalization issues ........................................................................................ 25
Isoform de-convolution and splice class detection .......................................... 27
Coding Potential Prediction ............................................................................. 31
Processing and analysis of CAGE-data ........................................................... 32
Chromatin Immunoprecipitation coupled with sequencing .............................. 36
Milestones and history ..................................................................................... 36
Peakfinding and normalization ........................................................................ 37
Motif discovery ................................................................................................. 38
Concluding Remarks .......................................................................................... 40
Original Articles ................................................................................................... 42
Article I ............................................................................................................. 43
Article II ............................................................................................................ 63
Article III ......................................................................................................... 66
Article IV ........................................................................................................... 97
References ............................................................................................................ 110
Supplemental Figures ......................................................................................... 116
ENGLISH SUMMARY | 7
English Summary
The recent advent of high throughput sequencing of nucleic acids (RNA and
DNA) has vastly expanded research into the functional and structural biology of the
genome of all living organisms (and even a few dead ones). With this enormous and
exponential growth in biological data generation come equally large demands in data
handling, analysis and interpretation, perhaps defining the modern challenge of the
computational biologist of the post-genomic era.
The first part of this thesis consists of a general introduction to the history,
common terms and challenges of next generation sequencing, focusing on oft
encountered problems in data processing, such as quality assurance, mapping,
normalization, visualization, and interpretation.
Presented in the second part are scientific endeavors representing solutions to
problems of two sub-genres of next generation sequencing.
For the first flavor, RNA-sequencing, a study of the effects on alternative RNA
splicing of KO of the nonsense mediated RNA decay system in Mus, using digital
gene expression and a custom-built exon-exon junction mapping pipeline is presented
(article I). Evolved from this work, a Bioconductor package, spliceR, for classifying
alternative splicing events and coding potential of isoforms from full isoform
deconvolution software, such as Cufflinks (article II), is presented. Finally, a study
using 5’-end RNA-seq for alternative promoter detection between healthy patients and
patients with acute promyelocytic leukemia is presented (article III).
For the second flavor, DNA-seq, a study presenting genome wide profiling of
transcription factor CEBP/A in liver cells undergoing regeneration after partial
hepatectomy (article IV) is included.
DANSK RESUMÉ | 8
Dansk Resumé
De nye nukleinsyresekventeringsteknologiers komme har i høj grad udvidet
horisonterne og mulighederne inden for forskning i funktionel og strukturel biologi af
alle levende (og et par få uddøde) organismers genom. Med denne enorme og
eksponentielle vækst af biologisk data opstår ligeledes store krav til behandling,
analyse og fortolkning af data, og således defineres den fremtidige udfordring for
bioinformatikkeren i den post-genomiske æra.
I den første del af denne afhandling præsenteres en general introduktion til
næste-generationssekvenserings historie, terminologi og udfordringer, med fokus på
typiske problemer indenfor datahåndtering, såsom kvalitetssikring, mapping
(placering af sekventerede aflæsningsfragmenter på kromosomet), normalisering og
fortolkning.
I den anden del præsenteres 4 videnskabelige arbejder der forsøger at løse
nuværende problemer indenfor to undertyper af sekvensering.
For den første type, RNA-sekvensering, præsenteres et studie der kortlægger
effekterne af knockout af nonsens-medieret RNA-nedbrydning på alternativ splejsning
i Mus ved hjælp af digitalt genudtryk og en specialbygget pipeline til mapping af exon-
exon-forbindelser. I forlængelse af dette, præsenteres en Bioconductor-pakke, spliceR,
der klassificerer alternative splejshændelser og proteinkodende potentiale af RNA-
isoformer genereret fra software som f.eks. Cufflinks. Endelig præsenteres et studie
der bruger RNA-sekvensering til at finde differentielt benyttede promotere mellem
raske patienter, og patienter diagnosticeret med akut promyelocytisk leukæmi.
For den anden type, DNA-sekvensering, præsenteres et studie omhandlende en
fuld genomisk profilering af transkriptionsfaktoren CEBP/As binding i murine
leverceller der undergår regenerering efter partiel hepatektomi.
ACKNOWLEDGEMENTS | 9
Acknowledgements
A sincere thank-you to my bosses Bo and Albin for being not just that when
things get rough.
Thank you to both the Porse group and the whole of Upper Binf for good
times!
AUTHOR PUBLICATIONS INCLUDED IN THIS THESIS | 10
Author Publications Included In This Thesis
Sorted by date (oldest first). Underlined articles are included in this thesis.
1. Waage, J*., Weischenfeldt, J*., Tian, G., Zhao, J., Damgaard, I., Jakobsen, J.
S., ... & Porse, B. T. (2012). Mammalian tissues defective in nonsense-mediated
mRNA decay display highly aberrant splicing patterns. Genome Biol,13, R35.
2. Jakobsen, J. S., Waage, J., Rapin, N., Bisgaard, H. C., Larsen, F. S., &
Porse, B. T. (2013). Temporal mapping of CEBPA and CEBPB binding
during liver regeneration reveals dynamic occupancy and specific regulatory
codes for homeostatic and cell cycle gene batteries. Genome research, 23(4),
592-603.
3. Knudsen, K., Porse, B. T., Sandelin A., Waage, J. (2013). spliceR: An R
package for classification of alternative splicing and prediction of coding
potential from RNA-seq data. Manuscript under preparation (Bioinformatics
format included)
4. Waage, J*., Boyd, M.*, Theilgaard-Mönch, K., Mora-Jensen, H., Carninci,
P., Rapin, N., … & Sandelin, A. (2013). Genome-wide mapping of transcription
start sites reveals deregulation of full-length transcripts in acute promyelocytic
leukemia. Under preparation.
*=shared first author
AUTHOR PUBLICATIONS NOT INCLUDED IN THIS THESIS | 11
Author Publications Not Included In This Thesis
1. Thoren, L. A., Nørgaard, G. A., Weischenfeldt, J., Waage, J., Jakobsen, J. S.,
Damgaard, I., ... & Porse, B. T. (2010). UPF2 is a critical regulator of liver
development, function and regeneration. PLoS one, 5(7), e11650.
2. Hasemann, M., Lauridsen, F. K. B., Waage, J., Jacobsen, J. S., Frank, A.,
Schuster, M., … & Porse, B.T. (2013) Submitted to PLoS Genetics.
3. Leucci, E., Patella, F., Waage, J., Holmstrøm, K., Lindow, M., Porse, B.T.,
… & Lund, A. H. (2013). MicroRNA-9 targets the long non-coding RNA
MALAT1 for degradation in the nucleus. Scientific Reports, 3, 2535
ABBREVIATIONS | 12
Abbreviations
(List of abbreviations used in thesis text, not including articles).
A3SS Alternative 3’ Splice Site
A5SS Alternative 5’ Splice Site
AFE Alternative First Exon
ALE Alternative Last Exon
AML Acute Myeloid Leukemia
APL Acute Promyelocytic Leukemia
AS Alternative Splicing
ATSS Alternative Transcription Start Site
ATTS Alternative Transcription Termination Site
BMM Bone Marrow-derived Macrophages
CAGE Cap Analysis of Gene Expression
ChIP Chromatin Immunoprecipitation
ENCODE ENCyclopedia Of DNA Elements
FANTOM Functional ANnotaTion Of the Mammalian genome
FPKM Fragments per Kilobase per Million mapped
IC Information Content
MES Multiple Exon Skipping
MX Mutually Exclusive exon
NMD Nonsense Mediated Decay
ORF Open Reading Frame
PTC Premature Termination Codon
PWM Position Weight Matrix
RPKM Reads per Kilobase per Million mapped
SES Single Exon Skipping
TMM Trimmed Mean of M-values
TPM Tags Per Million mapped
TSS Transcription Start Site
INTRODUCTION | 13
Introduction
The history of nucleic acid sequencing started with the complete sequencing of
the RNA gene of a bacteriophage coat protein in 1972 (1), based on the method of
incorporating dye-coupled chain-terminating nucleotides, published by Frederick
Sanger a few years later (2). An impressive feat at that time, this deduction of the full
387 nucleotides of sequence marked perhaps the earliest beginning of the era of the
genome – an era which is currently experiencing it’s renaissance in the 21th century
with the advent of the next generation of nucleotide sequencing.
From the early 1970s to the late 1990s, technological advances in sequencing
were significant but somewhat slow compared to the evolution of the microprocessor
(driving many other scientific advances), and sequencing of large full genome projects
required large machine-parks and extensive collaborative efforts. Following the first
fully sequenced genome, the 5386 nucleotides long DNA of the phiX174 phage (3) (and
later the first genome to be fully synthesized and assembled in vitro (4)), several small
genomes of prokaryotes were sequenced, an by the late eighties, an international
consortium of more than 74 laboratories and 600 scientists began a 7-year effort to
sequence the first eukaryotic and large genome of Saccharomyces cerevisea.
Completed in 1996 (5), this 12-megabase genome was a huge step forward in terms of
sheer size, but the 3000-megabase human genome was still a daunting, distant and
unfeasible task. But the foundation was laid, and from the yeast genome project (and
other early sequencing projects) sprang a plethora of technological and computational
disciplines and methods. And perhaps just as important, from these early sequencing
projects rose the cross-disciplinary field of bioinformatics, combining sequence analysis
with mathematical and statistical models of machine learning, allowing for sequence
alignment, gene prediction, structure prediction, and so on. An iconic example of an
early bioinformatics algorithm, BLAST, published in 1990 (6), was perhaps the
foundation of fast computational analysis of biological sequence.
The obvious culmination of the first era of sequencing was the complete
deduction of the human genomic sequence, an effort first planned in the early 80s,
INTRODUCTION | 14
initiated in 1991, and completed in 2001. A multi-party race to finish, private company
Celera intended to achieve the completion of both sequencing and assembly first,
raising fears that genes and genomic sequence could be patented and not available on
the open domain for research. Initial efforts to assemble the genome from the public
sequencing consortium were unsuccessful, but on June 22, 2000 GigAssembler (7), a
assembler written by then graduate student at University of Santa Cruz in California
(UCSC) Jim Kent completed the job just days before Celera, and a few weeks later,
the first draft of the human genome was made available at the newborn human genome
browser at the UCSC (8), releasing the genome into the public domain.
Concurrent with the human genome project, still based on the Sanger
sequencing methods invented in the 1960s, companies were trying to reinvent
nucleotide sequencing. Several full-fledged automated high-throughput sequencers
appeared on the scene around 2005, and underwent several iterative improvements to
both speed, accuracy and cost over the next 8 years. At time of writing, such machines
are relatively widespread, and used in a number of research disciplines.
Omicsmaps.com, a user-updated map of sequencers in academia, show 23 machines in
Denmark alone.
Three platforms currently dominate the market, based on different approaches
for deducing nucleotide sequence in parallel: Roches/454s pyrosequencing is based on
fixing DNA to micro-beads, followed by amplification and sequencing, and currently
supports read lengths up to 1000 nt, it’s main advantage. Applied Biosystems SOLiD
is based on ligation and two-base encoding (colorspace), and has high accuracy as its
primary advantage. Illuminas Solexa platform is based on sequencing-by-synthesis,
and has the highest throughput and the lowest cost of the three (9). As of 2013,
Illumina by far has the largest marketshare (10), and all of the work included in this
thesis is based on data from this platform. The different sequencing platforms will not
be covered more in detail, but for a comprehensive comparison and overview,
including advantages and disadvantages, please refer to (9).
INTRODUCTION | 15
Much of the work presented here was initiated between 2008 and 2010, a time
where the field of high-throughput sequencing and its sub-disciplines, such as RNA-
seq and ChIP-seq, were in their nascency. Working with the data presented an array of
interesting (and some not so interesting) problems with few ready-made tools and
pipelines available, necessitating a healthy dose of experimentation. Fortunately, the
bioinformatics discipline borrows a lot of its mentality and affinity for open-source
from computer science in general, and from early on, researchers and programmers
were sharing experiences, approaches and source code online.
INTRODUCTION TO ARTICLES | 16
Introduction to articles
The articles presented herein represent the main work of this dissertation, as
well as some additional work initiated during my master thesis (article I). Here, a brief
motivation, introduction to the subject matter as well as an overview of my
contributions and approaches to each work is presented. A more detailed treatise on
the methodologies used follows in each section of the general introduction, referencing
to each article as needed.
Obviously, the bioinformatic method cannot be wholly separated from the
underlying biology, but it should be noted that the aim of this thesis is not to carry the
reader through the different biological systems that are the focus point of the articles
included (although a brief introduction is given beforehand), but rather to investigate
the history, techniques, challenges and solutions in current appliances of next-
generation sequencing, and selected examples on how these were used in the included
works.
INTRODUCTION TO ARTICLES | 17
Paper I: NMD and alternative splicing – Genome Biology (2012)
In eukaryotes, most multi-exon genes undergo alternative splicing, the process
of regulated shuffling of genomic coding entities (exons) in a combinatorial manner,
creating several distinct mRNA molecules (and thereby protein products) from a
single gene unit, and this process is believed to be a significant contributor to
complexity of higher organisms. Such flexibility in any biological system often carries
with it a level of noise, and in this case, a potentially detrimental kind.
NMD or nonsense mediated decay is a general mRNA pathway in eukaryotes
that senses aberrantly spliced transcripts, or transcripts that would otherwise lead to
potentially dysfunctional protein products, and removes these by means of nucleolysis
and degradation.
For this paper, originally planned in early 2008, (and at the dawn of RNA-seq
technology) we sought to take advantage of the new digital platform to carry out the
first genome-wide and non-discriminative analysis of the effects of NMD KO
(facilitated by a murine KO of the essential NMD-factor Upf2) in an mammalian
system on the global splicing pattern as a whole. The study gave rise to several
bioinformatic challenges: dealing with the nascent RNA-seq data as a whole; creating
an artificial exhaustive genome of exon-exon splice junctions facilitating deconvolution
of full isoforms based on junction evidence (this was before paired-end sequencing was
widespread), predicting coding potential of each transcript, and downstream
interpretative analyses, including splice class classification, motif search, and sequence
conservation assessment. A challenge for this data was to meaningfully describe,
quantify and visualize “noisy” transcripts rescued by NMD KO, without setting
inhibitory filter thresholds that would otherwise remove these. Perhaps in some sense
a boon for this project, using biological replicates for sequencing was not feasible cost-
wise, and post-processing with replication would potentially have eliminated some of
the “stochastic” noisy splicing.
Overall, the study was successful, and we found that NMD KO upregulates a
large population of non-productive mRNAs through splice factor deregulation, and
INTRODUCTION TO ARTICLES | 18
extended the understanding of NMD from an mRNA quality pathway to a regulator
of several layers of gene expression. The study was published in Genome Biology in
2012 (11).
Paper II: a package for splice class detection – under preparation (formatted as a
Bioinformatics Application Note) (2013)
Our NMD-paper produced a more-or-less ready to run pipeline (RAINMAN)
for the community to run on their own RNA-seq data. But by 2012, paired-end RNA-
sequencing was the norm, and serious advances had been made in the field of full-
length isoform deconvolution based on robust statistical methods, including both
expectation–maximization and bayesian methods. One such tool published was
Cufflinks, part of the popular Tuxido suite for RNA-seq analysis; Cufflinks were
among many other studies used for the assembly of RNA-seq data in the large scale
ENCODE study of ultimo 2012.
As such, RAINMAN was outdated, and Kristoffer Knudsen (master student in
the Sandelin lab) and I set out to write a new tool for splice class detection and coding
potential prediction based on full length isoforms. From this emerged spliceR, an R
and Bioconductor package that allows for user-friendly annotation of assembled
RNA-seq data from Cufflinks (or any other full-length assembler), and outputs a range
of useful data, including isoform fraction switch scores and genomic locations of
differentially spliced elements for downstream analyses. spliceR was accepted into the
Bioconductor 2.13 release in October ‘13 (12), and the manuscript is currently under
preparation.
Paper III: Alternative promoter usage in acute promyelocytic leukemia – awaiting
experimental validation.
A different application for RNA-seq is CAGE-seq, or Cap Analysis of Gene
Expression. This method effectively captures small sequence fragments from the 5’ end
of capped RNA molecules, which are in turn sequenced, and processed to generate a
genome wide map of promoter usage. In humans (and other higher organisms), genes
INTRODUCTION TO ARTICLES | 19
often have more than one transcription start site, allowing for a diversification of the
resulting protein product, similar to and in concert with alternative splicing. For this
study, we took advantage of in-house cell-sorting facility and expertise combined with
access to samples from patients with acute promyelocytic leukemia (APL). The data
for this paper was particularly challenging, as isolatable material from the (liquid)
tumors was scarce, resulting in low input material for the method. This generated a lot
of technical noise, and necessitated rigorous filtering methods. After processing, we
found promoters with upregulated usage in APL, which were often located
downstream of annotated promoters, while promoters supporting the annotated
longest transcript variants were commonly downregulated in cancer. This manuscript
has been prepared, and is awaiting experimental validation of selected targets.
Paper IV: Transcription factor profiling in liver regeneration – Genome Research
(2013)
Another major advance for biology brought about by the high-throughput-era
was the coupling of traditional transcription factor binding site profiling by ChIP to
sequencing, a field previously limited by the microarray platform’s limited and pre-
selected number of probes. In this study we presented one the first genome wide
studies of temporal transcription factor dynamics – the binding patterns of key
hepatocyte factors CEBPA and CEBPB during liver regeneration following partial
hepatectomy in a mouse model.
Combined with DNA polymerase II binding data, as well as large array of in-
silico binding predictions of other transcription factors, we described a circuitry of
temporal codes, binding classes, and mutual exclusive motifs. Major challenges for this
study were many and included handling and processing (at that time nascent) ChIP-
seq data, peak calling, filtering and clustering techniques using unsupervised
approaches. In addition, considerable effort went into generating a high information /
low redundancy set of transcription factor motifs from three large repositories
(JASPAR, Transfac and UniProbe).
SEQUENCING AND MAPPING ISSUES | 20
Sequencing and mapping issues
A number of issues are shared between sequencing platforms and appliances
pertaining to read length, read quality and trimming and mapping. Although not one
of the most exciting issues of sequence analyses, this short section will briefly cover the
most common challenges, and how some of these were overcome in the presented
works.
Sequence data generated by current platforms is exported in the FASTQ
format, a sequence format that combines the raw sequence format FASTA with the
quality data. Each platform has its own idiosyncrasies in terms of sequencing quality;
the Illumina platform, for instance, generally has a low quality towards the edges of the
flow cell, in the very beginning of the sequencing run and trailing off towards the end
of the run (data not shown). Most common workflows include a read quality
visualization and trimming step for ensuring optimal mapping. In addition, this step
can include the removal of sequencing adapter sequences, and identical reads arising
from PCR amplification artifacts. Separated form the analytic and interpretive side, a
large amount of time is spent readying and “massaging” data in bioinformatics. For
article I, this was done manually in Perl and Python. For article III and IV, this was
done using FastX-tools (13) and FastQC (14).
Perhaps the pivotal step in sequence processing is the mapping of short
sequenced fragments back to the relevant genome, a field where significant progress
has been made to accommodate the much larger amounts of data since the early days
of BLAST (6). The seed-and-extend approach of BLAST is still applicable in some
cases, but modern mappers applying more efficient forms of genome indexing have
significantly increased the speed of mapping, a necessity when dealing with millions of
reads. Except for the ELAND mapper proprietary to the Illumina pipeline, MAQ
was one the earliest attempts at a fast short-read mapper (15), using a hash table of the
genome a look-up index, incorporating the quality scores into the mapping algorithm.
A bit on the slow side (typically requiring > 1000 CPU hours for larger sequencing
projects on workstations at the time of publishing), a big stride forward was made with
SEQUENCING AND MAPPING ISSUES | 21
Bowtie (16), utilizing the Burrows-Wheeler transform for encoding the reference
genome. This transform, published by computer company HP in 1994 as a lossless
compression algorithm, proved to be an ingenious way of compressing the genomic
index to a size manageable by most workstations (<2GB RAM), but at the same time
allowing for very rapid lookup of sequences. The bowtie suite of tools has later been
expanded with an RNA-seq assembler and isoform deconvolver (described later), and
is at the time of writing still the most widely used short read mapper in research (based
on number of citations).
RNA-SEQUENCING | 22
RNA-sequencing
Like the first fully sequenced biological polymer in 1972, the first two works
included in this theses is based on sequencing of RNA. This section will briefly cover
the history and important landmarks over the last few years of RNA-sequencing by
next generation technologies, and will subsequently dig deeper into common problems
and issues with the presented works as main focus.
Milestones and history
RNA-seq was introduced to the research community as a group of 5 papers
coming out at approximately the same time, but the first significant paper using RNA-
seq and paving the way for digital gene expression was by Mortazavi et al. in 2008. An
important benchmark paper, their results showed high inter-sample fidelity between
technical replicates (later confirmed in (17), arguing that technical replicates are per se
not needed in RNA-seq) , and, using in-vitro synthesized transcripts as spike-ins, high
sensitivity and dynamic range, establishing the method as a serious competitor to the
previously ubiquitous microarray-based gene expression platform (18). This work also
established the first simple normalization and quantification methods and metrics, and
showed, perhaps surprisingly, a correlation with arrays with a R2 at around only 0.7.
More recent advances in data processing and analysis have increased this correlation to
around 0.8 (19). Whether the remaining difference owes to the shortcomings of the
array platform, the immaturity of RNA-seq processing, or a combination of the two, is
not yet readily discernible.
RNA-SEQUENCING | 23
Later the same year, the first global study of alternative splicing using RNA-seq
was published in Nature by the Burge lab (20). Surprisingly, this study found that
more than 90% of all human genes undergo alternative splicing. Classifying eight
different types of alternative events, authors found strong tissue-specific and highly
conserved “switch-like” splicing events. To accomplish this extensive mRNA isoform
discovery, reads were mapped to an artificial custom genome based on a transcript
database of all possible exon-exon junctions, in turn allowing capture of all possible
splice events (between the annotated exons from the selected repositories, at least).
Until recent advances in full isoform deconvolution, this approach was popular, and
was also employed in a modified version by the RAINMAN pipeline presented in
paper I (as described below).
Much progress followed in the following years, not all relating to digital gene
expression and alternative splicing. RNA-seq found other interesting usages including
sequencing of non-coding RNA-species, assessment of allele-specific expression and
detection of fusion genes (one of the first works on this is (21)). Technical advances
were also made, including single cell RNA-seq (22) and strand-specific RNA-seq (23).
Similar advances were made on the software side (some will be discussed later),
Figure 1: Isofo rm expr es sio n results from the ENC ODE p roject , bas ed on isof orm as sembly from RNA-seq da ta . Adap ted fr om Djeba li et al , 201 2.
RNA-SEQUENCING | 24
including software packages for normalization, differential calling, and visualization,
and reference-free assembly (they are too numerous to be referenced here).
Perhaps the pinnacle of research in genome wide transcription profiles was the
publication of the ENCODE consortiums large scale sequencing of the transcriptome
of 15 human cell lines in 2012 (24). This study validated the previous claim of almost all
human multi-exon genes undergoing alternative splicing, and showed that at least 60%
of the full human genome is transcribed (but not necessarily translated), perhaps the
final nail in the coffin of “junk-DNA”. Using Tophats/Cufflinks for isoform assembly
(discussed below), ENCODE made several interesting findings – perhaps one of the
more curious ones showed, that as the number of annotated isoforms increase for a
given gene, the fraction of the most expressed isoform (the major isoform) decreases
(signifying biological importance of multiple isoforms, Error! Reference source
not found.a), but that most genes reach a plateau at 10-12 isoforms (perhaps
signifying a rate-limiting or energy-wise prohibitive value of expressing too many
differentially spliced isoforms, Error! Reference source not found.b). This is a
good example of a bias-free finding facilitated by sequencing, which would have been
impossible just 10 years ago with microarray-technology.
In the relatively short life of RNA-seq, the field has been undergoing rapid
advances, and methods are ever evolving. This sometimes translates to older methods
being called obsolete by newer, as will be evident of this follows in the section about
isoform deconvolution.
RNA-SEQUENCING | 25
Figure 2: RNA-seq normalization. Left: RNA-seq data from Upf2 KO mouse l iver (data from artic le I) after RPKM normalization. Red dots represent the 50 most expressed genes in WT, green and pink dots represent 50 random sampled genes from the second 2 nd and 4 th bin of an 8-bin quantization of expression, respectively . Note how neither subset straddles the black l ine of no DE. Right: same data after quanti le -normalization, one of many normalization methods avai lable. Note: this f igure is for i l lustrative purposes, and does not relate to method used in paper I .
Normalization issues
A pervasive RNA-seq challenge is the problem of normalization, an issue that has been
apparent in almost all data this author has been in touch with. In this context,
normalization is understood as the action of adjusting count values proxy of transcript
or gene expression in a way that adjusts for unwanted (often technical) bias, and makes
comparisons between and within samples meaningful. Normalization methods can
roughly be divided into two groups - the first group, intra-sample methods, normalize
each given data point (transcript or gene count) using a specific approach. One of the
earliest proposed intra-sample methods was that of RPKM (Reads Per Kilobase of
transcript length per Million mapped reads), simply normalizing coverage of a given
transcript to the transcript length and the sequencing depth (18). Even though the
RPKM method has been shown to have some pitfalls in relation to gene length bias (in
particular in the over-estimation of lowly expressed transcripts)(25,26), it is still a
popular method and readily gives a assessment of the relative levels of transcripts intra-
RNA-SEQUENCING | 26
sample-wise. Another method, borrowed from microarray analysis, is quantile
normalization.
RNA-seq doesn’t disclose any information about the total RNA levels in the
cell. This have sometimes been dubbed the “sequencing real estate” problem and stems
from the notion that in some experiments, as a given sample is sequenced at a given
depth, larger groups of RNA species can be more or less abundant in only one sample.
After the RPKM normalization method is applied, ranges of expression can be skewed
towards or away from this sample (over- or undersampling), although this does not
represent an actual difference in expression (but rather a difference in absolute RNA
abundances), and this can be potentially dangerous in downstream significance calling.
An example of a skewed RNA-seq expression profile is given in figure 2. Here, the
highly expressed range of genes is skewed towards the WT, resulting in under-
sampling of the remaining expression space. This phenomenon is somewhat akin to
issues of probe saturation found in microarrays, but additionally gives rise to more
peculiar distributions of inter-sample expression comparisons, where difference in
absolute abundances in certain ranges of expression may produce a local skew - e.g.
“banana-like” scatterplots (data not shown). The second group of normalization
methods, inter-sample methods, generally calculates normalization factors to be
applied equally to all data-points for a given sample to account for differences in the
true abundance of RNA. The TMM method, used in paper I, and part of the EdgeR
Bioconductor package (27), assumes that technical variation can be modeled using the
poisson distribution, and biological variation using the negative binomial distribution,
a assumption that also holds true for other seq-data, including ChIP-seq. Using this
variance as weights, TMM simply proposes an inter-sample normalization procedure
by calculating a normalization factor based on the trimmed mean of log expression
ratios, under the assumption that most genes are not differentially expressed (28).
RNA-SEQUENCING | 27
Since the publication of paper I, several methods have been proposed in
literature regarding RNA-seq normalization, and it’s now evident that this issue is just
as pertinent for RNA-seq as it was for microarray. The results from one recent
comparison study including several methods (29) confirmed earlier suspicions that the
RPKM (and by extension, the FPKM metric used by Cufflinks) metric is a bad choice
due to gene length bias issues, and found DESeq (30) and TMM normalization to be
most robust (Figure 3). A combination of RPKM and TMM normalization, as was
applied in paper I, wasn’t tested, but is expected to inherit some of the strengths and
weaknesses of both methods.
Isoform de-convolution and splice class detection
Initial approaches in detecting alternative splicing was based on creation of
artificial genomes, consisting of all possible exon-exon junctions generated from an
mRNA repository of choice. For article I, we developed our own pipeline based on
this approach, RAINMAN. This method, primarily motivated by single-end
sequencing being the available data, and by its relative simplicity, has a few inherent
pitfalls. Firstly, each alternative splicing event is only detected based on co-occurrence
of two exons, and the inter-dependence between multiple events cannot be
determined. In our case (article I), the observation that more than 95% of junctions
mapped to concurrent exons of annotated, productive and coding mRNAs, provided
Figure 3: A comp aris on inter - and intra sam ple norm aliza tion methods for RNA-seq. T C=to ta l count, UQ = upp er quart il e, Med = media n, DES eq = nor maliza tion b y the R-pa cka ge o f the s ame name, T MM = trimmed mean of M-val ues (edgeR), Q = quanti le , RPKM = r ea ds p er k ilob ase per mil l ion mapp ed, Raw Count = raw read counts (no norma lizat ion) . Diff erent normal izat ion m ethods p roduce w idely di fferent results depending on their a ppr oach, and w hether they ’ re intra- (Q a nd RPKM) or inter-sa mple ( the rest) . Ada pted from Dill ies et al ., 2012
RNA-SEQUENCING | 28
the argument that although we couldn’t detect multiple splice events within the same
messenger, those would statistically be relatively rare (0.05^2 = 0.0025), assuming no
dependency between events. We dared this normally dangerous assumption, as noisy
splicing was thought to happen in a random fashion.
Under this assumption, the remaining exons of the longest supported
annotated isoform from a array of public repositories (Refseq, UCSC knownGene,
Ensembl, etc…) was attached to each exon of the respective junction, and thus the full
length isoform inferred, and, later, its coding potential predicted. This exhaustive
exon-exon junction catalog also provided a simple but suitable platform for inferring
alternative splicing events, and the alterations of NMD KO on AS. We chose to detect
the classes of AS recognized in literature as the minimum informative set, namely
single and multiple exon-skipping (SES/MES), alternative 5’ and 3’ splice sites (A5SS
/A3SS), alternative first and last exons (AFE, ALE) and mutually exclusive exons
(MX). A graphical overview of these classes is given in Supplemental figure 1. The
detection algorithm was detected on an iterative approach – in the first step, for each
given splice junction, junctions with similar left or right exonic edges (or coordinates)
were isolated. Next, the connections and supporting exons of these junctions were
included in the model, and depending on the trace back to the original junction, and
the inclusion or exclusion of supportive or inhibitive junctions and exons, the splice
class was deduced. An example is given in Figure 4 for two splice classes. For (A),
two junctions (contJunc1 and contJunc2) with identical left and right coordinates with
testJunctestJuncleftcontJunc1left
testJuncrightcontJunc2right
skippedExon
contJunc1rightskippedExonstart
contJunc2leftskippedExonend
contJunc2contJunc1
testJunc
juncShareRightx
juncShareRighty
testJunc-Exon
juncShareExonx
juncShareExony
A B
Single exon skipping Alternative 5’ splice site
Figure 4: Cla ssi f icat ion o f alter na tive s plicing events ba sed on exon-exon junct ion evidence. Ref er to text for detai ls.
RNA-SEQUENCING | 29
the single junction testJunc, and with evidence supporting only one exon between
those two junctions, classifies as SES. For (B), a junction (testJunc), sharing right
coordinates with another junction (juncShareRighty) and left coordinates with an
exon, which itself has support for an additional exon sharing left coordinates, but with
different right coordinates, from which a junction (juncShareRightx) has identical right
coordinates with juncShareRightx, classifies as A5SS. Similar logic was applied for the
remaining splice classes. It has to be noted, that although desirable, exon read
coverage was not included in the classification, as time did not permit this feature to be
implemented.
As a nicety and convenient functionality, visualization of splice junctions color
coded by splice class for export to genome browser was included in the RAINMAN
software. Figure 5 shows a classical example exon-exon junction coverage of the
NMD-regulated splice factor Srsf7, color coded by splicing class – in NMD KO
tissues, isoforms including a PTC-containing exon (marked in yellow), normally
spliced in by the splicing factor itself in a auto-regulatory manner, are rescued.
For the work presented, this platform was sufficient and successful in detecting
a marked phenotype of unproductive PTC+ transcripts and a global alteration in
alternative splicing in the two NMD KO systems utilized.
Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7
Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7
chr17.2379
Mammal Cons
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
RefSeq Genes
Ensembl Gene Predictions
Genscan Gene Predictions
Exoniphy Mouse/Rat/Human/Dog
TROMER Transcriptome database
30-Way Multiz Alignment & Conservation
Sfrs7Sfrs7Sfrs7Sfrs7Sfrs7
Sfrs7
ENSMUST00000063417ENSMUST00000063403ENSMUST00000112421
chr2.38800001.10chr2.38800001.12
chr2.38800001.14 chr2.38800001.16chr2.38800001.18
chr2.38800001.19 chr2.38800001.21
MTR000672.17.2018.2MTR000672.17.2018.4MTR000672.17.2018.5MTR000672.17.2018.8MTR000672.17.2018.9
MTR000672.17.2018.7MTR000672.17.2018.0
MTR000672.17.2018.3MTR000672.17.2018.6
MTR000672.17.2018.1
0.47
0.13
0.13
0.13Junctions, KO
Junctions, WT
Figure 5 : S creens hot f rom the UCS C geno me b row ser, show ing bo nema rrow -derived ma cro phage (BMM ) RNA-seq coverage in Up f2 WT vs KO of the NMD -regulated sp lice factor Srs f7 (S fsr7 ). Gr een junct ions repr es ent alterna tive splice s ites and red junct ions indicate single exon skipping. Numbers ab ove junct ions incl uding o r skipping the yel lo w auto-regulatory S TOP -containing exon a re re la tive exp ression val ues.
RNA-SEQUENCING | 30
Around 2010 perhaps, paired-end sequencing was becoming the norm in RNA-
seq. Here, both ends of the library insert are sequenced and can be mapped to the
genome. This information, combined with the known insert size as well as exon
coverage / junction-evidence, enables modern RNA-seq assemblers to attempt full-
length transcript de-convolution. Perhaps the most widely used and best cited is the
Cufflinks assembler (31), part of the ubiquitous Tuxedo-suite from the hand of Cole
Trapnell. Cufflinks achieves de-convolution by a series of simple steps (Figure 6). After
initial mapping of reads to the genome, and de novo splice junction detection (A), all
mutually incompatible splice
fragments are reduced to the
parsimonious number of possible
paths through the splice graph (B,
C). For genes with several splice
events, this combinatorial approach
generates a large number of
potentially false positive isoforms,
and this initial step is similar to the
junction-based approach described
above and presented in paper I,
where “canonical”, annotated exons
are simply concatenated to each
junction. In perhaps the pivotal step,
each putative isoforms compatibility
with the evidence is determined,
using exon coverage and the
fragment length distribution (D)
(either determined the mate distance
in pre-annotated single exon genes
or given by the user), setting the prior
Figure 6: RNA Deconvolut ion b y Cuff l inks. Refer to main text fo r details . Adap ted fr om T rap nell et al . , 201 0
RNA-SEQUENCING | 31
abundances of each transcript. Finally, a minimum cost probability function is
maximized, finding the final set of correct parsimonious isoforms, and abundances are
reported (E). This maximizing approach vastly improves isoform inference (31).
Other approaches exist to RNA-seq assembly (for a recent review, see (32)), but
the Tuxedo suite has proved to be robust in benchmarks, and was among others used
as assembler in the ENCODE project, and for our splice class detection software
spliceR (paper II), implemented in R and available in the Bioconductor project, we
choose to build directly on Cufflinks-assembled trancripts (although other full-length
assemblers are supported). As several earlier efforts in splice detection from RNA-seq
data has been made, the main aim for spliceR was easily interoperability with R,
Bioconductor and Cufflinks. Another major point was complete classification of all
splice classes, as well as no requirement for pre-annotated splice event or transcript
data. In addition, the option to import the genomic locations of elements spliced in/out
for each alternative splicing type was included. This should be a powerful tool for
downstream analysis of systems where global alternative splicing is altered, including
sequence analysis and motif finding, and phylogenetic analyses. At the time of writing,
three collaboration projects take advantage of spliceR, and we hope to see a good
amount of interest in the package, when Bioconductor 2.13 is released.
Coding Potential Prediction
Coding potential prediction of mRNAs is in itself not a super-tough problem,
but a few different approaches exist. PhyloCSF (33) and CPC (34) have an
evolutionary approach, using aligments of potential coding regions combined with
machine learning approaches to scan for the frequency of synonymous and conservative
substitutions, scoring for coding potential. Coding potential prediction. CPAT (35)
bases its score on both the length of the longest ORF in all reading frames, as well as
other sequence linguistic features.
The approach chosen in article I to predict coding potential and position of
(potentially premature) STOP codons was based on the concept, that in the NMD
KO, events representing aberrant splicing are independent (as exemplified by 95% of
RNA-SEQUENCING | 32
junctions mapping to canonical junctions), and thus annotated ORFs can be applied
to transcripts generated by exon concatenation and scanned trough. The approach is
outlined in Supplemental figure 2
For spliceR (article II), the same approach was chosen, scanning full isoforms
with the most upstream compatible ORF, if any. Cufflinks, however, often finds many
novel genes and isoforms, and spliceR in general only annotates 30-35% of found
isoforms with ORF and STOP positions, and heuristic approaches, e.g. as presented
in CPAT, would be a desirable addition to the package.
Processing and analysis of CAGE-data
An entirely different flavor of RNA-seq is CAGE-seq, or Cap Analysis of Gene
Expression combined with high-throughput sequencing. CAGE is based on the
sequencing of 5’ ends of RNA, and when these short sequences, referred to as “tags”,
are mapped to back to the genome, a genome-wide map of transcription start sites
(TSSs) can be inferred, and the number of times each TSS is represented can be taken
as a proxy of that transcript’s expression (36). This technique was originally devised as
a technological platform for the FANTOM project for genome-wide profiling of
promoters (36). Obviously, regular RNA-seq intrinsically has the power to map 5’
transcription start sites, but with lower precision, both due to lower coverage as well
as coverage bias – all parts of the mRNA is not covered equally, and in some cases, the
5’UTR may not be covered at all.
For paper III, blood was taken from human patients diagnosed with acute
promyelocytic leukemia (APL, a cancer of the myeloid lineage of blood cells), as well as
from healthy patients, and thus very little biological material was available from the
purified and cell-sorted samples, To facilitate a promoterome analysis, we applied a
variant of the CAGE technique dubbed nanoCAGE, which allows for small library
input sizes (37,38). This was the first time this technique had been applied to human
clinical samples, and served as a proof-of-concept for the method in this setting.
In this low-input material scenario, PCR-amplification bias was expected to be
a significant issue, and after initial observations of the data, we had some doubts to
RNA-SEQUENCING | 33
whether any meaningful biological signals were to be deducted. Applying rigorous
filtering and thresholding, however, resulted in an interesting phenotype – a switch in
transcript-length preference between WT and cancer cells: After initial read quality
filtering, and mapping to the genome, the following four filters were applied:
1. PCR-bias: Removal of all tag “towers” with a width of one read (in normal CAGE,
this could be dangerous, removing several meaningful TSS clusters, but manual
inspection of the nanoCAGE data showed, that these towers cropped up in a more
random fashion.)
2. High-pass expression filter: in CAGE, the analog to RNA-seqs RPKM is tags per
million (TPM), but without the gene length normalization. To further weed out
technical noise, an expression of 2 TPM was required.
RNA-SEQUENCING | 34
3. Variance: as some heterogeneity in patients with APL is expected due to unique
cancer genotypes, some of the variance between the cancer samples may not be due
to noise. None the less, variance obfuscates downstream analyses and significance
calling, and clusters with a coefficient of variance above 1 were removed
4. Exon noise filtering: doing “birds-eye” analysis of the data, an amount of “exon-
painting”, or low-coverage reads spanning exonic areas, were evident, possibly due
to capture of degradation products. To avoid clusters being called in these regions,
clusters were required to be expressed more than 1.5 times the average exonic
coverage for the relevant gene, for some analyses. Variance filtering also
contributed to countering this problem.
Figure 7: Fo r na noC AGE da ta , r igid f i ltering is required to w eed out noise - low ly expressed and narrow c lusters are genera lly no t a ssociated w ith pre-annotated T SS s, motiva ting f i ltering on these pa rameter s. (Distance to near es t annotated TS S on the x-axis) .
RNA-SEQUENCING | 35
Some of these filters, and the motivation for the thresholds may seem arbitrary.
Figure 7 represents three variables, TPM (color coding), width of clusters (Y) and
distance from closest annotated UCSC knownGene transcription start site. Although
an intrinsic strength of the CAGE technique is promoter discovery, many promoters
detected in this experiment represent exisiting pre-annotated promoters, and the
distance to the nearest TSS can be used as a quality measure of different filtering
thresholds. From the plot, distance to nearest TSS is much tighter for higher
expressed clusters, and for wider clusters, legitimizing filters for both expression and
width.
In CAGE, tag clustering and filtering is just one component of the data
analysis, for which many parts are similar to those of RNA-seq, including gene
ontology analysis, overlap with other datasets, network analysis, etc. One specific
analysis of paper IV was specific to CAGE, however, and shows well how the ever-
growing amount of public data can often be taken advantage of. To deduce more
about the regulatory neighborhood of our putative TSSs (lacking any transcription
factor ChIP-seq data for these samples), we instead generalized the question by
downloading all Mus transcription factor binding sites from the ENCODE project
(across cell lines), comprising more than 125 data sets in replicates. The distance
distribution from putative TSSs to the nearest transcription factor binding site was
then measured and visualized. Although no mechanistic inference can be made
directly, as tissues and sample conditions are not comparable, this is an example of
large-scale data integration in genomics.
CHROMATIN IMMUNOPRECIPITATION COUPLED WITH SEQUENCING | 36
Chromatin Immunoprecipitation coupled with sequencing
The work done in the fourth article presented in this thesis is based largely on
data from sequencing of DNA-fragments bound by certain transcription factors (or
ChIP-seq), and this section will focus on common challenges in analyzing ChIP-seq
data and how these were countered in the presented work.
Milestones and history
The predecessor of ChIP-seq is ChIP-chip or ChIP-on-chip, using chromatin
pull downs coupled with DNA micro arrays. Before the first papers using high
throughput sequencing were published in ‘07/’08, this was the method of choice for
assaying global transcription factor binding profiles, but obviously had its limits in the
same way that microarray analyses for gene expression is limited, most apparent being
the restriction to a pre-defined set probes. As such, researchers could only probe a very
small fraction of a mammalian genome, resulting in arrays covering, for instance, only
known promoters, or exonic areas (39).
Harnessing the (theoretically) non-discriminatory power of sequencing, a
natural evolution of the method was given, and the first papers utilizing ChIP-seq was
published in Nature and Cell in 2007. In the first, profiling the global binding of
transcription factor STAT1 in interferon-treated versus normal HeLa cells, more than
40,000 putative binding regions were presented with, compared to todays standards,
relatively few mapped reads (40), and this paper established a methodological and
statistical framework for signal processing of ChIP-seq data, that is still in use today.
The second paper presented a genome-wide map of histone lysine and arginine
methylations, providing new insights in epigenetic regulation that was not readily
obtainable with older technologies, and likewise defining much of the data analysis
terminology (41).
Since then, a slew of papers have been published using ChIP-seq, and the
largest effort so far, similar to RNA-seq, is most likely that of the ENCODE project,
integrating data from thousands of experiments and sequencing runs into a combined
CHROMATIN IMMUNOPRECIPITATION COUPLED WITH SEQUENCING | 37
regulatory lexicon, describing the
transcriptional, regulatory, and
epigenetic landscape of the human
genome (24,42–44).
Peakfinding and normalization
Perhaps the primary challenge in
analazying ChIP-seq data of
transcription factors (as well as histone
modifications) is the detection of peaks,
i.e. DNA regions that can statistically
be induced to be bound by the protein
of interest, compared to the
background, defined either by the
signal-to-noise ratio in the sample, by a
mock IgG sample, or both. Since the
introduction of next-gen sequencing,
developing peak finder software have
been an active pursuit in genomics, and
to date, there is more than 20 published
attempts at solving this problem. Peak
finders generally differ in the signal
scanning process, statistical models
used, and in sensitivity for detecting
different regions (e.g. broad histone
marks vs. narrow transcription factor
peaks).
As mentioned above, a statistical framework based on modeling the
background genomic coverage (often referred to as lambda) using a Poisson
distribution, and calculating thresholds and significance levels based on this was
Figure 8: An example o f CH iP-seq pea ks fro m 16 sam ples; 8 t im e-po ints after pa rt ial l iver hep atectomy fo r two transcrip tion fa ctors, CE BPA and CE BPA. Black do tted l ines represent consens us p ea k r egions, in w hich ta gs f or each s ampl e are quantif ied.
CHROMATIN IMMUNOPRECIPITATION COUPLED WITH SEQUENCING | 38
presented in the first effort on ChIP-seq, as was subsequently carried through in one of
the earliest most popular peak finders, MACS (45). MACS refined peak calling by
introducing a local lambda (defined as either 1, 2 or 10k nt around the peak) to correct
for local biases, as well as the employment of a shift size modeling step, trying to
deduce the DNA library insert size for more accurate peak definition.
For article VI, MACS and uSeq (a peak caller with similar inner workings),
were employed for our sample pool size of 16 (two TFs, 8 time points). An example of
the binding landscape and peak definition is given in Figure 8. When quantifying
binding intensity for each sample, a common reference for inter-sample comparison is
required (akin to gene or transcript ids for RNA-seq). As each sample was peak-called
separately, the common reference was simply defined as the area under all overlapping
peaks for all samples – analog to the 1-dimensional “shadow” simultaneously cast by all
samples on the genome. Having these consensus peaks, regular downstream
bioinformatic analyses could follow, including clustering, gene-association, network
analysis, and ...
Motif discovery
Discovering transcription factor binding motifs has been a pursuit in genomics
for some time, and data from ChIP-on-chip and ChIP-seq methods is a natural
catalyst for this. Three large repositories containing binding motifs exist, and were the
basis of the combined high-confidence reduced motif bank generated for article IV:
JASPAR, open-source, and containing a small, non-redundant and literature-backed
set, primarily based on the SELEX method (46); TRANSFAC (47), a commercial set
of binding motifs, but with a limited open access set; and UniPROBE (48) , a open
access library based on protein-binding microarray data. Our purpose was to reduce
the data from these three sources into something smaller, more manageable and of
higher quality and lower redundancy. Aggregation of all available models resulted in
939 position weight matrices (PWMs), and the smaller, final high-confidence set was
generated as follows:
1. Removal of duplicates
CHROMATIN IMMUNOPRECIPITATION COUPLED WITH SEQUENCING | 39
2. End-trimming of models, removing all subsequent nucleotides with an
information content (IC) less than 0.5 bits.
3. Removal of all models with a total IC of less than 8 bits
4. Scoring of all models against each other using the spearman correlation
coefficient, and subsequent hierarchical clustering.
5. After inspection of the distance tree, a threshold of ~500 clusters were
chosen (figure 9) based on biologically meaningful clusters, and within
each cluster, the motifs were aligned using Smith-Waterman local
algorithm. The consensus alignment motif was extracted, and
subsequently end-trimmed and filtered as in step 2 and 3.
This approach resulted in a little less than 500 high confidence motifs, which
were the basis of the binding motif profiling used in article IV.
Figure 9 : A sect ion of a dendrogram representing the hierarchical clustering o f ~1000 pos it ion w eight matrices (PWM s) (nam es at the le ft side), after tr imming a nd al ignm ent. T he blue l ine represents a n arb itraty cuto ff ( a b io logical meaningful va lue set af ter manual inspection of the tree), reducing the set to a ppr ox. 500 PW Ms -
CONCLUDING REMARKS | 40
Concluding Remarks
In the 14 years that has passed so far of the 21th century, genomics can truly be
said to have taken centre stage, in no small part due to the advent of high throughput
sequencing. The genomic era, perhaps starting in 1972 with the first determination of
the sequence of a gene, and perhaps ending with the final draft of the human genome,
is now over, and the post-genomic era has begun. But as easy as it is to fling around
clichés about the marvels of this time of cutting edge research, evenly hard is it perhaps
to pinpoint defining key research discoveries, and this is an inherent challenge in
modern day genomics. As the amount of data grows larger and larger, research has a
tendency to move away from hypothesis-driven manipulation of simpler systems to
genome-wide, organism-wide profiling of any biological information system.
The articles presented in this thesis all present attempts to solve problems in
contemporary biology by applying the methodology of quantification, classification
and statistics on data from high throughput sequencers.
They were all challenging in the sense that no clear-cut hypothesis existed a
priori, in least in terms of analyzing the sequence data. This required an exploration-
heavy approach, which when most frustrating (and it will inevitably be at times) could
be likened to a “fishing expedition” and when most giving, an unbeatable feeling of
actually finding the needle in the haystack. I found, through trial-and-error and an
increasing confidence in dealing with big data, that a certain amount of
systematization and consistence in data handling and analysis should be a priority in
bioinformatics projects. A wide variety of tools and pipelines exist, and even though
some general approaches are agreed on in an implicit manner in the field, few standard
operating procedures exist (one example of such is the ENCODEs complete guideline
for ChIP-seq studies, from lab-bench to computer (49)) – the data is only as good as
the tools you use to analyze it with! The Bioconductor project is a good example of an
inter-operable, common (and open source) platform for genomics, and with spliceR
CONCLUDING REMARKS | 41
(presented in artile II), we hoped to present a valuable addition to that project. And as
applications and tools mature with the field as a whole, more standards will evolve.
As discussed in the introduction, sequencing has become an accepted and
widely used method in most laboratories that do some sort of genomic research, and a
lot more robust and well-documented tools and pipelines have been published since. It
is this author’s observation, that the gap between researchers from a classical biological
background (and with few skills in computer science), and researchers with a
background in computer science is decreasing, and most likely, future researchers in
genomics will be equally well trained in both disciplines. “Big Data” is not a
phenomenon restricted to genome biology, and undergraduate programs in science
(and other disciplines) are aware of this.
The future in sequencing is promising, especially as the clinical domain starts to
accept the new possibilities fully, in part also facilitated by decreasing costs and
improved opportunities for running hundreds of samples by the means of multiplexing.
With the increased processing and normalization quality of RNA-seq, this platform is
expected to completely replace microarrays, with its non-discriminative approach,
offering de-novo transcript detection, characterization of alternative splicing, etc. And
as reads get longer, and transcript assemblers more advanced, we will come closer to
solving the isoform problem fully. For ChiP-seq, the advantages over ChIP-on-chip
are just as apparent, and even though large-scale efforts are increasing our knowledge
of the regulatory landscape repeatedly, luckily there’s a few thousand more
transcription factors in the human genome, who hasn’t been described yet, and just as
many cell types! Combined with profiling of epigenetic marks, genome resequencing
and all the other wonderful appliances of sequencing, the unraveling of the complex
layers of life goes perhaps a little faster.
ORIGINAL ARTICLES | 42
Original Articles
ORIGINAL ARTICLES | 43
Article I
Mammalian tissues defective in nonsense-mediated mRNA decay display
highly aberrant splicing patterns
ARTICLE I | 44
RESEARCH Open Access
Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrantsplicing patternsJoachim Weischenfeldt1,2,3†, Johannes Waage1,2,4†, Geng Tian5, Jing Zhao5, Inge Damgaard1,2,3,Janus Schou Jakobsen1,2, Karsten Kristiansen6, Anders Krogh2,4,6, Jun Wang5,6 and Bo T Porse1,2,3*
Abstract
Background: Nonsense-mediated mRNA decay (NMD) affects the outcome of alternative splicing by degradingmRNA isoforms with premature termination codons. Splicing regulators constitute important NMD targets;however, the extent to which loss of NMD causes extensive deregulation of alternative splicing has not previouslybeen assayed in a global, unbiased manner. Here, we combine mouse genetics and RNA-seq to provide the first invivo analysis of the global impact of NMD on splicing patterns in two primary mouse tissues ablated for the NMDfactor UPF2.
Results: We developed a bioinformatic pipeline that maps RNA-seq data to a combinatorial exon database,predicts NMD-susceptibility for mRNA isoforms and calculates the distribution of major splice isoform classes. Wepresent a catalog of NMD-regulated alternative splicing events, showing that isoforms of 30% of all expressedgenes are upregulated in NMD-deficient cells and that NMD targets all major splicing classes. Importantly, NMD-dependent effects are not restricted to premature termination codon+ isoforms but also involve an abundance ofsplicing events that do not generate premature termination codons. Supporting their functional importance, thelatter events are associated with high intronic conservation.
Conclusions: Our data demonstrate that NMD regulates alternative splicing outcomes through an intricate web ofsplicing regulators and that its loss leads to the deregulation of a panoply of splicing events, providing novelinsights into its role in core- and tissue-specific regulation of gene expression. Thus, our study extends theimportance of NMD from an mRNA quality pathway to a regulator of several layers of gene expression.
BackgroundAlternative splicing (AS) involves the selective inclusionand exclusion of exons from a nascent pre-mRNA thatresults in various combinations of mature mRNAs withdifferent coding potential and thus protein sequence [1].Importantly, it has recently been estimated that nearly95% of all multi-exon genes in the mammalian cellundergo AS [2,3], suggesting a pivotal role for AS inregulating and expanding the repertoire of isoformsexpressed. By examining ESTs, it has been proposedthat one-third of all AS isoforms contain a premature
termination codon (PTC) [4], and these are expected tobe targeted for degradation by nonsense-mediatedmRNA decay (NMD). NMD is an mRNA quality controlmechanism, and the primary function of NMD was initi-ally thought to be in removal of aberrant transcriptsarising from mutations or faulty transcription, mRNAprocessing or translation, but it is now evident thatNMD impacts on both diverse physiological processes[5-7] as well as pathophysiological conditions (reviewedin [8]). The conserved core components of the NMDpathway are the UPF1, UPF2 and UPF3A/B proteins,and mutations or depletion of these factors inactivateNMD [9,10]. In mammalian cells, PTCs are distin-guished from normal stop codons by their position rela-tive to a downstream exon-exon junction, which ismarked by the deposition of the exon junction complex
* Correspondence: [email protected]† Contributed equally1The Finsen Laboratory, Rigshospitalet, Faculty of Health Sciences, Universityof Copenhagen, DK2200 Copenhagen, DenmarkFull list of author information is available at the end of the article
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
© 2012 Weischenfeldt et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.
ARTICLE III | 45
[11]. It has been generally established that for a stopcodon to be recognized by the NMD apparatus, it mustbe situated at least 50 nucleotides upstream of an exon-exon boundary (the 50 nucleotides rule) [12]. Thus,nearly all naturally occurring eukaryotic stop codons arefound downstream of the last intron, thereby renderingthem immune to NMD. Although recent data havedemonstrated that the proximity of the poly(A)-bindingprotein (PABP) to the PTC is inversely correlated withthe efficiency of NMD [13,14], the 50 nucleotides ruleapplies to almost all studied mammalian transcripts, tak-ing heed of a few noted exceptions [15,16]. Mechanisti-cally, AS can utilize NMD to selectively degradetranscripts by the selective inclusion of a PTC-contain-ing (PTC+) exon or exclusion of an exon, resulting in aPTC+ downstream exon. This coupling, initially discov-ered for serine/arginine-rich (SR) proteins in Caenor-habditis elegans [17], has been coined regulatedunproductive splicing and translation (RUST) or AScoupled to NMD (AS-NMD) [4,18]. Intriguingly, pro-teins involved in splicing processes utilize AS-NMD toautoregulate their own synthesis through a negativefeedback loop. The most well characterized splicing acti-vators, the SR proteins, bind to cis elements in the pre-mRNA, usually stimulating the inclusion of an exon.The SR proteins have been shown to utilize AS-NMD ina negative feedback loop to activate the inclusion of aPTC+ exon (PTC upon inclusion) in their own pre-mRNA, thus resulting in NMD [18-21]. The othermajor class of splice regulators, the heterogeneousnuclear ribonucleoproteins (hnRNPs), are a class ofRNA binding proteins with roles in mRNA splicing,export and translation [22,23]. The hnRNPs often, butnot always, bind to splice silencer elements and represssplicing at nearby splice sites. Splicing repressors, suchas hnRNPs, use AS-NMD to repress the inclusion of acoding exon in their own pre-mRNA that leads to anout-of-frame skipping event, consequently inducing adownstream PTC and thus NMD (PTC upon exclusion).Moreover, AS-NMD is also used to cross-regulateexpression of other splice factors, as described elegantlyfor PTBP1 and PTBP2 [24].AS is regulated by the selective recruitment of splice
regulators to pre-mRNAs. It is well established that spli-cing activators (such as SR proteins) compete with spli-cing repressors (such as hnRNPs) for binding to splicesites in an antagonistic manner, where the relative con-centration of the two classes regulates the level of AS[25,26]. Thus, the fate of an alternative exon is usuallydecided by the antagonism between SR proteins andhnRNPs and their concentration and activity (reviewedin [27]). Due to the autoregulatory feedback loopemployed by splice regulators, modulating NMD couldpotentially have widespread effects on the concentration
of splicing activators and repressors and thus AS andAS-NMD.Despite the potential of AS-NMD, it is presently not
known to what extent this pathway regulates the tran-scriptome on a global scale, and how AS homeostasis isaffected by NMD. A major problem in discovering thefull spectrum of PTC+ transcripts is that EST andcDNA repositories are biased against these isoforms dueto their unstable nature in normal cells. Additionally,most studies have primarily focused on microarrays toquery the transcriptome upon muting NMD [18,19,28],and the newer studies using sequencing [29] havefocused on single cassette exon events, thus disregardingmany other physiologically important splice classes suchas mutually exclusive exons and alternative 5’ and 3’splice site usage. Last but not least, the transcriptomicconsequences of genetically modulating NMD andthereby AS in the mammalian organism are largelyunanswered.In the present study, we have performed RNA-seq on
two different tissues from a Upf2 conditional knock-out(KO) mouse line. Thus, in addition to providing signifi-cantly novel insights into common and tissue-specificfunctions of NMD, our study represents the first com-prehensive and unbiased transcriptome analysis of adultgenetically modified mice. To facilitate a high-resolutionanalysis of all possible exon-exon combinations, we havegenerated a bioinformatic pipeline, named RAINMAN(RnAseq-based Isoform detection and NMd ANalysispipeline; available to the scientific community), thatmaps reads to a combinatorial database that incorpo-rates both known and in silico predicted exon-exonjunctions. The pipeline predicts NMD susceptibilitybased on junction evidence and groups AS events intoseven major splice isoform classes. Using this approach,our results reveal an unprecedented increase in ASupon ablating UPF2, and by inference NMD (althoughsmall additive effects from UPF3A/B interaction andderegulation of UPF1 phosphorylation cannot be whollyexcluded from analysis, we juxtapose UPF2 and NMDablation for all purposes in this work) and show that ahigh proportion of these upregulated AS events are notpredicted to contain a PTC. Hence, we find that only50% of the increase in AS is directly due to stabilizationof PTC+ isoforms. Our data demonstrate that mutingNMD results in deregulated levels of core splice regula-tors at both the mRNA and protein level, and furthersuggests that this contributes to the deregulation of gen-eral AS upon ablating NMD. Hence, our data support amodel where NMD ablation leads to deregulated levelsof core splicing factors that ultimately lead to aberrantglobal splicing, implying a novel intricate interplaybetween the NMD machinery and the splicing factors.Finally, our data analysis generates the first insights into
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 2 of 19
ARTICLE I | 46
a putative functional role of the PTC and its surround-ing regions through analysis of their conservationpatterns.
ResultsA major task in analyzing AS by RNA-seq is to exploreand quantify the differential representation of varioussplice classes and the protein-coding potential of themapped reads. To this end, we have developed RAIN-MAN, a streamlined bioinformatic pipeline available tothe scientific community that maps RNA-seq reads to acomprehensive combinatorial database of exon-exonjunctions for unique junction discovery, PTC detectionand identification of splice isoforms. We used this pipe-line to investigate the complexity of the transcriptomeand global role of AS-NMD in vivo in mammalian cells,by taking advantage of our Upf2 conditional KO mouse,which we previously used to demonstrate the in vivoimportance of NMD [7,30]. To explore the effect ofNMD on global splicing and to generate and validate anattractive bioinformatic pipeline to study AS and NMD,we chose to analyze the transcriptomes of two differentmammalian organ systems with distinct phenotypesupon UPF2 deletion. In one end of the spectrum, weanalyzed liver, wherein removal of UPF2 results in fail-ure in liver metabolism and a high mortality rate [30],and in the other, we analyzed bone marrow-derivedmacrophages (BMMs). These macrophages are gener-ated in vitro from murine bone marrow cells, and Upf2deleted BMMs are completely devoid of NMD activitybut nevertheless show no morphological or functionalphenotype compared to wild-type (WT) controls [7].
Splice isoform inference and PTC detectionIn order to obtain biological material, we first generatedmice in which the NMD core factor Upf2 could beselectively inactivated in liver or in BMMs using ourpreviously reported strategy (see Materials and meth-ods). We performed whole transcriptome sequencing(single-end) on poly (A)-purified RNA, isolated frompoly-IC injected Upf2fl/fl; Mx1Cre and Upf2fl/fl livers(termed from now on ‘Liver KO’ and ‘Liver WT’, respec-tively) and Upf2fl/fl; LysMCre BMMs and Upf2fl/fl BMMs(henceforth ‘BMM KO’ and ‘BMM WT’, respectively).In order to minimize biological variation, we generatedlibraries from pools of poly(A)-purified RNA derivedfrom three individual animals. Due to the underrepre-sented nature of NMD-susceptible transcripts in ESTand cDNA databases [4], we generated a comprehensivecombinatorial database of exon-exon junctions in themurine genome, using sequences from exon modelsannotated in different repositories (RefSeq, Ensembl,UCSC Known Genes, GENSCAN and Exoniphy;
Supplemental Materials in Additional file 1 and FigureS1B in Additional file 2). To discover junctions betweenexons not previously recorded in the murine transcrip-tome, we also employed TopHat [31], a de novo map-ping algorithm, to map reads to junctions betweenunannotated exons (Table S1 in Additional file 1). Toutilize reads that cover three or more exons, thus align-ing to two or more exon-exon junctions, and not initi-ally mapped to our combinatorial database or byTopHat, we incorporated an extra read truncation step(Figure S1B in Additional file 2), trimming reads insteps of 10 bp and remapping, allowing us to recover anadditional 4 to 7% of total reads. As a result, 84% ofreads mapped to the genome or transcriptome, and outof these, 28% mapped to splice junctions (Figure 1a),corresponding to a combined total of 323,474 uniquesplice junctions across our two tissues and two geno-types. With a minimum requirement of three reads to asplice junction, minimizing sequencing and mappingartifacts, we mapped approximately 150,000 uniquejunctions. Of note, we found that the de novo mapperTopHat contributed significantly to our transcriptomeset, with 11 to 16% of all discovered junctions uniquelydefined by TopHat (Table S1 in Additional file 1). Map-ping of the KO samples benefited the most from theseTopHat predicted splice junctions, suggesting that thecurated transcript repositories are biased against theTopHat-predicted junctions due to their NMD suscept-ibility. Indeed, TopHat identified 20% and 24% of thePTC+ splice junctions above our minimum read cutoff,compared to only 4% and 11% of PTC- splice junctions(junctions that do result in the generation of a PTC, inBMM and liver samples, respectively (Table S2 in Addi-tional file 1). These data demonstrate that at least aquarter of NMD susceptible transcripts are not presentin the normal murine repositories. This has implicationsfor future splice isoform detection, since many tran-scripts that are below detectable levels under physiologi-cal conditions, and hence are absent in the repositories,will not be detected under conditions that could favortheir presence, for example, perturbed or disease states.Thus, taking advantage of our combinatorial databasethat includes a pipeline for splice isoform detection andPTC prediction is likely to considerably increase thenumber of isoforms detected under conditions thatfavor increased AS.In the next step of RAINMAN, the pipeline calculates
the distribution of seven different splice isoform classesand predicts PTCs (see Supplementary Methods inAdditional file 1 for details), and all data are combinedfor easy visualization and data mining. An illustrativeexample of a genome browser output from the mappingsteps is shown in Figure 1b, with the conditional Upf2
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 3 of 19
ARTICLE III | 47
KO gene deleted for exons 2 and 3 in the KO sample.The UCSC-based visualization includes both evidence ofreads mapped to junctions and to exons, and, in thiscase, demonstrates increased AS in the liver KO com-pared to WT.We mapped a total of 150,000 unique junctions with
approximately 100,000 and 130,000 junctions in BMMand liver tissues, respectively (minimum of three readsto a junction; Figure 2a, top, and data not shown). Tally-ing up, we found that 6,256 and 7,997 unique genes har-bored 14,056 and 25,534 upregulated junction events inBMM and liver, respectively, an average of 2.3 and 3.2upregulated junction events per gene (Figure 2a),demonstrating that loss of UPF2, and by inferenceNMD, leads to a substantial deregulation of splicing.
Ablating NMD results in significant upregulation of PTC+junctionsIt has been shown that a stop codon is generally recog-nized as a PTC by the NMD machinery, if it is situatedat least 50 nucleotides upstream of an exon-exon junc-tion, termed the 50 nucleotides rule. We assayed andverified this distance requirement by simply plotting theposition of RefSeq gene model stop codons that residein the penultimate exon (Figure S2 in Additional file 3),observing that the vast majority of these stops indeedfall precisely within 50 nucleotides of the final intronboundary, and thereby avoiding the elicitation of NMDfor those transcripts. Incorporating this distance metricin our PTC prediction algorithm, we calculated the frac-tion of regulated PTC+ junctions as a function of UPF2
(a)
(b)
9%2%
27%62%
8%2%
27%
63%
8%2%
30%60%
8%2%
27%62%
ExonicJunctionsIntronicIntergenic
Reads, BMM WT Reads, BMM KO Reads, Liver WT Reads, Liver KOExonicJunctionIntronicIntergenic
5,124,062 5,794,917 10,795,989 10,421,2412,195,422 2,469,916 5,341,681 4,573,723202,137 226,777 350,573 418,607718,984 760,327 1,512,332 1,378,347
Total mapped 8,240,605 9,251,937 18,000,575 16,791,918
chr2:
BMM WT
BMM KO
Liver WT
Liver KO
Upf2 gene
Upf2 gene locus
Exoncoverage
Junctions
78.088.097.077.0
Exons deleted in Upf2 KOExons deleted in Upf2 KO2
Junctions
Junctions
Junctions
Exoncoverage
Exoncoverage
Exoncoverage
Mapping frequency
59700005880000
Figure 1 Mapping and downstream analysis of RNA-seq data. (a) Total number of RNA-seq reads mapped to exonic, junction, intronic andintergenic regions as well as the fraction of mapped reads for all four samples. Pie charts below visualize the distribution of all mapped reads.(b) An example of an output from the pipeline, visualized in the UCSC genome browser, showing the Upf2 gene locus, with junctions supportedby reads (horizontal bars in each sub-window), and exon coverage (vertical bars). Refer to Figure S1 in Additional file 2 for another example of agenome browser output. No minimum read cutoff was used for junction visualization. The very low exon coverage present for knocked outexons 2 and 3 in KO samples represents a miniscule amount of non-recombined tissue/cells.
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 4 of 19
ARTICLE I | 48
ablation. As expected, this analysis demonstrated ahighly significant upregulation of PTC+ junctions in theKO samples (Figure 2a, all junctions versus PTC+ junc-tions, P-value 1 ! 10-16, Chi-square). In total, approxi-mately 44% of predicted PTC+ junctions wereupregulated in BMMs and Liver KO (Figure 2a),amounting to 16% (BMM) and 28% (liver) of expressedgenes containing at least one NMD-susceptible splicejunction regulated more than two-fold.Among the upregulated PTC+ splice junctions, 516
were shared between the two tissues, and these corre-sponded to 448 unique genes (Figure 2a; Table S3 in
Additional file 4), which we thus termed core NMD tar-gets. The group includes well-known NMD targets suchas Smg5, Hsf1, and Zcchc6 as well as many knownNMD-susceptible splicing factors [6,18,19]. Indeed,Gene Ontology (GO) analysis demonstrated that thehighest ranked cluster contained genes involved inmRNA processing and splicing (P-value of 9.0 ! 10-18,Bonferroni corrected; see the full list of GO terms andgenes in Table S3 in Additional file 4). All classical SRproteins have been shown to utilize AS-NMD to autore-gulate their own synthesis [20], and our finding thatcommon splicing factor junctions are upregulated upon
(a)
(b)
Upregulated Unregulated Downregulated
Junctions 14,056 (14 %) 72,499 (74 %) 11,311 (12 %)
Genes 6,256 (30 %) 9,156 (44 %) 5,501 (26 %)
Junctions 25,534 (20 %) 84,901 (68 %) 14,793 (12 %)
Genes 7,997 (33 %) 9,932 (41 %) 6,580 (27 %)
Junctions 2,257 (4 %) 49,583 (94 %) 989 (2 %)
Genes 1,689 (16 %) 7,736 (75 %) 885 (9 %)
Junctions 1,733 (44 %) 1,705 (43 %) 524 (13 %)
Genes 1,272 (45 %) 1,135 (40 %) 449 (16 %)
Junctions 5,962 (43 %) 5,725 (41 %) 2,180 (16 %)
Genes 3,062 (43 %) 2,581 (37 %) 1,401 (20 %)
Junctions 516 (41 %) 720 (57 %) 32 (3 %)
Genes 448 (45 %) 507 (51 %) 32 (3 %)
Liver
Common
BMM
PTC+ junctions
All junctions
Liver
Common
BMM
Major Gene Ontology for PTC+ genes
Liver KO
BMM KOmRNA processing (Common genes)
Mitochondrial function (Liver-specific genes)
GTPase regulator activity (BMM-specific genes)
Junctions Genes
Figure 2 Number of regulated splicing and PTC occurrences in Upf2 KO tissues. (a) Splice junctions with a minimum of three reads in WTor KO samples were grouped into upregulated (!2-fold up in KO), downregulated (!2-fold down in KO) or unregulated. Genes corresponding tothe junctions are also shown. Left: number of junctions and genes and percentage in parentheses for the relative fraction in each tissue. The tophalf of the table lists results for all junctions, the bottom half for PTC+ junctions only. Right: pie charts visualize the relative distribution ofregulated junctions and corresponding genes. (b) Venn diagram of major Gene Ontology terms associated with liver-unique genes (top), BMM-unique genes (bottom) and common genes (overlap).
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 5 of 19
ARTICLE III | 49
muting NMD is in agreement with earlier studies[7,18,19,32]. We also determined the GO terms asso-ciated with genes containing upregulated PTC+ junc-tions unique to either liver or BMMs (Figure 2b). In theBMM-specific set, genes involved in G-protein coupledreceptor function were enriched (P-value 1.5 ! 10-4,Bonferroni corrected; Table S4 in Additional file 5),whereas the liver specific set was strongly associatedwith mitochondrion, among others (P-value 9.7 ! 10-56,Bonferroni corrected; Table S4 in Additional file 5).In summary, we find that 43 to 44% of all predicted
PTC+ junctions are upregulated in Upf2-ablated tissuesand that 516 of these junctions are common in BMMand liver, several of which are well-known NMD targets.Moreover, we show that NMD is predicted to downre-gulate isoforms of 16 to 28% of all expressed genes(excluding the well-described regulation of splicing fac-tors from the analysis). Moreover, NMD regulates genesinvolved in mitochondria and G-protein-coupled recep-tor functions in liver and BMM, respectively, suggestingthat NMD also serves tissue-specific functions.
Loss of NMD leads to the selective stabilization ofalternative splicing eventsWe next compared the impact of ablating NMD oncanonical versus AS in more detail. Here, a canonicaljunction is defined as a splicing junction between twoconsecutive exons of the longest RefSeq isoform foreach gene, and AS junctions are thus all other combina-tions. In total, our pipeline detected approximately10,000 unique AS junctions in BMMs compared toapproximately 35,000 in the liver (Figure 3). A signifi-cant challenge in analyzing expression changes uponNMD ablation is to distinguish primary from secondaryeffects. Using the sensitive PTC-detection in RAIN-MAN, we found that close to 50% of AS was predictedto generate a PTC (Figure 3a, bottom), which is in starkcontrast to canonical splicing junctions (Figure 3a, top).This suggests that NMD directly degrades 16% (0.47 !0.35) of all AS in BMMs and 18% in liver (0.45 ! 0.39),resulting in more than a two-fold reduction of theinvolved isoforms.The BMM and liver samples demonstrated essentially
similar fractions of up- and downregulated AS junctions(Figure 3a). However, the number of unique AS junc-tions were higher in liver samples, even after calibratingfor the number of mapped junction reads (and therebysequencing depth; Figure 3b, top). To investigatewhether this was a characteristic feature of the organsor whether it was due primarily to muting of NMD, weanalyzed the AS complexity per gene (Figure 3b). Thenumber of mapped junction reads was first normalizedto BMM WT (Figure 1a) by stochastically removingreads from BMM KO, liver WT and liver KO, to
exclude any skewing due to differences in sequencingdepth. Interestingly, these data demonstrate an inher-ently different AS profile between the two tissues.Whereas BMM WT and KO showed similar proportionsof AS, measured as the correlation between the numberof AS junctions and the total number of junctions pergene (Figure 3b, top), the proportion of AS of the liverKO sample increased considerably more compared toboth BMM KO and liver WT (Figure 3b, top). Thisstrongly suggests that the liver has a high level of ASrelative to BMM, both normally and through ablation ofNMD.Finally, we also compared the proportion of PTC+
junctions, measured as the number of PTC+ junctionsas a function of total splicing for each gene (Figure 3b,bottom). The PTC+ junctions followed the same trendas for AS junctions, with a steeper relative increase inPTC+ junctions in liver KO compared to liver WT andBMM. From linear interpolation, we found that, onaverage, 27% of all spliced junctions in a given gene inthe liver are predicted to result in a PTC (liver KOslope of 0.27; Figure 3b).In conclusion, these data demonstrate that approxi-
mately 17% of all AS is downregulated via NMD morethan two-fold. We furthermore show that liver is char-acterized by a high degree of AS compared to BMMsand that close to one-third of all uniquely spliced junc-tions in the liver are predicted to elicit a PTC, thus giv-ing the first in vivo evidence of the pervasive effect andimportance of AS-NMD in the mammalian organism.
Splice isoform classes are differently affected by NMDablationAs described above, we detected a significant increasein AS upon ablating UPF2. It has, however, not pre-viously been described to what extent NMD affects dif-ferent splicing classes. Transcriptome-wide studieswhere NMD has been muted have primarily looked atsingle cassette exon skipping (ES) events [28,32], thusdisregarding many physiologically relevant AS events.To this end, we implemented an AS isoform classifica-tion module that allowed us to infer the degree towhich all major splice isoform classes were affected byNMD. Hence, our splice isoform inference pipelineassessed the distribution of splice junctions to sevenmajor groups of AS events, namely single exon skip-ping (SES) and multiple exon skipping (MES), alterna-tive 5’ splice site (A5SS) and alternative 3’ splice site(A3SS), mutually exclusive exons (MXE), alternativefirst exon (AFE) and alternative last exon (ALE) (seeTable 1, Supplementary Materials in Additional file 1and Table S6 in Additional file 6 for pipeline perfor-mance). We narrowed our downstream splice isoformanalysis to those that demonstrated a change in
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 6 of 19
ARTICLE I | 50
‘percent spliced in’ (PSI) between KO and WT (!PSI)higher than 20%. PSI is calculated from the ratio ofjunctions supporting a given feature (for example, theinclusion of an exon) versus junctions supporting thereciprocal event (for example, the skipping of the sameexon), and the !PSI is thus a metric of how much theinclusion changes upon NMD ablation (see Supplemen-tary Methods in Additional file 1 for further details).We found that approximately 40% of all ES events witha !PSI higher than 20% (inclusion) or lower than -20%(exclusion) in the KO tissues were upregulated due tostabilization of a PTC+ isoform (Table 1). Next, wesubdivided ES events into PTC upon inclusion andexclusion events. An example of a PTC upon exclusion
event that is stabilized upon NMD ablation in bothBMM (!PSI = -61%) and liver (!PSI = -63%) is Mgea5,which was also found upregulated in immortalizedmouse embryonic fibroblasts (MEFs) ablated for NMD[32]. We validated this Mgea5 isoform among others byRT-PCR (Figure 4b). In the liver, 55% of all single exoninclusion events generated a PTC, whereas 40% werefound in the BMM (compare PTC upon inclusion tototal inclusion events for single ES in Table 1).Tmem183a is an example of a highly skipped PTCupon inclusion that was found stabilized in both BMM(!PSI = 51%), liver (!PSI = 54%) and immortalizedMEFs ablated for NMD [32], and was similarly vali-dated by RT-PCR (Figure 4b).
(a) (b)
LiverBMM
AS
junc
tions
PTC
+ ju
nctio
ns
Number of junctions per gene
!=0.62
!=0.41
!=0.13!=0.11
!=0.37
!=0.35
KOWT
99% 99%
PTC+PTC-
UpregulatedUnregulatedDownregulated
BMM Liver CommonCanonical splice junctions
Alternative splice junctions
!=0.27
!=0.16
BMM WT: 8,107/2.195 = 3,693BMM KO: 9,447/2.470 = 3,824Liver WT: 25,390/5.341 = 4,753Liver KO: 31,041/4.573 = 6,78711%
77%
12%10%
76%
14% 2%
95%
3%
35%
99%
1% 1% 1%
Fraction of unique alternative spliced junctions
BMM Liver Common
PTC+PTC-
UpregulatedUnregulatedDownregulated
number of unique AS junctions per million mapped junction reads
48 %
17 %
35 %
43 %
39 %
17 %69 %
27 %4 %
53 %47 %
55 %45 %
37 %
63 %
0 50 100 150 200
050
100
150
200
0 50 100 150 200
050
100
150
200
0 50 100 150 200
050
100
150
200
0 50 100 150 200
050
100
150
200
Figure 3 Loss of UPF2 leads to increased AS-NMD. (a) Proportions of regulated canonical junctions versus junctions supporting AS events(red, upregulated; yellow, unregulated; green, downregulated), with additional distribution pie charts underneath displaying the predictedpercentage of PTC+ junctions versus PTC- junctions (junctions that do not elicit a PTC) of upregulated junctions (black, PTC+; grey, PTC-). (b) Thenumber of AS junctions and the number of PTC+ junctions are shown as a function of number of junctions per gene in the top and bottompanels, respectively. The slopes of the curves were calculated by linear interpolation. In total, our pipeline detected 10,061 unique AS junctions inBMMs (8,107 in WT and 9,447 in KO) compared to 33,164 in the liver (25,390 in WT and 31,041 in KO), and calculations in the table (top) showthe number of unique AS junctions per million mapped reads for each sample. No minimum read cutoff was used in (b).
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 7 of 19
ARTICLE III | 51
Interestingly, for alternative usage of A5SSs and A3SSs,we saw an even higher proportion of PTC+ events,especially in the liver (compare the fourth and fifth rowsin BMM versus liver tissue in Table 1), suggesting a pre-viously unrecognized importance for these splice eventsin AS-NMD. The ribosomal protein Rps12 is an exam-ple of a gene that generates a PTC+ isoform as a resultof A5SS in both BMM (!PSI = 50%) and liver (!PSI =55%), and likewise for Smg5 in BMM (!PSI = 37%) andliver (!PSI = 50%) as a result of A3SS (Table 1).Using stringent criteria (Supplementary Methods in
Additional file 1), we detected 44 MXE events in theBMM KO, but more than 10 times as many in the moresplice-prone liver KO tissue (Table 1). Out of thedetected MXE events, few were predicted to elicit aPTC+ isoform, possibly due to the regulated nature of amutually exclusive splice event, which in most instanceswould not be predicted to undergo NMD. An exceptionis the pyruvate kinase gene Pkm2, which encodes twodifferent enzymatic active isoforms due to a single MXEevent [33]. In the KO samples, and this was particularlytrue for BMMs, aberrant splicing resulted in inclusionof both mutually exclusive exons, resulting in a PTC+isoform (validation in Figure 4b and schematic in Figure4d). This may indicate an important role for NMD inensuring the exclusive incorporation of exons in MXEevents.Finally, we quantified the number of isoforms with an
increase in AFE or ALE. From transcriptome data alone,it is often impossible to determine whether an AFE isthe result of alternative promoter usage or AS between
two mutually exclusive first exons and the second exon.Nevertheless, AFE results in alternative transcripts withpotential PTCs. Analysis of AFE showed that more thanone-third of the detected AFE events are predicted togenerate a PTC+ isoform (Table 1). For ALE categoriza-tion, we required AS to two mutually exclusive 3’ term-inal exons, and PTC+ prediction is thus not meaningful.Hence, changes in levels of ALE are therefore mostlikely secondary effects of altered splicing upon UPF2ablation. A functional mechanism for ALE is the regu-lated inclusion of microRNA target sites in the 3’ UTR,and this potentially adds another layer of complexitywhen studying conditions where splicing is perturbed,such as by removing NMD.To test the accuracy of our splice isoform detection
algorithm, we chose 50 RAINMAN-predicted AS events(!PSI > 20% (inclusion), or < -20% (exclusion)) for RT-PCR on independent liver and BMM material. We vali-dated AS events that are not predicted to result in aPTC+ isoform (Figure 4b, top row), isoforms stabilizedin the KO due to inclusion of a PTC (Figure 4b, middlerow) and isoforms that elicit a PTC due to a skippingevent (Figure 4b, bottom row). Out of the 50 tested ASevents, we were able to validate 49 of these (98%), andwe therefore conclude that our splice isoform detectionalgorithm is highly accurate in predicting different spliceisoform events.To validate the inferred expression changes and to
assess inter sample-replicate variability, we performedquantitative PCR experiments on biological replicates.This analysis showed a strong correlation with the
Table 1 Splice events and PTC+ classificationTissue Splice
eventTotalevents
PTC+events
Percentage PTC+
Exclusionevents
PTC uponexclusion
Inclusionevents
PTC uponinclusion
BMM Total ES 730 281 38% 428 201 302 80
Single ES 511 310 148 201 80
Multiple ES 219 118 53 101 -
A5SS 130 68 52%
A3SS 238 110 46%
MXE 44 5 11%
AFE 130 44 34%
ALE 49 NA NA
Liver Total ES 3,102 1,285 41% 1,926 965 1,176 320
Single ES 1,505 932 531 573 320
Multiple ES 1,597 994 434 603 -
A5SS 449 263 59%
A3SS 654 370 57%
MXE 475 66 14%
AFE 172 66 38%
ALE 143 NA NA
Splice isoform classes are shown for both single and multiple exon skipping (ES), alternative 5’ splice site (A5SS), alternative 3’ splice site (A3SS), mutuallyexclusive exons (MXE), alternative first exon (AFE) and alternative last exon (ALE). For ALE, no events were classified as PTC+, since the class relies on splicing totwo different last exons. Here, a !PSI > 20% was required for all classes. NA, not applicable.
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 8 of 19
ARTICLE I | 52
Immunoblotting:
UPF2
b-Actin
*
(d)(c)
Pkm2 spliced isoforms
exon 8 exon 9 exon 10 exon 11
exon 8 exon 9 exon 11
exon 8 exon 9 exon 10 exon 11
exon 8 exon 10 exon 11
STOPM1 isoform M2 isoform
NMD-sensitive M1/M2 isoform
Srfs2
Mgea5
Hnrnph
1
U2af2
Hnrnpl
Srfs6
Ptbp2
Hnrnph
3Srfs
3
Pkm2
Jmjd6
Trub2
Tmem18
3aRpl3
NgfEpn1
Nelf Mff Hpn
A5SS+ES
MXE
Smg5
A3SSES ES ES ES ES A3SS
Rps12
A5SS A3SS
ES ES ES ES ES
ESA5SS ES ES A3SSES ESES
Cfi
ES ESCbs
ES
(b)
STOP
STOP
Exon Skipping (ES)
Alternative 5’ Splice Site (A5SS)
Alternative 3’ Splice Site (A3SS)
AlternativeFirst Exon (AFE)
AlternativeLast Exon (ALE)
Mutually ExclusiveExon (MXE)
(a)
Lpca
t3
Papss
2
Ndufv3
Brp44l
Figure 4 Splice isoform classes are differentially affected by loss of UPF2. (a) Schematic drawing of the main isoform classes detected inour pipeline. (b) RT-PCR validation of 25 splicing events predicted from our pipeline (see Table S9 in Additional file 12 for a list of the 49validated events out of 50 tested). Top: normal AS events. Middle: PTC upon inclusion events. Bottom: PTC upon exclusion by ES. (c) Westernblotting from two different liver pairs, showing UPF2 (rabbit a-UPF2), SRp55, SRp40 and SRp30 (mouse a-mAb104) and b-actin (rabbit a-actin).The asterisk denotes the truncated UPF2 isoform found in cells ablated for NMD. (d) MXE splicing of Pkm2 and the inclusion of both mutuallyexclusive exons in the KO sample.
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 9 of 19
ARTICLE III | 53
RNA-seq inferred expression changes (R = 0.975, Pear-son’s correlation; Figure S3A in Additional file 7) and avery low sample replicate variability (Figure S3B inAdditional file 7). In addition, we assessed our ability tocorrectly infer PTC+ isoforms by cloning and sequen-cing the full-length isoforms of Srsf9, a gene with aPTC-upon inclusion event (!PSI = -40% in liver; iso-form not detectable in BMMs). The predicted PTC+isoform was precisely recapitulated in vivo (Figure S3Cin Additional file 7), further validating the usefulness ofour approach.
Loss of NMD leads to deregulation of PTC- isoformsthrough breakdown of splice factor homeostasisWe have demonstrated that only 50% of upregulated ASjunctions are predicted to be the direct result of stabi-lized PTC+ isoforms, suggesting that a major proportionof the increased AS is the indirect result of perturbedNMD (Figure 3a). As we have shown here, the liver hasa high degree of AS (Figures 1 to 3), and it has beenfound to have a prominent divergent expression patternof SR proteins and hnRNPs compared to other mamma-lian tissues [34]. Hence, we hypothesize that this couldmake the liver particularly susceptible to changes insplicing patterns upon ablation of NMD.Splicing factors are known to utilize AS-NMD to
autoregulate their own synthesis in a negative feedbackloop, through the selective inclusion or exclusion ofPTC+ exons [4,18]. Indeed, our RNA-seq analysis alsoidentifies a range of these factors among the core NMDtargets (Figure 2; Table S3 in Additional file 4). How-ever, as most of the NMD sensitive splicing factor iso-forms are predicted to encode truncated proteins, therescue of these upon NMD ablation is not expected toaffect AS per se (unless they encode proteins with domi-nant negative properties). Previous microarray-basedstudies have not addressed the formal possibility thatthe stabilization of PTC+ isoforms upon loss of NMDcould occur against the backdrop of changed levels ofthe canonical isoforms capable of encoding the full-length form of proteins such as splicing regulators. Totest this possibility, we scrutinized our datasets forchanges in the expression levels of 25 canonical mRNAisoforms capable of encoding full-length splicing regula-tors by correcting the changes in the gene expressionratios between KO and WT samples with the fractioncontributed by the PTC+ isoforms (Table S5 in Addi-tional file 1). Strikingly, for the 19 splice factors forwhich we have evidence for AS-NMD in the liver, 10displayed a > 1.5-fold deregulation, with 5 displayingincreased (Sfrs3, Srfs4, Tra2a, Srfs16, Hnrpdl) and 5decreased (Tra2b, Hnrnpf, Hnrnph3, Hnrnpr, Ptbp2)expression of the canonical mRNA isoforms. Moreover,western blot analysis revealed that SRp30 and SRp40
protein levels were elevated in KO liver, demonstratingthat the changes in canonical mRNA isoform levels forat least some splicing factors are indeed translated intofull-length protein.Collectively, these findings support a model where the
increase in global AS upon loss of NMD is at least inpart caused by a massive deregulation of key splicingfactors, and further help to explain the PTC- AS frac-tion deregulated upon muting NMD. A highly aberrantlevel of splice regulators would likely tilt the normal bal-ance between activators and repressors and deregulate alarge subset of exons [4,18].
Ablating NMD leads to skipping of highly conservedcassette exonsSES represents the best characterized AS event, and is afrequent PTC-generating mechanism. RAINMAN classi-fied 40% and 60% of all SES as PTC+ events in theBMM and liver KO samples (Table 1 - compare PTCupon exclusion with total exclusion events, and PTCupon inclusion with total inclusion events). Out of thePTC upon inclusion events, a total of 26 events wereupregulated in both KO tissues (!PSI > 20%) and 51PTC upon exclusion events were similarly upregulatedin both KO tissues (!PSI < -20%; manually curated listsare included in Table S7 in Additional file 8 and TableS8 in Additional file 9). In support of an important phy-siological function, almost all of the genes harboringthese highly upregulated PTC+ isoforms were also upre-gulated at the gene level in KO samples (Table S7 inAdditional file 8 and Table S8 in Additional file 9), sug-gesting that stabilization of the PTC+ isoforms led to anappreciable increase in total mRNA levels in KO tissues.It should be noted that most of the splicing factors andribosomal proteins known to use AS-NMD were indeedupregulated in both tissues, but had !PSI > -20% or <20% in BMMs (data not shown). Hence, the high fre-quency of PTC+ splice events suggests that these areregulated AS events. It is generally found that AS exonsare more conserved than constitutive exons, especiallyin the flanking introns [35]. The core members of theSR protein family and hnRNPs have been proposed touse AS-NMD in an autoregulatory loop, by binding tohighly conserved regions, termed ultraconserved ele-ments (UCEs), to elicit the PTC+ splice event [18,20].Upon inhibition of NMD in cell lines, it has been shownthat many PTC+ exons are surrounded by high intronicconservation [18,19]. As described above, our sensitiveapproach demonstrated a high proportion of PTC- ASevents upregulated in KO tissues, and we thereforeexamined the conservation of PTC+ and PTC- singleexon exclusion (!PSI < -20%) and inclusion (!PSI >20%) events in the KO samples compared to the unre-gulated skipping events. The latter control group
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 10 of 19
ARTICLE I | 54
contains exons that undergo AS but are not differentlyskipped between WT and KO (-20% < !PSI " 20%). Wefirst examined exclusion events, and found that flankingintrons of the skipped exons are significantly more con-served compared to unregulated skipping events (Figure5a; P-value < 2.2 ! 10-16, Komogorov-Smirnov test), andthis was even more pronounced in BMMs (Figure S4 inAdditional file 10). Interestingly, flanking introns ofPTC+ upon exclusion events were less conserved thanPTC- exclusion events, but displayed a significant higherexonic conservation (P-value < 2.2 ! 10-16, Komogorov-Smirnov test). We next considered exons that demon-strated increased inclusion in KO samples. In thesecases, flanking introns were also significantly morehighly conserved compared to skipping events not regu-lated by NMD (Figure 5b). Again, PTC- inclusion eventswere more highly conserved compared to PTC+ eventsin the flanking introns.In summary, these results demonstrate that 40 to 60% ofupregulated SES events are predicted to undergo NMD,depending on tissue type, and that these splice eventsare highly conserved. Importantly, PTC- inclusion andexclusion events displayed higher intronic conservation
compared to PTC+ events, strongly suggesting that theUCE-containing splice regulator isoforms (which are allPTC+) are not the primary reason for the high intronicconservation surrounding skipped exons. Moreover,removal of UCE elements from the analysis did not alterthe intronic conservation profiles of either PTC- or PTC+ events (data not shown). Thus, the high conservationfor both included and excluded exons appears to be ageneral attribute of highly regulated cassette exons(!PSI > 20%) that are affected directly and indirectly byNMD. Thus, the deregulation of PTC- events is there-fore consistent with a mechanism where deregulatedlevels of splice regulators in the KO samples markedlyaffect regulated splicing of their cognate target exons.This again implies that ablating Upf2, and by inferenceNMD, has important secondary effects on splicing factorhomeostasis and that this impacts widely on global AS.
Muting NMD leads to increase in splicing by-productsWe have demonstrated that removing UPF2, and byinference NMD, causes a dramatic increase in AS, inpart by stabilizing PTC+ AS isoforms. Apart from AS-NMD, NMD has been implicated in removal of genomic
(a) (b)
STOP
PTC+ single exon skipping upregulatedPTC- single exon skipping upregulatedSingle exon skipping unregulatedMm9 refseq
0.0
0.2
0.4
0.6
0.8
1.0
Exon Exclusion Events upregulated in liver KO
Con
serv
atio
n
25 25 75 75757525 25
0.0
0.2
0.4
0.6
0.8
1.0
Con
serv
atio
n
25 25 75 75757525 25
STOP
Exon Inclusion Events upregulated in liver KO
Figure 5 Introns flanking regulated exons are highly conserved. (a, b) Mean per position phastCon conservation scores around SES eventsare shown for exclusion events upregulated in the KO sample (a), and for inclusion events upregulated in the KO sample (b). Shown are PTC+exclusion/inclusion events (red line), PTC- exclusion/inclusion events (green line) and unregulated skipping events (grey line). Yellow lines arescores for all mm9 RefSeq exons and 75 bp into surrounding introns. Data shown are for liver. Exclusion events: PTC+, 439; PTC-, 251. Inclusionevents: PTC+, 64; PTC-, 162. Unregulated events: 3,494. Mm9 RefSeq exons: 274,281. See Figure S4 in Additional file 10 for a graph with BMMdata. Numbers on the x-axis indicate nucleotide intervals - 25 and 75 nucleotides for exons and flanking introns, respectively. Curves represent acubic smoothing spline fitted to data.
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 11 of 19
ARTICLE III | 55
noise and splice errors that might otherwise lead to apanoply of spurious isoforms [7,36-38]. The extent towhich NMD destroys such splice by-products has not,however, been studied on a transcriptome-wide basis inmammalian tissues. We therefore examined the relativeexpression level of PTC+ junctions, measured as thePTC+ fraction (PTC-generating junctions/All otherjunctions) against the junction expression level (Figure6), binned by the RNA-seq metric RPKM (reads per kbof gene model per Mb of mapped reads, refer to [39]).These data show that the PTC+ fraction is enriched inliver compared to BMM over the entire expression pro-file, further emphasizing that the liver is characterizedby an increased abundance of PTC+ isoforms comparedto BMM. Importantly, there is a distinct overrepresenta-tion of PTC+ isoforms in the lower expression range forboth liver (Figure 6 - compare dark red and light red forliver KO and WT, respectively) and BMM (Figure 6 -compare dark blue with light blue for BMM KO andWT, respectively) relative to more highly expressed iso-forms. These data thus confirm and quantify, on a glo-bal scale, another major role for NMD in removing low-abundance isoforms such as the products of erroneoussplicing.
PTC exons are distinct from normal stop codon exonsWe have shown that NMD-affected SES events demon-strate a high conservation score and that this is not dueto UCE-containing SR proteins and hnRNPs. The ques-tion then arises to what extent the PTC+ exon itself isaberrant compared to other expressed exons, and inparticular the normal stop codon-containing exon. Wethus examined the conservation of the exonic sequencesurrounding the PTC in upregulated PTC+ exons, asevidenced by upregulated PTC+ junctions. To normalizefor normal exonic conservation variation and for biasesin tissue-specific exon conservation, we subtracted ran-dom exonic sequences expressed in the same samples.Interestingly, we found that PTC+ exons had a markedlydifferent conservation profile compared to normal stopcodon exons (Figure 7a; for comparison, see unnorma-lized conservation score in Figure S5 in Additional file11). Hence, exonic sequence conservation increasedtowards the PTC, and displayed only a minor drop inrelative conservation score in nucleotides following thePTC. Importantly, the PTC and surrounding nucleotidesdisplayed an overall higher conservation score relative torandom exons, and this was particularly striking forPTCs in BMM. Notably, removal of the UCE-containing
1 2 5 10 20
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Isoform expression (Junction RPKM)
PTC
+ sp
licin
g fra
ctio
n
BMM WTBMM KOLiver WTLiver KO
Figure 6 Low-expressed junctions are enriched in Upf2 KO samples. PTC-content (y-axis) in junctions binned by expression (RPKM; x-axis).All junctions for samples were binned in increments of 1 RPKM and the PTC+/PTC- ratio for each bin was calculated. No minimum read cutoffwas applied.
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 12 of 19
ARTICLE I | 56
¨SKDVW&RQ�VFRUH��VWRS�FRGRQ�H[RQLF�VFRUH���UDQGRP
�H[RQLF�VFRUH�
TAA
TAG
TGA
Intronic
TAA
TAG
TGA
Intergenic
TAA
TAG
TGA
RefSeq STOPs
TAA
TAG
TGA
Exonic
TAA
TAG
TGA
BMM PTC+
TAA
TAG
TGA
Liver PTC+
P = 8.104e-5
P = 7.166e-12
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Codo
n fre
quen
cy
0.7
6WRS�FRGRQ�GLVWULEXWLRQ
(a)
(b)
1XFOHRWLGH�FRQVHUYDWLRQ�DW�37&V�YV.�QRUPDO�67O3�FRGRQV
3RVLWLRQ�LQ�100ES�ZLQGRZ��67O3�FRGRQ�FHQWHUHG0 20 40 60 80 100
%00�37&�/LYHU�37&�1RUPDO�5HI6HT�67O3
�0.25
�0.20
�0.15
�0.10
�0.05
0.00
0.05
0.10
Figure 7 Conservation of regulated PTCs and surrounding exons. (a) Mean per-position phastCon scores are shown, centered on the PTC,for upregulated junctions in liver and BMM. To visualize conservation around PTCs in comparison to normal exonic areas, phastCon scores froma random sample (n = 4,000) of BMM and liver-expressed exons were subtracted from either sample. For normal STOPs, phastCon-scores fromrandom RefSeq exons (n = 4,000) were subtracted. Normal STOPs are based on all RefSeq transcript models. Ranges of scores do not extendinto introns, and may be shorter than 100 bp for individual PTCs. For PTC+ junctions, a KO/WT fold change of 2 was required. BMM PTC+positions: 884. Liver PTC+ positions: 3,091. Normal RefSeq STOP positions: 23,231. (b) Distribution of stop codons: for intronic, intergenic andexonic bins, all mm9 trinucleotides in all three reading frames were sampled. RefSeq STOPs represent normal stop codons for all 21,470 RefSeqtranscript models (for genes with multiple models, the longest was used). For BMM (n = 497), PTC-inducing junctions were required to have alog2(KO/WT) fold change > 2, and a minimum of 5 reads for both genotypes summed. For liver (n = 670), PTC-inducing junctions were requiredto have a log2(KO/WT) fold change > 2, and a minimum of 10 reads for both genotypes summed. Fisher’s exact test was used to test forsignificance.
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 13 of 19
ARTICLE III | 57
core splicing factor genes did not influence the conser-vation profile in liver or BMM (data not shown). TheBMMs are characterized by a more stringent splice pat-tern, whereas the liver has a high degree of AS andincreased PTC+ junctions expressed at low levels (Fig-ures 2 to 4 and 6), pointing to an increased fraction ofrandom splice byproducts in the liver. Hence, theincreased PTC+ exonic conservation in BMMs com-pared to the liver is likely due to a higher proportion offunctionally relevant PTCs in BMMs, found in variousvertebrates. Moreover, the relatively high conservationscore at and after the PTC suggests that these sequencesmay possibly harbor un-recognized cis elements thatcould assist in recognizing the PTC.Finally, we also gauged whether PTCs associated with
highly regulated PTC-inducing junctions displayed anypreference for the identity of the stop codon (Figure7b). Surprisingly, whereas the frequencies of normalcanonical UAA, UAG and UGA RefSeq stop codonscorresponded closely to the distribution observed inexonic regions, regulated PTCs had a marked and highlysignificant preference for UGA in both tissues. Thesefindings suggest that the NMD eliciting PTCs have beenunder selective pressure in order to adapt to the regula-tory needs of AS-NMD.
DiscussionHere, we present the first RNA-seq study from geneti-cally modified adult mice accompanied by a thoroughcomparative study of the role of NMD in two distinctmurine tissues. Moreover, this is the first analysis todate of the global impact of NMD on all major classesof AS in untransformed mammalian cells.To facilitate a detailed analysis of AS, we have gener-
ated a streamlined bioinformatic pipeline, RAINMAN,available to the scientific community, that maps RNA-seq reads to a comprehensive combinatorial exon-exonjunction database to obtain maximum AS isoform infor-mation. This method allowed us to map 28% of all map-pable reads to junctions (all samples combined; Figure1) with a total discovery of 150,000 unique junctions,thus giving us high AS information. The junction dataare further processed in the pipeline to predict NMDsusceptibility and thus coding potential with high accu-racy at single junction resolution. Finally, seven majorsplice isoform classes are inferred and processed formultiple comparison purposes.
Loss of NMD leads to a pronounced deregulation ofalternative splicingUsing this pipeline, we have shown that ablating Upf2,and by inference NMD, impacts on several layers of ASin both BMMs and liver. By examining genes that harborthe exact same spliced junctions predicted to elicit a PTC
in both tissues, our data confirmed that genes involved inmRNA processing are highly overrepresented amongcommon NMD targets, as previously shown (Table S3 inAdditional file 4). From our data, we estimate that closeto one-third of all junction events supporting AS resultin PTC+ and NMD-sensitive transcripts in the liver (Fig-ure 3b), which is in agreement with earlier predictions[4]. Previous computational analyses of dbEST andSWISS-PROT have found that 8 to 12% of human genesare predicted to be targets of NMD [4,40], but theserepositories inherently underestimate the true number ofNMD-sensitive isoforms. Our data demonstrated that16% and 28% of expressed genes in BMM and liver,respectively, were predicted to undergo NMD (upregu-lated more than two-fold in KO) of one or more iso-forms. These numbers expand upon and underscore theglobal importance of NMD and in particular AS-NMD.Nearly all detected PTC+ events could be attributed toAS, and we found that approximately 50% of all AS inthe KO was in fact predicted to elicit a PTC (Figure 3a),with the liver as the tissue with the most AS (Figure 3b).This complements previous findings showing that theliver has one of the highest levels of AS compared toother organs [2,34]. The high level of AS in the liver wasalso reflected in the fraction of low-level PTC+ junctions(Figure 6), most likely consisting of splice errors. Thus,our data make a strong case for NMD playing an impor-tant role in removing ‘splicing’ noise also in untrans-formed mammalian tissues.In contrast to the splice error-generated PTC+ iso-
forms, our analysis of highly regulated splice classes(!PSI > 20% or !PSI < -20%), revealed that approxi-mately 40% of all highly skipped exons were degradedby NMD (Table 1). These ES events were further subdi-vided into SES and MES events. Whereas three-quartersof all ES events in BMMs were SES, the proportion ofSES and MES in liver was approximately equal. Wespeculate that the increased MES proportion reflects themore promiscuous splicing pattern in the liver, thusleading to novel AS between exons not normally splicedtogether.Importantly, our analysis of conservation patterns in
SES revealed a high conservation score of NMD-regu-lated skipping exons and their flanking introns com-pared to that of unregulated events. This suggests thatevents deregulated in the absence of NMD are undertight control in normal NMD-proficient cells and thatAS-NMD plays a crucial role in controlling the expres-sion of a vast number of genes.
The splice isoform inference analysis expands therepertoire of NMD targetsFrom our splice isoform inference, we found a high pro-portion of PTC+ events in most classes with the notable
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 14 of 19
ARTICLE I | 58
exception of MXEs (ALE events were not included inPTC classification). Due to the regulatory nature ofMXEs, in which two coding exons compete for theselective inclusion, it is not surprising that this group isunderrepresented in the PTC+ events. One interestingexample was Pkm2, in which a well-characterized MXEevent either dictates whether the adult form (M1 iso-form containing exon 9) or the embryonic isoform (M2containing exon 10) is produced. In many tumors, theembryonic M2 isoform is switched on, leading toincreased aerobic glycolysis [33]. However, we foundthat ablating UPF2, and by inference NMD, led to inclu-sion of both mutually exclusive exons with a resultingdownstream PTC (Figure 4b, d), suggesting an interest-ing function for NMD in controlling MXE. We also pro-vide the first report of a surprisingly high frequency ofPTC+ A5SS and A3SS events, suggesting that the alter-native usage of splice acceptor or donor sites is oftenutilized to regulate expression by AS-NMD. Anotherpossibility could be that these events are more error-prone - for example, by the use of a strong and weaksplice site (the alternative splice site), leading to a lessstringent AS than between two strong splice sites suchas SES. Nevertheless, our study significantly expands theknown repertoire of NMD targets in all classes of AS,many previously uninvestigated on a global scale.
Common and tissue-specific functions of NMDThe impact of NMD on transcriptome composition hasup to now only been studied in whole organisms orindividual cell lines, which has therefore precluded anyinsights into conserved versus tissue-specific functionsof NMD. Focusing first on core NMD targets, our strin-gent comparative analysis found 50 genes with a com-mon highly upregulated PTC upon exon exclusion inBMMs and liver (Table S8 in Additional file 9). Almostall of these genes were upregulated in both tissues, sug-gesting a functional role for AS-NMD in regulating theabundance of transcripts with full coding potential fromthese genes (see below). Several of the known splicingfactors previously shown to utilize AS-NMD in an auto-regulatory loop, such as Ptbp1, demonstrated PTC uponexclusion in both tissues, but below !PSI of 20% inBMMs. Besides SR proteins and hnRNPs, most of thegenes found upregulated are unknown targets for AS-NMD, such as Soat2/and Acat2. The latter gene isinvolved in esterification of cholesterol, and is known tobe regulated in a tissue-specific manner, with a particu-larly high expression level in liver and intestine and to alesser extent in macrophages [41]. Here, we found thatthis gene was upregulated 8.4-fold in BMMs and 4.2-fold in liver upon muting NMD, suggesting an impor-tant regulatory function for AS-NMD in the synthesis ofcholesterol esters.
We found 26 genes with a common PTC upon inclu-sion event (!PSI > 20%) in both BMM and liver KO(Table S7 in Additional file 8). An example of a genewith a PTC upon inclusion isoform is Nktr, which isexclusively expressed in and required for natural killer(NK) cells. We found that inhibiting NMD caused ahigh upregulation of a PTC+ isoform and a concomitant2.4- to 3.9-fold upregulation at the gene level. It hasbeen shown that aberrant isoforms of the Nktr gene arepresent in cells not expressing Nktr [42]. Here, we showthat NMD mutes Nktr in BMMs and liver (and also inNMD ablated transformed MEFs [32]), suggesting thatAS-NMD serves an important regulatory function inregulating the functional expression of this NK cell-spe-cific protein. Interestingly, Nktr is involved in NK cellactivity but contains RS repeats and a cyclophilin-domain found in several splicing factors, and this couldconfer on the protein properties sufficient to facilitateregulation of its own expression through AS-NMD.Thus, it could be that several other proteins not directlyinvolved in splicing have gained RNA-binding propertiesthat would allow them to autoregulate their own synth-esis by AS-NMD.Apart from the core NMD targets, our analysis also
yielded the first insights into potential tissue-specificeffects of NMD. Through GO analysis of parent genesof PTC+ junctions that were exclusively upregulated ineither the BMM or liver datasets, we could show a tis-sue-specific role for NMD in the regulation of G-pro-tein-coupled receptors and mitochondria function,respectively. Interestingly, these GO classes mirror themain biological functions of monocytes/macrophagesand liver tissue, that is, in immunological reactions andenergy metabolism, respectively, and we predict thatNMD-dependent transcriptome analysis in other organswould uncover tissue-specific NMD targets in pathwaysof particular importance for the organ in question.
NMD controls the expression of a network of splicingfactorsIn contrast to the well-characterized role of AS-NMD inthe autoregulation of individual splicing factors, the glo-bal consequences of its disruption on broad splicing pat-terns have not been studied in detail. We were thereforeintrigued by finding that approximately 50% of the upre-gulated AS events in the NMD-deficient tissues weredevoid of PTCs, suggesting that they were indirect tar-gets of NMD and that the importance of NMD in ASextends well beyond AS-NMD.Using conventional microarray-based gene expression
analysis, we have previously shown that several splicefactors were upregulated in both UPF2-deficient BMMsand liver [30,43]. However, these studies could not dis-criminate between canonical mRNA isoforms and
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 15 of 19
ARTICLE III | 59
stabilized PTC+ isoforms, which is important as manyof the latter would be predicted to encode truncatedand thereby (most likely) non-functional proteins. Usingour RNA-seq data, and correcting for the presence ofthe PTC+ isoforms, we can now show deregulatedexpression of the canonical mRNA isoform (encodingthe full-length protein) for a prominent number ofsplice factors, a finding that was also corroborated bychanges in protein levels for two of these. These find-ings suggest that the prominent deregulation of PTC-containing events, at least in part, can be explained bychanges in the protein levels of splicing factors. This isfurther supported by the observation that the intronicregions surrounding deregulated PTC- exons are highlyconserved, suggesting that they are subjected to splicefactor-mediated regulation.Our data therefore extend the importance of NMD to
the regulation of PTC-deficient mRNA isoforms throughthe deregulation of splice factor levels. The mechanisms(s) by which this occurs will be the subject of future stu-dies. They also provide a clue as to why the liver ismore affected by the loss of NMD than BMMs, as theformer organ displays much higher levels of AS and istherefore predicted to be more sensitive to alterations insplicing factor levels.Finally, and most importantly, these findings expand
the functional impact of NMD on transcriptome beha-vior to also include substantial indirect effects on PTC-splicing events and highlights the crucial importance ofthis pathway as a gatekeeper of transcriptome integrity.
A role for the stop codon and its flanking regions inregulating NMDThe importance of PTC identity and its flanking regionsin the regulation of NMD is a relatively unexplored areaand has not previously been addressed on a global scale.Here we were able to show that regions flanking regu-lated PTCs display a divergent and increased level of con-servation when compared to both canonical stop codonsand exonic regions, suggesting that they have been underselective pressure. Similarly, the distribution of regulatedPTCs was markedly different from that of normal termi-nation events with a strong preference for the NMDmachinery to use UGA. This is interesting in light ofrecent findings showing that sequences (including stopcodons) having positive impact on stop codon read-through lead to a reduction of NMD through the removalof UPF1 from the 3’ UTR through a translation-depen-dent mechanism [44]. These findings may suggest thatthe PTC and its surrounding sequences have been underselective pressure to optimize NMD efficiency in a pro-cess involving translational read-through, which in turnmay be used by the cell for regulatory purposes.
ConclusionsBy developing and applying a robust bioinformatic pipe-line mediating a high-resolution study of AS and itsdynamics, this whole-transcriptome analysis of twoNMD-deficient primary mouse tissues provides a com-prehensive quantification of the impact of NMD onuntransformed mammalian transcriptomes, providingcrucial novel insights into its role in both core and tis-sue-specific regulation of gene expression, significantlyextending the importance of NMD from an mRNAquality pathway to a regulator of several layers of geneexpression. Thus, in addition to removing low levels ofsplicing ‘errors’, and destabilizing targets of AS-NMD,our analysis reveals the potentially crucial importance ofNMD in maintaining splicing homeostasis. In particular,the observed deregulation of full-length splicing regula-tors suggests that the NMD pathway controls theirexpression in a manner distinct from direct AS-NMD,and that their deregulation is an important contributorto the global deregulation of AS that we observe inNMD-deficient tissues.Finally, we have provided a reference catalogue of
NMD-regulated AS events as well as an open sourcetool, RAINMAN, facilitating the process of splice iso-form inference and PTC prediction from RNA-seqreads. Future transcriptome studies in other organs andorganisms will further reveal the impact of NMD onsplicing patterns and how this pathway modulates biolo-gical read-out in a tissue- and pathway-specific manner.
Materials and methodsMiceFor the selective deletion of UPF2 in liver and BMM weused our conditional floxed Upf2 line and the LysMCreand Mx1Cre driver lines as described previously [7].Upf2 is recombined during macrophage differentiation,and the mature BMMs are devoid of any functionalUPF2 protein as well as NMD activity. To rescue theUpf2fl/fl; Mx1Cre from the previously reported hemato-poietic lethality, Upf2fl/fl; Mx1Cre and Upf2fl/fl weretransplanted with wild-type bone marrow cells prior topoly-I:C injection. As described previously, liver washarvested 3 weeks after Upf2 recombination to avoidany indirect effects from the poly-I:C treatment [30]. Allmouse work was performed according to national andinternational guidelines and approved by the DanishAnimal Ethical Committee. This study was approved bythe review board at the Faculty of Health, University ofCopenhagen (LT-P0658).
cDNA synthesis for RNA-seqTotal RNA was harvested in Trizol (Invitrogen; Carlsbad,CA, USA from Upf2fl/fl; LysMCre and Upf2fl/fl-derived
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 16 of 19
ARTICLE I | 60
male BMMs, grown as previously described [7]. Briefly,bone marrow cells were grown in vitro in the presence ofmacrophage colony-stimulating factor-conditioned med-ium for 7 days, giving essentially a 100% pure macrophagepopulation. Total RNA from liver was harvested in Trizol21 days after injection of poly I:C from Upf2fl/fl; Mx1Creand Upf2fl/fl; Mx1Cre. We pooled 50 μg RNA from each ofthree age-matched males to get a total of 150 μg RNA,which was subjected to two rounds of mRNA purificationby hybridizing to oligo(dT) beads (Dynabeads mRNA puri-fication kit, Invitrogen). The resulting mRNA (1 μg) wasthen used as template to prepare cDNA. Double-strandedcDNA was essentially prepared as described by the manu-facturer (Superscript Double-Stranded cDNA Synthesiskit, Invitrogen), using 10 μM random hexamers to primefirst strand synthesis. Finally, the double-stranded cDNAwas purified using QiaQuick PCR columns (Qiagen, Hil-den, Germany) followed by phenol-chloroform extraction.The quality of the cDNA was verified using a Bioanalyzer.Library preparation for cDNA sequencing was per-
formed essentially as for the whole genome DNA libraryconstruction (DNA sample prep kit, Illumina; see Sup-plementary Methods in Additional file 1). Sequencingwas performed on an Illumina Genome Analyzer IIflowcell, generating 75 bp single-end reads. RNA-seqdata have been submitted to the NCBI Short ReadArchive database with accession number GSE26561.
RT-PCR and western blottingTotal RNA was purified using Trizol and 0.5 μg RNAwas used to synthesize single-stranded cDNA with oligod(T) primers, using a ProtoScript M-MULV First-StrandSynthesis kit (New England Biolabs, Ipswich, MA, USA)as described by the manufacturer. The cDNA was usedin standard PCR reactions using Taq Polymerase (Invi-trogen), with primers specific for the exons flanking thealternative exon (see Table S9 in Additional file 12 for alist of primers). For protein analysis, snap-frozen tissuewas dissected and lysed in ice-cold RIPA buffer supple-mented with proteinase inhibitors and separated bySDS-PAGE and subjected to western blotting. Antibo-dies used: affinity-purified rabbit a-UPF2 (kind gift fromDr Jens Lykke-Andersen); rabbit a-mAb104 serum(pan-SR protein antibody; kind gift from Dr JavierCaceres); mouse a-b actin (ab6276, Abcam, Cambridge,UK). Development of a-mAb104 immunoblotted liverprotein lysate required different exposure times depend-ing on the molecular weight band, and was thus treatedindividually for the different protein bands.
RAINMAN overviewFigure S1B in Additional file 2 outlines the sequentialbioinformatic processing steps making up the RAIN-MAN pipeline utilized in this study. Our PTC and splice
event detection and quantification pipeline is availableas a series of Python and R scripts, readily scalable andcustomizable for use with RNA-seq short reads of anylength. First, an index is generated, consisting of therelevant genome assembly combined with a combinator-ial exon-exon junction database, as described in detail inthe Supplementary Methods in Additional file 1. Readsare mapped to this index (using the Bowtie/TopHat[31,39] mappers as default) and unmapped reads aretruncated to rescue reads spanning more than twoexons. Gene expression is calculated by the RPKMmetric (analogous to fragments per kb of exon per mil-lion fragments mapped (FPKM)) using the Cufflinkspackage [45] and upper quartile normalization, andjunction expression levels are calculated by quantifyingreads spanning exon-exon boundaries, likewise using theRPKM metric, and are (optionally) normalized using“Trimmed mean of M component” (TMM) normaliza-tion [46]. Finally, junctions are processed by PTC andsplice event detection algorithms, and all data are out-put, allowing the researcher to visualize results in UCSCGenome Browser and spreadsheet software. See Supple-mentary Methods in Additional file 1 for a detaileddescription of RAINMAN and Table S6 in Additionalfile 6 for splice detection validation. RAINMAN scripts,documentation and complete junction and gene expres-sion lists are available online [47].
Statistical methodsThe statistical package R was used to calculate Chi-square and Kolmogorov-Smirnov tests.
Database accessionRNA-seq data are available at the NCBI Short ReadArchive database (accession number GSE26561). RAIN-MAN scripts, documentation and complete junction andgene expression lists are available online [47].
Additional material
Additional file 1: Supplementary Information and SupplementaryTables S1, S2 and S5. Supplementary Materials, Methods andReferences. Supplementary Table S1: number of mapped junctionscontributed uniquely by the combinatorial database (Comb DB only) orTopHat (TopHat only) and the number of mapped junctions detected byboth the combinatorial database and TopHat (Both). SupplementaryTable S2: the contribution of TopHat to the number of junctionspredicted to generate a PTC versus all junctions (minimum of three readsper junction). Supplementary Table S5: deregulation of core splice factors.Gene FC indicates the change in mRNA levels for all the isoforms for theparticular gene between KO and WT.
Additional file 2: Supplementary Figure S1. UCSC Genome browseroutput of Pion gene and schematic of the RAINMAN pipeline with stepsfor mapping and processing of reads.
Additional file 3: Supplementary Figure S2. Histogram showingdistance from normal stop codon to the 3’ end of RefSeq genes with
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 17 of 19
ARTICLE III | 61
stops in final exon, and distances to nearest downstream exon-exonjunction for genes with stop codons in the second to last exon.
Additional file 4: Supplementary Table S3. Reads per unique junctionstatistics for all samples, split into junctions discovered by mapping tothe combinatorial database versus junctions discovered by TopHat.
Additional file 5: Supplementary Table S4. Table with Gene Ontologyterms associated with genes containing upregulated PTC+ junctions thatare unique for Upf2 KO liver or BMM.
Additional file 6: Supplementary Table S6. Results from validation bymanual inspection of output from isoform class inference.
Additional file 7: Supplementary Figure S3. Validation of expressionchange inference and isoform inference.
Additional file 8: Supplementary Table S7. PTC upon inclusionisoforms (SES) upregulated in both Upf2 KO liver and BMM (!PSI > 20%).
Additional file 9: Supplementary Table S8. PTC upon exclusionisoforms (SES) upregulated in both Upf2 KO liver and BMM (!PSI <-20%).
Additional file 10: Supplementary Figure S4. Mean per positionphastCon conservation score around single exon skipping events forBMMs. Numbers on x-axis indicate nucleotide intervals - 25 and 75nucleotides for exons and flanking introns, respectively.
Additional file 11: Supplementary Figure S5. Conservation aroundupregulated PTCs, with mean per-position phastCon scores centered onthe PTC for upregulated junctions in liver and BMMs.
Additional file 12: Supplementary Table S9. List of primers used in RT-PCR validation of splicing events.
AbbreviationsA3SS: alternative 3’ splice site; A5SS: alternative 5’ splice site; AFE: alternativefirst exon; ALE: alternative last exon; AS: alternative splicing; AS-NMD:alternative splicing coupled to nonsense-mediated mRNA decay; BMM: bonemarrow-derived macrophage; bp: base pair; ES: exon skipping; EST:expressed sequence tag; GO: Gene Ontology; hnRNP: heterogeneous nuclearribonucleoprotein; KO: knock-out; MEF: mouse embryonic fibroblast; MES:multiple exon skipping; MXE: mutually exclusive exon; NK: natural killer;NMD: nonsense-mediated mRNA decay; PCR: polymerase chain reaction; PSI:percent spliced in; PTC: premature termination codon; PTC-: PTC absence;PTC+: PTC containing; RAINMAN: RnAseq-based Isoform detection and NMdANalysis pipeline; RNA-seq: next generation RNA sequencing; RPKM: readsper kb of gene model per million mapped reads; SES: single exon skipping;SR: serine/arginine-rich; UCE: ultraconserved element; UTR: untranslatedregion; WT: wild type.
AcknowledgementsThis work was supported by grants from the Danish Natural ScienceResearch Council, the Novo Nordisk foundation, the Lundbeck Foundationand the European Research Council (financial support to J Waage; EU FP7framework programme/ERC grant agreement 204135).
Author details1The Finsen Laboratory, Rigshospitalet, Faculty of Health Sciences, Universityof Copenhagen, DK2200 Copenhagen, Denmark. 2Biotech Research andInnovation Centre (BRIC), University of Copenhagen, DK-2200 Copenhagen,Denmark. 3Section for Gene Therapy Research, Rigshospitalet, University ofCopenhagen, DK-2100 Copenhagen, Denmark. 4The Bioinformatics Centre,University of Copenhagen, DK-2200, Copenhagen, Denmark. 5BGI-Shenzhen,Shenzhen 518083, China. 6Department of Biology, University of Copenhagen,DK-2200 Copenhagen, Denmark.
Authors’ contributionsJoWe carried out experiments, drafted the manuscript and conceived thestudy. JoWa carried out the bioinformatic analysis and helped to draft themanuscript. GT, JZ, JuWa, and KK facilitated sequencing. ID and JSJ assistedwith experimental work. AK participated in study design. BP helped to draft
the manuscript and conceived the study. All authors have read andapproved the manuscript for publication.
Competing interestsThe authors declare that they have no competing or conflicting interests.
Received: 28 November 2011 Revised: 6 May 2012Accepted: 24 May 2012 Published: 24 May 2012
References1. Black DL: Mechanisms of alternative pre-messenger RNA splicing. Annu
Rev Biochem 2003, 72:291-336.2. Pan Q, Shai O, Lee L, Frey B, Blencowe BJ: Deep surveying of alternative
splicing complexity in the human transcriptome by high-throughputsequencing. Nat Genet 2008, 40:1413-1415.
3. Wang E, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore S,Schroth G, Burge CB: Alternative isoform regulation in human tissuetranscriptomes. Nature 2008, 456:470-476.
4. Lewis BP, Green RE, Brenner SE: Evidence for the widespread coupling ofalternative splicing and nonsense-mediated mRNA decay in humans.Proc Natl Acad Sci USA 2003, 100:189-192.
5. Rehwinkel J, Letunic I, Raes J, Bork P, Izaurralde E: Nonsense-mediatedmRNA decay factors act in concert to regulate common mRNA targets.RNA 2005, 11:1530-1544.
6. Wittmann J, Hol EM, Jäck H-M: hUPF2 silencing identifies physiologicsubstrates of mammalian nonsense-mediated mRNA decay. Mol Cell Biol2006, 26:1272-1287.
7. Weischenfeldt J, Damgaard I, Bryder D, Theilgaard-Mönch K, Thoren LA,Nielsen FC, Jacobsen SEW, Nerlov C, Porse BT: NMD is essential forhematopoietic stem and progenitor cells and for eliminating by-products of programmed DNA rearrangements. Genes Dev 2008,22:1381-1396.
8. Gardner LB: Nonsense-mediated RNA decay regulation by cellular stress:implications for tumorigenesis. Mol Cancer Res 2010, 8:295-308.
9. Perlick HA, Medghalchi SM, Spencer FA, Kendzior RJ Jr, Dietz HC:Mammalian orthologues of a yeast regulator of nonsense transcriptstability. Proc Natl Acad Sci USA 1996, 93:10928-10932.
10. Lykke-Andersen J, Shu MD, Steitz JA: Human Upf proteins target an mRNAfor nonsense-mediated decay when bound downstream of atermination codon. Cell 2000, 103:1121-1131.
11. Le Hir H, Gatfield D, Izaurralde E, Moore MJ: The exon-exon junctioncomplex provides a binding platform for factors involved in mRNAexport and nonsense-mediated mRNA decay. EMBO J 2001, 20:4987-4997.
12. Nagy E, Maquat LE: A rule for termination-codon position within intron-containing genes: when nonsense affects RNA abundance. TrendsBiochem Sci 1998, 23:198-199.
13. Behm-Ansmant I, Gatfield D, Rehwinkel J, Hilgers V, Izaurralde E: Aconserved role for cytoplasmic poly(A)-binding protein 1 (PABPC1) innonsense-mediated mRNA decay. EMBO J 2007, 26:1591-1601.
14. Eberle AB, Stalder L, Mathys H, Orozco RZ, Mühlemann O:Posttranscriptional gene regulation by spatial rearrangement of the 3’untranslated region. PloS Biol 2008, 6:e92.
15. Wang J, Gudikote JP, Olivas OR, Wilkinson MF: Boundary-independentpolar nonsense-mediated decay. EMBO Rep 2002, 3:274-279.
16. Bühler M, Steiner S, Mohn F, Paillusson A, Mühlemann O: EJC-independentdegradation of nonsense immunoglobulin-mu mRNA depends on 3’UTR length. Nat Struct Mol Biol 2006, 13:462-464.
17. Morrison M, Harris KS, Roth MB: smg mutants affect the expression ofalternatively spliced SR protein mRNAs in Caenorhabditis elegans. ProcNatl Acad Sci USA 1997, 94:9782-9785.
18. Ni JZ, Grate L, Donohue JP, Preston C, Nobida N, O’Brien G, Shiue L,Clark TA, Blume JE, Ares M: Ultraconserved elements are associated withhomeostatic control of splicing regulators by alternative splicing andnonsense-mediated decay. Genes Dev 2007, 21:708-718.
19. Saltzman AL, Kim YK, Pan Q, Fagnani MM, Maquat LE, Blencowe BJ:Regulation of multiple core spliceosomal proteins by alternativesplicing-coupled nonsense-mediated mRNA decay. Mol Cell Biol 2008,28:4320-4330.
20. Lareau LF, Inada M, Green RE, Wengrod JC, Brenner SE: Unproductivesplicing of SR genes associated with highly conserved andultraconserved DNA elements. Nature 2007, 446:926-929.
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 18 of 19
ARTICLE I | 62
21. Saltzman AL, Pan Q, Blencowe BJ: Regulation of alternative splicing bythe core spliceosomal machinery. Genes Dev 2011, 25:373-384.
22. Pinol-Roma S, Choi YD, Matunis MJ, Dreyfuss G: Immunopurification ofheterogeneous nuclear ribonucleoprotein particles reveals anassortment of RNA-binding proteins. Genes Dev 1988, 2:215-227.
23. Han SP, Tang YH, Smith R: Functional diversity of the hnRNPs: past,present and perspectives. Biochem J 2010, 430:379-392.
24. Boutz PL, Stoilov P, Li Q, Lin CH, Chawla G, Ostrow K, Shiue L, Ares M Jr,Black DL: A post-transcriptional regulatory switch in polypyrimidine tract-binding proteins reprograms alternative splicing in developing neurons.Genes Dev 2007, 21:1636-1652.
25. Hanamura A, Caceres JF, Mayeda A, Franza BR, Krainer AR: Regulatedtissue-specific expression of antagonistic pre-mRNA splicing factors. RNA1998, 4:430-444.
26. Bai Y, Lee D, Yu T, Chasin LA: Control of 3’ splice site choice in vivo byASF/SF2 and hnRNP A1. Nucleic Acids Res 1999, 27:1126-1134.
27. Chen M, Manley JL: Mechanisms of alternative splicing regulation:insights from molecular and genomics approaches. Nat Rev Mol Cell Biol2009, 10:741-754.
28. Pan Q, Saltzman AL, Kim YK, Misquitta C, Shai O, Maquat LE, Frey BJ,Blencowe BJ: Quantitative microarray profiling provides evidence againstwidespread coupling of alternative splicing with nonsense-mediatedmRNA decay to control gene expression. Genes Dev 2006, 20:153-158.
29. Ramani AK, Nelson AC, Kapranov P, Bell I, Gingeras TR, Fraser AG: Highresolution transcriptome maps for wild-type and nonsense-mediateddecay-defective Caenorhabditis elegans. Genome Biol 2009, 10:R101.
30. Thoren LA, Nørgaard GA, Weischenfeldt J, Waage J, Jakobsen JS,Damgaard I, Bergström FC, Blom AM, Borup R, Bisgaard HC, Porse BT: UPF2Is a Critical Regulator of Liver Development, Function and Regeneration.PLoS ONE 2010, 5:e11650.
31. Trapnell C, Pachter L, Salzberg S: TopHat: discovering splice junctions withRNA-Seq. Bioinformatics 2009, 25:1105-1111.
32. McIlwain DR, Pan Q, Reilly PT, Elia AJ, McCracken S, Wakeham AC, Itie-Youten A, Blencowe BJ, Mak TW: Smg1 is required for embryogenesis andregulates diverse genes via alternative splicing coupled to nonsense-mediated mRNA decay. Proc Natl Acad Sci USA 2010, 107:12186-12191.
33. Christofk HR, Vander Heiden MG, Harris MH, Ramanathan A, Gerszten RE,Wei R, Fleming MD, Schreiber SL, Cantley LC: The M2 splice isoform ofpyruvate kinase is important for cancer metabolism and tumour growth.Nature 2008, 452:230-233.
34. Yeo G, Holste D, Kreiman G, Burge CB: Variation in alternative splicingacross human tissues. Genome Biol 2004, 5:R74.
35. Kim E, Goren A, Ast G: Alternative splicing: current perspectives. Bioessays2008, 30:38-47.
36. Mitrovich QM, Anderson P: mRNA surveillance of expressed pseudogenesin C. elegans. Curr Biol 2005, 15:963-967.
37. Mendell JT, Sharifi NA, Meyers JL, Martinez-Murillo F, Dietz HC: Nonsensesurveillance regulates expression of diverse classes of mammaliantranscripts and mutes genomic noise. Nat Genet 2004, 36:1073-1078.
38. Sayani S, Janis M, Lee CY, Toesca I, Chanfreau GF: Widespread impact ofnonsense-mediated mRNA decay on the yeast intronome. Mol Cell 2008,31:360-370.
39. Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficientalignment of short DNA sequences to the human genome. Genome Biol2009, 10:R25.
40. Hillman RT, Green RE, Brenner SE: An unappreciated role for RNAsurveillance. Genome Biol 2004, 5:R8.
41. Cases S: ACAT-2, A second mammalian Acyl-CoA:cholesterolacyltransferase. Its cloning, expression, and characterization. J Biol Chem1998, 273:26755-26764.
42. Simons-Evelyn M, Young HA, Anderson SK: Characterization of the mouseNktr gene and promoter. Genomics 1997, 40:94-100.
43. Weischenfeldt J, Damgaard I, Bryder D, Theilgaard-Mönch K, Thoren LA,Nielsen FC, Jacobsen SE, Nerlov C, Porse BT: NMD is essential forhematopoietic stem and progenitor cells and for eliminating by-products of programmed DNA rearrangements. Genes Dev 2008,22:1381-1396.
44. Hogg JR, Goff SP: Upf1 Senses 3’UTR Length to Potentiate mRNA Decay.Cell 2010, 143:379-389.
45. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H,Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression
analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc2012, 7:562-578.
46. Robinson MD, Oshlack A: A scaling normalization method for differentialexpression analysis of RNA-seq data. Genome Biol 2010, 11:R25.
47. RAINMAN (RnAseq-based Isoform detection and NMd ANalysis pipeline)..[http://people.binf.ku.dk/jwaage/RAINMAN/].
doi:10.1186/gb-2012-13-5-r35Cite this article as: Weischenfeldt et al.: Mammalian tissues defective innonsense-mediated mRNA decay display highly aberrant splicingpatterns. Genome Biology 2012 13:R35.
Submit your next manuscript to BioMed Centraland take full advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at www.biomedcentral.com/submit
Weischenfeldt et al. Genome Biology 2012, 13:R35http://genomebiology.com/2012/13/5/R35
Page 19 of 19
ARTICLE II | 63
Article II
spliceR: An R package for classification of alternative splicing and prediction of
coding potential from RNA-seq data
ARTICLE II | 64
© Oxford University Press 2013 1
Gene expression spliceR: An R package for classification of alternative splicing and prediction of coding potential from RNA-seq data Kristoffer Knudsen1,2,3,4, Bo Torben Porse1,2,3, Albin Sandelin2,4,* and Johannes Waa-ge1,2,3,4,* 1The Finsen Laboratory, Rigshospitalet, Faculty of Health Sciences, University of Copenhagen, DK2200 Copen-hagen, Denmark
2Biotech Research and Innovation Centre (BRIC), University of Copenhagen, DK-2200 Copenhagen, Denmark 3!The Danish Stem Cell Centre (DanStem) Faculty of Health Sciences, University of Copenhagen, DK2200 Co-penhagen Denmark
4The Bioinformatics Centre, University of Copenhagen, DK2200, Copenhagen, Denmark Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
ABSTRACT Summary: With the advent of increasing depth and decreasing costs in digital gene expression technologies exemplified by RNA-sequencing, researchers are now able to profile the transcriptome with unprecedented detail. These advances not only allow for pre-cise approximation of gene expression levels, but also for character-ization of alternative isoform usage/switching between samples. Recent software improvements in full transcript deconvolution prompted us to develop spliceR, an R package for classification of alternative splicing. spliceR labels isoforms based on fully assem-bled transcripts, detecting single- and multiple exon skipping, alter-native donor or acceptor sites, intron retention, alternative first or last exon usage, and mutually exclusive exon events. Alongside, event spliced-in/out values are calculated for effective post-filtering, and genomic coordinates of differentially spliced elements are annotated for downstream sequence analysis. Furthermore, spliceR has the option to predict the coding potential and thereby the nonsense mediated decay (NMD) sensitivity of transcripts based on stop co-don position. Availability and Implementation: spliceR is implemented as an R package, is freely available from https://github.com/splicer-tool/splicer, and has been submitted to the Bioconductor repository. Contact: [email protected] Alternative splicing is an important part of the multi-layered pro-cess of RNA processing, elevating the potential number of unique products with orders of magnitude. More than 95% of all human genes undergo alternative splicing, and this is thought be a key element in driving the phenotypical complexity of mammals (Pan et al., 2008). Recent advances in sequencing technology of RNA (RNA-seq), combined with modern RNA-seq transcript assem-blers, such as Cufflinks (Roberts et al., 2011), now allows for de-scribing the diverse RNA landscape with high resolution.
*To whom correspondence should be addressed.
Here we present an easy-to-use tool, spliceR, which builds upon common RNA-seq assembly workflows by annotating multiple transcripts from the same gene with alternative splicing classes. For each gene entity, each alternative transcript is classified against either the gene’s hypothetical pre-mRNA (based on all isoforms) or against the gene’s most expressed transcript. Additionally, spliceR allows for the characterization of the protein coding potential of each transcript by translating the full exonic sequence of each transcript with supported annotated open reading frames (ORFs). spliceR is fully based on object types found in the Bioconductor (Gentleman et al., 2004) project, such as GRanges, allowing for full flexibility and modularity, and has been submitted to the Bio-conductor repository. Classification of alternative splicing: spliceR takes full-length transcript information either from Cufflinks, or from a data gener-ated by any RNA-seq assembler that outputs fully deconvoluted transcripts. spliceR supports a number of filters, letting the user choose to classify alternative splicing only from those transcripts that have passed set of qualifiers, either given by cufflinks (tran-script and/or gene model confidence), or based on NMD-sensitivity or expression thresholds. In some scenarios, especially when looking at changes between samples where splicing patters are expected to change, users may opt to configure spliceR to use the most expressed transcript as the reference transcript instead of the theoretical pre-RNA. Based on this comparison spliceR outputs a complete classification of splicing events as seen in figure 1a. Furthermore, percent spliced in (PSI) values are calculated for each of the transcripts, representing the percentage of the total parent gene expression originating from this transcript. Finally, delta-PSI values (dPSI) allow researchers to assess changes in splicing (i.e. isoform switching) between sam-ples.
ARTICLE II | 65
K Knudsen1et al.
2
The output of spliceR also facilitates various downstream anal-yses, including filtering on transcripts that have major changes between samples, filtering for specific splicing classes, or sequence analysis on elements that are spliced in or out between samples. The latter could include detection of enriched motifs in or sur-rounding such elements, or identification of protein domains that are spliced in/out. spliceR facilitates this type of analysis by outputting the genomic coordinates of each alternatively spliced element. Visualization: To analyze global trends in splicing, spliceR can produce a range of Venn diagrams, showing the overlap of splicing events between samples for either a specific type of alternative splicing or for all events. An example is shown in figure 1b. Analysis of coding potential: For assessment of coding potential, spliceR initially retrieves the genomic exon sequence using BSGenome objects, easily downloadable from Bioconductor di-rectly in R. Next, ORF annotation is retrieved from the UCSC Genome Browser repository from either Refseq, Known Gene or Ensembl. Alternatively, one can generate a custom ORF-table. Finally, the RNA sequences of input transcripts are assembled, and if a compatible ORF exists, translated, and positional data about the stop codon, including distance to final exon-exon junction, is recorded and returned to the user. Based on the generally accepted 50 nt rule in the NMD literature (Weischenfeldt et al., 2012), tran-scripts are marked NMD-sensitive if the stop codon falls more than 50 nt upstream of the final exon-exon junction, although this set-ting is user-configurable.
Conclusion: We present a Bioconductor package spliceR, which harnesses the power of current RNA-seq and assembly technologies, and provides a full overview of alternative splicing events and protein coding potential of transcripts. spliceR is flexible and easy integrated in existing workflows, supporting input and output of standard Bioconductor data types. To our knowledge, spliceR is the only software that facilitates compre-hensive splice class detection based on full-length transcripts.
ACKNOWLEDGEMENTS Funding: This work was supported in part through a center grant from the Novo Nordisk Foundation (The Novo Nordisk Founda-tion Section for Stem Cell Biology in Human Disease). Conflict of Interest: The authors have noting to declare.
REFERENCES Gentleman,R.C. et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5, R80. Pan,Q. et al. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature genetics, 40, 1413–5. Roberts,A. et al. (2011) Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics (Oxford, England), 27, 2325–9. Visconte,V. et al. (2012) SF3B1 haploinsufficiency leads to formation of ring sideroblasts in myelodysplastic syndromes. Blood, 120, 3173–86. Weischenfeldt,J. et al. (2012) Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns. Genome biology, 13, R35.
ARTICLE III | 66
Article III
Genome-wide mapping of transcription start sites reveals deregulation of full-
length transcripts in acute promyelocytic leuke mia
ARTICLE III | 67
Genome-wide mapping of transcription start sites reveals
deregulation of full-length transcripts in acute promyelocytic
leukemia.
Johannes Waage1,2,3,4, Mette Boyd1,2, Kim Theilgaard-Mönch2,3,5, Helena
Mora-Jensen6, Piero Carninci7, Nicolas Rapin1,2,3,4, Berit Lilje1,2, Peter Hokland8,
Niels Borregaard6, Bo Torben Porse2,3,4, Albin Sandelin1,2,*
1 The Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen,
Denmark; 2 Biotech Research and Innovation Centre (BRIC), University of Copenhagen, Copenhagen,
Denmark; 3 The Finsen Laboratory, Rigshospitalet, Faculty of Health Sciences, University of Copenhagen,
Copenhagen, Denmark; 4 The Danish Stem Cell Centre (DanStem) Faculty of Health Sciences, University of
Copenhagen, Copenhagen. Denmark; 5 The Department of Hematology, Skanes University Hospital, Lund
University, Sweden, 6 The Granulocyte Research Laboratory, Department of Hematology, Rigshospitalet,
University of Copenhagen, Copenhagen, Denmark; 7 RIKEN Center for Life Science Technologies,
Yokohama, Japan; 8 Department of Hematology, Århus University Hospital, Århus, Denmark
* Corresponding author
ARTICLE III | 68
Abstract
Transcription start sites are the focal points of transcriptional regulation, where
information from regulatory elements is integrated to stabilize initiation of
transcription. In humans, most genes have more than one transcription start site, and
these often exhibit different tissue specificity, serving as distinct regulatory frameworks
for the same gene. Usage of such promoters can also result in differential gene function
manifested on the protein level. Alternative promoter usage is increased in several
disease states, including cancer, but its impact in most disease states is
uncharacterized.
Most methods for genome-wide detection of transcription start sites cannot be
applied to rare cell samples or highly purified cells because they need high amounts of
RNA . In this study, we have applied the nanoCAGE method to create a genome-
wide map of transcription start site usage in highly purified cancer cells from patients
with acute promyelocytic leukemia, and corresponding normal bone marrow cells (i.e.
promyelocytes) purified from healthy subjects, using as little as 50 ng total RNA.
We show that the nanoCAGE method gives similar results as experiments
made with microarrays in terms of expression on gene level, but in addition allows for
the identification of alternative promoter usage in cancer cells. We identify 2,162
putative promoters that are significantly differentially regulated between APL and
controls. Interestingly, promoters whose usage is upregulated in APL have an
increased propensity to be located within genes, downstream of annotated promoters.
Conversely, promoters supporting annotated longest transcript variants are commonly
downregulated in cancer. We show several examples of genes with upregulated
alternative promoters located downstream, many of which confer protein domain loss
that could contribute to leukemic transformation and maintenance. Moreover, we
show that these downstream promoters likely have different regulatory cues than
cancer-specific promoters corresponding to the longer RNA isoforms.
ARTICLE III | 69
Introduction
Messenger RNA (mRNA) heterogeneity is a hallmark of higher eukaryotes,
contributing to the plethora of tissue- and cell specific patterns of protein expression
and function. The underlying mechanisms contributing to the final selection of mRNA
isoforms include alternative promoter usage, alternative splicing, and alternative
termination (reviewed in refs [1–3]), potentially generating several isoforms from a
single gene.
There are several examples where alternative isoform usage can affect the
protein product of a gene, typically by excluding exons that code for important protein
residues, or domains (for a review, see [4]). Alternative promoters are particularly
interesting generators of isoform diversity, since they confer additional regulatory
inputs to the same gene. Indeed, there are many examples of alternative promoters that
confer different tissue specificity within the same gene. An example is the Dlgap1 gene
in mouse which has at least four promoters that are specific for different brain tissues
[5]. There are indications that alternative promoter usage is increased in disease states,
and particularly cancer[1].
Thus, mapping the promoter usage genome-wide in specific cell states,
including disease, is important. There are several methods for high-throughput
discovery of promoter usage, ranging from chromatin immunoprecipitation (ChIP) of
key components of the preinitiation complex, to sequencing RNA 5’ ends [6]. RNA-
based methods offer base-pair resolution, and are typically based on selection of cap
structures at mature 5’ ends of RNAs followed by sequencing of the first 20-30 nt of the
transcript from the 5’ end. The two most commonly used methods are Cap Analysis of
Gene Expression (CAGE) [7] and TSS Seq [8]. While these methods have high
sensitivity and specificity, they also require RNA from a substantial number of cells
(typically >1 µg total RNA of high quality). This makes it hard to analyze rare cell
types or highly purified cells from clinical tissue samples. The nanoCAGE method,
based on template switching instead of cap trapping, is an alternative which can
measure promoter usage with as little as 10ng total RNA, at the cost of lower
ARTICLE III | 70
sensitivity [9,10]. Previous studies have used the nanoCAGE method to investigate the
promoter landscape of cultured hepatocellular carcinoma cells [10,11], but it has not yet
been applied to clinical samples. As a proof of concept, we applied the nanoCAGE
method to comprehensively identify promoters specifically used in highly purified cells
from patients with acute promyelocytic leukemia (APL), compared to corresponding
purified controls, namely promyelocytes.
Acute myeloid leukemia (AML) encompasses a variety of clonal disorders,
whose malignant populations do not respond normally to regulatory cues and have lost
the ability to differentiate into fully mature blood cells [12]. APL represents a
genetically defined subclass according to the current WHO classification of AML, and
comprises 5-10% of all AMLs. APL is characterized by a block in terminal neutrophil
differentiation and the accumulation of “leukemic” promyelocytes (APL blast cells) in
the bone marrow and blood [13,14]. Over 98% of APLs harbor a t(15;17) translocation,
which juxtaposes the promyelocytic leukemia gene (PML) and the retinioc-acid
receptor alpha gene (RARA) resulting in the expression of the PML-RARA fusion
protein[14]. In normal cells, RARA forms heterodimers with the retinoid X receptor
(RXR) that binds the retinoic acid responsive elements (RARE) of their target gene
promoters. In the absence of retinoic acid (RA), RARA interacts with histone
deacetylase (HDAC)-containing co-repressor complexes and represses transcription,
whereas binding of RA triggers a conformational switch resulting in recruitment of
coactivator complexes and subsequently, activation of transcription [15]. In APL cells,
PML-RARA forms homodimers, multimeric complexes, and heterodimers with RXR
that all bind to RARE and repress transcription at physiological levels of RA through
recruitment of corepressor complexes containing HDACs, DNA methyltransferases
and polycomb complexes [16–18]. At pharmacological levels, however, RA reverses
PML-RARA mediated repression of RARA target genes resulting in terminal
neutrophil differentiation of APL cells.
Despite the tremendous progress in our understanding of how PML-RARA
regulates transcription at the molecular level, little is currently known to what extent
ARTICLE III | 71
PML-RARA changes the TSS landscape in APL. Alternative promoter usage has
been associated with cancer; hence, elucidating the promoter usage landscape in APL
is an important aspect of the regulatory state of the disease.
In the present study we used APL as a cancer model to investigate changes of
TSS usage associated with malignant transformation. For this we compare the TSS
landscapes of highly purified leukemic promyelocytes (iAPL blasts) from APL patients
with those of normal early promyelocytes (EPMs) purified from healthy subjects, using
as little as 50 ng total RNA.
We find that the cancer cells tend to use downstream alternative promoters
much more often than normal cells, thereby producing shorter transcripts, which lie at
a median distance of ~6400 nt from the TSS generating the longest pre-mRNA. We
demonstrate that usage of these downstream alternative promoters often leads to
transcripts lacking the potential to code for important protein domains that could
contribute to the cancer phenotype. Conversely, the usage of canonical TSSs
(corresponding to full-length genes) for genes in repair pathways is downregulated in
cancer cells. Furthermore, using transcription factor ChIP-seq data from the
ENCODE project, we demonstrate that the two categories of putative TSSs have
distinct regulatory profiles. We also report 90 cases of non-coding RNA promoters, in
particular snoRNAs, that are upregulated in the APL cells.
Materials and Methods
Sample preparation
Bone marrow was aspirated from the posterior iliac crest of healthy donors and
patients with newly diagnosed APL after informed consent had been given according
to guidelines established by the Danish National Committee on Health Research
Ethics, and cell sorted by fluorescence activated flow cytometry (FACS). Detailed
FACS protocols are described in [19], and the FACS strategy is depicted in Figure 2.
RNA was isolated from sorted cells using the RNeasy Micro Kit (Qiagen,
Valencia, CA, USA) according to manufacturer’s instructions. Briefly, BM
populations were sorted once and then resorted directly into 350 µl RLT lysis buffer/β-
ARTICLE III | 72
Mercaptoethanol (Sigma-Aldrich). Tubes containing sorted cells were vortexed
immediately to lyse all cells, then snap-frozen on dry ice and stored at -80ºC before
mRNA was purified according to the manufacturer’s instructions (RNeasy Micro Kit,
Qiagen, Valencia, CA, USA).
NanoCAGE library preparation
NanoCAGE was performed as described in [9], with 50 ng total RNA for all
samples. RNA was sequenced on the Illumina Genome Analyzer IIX in biological
duplicates for both EPM and APL samples, with an input of 2 nM in 10 uL per
sample, following standard protocol.
NanoCAGE tag mapping
Reads were mapped to the NCBI GRCh37/hg19 human reference genome
using Bowtie v. 0.12.8 [20] with standard parameters. See suppl. Figure S1 for
mapping results.
Tag clustering
Initial consensus clusters were created, consisting of overlapping and
neighboring tags for all samples merged, followed by tag quantification for each
sample in the consensus clusters. Clusters were end-trimmed for low tag content,
reducing each tag cluster to the minimum width required for at least 80% of tags to
remain in the cluster, and clusters of 1 nt width were removed to eliminate PCR
artifacts. Tag counts were quantified as TPM (tags per million mapped reads), and
clusters with less than 2 TPM in the highest sample were removed. Clusters with high
inter-replicate variance were removed using a coefficient of variance-threshold of 1,
resulting in a final set of 7,458 clusters.
Exon noise filtering
To further solidify our distal promoter discovery, the per nt coverage of a given
TC was divided by the average per nt coverage across exons for the respective gene
(minus exons containing any tag clusters), and a ratio of minimum 1.5 was required.
ARTICLE III | 73
Statistical testing for differential expression
To test for differential promoter usage between APL and EPM, we utilized
edgeR v. 3.13 [21], using standard settings, with FDR = 0.05, and resulting in 1,437
clusters downregulated, and 725 clusters upregulated in APL.
Promoter classification
Promoters (tag clusters defined above) were assigned to the closest UCSC
knownGene transcriptional unit, and were partitioned into four categories: Full-length
canonical (within 300 nt from the most upstream annotated TSS of the closest gene),
proximal (from 300 nt to within 1/10 of the total gene length), distal (from 1/10 to the 3’-
end of the gene), and intergenic. Ambiguous promoters (due to e.g. overlapping gene
units) were discarded.
Protein domain loss analysis
Transcript regions lost in due to usage of alternative promoters located within
genes and upregulated in APL were intersected with protein domain mappings from
the SUPERFAMILY repository [22], reporting both full and partial overlaps.
Regulatory neighborhood analysis
The full ENCODE transcription factor ChIP-seq binding site set [23] was
downloaded from the UCSC Genome Browser, and the distance from each TC to the
nearest binding site was determined for each TF. For a select subset of TFs, the
distance from each TC to the full TF binding site set was measured.
Data availability
Raw sequence data as well as the full TSS set is available at the NCBI Gene
Expression Omnibus and Short Read Archive under accession no. GSE46561.
Results
Figure 1 gives a full overview of the analysis flow, starting from FACS-sorting of
cells, followed by microarray and nanoCAGE experiments and concluded by
computational analysis and comparisons with existing datasets.
ARTICLE III | 74
Analysis of cancer phenotypes is often hampered by the lack of appropriate
normal controls, i.e. normal cells at a similar stage of differentiation. We have
previously established that the closest normal counterparts of APL blast cells in terms
of expression are early promyelocytes (EPM) cells [24]. Therefore, a FACS cell sorting
strategy was developed to purify these two cell types from bone marrow samples, as
described in Figure 2: reanalysis of double-sorted populations demonstrated a purity
between 90%-100% for all populations. Importantly, by comparing APL blast to
normal EPMs, we are therefore able to determine cancer specific transcriptional
changes as opposed to those arising from differences in degree of differentiation.
Following cell sorting, for each patient, RNA was extracted and used to
measure gene expression and promoter usage by nanoCAGE. Sequenced
nanoCAGE reads were mapped to the human genome as described in Methods
(suppl. Figure S1A). Using pooled data, neighboring tags were aggregated into tag
clusters (TCs), whose expression from each library was quantified by the normalized
number of tags within the cluster on the same strand (expressed as tags per million
mapped reads, TPM). For simplicity, we will refer to these clusters interchangeably as
TSSs or promoters in the text, as in [5]. Following initial filtering, the resulting TCs
were found to have an distance distribution showing an increase of clusters close to
previously annotated TSSs, and a decrease of clusters with similar distance to TSSs as
random genomic locations (suppl. Figure 1B).
Visual inspection of raw mapped data showed a tendency towards low intensity
tag clusters mapping uniformly over exons. These might represent cases where the
reverse transcriptase failed to reach the real 5’ end, or the capture of partially degraded
(and possibly recapped) mRNAs, as discussed in [10]. We also noted cases of PCR
clonal expansion, typically only present in one of the replicates; these are likely
attributed to the low amount of starting RNA and the high number of PCR cycles
employed in the NanoCAGE method. To take these issues into account, we employed
additional filtering on clusters, requiring low inter-replicate variance, as well as
constraints on cluster widths and expression levels. This produced a final set of 7,458
ARTICLE III | 75
TCs. Of these, 1,398 overlapped with annotated RefSeq TSSs, 1,795 overlapped with
GenCode V14 TSSs, and 5,663 were un-annotated.
Next, we identified differentially expressed promoters using edgeR [21] and
found 725 promoters to be upregulated and 1,437 promoters to be downregulated in
APL vs. control (Figure 3A) (P<0.05, FDR-adjusted).
To validate our findings in terms of expression, we compared our set of
differentially expressed promoters with previously published microarray-based gene
expression data [25]. Genes assigned as significantly downregulated by microarrays
(n=2,066, P < 0.05, FDR-adjusted) overlap the corresponding promoters sets from
CAGE significantly (n=563, P=1.34e-22, hypergeometric test) (Figure 3B, top).
Similarly, for all genes found upregulated in APL, compared to normal counterpart
cells (EPMs), by microarray (n=855, P<0.05, FDR-adjusted), we found a less marked,
but still significant overlap (n=165, P=2.14e-25, hypergeometric test) with promoters
upregulated in APL by nanoCAGE (Figure 3B, bottom). The combined microarray
data was based on samples from 37 APL patients. Owing to the sheer difference in
sample numbers between our APL nanoCAGE (2) and the array studies, some
precaution is necessary when comparing expression values between platforms, and
smaller overlap fractions were expected.
Next, we investigated the location of differentially expressed promoters in
relation to annotated genes. We partitioned all promoters across both samples into
bins of location relative to nearest gene body, grouping 16% as full-length canonical
(here defined to be within -150/150 bp of longest UCSC transcript TSS on the same
strand), 7% as intragenic proximal (from 150 bp to 1/10 th of the total gene length), 43%
as intragenic distal (placed somewhere along the remaining gene body), and 33% as
intergenic (figure 3, grey bars).
Interestingly, promoters upregulated in APL were overrepresented in the
category of TCs located downstream, distal of annotated TSSs (P<0.01,
hypergeometric test); conversely, downregulated promoters were overrepresented in
the bin of TCs overlapping canonical TSSs (P < 0.01, hypergeometric test) (figure 3).
ARTICLE III | 76
The distal upregulated promoters, by definition located between 1/10 of the gene
length and the transcription termination site, were not placed uniformly along the
gene, peaking around a median distance from the TSS of ~6400 nt (suppl. Figure S2).
In summary, TSSs preferentially used in APL are more often downstream alternative
promoters compared to promoters preferentially used in the EPM control samples.
Comparing to gene set signatures collected from the Molecular Signature
Database (http://www.broadinstitute.org/gsea/msigdb/index.jsp) [26], we observed a
significant overlap between our upregulated distal promoters, and genes
overexpressed in APLs ( P = 2.3e-12, hypergeometric test )[27], chronic myelogenous
leukemia (CML) (P = 2.44e-15, hypergeometric test)[28], as well as a number of other
cancers, indicating that promoters of oncogenic and cancer maintaining genes,
expressed in APL, produce shorter transcripts (Figure 4, overlay). Similarly, we
observed a significant overlap between down-regulated canonical promoters, and gene
sets for DNA repair pathways, including tumor suppressors CHEK2 (P<0.01,
hypergeometric test) and ATM (P <0.01, hypergeometric) both involved in activation
of the DNA damage checkpoint, leading to repair, cell cycle arrest, and/or apoptosis
(Figure 4, overlay) [29], further validating the observation.
Since degraded RNA has potential to be captured by the nanoCAGE protocol,
heightened RNA degradation activity in the cancer cells could potentially cause some
of these observations. We reasoned that if this were the case, then we would expect the
nanoCAGE tags to be uniformly distributed over the exons, and not congregate to
distinct and reproducible clusters over replicates. Even though our initial intra-
replicate filtering strategy would catch most of these exon tags, we further refined our
filter, requiring that the downstream, cancer-specific tag clusters of interest should
have higher tag coverage than expected over the exons of the respective gene (see
Methods). This filtering resulted in a list of 40 high confidence downstream promoters
upregulated in APLs (the full list is given in suppl. table T1), suitable for further
downstream functional analysis. Two examples of such are given in Figure 5. Both of
these are close to annotated alternative promoters (further validating the approach),
ARTICLE III | 77
but here we give evidence that the alternative promoters are used preferentially in the
APL cells compared to controls. Figure 5A shows an APL-specific alternative
promoter, which generates a shorter transcript version of ADAMTSL2. This is placed
downstream of several protein-coding exons, and usage of the alternative promoter will
produce a transcript which cannot encode the signal peptide present in the longer
variant of the RNA, as well as several regions in which mutations have been shown to
result in increased amounts of available TGF-beta [30]. TGF-beta, known to inhibit
proliferation of hematopoietic precursors, is normally found highly expressed in APLs
[31], and the selective downstream promoter usage of ADAMTSL2 could contribute
to this effect. Figure 5B shows a downstream alternative promoter in the GPR56 gene,
a GPCR family receptor recently shown to play a role in both hematopoietic stem cell-
as well as leukemic stem cell maintenance [32]. In this case, the alternative promoter is
not predicted to give a different protein product, as it is located in an extended 5’ UTR.
An overview of 24 selected genes with downstream promoters upregulated in APL is
presented in Figure 6, indicating novel promoters (15/24) and previously annotated
alternative promoters (9/24). To further investigate the functional effects of
downstream promoter usage, we extrapolated the lost mRNA exons to protein
domains. Several genes exhibited domain loss, which could be implicated in an
oncogenic phenotype, including: USP13, involved in deubiquitination and implied in
the stabilization of p53 [33], loses its proteinase domains. ADAMTSL2, in addition to
the regulation of TGF-beta described above, looses its TSP-1 type 1 repeat domains
(see Figure 5A), previously shown to be implicated in induction of apoptosis [34] and
inhibition of angiogenesis [35]. CTNNA3, a key player in cellular adhesion, loses its
cadherin associating alpha-catenin domains. A full list of domains lost is given in suppl.
table T2.
To further assess the regulatory differences between the promoter classes, we
investigated the binding of transcription factors in the genomic neighborhood of each
TSS class. To facilitate this, we measured the distance from each TC to the nearest
transcription factor binding site, based on ChIP-seq peaks from ENCODE (all cells)
ARTICLE III | 78
(suppl. Figure S3). We observed that transcriptional inhibitor members of the E2F
family were associated closer with APL-downregulated full-length promoters, while
early response transcription factors JUN and FOS (together forming the AP1
complex, and both expressed in APL cells measured by nanoCAGE (data not shown))
were closer associated with upregulated downstream promoters (figure 7). This
indicates that some regulatory cues differ between the downstream and canonical
promoters, even within AML cells.
A strength of nanoCAGE is that it is not limited to a pre-defined gene set, but
can potentially capture any RNA species in the cell, including ncRNAs. To categorize
putative non-coding TSSs, we extracted all non-coding RNA identifiers from RefSeq:
of these, 90 and 84 were up- and downregulated (figure 8A), respectively, with a
significantly larger fraction of non-coding RNAs being upregulated compare to non-
regulated (figure 8B). Suppl. table T3 lists the most upregulated ncRNAs.
SNORD114-1 has previously been found upregulated APL in a PML-RAR-alpha
context by microarray and ChIP-seq analyses, conferring cell growth.[36] Similarly,
miR-21, classically described as an “oncomir” [37], and with many tumor-supressor
targets, is upregulated. Curiously, MEG3 is high on the list of upregulated ncRNA
TSSs, normally thought to be a tumor suppressor not normally expressed in tumors
[38] and has previously been show to be hypermethylated in AML and myeloplastic
syndrome (MDS). Microarray data confirm this upregulation (7 times upregulated in
APL). The high expression of MEG3 in APL may point to a new and as yet
undescribed role in cancer pathology.
Discussion
In this paper we present the first investigation of promoter usage in APL cancer
cells, compared to their corresponding normal counterpart, both FACS-purified from
human bone marrow. The nanoCAGE method allows for analyses of promoters with
these rare cells – in our case with as little as 50 ng total RNA. . By applying stringent
filtering and statistical analyses, we identify 2,162 TSS clusters that are significantly
changed between the two states. On the gene level, the findings correspond well with
ARTICLE III | 79
microarray studies, but the method can pinpoint novel promoters in the set, which
array approaches cannot. Surprisingly, we see a shift of promoter usage: APL cells are
prone to use promoters that are downstream of canonical TSSs, downregulating
usage of full-length canonical transcripts. To further refine these findings, we
employed a stringent filter on the TC-to-exon-coverage, which resulted in a list of 40
high-confidence APL-specific promoters - followed by an extrapolation of mRNA to
protein domains, we demonstrated several examples of domain loss with clear
implications to protein function in the cancer cells, several of which can be linked
mechanistically with a cancer phenotype. Furthermore, we noted that the regulatory
landscape around these downstream promoters is different to those of full-length
transcripts, even if the full-length transcripts are preferentially expressed in APL. In
particular, AP1 transcription factors are preferentially bound closer. A caveat with
these results is that the ENCODE ChIP-seq data is based on multiple cell lines. On
the other hand, we could in many cases see corresponding support of expression of the
TF in question by nanoCAGE, which argues that the observed regulatory event
should also be relevant for the APL cells.
In summary, our findings indicate that alternative promoter usage is potentially
important yet often overlooked biological feature of APL cells. While these results are
indicative of functional impact, it is unclear what the causality is: we do not at present
know if the observed shorter isoforms are a result of disease progression or play a
causal role in malignant transformation. To address such questions, it would be
necessary to either knock down or overexpress the shorter variants and observe the
resulting phenotypes.
Regardless of cause and effect, these types of cancer-specific alternative
promoters could potentially be used as biomarkers for APL, since they have additional
power compared to microarray expression profiles, which cannot detect the placement
of alternative promoters. However, for a comprehensive delineation of APL-specific
biomarkers it would then be necessary to expand the study to cover multiple APL and
normal cancer states, as well as biological variation over many patients.
ARTICLE III | 80
Acknowledgments
The authors thank the National High-throughput DNA Sequencing Centre of
Copenhagen, Denmark for collaborative assistance with sequencing.
This study was supported by grants to AS from the Danish Cancer Society,
The Lundbeck Foundation, the Novo Nordisk Foundation and the European
Research Council (FP7/2007–2013/ERC grant agreement 204135). Work in the Porse
lab was supported by grants from the Danish Cancer Society, The Danish Research
Council for Strategic Research and through a center grant from the Novo Nordisk
Foundation (The Novo Nordisk Foundation Section for Stem Cell Biology in
Human Disease).
Author contributions
JW and AS carried out the computational analysis, MB and PC carried out
laboratory work. HMJ and KT collected cell samples and carried out cell sorting. NR
provided microarray data and analysis. BL contributed with domain loss analysis. PH
and NB contributed with clinical sample collection. BTP and AS supervised the
project. JW, BTP and AS wrote the paper.
References 1. Davuluri R V, Suzuki Y, Sugano S, Plass C, Huang TH-M (2008) The functional
consequences of alternative promoter use in mammalian genomes. Trends in genetics : TIG 24: 167–
177. Available: http://www.ncbi.nlm.nih.gov/pubmed/18329129. Accessed 1 March 2013.
2. Kornblihtt AR, Schor IE, Alló M, Dujardin G, Petrillo E, et al. (2013) Alternative
splicing: a pivotal step between eukaryotic transcription and translation. Nature reviews Molecular
cell biology 14: 153–165. Available: http://europepmc.org/abstract/MED/23385723. Accessed 27
February 2013.
3. Sun Y, Fu Y, Li Y, Xu A (2012) Genome-wide alternative polyadenylation in animals:
insights from high-throughput technologies. Journal of molecular cell biology 4: 352–361. Available:
http://www.ncbi.nlm.nih.gov/pubmed/23099521. Accessed 13 March 2013.
4. Tress ML, Martelli PL, Frankish A, Reeves G a, Wesselink JJ, et al. (2007) The
implications of alternative splicing in the ENCODE protein complement. Proceedings of the
National Academy of Sciences of the United States of America 104: 5495–5500. Available:
ARTICLE III | 81
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1838448&tool=pmcentrez&rendertype
=abstract.
5. Valen E, Pascarella G, Chalk A, Maeda N, Kojima M, et al. (2009) Genome-wide
detection and analysis of hippocampus core promoters using DeepCAGE. Genome research 19: 255–
265. Available:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2652207&tool=pmcentrez&rendertype=
abstract. Accessed 5 March 2013.
6. Sandelin A, Carninci P, Lenhard B, Ponjavic J, Hayashizaki Y, et al. (2007)
Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nature reviews
Genetics 8: 424–436. Available: http://www.ncbi.nlm.nih.gov/pubmed/17486122. Accessed 5 March
2013.
7. Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, et al. (2003) Cap analysis
gene expression for high-throughput analysis of transcriptional starting point and identification of
promoter usage. Proceedings of the National Academy of Sciences of the United States of America
100: 15776–15781. Available:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=307644&tool=pmcentrez&rendertype=
abstract.
8. Tsuchihara K, Suzuki Y, Wakaguri H, Irie T, Tanimoto K, et al. (2009) Massive
transcriptional start site analysis of human genes in hypoxia cells. Nucleic acids research 37: 2249–2263.
Available:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2673422&tool=pmcentrez&rendertype=
abstract. Accessed 20 March 2013.
9. Salimullah M, Mizuho S, Plessy C, Carninci P (2011) NanoCAGE: A High-
Resolution Technique to Discover and Interrogate Cell Transcriptomes. Cold Spring Harbor
Protocols 2011: pdb.prot5559–pdb.prot5559. Available:
http://www.cshprotocols.org/cgi/doi/10.1101/pdb.prot5559. Accessed 4 November 2012.
10. Plessy C, Bertin N, Takahashi H, Simone R, Salimullah M, et al. (2010) Linking
promoters to functional transcripts in small samples with nanoCAGE and CAGEscan. Nature
methods 7: 528–534. Available: http://dx.doi.org/10.1038/nmeth.1470. Accessed 10 November 2012.
11. Plessy C, Pascarella G, Bertin N, Akalin A, Carrieri C, et al. (2012) Promoter
architecture of mouse olfactory receptor genes. Genome research 22: 486–497. Available:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3290784&tool=pmcentrez&rendertype
=abstract. Accessed 5 March 2013.
12. Döhner H, Estey EH, Amadori S, Appelbaum FR, Büchner T, et al. (2010)
Diagnosis and management of acute myeloid leukemia in adults: recommendations from an
ARTICLE III | 82
international expert panel, on behalf of the European LeukemiaNet. Blood 115: 453–474. Available:
http://www.ncbi.nlm.nih.gov/pubmed/19880497. Accessed 2 March 2013.
13. Wang Z-Y, Chen Z (2008) Acute promyelocytic leukemia: from highly fatal to highly
curable. Blood 111: 2505–2515. Available:
http://bloodjournal.hematologylibrary.org/content/111/5/2505.short. Accessed 5 November 2012.
14. Döhner H, Estey EH, Amadori S, Appelbaum FR, Büchner T, et al. (2010)
Diagnosis and management of acute myeloid leukemia in adults: recommendations from an
international expert panel, on behalf of the European LeukemiaNet. Blood 115: 453–474.
doi:10.1182/blood-2009-07-235358.
15. Minucci S, Pelicci PG (2006) Histone deacetylase inhibitors and the promise of
epigenetic (and more) treatments for cancer. Nature reviews Cancer 6: 38–51. Available:
http://www.ncbi.nlm.nih.gov/pubmed/16397526. Accessed 4 March 2013.
16. Lin RJ, Evans RM (2000) Acquisition of oncogenic potential by RAR chimeras in
acute promyelocytic leukemia through formation of homodimers. Molecular cell 5: 821–830. Available:
http://www.ncbi.nlm.nih.gov/pubmed/10882118.
17. Licht JD (2006) Reconstructing a disease: What essential features of the retinoic acid
receptor fusion oncoproteins generate acute promyelocytic leukemia? Cancer cell 9: 73–74. Available:
http://www.ncbi.nlm.nih.gov/pubmed/16473273. Accessed 23 April 2013.
18. Villa R, Pasini D, Gutierrez A, Morey L, Occhionorelli M, et al. (2007) Role of the
polycomb repressive complex 2 in acute promyelocytic leukemia. Cancer cell 11: 513–525. Available:
http://www.ncbi.nlm.nih.gov/pubmed/17560333. Accessed 12 April 2013.
19. Mora-Jensen H, Jendholm J, Fossum A, Porse B, Borregaard N, et al. (2011)
Technical advance: immunophenotypical characterization of human neutrophil differentiation. Journal
of leukocyte biology 90: 629–634. Available: http://www.ncbi.nlm.nih.gov/pubmed/21653237. Accessed
22 May 2013.
20. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome biology 10: R25. Available:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2690996&tool=pmcentrez&rendertype
=abstract. Accessed 10 June 2011.
21. Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for
differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) 26:
139–140. Available:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2796818&tool=pmcentrez&rendertype=
abstract. Accessed 2 November 2012.
ARTICLE III | 83
22. Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homology to
genome sequences using a library of hidden Markov models that represent all proteins of known
structure. Journal of molecular biology 313: 903–919. Available:
http://www.ncbi.nlm.nih.gov/pubmed/11697912. Accessed 28 February 2013.
23. Encode T, Consortium P (2011) A user’s guide to the encyclopedia of DNA elements
(ENCODE). PLoS biology 9: e1001046. doi:10.1371/journal.pbio.1001046.
24. Marstrand TT, Borup R, Willer a, Borregaard N, Sandelin a, et al. (2010) A
conceptual framework for the identification of candidate drugs and drug targets in acute
promyelocytic leukemia. Leukemia 24: 1265–1275. Available:
http://www.ncbi.nlm.nih.gov/pubmed/20508621. Accessed 23 April 2013.
25. Haferlach T, Kohlmann A, Wieczorek L, Basso G, Kronnie G Te, et al. (2010)
Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of
leukemia: report from the International Microarray Innovations in Leukemia Study Group. Journal of
clinical oncology : official journal of the American Society of Clinical Oncology 28: 2529–2537.
Available: http://www.ncbi.nlm.nih.gov/pubmed/20406941. Accessed 5 March 2013.
26. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL (2005) Gene set
enrichment analysis : A knowledge-based approach for interpreting genome-wide.
27. Ross ME, Mahfouz R, Onciu M, Liu H-C, Zhou X, et al. (2004) Gene expression
profiling of pediatric acute myelogenous leukemia. Blood 104: 3679–3687. Available:
http://www.ncbi.nlm.nih.gov/pubmed/15226186. Accessed 3 April 2013.
28. Diaz-Blanco E, Bruns I, Neumann F, Fischer JC, Graef T, et al. (2007) Molecular
signature of CD34(+) hematopoietic stem and progenitor cells of patients with CML in chronic
phase. Leukemia 21: 494–504. Available: http://www.ncbi.nlm.nih.gov/pubmed/17252012. Accessed 3
April 2013.
29. Pujana MA, Han J-DJ, Starita LM, Stevens KN, Tewari M, et al. (2007) Network
modeling links breast cancer susceptibility and centrosome dysfunction. Nature genetics 39: 1338–1349.
Available: http://www.ncbi.nlm.nih.gov/pubmed/17922014. Accessed 27 February 2013.
30. Le Goff C, Morice-Picard F, Dagoneau N, Wang LW, Perrot C, et al. (2008)
ADAMTSL2 mutations in geleophysic dysplasia demonstrate a role for ADAMTS-like proteins in
TGF-beta bioavailability regulation. Nature genetics 40: 1119–1123. Available:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2675613&tool=pmcentrez&rendertype=
abstract. Accessed 30 March 2013.
31. Masterson M, A Raza, N Yousuf, A Abbas, A Umerani, A Mehdi, SA Bokhari, Y
Sheikh, K Qadir JF and (2013) High expression of transforming growth factor-beta long cell cycle
ARTICLE III | 84
times and a unique clustering of S-phase cells in patients with acute promyelocytic leukemia [see
comments]: 1037–1048.
32. Saito Y, Kaneda K, Suekane a, Ichihara E, Nakahata S, et al. (2013) Maintenance of
the hematopoietic stem cell pool in bone marrow niches by EVI1-regulated GPR56. Leukemia.
Available: http://www.ncbi.nlm.nih.gov/pubmed/23478665. Accessed 4 April 2013.
33. Liu J, Xia H, Kim M, Xu L, Li Y, et al. (2011) Beclin1 controls the levels of p53 by
regulating the deubiquitination activity of USP10 and USP13. Cell 147: 223–234. Available:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3441147&tool=pmcentrez&rendertype=
abstract. Accessed 25 March 2013.
34. Guo N, Krutzsch HC, Inman JK, Roberts DD (1997) Thrombospondin 1 and Type I
Repeat Peptides of Thrombospondin 1 Specifically Induce Apoptosis of Endothelial Cells
Thrombospondin 1 and Type I Repeat Peptides of Thrombospondin 1 Specifically Induce Apoptosis
of Endothelial Cells ’: 1735–1742.
35. Iruela-Arispe ML, Lombardo M, Krutzsch HC, Lawler J, Roberts DD (1999)
Inhibition of Angiogenesis by Thrombospondin-1 Is Mediated by 2 Independent Regions Within the
Type 1 Repeats. Circulation 100: 1423–1431. Available:
http://circ.ahajournals.org/cgi/doi/10.1161/01.CIR.100.13.1423. Accessed 23 April 2013.
36. Valleron W, Laprevotte E, Gautier E-F, Quelen C, Demur C, et al. (2012) Specific
small nucleolar RNA expression profiles in acute leukemia. Leukemia : official journal of the Leukemia
Society of America, Leukemia Research Fund, UK 26: 2052–2060. Available:
http://www.ncbi.nlm.nih.gov/pubmed/22522792. Accessed 4 March 2013.
37. Kent O a, Mendell JT (2006) A small piece in the cancer puzzle: microRNAs as
tumor suppressors and oncogenes. Oncogene 25: 6188–6196. Available:
http://www.ncbi.nlm.nih.gov/pubmed/17028598. Accessed 28 February 2013.
38. Zhang X (2003) A Pituitary-Derived MEG3 Isoform Functions as a Growth
Suppressor in Tumor Cells. Journal of Clinical Endocrinology & Metabolism 88: 5119–5126. Available:
http://jcem.endojournals.org/cgi/doi/10.1210/jc.2003-030222. Accessed 2 April 2013.
ARTICLE III | 85
Figure 1: Pipeline overview
Flowchart of the full data analysis process presented in this work. Tag cluster
data sets are depicted in green, experimental and computational steps to produce data
in blue, and analysis modules in red.
ARTICLE III | 86
Figure 2: FACS sorting strategy for the purification of APL blasts and their
corresponding normal bone marrow population.
Mononuclear cells (MNCs) were purified from BM aspirates of healthy
subjects and APL patients. MNCs were stained with a cocktail of fluorochrome-
conjugated MoAbs and the DNA dye 7AAD allowing for sorting of
immunophenotypical identical normal early promyelocytes (EPM) and APL blast as
described previously [19]. Purified EPM and APL blasts were defined as Lin-
CD34lo/negCD15+ cells. Wrigth Giemsa stains of sorted EPM and APL blasts
demonstrate a typical promyelocyte morpholgy.
ARTICLE III | 87
Figure 3: Characteristics of gene expression in APL by nanoCAGE
(A) MA-plot for normalized nanoCAGE-seq data, showing the combined
sample expression on the x-axis, and the sample fold change on the y-axis. Blue data
points represent candidate TSSs which are significantly different beteen APL cells
and control (with FDR <= 0.05), giving 1,437 downregulated and 725 upregulated
genes, respectively. (B) Venn diagrams comparing the overlap of
TSSs of genes down- or upregulated in APL using microarrays and
nanoCAGE.
chr16:57,653,181-57,677,232
GPR56
10 kb
Apl
Epm
chr9:136,395,254-136,442,102 10 kb
ADAMTSL2
Nor
mal
ized
tag
coun
t Apl
Epm
0
500
50
0
500
50
A
B
Nor
mal
ized
tag
coun
t
ARTICLE III | 88
UCSC longest transcript
APL promoter, supported
APL promoter, novel
KIAA0226L
ARGLU1
CTNNA3
RUVBL1
HSPA8
SNX20
ESYT2
GLG1
KIF24
ELF1
ERG
ZFR
PZP
ST5
USP53
USP13
TPM4
IFT88
AKAP9
GPR56
GCNT1
ROBO3
VPS13B
ADAMTSL2
ARTICLE III | 89
Figure 4: (Previous page) Classification of candidate transcription start sites
Barplot of genomic location classifications of candidate TSSs for pooled
samples (grey bars), or the subsets which are not changing (green), upregulated in
APL (blue), or downregulated in APL (red). For these subcategories, we investigated
their locations: overlapping known TSSs (canonical), novel TSSs within genes either
proximal or siatal to the TSS, or intragenic (See Methods for details).
TSSs downregulated (FDR<0.05, N=827) in APL are overrepresented in
canonical promoters , while TSSs upregulated in APL (FDR<0.05, N=364) are
overrepresented in intragenic promoters, supporting shorter transcripts of known
genes. Overlay boxes show gene sets from other studies enriched for either
downregulated canonical promoters (pink), or upregulated distal promoters (teal).
ARTICLE III | 90
Figure 5: Examples of AML-specific alternative promoters
Two examples of AML-specific alternative promoters identified in the study, at
the ADAMTSL2(A) and GPR56(B) loci. Top panel shows UCSC genes from the
UCSCS genome browser: grey panels show normalized mean TPM counts over
replicates for AML and NanoCAGE tags on the Y axis and the genomic location on
the X axis. For ADAMTSL2, the mRNA stretches corresponding to protein domains
lost by alternative promoter usage are indicated in colored boxes.
E2F4 E2F6
c-Fos p300
0
2*10-4
4*10-4
0
2*10-4
4*10-4
-1000
0-50
00 050
0010
000
-1000
0-50
00 050
0010
000
Distance to respective TSS [nt]
Den
sity
of C
hIP
pea
k oc
cura
nce
Location and expression of TSS
Downregulated in APL,canonical TSS
Upregulated in APL, canonical TSS
Upregulated in APL, distal alternative TSS within gene
6*10-4
6*10-4
ARTICLE III | 91
Figure 6: Visual representation of 24 genes with downstream promoters
upregulated in APL.
Gene models visualized are based on longest canonical UCSC gene models,
and are normalized to have the same lengths. Thick boxes indicate protein-coding
exons. Orange arrows indicate the TSSs inferred from the gene models. Red arrows
indicate AML-specific promoters not previously annotated by either UCSC Known
Gene, RefSeq or Ensembl. Green arrows indicate AML-specific promoters that are
supported by aforementioned databases. Note that the large majority of downstream
promoters are within or between coding exons.
-4
0
4
-2 0 2 4
log2[APL]+log2[EPM]2
log2
[APL
/EPM
]
0.00
0.25
0.50
0.75
Downregulated
Unregulated
Upregulated
A B
Freq
uenc
y
Coding RNAs
Non-coding RNAs
p<0.01
ARTICLE III | 92
Figure 7: Distance distributions from transcription factor binding sites to
candidate TSSs
Distribution of selected ENCODE transcription factor ChIP peaks (pooled
over all samples) in relation to TSS categories. For each TSS in all three categories,
the distance to the nearest respective TF was logged, and the total distance
distribution is shown. *** indicate p < 0.001, * indicates p < 0.05, NS indicates non-
significance.
0
5
10
15
20
E2F4
E2F6
E2F6_(H
-50)
c-Fos
c-Ju
n
p300_(N
-15)
TF
log
2(d
ista
nce
to
ne
are
st
TF
bin
din
g s
ite
)
Canonical, downregulated in APL
Distal, upregulated in APL
***
*** ****NS NS
ARTICLE III | 93
Figure 8: Characteristics of expression of non-coding RNAs
A) MA-plot of non-coding RNAs. Blue dots indicate ncRNAs upregulated in
APL (p< 0.05, n = 90), red dots indicate ncRNAs downregulated in APL (p < 0.05,
n=84). B) Compared to the total number of genes deregulated in APL, ncRNAs are
overrepresented in upregulated genes.
-4
0
4
-2 0 2 4
log2[APL]+log2[EPM]2
log2
[AP
L/E
PM
]
0.00
0.25
0.50
0.75
Downregulated
Unregulated
Upregulated
A
Fre
quen
cy
Coding
RNAs
Non-c
oding
RNAs
p<0.01
B
ARTICLE III | 94
Figure S1: A; Vigourous filtering ensures a reduced but high confidence set of
candidate promoters.
“Reads mapped” indicate reads mapped to mm9, after removing PhiX-specific
sequences. Common tag clusters are defined as overlapping reads, based on all
samples pooled. The filtering steps done to reach 5,748 candidate TSS’s are described
in the text.
B: Violin plots of distance to nearest RefSeq TSS of all (pink) and filtered
(blue) TSSs. The blue line indicates the mean of the distance of random genomic
locations to nearst RefSeq TSS.
ARTICLE III | 95
Figure S2: Density plot of distance in nucleotides from distal upregulated
promoters to the TSS of the longest refseq gene model.
The green line indicates the median.
0.00
0.05
0.10
0.15
5 10 15 20Distance from TSS [nt] (green line = median)
density
ARTICLE III | 96
[Figure too large for inclusion. Downloadable at
https://www.dropbox.com/s/2e50t51xnntpx5r/Figure_S3.pdf]
Figure S3: Distance boxplots from each tag cluster of a given category to the
closest ENCODE TF binding site.
ARTICLE IV | 97
Article IV
Temporal mapping of CEBPA and CEBPB binding during liver regeneration
reveals dynamic occupancy and specific regulatory codes for homeostatic and cell cycle
gene batteries
ARTICLE IV | 98
Research
Temporal mapping of CEBPA and CEBPB bindingduring liver regeneration reveals dynamic occupancyand specific regulatory codes for homeostatic and cellcycle gene batteriesJanus Schou Jakobsen,1,2,3,7 Johannes Waage,1,2,3,4 Nicolas Rapin,1,2,3,4
Hanne Cathrine Bisgaard,5 Fin Stolze Larsen,6 and Bo Torben Porse1,2,3,7
1The Finsen Laboratory, Rigshospitalet, Faculty of Health Sciences, 2Biotech Research and Innovation Centre (BRIC), 3The Danish Stem
Cell Centre (DanStem), Faculty of Health Sciences, 4The Bioinformatics Centre, 5Department of Cellular and Molecular Medicine,
Faculty of Health Sciences, University of Copenhagen, DK-2100 Copenhagen, Denmark; 6Department of Hepatology, Rigshospitalet,
DK2200 Copenhagen, Denmark
Dynamic shifts in transcription factor binding are central to the regulation of biological processes by allowing rapidchanges in gene transcription. However, very few genome-wide studies have examined how transcription factor occu-pancy is coordinated temporally in vivo in higher animals. Here, we quantified the genome-wide binding patterns of twokey hepatocyte transcription factors, CEBPA and CEBPB (also known as C/EBPalpha and C/EBPbeta), at multiple timepoints during the highly dynamic process of liver regeneration elicited by partial hepatectomy in mouse. Combining theseprofiles with RNA polymerase II binding data, we find three temporal classes of transcription factor binding to beassociated with distinct sets of regulated genes involved in the acute phase response, metabolic/homeostatic functions, orcell cycle progression. Moreover, we demonstrate a previously unrecognized early phase of homeostatic gene expressionprior to S-phase entry. By analyzing the three classes of CEBP bound regions, we uncovered mutually exclusive sets ofsequence motifs, suggesting temporal codes of CEBP recruitment by differential cobinding with other factors. Thesefindings were validated by sequential ChIP experiments involving a panel of central transcription factors and/or bycomparison to external ChIP-seq data. Our quantitative investigation not only provides in vivo evidence for the in-volvement of many new factors in liver regeneration but also points to similarities in the circuitries regulating self-renewalof differentiated cells. Taken together, our work emphasizes the power of global temporal analyses of transcription factoroccupancy to elucidate mechanisms regulating dynamic biological processes in complex higher organisms.
[Supplemental material is available for this article.]
Most biological processes—such as embryonic development, dif-ferentiation, or cellular responses to external stimuli—are in es-sence dynamic, which requires the underlying transcriptionalnetworks to be tightly temporally coordinated. Several large-scaletemporal studies have focused on the comprehensive mapping ofdynamic changes of gene expression but have not revealed howcoordination is achieved on the transcriptional level (e.g., vanWageningen et al. 2010). To address this question globally, thebinding of regulatory proteins has been examined using methodssuch as chromatin immunoprecipitation with microarrays (ChIP-chip) or sequencing (ChIP-seq) (e.g., Sandmann et al. 2006a;Johnson et al. 2007; Vogel et al. 2007). Still, many such studieshave operated with just one or two conditions. The exceptionsinclude temporal examinations in the model systems Drosophilaand yeast (Sandmann et al. 2006b; Jakobsen et al. 2007; Ni et al.2009; Zinzen et al. 2009). So far, very few in vivo, multi-time pointinvestigations of transcription factor (TF) binding have been un-dertaken in higher vertebrates.
Mammalian liver regeneration is a well-studied process, inwhich the large majority of mature hepatocytes rapidly and ina highly synchronized manner re-enter the cell cycle upon injury(Fausto et al. 2006; Michalopoulos 2007; Malato et al. 2011). Serialtransplantation of liver tissue has demonstrated a very high‘‘repopulating’’ capacity of hepatocytes (Overturf et al. 1997),which lends hope to using these cells in regenerative medicine.The ability of mature liver cells to proliferate is reminiscent ofspecific, differentiated cells of the immune system (naive T cells, Bcells) that are kept in quiescence until exposed to specific stimuli(Glynne et al. 2000; Yusuf and Fruman 2003; Feng et al. 2008).However, it remains open whether similar programs control theproliferation of hepatocytes and immune cells. In the liver, anumber of studies have mapped temporal changes in mRNA levelsduring regeneration (e.g., White et al. 2005). This, coupled withfunctional studies, has led to the identification of several TFs in-volved in the regenerative response (for review, see Kurinna andBarton 2011). Still, knowledge about how these factors are coor-dinated temporally throughout the regenerative process is limited.
CEBPA (C/EBPalpha) and CEBPB (C/EBPbeta) are two keyhepatocyte TFs known to have divergent roles in liver function andregeneration. The two factors belong to the same basic regionleucine zipper-family (bZIP), and several studies have shown thatthey bind the same core DNA sequence, acting as either homo- or
7Corresponding authorsE-mail [email protected] [email protected] published online before print. Article, supplemental material, and publi-cation date are at http://www.genome.org/cgi/doi/10.1101/gr.146399.112.
592 Genome Researchwww.genome.org
23:592–603 ! 2013, Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/13; www.genome.org
Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from
ARTICLE IV | 99
heterodimers (Diehl and Yang 1994; Ranaet al. 1995; Osada et al. 1996). WhileCEBPA is highly expressed in the quies-cent condition (before injury) and regu-lates many metabolic liver genes, CEBPBis up-regulated during liver regrowth andis required for a full regenerative response(Greenbaum et al. 1995; Wang et al. 1995).In many tissues, CEBPA is observed to bean anti-proliferative factor facilitating dif-ferentiation, while CEBPB has been foundto be either pro- or anti-proliferative indifferent settings (Porse et al. 2001; Nerlov2007). In the skin, the two factors appearto act redundantly to limit epidermal stemcell activity (Lopez et al. 2009).
In the current study, we have ex-amined dynamic TF binding during liverregeneration in the mouse by performinga time course of ChIP-seq experiments tomap and quantify binding of CEBPA andCEBPB on a genome-wide scale at a highlevel of temporal resolution. We find thatCEBPA and CEBPB generally occupy thesame positions in the genome of hepato-cytes in vivo, but they do so with quanti-tatively divergent temporal patterns duringregeneration.
To further dissect the dynamic tran-scriptional network behind liver regen-eration, we interrogated the temporallydefined groups of CEBP-bound elementsfor regulatory properties, both with re-spect to sequence composition and dif-ferential expression of associated genes.
Results
A time course of CEBPA and CEBPBin vivo ChIP experiments reveals threedistinct patterns of binding
Liver regeneration in rodents has beenstudied in detail using partial hepatec-tomy, in which three of five lobes of theliver are resected. The hepatocytes in theremaining liver lobes undergo up to twocell cycles during the first week of liverregeneration, hereby reestablishing pre-surgical liver mass (for reviews, see Faustoet al. 2006; Michalopoulos 2007). To ex-amine the binding dynamics of CEBPAand CEBPB during liver regeneration, weharvested regenerating liver tissue ateight time points (0, 3, 8, 16, 24, 36, 48,and 168 h; Methods). These time pointscover the quiescent state (G0), severalstages of the first cell cycle growth phase(G1), the G1–S-phase transition at 36 h, a later time point at 48 h, aswell as the terminal phase point of 168 h (Fig. 1A; Matsuo et al.2003; Fausto et al. 2006). Livers from five mice for each time pointwere subjected to ChIP using specific CEBPA or CEBPB antibodies
(Methods) (Fig. 1B). We confirmed antibody specificity by doingChIP in mice deficient for Cebpa or Cebpb or by including epitopeblocking peptides (Supplemental Fig. S1). To minimize experi-mental variation, we pooled DNA precipitated from each of the five
Figure 1. A ChIP-seq time course of mouse liver regeneration. (A) Overview of the experimental timecourse. Eight time points covering the full week of the regenerative response; phases of the first cell cycleindicated. Previous studies show a prolonged decrease in the CEBPA to CEBPB ratio. (B) Experimentaloutline for ChIP experiments. Partial hepatectomy was carried out on five mice for each time point. Theregenerating liver tissue was harvested after variable time intervals, fixed, and subjected to ChIP-seq.(C ) Normalized coverage maps of four gene proximal regions with ChIP-seq peaks, marked by dashedvertical lines. Each row represents a time point for either CEBPA (top eight rows, dark purple) or CEBPB(bottom eight rows, dark pink). Gene loci are shown in black below. (D) CEBPA and CEBPB temporalbinding profiles based on genomic coverage, generated from the peaks in the C. (Alb-enh) Albuminproximal region; (Mdc1) mediator of DNA damage checkpoint 1; (Orm1) orosomucoid 1; (Cps1)carbamoyl-phosphate synthetase 1.
Genome Research 593www.genome.org
Regulation of liver regeneration
Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from
ARTICLE IV | 100
mice in equal amounts and sequenced the combined material. Wemapped the sequences to the mouse genome (mm9) and deter-mined base pair coverage after internal normalization to the totalmapped read count for each antibody (Supplemental Methods; Sup-plemental Fig. S2; Supplemental Table S1). Genomic regions bound byeither CEBPA or CEBPB were identified with the Useq peak-finder al-gorithm using a mock immunoprecipitation (IP) control (Supple-mental Methods, Supplemental Table S2). We validated IP consistencyof two series of three independent ChIPs at four different CEBP boundgenomic locations (Supplemental Fig. S3). For further validation, weperformed de novo motif searches on several data sets and a conser-vation score analysis centering on CEBP motifs, all of which con-firmed the quality of our ChIP-seq data (Supplemental Fig. S4).
Next, CEBPA or CEBPB peaks for each time point were used toconstruct a sum list of all enriched regions. Four examples ofenriched regions located just upstream of gene loci are shown inFigure 1, depicting genomic coverage for the CEBPA and CEBPBChIP time series (Fig. 1C).
After applying a stringent filtering regimen, we identifieda total of 11,314 high-confidence regions bound by CEBPA orCEBPB, many of which show highly dynamic occupancy duringthe course of regeneration (for full peak calling numbers and re-gion coverage, see Supplemental Tables S2, S3; for filtering, seeSupplemental Methods). We assembled temporal binding profilesfor all bound regions based on maximal genomic coverage. As il-lustrated in Figure 1 for the four regions mentioned above, distinctprofile patterns can be observed (Fig. 1D).
The high temporal resolution of the obtained binding profilesallowed us to query for groups of putative cis-regulatory regionswith similar occupancy dynamics. By hierarchical clustering, weidentified three prevalent clusters (Fig. 2A), containing roughlyequal numbers of bound regions (3549, 2818, and 3034 for the A,B, and C clusters, respectively). Two clusters (A and B) were definedby strong CEBPB binding, with maxima either at the 3- and 36-h(A) or at the 3-h (B) time points, which is also evident in a summedprofile of temporal coverage (Fig. 2B). As opposed to the A and B
Figure 2. Three distinct temporal binding patterns reflect the CEBPA and CEBPB protein level ratios. (A) Heatmap of hierarchical clustering on CEBPAand CEBPB binding profiles, based on genomic coverage in each region (row) and time point (column). White signifies low binding level (coverage 0); darkblue, high (cut-off at coverage 120). The most prominent clusters are marked A, B, and C. (B) Summarized coverage profiles of the three clusters. For eachprofile, summed time point levels are shown as normalized to the highest point (set to 1). (C ) Western blot showing CEBPA and CEBPB levels through thefirst 36 h of liver regeneration. Isoforms indicated (CEBPA: p42 and p30; CEBPB: LAP*, LAP, and LIP). Loading control is histone H3. Experiment isrepresentative of three independent repeats. (D) Representative quantification of protein levels based on signal intensities (using ImageJ). All isoformsincluded for protein level determination, normalized to a maximum value of 1.
Jakobsen et al.
594 Genome Researchwww.genome.org
Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from
ARTICLE IV | 101
clusters, the C cluster was characterized by a high degree of tem-poral correspondence between the CEBPA and CEBPB bindinglevels, with maxima at the quiescent state 0-h time point and at 24h (Fig. 2A,B). The detected robust binding of CEBPA at the 16- and24-h time points was unexpected, as previous studies have re-ported decreased protein levels for this TF throughout the re-generative process (Greenbaum et al. 1995, 1998). To clarify this,we examined the protein levels for CEBPA and CEBPB until the firstround of replication (time points, 0–36 h) by Western blotting(Fig. 2C). This highlighted a clear correlation between the CEBPAto CEBPB protein level ratios and the observed binding patterns.Specifically, we detected high levels of CEBPB at the 3- and 36-h timepoints, high CEBPA levels at the quiescent state (0 h), and a returnof CEBPA predominance at 16- and 24-h time points (Fig. 2C,D).
The three binding pattern clusters are associated with specificfunctional classes of genes
The temporally distinct CEBP binding patterns of putative cis-regulatory regions belonging to the three clusters suggest divergentregulatory roles through the regenerative process. To address thispossibility, we investigated whether each cluster was associatedwith specific sets of differentially expressed genes. To this end, weperformed ChIP-seq experiments with liver tissue from the eighttime points outlined above with an antibody specific to RNApolymerase II (POL2). In contrast to examining steady-state mRNAlevels, this allowed us to directly measure the transcriptional ac-tivity at any given time point as POL2 binding to each gene body(Supplemental Methods; Sandoval et al. 2004). To pinpoint genesregulated by CEBPA and/or CEBPB, a single putative target genewas assigned to each bound region, based on proximity to thetranscription start site (TSS) of neighboring genes (SupplementalMethods). Genes associated with regions belonging to one of thethree binding profile clusters (A, B, or C cluster genes) were in-spected for differential expression (for a full target gene list, seeSupplemental Table S4).
For an initial overview, we counted differentially expressedgenes from each cluster, defined as genes with a change in POL2gene body coverage above twofold, comparing each time point to0 h (Fig. 3A). This revealed that all clusters are associated withprominent gene expression changes at the 3-h time point. The Ccluster was associated with a large group of down-regulated genesat 3 h and, furthermore, displayed up-regulation of many genes at24 h. In contrast, the A and B clusters mostly exhibit down-regulationat the 24-h time point and up-regulation early in the time course(3 and 8 h), as well as a resurge at 36 h (mainly the A cluster).
Next, we focused the analysis on the 0-, 3-, 24-, and 36-h timepoints, as these displayed the most pronounced TF binding dif-ferences between the three clusters and therefore are likely torepresent the most distinct regulatory states. Groups of genes, up-or down-regulated from one time point to the next (i.e., from 0–3h, 3–24 h, and 24–36 h), were subjected to gene ontology (GO)analysis with the online DAVID tool (Supplemental Methods)(Huang da et al. 2009). As summarized in Table 1, we observed ex-tensive differences in functional classes (GO terms, Biological Process)of differentially expressed genes associated with the three bindingpatterns (for full GO analysis lists, see Supplemental Table S5).
At the 0- to 3-h transition, the A cluster displays almost equalnumbers of genes up- and down-regulated (554 vs. 682) (Table 1).In contrast, the B and C clusters showed mainly decreased targetgene activity, with 610 and 709 down-regulated genes (63% and70% of total regulated genes, respectively). The down-regulated
genes associated with both B and C regions are annotated withlipid metabolism or oxidation-reduction terms, while the activatedgenes are associated with acute phase or inflammatory response.This shift of gene activity is expected, as the liver shifts froma quiescent, homeostatic state to an acute stage as a response toinjury. By inspection of shared B and C target gene loci, we foundseveral examples of B and C peaks in close proximity to each other.These often bind CEBPs at mutually exclusive time points, as ex-emplified in Figure 3B (lower panels). This observation suggests thatCEBP complexes, possibly via interaction with specific co-activatorsor repressors at B or C regions, could have opposing actions ontranscription, leading to timed fine-tuning of gene activity.
Prominent gene expression changes from 3 to 24 h includea significant down-regulation of A cluster genes with the term‘‘transcriptional regulatory activity,’’ a shift that is paralleled bya decrease of A cluster CEBPA and CEBPB occupancy. This maysuggest that the expression of these regulatory factors is dependenton CEBP binding. Another finding was the up-regulation of genestargeted by C cluster regulatory regions (889 of 1209 differentially
Figure 3. The A, B, and C cluster regions target distinct sets of genes.(A) Heatmap of gene expression changes through liver regeneration. ‘‘Up’’and ‘‘down’’ bins count genes associated with a CEBP cluster (A, B, or C),exceeding a 2.0-fold change threshold (POL2 binding level change rela-tive to 0 h). Blue color intensity indicates number of genes going up; grayintensity, genes going down. (B) Representative examples of transcrip-tional activity (polymerase II [POL2] binding) of putative target genes.CEBPB binding illustrates gene association with either the C (Cps1) or A(Mdc1) cluster exclusively (upper left and right panel, respectively) or as-sociation with peaks belonging to both B and C clusters (Adh7 and Orm1;bottom panels). Gene loci are shown in black below; binding levels forCEBPB and POL2, in dark pink and red, respectively. Black numbers in-dicate POL2 gene body coverage for time point comparison. (Adh7) Al-cohol dehydrogenase 7; for additional gene symbols, see Figure 1.
Regulation of liver regeneration
Genome Research 595www.genome.org
Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from
ARTICLE IV | 102
Tab
le1
.T
arg
et
gen
eG
Oca
teg
ori
es
defi
ne
dis
tin
ctre
gu
lato
ryro
les
for
the
A,
B,
an
dC
bin
din
gcl
ust
ers
Clu
ster
Tim
ep
oin
t0–3
hP-v
alu
eT
ime
po
int
3–2
4h
P-v
alu
eT
ime
po
int
24–3
6h
P-v
alu
e
AU
PH
exose
met
abolic
pro
cess
0.2
55
UP
Purin
en
ucl
eotid
eb
ind
ing
0.0
82
UP
Cel
lcy
cle
0.0
01
554
Pro
gra
mm
edce
lld
eath
0.2
67
562
Ad
enyl
nucl
eoti
de
bin
din
g0.0
86
545
Reg
ula
tion
of
cell
cycl
e0.0
38
DO
WN
682
Am
ine
bio
syn
thet
icp
roce
ssC
ellu
lar
amin
oac
idb
iosy
nth
etic
pro
cess
0.0
12
0.0
14
DO
WN
749
Tra
nsc
rip
tion
reg
ula
tor
activi
tyD
NA
bin
din
g2.6
310!
6
2.2
310!
4D
OW
N73
Neg
ativ
ere
gula
tion
of
tran
scrip
tion
,D
NA
-dep
end
ent
Wn
tre
cep
tor
sig
nal
ing
pat
hw
ay
1.0
00
1.0
00
BU
PReg
ula
tion
of
infla
mm
atory
resp
on
se0.0
16
UP
Solu
te:c
atio
nsy
mp
ort
erac
tivi
ty0.2
77
UP
Cofa
ctor
bin
din
g0.1
52
360
Cofa
ctor
met
abolic
pro
cess
0.0
63
518
Sym
port
erac
tivi
ty0.2
85
272
Oxi
dore
duct
ase
activi
ty,
actin
gon
sulfu
rg
roup
of
don
ors
0.2
77
DO
WN
Org
anic
eth
erm
etab
olic
pro
cess
0.0
07
DO
WN
Tra
nsc
rip
tion
reg
ula
tor
activi
ty0.0
36
DO
WN
NA
610
Trig
lyce
rid
em
etab
olic
pro
cess
0.0
09
460
Lip
ase
activi
ty0.5
21
77
NA
CU
PA
cute
infla
mm
atory
resp
on
se0.0
08
UP
Oxi
dat
ion
red
uct
ion
2.4
310!
4U
PBio
log
ical
adh
esio
n0.0
12
309
Res
pon
seto
woun
din
g0.0
31
889
Ster
oid
met
abolic
pro
cess
1.0
310!
3177
Cel
lad
hes
ion
0.0
24
DO
WN
709
Oxi
dat
ion
red
uct
ion
Lip
idb
iosy
nth
etic
pro
cess
5.8
310!
8
8.9
310!
5D
OW
N320
Posi
tive
reg
ula
tion
of
mac
rom
ole
cule
bio
syn
thet
icp
roce
ssPosi
tive
reg
ula
tion
of
nitro
gen
com
poun
dm
etab
olic
pro
cess
0.0
64
0.0
64
DO
WN
196
Pat
tern
bin
din
gPoly
sacc
har
ide
bin
din
g0.0
59
0.0
59
Th
etw
oto
ph
its
of
GO
bio
log
ical
pro
cess
cate
gori
esen
rich
edar
esh
ow
n.
Gen
esas
sig
ned
toa
CEB
Pb
ind
ing
clust
erw
ere
incl
ud
edin
the
UP
and
DO
WN
bin
sw
hen
exce
edin
ga
fold
chan
ge
of
tran
scrip
tion
alac
tivi
tyab
ove
1.5
5.C
han
ges
are
mea
sure
das
PO
L2b
ind
ing
leve
lsre
lative
toa
pre
ced
ing
tim
ep
oin
tas
ind
icat
ed.P-
valu
esar
eBen
jam
inic
orr
ecte
dfo
rm
ultip
lete
stin
g.Fo
ra
com
ple
teG
Ote
rmlis
t,se
eSu
pp
lem
enta
lTab
leS5
.
Jakobsen et al.
596 Genome Researchwww.genome.org
Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from
ARTICLE IV | 103
expressed genes). Enriched C cluster GO terms were consistentwith a resurge of homeostatic and metabolic gene activity beforeS-phase entry (Table 1). One notable example is Cps1, carbamoyl-phosphate synthetase 1, which encodes a rate-limiting enzyme inthe urea-cycle (Jones 1965) and displays a pronounced up-regulationat the 24-h time point (Fig. 3B, upper left panel).
At the 36-h time point, the A cluster associated genes dis-played a strong bias toward increased transcriptional activity, with545 up-regulated against only 73 down-regulated, in contrast tothe C group target genes with 177 up and 196 down. In the large setof up-regulated A cluster genes, we found a strong overrep-resentation of cell cycle GO terms (Table 1). An example of a cellcycle regulator gene (Mdc1) targeted by an A cluster enhancer isshown in Figure 3 (Fig. 3B, upper right panel). The B cluster showsassociation with relatively few regulated genes at this time point(272 up and 77 down).
These data collectively show the three clusters to be associatedwith defined sets of genes with distinct biological functions andtiming of expression.
Motif analysis identifies specific and mutually exclusive setsof regulatory code
The targeting of distinct sets of genes and the distinct bindingpatterns suggest the presence of divergent regulatory features inthe regions constituting the three temporal clusters. Combinato-rial binding of multiple TFs has been proposed to explain dynamicchanges in occupancy and gene regulation during development(e.g., Zinzen et al. 2009).
To examine this possibility in our system, we carried out ananalysis of the collective set of sequences of each binding cluster(A, B, or C) versus a background set. We counted instances ofknown TF cognate sequences to generate probability scores usingthe ASAP software (Marstrand et al. 2008). In order to attain com-prehensive coverage with minimal redundancy, the analysis wasperformed using a condensate of three publicly available databasesof TF binding sequences or position weight matrixes (PWMs) (Sup-plemental Methods; Supplemental Table S6). Probability Z-scoresfor all 246 condensed PWMs with representation above a thresholdwere used for hierarchical clustering to examine the general differ-ences among the three clusters (Fig. 4A). Key examples are shownwith full Z-score information (Fig. 4B).
As expected, we found CEBP motifs to be among the mostenriched for all of the binding profile clusters (Fig. 4A,B; Supple-mental Table S6). From the hierarchical clustering, it was evidentthat the majority of motifs found in the A set were shared with theB set, while a small group of motifs was found to have robustoverrepresentation in all the three clusters (Fig. 4A). Prominentamong these were motifs recognized by CREB and HNF4A (Fig.4A,B), two TFs known to have a wide set of target genes in hepa-tocytes and central roles in liver metabolism and development(Costa et al. 2003; Montminy et al. 2004; Bolotin et al. 2010).
Notably, many motifs displayed a clear mutually exclusivepattern of overrepresentation (Fig. 4A). Specifically, many motifsfound in the A and B clusters of cis-regulatory elements were un-derrepresented in the C cluster and vice versa. Binding sequencesof the A and B regions belong to TFs related to stress response,proliferation, or cell cycle regulation, such as E2F, EGR1 and MYC(c-myc), the hypoxia response factor HIF1A, the metal responsefactor MTF1, and Kruppel-like factors (KLFs) (Fig. 4B; SupplementalTable S6; Lichtlen and Schaffner 2001; Blais and Dynlacht 2004;Liao et al. 2004; Eilers and Eisenman 2008; Majmundar et al. 2010;
McConnell and Yang 2010; Zwang et al. 2011). The C regions, onthe other hand, contain matches to PWMs of liver-specificTFs—such as FOXA2, ONECUT1 (previously known as HNF6), andHNF1A (Costa et al. 2003; Guillaumond et al. 2010)—or factorsfound to be associated with a quiescent state (G0), e.g., MAFB andFOXO3A (Greer and Brunet 2005; Sarrazin et al. 2009). Addition-ally, several different SOX motifs were identified as strongly over-represented within this cluster (Fig. 4B; Supplemental Table S6).
Overall, these observations suggest significant differences inthe regulatory mechanics of the A/B clusters versus the C cluster ofputative cis-regulatory elements. Moreover, the data point to setsof specific, dynamic TF binding partners of CEBPs, defining atemporal regulatory code.
Multi-level support of temporal cis-regulatory code predictions
To assess if the TFs predicted to interact with A or C cluster regions arepresent in the liver, we examined their expression levels (POL2 genebody reads) against all genes (Fig. 4C; Supplemental Table S7). Thisrevealed that the majority are highly expressed, being representedby either direct matches or top protein family members (e.g., Klf10,-13, -15 or Sox13, -15, -18). Strikingly, Cebpa, Cebpb, and Mafb are atthe top of the list (rank positions of 107, 21, and 61, respectively).
We further examined the indicated differences of the threeclusters by interrogating their genomic position relative to thenearest TSS (Fig. 4D). The three clusters clearly diverge in this re-spect, as the C cluster regions are distinct from A (or B) regions byshowing no proximity to TSSs. Several studies have shown differ-ences in preferred genomic positioning of TFs (e.g., Gerstein et al.2012). This suggests, based solely on the genomic difference inposition, that C cluster regions recruit other cobinding factors thanthe A (or B) regions. Specifically, E2Fs and MYC have been shownto bind at positions similar to the A regions, supporting the rele-vancy of enrichment for E2F and MYC cognate sequences in the Acluster (e.g., Eilers and Eisenman 2008).
Next, we took advantage of published mouse ChIP-seq datasets (Chen et al. 2008; Hoffman et al. 2010; Laudadio et al. 2012) totest if a panel of factors bound to A versus C regions as predicted byour computational analysis. We find our analysis to be supportedas the E2F1, KLF4, and MYC factor bound regions overlap signifi-cantly more with A regions, while FOXA2 and ONECUT1 prefer-entially bind C regions (Fig. 4E; Supplemental Methods).
Finally, we tested co-occupancy of CEBPs and several putativecofactors by performing sequential ChIP (Methods). This shows thatthe factor E2F3 binds to the two A regions Slbp and Cbx5 simul-taneously with CEBP factors, but none of the examined C regions,in accordance with our predictions. In contrast, the C cluster–associated factors ONECUT1, HNF1A, and MAFB interact with sev-eral C regions also occupied by CEBPs (Fig. 4F). Noticeably, a numberof target regions are shared among the three C region factors.
Two modes of transcriptional regulation by EGR1
EGR1 has been shown to be essential for a timely regenerative re-sponse in the mouse liver (Liao et al. 2004) and represents a classical‘‘immediate early TF,’’ which is up-regulated upon a growth stimulus(for review, see Thiel and Cibelli 2002). Recently, it was found to bepart of a growth-signal discriminatory circuit together with TP53(Zwang et al. 2011). Canonical EGR1 cognate sequences are amongthe most highly enriched in the ‘‘cell cycle’’ or ‘‘proliferation’’ set ofCEBP bound regulatory regions (the A cluster), together with E2Fand MYC motifs (Supplemental Table S6).
Regulation of liver regeneration
Genome Research 597www.genome.org
Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from
ARTICLE IV | 104
To investigate the role of EGR1 in liver regeneration in vivo,we performed ChIP-seq experiments with EGR1-specific antibodiesat the 24- and 36-h time points. Unexpectedly, we observed apreference of EGR1 for the C set over the A set regions (Fig. 5A).Accordingly, A and EGR1 peak summits were generally positioned
further apart than C and EGR1 peaks (Fig. 5B). Moreover, a GOanalysis of putative target genes suggested that EGR1 may be in-volved in all aspects of liver regeneration, as acute phase genes,metabolic genes, and cell cycle genes are found proximal to EGR1bound regions (Supplemental Table S9; data not shown).
Figure 4. The CEBP temporal binding patterns display two sets of cis-regulatory code. (A) Representation of motif frequencies in each temporal cluster.Hierarchical clustering based on Z-scores of 210 position weight matrices (PWMs). Each row indicates a specific motif (PWM), while columns represent eachbinding cluster (A, B, or C). Color scale indicates representation relative to background set (overrepresented, dark blue; underrepresented, dark gray; nodifference, white). Z-score scale is cut off at 640 for clarity. Example PWMs are shown in black text. (B) Selected transcription factors with clear binding sequenceoverrepresentation in A and B clusters (upper six), all clusters (middle three), or only the C cluster (lower seven). Clusters are denoted by color. (C ) Expression levelfrequency distribution (POL2 gene body read coverage) of all genes (reads per kilobase, sum of eight time points, normalized, above 0.5) with candidatetranscription factors indicated. (D) Distances from A, B, and C cluster peak summits or random positions to the most proximal RefSeq transcription start site(TSS). (E) External ChIP-seq data peak regions (yellow bars) showing overlaps with CEBP A and C cluster regions (white numbers indicate proportions) andP-value of hypergeometric test (Fisher’s one-tailed) of overlap similarity. (F) Sequential ChIP (reChIP) assessing co-occupancy at A and C cluster regions. Anti-CEBP was used as first-round antibody (recognizing both CEBPA and CEBPB), with second-round IgG or antibody against indicated TFs. Enrichments arenormalized to IgG levels. N = 2–5. Error bars, SEM. (*) P < 0.05; (**) P < 0.01, t-test versus IgG enrichments. For gene names, see Supplemental Table S10.
Jakobsen et al.
598 Genome Researchwww.genome.org
Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from
ARTICLE IV | 105
By manual inspection of tightly overlapping EGR1 and C re-gions bound by CEBPs, we found many peaks positioned preciselyat a CEBP sequence but lacking EGR1 cognate sequence (Fig. 5C,bottom panels). Conversely, EGR1 peaks that possessed EGR1cognate sequences did not overlap exactly with the CEBP peaks,and these peaks were consistently associated with A cluster regions(Fig. 5C, top panels). Sequences of target regions with positions ofCEBP and EGR1 PWM hits and logos of used PWMs can be found inSupplemental Figure S5.
To test if EGR1 depends on CEBPs for interacting with DNA atC regions (lacking EGR1 sequences), we performed EGR1-specificChIP with liver tissue from wild-type and Cebpb knockout mice.Our results show that EGR1 binding depends significantly more onCEBPB at C regions than at A regions, suggesting that EGR1 couldbind DNA in two modes, indirectly (via CEBPs) at C regions anddirectly at A regions (Fig. 5D).
Differential sets of transcriptional regulators targetedby the three clusters of cis-regulatory regions
To further explore the transcriptional network of liver regeneration,we focused on the 120 most highly expressed CEBP target genesannotated with the ‘‘transcriptional regulator, DNA-dependent’’
GO term (Supplemental Table S8). These we represent as targetsof either A, B, or C cluster regions or of any combination hereof(Fig. 6A).
We find that the large majority of regulators (110 of 120) aretargeted by either the A or B cluster regions, while the C clustertargets much fewer (38), suggesting a massive shift at the tran-scriptional network level upon transition from a quiescent to a re-generative state. Moreover, the injury response clusters (A/B) alsoimpinge on many of the regulators targeted at the quiescent stage(29 of 38 bound by the C cluster), hinting at tight integration of thegenetic programs at successive stages of the complex regenerativeprocess.
A number of genes are associated with a high number of CEBPbound regions belonging to two or three clusters (Fig. 6A), sug-gesting a complex cis-regulatory structure. This may suggest thatthey are important components, or ‘‘hubs,’’ of the network. Prom-inent examples are the genes for the bHLH DNA-binding dominantrepressor, ID2, as well as the cell cycle regulators JUN and EGR1(Fig. 6A,B). Also Cebpa and Cebpb themselves are targets of multiplebound regions, as are Mafb, Hnf4a, and several Klf genes (Fig. 6A).
Several genes of regulatory factors are targeted by regions ofwhich they may be coregulators based on the binding motif analysis(Figs. 4B, 6A). This can be interpreted as network feed-forward
Figure 5. Two modes of EGR1 cobinding at A and C cluster regions. (A) Bar diagram displaying overlap between peaks defined by EGR1 binding (24 h or36 h, yellow bars) and CEBP binding. A or C cluster CEBP peak overlaps are shown (white numbers indicate proportions) with P-values of hypergeometrictests (Fisher’s one-tailed) of similarity. (B) Diagram showing the distribution of distances between CEBP peak summits and the closest EGR1 (36-h) peak. A(gray) and C (blue) cluster peaks and a randomly distributed set (dashed line) are included. (C ) CEBP and EGR1 ChIP read coverage and PWM hit tracksfrom representative A (top panels) and C (bottom panels) cluster regions. Vertical red bars indicate CEBP and EGR1 PWM hits; height of bars identity score;and transparent black bars show position of peak summits. (D) EGR1-ChIP using livers of mice deficient for Cebpb and wild-type livers. (Y-axis) Ratio ofenrichment for Cebpb-null versus wild-type experiments. EGR1 bound regions overlapping with CEPB A cluster (gray) and C cluster (blue) peaks are shown.N = 3–4. Error bars, SEM. (**) P < 0.01, Mann-Whitney test, two-tailed. For gene names, see Supplemental Table S10.
Regulation of liver regeneration
Genome Research 599www.genome.org
Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from
ARTICLE IV | 106
or auto-regulatory loops (Fig. 6C). Examples include genesencoding E2Fs and KLFs for the A and B cluster regions and FOXfactors for the C regions. Hnf4a is a shared target of regions fromall three clusters, and the HNF4A binding sequence is alsoenriched in all clusters (Fig. 6A; Supplemental Tables S6, S7). Theseregulatory loops may be involved in adding robustness to thenetwork as suggested for other transcriptional hierarchies (Leeet al. 2002).
We detect the cholesterol Nr1h3 (also known as Lxr) and bileacid Nr1h4 (also known as Fxr) metabolism regulatory genes to be Aand C cluster targets, respectively (Fig. 6A; Supplemental Table S8;Kalaany and Mangelsdorf 2006). We also find the gene encoding
RXRA, the obligate binding partner of both TFs, to be targeted bymultiple CEBP bound elements. This suggests that these factorsmay be interconnected with the CEBPs in regulation of choles-terol and bile metabolism during liver regeneration.
Finally, we note that almost all core members of the circadianclock system (Zhang and Kay 2010) turn out to be putative CEBPtargets (Fig. 6A; Supplemental Table S8), e.g., the core TF genesClock and Bmal1 (also known as Arntl), as well as the downstreameffectors encoded by Per1/2/3 and Cry1/2, and secondary compo-nent genes Rora, Rorc, Nr1d2 (also known as Rev-erb beta), Dbp, andNfil3 (also known as E4bp4). This points to a high level of cross-talkbetween circadian clock and regenerative response regulation.
Figure 6. The A, B, and C cluster regions target distinct sets of transcriptional regulator encoding genes. (A) Regulatory network showing transcriptionfactors targeted by CEBP-bound regions belonging to the A, B, or C clusters. The total number of connections (peaks associated with a gene) is indicated bycolor scale (white is one, and dark blue is 11, the maximum), and the relative, total expression (POL2 binding throughout the eight time points) isindicated by circle size. (B) CEBPB 3-h ChIP genomic coverage in the vicinity of three putative targets. Jun, Id2, and Egr1 loci display several CEBP peaks, i.e.,‘‘connections’’ in A. (C ) Examples of putative feed-forward loops in the network. Arrows point from the transcription factor to the regulated gene. Forclarity, the gene product level of CEBP targets has been left out. ‘‘A’’ and ‘‘C’’ are cis-regulatory elements belonging to the A or C cluster.
Jakobsen et al.
600 Genome Researchwww.genome.org
Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from
ARTICLE IV | 107
DiscussionThe multi-time point global quantification of TF occupancy en-abled us to study the temporal dynamics of regulation in vivo. Wefind that the two central transcriptional regulators CEBPA andCEBPB interact with a large group of cis-regulatory elements ina highly dynamic manner through the progressive phases of liverregeneration. This group of genomic regions can be subdividedby CEBP binding dynamics into three clusters with distinctlyenriched sets of sequence motifs, pointing to molecular mecha-nisms governing differential CEBP binding. Guided by this ob-servation, we validate the predicted binding preference of severalkey cobinding TFs with sequential ChIP and published ChIP-seqdata.
An important aspect of our study is the integration of ob-served TF binding dynamics and expression of putative targetgenes. We find that each of the three binding clusters is associatedwith distinct sets of regulated target genes, i.e., acute phase genes,metabolic/homeostatic genes, and cell cycle–related genes (Table 1).This demonstrates the possibility of linking dynamic TF occupancypatterns with the shifting phases of the regenerative program(Fig. 7A). Moreover, the concomitant resurge of CEBP binding tothe C cluster regions and metabolic gene activity at 24 h providesevidence for a previously uncharacterized, early ‘‘homeostatic’’
phase of liver regeneration, taking place before the first round ofreplication.
The observed shifting phases of CEBP binding is closelytemporally coupled to the dynamic ratio of CEBPA to CEBPB (Fig.2C,D). One possible interpretation is that this ratio is involved indictating enhancer occupancy by CEBP complexes. As such, a highratio of CEBPA to CEBPB would lead to binding of C cluster cis-regulatory elements enhancing metabolic and repressing acutephase response genes. Conversely, a low ratio would direct bindingof B cluster elements that repress metabolic or activate acute phasegenes or of A cluster elements that activate cell cycle genes. Thisdifferential recruitment of CEBPs could be explained by in-teraction with distinct sets of TFs, depending on CEBP complexcomposition. This model is summarized in Figure 7B. Our obser-vations and model contradict the conventional view that CEBPAlevels are low while CEBPB levels are high throughout the re-generative process (Greenbaum et al. 1995, 1998), and emphasizesthe requirement for a high degree of temporal resolution in studiesof dynamic biological processes.
Many cognate sequences associated with TFs not previouslyknown to be involved in liver regeneration were found to beoverrepresented in the CEBP bound regions (Supplemental TableS6). In the C cluster regions, motifs of several SOX factors wereabundant (Fig. 4B). Recent findings implicate SOX factors in he-patocyte differentiation and stem cell biology (Duan et al. 2010;Furuyama et al. 2011). Our data find Sox18, -13, and -15 to behighly expressed (Fig. 4C; Supplemental Table S7), but exactlywhich SOX factors cooperate with CEBPs in the liver remains to bedetermined. In the A cluster, we found overrepresented KLF factormotifs, and several highly expressed Klfs (e.g., Klf10, -13, and -15)are targets of CEBP A cluster enhancers (Figs. 4B,C, 6A; Supple-mental Table S8). Liver functions were recently identified forKLF15 and KLF10 (Guillaumond et al. 2010; Takashima et al.2010). Moreover, KLF factors are central for self-renewal programsin embryonic stem cells (Takahashi and Yamanaka 2006; Jianget al. 2008), which suggests that they may play similar roles in theself-renewal circuitry of fully differentiated hepatocytes, very likelyas both binding partners and targets of CEBPs.
An unexpected observation of this study was that EGR1binding preferentially colocalizes with the C cluster regions (Fig.5A), which are almost depleted of EGR1 cognate sequence (Fig. 4B;Supplemental Table S6). Our data support a model of indirect or‘‘assisted’’ EGR1 binding to explain this observation (Fig. 5C,D).Only one example of indirect EGR1 binding via CEBPB has beenpublished (Zhang et al. 2003), but our genome-wide observationsindicate that EGR1 targeting of metabolic gene promoters (Ccluster regions) via CEBPs could be a general phenomenon in theliver. EGR1 has also been found to be important for interpretationof mitogenic signals (Zwang et al. 2011) and targets A cluster re-gions in this study (Fig. 5A,D; Supplemental Table S9). Hence, theobserved CEBPA-to-CEBPB shift in the CEBP pool could regulateEGR1 recruitment to A or C cluster cis-regulatory elements andthus ensure a tight temporal control of EGR1 action in line withthe progressive phases of liver regeneration.
In essence, liver regeneration upon partial hepatectomy iscompensatory growth, as the cells proliferating are fully differen-tiated hepatocytes. Previous studies have shown that liver re-generation does not involve loss of hepatocyte differentiation state(e.g., Malato et al. 2011; for reviews, see Fausto et al. 2006;Michalopoulos 2007). This is consistent with our finding thatmany hepatocyte-specific genes, e.g., involved in metabolic func-tions, are up-regulated rather than down-regulated. (Fig. 3B; Table
Figure 7. Protein level ratios of CEBPA versus CEBPB may define a dy-namic transcriptional switch. (A) Schematic diagram showing the di-vergent binding patterns and biological roles of the A, B, and C cluster cis-regulatory elements through the first cell cycle of liver regeneration. Apreviously uncharacterized early resurge of metabolic/homeostatic genes,associated with C cluster regions with binding peaking at 24 h, was ob-served. (B) Model of a transcriptional switch centered on the relative ratioof CEBPA and CEBPB determining the composition of the CEBP complexpool (homo- or heterodimers). Gene activating or repressive actions areindicated, as well as sets of transcription factors found to be associatedwith A or C putative enhancers. Metabolic genes are induced and acutephase genes repressed by binding in a CEBPA-high setting (C regions),while the opposite is true for the CEBPB-high setting (B regions). A-typeregions are only targeted by CEBPs when the CEBPB form is abundant.
Regulation of liver regeneration
Genome Research 601www.genome.org
Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from
ARTICLE IV | 108
1). Only few examples are known of differentiated cells capable ofhepatocyte-like ‘‘self-renewal’’ in the adult mammalian organism.It has recently been shown that the quiescence of naı̈ve T cells isactively enforced by a balance between FOXP1 and FOXO1,allowing swift expansion upon external cues (Feng et al. 2011).Similarly, deficiency of just two TFs lends long-term, non-tumorigenic expansion capacity to mature macrophages (Azizet al. 2009). These two factors are MAF (c-maf) and MAFB, both ofwhich we find to be putative CEBP targets in the liver (Fig. 6A;Supplemental Table S8). MAFB is among the most highly expressedTFs in the quiescent condition and is strongly down-regulatedfrom the 8-h time point (Supplemental Tables S4, S8). Further-more, we find MAFB and MAFK binding sequences enriched in thequiescence/metabolism-associated C cluster of enhancers, pre-dominantly bound by CEBPs at the 0- and 24-h time points. In theproliferation-associated A cluster, we find KLF and MYC sequences(Fig. 4B; Supplemental Table S6); KLF4 and MYC were demon-strated to be required for expansion of the differentiated macro-phages (Aziz et al. 2009). Moreover, CEBPA is a key determinant ofmacrophage as well as liver cell differentiation (e.g., Feng et al.2008). These observations may suggest that similar transcriptionalnetworks centered on a CEBP/MAF axis are enforcing quiescencein both fully differentiated hepatocytes and macrophages. Un-derstanding this circuitry has the potential to be of use in re-generative medicine as a step toward manipulation of hepatocytesor other fully differentiated cells for therapeutic purposes.
In conclusion, the present work shows how time-resolvedanalysis of two core TFs of liver regeneration can reveal specificaspects of the regulation operating at distinct phases during theprocess. In a broader sense, our approach of analyzing dynamic TFoccupancy globally in vivo should be applicable to other experi-mental systems and holds promise to aid in the elucidation ofcomplex transcriptional networks in higher vertebrates.
Methods
Mouse workThe partial hepatectomy was performed on wild-type mice(C57BL6/J, male, 7 wk old) by removing three of five liver lobesaccording to the method described earlier (Thoren et al. 2010).Mice were culled after varying amounts of time covering theregrowth phase of liver regeneration (0, 3, 8, 16, 24, 36, 48, and 168h) (Fig. 1A,B). Livers from five individual mice for each time pointwere harvested, generating 40 samples; directly snap-frozen inliquid nitrogen; and stored at !80°C. Animal experiments con-formed to institutional as well as Danish national guidelines.
Chromatin immunoprecipitationAfter thawing, tissue samples were homogenized by douncing(loose pestle, Wheaton 15-mL douncer) in cold PBS, and werecross-linked for 10 min in 1% formaldehyde using a rotator.Chromatin was fragmented by sonication (Bioruptor, Diagenode).The 40 samples were used in ChIP-seq experiments with antibodiesagainst CEBPA, CEBPB, EGR1 (Santa Cruz: sc-61, sc-150, sc-110x),RNA-POL2 subunit B1 antibody (AC-055-100, Diagenode), andMock IgG (Sigma I8140). ChIP was performed according to themethod described earlier (Sandmann et al. 2006a). Sequential ChIPwas done according to the method described previously (Truax andGreer 2012), with the modification of cross-linking antibody–Protein A beads (www.neb.com). Santa Cruz antibodies as abovewere used for sequential CEBPA-CEBPB ChIP (Supplemental Fig.S6). An in-house CEBP antibody ( JSJ#1052) recognizing both
CEBPA and CEBPB was used in the first round in sequential ChIPwith antibodies against ONECUT1 (HNF6), HNF1A, E2F3, MAFB,or IgG for the second round (Santa Cruz: sc-13050x, sc-8986x, sc-878x; Novus Bio: nb600-266). All sequential ChIP enrichmentswere normalized to IgG ratios. Enrichment was validated by qPCR(ABI Prism 7000 or Roche Lightcycler 480), using ratios of a posi-tive detector primer set versus a negative (Mamstr, Chr_12_desert1or Sfi2). All primer sets are listed in Supplemental Table S10.
Clustering of CEBPA and CEBPB peaksConsensus regions for CEBPA and CEBPB peaks for all time pointswere defined by merging overlapping regions between sets, re-quiring a called peak in at least one sample, producing a total of87,049 consensus regions. Subsequently, coverage, defined as themaximal coverage level for each region, was determined using thenormalized counts (Supplemental Material). Filtering was applied,requiring normalized coverage of a minimum of 50 reads in at leastone sample for any given region, reducing the set to 11,314. Hi-erarchical clustering was performed in MeV (http://www.tm4.org/mev/) (Saeed et al. 2006), and three predominant clusters wereidentified: A (3449 regions), B (2818), and C (3034). The remaining2013 regions were excluded from further analysis. Coverage readswere summed for each time point/cluster, normalized, and used fordisplaying sum coverage tracks (Fig. 2B). Consensus region cover-age data can be found in Supplemental Table S3.
Data accessThe timeseries ChIP-seq data generated for this work have beendeposited in the NCBI Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) and are accessible through GEO Seriesaccession number GSE42321.
AcknowledgmentsWe thank Claus Nerlov and Agnes Zay for Cebpb knockout mouselivers, Mie Poulsen and Bjørg Krog for expert help in performingthe partial hepatectomy, and Thomas Sandmann, Federico DeMasi, and members of the Porse laboratory for critical reading ofthe manuscript. This study was supported by grants from TheNovo Foundation and the Danish Cancer Society.
References
Aziz A, Soucie E, Sarrazin S, Sieweke MH. 2009. MafB/c-Maf deficiency enablesself-renewal of differentiated functional macrophages. Science 326: 867–871.
Blais A, Dynlacht BD. 2004. Hitting their targets: An emerging picture of E2Fand cell cycle control. Curr Opin Genet Dev 14: 527–532.
Bolotin E, Liao H, Ta TC, Yang C, Hwang-Verslues W, Evans JR, Jiang T,Sladek FM. 2010. Integrated approach for the identification of humanhepatocyte nuclear factor 4a target genes using protein bindingmicroarrays. Hepatology 51: 642–653.
Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W,Jiang J, et al. 2008. Integration of external signaling pathways with thecore transcriptional network in embryonic stem cells. Cell 133: 1106–1117.
Costa RH, Kalinichenko VV, Holterman AX, Wang X. 2003. Transcriptionfactors in liver development, differentiation, and regeneration.Hepatology 38: 1331–1347.
Diehl AM, Yang SQ. 1994. Regenerative changes in C/EBPa and C/EBPbexpression modulate binding to the C/EBP site in the c-fos promoter.Hepatology 19: 447–456.
Duan Y, Ma X, Zou W, Wang C, Bahbahan IS, Ahuja TP, Tolstikov V, Zern MA.2010. Differentiation and characterization of metabolically functioninghepatocytes from human embryonic stem cells. Stem Cells 28: 674–686.
Eilers M, Eisenman RN. 2008. Myc’s broad reach. Genes Dev 22: 2755–2766.Fausto N, Campbell JS, Riehle KJ. 2006. Liver regeneration. Hepatology 43:
S45–S53.Feng R, Desbordes SC, Xie H, Tillo ES, Pixley F, Stanley ER, Graf T. 2008. PU.1
and C/EBPa/b convert fibroblasts into macrophage-like cells. Proc NatlAcad Sci 105: 6057–6062.
Jakobsen et al.
602 Genome Researchwww.genome.org
Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from
ARTICLE IV | 109
Feng X, Wang H, Takata H, Day TJ, Willen J, Hu H. 2011. Transcription factorFoxp1 exerts essential cell-intrinsic regulation of the quiescence of naiveT cells. Nat Immunol 12: 544–550.
Furuyama K, Kawaguchi Y, Akiyama H, Horiguchi M, Kodama S, Kuhara T,Hosokawa S, Elbahrawy A, Soeda T, Koizumi M, et al. 2011. Continuouscell supply from a Sox9-expressing progenitor zone in adult liver,exocrine pancreas and intestine. Nat Genet 43: 34–41.
Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ,Khurana E, Rozowsky J, Alexander R, et al. 2012. Architecture of the humanregulatory network derived from ENCODE data. Nature 489: 91–100.
Glynne R, Ghandour G, Rayner J, Mack DH, Goodnow CC. 2000. B-lymphocyte quiescence, tolerance and activation as viewed by globalgene expression profiling on microarrays. Immunol Rev 176: 216–246.
Greenbaum LE, Cressman DE, Haber BA, Taub R. 1995. Coexistence of C/EBPa, b, growth-induced proteins and DNA synthesis in hepatocytesduring liver regeneration. Implications for maintenance of thedifferentiated state during liver growth. J Clin Invest 96: 1351–1365.
Greenbaum LE, Li W, Cressman DE, Peng Y, Ciliberto G, Poli V, Taub R. 1998.CCAAT enhancer-binding protein b is required for normal hepatocyteproliferation in mice after partial hepatectomy. J Clin Invest 102: 996–1007.
Greer EL, Brunet A. 2005. FOXO transcription factors at the interfacebetween longevity and tumor suppression. Oncogene 24: 7410–7425.
Guillaumond F, Grechez-Cassiau A, Subramaniam M, Brangolo S, Peteri-Brunback B, Staels B, Fievet C, Spelsberg TC, Delaunay F, Teboul M. 2010.Kruppel-like factor KLF10 is a link between the circadian clock andmetabolism in liver. Mol Cell Biol 30: 3059–3070.
Hoffman BG, Robertson G, Zavaglia B, Beach M, Cullum R, Lee S,Soukhatcheva G, Li L, Wederell ED, Thiessen N, et al. 2010. Locus co-occupancy, nucleosome positioning, and H3K4me1 regulate thefunctionality of FOXA2-, HNF4A-, and PDX1-bound loci in islets andliver. Genome Res 20: 1037–1051.
Huang da W, Sherman BT, Lempicki RA. 2009. Systematic and integrativeanalysis of large gene lists using DAVID bioinformatics resources. NatProtoc 4: 44–57.
Jakobsen JS, Braun M, Astorga J, Gustafson EH, Sandmann T, Karzynski M,Carlsson P, Furlong EE. 2007. Temporal ChIP-on-chip reveals Biniou asa universal regulator of the visceral muscle transcriptional network.Genes Dev 21: 2448–2460.
Jiang J, Chan YS, Loh YH, Cai J, Tong GQ, Lim CA, Robson P, Zhong S, NgHH. 2008. A core Klf circuitry regulates self-renewal of embryonic stemcells. Nat Cell Biol 10: 353–360.
Johnson DS, Mortazavi A, Myers RM, Wold B. 2007. Genome-wide mappingof in vivo protein-DNA interactions. Science 316: 1497–1502.
Jones ME. 1965. Amino acid metabolism. Annu Rev Biochem 34: 381–418.Kalaany NY, Mangelsdorf DJ. 2006. LXRS and FXR: The yin and yang of
cholesterol and fat metabolism. Annu Rev Physiol 68: 159–191.Kurinna S, Barton MC. 2011. Cascades of transcription regulation during
liver regeneration. Int J Biochem Cell Biol 43: 189–197.Laudadio I, Manfroid I, Achouri Y, Schmidt D, Wilson MD, Cordi S, Thorrez
L, Knoops L, Jacquemin P, Schuit F, et al. 2012. A feedback loop betweenthe liver-enriched transcription factor network and miR-122 controlshepatocyte differentiation. Gastroenterology 142: 119–129.
Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, HannettNM, Harbison CT, Thompson CM, Simon I, et al. 2002. Transcriptionalregulatory networks in Saccharomyces cerevisiae. Science 298: 799–804.
Liao Y, Shikapwashya ON, Shteyer E, Dieckgraefe BK, Hruz PW, Rudnick DA.2004. Delayed hepatocellular mitotic progression and impaired liverregeneration in early growth response-1-deficient mice. J Biol Chem 279:43107–43116.
Lichtlen P, Schaffner W. 2001. Putting its fingers on stressful situations: Theheavy metal-regulatory transcription factor MTF-1. Bioessays 23: 1010–1017.
Lopez RG, Garcia-Silva S, Moore SJ, Bereshchenko O, Martinez-Cruz AB,Ermakova O, Kurz E, Paramio JM, Nerlov C. 2009. C/EBPa and b coupleinterfollicular keratinocyte proliferation arrest to commitment andterminal differentiation. Nat Cell Biol 11: 1181–1190.
Majmundar AJ, Wong WJ, Simon MC. 2010. Hypoxia-inducible factors andthe response to hypoxic stress. Mol Cell 40: 294–309.
Malato Y, Naqvi S, Schurmann N, Ng R, Wang B, Zape J, Kay MA, Grimm D,Willenbring H. 2011. Fate tracing of mature hepatocytes in mouse liverhomeostasis and regeneration. J Clin Invest 121: 4850–4860.
Marstrand TT, Frellsen J, Moltke I, Thiim M, Valen E, Retelska D, Krogh A.2008. Asap: A framework for over-representation statistics fortranscription factor binding sites. PLoS One 3: e1623.
Matsuo T, Yamaguchi S, Mitsui S, Emi A, Shimoda F, Okamura H. 2003.Control mechanism of the circadian clock for timing of cell division invivo. Science 302: 255–259.
McConnell BB, Yang VW. 2010. Mammalian Kruppel-like factors in healthand diseases. Physiol Rev 90: 1337–1381.
Michalopoulos GK. 2007. Liver regeneration. J Cell Physiol 213: 286–300.Montminy M, Koo SH, Zhang X. 2004. The CREB family: Key regulators of
hepatic metabolism. Ann Endocrinol 65: 73–75.
Nerlov C. 2007. The C/EBP family of transcription factors: A paradigm forinteraction between gene expression and proliferation control. TrendsCell Biol 17: 318–324.
Ni L, Bruce C, Hart C, Leigh-Bell J, Gelperin D, Umansky L, Gerstein MB,Snyder M. 2009. Dynamic and complex transcription factor bindingduring an inducible response in yeast. Genes Dev 23: 1351–1363.
Osada S, Yamamoto H, Nishihara T, Imagawa M. 1996. DNA bindingspecificity of the CCAAT/enhancer-binding protein transcription factorfamily. J Biol Chem 271: 3891–3896.
Overturf K, al-Dhalimy M, Ou CN, Finegold M, Grompe M. 1997. Serialtransplantation reveals the stem-cell-like regenerative potential of adultmouse hepatocytes. Am J Pathol 151: 1273–1280.
Porse BT, Pedersen TA, Xu X, Lindberg B, Wewer UM, Friis-Hansen L, NerlovC. 2001. E2F repression by C/EBPa is required for adipogenesis andgranulopoiesis in vivo. Cell 107: 247–258.
Rana B, Xie Y, Mischoulon D, Bucher NL, Farmer SR. 1995. The DNA bindingactivity of C/EBP transcription factor is regulated in the G1 phase of thehepatocyte cell cycle. J Biol Chem 270: 18123–18132.
Saeed AI, Bhagabati NK, Braisted JC, Liang W, Sharov V, Howe EA, Li J,Thiagarajan M, White JA, Quackenbush J. 2006. TM4 microarraysoftware suite. Methods Enzymol 411: 134–193.
Sandmann T, Jakobsen JS, Furlong EE. 2006a. ChIP-on-chip protocol forgenome-wide analysis of transcription factor binding in Drosophilamelanogaster embryos. Nat Protoc 1: 2839–2855.
Sandmann T, Jensen LJ, Jakobsen JS, Karzynski MM, Eichenlaub MP, Bork P,Furlong EE. 2006b. A temporal map of transcription factor activity: mef2directly regulates target genes at all stages of muscle development. DevCell 10: 797–807.
Sandoval J, Rodriguez JL, Tur G, Serviddio G, Pereda J, Boukaba A, Sastre J,Torres L, Franco L, Lopez-Rodas G. 2004. RNAPol-ChIP: A novelapplication of chromatin immunoprecipitation to the analysis of real-time gene transcription. Nucleic Acids Res 32: e88.
Sarrazin S, Mossadegh-Keller N, Fukao T, Aziz A, Mourcin F, Vanhille L, KellyModis L, Kastner P, Chan S, Duprez E, et al. 2009. MafB restricts M-CSF-dependent myeloid commitment divisions of hematopoietic stem cells.Cell 138: 300–313.
Takahashi K, Yamanaka S. 2006. Induction of pluripotent stem cells frommouse embryonic and adult fibroblast cultures by defined factors. Cell126: 663–676.
Takashima M, Ogawa W, Hayashi K, Inoue H, Kinoshita S, Okamoto Y, SakaueH, Wataoka Y, Emi A, Senga Y, et al. 2010. Role of KLF15 in regulation ofhepatic gluconeogenesis and metformin action. Diabetes 59: 1608–1615.
Thiel G, Cibelli G. 2002. Regulation of life and death by the zinc fingertranscription factor Egr-1. J Cell Physiol 193: 287–292.
Thoren LA, Norgaard GA, Weischenfeldt J, Waage J, Jakobsen JS, DamgaardI, Bergstrom FC, Blom AM, Borup R, Bisgaard HC, et al. 2010. UPF2 isa critical regulator of liver development, function and regeneration.PLoS ONE 5: e11650.
Truax AD, Greer SF. 2012. ChIP and Re-ChIP assays: Investigatinginteractions between regulatory proteins, histone modifications, andthe DNA sequences to which they bind. Methods Mol Biol 809: 175–188.
van Wageningen S, Kemmeren P, Lijnzaad P, Margaritis T, Benschop JJ, deCastro IJ, van Leenen D, Groot Koerkamp MJ, Ko CW, Miles AJ, et al.2010. Functional overlap and regulatory links shape genetic interactionsbetween signaling pathways. Cell 143: 991–1004.
Vogel MJ, Peric-Hupkes D, van Steensel B. 2007. Detection of in vivoprotein-DNA interactions using DamID in mammalian cells. Nat Protoc2: 1467–1478.
Wang ND, Finegold MJ, Bradley A, Ou CN, Abdelsayed SV, Wilde MD, TaylorLR, Wilson DR, Darlington GJ. 1995. Impaired energy homeostasis in C/EBPa knockout mice. Science 269: 1108–1112.
White P, Brestelli JE, Kaestner KH, Greenbaum LE. 2005. Identification oftranscriptional networks during liver regeneration. J Biol Chem 280:3715–3722.
Yusuf I, Fruman DA. 2003. Regulation of quiescence in lymphocytes. TrendsImmunol 24: 380–386.
Zhang EE, Kay SA. 2010. Clocks not winding down: Unravelling circadiannetworks. Nat Rev Mol Cell Biol 11: 764–776.
Zhang F, Lin M, Abidi P, Thiel G, Liu J. 2003. Specific interaction of Egr1 andc/EBPb leads to the transcriptional activation of the human low densitylipoprotein receptor gene. J Biol Chem 278: 44246–44254.
Zinzen RP, Girardot C, Gagneur J, Braun M, Furlong EE. 2009.Combinatorial binding predicts spatio-temporal cis-regulatory activity.Nature 462: 65–70.
Zwang Y, Sas-Chen A, Drier Y, Shay T, Avraham R, Lauriola M, Shema E,Lidor-Nili E, Jacob-Hirsch J, Amariglio N, et al. 2011. Two phases ofmitogenic signaling unveil roles for p53 and EGR1 in elimination ofinconsistent growth signals. Mol Cell 42: 524–535.
Received July 20, 2012; accepted in revised form February 5, 2013.
Regulation of liver regeneration
Genome Research 603www.genome.org
Cold Spring Harbor Laboratory Press on July 11, 2013 - Published by genome.cshlp.orgDownloaded from
REFERENCES | 110
References
1. Jou W, Haegeman G, Ysebaert M, Fiers W. Nucleotide Sequence of the Gene Coding for the Bacteriophage MS2 Coat Protein. Nature [Internet]. 1972 [cited 2013 Jul 15]; Available from: http://adsabs.harvard.edu/abs/1972Natur.237...82J
2. Sanger F, Coulson a R. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of molecular biology [Internet]. 1975 May 25;94(3):441–8. Available from: http://www.ncbi.nlm.nih.gov/pubmed/1100841
3. Sanger F, Coulson a R, Friedmann T, Air GM, Barrell BG, Brown NL, et al. The nucleotide sequence of bacteriophage phiX174. Journal of molecular biology [Internet]. 1978 Oct 25;125(2):225–46. Available from: http://www.ncbi.nlm.nih.gov/pubmed/714153
4. Cherwa JE, Organtini LJ, Ashley RE, Hafenstein SL, Fane B a. In VITRO ASSEMBLY of the øX174 procapsid from external scaffolding protein oligomers and early pentameric assembly intermediates. Journal of molecular biology [Internet]. Elsevier Ltd; 2011 Sep 23 [cited 2013 Jun 20];412(3):387–96. Available from: http://www.ncbi.nlm.nih.gov/pubmed/21840317
5. Goffeau A, Barrell B, Bussey H, Davis R. life with 6000 genes. Science [Internet]. 1996 [cited 2013 Jul 15]; Available from: http://www.sciencemag.org/content/274/5287/546.short
6. Altschup SF, Gish W, Pennsylvania T, Park U. Basic Local Alignment Search Tool. Journal of Molecular Biology. 1990;215(3):403–10.
7. Kent WJ, Haussler D. Assembly of the Working Draft of the Human Genome with GigAssembler. 2001;1541–8.
8. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The sequence of the human genome. Science (New York, N.Y.) [Internet]. 2001 Feb 16 [cited 2013 May 21];291(5507):1304–51. Available from: http://www.ncbi.nlm.nih.gov/pubmed/11181995
9. Liu L, Li Y, Li S, Hu N, He Y, Pong R, et al. Comparison of Next-Generation Sequencing Systems. 2012;2012.
10. Peterson TW. Next Gen Sequencing Survey. JP Morgan. 2013;(May 2010).
11. Weischenfeldt J, Waage J, Tian G, Zhao J, Damgaard I, Jakobsen JS, et al. Mammalian tissues defective in nonsense-mediated mRNA decay display highly
REFERENCES | 111
aberrant splicing patterns. Genome biology [Internet]. BioMed Central Ltd; 2012 Jan [cited 2013 Mar 13];13(5):R35. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3446288&tool=pmcentrez&rendertype=abstract
12. spliceR [Internet]. Available from: http://www.bioconductor.org/packages/2.13/bioc/html/spliceR.html
13. Hannon G. FASTX-toolkit [Internet]. Available from: http://hannonlab.cshl.edu/fastx_toolkit/
14. Andrews S. FastQC [Internet]. Available from: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
15. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research. 2008;18(11):1851–8.
16. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology [Internet]. 2009 Jan [cited 2011 Jun 10];10(3):R25. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2690996&tool=pmcentrez&rendertype=abstract
17. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research [Internet]. 2008 Sep [cited 2013 Aug 6];18(9):1509–17. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2527709&tool=pmcentrez&rendertype=abstract
18. Mortazavi A, Williams BA, Mccue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. 2008;5(7):1–8.
19. Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nature methods [Internet]. Nature Publishing Group; 2010 Sep [cited 2013 May 22];7(9):709–15. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3005310&tool=pmcentrez&rendertype=abstract
20. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, et al. Alternative isoform regulation in human tissue transcriptomes. Nature [Internet]. 2008 Nov 27 [cited 2013 May 21];456(7221):470–6. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2593745&tool=pmcentrez&rendertype=abstract
REFERENCES | 112
21. Berger MF, Levin JZ, Vijayendran K, Sivachenko A, Adiconis X, Maguire J, et al. Integrative analysis of the melanoma transcriptome. Genome research [Internet]. 2010 Apr [cited 2013 Aug 15];20(4):413–27. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2847744&tool=pmcentrez&rendertype=abstract
22. Tang F, Barbacioru C, Nordman E, Li B, Xu N, Bashkirov VI, et al. RNA-Seq analysis to capture the transcriptome landscape of a single cell. Nature protocols [Internet]. Nature Publishing Group; 2010 Mar [cited 2013 Aug 9];5(3):516–35. Available from: http://www.ncbi.nlm.nih.gov/pubmed/20203668
23. Perkins TT, Kingsley R a, Fookes MC, Gardner PP, James KD, Yu L, et al. A strand-specific RNA-Seq analysis of the transcriptome of the typhoid bacillus Salmonella typhi. PLoS genetics [Internet]. 2009 Jul [cited 2013 Aug 6];5(7):e1000569. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2704369&tool=pmcentrez&rendertype=abstract
24. Djebali S, Davis C a, Merkel A, Dobin A, Lassmann T, Mortazavi A, et al. Landscape of transcription in human cells. Nature [Internet]. 2012 Sep 6 [cited 2013 May 21];489(7414):101–8. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22955620
25. Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biology direct [Internet]. 2009 Jan [cited 2013 Aug 8];4:14. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2678084&tool=pmcentrez&rendertype=abstract
26. Gao L, Fang Z, Zhang K, Zhi D, Cui X. Length bias correction for RNA-seq data in gene set analyses. Bioinformatics (Oxford, England) [Internet]. 2011 Mar 1 [cited 2011 Jul 15];27(5):662–9. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3042188&tool=pmcentrez&rendertype=abstract
27. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) [Internet]. 2010 Jan 1 [cited 2012 Nov 2];26(1):139–40. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2796818&tool=pmcentrez&rendertype=abstract
28. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology [Internet]. 2010 Jan;11(3):R25. Available from:
REFERENCES | 113
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2864565&tool=pmcentrez&rendertype=abstract
29. Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in bioinformatics [Internet]. 2012 Sep 17 [cited 2013 Aug 6]; Available from: http://www.ncbi.nlm.nih.gov/pubmed/22988256
30. Anders S, Huber W. Differential expression analysis for sequence count data. Genome biology [Internet]. BioMed Central Ltd; 2010 Jan [cited 2013 Aug 6];11(10):R106. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3218662&tool=pmcentrez&rendertype=abstract
31. Trapnell C, Williams B a, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology [Internet]. Nature Publishing Group; 2010 May [cited 2013 May 21];28(5):511–5. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3146043&tool=pmcentrez&rendertype=abstract
32. Martin J a, Wang Z. Next-generation transcriptome assembly. Nature reviews. Genetics [Internet]. Nature Publishing Group; 2011 Oct [cited 2013 Aug 6];12(10):671–82. Available from: http://www.ncbi.nlm.nih.gov/pubmed/21897427
33. Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics (Oxford, England) [Internet]. 2011 Jul 1 [cited 2013 Aug 8];27(13):i275–82. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3117341&tool=pmcentrez&rendertype=abstract
34. Kong L, Zhang Y, Ye Z-Q, Liu X-Q, Zhao S-Q, Wei L, et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic acids research [Internet]. 2007 Jul [cited 2013 Aug 12];35(Web Server issue):W345–9. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1933232&tool=pmcentrez&rendertype=abstract
35. Wang L, Park HJ, Dasari S, Wang S, Kocher J-P, Li W. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic acids research [Internet]. 2013 Apr 1 [cited 2013 Aug 21];41(6):e74. Available from:
REFERENCES | 114
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3616698&tool=pmcentrez&rendertype=abstract
36. Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proceedings of the National Academy of Sciences of the United States of America [Internet]. 2003 Dec 23;100(26):15776–81. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=307644&tool=pmcentrez&rendertype=abstract
37. Salimullah M, Mizuho S, Plessy C, Carninci P. NanoCAGE: A High-Resolution Technique to Discover and Interrogate Cell Transcriptomes. Cold Spring Harbor Protocols [Internet]. 2011 Jan 4 [cited 2012 Nov 4];2011(1):pdb.prot5559–pdb.prot5559. Available from: http://www.cshprotocols.org/cgi/doi/10.1101/pdb.prot5559
38. Plessy C, Bertin N, Takahashi H, Simone R, Salimullah M, Lassmann T, et al. Linking promoters to functional transcripts in small samples with nanoCAGE and CAGEscan. Nature methods [Internet]. Nature Publishing Group; 2010 Jul [cited 2012 Nov 10];7(7):528–34. Available from: http://dx.doi.org/10.1038/nmeth.1470
39. Buck MJ, Lieb JD. ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics [Internet]. 2004 Mar [cited 2013 Aug 15];83(3):349–60. Available from: http://linkinghub.elsevier.com/retrieve/pii/S0888754303003628
40. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. 2007;4(8):651–7.
41. Barski A, Cuddapah S, Cui K, Roh T-Y, Schones DE, Wang Z, et al. High-resolution profiling of histone methylations in the human genome. Cell [Internet]. 2007 May 18 [cited 2011 Jul 17];129(4):823–37. Available from: http://www.ncbi.nlm.nih.gov/pubmed/17512414
42. Encode T, Consortium P. A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS biology [Internet]. 2011 Apr [cited 2013 Feb 27];9(4):e1001046. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3079585&tool=pmcentrez&rendertype=abstract
43. Neph S, Vierstra J, Stergachis AB, Reynolds AP, Haugen E, Vernot B, et al. An expansive human regulatory lexicon encoded in transcription factor
REFERENCES | 115
footprints. Nature [Internet]. Nature Publishing Group; 2012 Sep 6 [cited 2013 Aug 7];489(7414):83–90. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22955618
44. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, et al. The accessible chromatin landscape of the human genome. Nature [Internet]. Nature Publishing Group; 2012 Sep 6 [cited 2013 Aug 6];489(7414):75–82. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3721348&tool=pmcentrez&rendertype=abstract
45. Zhang Y, Liu T, Meyer C a, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome biology [Internet]. 2008 Jan [cited 2011 Jun 10];9(9):R137. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2592715&tool=pmcentrez&rendertype=abstract
46. Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Research [Internet]. Oxford University Press; 2010;38(Database issue):D105–D110. Available from: http://www.ncbi.nlm.nih.gov/pubmed/19906716
47. Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, et al. The TRANSFAC system on gene expression regulation. Nucleic Acids Research [Internet]. Oxford University Press; 2001;29(1):281–3. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=29801&tool=pmcentrez&rendertype=abstract
48. Robasky K, Bulyk ML. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Research [Internet]. Oxford University Press; 2011;39(Database issue):D124–D128. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3013812&tool=pmcentrez&rendertype=abstract
49. Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome research [Internet]. 2012 Sep [cited 2013 Aug 8];22(9):1813–31. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3431496&tool=pmcentrez&rendertype=abstract
SUPPLEMENTAL FIGURES | 116
Supplemental Figures
S up plemental f igure 1: A gr aphical overview of sp lice cla sses detected by RAINMAN and spliceR. In a ddition, spliceR detects intron retent ion events, not show n here.
Alternative 5’ splice site (A5SS) Alternative 3’ splice site (A3SS)
Mutually exclusive exon (MXE)
Single exon skipping (SES) Multiple exon skipping (MES)
Alternative !rst exon (AFE) Alternative last exon (ALE)
SUPPLEMENTAL FIGURES | 117
Left exon Right exon
Junction x
Refseq E2 Refseq E3 Refseq E4
Refseq E1 Refseq E2 Refseq E4 Refseq E5
Gene y
Putative isoform
Refseq E1 Refseq E2 Refseq E4 Refseq E5
ORF STOP
Putative polypeptide
Translation
With annotated CDS: Without annotated CDS:
Refseq E1 Refseq E2 Refseq E4 Refseq E5 Scan for ORFs
ORF ORF ORF
Putative polypeptide
STOP
Putative polypeptide
STOP
STOP
Putative polypeptide
Select best ORF based on length of polypeptide vs.transcript length. Optimize this ratio, minimizing false positives, using genes with known CDS.
A
B
C
ED
Reads
Combinatorial databasejunction
Left exon Right exon
Refseq E1 Refseq E5
S up plemental f igure 2: Stra tegy for pr edict ing coding potent ia l of t ranscrip ts bas ed on exon-exon junct io n evidence. In A, s ho rt sequenced reads ar e ma pped to a ar ti f ic ia l genome consis ting of al l pos sible exon-exon combinat ions for each gene. I n B and C, each a djoining exo n fr om the longest suppor ted refseq (or another repos itory) i sofo rm is concatena ted to the exons sup ported b y the junct io n. In D, if an annotated C DS exists, the p utat iv e i sof orm is translated and the pos it ion of the S TOP co do n is stored. In E, if no C DS exists, t ransla te in al l f rames, choosing a combinat ion o f lo ngest s up ported polyp ep tide and mos t upstream ORF.
Formula for breakthroughs in research: Take young researchers, put them
together in virtual seclusion, give then an unprecedented degree of freedom and
turn up the pressure by fostering competition
- JAMES WATSON, 1989 ” “