characterization of ribosomal footprints with use of graph ......master’s thesis characterization...
TRANSCRIPT
-
Master’s Thesis
Characterization of ribosomalfootprints with use of graph kernel
based approaches
Soraya Nikousokhan
October 2016
Albert-Ludwigs Universität Freiburg
Department of Computer Science
Chair of Bioinformatics
-
Candidate
Soraya Nikousokhan
Matr. number
3555120
Working period
14. 04. 2016 – 28. 10. 2016
Examiners
Prof. Dr. Rolf BackofenProf. Dr. Wolfgang Hess
Supervisors
Dr . Fabrizio CostaPavankumar Videm
I
-
Acknowledgment
I would like to express my thanks to my parents, for all the love and supportsthat they give me. Mostly I want to thank them for teaching me the value ofgaining knowledge and learning and encouraging me to proceed on this path.I would like to thank Prof. Dr. Rolf Backofen for giving me the opportunityto write my thesis at the Freiburg bioinformatics chair. I would like to thankDr. Fabrizio Costa for introducing this interesting topic to me and his helpand efficient supervision during the work, as well as Pavankumar Videm forhis help and patience and the time he put in to answer my questions. I wouldlike to also thank the Freiburg bioinformatics chair members, special thanksto Teresa Müller, Patrick Wright, Milad Miladi and Torsten Houwaart fortheir help and advices. I would like to thank my friends Hanna Poelker,Maryam Samani and Kristin Gekeler for their company and support duringboth the good and the difficult phases of this work.
II
-
Abstract
Ribosome profiling is an emerging technique that with use of deep sequenc-ing methods, gives new insight to translation of proteins from single codonto genome scale. In comparison to former available methods microarraysand RNA-seq, Ribo-seq solely considers active mRNAs at translation phasein a cell which prepare information for protein synthesis. This novel charac-teristic of Ribo-seq provides new data with focus on translation level. Theobtained patterns of ribosomal footprints may reveal new aspects in trans-lation field. The aim of this work is to classify Ribo-seq profiles accordingto different conditions and find clusters with respect to Ribo-seq profiles.This is done by a tool named BlockClust, which is based on a graph kernelmethod called Neighborhood fast graph kernel (NSPDK). BlockClust en-codes expression profiles data to graphs format and employ NSPDK methodfor achieving a high performance. Although BlockClust previously appliedfor clustering non-coding RNAs from their RNA-seq expression profiles, itcan also be adapted to use for clustering and classification tasks on othertypes of data e.g. Ribosome profiling. Therefore, we have adapted Block-Clust by defining new attributes for finding patterns in Ribo-seq data andadding them to the former available set of attributes. Moreover, we per-formed an optimization by using different parameter sets. Furthermore, weshowed that it is possible to employ BlockClust on Ribosome profiles. Weachieved a good performance in classification of these profiles.
III
-
Kurzfassung
Ribosome Profiling ist eine Technik, die mit der Verwendung von DNA-Sequenzierung (Deep Sequencing) neue Einsichten in die Übersetzung vonProteinen liefert – sowohl in einzelne Codons und in genomische Maßstäbe(genomic scales). Im Vergleich zu früheren verfügbaren Methoden wie Mi-croarrays und RNA-Seq. berücksichtigt Ribo-Seq. nur aktive mRNAs undliefert daher Informationen der Protein-Synthese. Somit bringt die Ribo-Seq. Methode neue Daten im Hinblick auf die Übersetzungsphase, riboso-malen footprints, mit sich, die neue Aspekte im Feld der Translation offen-baren. Das Ziel dieser Arbeit ist es, bedeutsame Cluster unter Einbeziehungvon Ribo-Seq.-Expressionsprofilen zu finden. Dafür wird das Tool Block-Clust verwendet, welches auf einer graph-kernel-Methode namens NSPDKbasiert. BlockClust kodiert Expressionsprofile zu Graphen und wendet dieNeighborhood-fast-graph-kernel-Methode an, um eine hohe Leistung zu erre-ichen. BlockClust ist vor allem als Clustering-Methode bekannt, die nicht-codierende RNA basierend auf ihren Expressionsprofilen in Cluster grup-piert. Es kann aber auch auf andere Arten von Expressionsdaten angewen-det werden, zum Beispiel auf Robosome-Profiling-Daten. Das ist möglich,durch das Hinzufügen weitere Attribute, BlockClust auf Ribosome-Profileanzuwenden. Die optimierten Parameter werden zum früheren Attribute-Set hinzugefügt. Desweitern haben wir eine Optimierung bezüglich ver-schiedener Parameter-Sets durchgeführt. Wir zeigen in dieser Arbeit dassdie Klassifikation von Ribosome-Profile mit einer guten Leistung erreichbarist.
IV
-
Contents
Abstract III
Kurzfassung IV
List of Tables VIII
1 Introduction 1
1.1 Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.4 Gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.5 Transcription . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.6 Translation . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.7 Gene expression . . . . . . . . . . . . . . . . . . . . . 4
1.1.8 Coding and Non-coding Regions . . . . . . . . . . . . 4
1.1.9 Next generation sequencing (NGS) . . . . . . . . . . . 5
1.1.10 Ribosome profiling . . . . . . . . . . . . . . . . . . . . 5
1.1.11 Cross linking immunoprecipitation (CLIP) . . . . . . . 8
1.1.12 Wild type vs mutant . . . . . . . . . . . . . . . . . . . 8
1.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Machine learning and bioinformatics . . . . . . . . . . 8
1.2.2 Supervised learning . . . . . . . . . . . . . . . . . . . 9
1.2.3 Unsupervised learning . . . . . . . . . . . . . . . . . . 9
V
-
Contents
1.2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Data 12
2.1 Extracting peaks from Ribo-seq data . . . . . . . . . . . . . . 13
2.2 Counting amount of reads per peak for each gene . . . . . . . 14
2.3 Peak frequencies in wild type vs mutant . . . . . . . . . . . . 17
2.4 Abundance of reads in peaks vs. length of chromosomes . . . 19
3 Methods 21
3.1 BlockClust: efficient clustering and classification of non-coding
RNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Extracting New Attributes with respect to Ribo-seq data . . 24
3.2.1 Entropy of Block Distances . . . . . . . . . . . . . . . 25
3.2.2 Entropy of Blocks End and Start Positions . . . . . . 26
3.2.3 Density of Reads . . . . . . . . . . . . . . . . . . . . . 28
3.2.4 GC-ratio . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.5 Number of Reads for GC positions . . . . . . . . . . . 30
3.3 Neighborhood subgraph pairwise distance kernel(NSPDK) . . 31
3.4 BlockClust adaptation with Ribo-seq Data . . . . . . . . . . . 33
4 Results and Discussion 38
4.1 Blockbuster benchmarking . . . . . . . . . . . . . . . . . . . . 38
4.2 BlockClust similarity score assessment . . . . . . . . . . . . . 40
4.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . 41
4.2.2 ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.3 Combinatorial feature similarity score assessment . . . 42
4.3 BlockClust optimization . . . . . . . . . . . . . . . . . . . . . 45
4.4 BlockClust experiment . . . . . . . . . . . . . . . . . . . . . . 46
5 Conclusion 49
5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A Appendix(Attributes) 52
VI
-
List of Figures
1.1 Transcription and translation . . . . . . . . . . . . . . . . . . 4
1.2 mRNA coding and non-coding regions . . . . . . . . . . . . . 5
1.3 Ribosome profiling-analysis of ribosome occupancy data. . . . 7
2.1 Highest Peak per Chromosome in Ribo-seq libraries . . . . . 15
2.2 Number of Reads Per Peak for Each Gene . . . . . . . . . . . 17
2.3 Ribosome Peak Freq. Mutant vs. Wild type . . . . . . . . . . 18
2.4 No. Peaks reads vs. Chromosome Length . . . . . . . . . . . 19
3.1 Read profile encoding . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Entropy of block distances . . . . . . . . . . . . . . . . . . . . 26
3.3 Entropy of blocks start positions . . . . . . . . . . . . . . . . 27
3.4 Entropy of blocks end positions . . . . . . . . . . . . . . . . . 28
3.5 Reads density per block group . . . . . . . . . . . . . . . . . . 29
3.6 GC-ratio per block group . . . . . . . . . . . . . . . . . . . . 30
3.7 Number of reads for GC positions . . . . . . . . . . . . . . . . 31
3.8 BlockClust pipeline . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Blockbuster benchmarking results . . . . . . . . . . . . . . . . 40
4.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Combinatorial feature similarity ROC measures for unbound
cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Combinatorial features similarity roc measures for bound cases 45
4.5 Clusters of Ribosome profiles based on four different conditions 47
VII
-
List of Tables
4.1 Line search optimization results . . . . . . . . . . . . . . . . . 46
4.2 Classification performance . . . . . . . . . . . . . . . . . . . . 46
4.3 Clustering performance . . . . . . . . . . . . . . . . . . . . . 47
VIII
-
IX
-
X
-
Chapter 1Introduction
Recently, vast developments in next generation sequencing methods enhance
us with non-expensive and fast Genome-wide sequencing of reads. Thus, new
doors have opened for implementing novel methods and beneficial tools for
retrieving new information from the end results. The aim of this work here
to classify and cluster Ribosome profiles with use of a tool called BlockClust.
In the first chapter we have an introduction on biological and machine
learning background concepts and methods, in chapter 2 we explain about
data analysis to check the relevance of data for clustering task, next, at
chapter 3 we discuss the method and variations in BlockClust adaptation to
our data and finally we reach to conclusion and discuss about future works.
In the following sections here, we have explained the biological informa-
tion and machine learning concepts which are coming in the next chapters.
1.1 Biology
This section focus at biological terms and methods which has mentioned in
latter chapters.
1.1.1 DNA
Deoxyribonucleic acid or DNA is a double- stranded molecule in form of dou-
ble helix. Each of the strands is a lineare molecule and it’s monomer building
bricks are nucleoside triphoshates.The two strands which are formed DNA
are joined together with hydrogen bond, between different complementary
1
-
1.1. Biology
bases.The complementary bases (nucleosids) adenin and thymine can form
two hydrogen bonds and guanine and cytosine can form three hydrogen
bonds [3].
1.1.2 RNA
Ribonucleic acid (RNA) is usually a single stranded chain existing in all
living cells and many viruses. One difference between DNA and RNA is
the sugar molecule (ribose instead of desoyaribouse). The second difference
is that instead of the base thymin RNA molecules make use of uracile. In
former times the task of RNA was recognized as messenger, which carries
instructions from DNA for controlling the synthesis of proteins.But nowa-
days it is known that they also preform important regulatory tasks (inside
the cell). This RNA’s are called non-coding (ncRNA) [6].
1.1.3 Protein
Proteins are large biological molecules or macromolecules which have a very
diverse range of tasks in living organisms such as: catalyzing metabolic
reactions, DNA replication responding to stimuli and transporting molecules
form one location to another. It is made of one long chains of amino acids
that have translated from nucleotide sequences of their genes. Different
amino acids sequences will form distinct folding and thus various tertiary
structure which leads to different functioning [17].
1.1.4 Gene
Gene is a region or segment of DNA which is a basic reason for different
heredity characteristics.
1.1.5 Transcription
Transcription is the act of reading the DNA information and transfers it to
RNAs. RNA is transported from the nucleus into cytoplasma and in the case
of mRNA further translated into protein. For initiating the transcription
different tf factors or proteins need to attach at the special site of DNA
sequences. This place is called enhancer or promoter and the components are
an enzyme called RNA polymerase and also transcription factors (subsidiary
2
-
1.1. Biology
proteins). All together named transcription initiation complex .After these
attachments transcription act will continue and RNA polymerase begins
mRNA synthesis in a way that complementary bases to the original DNA
strand are elongated to mRNA sequence. This process will end as soon as
all the strand is completely synthesized [1].
1.1.6 Translation
Translation is the act of making proteins by encoding the mRNA information
into amino acids which are the building blocks of proteins. For this purpose
mRNA uses a three letter combination of nucleotides, each of these com-
binations are translated by ribosome to a different amino acid. Ribosomes
consist of two subparts, one small and one large sub-unit.The ribosome is a
complex consisting of ribosomal RNA and proteins. Translation comprises
three phases, which are initiation, elongation and termination. For initia-
tion the small subunit of the ribosome attaches to recognition elements of
mRNA sequence, after that it will join transfer RNA to AUG. The AUG
codes for methionine amino acid.Then the large sub-unit will bind to the
whole complex and initiation will start. The elongation phase continue until
a stop codon is reached. The stop codons are UAA,UAG and UGA. Finally
the translation complex disassembles [7].
3
-
1.1. Biology
Figure 1.1: Transcription and translation are different biological processes
for protein production, transcription transfers DNA information to mRNAs.
After mRNAs are exited form cell nucleus, translation task starts and it in-
cludes three various phases, which are initiation, elongation and termination.
The final products of these actions are proteins.Figure retrieved form [1].
1.1.7 Gene expression
The information of genes which is translated into gene product, gene prod-
uct is often a protein, through several steps transcription,splicing translation
and post-translation modifications, whole this process defined as gene ex-
pression [5].
1.1.8 Coding and Non-coding Regions
CDS or Coding DNA Sequence is the region that includes codes for proteins.
In mRNA the coding part is surrounded by five prime untranslated region
and three prime untranslated region.The coding part consists of codes which
are translated to proteins and the non-coding parts are helping the act of
translation to be initiated and completed.
4
-
1.1. Biology
Figure 1.2: indicates the coding and non-coding parts(5’ and 3’ UTRS) of
mRNA. The coding part consists of codes which are translated to proteins
and the non-coding parts are helping the act of translation to be initiated
and completed. Figure taken from [4].
1.1.9 Next generation sequencing (NGS)
NGS approaches revealed a new era in which a whole genome can be se-
quenced in easier and cheaper, at the same time accurate ways that are
available commercially. The former sequencing method is Sanger sequencing
which is costly and it is not as fast as next generation sequencing. Sequenc-
ing of whole genome will provide large research possibilities in large-scale
comparative and evolutionary studies as well as giving an insight about how
changes in the genetic code are influencing diseases. NGS technologies com-
prise various steps such as: template preparation, sequencing and imaging
and data analysis. Different NGS technologies have different protocols for
each of the steps. Current NGS technologies are Roche/454,Illumina/Solexa,Life/APG
and HelicosBioSciences [24].
1.1.10 Ribosome profiling
The Ribosome profiling method, is a novel sequencing technique, which is be-
coming more popular nowadays for extracting new information about RNA
translation.In comparison to other methods like microarrays [32] and RNA-
seq [33], which considers the mRNA abundances, Ribosome profiling solely
considers mRNAs which are actively take part in translation.Therefore Ribo-
seq provides information on protein which is one of final product in gene
expression synthesis.Thus Ribosome profiling technique gives us this oppor-
tunity to understand new insights in the identity and amount of proteins in
transcriptome. It is counting the accumulation of ribosome footprints for
5
-
1.1. Biology
each position in transcript. As we can see in figure 1.3 the abundance of
proteins are directly related to the number of ribosome density in the protein
coding sites of a transcript. The total number of ribosomes that functioning
in a protein synthesis can be yield from ribosome footprints. The amounts of
mRNAs and ribosomes for each of them has straight influence on footprints
counts, altering in either amounts will cause change in number of covered
footprints. Therefore results will change due to the mRNA abundance or
the ribosomes amounts [20].
6
-
1.1. Biology
Figure 1.3: a-indicates the amount of ribosome footprints with respect to
ribosome positions in vivo. In Ribosome profiling process, the nucleus-
protected footprints used to show the positions of attached ribosomes to
many mRNAs in cell. Therefore, Ribosome profiling data is able to display
the abundance of ribosome at each position of transcripts. The thick line
shows the coding regions. b-Ribosome profiling data represents amount of
proteins which are synthesized. It can be explained by the relation among
the abundance of synthesized proteins and the density of ribosome in trans-
lated region of a transcript. Ribosome profiling indicates all ribosomes which
are involved in the act of translation for a protein. Hence alternation in ei-
ther mRNAs abundance in polysomes or number of ribosomes can make a
difference in ribosome profiling signal. Figure retrieved from [20].
7
-
1.2. Machine learning
1.1.11 Cross linking immunoprecipitation (CLIP)
CLIP is a UV cross-linking method in molecular biology. It is vastly used
for studying of protein interaction with RNAs. We can use CLIP based
methods, in order to have a better understanding of translation, for example
mapping of RNA binding sites for a specific protein in a genome range [19].
1.1.12 Wild type vs mutant
The wild type of an species is its most common phenotype which can be
seen in nature. A phenotype is set of representative characteristics of an or-
ganism. Genes are the major cause of functional behavior and development
of an organism, any changes in DNA or RNA or protein sequences will cause
mutation in natural form. Mutation of at least one gene in wild type called
mutant. These changes can be fundamental; if mutations happen in DNA
level they can alter all copies of the translated protein which can causes a
decrease in expression of the protein. However, if mutation happens in RNA
or protein synthesis level, will not be that importantly consider since it will
just affect one copy of different available copies of RNA or a protein [30].
1.2 Machine learning
This section contains information about machine learning methods and con-
cepts that will be used in the following chapters.
1.2.1 Machine learning and bioinformatics
Bioinformatics is an emerging new interdisciplinary field of study. It involves
different areas such as computer science, mathematics, statistics and engi-
neering for analyzing biological data. On the other hand, machine learning
aims to find methods that are able to learn from data and make legitimate
predictions based on data. Generally in machine learning the effort is to
find a model from an available set of samples to make decisions based on
data. With recent developments in bioinformatics methods we gained a huge
amount of data, thus for processing and discovering new knowledge from this
enormous amount of information machine learning approaches will be useful
8
-
1.2. Machine learning
and applicable. They try to find computational models to retrieve noble in-
formation from data. Some of these modeling methods in machine learning
are supervised classification, clustering, probabilistic, graphical models, op-
timization and heuristics. There are various fields in bioinformatics which
applied machine learning methods some of them are genomics, proteomics,
microarray, system biology, evolution and text-mining. In genomics machine
learning methods focus on finding number of sequences and location and
structure of genes. As long as in proteomic most of the effort is to predict
protein structure prediction which can be very complicated combinatorial
task due to the intrinsic complex characteristics of protein molecules. Sys-
tem biology models the life process in the cell and the application of machine
learning in evolution field can be reconstruction of phylogenetic tree. With
all these different methods and applications there is a large database avail-
able from publications. We can use text-mining approaches for a feasible
search which returns related results for different topics in bioinformatics
research areas [22].
1.2.2 Supervised learning
Supervised learning is the task of retrieving a model based on available
labeled examples which known as training examples. Here, each training
examples contains a pair which are the input object and a sought output
value. Supervised learning task tries to assign a function to training data in
a way that this function can predict the output for new instances [25].
1.2.3 Unsupervised learning
Unsupervised learning methods aim to extract hidden functions from un-
labeled data, regardless of supervised learning and reinforcement learning
the given data has no label. Thus, there is no error or reward function for
evaluation of results. Unsupervised learning is very similar to the density es-
timation in statistics. It comprises methods for summarizing and extracting
the key features [11].
9
-
1.3. Contribution
1.2.4 Classification
Classification is a supervised learning task. There are sets of labeled data
called training which contains pair of instances and their desired values
which can be in nominal, categorical or numerical format. Classifier by
considering and comparing instances according to similarity or dissimilarity
tries to specify sought values or classes for new observations [21].
1.2.5 Clustering
Clustering is an unsupervised learning task which aims to build groups for
set of unlabeled data. It forms groups or clusters based on the similarity of
their members. The objects that are more similar to each other locate in
same cluster, whereas they are less analogous to other clusters. Clustering
is an iterative task of adjusting data preprocessing and model parameters
to detect sought properties. It is an iterative approach that comprises trial
and failure for discovering underling knowledge in data [11].
1.3 Contribution
In this work we have tried to find a solution for clustering and classification
task of Ribo-seq data. Similar work has been done for RNA-seq expression
profiles by use of BlockClust[8] tool for finding clusters of non-coding RNAs
via RAN-seq expression profiles. Hence we want to define a new neverthe-
less similar task for Ribo-seq data. Therefore, we classify gene annotations
based on correspondence Ribosomal footprints signal. Moreover we assign
clusters to transcripts under different conditions of Ribosome profiles. For
fulfilling these goals, first we have to analyze the data to see the possibilities
and chances of significant dissimilarities and similarities in profiling data
under variant conditions. Second, after finding an outlook to data, we have
defined a new set of attributes with respect to Ribo-seq data for increasing
the distinction power of BlockClust algorithm for these particular format of
data. Because the size of mRNA reads are longer than non-coding RNAs the
computational time for running BlockClust tool increases drastically. The
previous version of this tool used Grid search for its parameters optimiza-
tion, however due to running time for Ribo-seq data it is not feasible here
10
-
1.3. Contribution
anymore. Thus, instead of Grid search for optimizing the set of parameters
e.g. radius and number of bins, we have employed a line search algorithm
to decrease the computing time. Finally we apply BlockClust in an opti-
mize manner on Ribo-seq expression profiles and building clusters based on
different conditions of data e.g the wild type or mutant.
In the next chapter we explain about the case study we have for Ribo-
some profiles and introduce the data we are using.
11
-
Chapter 2Data
Before applying BlockClust [8] tool on Ribo-seq data, first we take a survey
on data to have a better perception about data behaviors under different
circumstances, moreover expose various aspects of it. Is data varying under
different biological conditions? Are the samples which we are working on
showing the expected biological behavior? What are the preliminaries for
clustering these data? these are questions that we aim to answer in this
chapter. In order to check the characteristics of Ribo-seq data, we work
on a case study which contains three different biological replicates under
various conditions. The available data are Ribo-seq profiles for different
conditions. Mutant library is gained after un pairing Ded1 protein which
is a critical factor for translation initiation in saccharomyces cerevisiae [16];
however, the function of this enzyme is clearly not known. Ded1p is sensitive
to temperature. We have cases for impairing Ded1 protein before and after
a slow elevation at temperature. We have four different Ribo-seq libraries
in four conditions which are:
• Wild type without temperature shift(WT-t0)
• Wild type with temperature shift(WT-t5)
• Mutant (Ded1is impaired) with temperature shift(Mut-t0)
• Mutant with temperature shift(Mut-t5)
And also in addition to these libraries we have the positions of Ded1 binding
sites to mRNAs which has extracted form iCLIP data. By looking at ribo-
12
-
2.1. Extracting peaks from Ribo-seq data
some profiling data in the specific regions of 5’ and 3’ UTRS (binding sites
of Ded1p) and comparing the wild type Ribo-seq vs. mutant, it indicates
accumulation of ribosomes in these particular regions. Hence, after Ded1
dysfunction there is an increment in the abundance of ribosomes in critical
positions (5’UTR-3’UTR) for initiation of translation. Therefore, with this
observation we can assume that improper functioning of Ded1p is a reason
for the accumulation of ribosome footprints. In our task we are interested
to look up the Ribo-seq data, be able to cluster the genes annotations based
on their Ribo-seq signals for clustering purpose, it is important to find sig-
nificant differences in our data. Different ribosome profiling libraries should
be distinct able based on different conditions.
We have done several analyses on data before applying BlockClust[8]
on these four different libraries. First we have an outlook on our data by
calling peaks[18] for each chromosome per different libraries. Next we check
the ribosome stacking at binding sites by counting number of reads per peak
for each gene. Next we compared the number of peaks per gene for Wild
and Mutant libraries. Finally there is a comparison for amount of reads in
peaks of each chromosome vs. chromosomes length.
2.1 Extracting peaks from Ribo-seq data
The work that has done here is to take the abundance of reads for the highest
peak in each chromosome and then demonstrate them in an increasing sort.
This will indicate how the abundance of reads is changing with respect to
chromosomes in whole the genome and reveals the basic changes according
to chromosomes in our data. The method that has employed for gaining the
peaks is the one in[18] we call it peak caller, it aims at finding peaks from the
blocks of expression profile which are extracted from blockbuster [15] tool.
It considers the block with highest abundance of reads and assigns the center
of Gaussian function to the highest block in amount of reads. After that
it will extend the domain of peak by checking the overlapped blocks which
cover more than 50 percent of the highest block and afterwards extend the
boundaries of Gaussian function to borders of these overlapped blocks (more
than 50 percent) and it pursues for the rest by choosing the next block with
highest amount of reads.The original data format, which we are starting
13
-
2.2. Counting amount of reads per peak for each gene
with are bam files. The format of inputs file in peak caller is sam, thus
we convert the bam format to sam with use of samtools[23] view command.
After taking the sam files and applying the tool in peak caller we gain the
output in gff format. It contains the information of peaks positions and the
amount of reads for particular peaks.
The bar chart in figure 2.1 illustrates the highest abundance of reads
for peak among different chromosomes of yeast for three replicates over two
various conditions. Generally in replicate one (R1) and replicate three (R3)
we see higher amount of reads for mutant in compare to wild type before
temperature elevation in all the chromosomes, however this behavior reverse
after increasing the temperature for these two replicates. Furthermore it is
obvious that replicate two (R2) does not demonstrate any expected biolog-
ical behavior.
2.2 Counting amount of reads per peak for each
gene
By employing peak caller for gaining the peaks coordinates for each chromo-
some we split the gene positions into two subsets one comprises genes which
appear in binding sites of Ded1 and the other genes which are at non-binding
regions of Ded1. We called them bound and unbound sets respectively. The
binding positions have already extracted from iCLIP data. So for gaining
the two subsets of genes we should intersect the genes position once with
binding sites and once with non-binding sites. For this purpose we have
used intersectBed [29]. Hence, for attaining the final result that is number
of reads per peak for each gene, we get the intersections for binding and
non-binding gene positions on peak coordinates which extracted from peak
caller.
For having a better understanding of our data and getting a general view
of our libraries we have employed peak caller on them, we consider the gens
which contain binding sites of Ded1 and genes for non-binding sites and
calculated the abundance of reads for each peak in these two sites.
Figure 2.2 depicts that for replicate 1 without temperature shift (t0-R1),
if we compare two cases of bound and unbound, we observe a significant
increment in the amount of reads at binding positions for mutant vs wild
14
-
2.2. Counting amount of reads per peak for each gene
Figure 2.1: illustrates a general insight from the ribo-seq libraries. It dis-
plays the highest abundance of reads for peaks among different chromosomes
of yeast for three replicates over two various conditions. It has sorted based
on increment number of reads for the wildtype case. In R1 and R3 we see
higher amount of reads for mutant in compare to wild type before temper-
ature elevation in all the chromosomes, However this behavior reverse after
increasing the temperature for these two replicates which may be explained
through the fact that Ded1 is not functioning after temperature elevation.
15
-
2.2. Counting amount of reads per peak for each gene
type which can be explained by the fact that Ded1 has a significant influence
on translation initiation in yeast. Although this behavior is changing after
temperature shift (t5-R1), in this case number of reads for wild type shows
a higher growth at binding sites in comparison to non-binding sites. The
number of reads in wild type in the latter condition is generally more than
number of reads in mutant. This gesture may be occurred for the fact that
the Ded1 protein is sensitive to temperature and increment in its levels
causes protein dysfunction.
By looking at figure 2.2 we realize that replicates 1 and 3 both show the
similar behavior, whereas replicate 2 displays completely different manner.
Therefore, we assumed there was some problem in preparing the replicate 2
libraries.
16
-
2.3. Peak frequencies in wild type vs mutant
●
●●
●●
●
●●●●
●●●●●●
●
●
●●●
●
●
●
●●
●●●
●
●
●
●
●●
●●
●
●●
●
●●
●●●
●
●
●●●
●
●
●●●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●●●●● ●
●
●●
●
●●●
●●
●
●●●●
●●
●
●●●
●●●●
●●
●●●●
●●
●
●
●
●●●
●●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●●
●
●●●●
●
●
●
●
●●●
●
●
●●●
●
●
●
●●
●
●
●●●●●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●●
●●●
●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●●●●
●
●
●●●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●●●●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●●
●●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●●
●
●
●
●
●
●
●
●
●●●
●●●●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●●●
●
●
●
●●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●●
●
●
●
●
●
●●●●
●
●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●●●●
●
●
●
●
●●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
● ●
●
●
●
●●●●●●
●
●
●
●
●
●●●●
●
●●
●●●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●●
●
●●●●
●
●
●
●
●
●
●
●●●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●●●●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●●●
●
●
●●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●●●
●●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
●
●
●●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●●●●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●●
●
●
●●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●●●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
●●
●
●
●●
●●
●
●
●
●●
●
●
●●●●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
0
500
1000
1500
2000
2500
0
500
1000
1500
2000
2500
boundunbound
t0_R1 t0_R2 t0_R3 t5_R1 t5_R2 t5_R3Libraries
Am
ount
of r
eads
per
pea
k
Conditions
Mut
WT
Figure 2.2: a significant increment in the amount of reads at binding po-
sitions for mutant vs. wild type which can be explained by the fact that
Ded1 has a significant influence on translation initiation in yeast. This be-
havior is changing after temperature shift (t5-R1), in this case number of
reads for wild type shows a higher growth at binding sites in comparison
to non-binding sites. It reveals that R2 behave differently from two other
cases.
2.3 Peak frequencies in wild type vs mutant
Calling peaks on our Ribo-seq libraries with use of peak caller provides
us with opportunity to yield the most significant peaks in these instances.
Hence, for further investigation on our data and checking for consistency of
our results to former gained ones, in this part we count the consensus peaks
for each gene position in our various Ribo-seq libraries (WT-t0, WT-t5,
17
-
2.3. Peak frequencies in wild type vs mutant
Mut-t0 and Mut-t5). Therefore we compute the peak amounts for mutant
and wild type before and after increasing the temperature for each gene
position under different conditions namely genes at binding sites of Ded1,
genes at non-binding sites of protein Ded1
Hence, first we build two BED files. One for gene positions where lo-
cated at binding sites and one case for genes at non-binding sites. Next
by intersecting genes annotations and peaks positions with use of [29], we
enumerate the amount of peaks with respect to type of genes that located in
overlaps. Therefore, it exhibits peaks which belong to two groups the peaks
which correlated with genes at binding positions and the peaks which corre-
lated with genes at non-binding positions. The final result is represented in
figure 2.3. It depicts the linear correlation of peak abundance in wild type
and mutant libraries for each gene position and indicates a consistency over
our library and conditions with respect to former plots. Moreover, there is
an obvious tendency toward mutant coordinate at unbound which can be
explained by ribosomal footprints stacking and thus higher amount of reads
at mutant condition.
Figure 2.3: depicts the linear correlation of peak abundance in wild type and
mutant libraries per gene position. There is an obvious tendency toward
mutant axis before temperature shift which alters after temperature shift
and can be explained by ribosomal footprints stacking since the initiation
is halted by impairing of Ded1. The binding positions and non-binding
positions demonstrate almost the similar manner
18
-
2.4. Abundance of reads in peaks vs. length of chromosomes
2.4 Abundance of reads in peaks vs. length of
chromosomes
This part represents relation among the chromosomes lengths in yeast and
abundance reads of consensus peaks in each chromosome, which calculated
with use of peak caller and it yields new insights about the density of reads in
different chromosomes of yeast in Ribo-seq profiles. Figure2.4 illustrates the
density of reads in different chromosomes of yeast in our Ribo-seq profiles.
It represents by increasing the chromosome length, higher amount of reads
can be expected,although some shorter chromosomes have higher read abun-
dance in comparison to longer ones regardless of length order.Furthermore
it is obvious that the results in replicate 1 and 3 are correlated, although
replicate 2 does not behave analogously to the previous replicates.
Mut_t0 Mut_t5 WT_t0 WT_t5
●
●
●●
●
●
●●●
●
●●
●●
●
●
●
●●●
●
●●●
●●
●●
●●
●
●
●
●
● ●
●
●● ●
●●
●●
●●
●
●
● ● ●●
● ● ●●●
●● ●●
● ●
●
●●
● ●
●●●
● ●
●●
● ●●
●
●
●●
● ●
●●● ●
●●
● ●●●
●
●
●●● ●
●●● ●●
●● ●●
●●●
●
●
● ●
●
●●●
●●
●●●●
●
●
●
●●●
●
●●●
●●
●●
●●
●
●
●
●●
●
●
●●
●●
● ●
● ●
●
●
●
●●● ●
●●
●●●
● ●●
●●
●
●
●
●●●
●
●●
●●
● ●
●●●
●
●
0
500
1000
1500
2000
400000 800000 1200000 400000 800000 1200000 400000 800000 1200000 400000 800000 1200000Chromosome Length
Rea
ds
factor(V2)
●
●
●
R1
R2
R3
Figure 2.4: displays the density of reads in different chromosomes of yeast in
our Ribo-seq profiles. By increasing the chromosome length, higher amount
of reads can be expected,although some shorter chromosomes have higher
read abundance in comparison to longer ones regardless of length order
Overall, in this chapter we process our Ribo-seq libraries under different
conditions. To illuminate similarities and dissimilarities, we have chosen
four various approaches which yield different aspects of Ribo-seq libraries
in hand. These processes are extracting peaks from Ribo-seq data, count-
ing amount peak reads for each gene, peak frequencies in two cases of wild
type and mutant and a comparison of reads abundance and chromosomes
lengths. Generally, these processes revealed there is a coherence in reads
amount in expression profiles of mutant libraries and wild type with tem-
perature shift. Because the protein has knocked out there, translation is
19
-
2.4. Abundance of reads in peaks vs. length of chromosomes
halted and an elevation in ribosome footprints is expected at binding po-
sitions of protein in mutant and wild type with temperature shift. We are
not observing this behavior in wild type without temperature increment.
Even though, replicate one and three showing similar relevant biological
behavior, we cannot observe this fashion in replicate two, thus we resume
our procedure without using replicate two. There are significant differences
under different conditions of ribosome expression profiles thus one can try
to cluster such profiles based on different conditions such as wild type and
mutant or with temperature elevation and without elevation. In the next
chapter we will explain about employing BlockClust for clustering such a
profiles under different conditions in this case study. We adapt BlockClust
which has former applied for RNA-seq expression profiles for Ribo-seq data.
20
-
Chapter 3Methods
With novel developments in whole genome sequencing and applying deep
sequencing techniques such as Ribo-seq, we have lots of novel data, which
with use of machine learning methods, we can extract useful information
from them. This work has aimed to find classes and clusters according to
similar processing patterns of ribosome accumulation in Ribo-seq data with
use of fast graph kernel techniques(NSPDK)[14]. Such a work has already
done for RNA-seq data in BlockClust tool. It has the ability to distinguish
different ncRNA groups. For achieving a high performance BlockClust uses
two main steps. These steps are encoding the expression profiles to a graph
and build combinatorial features based on that graph. We will be able to
apply this tool on Ribo-seq data. Hence there is a need for an adaption with
employing new features and optimize its pipeline according to Ribo-seq data.
3.1 BlockClust: efficient clustering and classifica-
tion of non-coding RNAs
The studies on Genome-wide sequencing data revealed that most of DNA
regions encode information for non-coding RNAs(ncRNAs)[26]. They have
an important role in cellular regulation, although the function annotations
for a large part of them are not obvious yet. One way for solving this issue,
stands several methods e.g. clustering ncRNAs based on their sequence or
secondary structure [34],[31]. It is also possible to assign classes based on
patterns of expression profiles in ncRNAs. These patterns are depended to
21
-
3.1. BlockClust: efficient clustering and classification of non-coding RNAs
functional molecule and 3D structure. therefore BlockClust has aimed for
the latter solution for grouping different classes of ncRNAs. It tries to assign
clusters to non-coding RNA classes by applying machine learning methods
on transcript processing patterns in RNA-seq data and it is robust to the
changes of cell line, organism and sequencing machines. Two main sections
in BlockClust which allow clustering are: 1. Expression profiles encoding
2. Combinatorial feature generation This information of mapped reads will
be in format of SAM (sequence alignment map) or BAM (binary alignment
map). Mapping gives information about where the reads are aligned on the
reference genome. So for the aim of simplicity and increasing computational
speed, BlockClust divides the expression profiles to sets of block groups and
blocks with use of blockbuster tool[15] which is fitting Gaussian functions
to the profile data. For each read blockbuster assigns a Gaussian function
and then take the consensus Gaussian and then assigns the reads to a block
by finding the highest peak and considering the standard deviation. Each
block consists of several reads and for each block groups exist a set of corre-
spondent blocks. In order to find patterns in expression profiles, BlockClust
will extract the attributes for blocks(e.g. number of multi mapped reads,
entropy of read expressions, minimum read length) and block groups(e.g.
entropy of read starts, entropy of read ends, entropy of read lengths) and
block edges(e.g. contiguity and difference in median read expressions) sets.
After that it discretizes the attributes values to nominal amounts with use
of equal frequency algorithm. Next, BlockClust encodes whole the informa-
tion to a graph. The amounts of blocks and block groups for each instance
are not identical, thus representing them in form of vectors is not possible.
A solution for that is a graph representation of such a data. This Graph
represents the values of attributes over one gene expressions data and con-
sists of two components. One for block group attributes and the other for
blocks and block edges attributes. In the first component place of node in-
dicates attribute type and in the second component order of nodes represent
order of block positions which gained from blockbuster tool. This sequence
of nodes called backbone. The attributes values for each single block at-
tach to the backbone according to the block. Sequence of backbone nodes
is analogous to sequence of blocks constructed with blockbuster. As soon
as, retrieving the graphs from expression profiles, BlockClust will be able to
22
-
3.1. BlockClust: efficient clustering and classification of non-coding RNAs
produce combinatorial feature with use of Neighborhood Subgraph Pairwise
Distance Kernel (NSPDK). These features have been employed to efficiently
cluster ncRNAs. Moreover BlockClust also implement the concept of view-
point in the process of feature generation. It is an extra information added
for extracting subgraphs in a way that at least one of the subgraph roots be
on the backbone. This helps to build features from an increasing amount of
attributes and considers a very smaller subset of attributes combinations.
Finally BlockClust uses the similarity notion of NSPDK and Markov Cluster
Process [13] for building ncRNA clusters.
23
-
3.2. Extracting New Attributes with respect to Ribo-seq data
Figure 3.1: After employing blockbuster tool and building blocks and block
groups from expression profiles, next BlockClust makes graphs from dis-
cretized values of the attributes applied for block groups and blocks. At
the end the similarity of these graphs compared by NSPDK with use of
combinatorial features. Figure retrieved from [8].
3.2 Extracting New Attributes with respect to Ribo-
seq data
In the following, new attributes for extracting new characteristics of Ribo-
some profiles is defined and added to BlockClust pipeline. we aim to extract
new attributes, because the expression profiles of the Ribo-seq data are not
similar with one from small ncRNAs. There is a significant difference in
the length and end expression levels of the block groups. The attributes
24
-
3.2. Extracting New Attributes with respect to Ribo-seq data
used in BlockClust are optimized for small ncRNAs. Hence you want to
extract few more which might make sense for ribo-seq data. In order to in-
crease the accuracy of our prediction, first we check how defined attributes
are correspondent with respect to Ribo-seq data. Five various attributes
are implemented here. Moreover, for specifying attribute values, blocks
and block group positions of our Ribosome profiles should be determined.
Therefore, we employed blockbuster tool on our ribosome profiles and after
that we continue our work on block groups and blocks coordinates. After
extracting block group positions with use of blockbuster, we try to find a
unique mapping between block groups and gene positions, thus with use of
this mapping we correspond conditions of our genes to block groups and
group them under their mapped gene.
3.2.1 Entropy of Block Distances
Entropy
In the field of information theory, it shows the amount of uncertainty in
random data. For instance, in stochastic binary variable X with values 1 and
0, if the probability of occurrence of 1 and 0 be 50 percent the uncertainty
for value X is at highest rate and thus the entropy value is 1 [10]. The
entropy formula is:
entropy = −∑
q log2 q (3.1)
q is the probability distribution of signals. We calculate the distance be-
tween two consecutive blocks in a way that evaluate the distance between
end position of first occurred block and start location of latter block then
we will map this value to bins which defined with respect to minimum and
maximum distances in the block group. The sought fraction for entropy
gains by sum of all mapped block distances values divided by total number
of bins in that particular block group. Finally we substitute each fraction in
entropy formula. We have represented the entropy for block groups under
four different conditions with respect to their mapped genes conditions. As
figure 3.2 illustrates the two cases of bound and unbound positions repre-
senting almost analogous behavior, however the median value for bound case
after temperature shift shows a slight increase.
25
-
3.2. Extracting New Attributes with respect to Ribo-seq data
●
●
●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●●
●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●
●
●●
●
●
●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●
●●●●
●●
●
●
●
●
●
0
1
2
3
0
1
2
3
boundunbound
t0_R1 t5_R1Libraries
Ent
ropy
of b
lock
dis
tanc
es
Conditions
Mut
WT
Figure 3.2: Entropy of block distances calculated per block group. This
figure indicates the two cases of bound and unbound positions representing
almost analogous behavior.
3.2.2 Entropy of Blocks End and Start Positions
For assessing entropy of start and end positions of blocks in block groups.
First, we count the number of blocks that share the identical starting posi-
tions at each block group. Next divide this statistic with amount of blocks for
correspondent block group for reaching to entropy ratio. Replacing gained
value to formula (3.1), renders block start/end entropy for each block group.
At the end we have plot them under four different cases based on their genes
status. Figure 3.3 displays that generally in bound cases the entropy of block
starts have higher entropy in comparison to unbound cases. Moreover it is
also visible that entropy values for block start positions after temperature
shift decreasing for mutant case despite of wild type case which shows an
increment. The same manner is occurred in the entropy of end positions.
26
-
3.2. Extracting New Attributes with respect to Ribo-seq data
●
●
●●●●
●
●
●●
●●●●●
●
●
●
●
●●●
● ●●●●
●●●●●●●●
●●
●●●●
●
●●●●●●●●●
●
0
3
6
9
0
3
6
9
boundunbound
t0_R1 t5_R1Libraries
Ent
ropy
of s
tart
ing
posi
tions
Conditions
Mut
WT
Figure 3.3: illustrates entropy of blocks starting position for each block
group. Generally in bound cases the entropy of block starts have higher
entropy in comparison to unbound cases.
27
-
3.2. Extracting New Attributes with respect to Ribo-seq data
●●
●
●●●●
●
●
●●
●●●●
●
●
●
●●●
●●●●●●●●●
● ●●●●●●●●●●●●●
●●●●
●
●●●
●●●●●●
0
3
6
9
0
3
6
9
boundunbound
t0_R1 t5_R1Libraries
Ent
ropy
of e
ndin
g po
sitio
ns
Conditions
Mut
WT
Figure 3.4: illustrates entropy of blocks ending positions for each block
groups. In bound cases the entropy of block ends have higher entropy in
comparison to unbound cases.
3.2.3 Density of Reads
In order to measure the reads density we used the blockbuster output file
for counting the reads per block and extracted the fraction by dividing this
value to the total amount of reads for each block group. Figure 3.5 at page
29 displays no significant dissimilarity between unbound cases, however in
bound positions it represents a higher density of reads for mutant cases.
This recent behavior after temperature shift alters, higher amount of reads
density belongs to wild type at bound positions.
28
-
3.2. Extracting New Attributes with respect to Ribo-seq data
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●●●●●
●
●
●●●●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●●●
●
●●
●
●
●
●●
●
●
●●●●
●
●
●
●●●
●●●
●
●●
●
●●
●
●
●
●
●
●●
●●
●
●
●●●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●●●
●●
●
●
●
●●●●
●
●●●
●●
●●●●●
●
●●●
●
●●●
●