characterization of ribosomal footprints with use of graph ......master’s thesis characterization...

Master’s Thesis

Characterization of ribosomalfootprints with use of graph kernel

based approaches

Soraya Nikousokhan

October 2016

Albert-Ludwigs Universität Freiburg

Department of Computer Science

Chair of Bioinformatics

Candidate

Soraya Nikousokhan

Matr. number

3555120

Working period

14. 04. 2016 – 28. 10. 2016

Examiners

Prof. Dr. Rolf BackofenProf. Dr. Wolfgang Hess

Supervisors

Dr . Fabrizio CostaPavankumar Videm

I

Acknowledgment

I would like to express my thanks to my parents, for all the love and supportsthat they give me. Mostly I want to thank them for teaching me the value ofgaining knowledge and learning and encouraging me to proceed on this path.I would like to thank Prof. Dr. Rolf Backofen for giving me the opportunityto write my thesis at the Freiburg bioinformatics chair. I would like to thankDr. Fabrizio Costa for introducing this interesting topic to me and his helpand efficient supervision during the work, as well as Pavankumar Videm forhis help and patience and the time he put in to answer my questions. I wouldlike to also thank the Freiburg bioinformatics chair members, special thanksto Teresa Müller, Patrick Wright, Milad Miladi and Torsten Houwaart fortheir help and advices. I would like to thank my friends Hanna Poelker,Maryam Samani and Kristin Gekeler for their company and support duringboth the good and the difficult phases of this work.

II

Abstract

Ribosome profiling is an emerging technique that with use of deep sequenc-ing methods, gives new insight to translation of proteins from single codonto genome scale. In comparison to former available methods microarraysand RNA-seq, Ribo-seq solely considers active mRNAs at translation phasein a cell which prepare information for protein synthesis. This novel charac-teristic of Ribo-seq provides new data with focus on translation level. Theobtained patterns of ribosomal footprints may reveal new aspects in trans-lation field. The aim of this work is to classify Ribo-seq profiles accordingto different conditions and find clusters with respect to Ribo-seq profiles.This is done by a tool named BlockClust, which is based on a graph kernelmethod called Neighborhood fast graph kernel (NSPDK). BlockClust en-codes expression profiles data to graphs format and employ NSPDK methodfor achieving a high performance. Although BlockClust previously appliedfor clustering non-coding RNAs from their RNA-seq expression profiles, itcan also be adapted to use for clustering and classification tasks on othertypes of data e.g. Ribosome profiling. Therefore, we have adapted Block-Clust by defining new attributes for finding patterns in Ribo-seq data andadding them to the former available set of attributes. Moreover, we per-formed an optimization by using different parameter sets. Furthermore, weshowed that it is possible to employ BlockClust on Ribosome profiles. Weachieved a good performance in classification of these profiles.

III

Kurzfassung

Ribosome Profiling ist eine Technik, die mit der Verwendung von DNA-Sequenzierung (Deep Sequencing) neue Einsichten in die Übersetzung vonProteinen liefert – sowohl in einzelne Codons und in genomische Maßstäbe(genomic scales). Im Vergleich zu früheren verfügbaren Methoden wie Mi-croarrays und RNA-Seq. berücksichtigt Ribo-Seq. nur aktive mRNAs undliefert daher Informationen der Protein-Synthese. Somit bringt die Ribo-Seq. Methode neue Daten im Hinblick auf die Übersetzungsphase, riboso-malen footprints, mit sich, die neue Aspekte im Feld der Translation offen-baren. Das Ziel dieser Arbeit ist es, bedeutsame Cluster unter Einbeziehungvon Ribo-Seq.-Expressionsprofilen zu finden. Dafür wird das Tool Block-Clust verwendet, welches auf einer graph-kernel-Methode namens NSPDKbasiert. BlockClust kodiert Expressionsprofile zu Graphen und wendet dieNeighborhood-fast-graph-kernel-Methode an, um eine hohe Leistung zu erre-ichen. BlockClust ist vor allem als Clustering-Methode bekannt, die nicht-codierende RNA basierend auf ihren Expressionsprofilen in Cluster grup-piert. Es kann aber auch auf andere Arten von Expressionsdaten angewen-det werden, zum Beispiel auf Robosome-Profiling-Daten. Das ist möglich,durch das Hinzufügen weitere Attribute, BlockClust auf Ribosome-Profileanzuwenden. Die optimierten Parameter werden zum früheren Attribute-Set hinzugefügt. Desweitern haben wir eine Optimierung bezüglich ver-schiedener Parameter-Sets durchgeführt. Wir zeigen in dieser Arbeit dassdie Klassifikation von Ribosome-Profile mit einer guten Leistung erreichbarist.

IV

Contents

Abstract III

Kurzfassung IV

List of Tables VIII

1 Introduction 1

1.1 Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.2 RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.3 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.4 Gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.5 Transcription . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.6 Translation . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.7 Gene expression . . . . . . . . . . . . . . . . . . . . . 4

1.1.8 Coding and Non-coding Regions . . . . . . . . . . . . 4

1.1.9 Next generation sequencing (NGS) . . . . . . . . . . . 5

1.1.10 Ribosome profiling . . . . . . . . . . . . . . . . . . . . 5

1.1.11 Cross linking immunoprecipitation (CLIP) . . . . . . . 8

1.1.12 Wild type vs mutant . . . . . . . . . . . . . . . . . . . 8

1.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Machine learning and bioinformatics . . . . . . . . . . 8

1.2.2 Supervised learning . . . . . . . . . . . . . . . . . . . 9

1.2.3 Unsupervised learning . . . . . . . . . . . . . . . . . . 9

V

Contents

1.2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Data 12

2.1 Extracting peaks from Ribo-seq data . . . . . . . . . . . . . . 13

2.2 Counting amount of reads per peak for each gene . . . . . . . 14

2.3 Peak frequencies in wild type vs mutant . . . . . . . . . . . . 17

2.4 Abundance of reads in peaks vs. length of chromosomes . . . 19

3 Methods 21

3.1 BlockClust: efficient clustering and classification of non-coding

RNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Extracting New Attributes with respect to Ribo-seq data . . 24

3.2.1 Entropy of Block Distances . . . . . . . . . . . . . . . 25

3.2.2 Entropy of Blocks End and Start Positions . . . . . . 26

3.2.3 Density of Reads . . . . . . . . . . . . . . . . . . . . . 28

3.2.4 GC-ratio . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.5 Number of Reads for GC positions . . . . . . . . . . . 30

3.3 Neighborhood subgraph pairwise distance kernel(NSPDK) . . 31

3.4 BlockClust adaptation with Ribo-seq Data . . . . . . . . . . . 33

4 Results and Discussion 38

4.1 Blockbuster benchmarking . . . . . . . . . . . . . . . . . . . . 38

4.2 BlockClust similarity score assessment . . . . . . . . . . . . . 40

4.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . 41

4.2.2 ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.3 Combinatorial feature similarity score assessment . . . 42

4.3 BlockClust optimization . . . . . . . . . . . . . . . . . . . . . 45

4.4 BlockClust experiment . . . . . . . . . . . . . . . . . . . . . . 46

5 Conclusion 49

5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

A Appendix(Attributes) 52

VI

List of Figures

1.1 Transcription and translation . . . . . . . . . . . . . . . . . . 4

1.2 mRNA coding and non-coding regions . . . . . . . . . . . . . 5

1.3 Ribosome profiling-analysis of ribosome occupancy data. . . . 7

2.1 Highest Peak per Chromosome in Ribo-seq libraries . . . . . 15

2.2 Number of Reads Per Peak for Each Gene . . . . . . . . . . . 17

2.3 Ribosome Peak Freq. Mutant vs. Wild type . . . . . . . . . . 18

2.4 No. Peaks reads vs. Chromosome Length . . . . . . . . . . . 19

3.1 Read profile encoding . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Entropy of block distances . . . . . . . . . . . . . . . . . . . . 26

3.3 Entropy of blocks start positions . . . . . . . . . . . . . . . . 27

3.4 Entropy of blocks end positions . . . . . . . . . . . . . . . . . 28

3.5 Reads density per block group . . . . . . . . . . . . . . . . . . 29

3.6 GC-ratio per block group . . . . . . . . . . . . . . . . . . . . 30

3.7 Number of reads for GC positions . . . . . . . . . . . . . . . . 31

3.8 BlockClust pipeline . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Blockbuster benchmarking results . . . . . . . . . . . . . . . . 40

4.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Combinatorial feature similarity ROC measures for unbound

cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Combinatorial features similarity roc measures for bound cases 45

4.5 Clusters of Ribosome profiles based on four different conditions 47

VII

List of Tables

4.1 Line search optimization results . . . . . . . . . . . . . . . . . 46

4.2 Classification performance . . . . . . . . . . . . . . . . . . . . 46

4.3 Clustering performance . . . . . . . . . . . . . . . . . . . . . 47

VIII

Chapter 1Introduction

Recently, vast developments in next generation sequencing methods enhance

us with non-expensive and fast Genome-wide sequencing of reads. Thus, new

doors have opened for implementing novel methods and beneficial tools for

retrieving new information from the end results. The aim of this work here

to classify and cluster Ribosome profiles with use of a tool called BlockClust.

In the first chapter we have an introduction on biological and machine

learning background concepts and methods, in chapter 2 we explain about

data analysis to check the relevance of data for clustering task, next, at

chapter 3 we discuss the method and variations in BlockClust adaptation to

our data and finally we reach to conclusion and discuss about future works.

In the following sections here, we have explained the biological informa-

tion and machine learning concepts which are coming in the next chapters.

1.1 Biology

This section focus at biological terms and methods which has mentioned in

latter chapters.

1.1.1 DNA

Deoxyribonucleic acid or DNA is a double- stranded molecule in form of dou-

ble helix. Each of the strands is a lineare molecule and it’s monomer building

bricks are nucleoside triphoshates.The two strands which are formed DNA

are joined together with hydrogen bond, between different complementary

1

1.1. Biology

bases.The complementary bases (nucleosids) adenin and thymine can form

two hydrogen bonds and guanine and cytosine can form three hydrogen

bonds [3].

1.1.2 RNA

Ribonucleic acid (RNA) is usually a single stranded chain existing in all

living cells and many viruses. One difference between DNA and RNA is

the sugar molecule (ribose instead of desoyaribouse). The second difference

is that instead of the base thymin RNA molecules make use of uracile. In

former times the task of RNA was recognized as messenger, which carries

instructions from DNA for controlling the synthesis of proteins.But nowa-

days it is known that they also preform important regulatory tasks (inside

the cell). This RNA’s are called non-coding (ncRNA) [6].

1.1.3 Protein

Proteins are large biological molecules or macromolecules which have a very

diverse range of tasks in living organisms such as: catalyzing metabolic

reactions, DNA replication responding to stimuli and transporting molecules

form one location to another. It is made of one long chains of amino acids

that have translated from nucleotide sequences of their genes. Different

amino acids sequences will form distinct folding and thus various tertiary

structure which leads to different functioning [17].

1.1.4 Gene

Gene is a region or segment of DNA which is a basic reason for different

heredity characteristics.

1.1.5 Transcription

Transcription is the act of reading the DNA information and transfers it to

RNAs. RNA is transported from the nucleus into cytoplasma and in the case

of mRNA further translated into protein. For initiating the transcription

different tf factors or proteins need to attach at the special site of DNA

sequences. This place is called enhancer or promoter and the components are

an enzyme called RNA polymerase and also transcription factors (subsidiary

2

1.1. Biology

proteins). All together named transcription initiation complex .After these

attachments transcription act will continue and RNA polymerase begins

mRNA synthesis in a way that complementary bases to the original DNA

strand are elongated to mRNA sequence. This process will end as soon as

all the strand is completely synthesized [1].

1.1.6 Translation

Translation is the act of making proteins by encoding the mRNA information

into amino acids which are the building blocks of proteins. For this purpose

mRNA uses a three letter combination of nucleotides, each of these com-

binations are translated by ribosome to a different amino acid. Ribosomes

consist of two subparts, one small and one large sub-unit.The ribosome is a

complex consisting of ribosomal RNA and proteins. Translation comprises

three phases, which are initiation, elongation and termination. For initia-

tion the small subunit of the ribosome attaches to recognition elements of

mRNA sequence, after that it will join transfer RNA to AUG. The AUG

codes for methionine amino acid.Then the large sub-unit will bind to the

whole complex and initiation will start. The elongation phase continue until

a stop codon is reached. The stop codons are UAA,UAG and UGA. Finally

the translation complex disassembles [7].

3

1.1. Biology

Figure 1.1: Transcription and translation are different biological processes

for protein production, transcription transfers DNA information to mRNAs.

After mRNAs are exited form cell nucleus, translation task starts and it in-

cludes three various phases, which are initiation, elongation and termination.

The final products of these actions are proteins.Figure retrieved form [1].

1.1.7 Gene expression

The information of genes which is translated into gene product, gene prod-

uct is often a protein, through several steps transcription,splicing translation

and post-translation modifications, whole this process defined as gene ex-

pression [5].

1.1.8 Coding and Non-coding Regions

CDS or Coding DNA Sequence is the region that includes codes for proteins.

In mRNA the coding part is surrounded by five prime untranslated region

and three prime untranslated region.The coding part consists of codes which

are translated to proteins and the non-coding parts are helping the act of

translation to be initiated and completed.

4

1.1. Biology

Figure 1.2: indicates the coding and non-coding parts(5’ and 3’ UTRS) of

mRNA. The coding part consists of codes which are translated to proteins

and the non-coding parts are helping the act of translation to be initiated

and completed. Figure taken from [4].

1.1.9 Next generation sequencing (NGS)

NGS approaches revealed a new era in which a whole genome can be se-

quenced in easier and cheaper, at the same time accurate ways that are

available commercially. The former sequencing method is Sanger sequencing

which is costly and it is not as fast as next generation sequencing. Sequenc-

ing of whole genome will provide large research possibilities in large-scale

comparative and evolutionary studies as well as giving an insight about how

changes in the genetic code are influencing diseases. NGS technologies com-

prise various steps such as: template preparation, sequencing and imaging

and data analysis. Different NGS technologies have different protocols for

each of the steps. Current NGS technologies are Roche/454,Illumina/Solexa,Life/APG

and HelicosBioSciences [24].

1.1.10 Ribosome profiling

The Ribosome profiling method, is a novel sequencing technique, which is be-

coming more popular nowadays for extracting new information about RNA

translation.In comparison to other methods like microarrays [32] and RNA-

seq [33], which considers the mRNA abundances, Ribosome profiling solely

considers mRNAs which are actively take part in translation.Therefore Ribo-

seq provides information on protein which is one of final product in gene

expression synthesis.Thus Ribosome profiling technique gives us this oppor-

tunity to understand new insights in the identity and amount of proteins in

transcriptome. It is counting the accumulation of ribosome footprints for

5

1.1. Biology

each position in transcript. As we can see in figure 1.3 the abundance of

proteins are directly related to the number of ribosome density in the protein

coding sites of a transcript. The total number of ribosomes that functioning

in a protein synthesis can be yield from ribosome footprints. The amounts of

mRNAs and ribosomes for each of them has straight influence on footprints

counts, altering in either amounts will cause change in number of covered

footprints. Therefore results will change due to the mRNA abundance or

the ribosomes amounts [20].

6

1.1. Biology

Figure 1.3: a-indicates the amount of ribosome footprints with respect to

ribosome positions in vivo. In Ribosome profiling process, the nucleus-

protected footprints used to show the positions of attached ribosomes to

many mRNAs in cell. Therefore, Ribosome profiling data is able to display

the abundance of ribosome at each position of transcripts. The thick line

shows the coding regions. b-Ribosome profiling data represents amount of

proteins which are synthesized. It can be explained by the relation among

the abundance of synthesized proteins and the density of ribosome in trans-

lated region of a transcript. Ribosome profiling indicates all ribosomes which

are involved in the act of translation for a protein. Hence alternation in ei-

ther mRNAs abundance in polysomes or number of ribosomes can make a

difference in ribosome profiling signal. Figure retrieved from [20].

7

1.2. Machine learning

1.1.11 Cross linking immunoprecipitation (CLIP)

CLIP is a UV cross-linking method in molecular biology. It is vastly used

for studying of protein interaction with RNAs. We can use CLIP based

methods, in order to have a better understanding of translation, for example

mapping of RNA binding sites for a specific protein in a genome range [19].

1.1.12 Wild type vs mutant

The wild type of an species is its most common phenotype which can be

seen in nature. A phenotype is set of representative characteristics of an or-

ganism. Genes are the major cause of functional behavior and development

of an organism, any changes in DNA or RNA or protein sequences will cause

mutation in natural form. Mutation of at least one gene in wild type called

mutant. These changes can be fundamental; if mutations happen in DNA

level they can alter all copies of the translated protein which can causes a

decrease in expression of the protein. However, if mutation happens in RNA

or protein synthesis level, will not be that importantly consider since it will

just affect one copy of different available copies of RNA or a protein [30].

1.2 Machine learning

This section contains information about machine learning methods and con-

cepts that will be used in the following chapters.

1.2.1 Machine learning and bioinformatics

Bioinformatics is an emerging new interdisciplinary field of study. It involves

different areas such as computer science, mathematics, statistics and engi-

neering for analyzing biological data. On the other hand, machine learning

aims to find methods that are able to learn from data and make legitimate

predictions based on data. Generally in machine learning the effort is to

find a model from an available set of samples to make decisions based on

data. With recent developments in bioinformatics methods we gained a huge

amount of data, thus for processing and discovering new knowledge from this

enormous amount of information machine learning approaches will be useful

8

1.2. Machine learning

and applicable. They try to find computational models to retrieve noble in-

formation from data. Some of these modeling methods in machine learning

are supervised classification, clustering, probabilistic, graphical models, op-

timization and heuristics. There are various fields in bioinformatics which

applied machine learning methods some of them are genomics, proteomics,

microarray, system biology, evolution and text-mining. In genomics machine

learning methods focus on finding number of sequences and location and

structure of genes. As long as in proteomic most of the effort is to predict

protein structure prediction which can be very complicated combinatorial

task due to the intrinsic complex characteristics of protein molecules. Sys-

tem biology models the life process in the cell and the application of machine

learning in evolution field can be reconstruction of phylogenetic tree. With

all these different methods and applications there is a large database avail-

able from publications. We can use text-mining approaches for a feasible

search which returns related results for different topics in bioinformatics

research areas [22].

1.2.2 Supervised learning

Supervised learning is the task of retrieving a model based on available

labeled examples which known as training examples. Here, each training

examples contains a pair which are the input object and a sought output

value. Supervised learning task tries to assign a function to training data in

a way that this function can predict the output for new instances [25].

1.2.3 Unsupervised learning

Unsupervised learning methods aim to extract hidden functions from un-

labeled data, regardless of supervised learning and reinforcement learning

the given data has no label. Thus, there is no error or reward function for

evaluation of results. Unsupervised learning is very similar to the density es-

timation in statistics. It comprises methods for summarizing and extracting

the key features [11].

9

1.3. Contribution

1.2.4 Classification

Classification is a supervised learning task. There are sets of labeled data

called training which contains pair of instances and their desired values

which can be in nominal, categorical or numerical format. Classifier by

considering and comparing instances according to similarity or dissimilarity

tries to specify sought values or classes for new observations [21].

1.2.5 Clustering

Clustering is an unsupervised learning task which aims to build groups for

set of unlabeled data. It forms groups or clusters based on the similarity of

their members. The objects that are more similar to each other locate in

same cluster, whereas they are less analogous to other clusters. Clustering

is an iterative task of adjusting data preprocessing and model parameters

to detect sought properties. It is an iterative approach that comprises trial

and failure for discovering underling knowledge in data [11].

1.3 Contribution

In this work we have tried to find a solution for clustering and classification

task of Ribo-seq data. Similar work has been done for RNA-seq expression

profiles by use of BlockClust[8] tool for finding clusters of non-coding RNAs

via RAN-seq expression profiles. Hence we want to define a new neverthe-

less similar task for Ribo-seq data. Therefore, we classify gene annotations

based on correspondence Ribosomal footprints signal. Moreover we assign

clusters to transcripts under different conditions of Ribosome profiles. For

fulfilling these goals, first we have to analyze the data to see the possibilities

and chances of significant dissimilarities and similarities in profiling data

under variant conditions. Second, after finding an outlook to data, we have

defined a new set of attributes with respect to Ribo-seq data for increasing

the distinction power of BlockClust algorithm for these particular format of

data. Because the size of mRNA reads are longer than non-coding RNAs the

computational time for running BlockClust tool increases drastically. The

previous version of this tool used Grid search for its parameters optimiza-

tion, however due to running time for Ribo-seq data it is not feasible here

10

1.3. Contribution

anymore. Thus, instead of Grid search for optimizing the set of parameters

e.g. radius and number of bins, we have employed a line search algorithm

to decrease the computing time. Finally we apply BlockClust in an opti-

mize manner on Ribo-seq expression profiles and building clusters based on

different conditions of data e.g the wild type or mutant.

In the next chapter we explain about the case study we have for Ribo-

some profiles and introduce the data we are using.

11

Chapter 2Data

Before applying BlockClust [8] tool on Ribo-seq data, first we take a survey

on data to have a better perception about data behaviors under different

circumstances, moreover expose various aspects of it. Is data varying under

different biological conditions? Are the samples which we are working on

showing the expected biological behavior? What are the preliminaries for

clustering these data? these are questions that we aim to answer in this

chapter. In order to check the characteristics of Ribo-seq data, we work

on a case study which contains three different biological replicates under

various conditions. The available data are Ribo-seq profiles for different

conditions. Mutant library is gained after un pairing Ded1 protein which

is a critical factor for translation initiation in saccharomyces cerevisiae [16];

however, the function of this enzyme is clearly not known. Ded1p is sensitive

to temperature. We have cases for impairing Ded1 protein before and after

a slow elevation at temperature. We have four different Ribo-seq libraries

in four conditions which are:

• Wild type without temperature shift(WT-t0)

• Wild type with temperature shift(WT-t5)

• Mutant (Ded1is impaired) with temperature shift(Mut-t0)

• Mutant with temperature shift(Mut-t5)

And also in addition to these libraries we have the positions of Ded1 binding

sites to mRNAs which has extracted form iCLIP data. By looking at ribo-

12

2.1. Extracting peaks from Ribo-seq data

some profiling data in the specific regions of 5’ and 3’ UTRS (binding sites

of Ded1p) and comparing the wild type Ribo-seq vs. mutant, it indicates

accumulation of ribosomes in these particular regions. Hence, after Ded1

dysfunction there is an increment in the abundance of ribosomes in critical

positions (5’UTR-3’UTR) for initiation of translation. Therefore, with this

observation we can assume that improper functioning of Ded1p is a reason

for the accumulation of ribosome footprints. In our task we are interested

to look up the Ribo-seq data, be able to cluster the genes annotations based

on their Ribo-seq signals for clustering purpose, it is important to find sig-

nificant differences in our data. Different ribosome profiling libraries should

be distinct able based on different conditions.

We have done several analyses on data before applying BlockClust[8]

on these four different libraries. First we have an outlook on our data by

calling peaks[18] for each chromosome per different libraries. Next we check

the ribosome stacking at binding sites by counting number of reads per peak

for each gene. Next we compared the number of peaks per gene for Wild

and Mutant libraries. Finally there is a comparison for amount of reads in

peaks of each chromosome vs. chromosomes length.

2.1 Extracting peaks from Ribo-seq data

The work that has done here is to take the abundance of reads for the highest

peak in each chromosome and then demonstrate them in an increasing sort.

This will indicate how the abundance of reads is changing with respect to

chromosomes in whole the genome and reveals the basic changes according

to chromosomes in our data. The method that has employed for gaining the

peaks is the one in[18] we call it peak caller, it aims at finding peaks from the

blocks of expression profile which are extracted from blockbuster [15] tool.

It considers the block with highest abundance of reads and assigns the center

of Gaussian function to the highest block in amount of reads. After that

it will extend the domain of peak by checking the overlapped blocks which

cover more than 50 percent of the highest block and afterwards extend the

boundaries of Gaussian function to borders of these overlapped blocks (more

than 50 percent) and it pursues for the rest by choosing the next block with

highest amount of reads.The original data format, which we are starting

13

2.2. Counting amount of reads per peak for each gene

with are bam files. The format of inputs file in peak caller is sam, thus

we convert the bam format to sam with use of samtools[23] view command.

After taking the sam files and applying the tool in peak caller we gain the

output in gff format. It contains the information of peaks positions and the

amount of reads for particular peaks.

The bar chart in figure 2.1 illustrates the highest abundance of reads

for peak among different chromosomes of yeast for three replicates over two

various conditions. Generally in replicate one (R1) and replicate three (R3)

we see higher amount of reads for mutant in compare to wild type before

temperature elevation in all the chromosomes, however this behavior reverse

after increasing the temperature for these two replicates. Furthermore it is

obvious that replicate two (R2) does not demonstrate any expected biolog-

ical behavior.

2.2 Counting amount of reads per peak for each

gene

By employing peak caller for gaining the peaks coordinates for each chromo-

some we split the gene positions into two subsets one comprises genes which

appear in binding sites of Ded1 and the other genes which are at non-binding

regions of Ded1. We called them bound and unbound sets respectively. The

binding positions have already extracted from iCLIP data. So for gaining

the two subsets of genes we should intersect the genes position once with

binding sites and once with non-binding sites. For this purpose we have

used intersectBed [29]. Hence, for attaining the final result that is number

of reads per peak for each gene, we get the intersections for binding and

non-binding gene positions on peak coordinates which extracted from peak

caller.

For having a better understanding of our data and getting a general view

of our libraries we have employed peak caller on them, we consider the gens

which contain binding sites of Ded1 and genes for non-binding sites and

calculated the abundance of reads for each peak in these two sites.

Figure 2.2 depicts that for replicate 1 without temperature shift (t0-R1),

if we compare two cases of bound and unbound, we observe a significant

increment in the amount of reads at binding positions for mutant vs wild

14


Figure 2.1: illustrates a general insight from the ribo-seq libraries. It dis-

plays the highest abundance of reads for peaks among different chromosomes

of yeast for three replicates over two various conditions. It has sorted based

on increment number of reads for the wildtype case. In R1 and R3 we see

higher amount of reads for mutant in compare to wild type before temper-

ature elevation in all the chromosomes, However this behavior reverse after

increasing the temperature for these two replicates which may be explained

through the fact that Ded1 is not functioning after temperature elevation.

15


type which can be explained by the fact that Ded1 has a significant influence

on translation initiation in yeast. Although this behavior is changing after

temperature shift (t5-R1), in this case number of reads for wild type shows

a higher growth at binding sites in comparison to non-binding sites. The

number of reads in wild type in the latter condition is generally more than

number of reads in mutant. This gesture may be occurred for the fact that

the Ded1 protein is sensitive to temperature and increment in its levels

causes protein dysfunction.

By looking at figure 2.2 we realize that replicates 1 and 3 both show the

similar behavior, whereas replicate 2 displays completely different manner.

Therefore, we assumed there was some problem in preparing the replicate 2

libraries.

16

2.3. Peak frequencies in wild type vs mutant

●

●●

●●

●

●●●●

●●●●●●

●

●

●●●

●

●

●

●●

●●●

●

●

●

●

●●

●●

●

●●

●

●●

●●●

●

●

●●●

●

●

●●●●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●●

●●

●●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●●●●● ●

●

●●

●

●●●

●●

●

●●●●

●●

●

●●●

●●●●

●●

●●●●

●●

●

●

●

●●●

●●●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●●●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●●●

●

●●●●

●

●

●

●

●●●

●

●

●●●

●

●

●

●●

●

●

●●●●●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●●

●●●

●●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●●●●

●

●

●●●

●

●

●●●●

●

●

●

●

●●

●

●

●

●

●

●●●●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●●

●●

●●

●●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●●

●

●

●

●

●

●

●

●

●●●

●●●●

●

●

●

●●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●●●

●

●

●

●●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●●●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●●

●

●

●

●

●

●●●●

●

●

●

●●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●●●●

●

●

●

●

●●

●

●

●

●●●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

● ●

●

●

●

●●●●●●

●

●

●

●

●

●●●●

●

●●

●●●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●●

●

●●●●

●

●

●

●

●

●

●

●●●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●●●●●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●●●

●

●

●●

●

●

●

●●

●●●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●●●

●●

●●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●●

●

●

●

●

●●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●●

●

●

●●

●●●●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●●

●●●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●●

●

●

●●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●●●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●●●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●●

●

●

●●

●

●

●●

●●

●

●

●

●●

●

●

●●●●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

0

500

1000

1500

2000

2500

0

500

1000

1500

2000

2500

boundunbound

t0_R1 t0_R2 t0_R3 t5_R1 t5_R2 t5_R3Libraries

Am

ount

of r

eads

per

pea

k

Conditions

Mut

WT

Figure 2.2: a significant increment in the amount of reads at binding po-

sitions for mutant vs. wild type which can be explained by the fact that

Ded1 has a significant influence on translation initiation in yeast. This be-

havior is changing after temperature shift (t5-R1), in this case number of

reads for wild type shows a higher growth at binding sites in comparison

to non-binding sites. It reveals that R2 behave differently from two other

cases.

2.3 Peak frequencies in wild type vs mutant

Calling peaks on our Ribo-seq libraries with use of peak caller provides

us with opportunity to yield the most significant peaks in these instances.

Hence, for further investigation on our data and checking for consistency of

our results to former gained ones, in this part we count the consensus peaks

for each gene position in our various Ribo-seq libraries (WT-t0, WT-t5,

17

2.3. Peak frequencies in wild type vs mutant

Mut-t0 and Mut-t5). Therefore we compute the peak amounts for mutant

and wild type before and after increasing the temperature for each gene

position under different conditions namely genes at binding sites of Ded1,

genes at non-binding sites of protein Ded1

Hence, first we build two BED files. One for gene positions where lo-

cated at binding sites and one case for genes at non-binding sites. Next

by intersecting genes annotations and peaks positions with use of [29], we

enumerate the amount of peaks with respect to type of genes that located in

overlaps. Therefore, it exhibits peaks which belong to two groups the peaks

which correlated with genes at binding positions and the peaks which corre-

lated with genes at non-binding positions. The final result is represented in

figure 2.3. It depicts the linear correlation of peak abundance in wild type

and mutant libraries for each gene position and indicates a consistency over

our library and conditions with respect to former plots. Moreover, there is

an obvious tendency toward mutant coordinate at unbound which can be

explained by ribosomal footprints stacking and thus higher amount of reads

at mutant condition.

Figure 2.3: depicts the linear correlation of peak abundance in wild type and

mutant libraries per gene position. There is an obvious tendency toward

mutant axis before temperature shift which alters after temperature shift

and can be explained by ribosomal footprints stacking since the initiation

is halted by impairing of Ded1. The binding positions and non-binding

positions demonstrate almost the similar manner

18

2.4. Abundance of reads in peaks vs. length of chromosomes

2.4 Abundance of reads in peaks vs. length of

chromosomes

This part represents relation among the chromosomes lengths in yeast and

abundance reads of consensus peaks in each chromosome, which calculated

with use of peak caller and it yields new insights about the density of reads in

different chromosomes of yeast in Ribo-seq profiles. Figure2.4 illustrates the

density of reads in different chromosomes of yeast in our Ribo-seq profiles.

It represents by increasing the chromosome length, higher amount of reads

can be expected,although some shorter chromosomes have higher read abun-

dance in comparison to longer ones regardless of length order.Furthermore

it is obvious that the results in replicate 1 and 3 are correlated, although

replicate 2 does not behave analogously to the previous replicates.

Mut_t0 Mut_t5 WT_t0 WT_t5

●

●

●●

●

●

●●●

●

●●

●●

●

●

●

●●●

●

●●●

●●

●●

●●

●

●

●

●

● ●

●

●● ●

●●

●●

●●

●

●

● ● ●●

● ● ●●●

●● ●●

● ●

●

●●

● ●

●●●

● ●

●●

● ●●

●

●

●●

● ●

●●● ●

●●

● ●●●

●

●

●●● ●

●●● ●●

●● ●●

●●●

●

●

● ●

●

●●●

●●

●●●●

●

●

●

●●●

●

●●●

●●

●●

●●

●

●

●

●●

●

●

●●

●●

● ●

● ●

●

●

●

●●● ●

●●

●●●

● ●●

●●

●

●

●

●●●

●

●●

●●

● ●

●●●

●

●

0

500

1000

1500

2000

400000 800000 1200000 400000 800000 1200000 400000 800000 1200000 400000 800000 1200000Chromosome Length

Rea

ds

factor(V2)

●

●

●

R1

R2

R3

Figure 2.4: displays the density of reads in different chromosomes of yeast in

our Ribo-seq profiles. By increasing the chromosome length, higher amount

of reads can be expected,although some shorter chromosomes have higher

read abundance in comparison to longer ones regardless of length order

Overall, in this chapter we process our Ribo-seq libraries under different

conditions. To illuminate similarities and dissimilarities, we have chosen

four various approaches which yield different aspects of Ribo-seq libraries

in hand. These processes are extracting peaks from Ribo-seq data, count-

ing amount peak reads for each gene, peak frequencies in two cases of wild

type and mutant and a comparison of reads abundance and chromosomes

lengths. Generally, these processes revealed there is a coherence in reads

amount in expression profiles of mutant libraries and wild type with tem-

perature shift. Because the protein has knocked out there, translation is

19

2.4. Abundance of reads in peaks vs. length of chromosomes

halted and an elevation in ribosome footprints is expected at binding po-

sitions of protein in mutant and wild type with temperature shift. We are

not observing this behavior in wild type without temperature increment.

Even though, replicate one and three showing similar relevant biological

behavior, we cannot observe this fashion in replicate two, thus we resume

our procedure without using replicate two. There are significant differences

under different conditions of ribosome expression profiles thus one can try

to cluster such profiles based on different conditions such as wild type and

mutant or with temperature elevation and without elevation. In the next

chapter we will explain about employing BlockClust for clustering such a

profiles under different conditions in this case study. We adapt BlockClust

which has former applied for RNA-seq expression profiles for Ribo-seq data.

20

Chapter 3Methods

With novel developments in whole genome sequencing and applying deep

sequencing techniques such as Ribo-seq, we have lots of novel data, which

with use of machine learning methods, we can extract useful information

from them. This work has aimed to find classes and clusters according to

similar processing patterns of ribosome accumulation in Ribo-seq data with

use of fast graph kernel techniques(NSPDK)[14]. Such a work has already

done for RNA-seq data in BlockClust tool. It has the ability to distinguish

different ncRNA groups. For achieving a high performance BlockClust uses

two main steps. These steps are encoding the expression profiles to a graph

and build combinatorial features based on that graph. We will be able to

apply this tool on Ribo-seq data. Hence there is a need for an adaption with

employing new features and optimize its pipeline according to Ribo-seq data.

3.1 BlockClust: efficient clustering and classifica-

tion of non-coding RNAs

The studies on Genome-wide sequencing data revealed that most of DNA

regions encode information for non-coding RNAs(ncRNAs)[26]. They have

an important role in cellular regulation, although the function annotations

for a large part of them are not obvious yet. One way for solving this issue,

stands several methods e.g. clustering ncRNAs based on their sequence or

secondary structure [34],[31]. It is also possible to assign classes based on

patterns of expression profiles in ncRNAs. These patterns are depended to

21

3.1. BlockClust: efficient clustering and classification of non-coding RNAs

functional molecule and 3D structure. therefore BlockClust has aimed for

the latter solution for grouping different classes of ncRNAs. It tries to assign

clusters to non-coding RNA classes by applying machine learning methods

on transcript processing patterns in RNA-seq data and it is robust to the

changes of cell line, organism and sequencing machines. Two main sections

in BlockClust which allow clustering are: 1. Expression profiles encoding

2. Combinatorial feature generation This information of mapped reads will

be in format of SAM (sequence alignment map) or BAM (binary alignment

map). Mapping gives information about where the reads are aligned on the

reference genome. So for the aim of simplicity and increasing computational

speed, BlockClust divides the expression profiles to sets of block groups and

blocks with use of blockbuster tool[15] which is fitting Gaussian functions

to the profile data. For each read blockbuster assigns a Gaussian function

and then take the consensus Gaussian and then assigns the reads to a block

by finding the highest peak and considering the standard deviation. Each

block consists of several reads and for each block groups exist a set of corre-

spondent blocks. In order to find patterns in expression profiles, BlockClust

will extract the attributes for blocks(e.g. number of multi mapped reads,

entropy of read expressions, minimum read length) and block groups(e.g.

entropy of read starts, entropy of read ends, entropy of read lengths) and

block edges(e.g. contiguity and difference in median read expressions) sets.

After that it discretizes the attributes values to nominal amounts with use

of equal frequency algorithm. Next, BlockClust encodes whole the informa-

tion to a graph. The amounts of blocks and block groups for each instance

are not identical, thus representing them in form of vectors is not possible.

A solution for that is a graph representation of such a data. This Graph

represents the values of attributes over one gene expressions data and con-

sists of two components. One for block group attributes and the other for

blocks and block edges attributes. In the first component place of node in-

dicates attribute type and in the second component order of nodes represent

order of block positions which gained from blockbuster tool. This sequence

of nodes called backbone. The attributes values for each single block at-

tach to the backbone according to the block. Sequence of backbone nodes

is analogous to sequence of blocks constructed with blockbuster. As soon

as, retrieving the graphs from expression profiles, BlockClust will be able to

22

3.1. BlockClust: efficient clustering and classification of non-coding RNAs

produce combinatorial feature with use of Neighborhood Subgraph Pairwise

Distance Kernel (NSPDK). These features have been employed to efficiently

cluster ncRNAs. Moreover BlockClust also implement the concept of view-

point in the process of feature generation. It is an extra information added

for extracting subgraphs in a way that at least one of the subgraph roots be

on the backbone. This helps to build features from an increasing amount of

attributes and considers a very smaller subset of attributes combinations.

Finally BlockClust uses the similarity notion of NSPDK and Markov Cluster

Process [13] for building ncRNA clusters.

23

3.2. Extracting New Attributes with respect to Ribo-seq data

Figure 3.1: After employing blockbuster tool and building blocks and block

groups from expression profiles, next BlockClust makes graphs from dis-

cretized values of the attributes applied for block groups and blocks. At

the end the similarity of these graphs compared by NSPDK with use of

combinatorial features. Figure retrieved from [8].

3.2 Extracting New Attributes with respect to Ribo-

seq data

In the following, new attributes for extracting new characteristics of Ribo-

some profiles is defined and added to BlockClust pipeline. we aim to extract

new attributes, because the expression profiles of the Ribo-seq data are not

similar with one from small ncRNAs. There is a significant difference in

the length and end expression levels of the block groups. The attributes

24


used in BlockClust are optimized for small ncRNAs. Hence you want to

extract few more which might make sense for ribo-seq data. In order to in-

crease the accuracy of our prediction, first we check how defined attributes

are correspondent with respect to Ribo-seq data. Five various attributes

are implemented here. Moreover, for specifying attribute values, blocks

and block group positions of our Ribosome profiles should be determined.

Therefore, we employed blockbuster tool on our ribosome profiles and after

that we continue our work on block groups and blocks coordinates. After

extracting block group positions with use of blockbuster, we try to find a

unique mapping between block groups and gene positions, thus with use of

this mapping we correspond conditions of our genes to block groups and

group them under their mapped gene.

3.2.1 Entropy of Block Distances

Entropy

In the field of information theory, it shows the amount of uncertainty in

random data. For instance, in stochastic binary variable X with values 1 and

0, if the probability of occurrence of 1 and 0 be 50 percent the uncertainty

for value X is at highest rate and thus the entropy value is 1 [10]. The

entropy formula is:

entropy = −∑

q log2 q (3.1)

q is the probability distribution of signals. We calculate the distance be-

tween two consecutive blocks in a way that evaluate the distance between

end position of first occurred block and start location of latter block then

we will map this value to bins which defined with respect to minimum and

maximum distances in the block group. The sought fraction for entropy

gains by sum of all mapped block distances values divided by total number

of bins in that particular block group. Finally we substitute each fraction in

entropy formula. We have represented the entropy for block groups under

four different conditions with respect to their mapped genes conditions. As

figure 3.2 illustrates the two cases of bound and unbound positions repre-

senting almost analogous behavior, however the median value for bound case

after temperature shift shows a slight increase.

25


●

●

●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●●

●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●

●●

●

●

●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●

●●●●

●●

●

●

●

●

●

0

1

2

3

0

1

2

3

boundunbound

t0_R1 t5_R1Libraries

Ent

ropy

of b

lock

dis

tanc

es

Conditions

Mut

WT

Figure 3.2: Entropy of block distances calculated per block group. This

figure indicates the two cases of bound and unbound positions representing

almost analogous behavior.

3.2.2 Entropy of Blocks End and Start Positions

For assessing entropy of start and end positions of blocks in block groups.

First, we count the number of blocks that share the identical starting posi-

tions at each block group. Next divide this statistic with amount of blocks for

correspondent block group for reaching to entropy ratio. Replacing gained

value to formula (3.1), renders block start/end entropy for each block group.

At the end we have plot them under four different cases based on their genes

status. Figure 3.3 displays that generally in bound cases the entropy of block

starts have higher entropy in comparison to unbound cases. Moreover it is

also visible that entropy values for block start positions after temperature

shift decreasing for mutant case despite of wild type case which shows an

increment. The same manner is occurred in the entropy of end positions.

26


●

●

●●●●

●

●

●●

●●●●●

●

●

●

●

●●●

● ●●●●

●●●●●●●●

●●

●●●●

●

●●●●●●●●●

●

0

3

6

9

0

3

6

9

boundunbound


Ent

ropy

of s

tart

ing

posi

tions

Conditions

Mut

WT

Figure 3.3: illustrates entropy of blocks starting position for each block

group. Generally in bound cases the entropy of block starts have higher

entropy in comparison to unbound cases.

27


●●

●

●●●●

●

●

●●

●●●●

●

●

●

●●●

●●●●●●●●●

● ●●●●●●●●●●●●●

●●●●

●

●●●

●●●●●●

0

3

6

9

0

3

6

9

boundunbound


Ent

ropy

of e

ndin

g po

sitio

ns

Conditions

Mut

WT

Figure 3.4: illustrates entropy of blocks ending positions for each block

groups. In bound cases the entropy of block ends have higher entropy in

comparison to unbound cases.

3.2.3 Density of Reads

In order to measure the reads density we used the blockbuster output file

for counting the reads per block and extracted the fraction by dividing this

value to the total amount of reads for each block group. Figure 3.5 at page

29 displays no significant dissimilarity between unbound cases, however in

bound positions it represents a higher density of reads for mutant cases.

This recent behavior after temperature shift alters, higher amount of reads

density belongs to wild type at bound positions.

28


●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●●●●●

●

●

●●●●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●●

●●●

●

●●

●

●

●

●●

●

●

●●●●

●

●

●

●●●

●●●

●

●●

●

●●

●

●

●

●

●

●●

●●

●

●

●●●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●●●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●●

●

●

●

●●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●●●●

●●

●

●

●

●●●●

●

●●●

●●

●●●●●

●

●●●

●

●●●

●

characterization of ribosomal footprints with use of graph ......master’s thesis characterization...

Documents