characterization of ribosomal footprints with use of graph ......master’s thesis characterization...

73
Master’s Thesis Characterization of ribosomal footprints with use of graph kernel based approaches Soraya Nikousokhan October 2016 Albert-Ludwigs Universit ¨ at Freiburg Department of Computer Science Chair of Bioinformatics

Upload: others

Post on 05-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Master’s Thesis

    Characterization of ribosomalfootprints with use of graph kernel

    based approaches

    Soraya Nikousokhan

    October 2016

    Albert-Ludwigs Universität Freiburg

    Department of Computer Science

    Chair of Bioinformatics

  • Candidate

    Soraya Nikousokhan

    Matr. number

    3555120

    Working period

    14. 04. 2016 – 28. 10. 2016

    Examiners

    Prof. Dr. Rolf BackofenProf. Dr. Wolfgang Hess

    Supervisors

    Dr . Fabrizio CostaPavankumar Videm

    I

  • Acknowledgment

    I would like to express my thanks to my parents, for all the love and supportsthat they give me. Mostly I want to thank them for teaching me the value ofgaining knowledge and learning and encouraging me to proceed on this path.I would like to thank Prof. Dr. Rolf Backofen for giving me the opportunityto write my thesis at the Freiburg bioinformatics chair. I would like to thankDr. Fabrizio Costa for introducing this interesting topic to me and his helpand efficient supervision during the work, as well as Pavankumar Videm forhis help and patience and the time he put in to answer my questions. I wouldlike to also thank the Freiburg bioinformatics chair members, special thanksto Teresa Müller, Patrick Wright, Milad Miladi and Torsten Houwaart fortheir help and advices. I would like to thank my friends Hanna Poelker,Maryam Samani and Kristin Gekeler for their company and support duringboth the good and the difficult phases of this work.

    II

  • Abstract

    Ribosome profiling is an emerging technique that with use of deep sequenc-ing methods, gives new insight to translation of proteins from single codonto genome scale. In comparison to former available methods microarraysand RNA-seq, Ribo-seq solely considers active mRNAs at translation phasein a cell which prepare information for protein synthesis. This novel charac-teristic of Ribo-seq provides new data with focus on translation level. Theobtained patterns of ribosomal footprints may reveal new aspects in trans-lation field. The aim of this work is to classify Ribo-seq profiles accordingto different conditions and find clusters with respect to Ribo-seq profiles.This is done by a tool named BlockClust, which is based on a graph kernelmethod called Neighborhood fast graph kernel (NSPDK). BlockClust en-codes expression profiles data to graphs format and employ NSPDK methodfor achieving a high performance. Although BlockClust previously appliedfor clustering non-coding RNAs from their RNA-seq expression profiles, itcan also be adapted to use for clustering and classification tasks on othertypes of data e.g. Ribosome profiling. Therefore, we have adapted Block-Clust by defining new attributes for finding patterns in Ribo-seq data andadding them to the former available set of attributes. Moreover, we per-formed an optimization by using different parameter sets. Furthermore, weshowed that it is possible to employ BlockClust on Ribosome profiles. Weachieved a good performance in classification of these profiles.

    III

  • Kurzfassung

    Ribosome Profiling ist eine Technik, die mit der Verwendung von DNA-Sequenzierung (Deep Sequencing) neue Einsichten in die Übersetzung vonProteinen liefert – sowohl in einzelne Codons und in genomische Maßstäbe(genomic scales). Im Vergleich zu früheren verfügbaren Methoden wie Mi-croarrays und RNA-Seq. berücksichtigt Ribo-Seq. nur aktive mRNAs undliefert daher Informationen der Protein-Synthese. Somit bringt die Ribo-Seq. Methode neue Daten im Hinblick auf die Übersetzungsphase, riboso-malen footprints, mit sich, die neue Aspekte im Feld der Translation offen-baren. Das Ziel dieser Arbeit ist es, bedeutsame Cluster unter Einbeziehungvon Ribo-Seq.-Expressionsprofilen zu finden. Dafür wird das Tool Block-Clust verwendet, welches auf einer graph-kernel-Methode namens NSPDKbasiert. BlockClust kodiert Expressionsprofile zu Graphen und wendet dieNeighborhood-fast-graph-kernel-Methode an, um eine hohe Leistung zu erre-ichen. BlockClust ist vor allem als Clustering-Methode bekannt, die nicht-codierende RNA basierend auf ihren Expressionsprofilen in Cluster grup-piert. Es kann aber auch auf andere Arten von Expressionsdaten angewen-det werden, zum Beispiel auf Robosome-Profiling-Daten. Das ist möglich,durch das Hinzufügen weitere Attribute, BlockClust auf Ribosome-Profileanzuwenden. Die optimierten Parameter werden zum früheren Attribute-Set hinzugefügt. Desweitern haben wir eine Optimierung bezüglich ver-schiedener Parameter-Sets durchgeführt. Wir zeigen in dieser Arbeit dassdie Klassifikation von Ribosome-Profile mit einer guten Leistung erreichbarist.

    IV

  • Contents

    Abstract III

    Kurzfassung IV

    List of Tables VIII

    1 Introduction 1

    1.1 Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1.1 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1.2 RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.1.3 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.1.4 Gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.1.5 Transcription . . . . . . . . . . . . . . . . . . . . . . . 2

    1.1.6 Translation . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.1.7 Gene expression . . . . . . . . . . . . . . . . . . . . . 4

    1.1.8 Coding and Non-coding Regions . . . . . . . . . . . . 4

    1.1.9 Next generation sequencing (NGS) . . . . . . . . . . . 5

    1.1.10 Ribosome profiling . . . . . . . . . . . . . . . . . . . . 5

    1.1.11 Cross linking immunoprecipitation (CLIP) . . . . . . . 8

    1.1.12 Wild type vs mutant . . . . . . . . . . . . . . . . . . . 8

    1.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.2.1 Machine learning and bioinformatics . . . . . . . . . . 8

    1.2.2 Supervised learning . . . . . . . . . . . . . . . . . . . 9

    1.2.3 Unsupervised learning . . . . . . . . . . . . . . . . . . 9

    V

  • Contents

    1.2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . 10

    1.2.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 10

    1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2 Data 12

    2.1 Extracting peaks from Ribo-seq data . . . . . . . . . . . . . . 13

    2.2 Counting amount of reads per peak for each gene . . . . . . . 14

    2.3 Peak frequencies in wild type vs mutant . . . . . . . . . . . . 17

    2.4 Abundance of reads in peaks vs. length of chromosomes . . . 19

    3 Methods 21

    3.1 BlockClust: efficient clustering and classification of non-coding

    RNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3.2 Extracting New Attributes with respect to Ribo-seq data . . 24

    3.2.1 Entropy of Block Distances . . . . . . . . . . . . . . . 25

    3.2.2 Entropy of Blocks End and Start Positions . . . . . . 26

    3.2.3 Density of Reads . . . . . . . . . . . . . . . . . . . . . 28

    3.2.4 GC-ratio . . . . . . . . . . . . . . . . . . . . . . . . . 29

    3.2.5 Number of Reads for GC positions . . . . . . . . . . . 30

    3.3 Neighborhood subgraph pairwise distance kernel(NSPDK) . . 31

    3.4 BlockClust adaptation with Ribo-seq Data . . . . . . . . . . . 33

    4 Results and Discussion 38

    4.1 Blockbuster benchmarking . . . . . . . . . . . . . . . . . . . . 38

    4.2 BlockClust similarity score assessment . . . . . . . . . . . . . 40

    4.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . 41

    4.2.2 ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    4.2.3 Combinatorial feature similarity score assessment . . . 42

    4.3 BlockClust optimization . . . . . . . . . . . . . . . . . . . . . 45

    4.4 BlockClust experiment . . . . . . . . . . . . . . . . . . . . . . 46

    5 Conclusion 49

    5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    A Appendix(Attributes) 52

    VI

  • List of Figures

    1.1 Transcription and translation . . . . . . . . . . . . . . . . . . 4

    1.2 mRNA coding and non-coding regions . . . . . . . . . . . . . 5

    1.3 Ribosome profiling-analysis of ribosome occupancy data. . . . 7

    2.1 Highest Peak per Chromosome in Ribo-seq libraries . . . . . 15

    2.2 Number of Reads Per Peak for Each Gene . . . . . . . . . . . 17

    2.3 Ribosome Peak Freq. Mutant vs. Wild type . . . . . . . . . . 18

    2.4 No. Peaks reads vs. Chromosome Length . . . . . . . . . . . 19

    3.1 Read profile encoding . . . . . . . . . . . . . . . . . . . . . . 24

    3.2 Entropy of block distances . . . . . . . . . . . . . . . . . . . . 26

    3.3 Entropy of blocks start positions . . . . . . . . . . . . . . . . 27

    3.4 Entropy of blocks end positions . . . . . . . . . . . . . . . . . 28

    3.5 Reads density per block group . . . . . . . . . . . . . . . . . . 29

    3.6 GC-ratio per block group . . . . . . . . . . . . . . . . . . . . 30

    3.7 Number of reads for GC positions . . . . . . . . . . . . . . . . 31

    3.8 BlockClust pipeline . . . . . . . . . . . . . . . . . . . . . . . . 37

    4.1 Blockbuster benchmarking results . . . . . . . . . . . . . . . . 40

    4.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 41

    4.3 Combinatorial feature similarity ROC measures for unbound

    cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.4 Combinatorial features similarity roc measures for bound cases 45

    4.5 Clusters of Ribosome profiles based on four different conditions 47

    VII

  • List of Tables

    4.1 Line search optimization results . . . . . . . . . . . . . . . . . 46

    4.2 Classification performance . . . . . . . . . . . . . . . . . . . . 46

    4.3 Clustering performance . . . . . . . . . . . . . . . . . . . . . 47

    VIII

  • IX

  • X

  • Chapter 1Introduction

    Recently, vast developments in next generation sequencing methods enhance

    us with non-expensive and fast Genome-wide sequencing of reads. Thus, new

    doors have opened for implementing novel methods and beneficial tools for

    retrieving new information from the end results. The aim of this work here

    to classify and cluster Ribosome profiles with use of a tool called BlockClust.

    In the first chapter we have an introduction on biological and machine

    learning background concepts and methods, in chapter 2 we explain about

    data analysis to check the relevance of data for clustering task, next, at

    chapter 3 we discuss the method and variations in BlockClust adaptation to

    our data and finally we reach to conclusion and discuss about future works.

    In the following sections here, we have explained the biological informa-

    tion and machine learning concepts which are coming in the next chapters.

    1.1 Biology

    This section focus at biological terms and methods which has mentioned in

    latter chapters.

    1.1.1 DNA

    Deoxyribonucleic acid or DNA is a double- stranded molecule in form of dou-

    ble helix. Each of the strands is a lineare molecule and it’s monomer building

    bricks are nucleoside triphoshates.The two strands which are formed DNA

    are joined together with hydrogen bond, between different complementary

    1

  • 1.1. Biology

    bases.The complementary bases (nucleosids) adenin and thymine can form

    two hydrogen bonds and guanine and cytosine can form three hydrogen

    bonds [3].

    1.1.2 RNA

    Ribonucleic acid (RNA) is usually a single stranded chain existing in all

    living cells and many viruses. One difference between DNA and RNA is

    the sugar molecule (ribose instead of desoyaribouse). The second difference

    is that instead of the base thymin RNA molecules make use of uracile. In

    former times the task of RNA was recognized as messenger, which carries

    instructions from DNA for controlling the synthesis of proteins.But nowa-

    days it is known that they also preform important regulatory tasks (inside

    the cell). This RNA’s are called non-coding (ncRNA) [6].

    1.1.3 Protein

    Proteins are large biological molecules or macromolecules which have a very

    diverse range of tasks in living organisms such as: catalyzing metabolic

    reactions, DNA replication responding to stimuli and transporting molecules

    form one location to another. It is made of one long chains of amino acids

    that have translated from nucleotide sequences of their genes. Different

    amino acids sequences will form distinct folding and thus various tertiary

    structure which leads to different functioning [17].

    1.1.4 Gene

    Gene is a region or segment of DNA which is a basic reason for different

    heredity characteristics.

    1.1.5 Transcription

    Transcription is the act of reading the DNA information and transfers it to

    RNAs. RNA is transported from the nucleus into cytoplasma and in the case

    of mRNA further translated into protein. For initiating the transcription

    different tf factors or proteins need to attach at the special site of DNA

    sequences. This place is called enhancer or promoter and the components are

    an enzyme called RNA polymerase and also transcription factors (subsidiary

    2

  • 1.1. Biology

    proteins). All together named transcription initiation complex .After these

    attachments transcription act will continue and RNA polymerase begins

    mRNA synthesis in a way that complementary bases to the original DNA

    strand are elongated to mRNA sequence. This process will end as soon as

    all the strand is completely synthesized [1].

    1.1.6 Translation

    Translation is the act of making proteins by encoding the mRNA information

    into amino acids which are the building blocks of proteins. For this purpose

    mRNA uses a three letter combination of nucleotides, each of these com-

    binations are translated by ribosome to a different amino acid. Ribosomes

    consist of two subparts, one small and one large sub-unit.The ribosome is a

    complex consisting of ribosomal RNA and proteins. Translation comprises

    three phases, which are initiation, elongation and termination. For initia-

    tion the small subunit of the ribosome attaches to recognition elements of

    mRNA sequence, after that it will join transfer RNA to AUG. The AUG

    codes for methionine amino acid.Then the large sub-unit will bind to the

    whole complex and initiation will start. The elongation phase continue until

    a stop codon is reached. The stop codons are UAA,UAG and UGA. Finally

    the translation complex disassembles [7].

    3

  • 1.1. Biology

    Figure 1.1: Transcription and translation are different biological processes

    for protein production, transcription transfers DNA information to mRNAs.

    After mRNAs are exited form cell nucleus, translation task starts and it in-

    cludes three various phases, which are initiation, elongation and termination.

    The final products of these actions are proteins.Figure retrieved form [1].

    1.1.7 Gene expression

    The information of genes which is translated into gene product, gene prod-

    uct is often a protein, through several steps transcription,splicing translation

    and post-translation modifications, whole this process defined as gene ex-

    pression [5].

    1.1.8 Coding and Non-coding Regions

    CDS or Coding DNA Sequence is the region that includes codes for proteins.

    In mRNA the coding part is surrounded by five prime untranslated region

    and three prime untranslated region.The coding part consists of codes which

    are translated to proteins and the non-coding parts are helping the act of

    translation to be initiated and completed.

    4

  • 1.1. Biology

    Figure 1.2: indicates the coding and non-coding parts(5’ and 3’ UTRS) of

    mRNA. The coding part consists of codes which are translated to proteins

    and the non-coding parts are helping the act of translation to be initiated

    and completed. Figure taken from [4].

    1.1.9 Next generation sequencing (NGS)

    NGS approaches revealed a new era in which a whole genome can be se-

    quenced in easier and cheaper, at the same time accurate ways that are

    available commercially. The former sequencing method is Sanger sequencing

    which is costly and it is not as fast as next generation sequencing. Sequenc-

    ing of whole genome will provide large research possibilities in large-scale

    comparative and evolutionary studies as well as giving an insight about how

    changes in the genetic code are influencing diseases. NGS technologies com-

    prise various steps such as: template preparation, sequencing and imaging

    and data analysis. Different NGS technologies have different protocols for

    each of the steps. Current NGS technologies are Roche/454,Illumina/Solexa,Life/APG

    and HelicosBioSciences [24].

    1.1.10 Ribosome profiling

    The Ribosome profiling method, is a novel sequencing technique, which is be-

    coming more popular nowadays for extracting new information about RNA

    translation.In comparison to other methods like microarrays [32] and RNA-

    seq [33], which considers the mRNA abundances, Ribosome profiling solely

    considers mRNAs which are actively take part in translation.Therefore Ribo-

    seq provides information on protein which is one of final product in gene

    expression synthesis.Thus Ribosome profiling technique gives us this oppor-

    tunity to understand new insights in the identity and amount of proteins in

    transcriptome. It is counting the accumulation of ribosome footprints for

    5

  • 1.1. Biology

    each position in transcript. As we can see in figure 1.3 the abundance of

    proteins are directly related to the number of ribosome density in the protein

    coding sites of a transcript. The total number of ribosomes that functioning

    in a protein synthesis can be yield from ribosome footprints. The amounts of

    mRNAs and ribosomes for each of them has straight influence on footprints

    counts, altering in either amounts will cause change in number of covered

    footprints. Therefore results will change due to the mRNA abundance or

    the ribosomes amounts [20].

    6

  • 1.1. Biology

    Figure 1.3: a-indicates the amount of ribosome footprints with respect to

    ribosome positions in vivo. In Ribosome profiling process, the nucleus-

    protected footprints used to show the positions of attached ribosomes to

    many mRNAs in cell. Therefore, Ribosome profiling data is able to display

    the abundance of ribosome at each position of transcripts. The thick line

    shows the coding regions. b-Ribosome profiling data represents amount of

    proteins which are synthesized. It can be explained by the relation among

    the abundance of synthesized proteins and the density of ribosome in trans-

    lated region of a transcript. Ribosome profiling indicates all ribosomes which

    are involved in the act of translation for a protein. Hence alternation in ei-

    ther mRNAs abundance in polysomes or number of ribosomes can make a

    difference in ribosome profiling signal. Figure retrieved from [20].

    7

  • 1.2. Machine learning

    1.1.11 Cross linking immunoprecipitation (CLIP)

    CLIP is a UV cross-linking method in molecular biology. It is vastly used

    for studying of protein interaction with RNAs. We can use CLIP based

    methods, in order to have a better understanding of translation, for example

    mapping of RNA binding sites for a specific protein in a genome range [19].

    1.1.12 Wild type vs mutant

    The wild type of an species is its most common phenotype which can be

    seen in nature. A phenotype is set of representative characteristics of an or-

    ganism. Genes are the major cause of functional behavior and development

    of an organism, any changes in DNA or RNA or protein sequences will cause

    mutation in natural form. Mutation of at least one gene in wild type called

    mutant. These changes can be fundamental; if mutations happen in DNA

    level they can alter all copies of the translated protein which can causes a

    decrease in expression of the protein. However, if mutation happens in RNA

    or protein synthesis level, will not be that importantly consider since it will

    just affect one copy of different available copies of RNA or a protein [30].

    1.2 Machine learning

    This section contains information about machine learning methods and con-

    cepts that will be used in the following chapters.

    1.2.1 Machine learning and bioinformatics

    Bioinformatics is an emerging new interdisciplinary field of study. It involves

    different areas such as computer science, mathematics, statistics and engi-

    neering for analyzing biological data. On the other hand, machine learning

    aims to find methods that are able to learn from data and make legitimate

    predictions based on data. Generally in machine learning the effort is to

    find a model from an available set of samples to make decisions based on

    data. With recent developments in bioinformatics methods we gained a huge

    amount of data, thus for processing and discovering new knowledge from this

    enormous amount of information machine learning approaches will be useful

    8

  • 1.2. Machine learning

    and applicable. They try to find computational models to retrieve noble in-

    formation from data. Some of these modeling methods in machine learning

    are supervised classification, clustering, probabilistic, graphical models, op-

    timization and heuristics. There are various fields in bioinformatics which

    applied machine learning methods some of them are genomics, proteomics,

    microarray, system biology, evolution and text-mining. In genomics machine

    learning methods focus on finding number of sequences and location and

    structure of genes. As long as in proteomic most of the effort is to predict

    protein structure prediction which can be very complicated combinatorial

    task due to the intrinsic complex characteristics of protein molecules. Sys-

    tem biology models the life process in the cell and the application of machine

    learning in evolution field can be reconstruction of phylogenetic tree. With

    all these different methods and applications there is a large database avail-

    able from publications. We can use text-mining approaches for a feasible

    search which returns related results for different topics in bioinformatics

    research areas [22].

    1.2.2 Supervised learning

    Supervised learning is the task of retrieving a model based on available

    labeled examples which known as training examples. Here, each training

    examples contains a pair which are the input object and a sought output

    value. Supervised learning task tries to assign a function to training data in

    a way that this function can predict the output for new instances [25].

    1.2.3 Unsupervised learning

    Unsupervised learning methods aim to extract hidden functions from un-

    labeled data, regardless of supervised learning and reinforcement learning

    the given data has no label. Thus, there is no error or reward function for

    evaluation of results. Unsupervised learning is very similar to the density es-

    timation in statistics. It comprises methods for summarizing and extracting

    the key features [11].

    9

  • 1.3. Contribution

    1.2.4 Classification

    Classification is a supervised learning task. There are sets of labeled data

    called training which contains pair of instances and their desired values

    which can be in nominal, categorical or numerical format. Classifier by

    considering and comparing instances according to similarity or dissimilarity

    tries to specify sought values or classes for new observations [21].

    1.2.5 Clustering

    Clustering is an unsupervised learning task which aims to build groups for

    set of unlabeled data. It forms groups or clusters based on the similarity of

    their members. The objects that are more similar to each other locate in

    same cluster, whereas they are less analogous to other clusters. Clustering

    is an iterative task of adjusting data preprocessing and model parameters

    to detect sought properties. It is an iterative approach that comprises trial

    and failure for discovering underling knowledge in data [11].

    1.3 Contribution

    In this work we have tried to find a solution for clustering and classification

    task of Ribo-seq data. Similar work has been done for RNA-seq expression

    profiles by use of BlockClust[8] tool for finding clusters of non-coding RNAs

    via RAN-seq expression profiles. Hence we want to define a new neverthe-

    less similar task for Ribo-seq data. Therefore, we classify gene annotations

    based on correspondence Ribosomal footprints signal. Moreover we assign

    clusters to transcripts under different conditions of Ribosome profiles. For

    fulfilling these goals, first we have to analyze the data to see the possibilities

    and chances of significant dissimilarities and similarities in profiling data

    under variant conditions. Second, after finding an outlook to data, we have

    defined a new set of attributes with respect to Ribo-seq data for increasing

    the distinction power of BlockClust algorithm for these particular format of

    data. Because the size of mRNA reads are longer than non-coding RNAs the

    computational time for running BlockClust tool increases drastically. The

    previous version of this tool used Grid search for its parameters optimiza-

    tion, however due to running time for Ribo-seq data it is not feasible here

    10

  • 1.3. Contribution

    anymore. Thus, instead of Grid search for optimizing the set of parameters

    e.g. radius and number of bins, we have employed a line search algorithm

    to decrease the computing time. Finally we apply BlockClust in an opti-

    mize manner on Ribo-seq expression profiles and building clusters based on

    different conditions of data e.g the wild type or mutant.

    In the next chapter we explain about the case study we have for Ribo-

    some profiles and introduce the data we are using.

    11

  • Chapter 2Data

    Before applying BlockClust [8] tool on Ribo-seq data, first we take a survey

    on data to have a better perception about data behaviors under different

    circumstances, moreover expose various aspects of it. Is data varying under

    different biological conditions? Are the samples which we are working on

    showing the expected biological behavior? What are the preliminaries for

    clustering these data? these are questions that we aim to answer in this

    chapter. In order to check the characteristics of Ribo-seq data, we work

    on a case study which contains three different biological replicates under

    various conditions. The available data are Ribo-seq profiles for different

    conditions. Mutant library is gained after un pairing Ded1 protein which

    is a critical factor for translation initiation in saccharomyces cerevisiae [16];

    however, the function of this enzyme is clearly not known. Ded1p is sensitive

    to temperature. We have cases for impairing Ded1 protein before and after

    a slow elevation at temperature. We have four different Ribo-seq libraries

    in four conditions which are:

    • Wild type without temperature shift(WT-t0)

    • Wild type with temperature shift(WT-t5)

    • Mutant (Ded1is impaired) with temperature shift(Mut-t0)

    • Mutant with temperature shift(Mut-t5)

    And also in addition to these libraries we have the positions of Ded1 binding

    sites to mRNAs which has extracted form iCLIP data. By looking at ribo-

    12

  • 2.1. Extracting peaks from Ribo-seq data

    some profiling data in the specific regions of 5’ and 3’ UTRS (binding sites

    of Ded1p) and comparing the wild type Ribo-seq vs. mutant, it indicates

    accumulation of ribosomes in these particular regions. Hence, after Ded1

    dysfunction there is an increment in the abundance of ribosomes in critical

    positions (5’UTR-3’UTR) for initiation of translation. Therefore, with this

    observation we can assume that improper functioning of Ded1p is a reason

    for the accumulation of ribosome footprints. In our task we are interested

    to look up the Ribo-seq data, be able to cluster the genes annotations based

    on their Ribo-seq signals for clustering purpose, it is important to find sig-

    nificant differences in our data. Different ribosome profiling libraries should

    be distinct able based on different conditions.

    We have done several analyses on data before applying BlockClust[8]

    on these four different libraries. First we have an outlook on our data by

    calling peaks[18] for each chromosome per different libraries. Next we check

    the ribosome stacking at binding sites by counting number of reads per peak

    for each gene. Next we compared the number of peaks per gene for Wild

    and Mutant libraries. Finally there is a comparison for amount of reads in

    peaks of each chromosome vs. chromosomes length.

    2.1 Extracting peaks from Ribo-seq data

    The work that has done here is to take the abundance of reads for the highest

    peak in each chromosome and then demonstrate them in an increasing sort.

    This will indicate how the abundance of reads is changing with respect to

    chromosomes in whole the genome and reveals the basic changes according

    to chromosomes in our data. The method that has employed for gaining the

    peaks is the one in[18] we call it peak caller, it aims at finding peaks from the

    blocks of expression profile which are extracted from blockbuster [15] tool.

    It considers the block with highest abundance of reads and assigns the center

    of Gaussian function to the highest block in amount of reads. After that

    it will extend the domain of peak by checking the overlapped blocks which

    cover more than 50 percent of the highest block and afterwards extend the

    boundaries of Gaussian function to borders of these overlapped blocks (more

    than 50 percent) and it pursues for the rest by choosing the next block with

    highest amount of reads.The original data format, which we are starting

    13

  • 2.2. Counting amount of reads per peak for each gene

    with are bam files. The format of inputs file in peak caller is sam, thus

    we convert the bam format to sam with use of samtools[23] view command.

    After taking the sam files and applying the tool in peak caller we gain the

    output in gff format. It contains the information of peaks positions and the

    amount of reads for particular peaks.

    The bar chart in figure 2.1 illustrates the highest abundance of reads

    for peak among different chromosomes of yeast for three replicates over two

    various conditions. Generally in replicate one (R1) and replicate three (R3)

    we see higher amount of reads for mutant in compare to wild type before

    temperature elevation in all the chromosomes, however this behavior reverse

    after increasing the temperature for these two replicates. Furthermore it is

    obvious that replicate two (R2) does not demonstrate any expected biolog-

    ical behavior.

    2.2 Counting amount of reads per peak for each

    gene

    By employing peak caller for gaining the peaks coordinates for each chromo-

    some we split the gene positions into two subsets one comprises genes which

    appear in binding sites of Ded1 and the other genes which are at non-binding

    regions of Ded1. We called them bound and unbound sets respectively. The

    binding positions have already extracted from iCLIP data. So for gaining

    the two subsets of genes we should intersect the genes position once with

    binding sites and once with non-binding sites. For this purpose we have

    used intersectBed [29]. Hence, for attaining the final result that is number

    of reads per peak for each gene, we get the intersections for binding and

    non-binding gene positions on peak coordinates which extracted from peak

    caller.

    For having a better understanding of our data and getting a general view

    of our libraries we have employed peak caller on them, we consider the gens

    which contain binding sites of Ded1 and genes for non-binding sites and

    calculated the abundance of reads for each peak in these two sites.

    Figure 2.2 depicts that for replicate 1 without temperature shift (t0-R1),

    if we compare two cases of bound and unbound, we observe a significant

    increment in the amount of reads at binding positions for mutant vs wild

    14

  • 2.2. Counting amount of reads per peak for each gene

    Figure 2.1: illustrates a general insight from the ribo-seq libraries. It dis-

    plays the highest abundance of reads for peaks among different chromosomes

    of yeast for three replicates over two various conditions. It has sorted based

    on increment number of reads for the wildtype case. In R1 and R3 we see

    higher amount of reads for mutant in compare to wild type before temper-

    ature elevation in all the chromosomes, However this behavior reverse after

    increasing the temperature for these two replicates which may be explained

    through the fact that Ded1 is not functioning after temperature elevation.

    15

  • 2.2. Counting amount of reads per peak for each gene

    type which can be explained by the fact that Ded1 has a significant influence

    on translation initiation in yeast. Although this behavior is changing after

    temperature shift (t5-R1), in this case number of reads for wild type shows

    a higher growth at binding sites in comparison to non-binding sites. The

    number of reads in wild type in the latter condition is generally more than

    number of reads in mutant. This gesture may be occurred for the fact that

    the Ded1 protein is sensitive to temperature and increment in its levels

    causes protein dysfunction.

    By looking at figure 2.2 we realize that replicates 1 and 3 both show the

    similar behavior, whereas replicate 2 displays completely different manner.

    Therefore, we assumed there was some problem in preparing the replicate 2

    libraries.

    16

  • 2.3. Peak frequencies in wild type vs mutant

    ●●

    ●●

    ●●●●

    ●●●●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●●●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●●● ●

    ●●

    ●●●

    ●●

    ●●●●

    ●●

    ●●●

    ●●●●

    ●●

    ●●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●●

    ●●●

    ●●

    ●●●●●●●

    ●●

    ●●●●●

    ●●●●

    ●●●

    ●●●

    ●●

    ●●●●●

    ●●●●

    ●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●●

    ●●●

    ●●●●

    ●●

    ●●●●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●●●●●

    ●●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●●●

    ●●

    ●●●

    ●●

    ●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    0

    500

    1000

    1500

    2000

    2500

    0

    500

    1000

    1500

    2000

    2500

    boundunbound

    t0_R1 t0_R2 t0_R3 t5_R1 t5_R2 t5_R3Libraries

    Am

    ount

    of r

    eads

    per

    pea

    k

    Conditions

    Mut

    WT

    Figure 2.2: a significant increment in the amount of reads at binding po-

    sitions for mutant vs. wild type which can be explained by the fact that

    Ded1 has a significant influence on translation initiation in yeast. This be-

    havior is changing after temperature shift (t5-R1), in this case number of

    reads for wild type shows a higher growth at binding sites in comparison

    to non-binding sites. It reveals that R2 behave differently from two other

    cases.

    2.3 Peak frequencies in wild type vs mutant

    Calling peaks on our Ribo-seq libraries with use of peak caller provides

    us with opportunity to yield the most significant peaks in these instances.

    Hence, for further investigation on our data and checking for consistency of

    our results to former gained ones, in this part we count the consensus peaks

    for each gene position in our various Ribo-seq libraries (WT-t0, WT-t5,

    17

  • 2.3. Peak frequencies in wild type vs mutant

    Mut-t0 and Mut-t5). Therefore we compute the peak amounts for mutant

    and wild type before and after increasing the temperature for each gene

    position under different conditions namely genes at binding sites of Ded1,

    genes at non-binding sites of protein Ded1

    Hence, first we build two BED files. One for gene positions where lo-

    cated at binding sites and one case for genes at non-binding sites. Next

    by intersecting genes annotations and peaks positions with use of [29], we

    enumerate the amount of peaks with respect to type of genes that located in

    overlaps. Therefore, it exhibits peaks which belong to two groups the peaks

    which correlated with genes at binding positions and the peaks which corre-

    lated with genes at non-binding positions. The final result is represented in

    figure 2.3. It depicts the linear correlation of peak abundance in wild type

    and mutant libraries for each gene position and indicates a consistency over

    our library and conditions with respect to former plots. Moreover, there is

    an obvious tendency toward mutant coordinate at unbound which can be

    explained by ribosomal footprints stacking and thus higher amount of reads

    at mutant condition.

    Figure 2.3: depicts the linear correlation of peak abundance in wild type and

    mutant libraries per gene position. There is an obvious tendency toward

    mutant axis before temperature shift which alters after temperature shift

    and can be explained by ribosomal footprints stacking since the initiation

    is halted by impairing of Ded1. The binding positions and non-binding

    positions demonstrate almost the similar manner

    18

  • 2.4. Abundance of reads in peaks vs. length of chromosomes

    2.4 Abundance of reads in peaks vs. length of

    chromosomes

    This part represents relation among the chromosomes lengths in yeast and

    abundance reads of consensus peaks in each chromosome, which calculated

    with use of peak caller and it yields new insights about the density of reads in

    different chromosomes of yeast in Ribo-seq profiles. Figure2.4 illustrates the

    density of reads in different chromosomes of yeast in our Ribo-seq profiles.

    It represents by increasing the chromosome length, higher amount of reads

    can be expected,although some shorter chromosomes have higher read abun-

    dance in comparison to longer ones regardless of length order.Furthermore

    it is obvious that the results in replicate 1 and 3 are correlated, although

    replicate 2 does not behave analogously to the previous replicates.

    Mut_t0 Mut_t5 WT_t0 WT_t5

    ●●

    ●●●

    ●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●

    ●●

    ● ● ●●

    ● ● ●●●

    ●● ●●

    ● ●

    ●●

    ● ●

    ●●●

    ● ●

    ●●

    ● ●●

    ●●

    ● ●

    ●●● ●

    ●●

    ● ●●●

    ●●● ●

    ●●● ●●

    ●● ●●

    ●●●

    ● ●

    ●●●

    ●●

    ●●●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●● ●

    ●●

    ●●●

    ● ●●

    ●●

    ●●●

    ●●

    ●●

    ● ●

    ●●●

    0

    500

    1000

    1500

    2000

    400000 800000 1200000 400000 800000 1200000 400000 800000 1200000 400000 800000 1200000Chromosome Length

    Rea

    ds

    factor(V2)

    R1

    R2

    R3

    Figure 2.4: displays the density of reads in different chromosomes of yeast in

    our Ribo-seq profiles. By increasing the chromosome length, higher amount

    of reads can be expected,although some shorter chromosomes have higher

    read abundance in comparison to longer ones regardless of length order

    Overall, in this chapter we process our Ribo-seq libraries under different

    conditions. To illuminate similarities and dissimilarities, we have chosen

    four various approaches which yield different aspects of Ribo-seq libraries

    in hand. These processes are extracting peaks from Ribo-seq data, count-

    ing amount peak reads for each gene, peak frequencies in two cases of wild

    type and mutant and a comparison of reads abundance and chromosomes

    lengths. Generally, these processes revealed there is a coherence in reads

    amount in expression profiles of mutant libraries and wild type with tem-

    perature shift. Because the protein has knocked out there, translation is

    19

  • 2.4. Abundance of reads in peaks vs. length of chromosomes

    halted and an elevation in ribosome footprints is expected at binding po-

    sitions of protein in mutant and wild type with temperature shift. We are

    not observing this behavior in wild type without temperature increment.

    Even though, replicate one and three showing similar relevant biological

    behavior, we cannot observe this fashion in replicate two, thus we resume

    our procedure without using replicate two. There are significant differences

    under different conditions of ribosome expression profiles thus one can try

    to cluster such profiles based on different conditions such as wild type and

    mutant or with temperature elevation and without elevation. In the next

    chapter we will explain about employing BlockClust for clustering such a

    profiles under different conditions in this case study. We adapt BlockClust

    which has former applied for RNA-seq expression profiles for Ribo-seq data.

    20

  • Chapter 3Methods

    With novel developments in whole genome sequencing and applying deep

    sequencing techniques such as Ribo-seq, we have lots of novel data, which

    with use of machine learning methods, we can extract useful information

    from them. This work has aimed to find classes and clusters according to

    similar processing patterns of ribosome accumulation in Ribo-seq data with

    use of fast graph kernel techniques(NSPDK)[14]. Such a work has already

    done for RNA-seq data in BlockClust tool. It has the ability to distinguish

    different ncRNA groups. For achieving a high performance BlockClust uses

    two main steps. These steps are encoding the expression profiles to a graph

    and build combinatorial features based on that graph. We will be able to

    apply this tool on Ribo-seq data. Hence there is a need for an adaption with

    employing new features and optimize its pipeline according to Ribo-seq data.

    3.1 BlockClust: efficient clustering and classifica-

    tion of non-coding RNAs

    The studies on Genome-wide sequencing data revealed that most of DNA

    regions encode information for non-coding RNAs(ncRNAs)[26]. They have

    an important role in cellular regulation, although the function annotations

    for a large part of them are not obvious yet. One way for solving this issue,

    stands several methods e.g. clustering ncRNAs based on their sequence or

    secondary structure [34],[31]. It is also possible to assign classes based on

    patterns of expression profiles in ncRNAs. These patterns are depended to

    21

  • 3.1. BlockClust: efficient clustering and classification of non-coding RNAs

    functional molecule and 3D structure. therefore BlockClust has aimed for

    the latter solution for grouping different classes of ncRNAs. It tries to assign

    clusters to non-coding RNA classes by applying machine learning methods

    on transcript processing patterns in RNA-seq data and it is robust to the

    changes of cell line, organism and sequencing machines. Two main sections

    in BlockClust which allow clustering are: 1. Expression profiles encoding

    2. Combinatorial feature generation This information of mapped reads will

    be in format of SAM (sequence alignment map) or BAM (binary alignment

    map). Mapping gives information about where the reads are aligned on the

    reference genome. So for the aim of simplicity and increasing computational

    speed, BlockClust divides the expression profiles to sets of block groups and

    blocks with use of blockbuster tool[15] which is fitting Gaussian functions

    to the profile data. For each read blockbuster assigns a Gaussian function

    and then take the consensus Gaussian and then assigns the reads to a block

    by finding the highest peak and considering the standard deviation. Each

    block consists of several reads and for each block groups exist a set of corre-

    spondent blocks. In order to find patterns in expression profiles, BlockClust

    will extract the attributes for blocks(e.g. number of multi mapped reads,

    entropy of read expressions, minimum read length) and block groups(e.g.

    entropy of read starts, entropy of read ends, entropy of read lengths) and

    block edges(e.g. contiguity and difference in median read expressions) sets.

    After that it discretizes the attributes values to nominal amounts with use

    of equal frequency algorithm. Next, BlockClust encodes whole the informa-

    tion to a graph. The amounts of blocks and block groups for each instance

    are not identical, thus representing them in form of vectors is not possible.

    A solution for that is a graph representation of such a data. This Graph

    represents the values of attributes over one gene expressions data and con-

    sists of two components. One for block group attributes and the other for

    blocks and block edges attributes. In the first component place of node in-

    dicates attribute type and in the second component order of nodes represent

    order of block positions which gained from blockbuster tool. This sequence

    of nodes called backbone. The attributes values for each single block at-

    tach to the backbone according to the block. Sequence of backbone nodes

    is analogous to sequence of blocks constructed with blockbuster. As soon

    as, retrieving the graphs from expression profiles, BlockClust will be able to

    22

  • 3.1. BlockClust: efficient clustering and classification of non-coding RNAs

    produce combinatorial feature with use of Neighborhood Subgraph Pairwise

    Distance Kernel (NSPDK). These features have been employed to efficiently

    cluster ncRNAs. Moreover BlockClust also implement the concept of view-

    point in the process of feature generation. It is an extra information added

    for extracting subgraphs in a way that at least one of the subgraph roots be

    on the backbone. This helps to build features from an increasing amount of

    attributes and considers a very smaller subset of attributes combinations.

    Finally BlockClust uses the similarity notion of NSPDK and Markov Cluster

    Process [13] for building ncRNA clusters.

    23

  • 3.2. Extracting New Attributes with respect to Ribo-seq data

    Figure 3.1: After employing blockbuster tool and building blocks and block

    groups from expression profiles, next BlockClust makes graphs from dis-

    cretized values of the attributes applied for block groups and blocks. At

    the end the similarity of these graphs compared by NSPDK with use of

    combinatorial features. Figure retrieved from [8].

    3.2 Extracting New Attributes with respect to Ribo-

    seq data

    In the following, new attributes for extracting new characteristics of Ribo-

    some profiles is defined and added to BlockClust pipeline. we aim to extract

    new attributes, because the expression profiles of the Ribo-seq data are not

    similar with one from small ncRNAs. There is a significant difference in

    the length and end expression levels of the block groups. The attributes

    24

  • 3.2. Extracting New Attributes with respect to Ribo-seq data

    used in BlockClust are optimized for small ncRNAs. Hence you want to

    extract few more which might make sense for ribo-seq data. In order to in-

    crease the accuracy of our prediction, first we check how defined attributes

    are correspondent with respect to Ribo-seq data. Five various attributes

    are implemented here. Moreover, for specifying attribute values, blocks

    and block group positions of our Ribosome profiles should be determined.

    Therefore, we employed blockbuster tool on our ribosome profiles and after

    that we continue our work on block groups and blocks coordinates. After

    extracting block group positions with use of blockbuster, we try to find a

    unique mapping between block groups and gene positions, thus with use of

    this mapping we correspond conditions of our genes to block groups and

    group them under their mapped gene.

    3.2.1 Entropy of Block Distances

    Entropy

    In the field of information theory, it shows the amount of uncertainty in

    random data. For instance, in stochastic binary variable X with values 1 and

    0, if the probability of occurrence of 1 and 0 be 50 percent the uncertainty

    for value X is at highest rate and thus the entropy value is 1 [10]. The

    entropy formula is:

    entropy = −∑

    q log2 q (3.1)

    q is the probability distribution of signals. We calculate the distance be-

    tween two consecutive blocks in a way that evaluate the distance between

    end position of first occurred block and start location of latter block then

    we will map this value to bins which defined with respect to minimum and

    maximum distances in the block group. The sought fraction for entropy

    gains by sum of all mapped block distances values divided by total number

    of bins in that particular block group. Finally we substitute each fraction in

    entropy formula. We have represented the entropy for block groups under

    four different conditions with respect to their mapped genes conditions. As

    figure 3.2 illustrates the two cases of bound and unbound positions repre-

    senting almost analogous behavior, however the median value for bound case

    after temperature shift shows a slight increase.

    25

  • 3.2. Extracting New Attributes with respect to Ribo-seq data

    ●●●●●●●●●●●●●●●●●●●

    ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

    ●●●●●●●●●●●●●●●●●●●●●

    ●●●●●●●●●●●●●●

    ●●●●●●●

    ●●

    ●●●●●●●●●●●●

    ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

    ●●●●

    ●●

    ●●●●●●●●●●●●●●●●●●●●●●

    ●●●●●●●●●●●●●

    ●●●●

    ●●

    0

    1

    2

    3

    0

    1

    2

    3

    boundunbound

    t0_R1 t5_R1Libraries

    Ent

    ropy

    of b

    lock

    dis

    tanc

    es

    Conditions

    Mut

    WT

    Figure 3.2: Entropy of block distances calculated per block group. This

    figure indicates the two cases of bound and unbound positions representing

    almost analogous behavior.

    3.2.2 Entropy of Blocks End and Start Positions

    For assessing entropy of start and end positions of blocks in block groups.

    First, we count the number of blocks that share the identical starting posi-

    tions at each block group. Next divide this statistic with amount of blocks for

    correspondent block group for reaching to entropy ratio. Replacing gained

    value to formula (3.1), renders block start/end entropy for each block group.

    At the end we have plot them under four different cases based on their genes

    status. Figure 3.3 displays that generally in bound cases the entropy of block

    starts have higher entropy in comparison to unbound cases. Moreover it is

    also visible that entropy values for block start positions after temperature

    shift decreasing for mutant case despite of wild type case which shows an

    increment. The same manner is occurred in the entropy of end positions.

    26

  • 3.2. Extracting New Attributes with respect to Ribo-seq data

    ●●●●

    ●●

    ●●●●●

    ●●●

    ● ●●●●

    ●●●●●●●●

    ●●

    ●●●●

    ●●●●●●●●●

    0

    3

    6

    9

    0

    3

    6

    9

    boundunbound

    t0_R1 t5_R1Libraries

    Ent

    ropy

    of s

    tart

    ing

    posi

    tions

    Conditions

    Mut

    WT

    Figure 3.3: illustrates entropy of blocks starting position for each block

    group. Generally in bound cases the entropy of block starts have higher

    entropy in comparison to unbound cases.

    27

  • 3.2. Extracting New Attributes with respect to Ribo-seq data

    ●●

    ●●●●

    ●●

    ●●●●

    ●●●

    ●●●●●●●●●

    ● ●●●●●●●●●●●●●

    ●●●●

    ●●●

    ●●●●●●

    0

    3

    6

    9

    0

    3

    6

    9

    boundunbound

    t0_R1 t5_R1Libraries

    Ent

    ropy

    of e

    ndin

    g po

    sitio

    ns

    Conditions

    Mut

    WT

    Figure 3.4: illustrates entropy of blocks ending positions for each block

    groups. In bound cases the entropy of block ends have higher entropy in

    comparison to unbound cases.

    3.2.3 Density of Reads

    In order to measure the reads density we used the blockbuster output file

    for counting the reads per block and extracted the fraction by dividing this

    value to the total amount of reads for each block group. Figure 3.5 at page

    29 displays no significant dissimilarity between unbound cases, however in

    bound positions it represents a higher density of reads for mutant cases.

    This recent behavior after temperature shift alters, higher amount of reads

    density belongs to wild type at bound positions.

    28

  • 3.2. Extracting New Attributes with respect to Ribo-seq data

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●●●

    ●●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●●●

    ●●●

    ●●

    ●●●●●

    ●●●

    ●●●