physical bioinformatics methods to understand the …

169
The Pennsylvania State University The Graduate School PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE CAUSES AND CONSEQUENCES OF VARIABLE CODON TRANSLATION RATES A Dissertation in Bioinformatics and Genomics by Nabeel Ahmed © 2019 Nabeel Ahmed Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2019

Upload: others

Post on 29-Apr-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

The Pennsylvania State University

The Graduate School

PHYSICAL BIOINFORMATICS METHODS TO

UNDERSTAND THE CAUSES AND CONSEQUENCES OF

VARIABLE CODON TRANSLATION RATES

A Dissertation in

Bioinformatics and Genomics

by

Nabeel Ahmed

© 2019 Nabeel Ahmed

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

August 2019

Page 2: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

ii

The dissertation of Nabeel Ahmed was reviewed and approved* by the following:

Edward P. O’Brien

Associate Professor of Chemistry

Dissertation Adviser

Chair of Committee

István Albert

Associate Professor of Bioinformatics

Sarah M. Assmann

Waller Professor of Biology

Naomi S. Altman

Professor of Statistics and Bioinformatics

Cooduvalli S. Shashikant

Associate Professor of Molecular and Developmental Biology

Chair, Intercollege Graduate Degree Program in Bioinformatics and Genomics

*Signatures are on file in the Graduate School.

Page 3: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

iii

ABSTRACT

The process of translating the genetic information encoded in an mRNA molecule to a

protein is crucial to cellular life and plays an important role in regulating gene expression.

The steady state in vivo protein concentrations are determined in part at the level of

translation. Therefore, uncovering the mechanisms of translational control can help us

understand a crucial component of cellular dynamics. The rate at which individual codons

are translated play an important role in deciding the fate of nascent proteins and affect the

downstream cellular processes they take part in. Hence, measurement of the translation

rates at all codon positions within a transcript would help us understand their role in

regulating co-translational processes such as protein folding and chaperone binding. With

the development of high-throughput Next Generation Sequencing technology in the last

decade, a method called Ribo-Seq can capture a transcriptome-wide snapshot of

translation at nucleotide resolution. However, no gold-standard method for extracting

translation rates from Ribo-Seq data exists and there have been contradictory biological

inferences drawn from different analyses methods. In this dissertation, I present novel

methods based on mathematical optimization and chemical kinetic modeling to correctly

identify the A-site within Ribo-Seq reads and quantify absolute codon translation rates.

This dissertation also highlights two novel biological insights and discoveries namely i)

that the primary structure of a protein encodes translation rate information through pairs

of evolutionarily selected amino acids and ii) that translation kinetics and co-translational

chaperone binding are coordinated.

In Chapter 1, I describe the current state of research in translation and how

translation rates have been estimated previously. I also discuss current methods for

analyzing Ribo-Seq data and their limitations.

In Chapter 2, I report a method that solves the essential first-step of determining

where the A-site of the ribosome was on ribosome-protected mRNA fragments generated

by Ribo-Seq. It is well-known that during translation elongation, the A-site of a ribosome

can occupy only the coding region between second and stop codon of a transcript. Turning

this fundamental fact into a mathematical optimization problem, I identify an offset for the

A-site from the 5′ end of the fragment that maximizes the number of reads between the

second and stop codons of a transcript. A-site offset tables are generated for wide range

of fragment sizes obtained from Ribo-Seq data for S. cerevisiae and mouse embryonic

stem cells. I present results showing that our method out-performs 11 other contemporary

Page 4: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

iv

methods for estimating the A-site position using known A-site stalling signals in polyproline

motifs.

In Chapter 3, I present a method for estimating absolute codon translation rates

based on chemical kinetic modeling of translation. Applying this method to high-coverage

transcripts, I show that translation rates of the codons have up to 26-fold variability in S.

cerevisiae and even the same codon type, at different positions on a single transcript can

have very different translation rates. Different molecular factors like cognate tRNA

concentration, downstream mRNA secondary structure, presence of proline in P-site, etc.

are identified that influence the translation rate of a codon in its A-site. Hence codon

translation rates are determined mostly by the context of the region flanking the codon

within a transcript

In Chapter 4, I describe the novel discovery that the chemical identity of pairs of

amino acids, when located in the P-site and A-site of the ribosome can causally and

predictably influence codon translation rates. Analysis of Ribo-Seq data from S. cerevisiae

exhibited correlations indicating that the presence of particular amino acids, when present

in the P-site and A-site can slow down or speed up the translation of the codon in the A-

site. To test for causation, twelve amino acid mutations were introduced into the primary

structure of non-essential S. cerevisiae proteins that the bioinformatic analysis predicts

will either speed up, slow down, or cause no change in translation rate when the mutated

residue is in the P-site. In all cases, the resulting change in ribosome density at the A-site

matches the prediction. Enrichment/depletion analyses of these amino acid pairs across

the proteome suggest evolutionary pressures are selecting against slow-translating pairs

of amino acids, but retaining them in regions where they might aid the efficiency of co-

translational processes.

Chapter 5 of this dissertation demonstrates for the first time evidence of

coordination between translation kinetics and co-translational binding of chaperones.

Using in vivo selective ribosome profiling approach, the binding profile of a Hsp70

chaperone Ssb was characterized and correlated with codon translation rates obtained

from Ribosome Profiling. It was found that periods of Ssb binding to the nascent

polypeptide chain outside the ribosome exit tunnel were correlated with faster translation

of mRNA segments within the ribosome. This translational speedup is maintained in a

strain with Ssb deleted indicating that this speedup is caused by features encoded within

the mRNA. I demonstrate that the distribution of molecular factors highlighted in Chapter

Page 5: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

v

3 and 4 across these mRNA fragments causes a speedup of translation in these fragments

to coincide with binding of Ssb.

In Chapter 6, I summarize my findings and their implications for characterizing the

principles of translation kinetics and their influence on co-translational processes. The

methods presented in this dissertation will hopefully provide an easy-to-implement

standardized protocol for processing Ribo-Seq data by correctly mapping the reads using

the provided offset table and quantify absolute rates. Identification of a novel factor like

amino acid pairs should motivate researchers to investigate the importance of pairs and

the potential role of loss of this pairing at sensitive sites in causing disorders. Finally, co-

ordination of translation kinetics with co-translational folding should open up avenues to

investigate the loss of chaperone binding due to altered translation kinetics caused by

synonymous mutations. Finally, the methods and studies described in this dissertation

demonstrates integration of useful information from next-generation sequencing datasets

with chemical kinetic models. The projects in this dissertation also showcase the power of

biophysical modelling in explaining the dynamics of cellular processes and it offers a multi-

disciplinary perspective of biology from physical sciences.

Page 6: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

vi

TABLE OF CONTENTS

LIST OF FIGURES ………………………………………………………………………………………………………………………..ix

LIST OF TABLES …………………………………………………………………………………………………………………………xii

ACKNOWLEDGEMENTS ……………………………………………………………………………………………………………xiv

Chapter 1 INTRODUCTION ............................................................................................................... 1

1.1 Overview ................................................................................................................................ 1

1.2 Translation and its importance .............................................................................................. 1

1.3 Previous estimates of translation rates ................................................................................. 3

1.4 Ribo-Seq measures the location and number of actively translating ribosomes .................. 4

1.4.1 Approaches for identifying the A-site ............................................................................. 5

1.5 Approaches for estimating translation rates using Ribo-Seq ................................................ 6

1.6 Molecular factors influencing translation elongation ............................................................ 7

1.7 Influence of translation kinetics on co-translational processes ............................................ 8

1.8 Objectives of dissertation ...................................................................................................... 9

Chapter 2 IDENTIFYING A- AND P-SITE LOCATIONS ON RIBOSOME-PROTECTED MRNA

FRAGMENTS USING INTEGER PROGRAMMING ............................................................................. 12

2.1 Abstract ................................................................................................................................ 12

2.2 Introduction ......................................................................................................................... 12

2.3 Results .................................................................................................................................. 14

2.3.1 Integer Programming Algorithm ................................................................................... 14

2.3.2 Illustrating the Integer Programming optimization procedure .................................... 17

2.3.3 A-site locations in S. cerevisiae Ribo-Seq data are fragment size and frame dependent

............................................................................................................................................... 18

2.3.4 Higher coverage leads to more unique offsets ............................................................. 18

2.3.5 Consistency across different datasets .......................................................................... 22

2.3.6 Robustness of the offset table to threshold variation .................................................. 22

2.3.7 Testing the Integer Programming algorithm against artificial Ribo-Seq data .............. 23

2.3.8 A-site offsets in mouse embryonic stem cells .............................................................. 23

2.3.9 Integer Programming does not yield unique offsets for E.coli ..................................... 24

2.3.10 Reproducing known PPX and XPP motifs that lead to translational slowdown ......... 25

2.3.11 Greater A-site location accuracy than other methods ............................................... 26

2.4 Discussion ............................................................................................................................. 29

2.5 Methods ............................................................................................................................... 32

Page 7: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

vii

2.5.1 Ribo-Seq datasets ......................................................................................................... 32

2.5.2 Gene selection, analyses and statistical tests ............................................................... 34

2.6 Acknowledgements .............................................................................................................. 37

2.7 Data Availability ................................................................................................................... 37

Chapter 3 A CHEMICAL KINETIC BASIS FOR MEASURING TRANSLATION ELONGATION RATES

FROM RIBOSOME PROFILING DATA .............................................................................................. 38

3.1 Abstract ................................................................................................................................ 38

3.2 Author Summary .................................................................................................................. 39

3.3 Introduction ......................................................................................................................... 39

3.4 Results .................................................................................................................................. 41

3.4.1 Theory ........................................................................................................................... 41

3.4.2 Application .................................................................................................................... 42

3.5 Discussion ............................................................................................................................. 48

3.6 Methods ............................................................................................................................... 50

3.6.1 Simulated steady state ribosome profiling data. .......................................................... 50

3.6.2 In silico measurement of average protein synthesis and codon translation times ...... 51

3.6.3 Analysis of ribosome profiling and RNA-Seq data ........................................................ 52

3.6.4 Assignment of mRNA secondary structure ................................................................... 53

Chapter 4 EVOLUTIONARILY SELECTED AMINO ACID PAIRS ENCODE TRANSLATION-ELONGATION

RATE INFORMATION ...................................................................................................................... 54

4.1 Abstract ................................................................................................................................ 54

4.2 Main Text ............................................................................................................................. 54

Chapter 5 EVOLUTIONARILY-ENCODED TRANSLATION KINETICS COORDINATE CO-

TRANSLATIONAL SSB CHAPERONE BINDING IN YEAST .................................................................. 66

5.1 Abstract ................................................................................................................................ 66

5.2 Introduction ......................................................................................................................... 67

5.3 Results .................................................................................................................................. 68

5.3.1 Selective Profiling of Ssb-Bound Ribosomes ................................................................. 68

5.3.2 Coordination of Ssb Binding with Translation Elongation Rates .................................. 69

5.4 Discussion ............................................................................................................................. 73

5.5 Methods ............................................................................................................................... 74

5.5.1 Translation kinetics analysis .......................................................................................... 74

5.5.2 Speed-up of translation ................................................................................................ 75

5.5.3 Contribution of mRNA versus Ssb binding .................................................................... 75

Page 8: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

viii

5.5.4 Enrichment/Depletion of Fast/Slow codons ................................................................. 76

5.5.5 Upstream charged residues .......................................................................................... 76

5.5.6 Downstream mRNA secondary structure ..................................................................... 76

Chapter 6 CONCLUSIONS AND FUTURE DIRECTIONS .................................................................... 77

6.1 Conclusions .......................................................................................................................... 77

6.2 Future Directions ................................................................................................................. 79

6.2.1 Synonymous mutations and diseases ........................................................................... 79

6.2.2 Test phenotypic effect of loss of amino acid pairing due to mutations ....................... 79

6.2.3 Causally test the effect of altered translation kinetics on Ssb chaperone binding ...... 80

Appendix A CHAPTER 2 SUPPORTING INFORMATION ................................................................... 83

A.1 Supporting Figures ............................................................................................................... 83

A.2 Supplementary Tables ......................................................................................................... 90

Appendix B CHAPTER 3 SUPPORTING INFORMATION ................................................................. 101

B.1 Supplementary Methods ................................................................................................... 101

B.1.1 Derivation of Eq. (3.3) from Eq. (3.1) and Eq. (3.2) .................................................... 101

B.1.2. Estimation of 𝝉 < 𝒊 > ................................................................................................. 102

B.2 Supplementary Figures ...................................................................................................... 103

B.3 Supplementary Tables ....................................................................................................... 107

Appendix C CHAPTER 4 SUPPORTING INFORMATION ................................................................. 116

C.1 Methods ............................................................................................................................. 116

C.1.1 Details of Experiments ................................................................................................ 116

C.1.2 Computational analyses of Ribo-Seq data .................................................................. 118

C.2 Supplementary Figures ...................................................................................................... 125

C.3 Supplementary Tables ....................................................................................................... 135

Appendix D CHAPTER 5 SUPPORTING INFORMATION ................................................................. 139

D.1 Derivations Demonstrating that the Fold Enrichment Is Directly Proportional to the Ssb-

Binding Probability ................................................................................................................... 139

D.1.1 Proof 1: Demonstration that the FE is directly proportional to the probability of Ssb

binding ................................................................................................................................. 139

D.1.2 Proof 2: Demonstration that SeRP reads are a function of the elongation rate, and

that the Fold Enrichment metric controls for this effect ..................................................... 140

REFERENCES ................................................................................................................................. 142

Page 9: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

ix

LIST OF FIGURES

Figure 1.1. Type of rates involved in translation. ............................................................. 2

Figure 1.2. Overview of Ribo-Seq ................................................................................... 4

Figure 2.1. The A-site location can be defined as an offset from the 5′ end of ribosome-

protected fragments. ......................................................................................................16

Figure 2.2. mRNA fragment size distribution for S. cerevisiae Ribo-Seq dataset from Pop

and co-workers (A) and the Pooled dataset (B). ............................................................17

Figure 2.3. Distribution of offset values from the Integer Programming algorithm applied

to transcripts from S. cerevisiae. ....................................................................................19

Figure 2.4. Increasing coverage identifies A-site locations for 𝑆 and 𝐹 combinations that

were initially ambiguous. ...............................................................................................20

Figure 2.5. Several PPX and XPP motifs lead to ribosomal stalling in S. cerevisiae. .....26

Figure 2.6. The Integer Programming algorithm correctly assigns greater ribosome

density than other methods to the Glycine in PPG motifs in S. cerevisiae and to

Glutamic acid in PPE motifs in mESCs. .......................................................................28

Figure 3.1. Eq. (3.5) accurately determines codon translation times from simulated

ribosome profiles. ..........................................................................................................43

Figure 3.2. Wide variability in individual codon translation rates in vivo. ........................45

Figure 3.3. Molecular factors shaping the variability of individual codon translation rates.

......................................................................................................................................47

Figure 4.1. Computational analyses of Ribosome profiling data demonstrate that identity

of amino acids in the P- and A-sites can influence the translation speed of the A-site codon.

......................................................................................................................................56

Figure 4.2. Ribosome profiling experiments in which mutations are made to the P-site

residue measure changes in translation speed that are consistent with the predictions from

Figure 4.1b. ...................................................................................................................58

Figure 4.3 Depending on the amino acid pair, translation speed is influenced by either the

identity of the tRNA pairs, the amino acid pairs, or both. ...............................................60

Page 10: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

x

Figure 4.4. Evolution selects for fast-translating pairs across the proteome but enriches

slow-translating pairs across interdomain linker regions. ...............................................63

Figure 5.1. Schematic representing the ribosome footprint x obtained from selective Ribo-

Seq when Ssb is bound to the region of nascent chain n amino acids upstream of x. ....69

Figure 5.2. Altered Translation Kinetics of Ssb-Bound Ribosomes ................................71

Figure 5.3. Identifying Ssb-Bound mRNA Segments and the Molecular Origins of

Translation Acceleration ................................................................................................73

Figure 6.1. Illustration of the hypothesis that a change in translation-elongation rates will

lead to disruption of Ssb binding. ...................................................................................82

Figure A.1. Fragment size distribution in (A) Pooled Ribo-Seq data in mouse embryonic

stem cells (mESCs) and (B) Pooled Ribo-Seq data in Escherichia coli. .........................83

Figure A.2. Pairwise comparison of fragment-size and frame distributions between genes

in S. cerevisiae. .............................................................................................................84

Figure A.3. Integer Programming algorithm correctly reproduces the true A-site offsets

from Artificial Ribo-Seq data. .........................................................................................85

Figure A.4. Meta-gene analysis in Pooled Ribo-Seq data reveal excess ribosome density

in E.coli genes beyond CDS regions. ............................................................................86

Figure A.5. Stalling at PPE and PPD motifs are reproduced in mESCs. ........................87

Figure A.6. Sequence-independent translational pause observed post-initiation in S.

cerevisiae and mESCs. .................................................................................................88

Figure A.7. The Integer Programming algorithm correctly assigns greater ribosome density

to the Glycine residue in PPG motifs than other methods in S. cerevisiae. ....................89

Figure B.1. Comparison of the properties of the 117- and 364-transcript data sets from

studies of Nissley et al.9 and Williams et al.114, respectively, to the entire S. cerevisiae

transcriptome. .............................................................................................................. 103

Figure B.2. Translation time distributions for the 64 codon types. ................................ 104

Figure B.3 Codon translation rates are highly correlated across datasets and with rates

from method of Dao Duc and Song . ........................................................................... 105

Page 11: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

xi

Figure B.4. Molecular factors shaping the variability of individual codon translation rates

in the dataset from Williams et al.114. ........................................................................... 106

Figure C.1. The percent change in median normalized ribosome density 𝜌 for a given pair

of amino acids in the P-site and A-site, relative to any other amino acid being in the P-site

(Eq. C.2). . ................................................................................................................... 126

Figure C.2. The sign of the percent change in ribosome density (Eq. C.2) for the fast and

slow translating amino acid pairs remains the same after controlling for different molecular

factors known to influence translation speed. .............................................................. 128

Figure C.3. The ribosome profiling data for all the mutant strains demonstrate consistent

fragment size distribution, strong 3 nt periodicity, robust frame distribution and high

pairwise correlation of individual transcript's ribosome profiles…………………………..129

Figure C.4. Ribosome profiles of mutant and wild-type strains are highly correlated. ….130

Figure C.5. Optimal and non-optimal codons are equally distributed between the domain

and linker regions of proteins for both fast- and slow-translating amino acid pairs. ...... 131

Figure C.6. Fast-translating amino acid pairs are enriched in those transcript segments

that are being translated when the chaperone Ssb is bound to the nascent chain. ...... 132

Figure C.7. Translation speed differences are not explained by wobble decoding in the P-

and A-sites. ................................................................................................................. 133

Figure C.8. Samples prepared in the same phase (single batch on same day) exhibit

higher correlations than samples prepared in different phases. ................................... 134

Page 12: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

xii

LIST OF TABLES

Table 2.1. A-site locations (nucleotide offsets from 5′ end) determined by applying the

Integer Programming algorithm to the Pooled dataset in S. cerevisiae are shown as a

function of fragment size and frame. ..............................................................................21

Table A.1. Number of genes for the various fragment size and frame combinations that

meet the criteria of at least 1 read per codon on average in the Pop and Pooled datasets

of S. cerevisiae. .............................................................................................................90

Table A.2. Initial offset tables after application of Integer Programming algorithm to Pop

and Pooled datasets in S. cerevisiae. ............................................................................91

Table A.3. For unique offsets described in Table 2.1, the robustness to variation in

parameters and consistency across different Ribo-Seq datasets are described with

additional sub columns. .................................................................................................92

Table A.4. Input A-site offset tables used in the creation of artificial Ribo-Seq data (table

below, see Methods). Offset A-site tables (next page) output by the Integer Programming

method when applied to artificial Ribo-Seq data constructed using the input tables (Top)

and P(𝑆, 𝐹) distribution with mode (28, 0) and variance 𝜆 = 48 (Distribution 5 in Figure

A.3). ..............................................................................................................................93

Table A.5. Initial offset table after application of Integer Programming algorithm to a

Pooled dataset in mESCs consisting of all genes. Offset table after application of Integer

Programming algorithm to a Pooled dataset of E. coli. ..................................................95

Table A.6. A-site locations (nucleotide offsets from 5΄ end) determined by applying the

Integer Programming algorithm to the Pooled dataset in mESCs are shown as a function

of fragment size and frame. ...........................................................................................96

Table A.7. Number of genes in the combination of fragment size and frame meeting the

criteria of at least 1 read per codon on average in mESCs and E. coli Pooled datasets.97

Table A.8. Median normalized ribosome densities for 61 codon types were correlated with

tRNA abundance for the Integer Programming method and 11 other contemporary

methods (see Methods for details). ................................................................................98

Table A.9. Publicly available datasets used in the study. ...............................................99

Table A.10. A-site offsets determined using the publicly available R packages – Plastid38

, RiboProfiling92 and riboWaltz37. ................................................................................. 100

Page 13: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

xiii

Table B.1. Statistics for the translation time distributions of 64 codon types obtained from

the Nissley dataset ...................................................................................................... 108

Table B.2. Statistics for the translation time distributions of 64 codon types obtained from

the Williams dataset .................................................................................................... 112

Table C.1. Ribo-Seq was obtained from five different published studies. ..................... 135

Table C.2. Details on the 12 single amino acid mutations that were made across 5 different

genes. ......................................................................................................................... 136

Table C.3. Statistics of read mapping for ribosome profiling experiments for the mutant

strains carried out in this study. ……………………………………………………………..137

Table C.4. Three mutations to gene YOL109W to test the contribution of amino acid and

tRNA identity. .............................................................................................................. 138

Page 14: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

xiv

ACKNOWLEDGEMENTS

First and foremost, I would like to thank God Almighty for always keeping me motivated

for the long and challenging journey of a PhD. I am grateful for the intellect that God has

bestowed upon me to contribute towards pushing our understanding of nature and life

even if it is only bit by bit. Learning about nature and getting to know the interplay of

complex network of molecular machines that together create functioning biological

systems have always amazed me and brought me closer to God Almighty.

I would like to thank the National Science Foundation, National Institutes of

Health, and Human Frontier Science Program for funding the work described in this

dissertation in part. Any opinions, findings, and conclusions or recommendations

expressed in this dissertation are those of the mine and my collaborators and do not

necessarily reflect the views of these funding agencies.

I would like to thank my dissertation advisor, Ed O’Brien, without whose constant

support and encouragement, this PhD would not have been possible. Ed has been a

wonderful advisor who always made sure to bring the best work out of me and taught me

to think about my research from different perspectives. I have learned a great deal about

how to propose and execute a research project from Ed and this will go a long way for me

to have a successful career as a scientist. I would also like to thank my committee

members, Professors Istvan Albert, Sarah Assmann and Naomi Altman, for their

thoughtful questions and criticisms that helped me improve upon my research projects. I

am grateful to Shashi who played an important role in bringing me to Penn State and

constantly provided support and encouragement during our meetings.

I would also like to extent my gratitude to my wonderful collaborators at University

of Heidelberg whose contributions made it possible to experimentally validate most of my

computational research findings. I would like to thank Bernd and Günter for providing

resources, insightful ideas and feedback that made it possible to ask the pertinent

research questions and extract exciting findings from our analyses. I would like to thank

Ulrike for having patience and running long and challenging experiments for our projects.

I am grateful to Kristina for working with me and Ed on Ssb project and providing all data

and useful insights needed to execute our part of the project. I would also like to thank

Pietro at University of Cambridge for working together on development of computational

methods and his diplomatic statements that helped us swiftly respond to harsh reviewers

comments. Coming back to people who have been in closer physical proximity, I would

like to thank other members of the O’Brien lab with whom I had a great time working with

Page 15: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

xv

over the past 5 years. Thank you Ajeet, Dan, Sarah, Joe, Dave, Fabio, Ben, Ian, Yang

and Yiyun for always being supportive whenever I have reached out to you for help.

Finally, I need to acknowledge my gratitude and thanks to the most important

people in my life. I am indebted to my father who inspired me to undertake a career in

science. His steadfast support and constant encouragement has kept me focused on my

research and convinced me to never give up. His wonderful achievements as a

hydrogeologist has always inspired me and I hope I could achieve even half of what he

had achieved in his scientific career. I would like to thank my Mom for always believing in

me and always encouraging me to never stop trying. Thanks to my brother, Adeel for

always being there for me and my grandmother for her love, prayers and wishes. I would

also like to honor the memory of two individuals who are no more but would have been

very proud to see me attain a PhD. To Baji, my paternal grandmother, I wish you could be

here to see me finish my PhD. It was her hard labor that uplifted our family out of poverty,

made sure my father received his education and subsequently led us to achieve highest

academic honors. Also, to my maternal grandfather who always made sure to make me

understand the value of education and knowledge during my childhood. I know that you

would be proud of my achievement.

Lastly, I need to thank my better half, my wife Anam. The last 2 years of my life

have been the most wonderful ever since I met you. I am always amazed by the positive

attitude you bring to all discussions we have. I am grateful to you for having the patience

to bear with me – with my rants, complaints and long hours away at work. I am grateful to

you for always making things easy for me. It would not have been possible to complete

my dissertation without your unwavering love and support.

Page 16: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

1

Chapter 1

INTRODUCTION

1.1 Overview

This chapter introduces the background and motivation for all the studies presented in

this dissertation. First, I describe the recent evidence demonstrating the importance of

translation in determining the protein abundance in vivo. Next, I discuss earlier single

gene methodologies and sequence-based measures used as estimates of translation

rates. Then, I introduce Ribosome Profiling, also known as Ribo-Seq, whose data forms

the basis for many of the projects in this dissertation, the current methods to model Ribo-

Seq data and their limitations. Next, I detail the evidence that translation kinetics has

downstream effects on co-translational processes. Lastly, I outline how the research

projects discussed in Chapters 2, 3, 4 & 5 in this dissertation overcome the limitations of

current analysis methods and how the developed methods offer novel biological insights.

1.2 Translation and its importance

Proteins play an integral role in the functioning of a cell. Their cellular concentrations are

determined dynamically through the processes of transcription, translation and

degradation. Through advances in mass spectrometry, it has been possible to directly

characterize proteins from cells but the estimates of their concentrations are qualitative

at best and it has been difficult to detect low expressed proteins1. mRNA copy numbers

are easy to measure through inexpensive microarray studies and recently by high-

throughput RNA sequencing. Consequently, gene expression has been mostly quantified

by mRNA levels that act as a proxy for the final protein levels. Schwanhäusser et al2 used

pulse labeling of radioactive variants of amino acids and nucleosides in a population of

unperturbed mouse fibroblasts cells to determine the turnover and half-lives of proteins

and their corresponding mRNA transcripts in a single experiment. The mRNA and protein

levels quantified in the same experiment demonstrated that only 40% of variability in

protein levels is explained by mRNA levels. According to their model, the translation rate

constants are better predictors of protein levels rather than mRNA levels. Therefore,

uncovering the mechanisms of translational control of gene expression can help us

understand a crucial understudied component of cellular dynamics.

Page 17: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

2

Translation is the process by which the genomic information encoded in

messenger RNA (mRNA) is converted into a newly synthesized (“nascent”) protein3.

Translation occurs through the action of the ribosome, a polymerase that initiates

translation by binding at the start codon on an mRNA molecule. Next, the ribosome

elongates (i.e., synthesizes) the nascent protein by uni-directionally sliding along the

transcript, one codon at a time, catalyzing peptide bond formation. Translation terminates

once the ribosome reaches the stop codon. A codon is a triplet of nucleotides, and the

61 sense codons encode the 20 naturally occurring amino acids – the building blocks of

proteins. The ribosome reads off this codon information and catalyzes peptide bond

formation between amino acid groups that are bound to transfer RNA (tRNA). The

ribosome contains three sites in which tRNA molecules can reside – the acceptor site (A-

site), the peptidyl site (P-site), and the exit site (E-site). The A-site contains the codon

that is being translated and binds the cognate amino-acylated-tRNA molecule, the P-site

contains the tRNA to which the nascent protein is covalently attached, and the E-site

contains the deacylated-tRNA that is ejected from the ribosome before the next codon is

translated.

The rates associated with translation (Figure 1.1) determine the time scales of

protein synthesis4,5, influence protein expression levels6, and have recently been shown

to influence the structure and function of the protein produced7–11. These rates include

the initiation rate (how fast the ribosome binds to the start codon), individual codon

translation rates at the A-site (how fast the ribosome moves from one codon position to

the next), and the average elongation rate (how fast the ribosome moves from one codon

position to the next, averaged over all the codon positions in a transcript). During the

elongation step of translation, the ribosome synthesizes a protein by sliding along an

Initiation Elongation Termination

𝛼 𝑘𝐴,𝑗+2

𝑘𝐴,𝑗+1 𝑘𝐴,𝑗 𝑘𝐴,2

𝑗 + 1 𝑗 𝑗 − 1 1 … …

𝑁𝐶

𝛽

Figure 1.1. Type of rates involved in translation. Translation is initiated by the binding

of ribosome subunits to the mRNA transcript at rate 𝛼. The ribosome then elongates at

rate 𝑘A to each successive codon until it reaches the stop codon, 𝑁C, where translation is

terminated with rate 𝛽 and the full-length protein (blue string) is released.

Page 18: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

3

mRNA molecule and translates different codons into amino acids at different rates12. The

rate at which individual codons are translated by the ribosome can determine whether a

nascent protein will fold and function, misfold and malfunction, aggregate or efficiently

translocate to a different cellular compartment13,14,15. Hence, measurement of the

translation rates at all codon positions within a transcript would be crucial to uncover the

mechanism of translational control. Translation elongation rate is synonymous with

codon translation rate. The mean translation time of a codon is the inverse of the codon’s

translation rate and these three terms are used interchangeably throughout this

dissertation.

1.3 Previous estimates of translation rates

Direct measurement of codon translation rates in vivo is nontrivial and translation

efficiency (rate of translation initiation or protein synthesis) has typically been estimated

by measures of codon usage bias and tRNA abundance that has been found to be

correlated with protein abundance16. Despite the degeneracy in the genetic code, the

frequency of usage of synonymous codons is highly biased17. This phenomenon is

referred to as codon usage bias. Frequent codons are generally correlated to high tRNA

abundance18 and the bias is more strongly observed in highly expressed genes across

diverse organisms19. Due to the evolutionarily conserved nature of codon usage bias,

translational efficiency was often approximated by indexes of codon usage20 and tRNA

abundance21. The intuitive hypothesis has been that frequent codons are translated

faster than rare codons and hence the biased codon usage and tRNA abundance have

co-evolved for the efficient use of translational machinery17. Though studies have shown

that substituting frequent codons with rare codons decreases overall protein

abundance22, there is no direct biochemical evidence that a change is translation

elongation rate causes a decrease in protein synthesis.

In the 1980’s and 90’s enzymology23 and cell biology24 assays were developed to

measure average translation-elongation rates one gene at a time or averaged over a

cell’s translatome. The enzymology techniques involved controlling the time at which

initiation of a transcript occurred, and then monitoring the subsequent appearance of

enzymatic activity. The time point at which the enzyme’s specific activity saturated,

divided by the enzyme’s length in residues, provided a measure of the transcript’s

average codon translation speed. Alternatively, the cell biology assays would

simultaneously measure the total mass of newly synthesized mRNAs and proteins

Page 19: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

4

produced in cells over some time period via pulse-

chase experiments, and then fit those data to a model

that reported the average elongation rate, among

other quantities. A drawback of the enzymology

approach is that it is not high throughput – the

measurements can only be done one gene at a time.

Additionally, to be accurate, this approach requires

that any acquisition of enzymatic activity occur on a

faster characteristic time scale than that of protein

synthesis. The cell biology approach is prone to large

errors because gross measurements of total protein

mass were used and the results depend on the details

of the model used to extract the rates.

1.4 Ribo-Seq measures the location and number of

actively translating ribosomes

Ribo-Seq25 is a Next-Generation Sequencing

technique in which translation is rapidly halted in cells

through the use of antibiotics or flash freezing.

Subsequent cell lysis and mRNA digestion of the

lysate using an RNase enzyme26 (Figure 1.2a) results

in a pool of ribosome-protected mRNA fragments that

is amplified and sequenced. The number and length of

mRNA fragments that map to the coding sequences

(CDSs) of transcripts is a function of the location and

number of ribosomes that were sitting at a particular

location on different copies of the same transcript

when translation was halted. When a ribosome dwells

for a longer time at a particular codon position, more

reads map to it relative to a codon position that is

translated faster on the same transcript (Figure 1.2b).

Hence, the read distribution across a CDS is, in part,

a function of the individual translation elongation rates

of each codon. The advent of Ribo-Seq provided a

Nuclease Digestion

RNA isolation

Sequencing and alignment

No

.of

read

s

Nucleotide position

Coding sequence region

a)

No

.of

read

s

𝒋

No

.of

read

s

𝒋+1

b)

Figure 1.2. Overview of Ribo-Seq

(a) Steps of Ribo-Seq experiment:

Unprotected mRNA fragments

(purple regions not covered by

green ribosomes) are digested by

nuclease enzyme such that only

ribosome-protected mRNA

fragments are isolated and

subsequently sequenced and

aligned to the transcriptome to

generate the ribosome profile. (b)

Slow translation leads to a higher

number of reads at codon position

𝑗 compared to fast translation at

codon 𝑗 + 1.

Page 20: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

5

codon level resolution of translation that can be used to estimate translation elongation

rates.

Over the past decade since the introduction of Ribo-Seq, several biases in the

experimental protocol have been identified and improvements have been proposed to

avoid such biases27. The most prominent bias that has been quantified is the effect of

cycloheximide (CHX) drug treatment that has been used to halt translation in earlier Ribo-

Seq studies. It was shown that the translation arrest induced by CHX was not perfect and

it led to continued elongation and distortion of ribosome density across downstream

codons28,29. Improved protocols now use flash-freezing for halting translation and

different ribonucleases for mRNA digestion are available for different organisms. One of

the fundamental computational challenge in the analyses of Ribo-Seq data is to map

reads to the correct codon position within the resulting ribosome-protected mRNA

fragments. To quantify individual codon translation rates, we must be able to accurately

identify which codon was being translated at the ribosome’s A-site. Otherwise, ribosome

density will be assigned to the wrong codon and the measured rates will be erroneous.

In Ribo-Seq experiments26, however, the location of the ribosome’s A-site on a ribosome-

protected mRNA fragment is not known a priori; additional information and assumptions

must be introduced to estimate their locations.

1.4.1 Approaches for identifying the A-site

Recent methods25,29–38 to estimate the A-site location are based on heuristic and

statistical learning approaches. For a canonical ribosome-protected fragment of 28 nt,

the A-site has been identified to be 15 nt from 5΄ end (see Figure 2.1A)25. This information

is used by many methods as a heuristic to qualitatively guess the location of the A-site

for non-canonical fragment lengths. Most of these methods can only be applied to a

narrow range of fragment lengths25,29,35,39 and hence do not utilize all of the reads

generated in a Ribo-Seq experiment. Others use simple heuristics, such as pausing at

codons of certain amino acids in response to specific growth media and drug treatments.

For example, drug treatment with 3-amino-1,2,4-triazole depletes the cellular

concentration of tRNAHis, and is expected to lead to a higher ribosome density when

histidine codons are in the A-site31. With such methods, the A-site location in S.

cerevisiae Ribo-Seq datasets has been estimated to be 15 nt from the 5΄ end of

ribosome-protected fragments of size 28 nt 25,40, 16 nt for fragment size 29 nt 40, and 15

nts from the 5΄ end of fragments that are 30 nt in length35. Additionally, frame-specific

Page 21: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

6

offsets of 14 to 17 nts from the 5΄ end for fragments between 28 and 30 nt in length are

used29,41. Alternatively, the Center-weighted Method smooths the ribosome density

across several codons34, and thus translation properties of individual codons cannot be

accurately ascertained with this approach. Therefore, to accurately identify the A-site

location, an approach is needed that is firmly rooted in biological principles that can also

be applied to the wide range of fragment lengths generated by a Ribo-Seq experiment.

1.5 Approaches for estimating translation rates using Ribo-Seq

Ribo-Seq overcomes the drawbacks of both enzymology and cell biology assay-based

methods to measure translation rates: it is high-throughput; it directly measures ribosome

positions on individual transcripts; and it measures a signal that is proportional to time

spent by the ribosome on a codon. Therefore, several analytical methods4,35,39,42–44 have

been developed that often estimate qualitative, relative differences in translation speed.

For example, in one method35 the “Ribosome Residence Time” of each codon type is

estimated as the proportion of Ribo-Seq reads for that codon type relative to the average

number of reads present in a local 20-codon window centered at the codon of interest.

However, this provides only a rough, relative measure of translation rates between

different codon types and imposes the assumption that each codon type translates at the

same rate. A simple thought experiment reveals the large errors that can arise from this

“local window” approach. Consider a 100-codon transcript in which the first half is

uniformly translated twice as slowly as the second half, resulting in twice as much

ribosome density in the first half compared to the second. Further, assume that the codon

in the 75th position (in the fast-translating region) is the only codon that translates slowly,

with 50% more reads than in its local window. With these conditions, codon 75 is being

translated at the transcript’s average codon translation speed. And yet, applying this local

window approach we would incorrectly conclude that codon 75 is being translated 1.5

times faster than the average codon translation speed. Dana and Tuller44 defined a

translation efficiency index for the mRNA transcripts called Mean Typical Decoding Rates

(MTDR) which is the geometric mean of translation rates of all codons within the

transcript. Pop et al.39 models a ribosome flow process while softly constraining the

translation rate of a codon type to be same throughout the cell. Gardin et al.35 do the

same but use relative ribosome densities in 20 codon windows, which has the effect of

reducing the variability in translation speeds. Thus, these methods ignore the variability

in the codon translation rate of the same codon type in different parts of the same

Page 22: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

7

transcript. Many of these methods may miss out on the variability in the codon translation

rate of the same codon type within the same transcript, and while all measure relative

rates between codons, none measure absolute translation rates of individual codons.

1.6 Molecular factors influencing translation elongation

As described previously, codon translation rates were estimated using measures of

codon usage that correlated with cognate tRNA abundance16. 61 codon types are

decoded by 42 tRNA families in S. cerevisiae and each tRNA family has variable copies

of genes encoding them across the genome45. Since there are only 42 tRNA types for

decoding 61 codon types, multiple codons are decoded by wobble decoding mechanism

in which the third nucleotide in the codon and anti-codon does not exhibit Watson-Crick

complementarity46. A codon optimality measure has been used in the literature taking

into account the cognate tRNA interactions and wobble base pairing47,48. Optimal codons

were defined as codons used commonly across the genome and decoded by tRNAs with

higher gene copy number. Non-optimal codons were mostly rare codons decoded by

lower abundant tRNA or through wobble decoding mechanism.

With the development of Ribo-Seq, the codon translation rates obtained were

correlated with codon optimality. Some studies35,44 showed that biased codon usage

strongly correlates with codon translation rate while others39,42,43 demonstrated that

synonymous codons do not differ in their codon translation rates. However, the

discrepancies were attributed to technical biases in Ribo-Seq, specifically the use of

cycloheximide29. An improved Ribo-Seq study found that codon translation rates are

correlated but could explain only 27% of the variation in the rates41. Wobble decoding

has been shown to slow translation in metazoans using Ribo-Seq49 but no definitive

evidence exists for any systematic slowdown caused by Wobble decoding mechanism in

other organisms.

Advances in structural methods, single-molecule methods and Ribo-Seq have

identified several other factors that can potentially influence translation50. This includes

features of both mRNA and nascent chain. mRNA secondary structure can be barrier for

translocation of the ribosome along the transcript and it can result in a slowdown of

translation while the structure is unwound by helicase activity of the ribosome51,52.

Analysis of initial Ribo-Seq data has also found a correlation with ribosome density and

folding energy of mRNA secondary structures53,54. Other features that can influence

translation kinetics are tRNA modifications that can alter decoding efficiency55 and stress

Page 23: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

8

conditions that can change the dynamic pool of tRNA thus affecting the decoding of

different codon types56,57.

The nascent chain features having an influence on translation rates include the

presence of proline residues. Proline is a well-established poor peptidyl donor and

acceptor when present in the P- and A-sites respectively58,59. Ribo-Seq studies confirmed

that presence of proline will lead to slowdown of translation60. The slowdown of

translation is extensive for polyproline motifs which requires external translation factors

to rescue translation61–66. This phenomenon has been determined through

enzymology61,62 and toe printing67 studies and has been extensively characterized to be

rescued by factors like EF-P and eIF5A in E. coli and S. cerevisiae respectively. Positively

charged residues are an additional nascent chain feature that can influence the codon

translation rate by interacting with the negatively charged tunnel resulting in a slowdown

of translation at the A-site42,68,69.

As the methods advance to study the translation process dynamically in real

time50, more factors may be discovered influencing the rate of translation elongation.

1.7 Influence of translation kinetics on co-translational processes

Translation is a resource intensive process and efficient production of proteins is

required to maintain protein homeostasis. Protein maturation is a multi-step process

requiring several factors to act in a timely fashion. A misstep can disrupt the protein

homeostasis potentially driving pathogenesis of diseases. Without changing the protein

abundance, this disruption can cause the protein to misfold and lead to aggregation

causing cytotoxicity. The ribosome as catalytic macromolecular complex maintains

balance between efficient protein production and ensuring that the proteins are

functionally active. This is achieved by the non-uniform pattern of translation kinetics

where variability of translation rates creates periods of fast and slow translation to

efficiently and accurately generate a functional proteome70.

Multi-domain proteins tend to fold in a domain-wise fashion such that they can

avoid large-scale non-native interactions71. Translation is a sequential process and it can

allow the separation of time scales for different domains of a multi-domain protein to fold.

However, there is still a danger for the nascent polypeptide accessing a large

conformation space upon exiting the ribosome exit tunnel to misfold72,73. The nascent

polypeptide needs to be supervised during the elongation phase to avoid any non-native

interactions. A network of molecular chaperones assist with the processing of nascent

Page 24: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

9

polypeptides by helping avoid misfolded nascent chain conformations while the rest of

the polypeptide is being synthesized inside the ribosome72–74. A network of factors also

exist to facilitate co-translational protein maturation steps of assembly of large protein

complexes75 and membrane targeting76. Alteration of translation kinetics have been

demonstrated to affect these co-translational processes but their mechanism of

coordination is not well understood77,78.

It was hypothesized that optimizing the mRNA sequence by replacing non-

optimal codons with optimal codons should result in an increase in efficiency of protein

production79. However, multiple lines of evidence have been found that optimizing the

mRNA sequence increases the efficiency of protein production but can often lead to loss

of functionality77,80,81, in some cases leading to widespread aggregation of proteins82.

Optimizing the FRQ protein in Neurospora, for example, led to the loss of circadian

rhythm83. Evolutionary selection pressures have shaped codon usage such that optimal

and non-optimal codons are distributed in clusters to create periods of faster and slower

translation48. It has been found that optimal codons are essential for maintaining the

fidelity of translation at structurally sensitive sites where a slowdown can result in

mistranslation leading to misfolding84. It was also expected that non-optimal codons will

be enriched in interdomain linker regions to slow down translation and facilitate co-

translational domain folding. It has been seen from single protein studies, for example,

that mutating optimal codons to non-optimal codons downstream of a N-terminal domain

makes the designed protein YKB fold with increased efficiency85. A study identified a

rare codon cluster downstream of a domain of SufI protein whose folding efficiency was

perturbed when they were mutated to common codons cluster81. Clusters of non-optimal

codons were found to be present between secondary structural motifs within structural

domains86. Ribo-Seq data has demonstrated that there is a slowdown of translation in

inter domain linkers41 but no systematic enrichment of non-optimal codons was observed

across interdomain linkers in large-scale analysis of 121, 120 and 51 multi-domain

proteins in E.coli, H. sapiens and S. cerevisiae87. This indicates that there are molecular

factors which need to be identified that are influencing translation and causing a

slowdown in interdomain linkers.

1.8 Objectives of dissertation

This introduction highlights the importance of translation in regulating gene expression,

its influence on downstream co-translational processes and current challenges

Page 25: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

10

concerning the analysis of Ribo-Seq data. This dissertation aims to address some of

these challenges so that Ribo-Seq data can be efficiently modeled to extract absolute

codon translation rates. This dissertation also aims to find novel insights that can be

gained from analysis of Ribo-Seq to understand the molecular origin of variability in

translation rates as well as any coordination with co-translational processes.

In Chapter 2, I describe a method to accurately identify the A-site within ribosome-

protected fragments. This method implements a probabilistic approach and utilizes the

fundamental feature of translation that A-site of a ribosome can occupy only the region

between the second and stop codons of a transcript. It overcomes the limitations of the

heuristic approaches used by other methods and can be applied to wider range of

fragment sizes. The usability of this method is demonstrated by greater accuracy of the

method in comparison to contemporary methods.

In Chapter 3, I present a method that uses a chemical kinetic model to derive an

equation for calculating codon translation rates of individual codons within an mRNA

transcript from Ribo-Seq data. This is fundamentally different from other analysis

methods of Ribo-Seq data35,39,42,43,88 that build their models of translation assuming a

constant elongation rate for a particular codon type. A chemical kinetic model of

translation will accurately capture the codon translation rates at each codon position with

minimal assumptions and hence is more likely to accurately quantify the role of translation

kinetics in influencing co-translational processes.

In Chapter 4, I describe an analysis of Ribo-Seq data that proposes a novel

molecular factor influencing the translation rate. This analysis demonstrates that the

chemical identity of the amino acid pairs in the P- and A-sites of the ribosome can

influence the codon translation rate and predicts that mutating the P-site amino acid will

lead to either speedup or slowdown of translation rate. This prediction is tested

experimentally for 12 amino acid pairs and all 12 mutations result in change in speed in

the expected direction. I also demonstrate that evolution selects for fast-translating pairs

relative to slow-translating pairs potentially to increase the efficiency of protein

production. However local enrichment of slow-translating pairs is observed in interdomain

linkers which can potentially explain the slowdown observed downstream of domain

regions but could not be attributed to enrichment of non-optimal codons. Identification of

amino acid pairs adds another feature of nascent chain mediated regulation of translation

further explaining the origin of variability in translation rates.

Page 26: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

11

In Chapter 5, I demonstrate that the co-translation process of binding of Hsp70

chaperone Ssb is coordinated with faster translation by the ribosome. The binding of Ssb

to nascent polypeptides are profiled using a method called Selective Ribosome

Profiling89, a variant of Ribo-Seq where chaperone bound to ribosome-nascent chain

complex are selected for ribosome profiling. I also describe how faster translation is

encoded within the mRNA with molecular factors affecting translation rate enriched in a

fashion to accelerate translation in the ribosome during periods of Ssb binding.

Finally, in Chapter 6, I summarize the findings from the studies presented in this

dissertation and their implication for studying the effect of synonymous mutations on

functional protein production and their role in diseases.

Page 27: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

12

Chapter 2

IDENTIFYING A- AND P-SITE LOCATIONS ON RIBOSOME-PROTECTED MRNA

FRAGMENTS USING INTEGER PROGRAMMING

The research presented in this chapter has been published as a research article in

Scientific Reports titled “Identifying A- and P-site locations on ribosome-protected mRNA

fragments using Integer Programming” by Nabeel Ahmed*, Pietro Sormanni*, Prajwal

Ciryam, Michele Vendruscolo, Christopher M. Dobson and Edward P O’Brien (* denotes

co-first authors). The author contributions are stated below: “P.S., P.C. and E.P.O.

conceived the study. N.A., P.S. and E.P.O. designed the computational analyses. P.C.,

M.V., C.M.D. contributed to design of the computational analyses. N.A. and P.S. analyzed

the data. N.A. and E.P.O. wrote the manuscript. All authors reviewed and commented on

the manuscript.” This chapter is being reproduced from the above publication under Open

Access Creative Commons Attribution 4.0 International License (CC BY).

2.1 Abstract

Identifying the A- and P-site locations on ribosome-protected mRNA fragments from Ribo-

Seq experiments is a fundamental step in the quantitative analysis of transcriptome-wide

translation properties at the codon level. Many analyses of Ribo-Seq data have utilized

heuristic approaches applied to a narrow range of fragment sizes to identify the A-site. In

this study, we use Integer Programming to identify the A-site by maximizing an objective

function that reflects the fact that the ribosome’s A-site on ribosome-protected fragments

must reside between the second and stop codons of an mRNA. This identifies the A-site

location as a function of the fragment’s size and its 5′ end reading frame in Ribo-Seq data

generated from S. cerevisiae and mouse embryonic stem cells. The correctness of the

identified A-site locations is demonstrated by showing that this method, as compared to

others, yields the largest ribosome density at established stalling sites. By providing

greater accuracy and utilization of a wider range of fragment sizes, our approach

increases the signal-to-noise ratio of underlying biological signals associated with

translation elongation at the codon length scale.

2.2 Introduction

Translation is a fundamental cellular process and an important step of gene expression

resulting in the production of proteins in cells90. In the past decade the advent of Ribo-Seq

(also known as Ribosome profiling), a high-throughput Next-Generation Sequencing

Page 28: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

13

method25,91, has enabled the transcriptome-wide study of translation. Ribo-Seq involves

rapidly halting translation in cells through the use of antibiotics or flash freezing followed

by cell lysis and then digestion of the lysate using an RNase enzyme26. The resulting pool

of ribosome-protected mRNA fragments is then amplified and sequenced. The number

and length of mRNA fragments that map to the coding sequences (CDSs) of transcripts is

a function of the location and number of ribosomes that were sitting at a particular position

on different copies of the same transcript. Where the ribosome’s A- and P-sites were

located on a fragment during the digestion step is not known a priori, additional information

and assumptions must be introduced to estimate their locations. Since translation occurs

at the A- and P-sites, the identification of these sites is critical to address translation-

related questions. If the A- and P-sites are not accurately identified, then systematic or

random error can diminish the statistical power of any underlying biological signal that

might exist. The identification of the A- and P-sites within ribosome footprints is therefore

fundamental to quantitatively understanding translation at the codon length scale.

Because of the importance of this assignment problem, a number of methods for

identifying the A- and P-sites have been created25,29–34,37,38,92. Many of these approaches

utilize the biological fact that only the P-site is permitted to occupy the start codon during

translation initiation and only the A-site is permitted to occupy the stop codon during

termination. Using such approaches, the A-site location in S. cerevisiae Ribo-Seq

datasets, for example, has been estimated to be 15 nt from the 5′ end of ribosome-

protected mRNA fragments of size 28 nt25,40; 16 nt for fragment size 29 nt40; 15 nts from

the 5′ end of fragments that are 30 nt in length35 and frame-specific offsets of 14 to 17 nts

from the 5′ end for fragments between 28 and 30 nt in length29,41. The P-site location offset

is 3 nt prior to the A-site. Similarly, in mouse embryonic stem cells (mESCs), such

approaches have yielded specific offsets for different fragment lengths33.

Here, we utilize the fundamental biological fact that the A-site on ribosome-protected

fragments must reside within the CDS of a gene under normal growth conditions. We use

this fact to create an objective function that, when maximized, identifies where the

ribosome’s A- and P-sites are most likely to be located on a ribosome-protected mRNA

fragment. We apply our method to S. cerevisiae and mESCs Ribo-Seq datasets and show

that, compared to other methods, our approach has greater accuracy and statistical power

in identifying A- and P-site locations and assigning read density.

Page 29: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

14

2.3 Results

2.3.1 Integer Programming Algorithm

In the analysis of Ribo-Seq data, mRNA fragments are initially aligned onto the reference

transcriptome and their location is reported with respect to their 5′ end. This means that

one fragment will contribute one read that is reported on the genome coordinate to which

the 5′ end nucleotide of the fragment is aligned (Figure 2.1A). In Ribo-Seq data, fragments

of different lengths are observed that can arise from incomplete digestion of RNA and from

the stochastic nature of mRNA cleavage by the RNase used in the experiment (Figures

2.2 and A.1). A central challenge in quantitatively analyzing Ribo-Seq data is to identify

from these Ribo-Seq reads where the A- and P-sites were located at the time of digestion.

It is non-trivial to do this since incomplete digestion and stochastic cleavage can occur at

both ends of the fragment. For example, mRNA digestion resulting in a fragment of size

29 nt can occur in different ways, two of which are illustrated in Figure 2.1B. The quantity

that we need to accurately estimate is the number of nucleotides that separate the codon

in the A-site from the 5′ end of the fragment, which we refer to as the offset and denote ∆.

Knowing ∆ determines the position of the A-site as well as the P-site since the P-site will

always be at ∆ minus 3 nt.

Our solution to this problem relies on the biological fact that for canonical transcripts

with no upstream translation the A-site of actively translating ribosomes must be located

between the second codon and the stop codon of the CDS3. Therefore, the optimal offset

value ∆ for fragments of a particular size (𝑆) and reading frame (𝐹) is the one that

maximizes the total number of reads 𝑇(∆|𝑖, 𝑆, 𝐹) between these codons for each gene i on

which the fragments map onto. The size of an mRNA fragment 𝑆 is measured in

nucleotides, and the frame 𝐹 has values of 0, 1 or 2 as defined by the gene start codon

ATG and corresponds to the frame in which the 5′ end nucleotide of the fragment is located

(Figure 2.1A). The 5′ end frame 𝐹 is a result of RNase digestion and it is distinct from the

reading frame of the ribosome that is typically translating in-frame (frame 0 of A-site). In

other words, for each combination of (𝑆, 𝐹) we shift the 5′ aligned read profile by 3

nucleotides at a time (to preserve the reading frame 𝐹) until we identify the value ∆ that

maximizes the reads between second and stop codon (Figure 2.1C, see next sub-section).

This procedure is carried out systematically for each fragment size 𝑆 and reading frame 𝐹

separately, as each may have (and we find some have) a different optimal ∆.

Page 30: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

15

This concept can be expressed in terms of Integer Programming93, a mathematical

optimization procedure in which an objective function is maximized subject to integer and

linear restraints. With ∆ as the integer variable to optimize, the objective function in this

case is 𝑇(∆|𝑖, 𝑆, 𝐹) = ∑ 𝑅(𝑗, ∆|𝑖, 𝑆, 𝐹)𝑁𝐶,𝑖

𝑗=4 , where 𝑁𝐶,𝑖 is the number of nucleotides in the

CDS of gene 𝑖 and 𝑅(𝑗, ∆|𝑖, 𝑆, 𝐹) is the number of reads from fragments of size 𝑆 and frame

𝐹 mapped onto gene 𝑖 whose 5′ end is at nucleotide position 𝑗 on the CDS after being

shifted along the transcript by ∆ nucleotides. The optimal ∆, denoted ∆′, for a given (𝑆, 𝐹)

for gene 𝑖 is determined as max{𝑇(∆|𝑖, 𝑆, 𝐹)} subject to the constraints (i) that 0 ≤ ∆ ≤ 𝑆,

and (ii) that the modulus of ∆

3= 0. Constraint (i) enforces the requirement that the A-site

is located between the first and last nucleotide of the fragment of size 𝑆 nts. Constraint (ii)

maintains the frame of the 5′-most nucleotide of the fragment as the Ribo-Seq reads are

shifted by an amount ∆. We enforce Constraint (ii) because we are interested in the

assignment of reads to the A-site at the resolution of a codon, not an individual nucleotide.

If we did not enforce constraint (ii), our algorithm would simply yield equal 𝑇(∆|𝑖, 𝑆, 𝐹)

scores for the two other values of ∆ that would still map the reads on the A-site codon,

but in the two frames where the 5′ end was not in. Therefore, to simplify the determination

of offsets we implemented constraint (ii). Thus, by maximizing 𝑇(∆|𝑖, 𝑆, 𝐹) for the CDS of

each gene in a data set of 𝑁𝑔 genes, we will obtain a set of 𝑁𝑔 values of ∆′. From this

distribution of ∆′ values, the A-site location corresponds to the most probable ∆′ value.

While identifying the ∆′ value for each gene in our data set, we also minimize the

occurrence of false positives by ensuring that the highest score, 𝑇(∆′|𝑖, 𝑆, 𝐹), is significantly

higher than the next highest score, 𝑇(∆′′|𝑖, 𝑆, 𝐹), which occurs at a different offset ∆′′. If

the difference between the top two scores is less than the average number of reads per

codon, we apply the following additional selection criteria. To choose between ∆′ and ∆′′,

we select the one that yields a number of reads at the start codon that is at least one-fifth

less than the average number of reads at the second, third and fourth codons. We further

require that the second codon has a greater number of reads than the third codon. The

biological basis for these additional criteria are that the true offset (i.e., the actual location

of the A-site) cannot be located at the start codon, and that the number of reads at the

second codon should be higher on average than the third codon due to contributions from

the initiation step of translation, during which the ribosome is assembling on the mRNA

with the start codon in the P-site. Below, we demonstrate that the results from our method

are largely robust to changes in these thresholds.

Page 31: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

16

Figure 2.1. The A-site location can be defined as an offset from the 5′ end of ribosome-

protected fragments. (A) A schematic representation of a translating ribosome (top drawing)

and of the offset ∆ between the Ribo-Seq reads mapped with respect to the 5′ end of the footprints

and centered on the A-site (blue bars). The ribosome is shown protecting a 28 nt fragment with

its 5′ end in reading frame 0, as defined from the ATG start codon of the gene. The E-, P- and A-

sites within the ribosome are indicated. The reads are then shifted from the 5′ end to the A-site

by the offset value ∆. (B) Stochastic nuclease digestion can result in different fragments. The two

most probable variants of a 29 nt footprint with the 5′ end in frame 1 are shown with their

boundaries mapped by dotted lines aligning to the genome which can result in offsets of 15 nt

(top) and 18 nt (bottom), respectively. (C) To illustrate the application of the Integer Programming

algorithm, consider a hypothetical transcript that is 60 nt in length. The first panel shows the

ribosome profile originating from reads assigned to the 5′ end of fragments of size 33 in frame 0.

The start and the stop codon are indicated while the rest of the CDS region is colored light peach.

The algorithm shifts this ribosome profile by 3 nt and calculates the objective function 𝑇(𝛥|𝑖, 𝑆, 𝐹).

The extent of the shift is the offset Δ. Values of 𝑇(𝛥|𝑖, 𝑆, 𝐹) for Δ = 12, 15, 18, 21 nts are indicated.

In this example, the average number of reads per codon is 7.85. The difference between the top

two offsets, 18 (𝑇  = 222) and 15 (𝑇  = 215), is less than the average. Hence, we check the

secondary criteria (Results). Offset 18 meets the criteria that the number of reads in the start

codon is less than one-fifth of the average of reads in second, third and fourth codons and also

that number of reads in the second codon is greater than reads in third codon. Hence, Δ = 18 nt

is the optimal offset for this transcript.

Page 32: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

17

Figure 2.2. mRNA fragment size distribution for S. cerevisiae Ribo-Seq dataset from Pop

and co-workers (A) and the Pooled dataset (B).

2.3.2 Illustrating the Integer Programming optimization procedure

To illustrate this Integer Programming algorithm in action we provide an example using

the hypothetical mRNA shown in Figure 2.1C. The algorithm is as follows: First, for gene 𝑖,

consider 𝑅(𝑗, ∆= 0|𝑖, 𝑆, 𝐹) composed of those fragments of size 𝑆 (= [20,21, … ,35] nt) and

whose 5′ end has been aligned to reading frame 𝐹 (= 0, 1 or 2). Second, for this ribosome

profile, determine the ∆ that maximizes 𝑇(∆|𝑖, 𝑆, 𝐹). Do this by starting from the 5′-end-

aligned ribosome profile (∆=0) and shift it three nucleotides at a time (i.e., obey Constraint

(ii) described previously) towards the 3′ end of the transcript such that ∆ = 0, 3, 6, 9, … , ≤ 𝑆.

At each value of ∆, calculate 𝑇(∆|𝑖, 𝑆, 𝐹) and record its value. Third, after all ∆ values have

been tested, the ∆ that maximizes 𝑇(∆|𝑖, 𝑆, 𝐹) is denoted ∆′, which is the putative location

of the A-site relative to 5′ end of fragments of size 𝑆 and frame 𝐹 for gene 𝑖. Check if the

secondary-selection criteria are required and apply them when the scores for the top two

offsets differ by less than the average number of reads per codon in the mRNA. Finally,

repeat these steps for every fragment size between 20-35 nts in length and every reading

frame. Thus, for one gene, this procedure yields 48 (=16x3) independent values for ∆′,

one for each fragment size and frame combination.

The fragment-size and frame distributions of ribosome-protected fragments (Figure

2.2) in S. cerevisiae are not gene dependent (Figure A.2), and therefore, neither should

the offset values be gene dependent. Thus, the location of the A-site, relative to the 5′

end of a fragment of size 𝑆 and frame 𝐹, corresponds to the most probable value of the

offset across all the genes in the dataset.

Page 33: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

18

2.3.3 A-site locations in S. cerevisiae Ribo-Seq data are fragment size and frame

dependent

We first applied the Integer Programming method to Ribo-Seq data from S. cerevisiae

published by Pop and co-workers39. For each combination of 𝑆 and 𝐹 we first identified

those genes that have at least 1 read per codon on average in their corresponding

ribosome profile. The number of genes meeting this criterion is reported in Table A.1. We

then applied the Integer Programming method to this subset of genes. The resulting

distributions of ∆ values are shown in Figure 2.3A for different combinations of fragment

length and frame. We only show results for fragment sizes between 27 and 33 nt because

greater than 90% of reads map to this range (Figure 2.2A). The most probable offset value

for all fragment sizes between 20 to 35 nt is reported as an offset table (Table A.2).

We see that the optimal ∆ value - that is, the A-site location - changes for different

combinations of 𝑆 and 𝐹, with the most probable values either at 15 or 18 nt. Thus, the

location of the A-site depends on 𝑆 and 𝐹. In most cases, there is one dominant peak for

a given pair of 𝑆 and 𝐹 values. For example, for fragments of size 27 through 30 nt in

frame 0, greater than 70% of their per-gene optimized ∆ values are 15 nt from the 5′ end

of these fragments. Similar results are found for other combinations such as sizes 30, 31

and 32 nt in frame 1 and 28 through 32 nt in frame 2, where optimized ∆ values are 18 nt.

Thus, across the transcriptome, the A-site codon position on these fragments is uniquely

identified.

There are, however, 𝑆 and 𝐹 combinations that have ambiguous A-site locations

based on these distributions. For example, for fragments of size 27 nt in frame 1, 47% of

the gene-optimized ∆ values are at 15 nt while 30% are at 18 nt. Similar results are

observed for fragments 28 and 29 nt in frame 1, and 31 and 32 nt in frame 0. Thus, for

these 𝑆 and 𝐹 combinations there is a similar probability of the A-site being located at one

codon or another, and therefore we cannot uniquely identify the A-site’s location.

2.3.4 Higher coverage leads to more unique offsets

We hypothesized that ambiguity in identifying the A-site for particular 𝑆 and 𝐹

combinations may be due to low coverage (i.e., sampling poor statistics). To test this

hypothesis we pooled the reads from different published Ribo-Seq datasets into a single

dataset with consequently higher coverage and more genes that meet our selection

Page 34: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

19

criteria (Table A.1). Application of our method to this Pooled dataset gives unique offsets

for more 𝑆 and 𝐹 combinations compared to the original Pop dataset (Figure 2.3B and

Table A.2), consistent with our hypothesis. For example, for fragments of size 27 and

frame 1, now we have the unique offset of 15 nt with 72% of gene-optimized ∆ values at

15 nt (Figure 2.3B). However, we still see the ambiguity present for certain (𝑆, 𝐹)

combinations.

We employed an additional strategy to increase coverage by restricting our

analysis to genes with greater average reads per codon. If the hypothesis is correct, then

we should see a statistically significant trend of an increase in the most probable ∆ value

with increasing read depth. We applied this analysis to the Pooled dataset and find that

Figure 2.3. Distribution of offset values from the Integer Programming algorithm applied to

transcripts from S. cerevisiae. The data plotted in (A) are from the Pop dataset, and (B) the Pooled

dataset. The distributions are plotted as a function of the of the offset value and for fragment sizes

of 27 to 33 nt, are shown, from left to right, for frames 0, 1 and 2. For a given fragment size and

frame, the A-site location is at the most probable Δ value in the distribution, provided the offset

occurs for more than 70% of the genes (dashed lines in panels). Error bars represent 95%

Confidence intervals calculated using Bootstrapping. Sample sizes are reported in Table A.1.

Page 35: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

20

some initially ambiguous 𝑆 and 𝐹 combinations become unambiguous as coverage

increases. For example, at an average of 1 read per codon, (𝑆, 𝐹) combinations of (25, 0),

(27, 2) and (30,1) are ambiguous as they fall below our 70% threshold. However, we see

a statistically significant trend (𝑠𝑙𝑜𝑝𝑒 = 0.5, 𝑝 = 3.94 × 10−6) for fragments of (25, 0) that

the 15 nt offset becomes more probable upon increasing the coverage, eventually

crossing the 70% threshold (Figure 2.4A). Similarly, for (27, 2) (𝑠𝑙𝑜𝑝𝑒 = 0.58, 𝑝 =

Figure 2.4. Increasing coverage identifies A-site locations for 𝑺 and 𝑭 combinations that

were initially ambiguous. Plotted is the percentage of transcripts with a particular Δ value for

different Sand F combinations from the Pooled dataset of S. cerevisiae. In each panel, multiple

distributions are plotted corresponding to transcripts with increasing coverage, indicated by the

legend at the bottom. For example, the distributions in blue and red arise from transcripts with,

respectively, at least 1 or 2 reads per codon on average. We observe the A-site location tends

towards 15 nt for S = 25, F = 0 (A) and towards 18 nt for S = 27, F = 2 (B), and S = 30, F = 1 (C).

For S = 32, F = 0 (D), there is no trend even at higher coverage. Note that for S = 27, F = 2 (panel

B), there are less than 10 genes with an average greater than 50 reads per codon and hence

we do not include the data point beyond average greater than 45 reads per codon (see

Methods). Error bars represent 95% Confidence intervals calculated using Bootstrapping.

Page 36: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

21

5.77 × 10−5) and (30,1) (𝑠𝑙𝑜𝑝𝑒 = 0.25, 𝑝 = 0.009) there is a trend towards an offset of 18

nt, with more than 70% of genes having this offset at the highest coverage (Figures 2.4B,

C). Hence, for these fragments, increasing coverage uniquely identifies ∆′ and hence the

A-site location. For a few combinations of (𝑆, 𝐹), like (32, 0), the ambiguity is not resolved

even upon very high coverage (Figure 2.4D), which we speculate may be due to inherent

features of nuclease digestion being equally likely for more than one offset.

Thus, high enough coverage yields the optimal offset table represented in Table

2.1, where the offset is the most probable location of the A-site relative to the 5′ end of the

mRNA fragments generated in S. cerevisiae.

Table 2.1. A-site locations (nucleotide offsets from 5′ end) determined by applying the

Integer Programming algorithm to the Pooled dataset in S. cerevisiae are shown as a

function of fragment size and frame. The top two offset values are listed for those 𝑆 and 𝐹

combinations in which the A-site location could not be uniquely determined. For unique offsets, the

most-probable offset value is listed.

Fragment Size Frame 0 Frame 1 Frame 2

24 15 15/12 18/12

25 15 12/15 18

26 15/12 18/15 18/15

27 15 15 18

28 15 15 18

29 15 15/18 18

30 15 18 18

31 15 18 18

32 18/15 18 18

33 18 18 18

34 18 18 18/21

Page 37: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

22

2.3.5 Consistency across different datasets

Ribo-Seq data is sensitive to experimental protocols that can introduce biases in the

digestion and ligation of ribosome-protected fragments. Pooling datasets together offers

the advantage of higher coverage but it may mask the biases specific to an individual

dataset. To determine whether our unique offsets (Table 2.1) are consistent with results

from individual data sets we applied the Integer Programming algorithm to each individual

dataset. Most of these datasets have low coverage resulting in fewer genes meeting our

filtering criteria. For each unique offset in Table 2.1, we classify it as consistent with an

individual data set provided that the most probable offset from the individual dataset (even

if it does not reach the 70% threshold due to limitations in the depth of coverage) is the

same as in Table 2.1. We find that the vast majority of unique offsets (18 out of 20) in

Table 2.1 are consistent across 75% or more of the individual datasets (statistics reported

in Table A.3). Just two (𝑆, 𝐹) combinations show frequent inconsistencies. (𝑆, 𝐹)

combinations (27, 1) and (27,2) are inconsistent in 33% or more of the individual datasets

(Table A.3). This suggests that researchers who wish to minimize false positives should

discard these (𝑆, 𝐹) combinations when creating A-site ribosome profiles.

2.3.6 Robustness of the offset table to threshold variation

The Integer Programming algorithm utilizes two thresholds to identify unique offsets. One

is that 70% of genes exhibit the most probable offset, the other, designed to minimize false

positives arising due to sampling noise in the Ribo-Seq data, is that the reads in the first

codon be less than one-fifth of the average reads in the second, third and fourth codon.

While there are good reasons to introduce these threshold criteria, the exact values of

these thresholds are arbitrary. Therefore, we tested whether varying these thresholds

changes the results reported in Table 2.1. We varied the first threshold to 60% and 80%,

and recomputed the offset table. We report whether the unique offset changed by listing

an ‘R’ or ‘S’ (for robust and sensitive, respectively) alongside the reported offset in Table

A.3. We find that two-thirds of the unique (𝑆, 𝐹) combinations do not change (Table A.3).

(𝑆, 𝐹) combinations (25, 0), (25, 2), (27,0), (27, 1), (28, 1), (31, 0), (33, 0) and

(33, 2) become ambiguous when we increased the threshold to 80%.

We varied the second, aforementioned threshold from one-fifth up to one and down

to one-tenth, and we find that all unique (𝑆, 𝐹) combinations except

(25, 2), (33,0), (33, 2) and (34, 1) remain unchanged (reported as ‘R’ in Table A.3). Thus,

Page 38: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

23

in summary, in the vast majority of cases, the unique offsets reported in Table 2.1 depend

very little on specific values of these thresholds.

2.3.7 Testing the Integer Programming algorithm against artificial Ribo-Seq data

To test the correctness and robustness of our approach we generated a dataset of

simulated ribosome occupancies across 4,487 S. cerevisiae transcripts and asked

whether our method could accurately determine the A-site locations. Artificial Ribo-Seq

reads were generated from these occupancies assuming a Poissonian distribution in their

(𝑆, 𝐹) values using random footprint lengths similar to that found in experiments (see

Methods and Figures A.3A, B). We investigated the ability of our method to correctly

determine the true A-site locations for four different sets of pre-defined offset values (see

Methods). The Integer Programming algorithm was then applied to the resulting artificial

Ribo-Seq data. We find the offset table generated from the algorithm reproduces the input

offsets used (Figure A.3C and Table A.4). This procedure was repeated for different read

length distributions as well as with different input offsets and we find that the offset tables

generated by our algorithm reproduce the input offset tables in greater than 93% of all

(𝑆, 𝐹) combinations (Figures A.3B, C). The method identifies a small number of ambiguous

offsets due to the low read coverage at the tails of the distributions. A finding that

emphasizes further the importance of read coverage as a critical factor in accurately

identifying the A-site.

2.3.8 A-site offsets in mouse embryonic stem cells

The biological fact that A-site of a ribosome resides only between the second and stop

codon is not limited to S. cerevisiae and hence the Integer Programming algorithm should

be applicable to Ribo-Seq data from any organism. Therefore, we applied our method to

a Pooled Ribo-Seq dataset of mouse embryonic stem cells (mESCs). The resulting A-site

offset table exhibited ambiguous offsets at all but three (𝑆, 𝐹) combinations (Table A.5). In

mESCs there is widespread translation elongation that occurs beyond the boundaries of

annotated CDS regions in upstream open reading frames (uORFs)94. Enrichment of

ribosome-protected fragments from these translating uORFs can make it difficult for our

algorithm to find unique offsets because they can contribute reads around the start codon

of canonical annotated CDSs. Therefore, we hypothesized that if we apply our algorithm

to only those transcripts devoid of uORFs and possessing a single initiation site then our

algorithm should identify more unique offsets. Ingolia and co-workers33 have

experimentally identified for well-translated mESCs transcripts its number of initiation sites

Page 39: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

24

and whether uORFs are present. Therefore, we selected those genes that have only one

translation initiation site near the annotated start codon and further restricted our analysis

to transcripts with a single isoform, as multiple isoforms can have different termination

sites.

Application of Integer Programming algorithm to this set of genes increases the

number of unique offsets from 3 to 13 (𝑆, 𝐹) combinations (Table A.6). Applying the same

robustness and consistency tests as we did in S. cerevisiae reveals that 77% of the unique

offsets are robust to threshold variation, and a similar percentage is consistent across both

individual datasets used to create the Pooled data (Table A.6). Thus, the unique offsets

we report for mESCs are robust and consistent in the vast majority of datasets. This result

also indicates that successful identification of A-site locations requires analysing only

those transcripts that do not contain uORFs.

2.3.9 Integer Programming does not yield unique offsets for E.coli

As a further test of how widely we can apply our algorithm, we applied it to a Pooled Ribo-

Seq data from the prokaryotic organism E. coli. The number of genes meeting our filtering

criteria is reported in Table A.7. MNase, the nuclease used in the E. coli Ribo-Seq protocol,

digests mRNA in a biased manner - favoring digestion from the 5′ end over the 3′ end66,95.

Therefore, as done in other studies66,95,96, we applied our algorithm such that we identified

the A-site location as the offset from the 3′ end instead of the 5′ end. Polycistronic mRNAs

(i.e., transcripts containing multiple CDSs) can cause problems for our algorithm due to

closely spaced reads at boundaries of contiguous CDS being scored for different offsets

in both the CDSs. To avoid inaccurate results, we restrict our analysis to the 1,915

monocistronic transcripts that do not have any other transcript within 40 nt upstream or

downstream of the CDS. Based on our experience in the analysis of mESCs dataset, we

filter out transcripts with multiple translation initiation sites as well as transcripts whose

annotated initiation sites have been disputed. Nakahigashi and co-workers97 have used

tetracycline as translation inhibitor to identify 92 transcripts in E.coli with different initiation

sites from the reference annotation and we exclude these transcripts from our analysis as

well. However, for this high coverage pooled dataset, we find ambiguous offsets for all

(𝑆, 𝐹) combinations (Table A.5). A meta-gene analysis of normalized ribosome density in

the CDS and 30 nt region upstream and downstream reveal signatures of translation

beyond the boundaries of the CDS (Figure A.4), especially a higher than average

enrichment of reads a few nucleotides before the start codon. We speculate that the base-

pairing of the Shine-Dalgarno (SD) sequence with the complementary anti-SD sequence

Page 40: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

25

in 16S rRNA98 protects these few nucleotides before the start codon from ribonuclease

digestion and hence results in an enrichment of Ribo-Seq reads. Since these “pseudo”

ribosome-protected fragments cannot be differentiated from actual ribosome-protected

fragments containing a codon with the ribosome’s A-site on it, our algorithm is limited in

its application for this data.

2.3.10 Reproducing known PPX and XPP motifs that lead to translational slowdown

In S. cerevisiae65 and E. coli66,67 certain PPX and XPP polypeptide motifs (in which X

corresponds any one of the 20 amino acids) can stall ribosomes when the third residue is

in the A-site. Elongation factors eIF5A (in S. cerevisiae) and EF-P (in E. coli) help relieve

the stalling induced by some motifs but not others65. Even in mESCs, Ingolia and co-

workers33 detected PPD and PPE as strong pausing motifs. Therefore, we examined

whether our approach can reproduce the known stalling motifs. We did this by calculating

the normalized read density at the different occurrences of a PPX and XPP motif.

In S. cerevisiae, we observed large ribosome densities at PPG, PPD, PPE and PPN

(Figure 2.5A), all of which were classified as strong stallers in S. cerevisiae65 and also in

E. coli67. In contrast, there is no stalling, on average, at PPP, consistent with other

studies65. This is most likely due to the action of eIF5A. For the XPP motifs, the strongest

stalling was observed for GPP and DPP motifs, which are consistent with the results in S.

cerevisiae and in E. coli (Figure 2.5B). In mESCs, we see the strongest stalling at PPE

and PPD, reproducing the results of Ingolia and co-workers33 (Figure A.5A). For XPP

motifs, we observed very weak stalling only for DPP (Figure A.5B). Thus, our approach to

map the A-site on ribosome footprints enables the accurate detection of established

translation pausing at particular PPX and XPP nascent polypeptide motifs.

A study of Ribo-Seq data of mammalian cells99 observed a sequence-independent

translation pause when the 5th codon of the transcript is in the P-site. This post-initiation

pausing was also observed in an in vitro study of poly-phenylalanine synthesis where

stalling was observed when the 4th codon was in the P-site100. With the A-site profiles

obtained using our offset tables for S. cerevisiae and mESCs; we also observe these

pausing events when both the 4th and 5th codons are at the P-site (Figure A.6).

Page 41: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

26

2.3.11 Greater A-site location accuracy than other methods

There is no independent experimental method to verify the accuracy of identified A-site

locations using our method or any other method26,27,29–32,37,38,42,89,92,101–103. We argue that

the well-established ribosome pausing at particular PPX sequence motifs is the best

available means to differentiate the accuracy of existing methods. The reason for this is

that these stalling motifs have been identified in E.coli61,62 and S. cerevisiae104 through

orthogonal experimental methods (including enzymology studies and toe printing), and the

exact location of the A-site during such a slowdown is known to be at the codon encoding

the third residue of the motif62. Thus, the most accurate A-site identification method will be

Figure 2.5. Several PPX and XPP motifs lead to ribosomal stalling in S. cerevisiae. The

median normalized ribosome density is obtained for all instances of (A) PPX and (B) XPP motifs

in which X corresponds to any one of the 20 naturally occurring amino acids. Using a permutation

test, we determine if the median ribosome density is statistically significant or occurs by random

chance. Statistically significant motifs are highlighted in dark red. This analysis was carried out on

the Pop dataset for transcripts in which at least 50% of codon positions have reads mapped to

them. Error bars are 95% Confidence Intervals for the median obtained using Bootstrapping.

Page 42: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

27

the one that most frequently assigns greater ribosome density to X at each occurrence of

the PPX motif.

We applied this test to the strongest stalling PPX motifs, i.e., PPG in S. cerevisiae and

PPE in mESCs. In S. cerevisiae, the Integer Programming method yields the greatest

ribosome density at the glycine codon of PPG motif when applied to both the Pooled

(Figure 2.6A) and Pop datasets (Figure A.7A). Examining each occurrence of PPG in our

gene dataset, we find that in a majority of instances our method assigns more ribosome

density to glycine than every other method when applied to both the Pooled (Figure 2.6B,

Wilcoxon signed-rank test (𝑛 = 224), 𝑃 < 0.0005 for all methods except Hussmann (𝑃 =

0.164)) and Pop datasets (Figure A.7B, Wilcoxon signed-rank test (𝑛 = 35), 𝑃 < 10−5 for

all methods except Hussmann (𝑃 = 0.026) and Ribodeblur (𝑃 = 0.01)). The same

analyses applied to mESCs at PPE motifs shows that our method outperforms the other

nine methods (Figures 2.6C-D) with our method assigning greater ribosome density at

glutamic acid for at least 85% of the PPE motifs in our dataset as compared to all other

methods (Figure 2.6D, Wilcoxon signed-rank test (𝑛 = 104), 𝑃 < 10−15 for all methods).

Thus, for S. cerevisiae and mESCs our Integer Programming approach is more accurate

than other methods in identifying the A-site on ribosome-protected fragments.

A large number of molecular factors influence codon translation rates and ribosome

density along transcripts11. One factor is the cognate tRNA concentration, as codons

decoded by cognate tRNA with higher concentrations should have on average lower

ribosome densities35,41,105. Therefore, as an additional qualitative test, we expect that the

most accurate A-site method will yield the largest anti-correlation between the ribosome

density at a codon and its cognate tRNA concentration. This test is only qualitative as the

correlation between codon ribosome-density and cognate tRNA concentration may be

affected by other factors, including codon usage and reuse of recharged tRNAs in the

vicinity of the ribosome106,107. Using tRNA abundances previously estimated from RNA-

Seq experiments on S. cerevisiae41, we find that our Integer Programming method yields

the largest anti-correlation compared to the eleven other methods considered (Table A.8),

further supporting the accuracy of our method. We were unable to run this test in mESCs

as measurements of tRNA concentration have not been reported in the literature.

Page 43: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

28

Figure 2.6. The Integer Programming algorithm correctly assigns greater ribosome

density than other methods to the Glycine in PPG motifs in S. cerevisiae and to Glutamic

acid in PPE motifs in mESCs. (A) Normalized ribosome density obtained using the various

methods used to identify the A-site is shown for an instance of PPG motif in gene YLR375W

with G at codon position 303 in the Pooled dataset of S. cerevisiae (The legend indicates the

method and full details for each method can be found in the Methods section). (B) The fraction

of PPG instances (n = 224) at which the Integer Programming method yields greater ribosome

density at glycine compared to every other method. The color-coding is the same as shown in

the legend in panel (A). Our method does better if it assigns greater ribosome density in more

than half the instances (horizontal line in panel B). The Integer Programming method does better

than all other methods (P < 0.0005) except for Hussmann, which is not statistically different

(P = 0.164). (C) Normalized ribosome density is shown for an instance of PPE motif in gene

uc007zma.1 with E at codon position 127 in the Pooled dataset of mouse ESCs (see Legend

and main text for details about methods). (D) The fraction of PPE instances at which the Integer

Programming method yields greater ribosome density at glutamatic acid compared to every

other method. The color-coding is same as shown in the legend of panel (C). The Integer

Programming method does better than all other methods (P < 10−15) in accurately assigning

ribosome density to Glutamic Acid in PPE motifs (n = 104). For the analyses presented in (B)

and (D), two-sided p-values were calculated using the Wilcoxon signed rank test. Error bars

represent the 95% Confidence Interval about the median calculated using Bootstrapping.

Page 44: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

29

2.4 Discussion

We have introduced a method to determine the A- and P-site locations on ribosome-

protected mRNA fragments, and shown that it is more accurate than other methods in

correctly assigning ribosome density to the glycine residue in PPG motifs and glutamic

acid residue in PPE motifs, which are strong translation-stalling sites in S. cerevisiae and

mESCs, respectively. Our method is unique amongst existing methods because it (i) uses

a probabilistic approach to identify the A-site location through Integer Programming

optimization and (ii) has an objective function rooted in the biology of translation – meaning

that its optimization enforces the fact that the A-site location of most reads must have been

between the second and stop codons of the CDSs. To be sure, several methods use

biological features to assign the A-site (such as having more reads around the start and

stop codons than in the UTR25,33). However, ours is the only method that also utilizes

feature (i), which is beneficial because the stochastic nature of mRNA cleavage during the

digestion-step of Ribo-Seq necessitates a probabilistic perspective. Our method is not

entirely probabilistic since we have to set thresholds and apply a secondary criterion to

arrive at a final offset value. These measures are unavoidable due to the variability in

coverage between different genes. However, we find that the results are robust to variation

in thresholds and mostly consistent across different Ribo-Seq datasets. Hence, the

respective A-site offset tables provided for S. cerevisiae (Table 2.1) and mouse embryonic

stem cells (Table A.6) can be applied to any dataset from these organisms.

Noteworthy about our test for accuracy is that it is based on results from orthogonal

experimental techniques. The stalling of translation at glycine in PPG motifs is well-

documented61,62,65,66,104 and in S. cerevisiae the Integer Programming method assigns

higher Ribo-Seq reads at the glycine codon at most instances of PPG compared to other

A-site methods. In mESCs PPE is the strongest stalling motif33. The Integer Programming

method outperforms other methods by assigning, on average, 1.76 times more reads at

the glutamic acid codon compared to other methods. These results indicate that the

Integer Programming method presented in this study is more accurate than existing

methods. One reason for this increase in accuracy, among many possible reasons, may

be that most methods only use reads from around the start codon, while our method uses

reads from around both the start and stop codons.

A potential point of confusion may arise from the distributions shown in Figure 2.3 in

which there are two highly probable offset values, raising the question of whether or not

there are multiple A-site locations for a given fragment size and frame. In almost all

Page 45: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

30

fragment length and frame combinations, there is one unique most probable A-site

location, but this ambiguity can arise from poor read coverage on a gene or stochastic

fluctuations in the extent of digestion on one side of an mRNA fragment compared to the

other. Consider fragment size 28 in frame 1. In the

Pop data set (top, middle panel of Figure 2.3A), approximately half of the genes have

∆= 15 nt, while the others have ∆= 18 nt, meaning the A-site could be at either location.

When we increase the read coverage of the genes, however, we see that the vast majority

of the offsets shift to 15 nt (bottom, middle panel in Figure 2.3B). Thus, the original A-site

ambiguity was not due to multiple, equally possible A-site locations, but rather the true A-

site location was hard to detect without better coverage. Consider another example. For

𝑆 = 27 and 𝐹 = 1 we observe in Figure 2.3A that 8% of genes have an optimal ∆= 0,

seemingly suggesting that the A-site is located at the 5′-end on a subset of fragments.

Spot-checking the ribosome profiles of these genes, we find that these genes contain no

reads in the 27 nt region upstream of the second codon and 27 nt upstream of the stop

codon (data not shown). Thus, the values of 𝑇(∆|𝑖, 𝑆, 𝐹) for all ∆ were equal and the

optimal ∆ was arbitrarily assigned a value of 0. In the higher coverage Pooled dataset,

however, there are only 2% of genes with optimal ∆= 0 for 𝑆 = 27 and 𝐹 = 1. Hence, as

we increase coverage, the proportion of genes with spurious offsets decreases. Thus,

offsets away from the most probable offset arise from sampling issues, not from multiple

A-site locations. This result is also seen in the analysis of the artificial Ribo-Seq data where

our algorithm correctly predicts the true offsets for a majority of (𝑆, 𝐹) combinations while

ambiguous offsets occur only for those (𝑆, 𝐹) combinations with the lowest read coverage.

We note that we set a threshold of 70% to determine a most-probable offset for each

fragment size and reading frame and demonstrated that the results are robust to variation

with this threshold (Table A.3). Therefore, the A-site assignments reported in Table 2.1

represent the most likely location of the A-site relative to the 5′ end of mRNA fragments

produced from Ribo-Seq experiments on S. cerevisiae.

Some (𝑆, 𝐹) combinations (such as 𝑆 = 32 and 𝐹 = 0, in Table 2.1) appear to be

inherently ambiguous, that is, increasing their coverage does not lead to a unique A-site

assignment (Figure 2.4D). We do not know the reason for this result, but we speculate

that these are situations where there are truly multiple equally probable A-site locations.

Another possibility is that the ribosome adopts different conformations in these situations

that result in different read lengths and offsets, leading to ambiguity40. A third possibility is

that the nucleotide context around the start and stop codons for a subset of genes may

Page 46: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

31

influence the offset assignment. While for a majority of (𝑆, 𝐹) combinations, higher

coverage leads to convergence towards a single offset, a minority of genes still have an

offset different from the most probable offset, possibly due to sequence bias. It’s possible

this effect has a bigger impact for (𝑆, 𝐹) combinations with ambiguous offsets. The

important point is that the A-site cannot be accurately assigned in these situations. We

therefore recommend that researchers discard reads from these (𝑆, 𝐹) combinations to

minimize chances of erroneous A-site assignments. We believe it will have negligible

effect on the A-site profiles since these combinations contribute only 2.9% of total reads

in the Pooled dataset.

We have found that the Integer Programming algorithm is sensitive to reads arising

from outside the boundaries of annotated CDS regions from non-canonical sources like

upstream ORFs (uORFs) or Internal Ribosome Entry Sites (IRES). Specifically, applying

our method to Ribo-Seq data from mESCs yielded few unique offsets. It was only after

removing genes that had multiple translation initiation sites, some arising from uORFs,

that the number of unique offsets increased more than four-fold. The reason for this

improvement was that by removing the uORFs, our method’s assumption was met that

the reads within 40 nt of the start codon only arise from the annotated CDS. Our method

was not able to identify any unique offsets in E. coli Ribo-Seq data even after we controlled

for multiple translation initiation sites. We observed in E. coli a high enrichment of reads

before the start codon after applying the conventional 12 nt offset from 3′ end66 (Figure

A.4) which we speculate may be due to protection of mRNA segments involved in binding

of the Shine-Dalgarno sequence to the ribosome108 and could limit the accuracy of our

method.

The next best method to the Integer Programming method is the approach used in the

study of Hussmann and co-workers29. Hussmann’s method uses a near-neighbor heuristic

to determine frame-specific offsets of +15, 14 and 16 for lengths 28 and 29 for frames 0,

1 and 2 respectively, and offsets of +!5, 17 and 16 for length 30. The reason Hussmann’s

method yields comparable results is that its offset table is highly similar to Table 2.1. If the

reading frame is maintained after applying the offset form the 5′ end, then 8 out of 9 of

Hussmann’s offsets are the same as in Table 2.1 with the 9th offset of (29, 1) being

ambiguous in our method. We believe the Integer Programming method is superior

because it more frequently assigns greater ribosome density to glycine in PPG motifs,

exhibits a strong correlation with cognate tRNA abundances, provides greater statistical

power and is based on biological features of translation rather than heuristic assumptions.

Page 47: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

32

Specifically, Hussmann’s method only uses reads that are 28, 29 and 30 nt in length,

whereas our method uses reads between 24 to 34 nt in length.

Our method preserves the original 3 nt periodicity found in the original 5′-end aligned

mRNA fragments. Therefore, it is not designed for detecting frame-shifting, translation of

upstream ORFs, or novel short peptides. Nevertheless, correct assignment of reads to the

A-site codon is essential in a variety of other analyses, such as determining translation

kinetics, and our method provides the most accurate assignment of ribosome density

compared to other methods (Figure 2.6 and Table A.8).

In summary, we have created a method for A-site identification that is more accurate

than existing methods in S. cerevisiae and mouse embryonic stem cells, utilizes a

fundamental feature of translation to identify the A-site, and have revealed how the A-site

location changes based on the size of the mRNA fragment and its frame. By increasing

the accuracy and range of fragment sizes for which the A-site can be identified, our

approach can help future studies to measure translation elongation properties at the length

scale of individual codons.

2.5 Methods

2.5.1 Ribo-Seq datasets

2.5.1.1 S. cerevisiae. Published Ribo-Seq data from S. cerevisiae were obtained from

GSM1557447 used in the study of Pop and co-workers39. The raw reads were pre-

processed according to the method stated in the original study. Raw fastq files were

downloaded and preprocessed using Fastx-toolkit (v0.013)

(http://hannonlab.cshl.edu/fastx_toolkit/index.html) as stated in the methods of the original

study. The adapter sequence CTGTAGGCACCATCAAT was stripped using FastQ clipper

and low-quality reads were filtered by FastQ quality filter. The processed reads were

aligned first to the ribosomal RNA sequences using Bowtie 2 (v2.2.3)109. The reads which

did not align to the ribosomal sequences were then aligned to the Saccharomyces

cerevisiae assembly R64-2-1 (UCSC: sacCer3) using Tophat (v2.0.13)110 with up to two

mismatches allowed. Gene annotations were obtained from Saccharomyces Genome

Database (http://www.yeastgenome.org/) on May 4, 2016 for 6,572 protein-coding genes.

Reads were assigned to the nucleotide positions according to the 5′ end.

The pooled Ribo-Seq dataset was formed by combining reads from all replicates of

S. cerevisiae Ribo-Seq data published in studies in which cycloheximide (CHX) was not

used to induce translation arrest9,28,35,39–41,55,111–114. It has been demonstrated that CHX

Page 48: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

33

pre-treatment leads to distortion of ribosome profiles due to ribosome slippage even after

CHX treatment28,29. The distorted ribosome profiles can spill across the CDS boundaries

thus limiting the application of Integer Programming algorithm. Hence, our analysis only

used those datasets without CHX pre-treatment. The list of all the utilized datasets is

reported in Table A.9. The raw reads from each study were processed according to the

reported method in the original study. If the method is not reported in the original study,

we used cutadapt (v1.14)115 to pre-process the raw reads. The alignment and assignment

of reads to gene transcripts was done as above for the Pop dataset39.

2.5.1.2 Mouse embryonic stem cells. The “no drug” sample for mouse embryonic stem

cells (mESCs) measured by Ingolia and co-workers33 was utilized in this study. Since CHX

treatment has been shown to artificially alter ribosome profiles in S. cerevisiae, we

believed it is prudent to not use mESC samples pre-treated with CHX. To increase the

coverage we pooled reads from another untreated Ribo-Seq sample of mESCs published

in the study of Hurt and co-workers116. The linker sequence

CTGTAGGCACCATCAATTCGTATGCCGTCTTCTGCTTGAA for Ingolia’s dataset and

the poly-A adapter sequence for Hurt’s dataset were trimmed using cutadapt (v1.14)115.

The trimmed reads were first aligned to ribosomal RNA sequences using Bowtie2

(v2.2.3)109 and the filtered reads were subsequently aligned to mm10 reference

transcriptome consisting of 21,185 genes obtained from UCSC knownGene database

using Tophat (v2.0.13)110 with up to two mismatches allowed. For a gene with multiple

isoforms, only the isoform with the longest CDS was included in the reference

transcriptome. For transcripts with no information on the 5′ UTR region, we included 40 nt

of genomic sequence upstream from the start codon for successful alignment of reads

around start codon and effective application of Integer Programming algorithm.

Translation initiation site data was obtained from Table S3 of study of Ingolia and co-

workers33. We selected genes that have only one translation initiation site coding for only

a canonical CDS product. From these genes, only genes containing a single isoform were

selected, resulting in 430 genes in our final dataset.

2.5.1.3 Escherichia coli. Wild-type Ribo-Seq data for E.coli were obtained from studies

of Li and co-workers (2012)117, Li and co-workers (2014)118 and Woolstenhulme and co-

workers66. The accession numbers of the samples used are provided in Table A.9. The

respective linker sequences in each sample were trimmed using cutadapt (v1.14)115.

Reads were initially aligned to ribosomal RNA sequences using Bowtie2 (v2.2.3)109 and

Page 49: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

34

the rest of reads aligned to the E.coli reference genome build NC_000913.3 using Tophat

(v2.0.13)110 with up to two mismatches allowed. Gene annotations were obtained for 4314

genes from RefSeq database corresponding to NC_000913.3.

2.5.2 Gene selection, analyses and statistical tests

2.5.2.1 Selection of genes. To obtain good sampling statistics, we selected for analysis

only those genes that have on average greater than 1 read per codon per fragment length

per reading frame. This means that different sets of genes can be used in the Integer

Programming algorithm depending on the fragment length and frame under scrutiny. The

average number of reads per codon was calculated on the CDS region of the gene and

an additional upstream region corresponding to the size of the fragment length being

considered. Genes in which more than 1% of the total number of mapped reads, for a

given 𝑆 and 𝐹, mapped to multiple locations across the genome were discarded from

further analysis.

2.5.2.2 Identifying unique offsets. We defined the most probable offset ∆′ to have a

unique, unambiguously identified A-site if at least 70% of genes in the dataset had an

offset equal to ∆′, and further require that there be at least 10 genes in the dataset.

Otherwise, the A-site location is defined as ambiguous for the fragment size and frame

under scrutiny. In the Results section, we show the A-site location is largely robust to

moderate variation in this 70% threshold.

2.5.2.3 High coverage test. To test for the effect of depth of coverage on the A-site

location we increased the average number of reads per codon required for a gene to be

included in the analyzed dataset from 1 to values up to 50. Three requirements have to

be met for an ambiguous offset to be identified as unique as coverage is increased. As

before, 70% of the genes had to have the most probable offset with at least 10 genes in

the dataset. In addition, there must to be a statistically significant increasing trend in the

most probable offset with increasing coverage. This requirement prevents fluctuations

above 70% due to statistical error as being counted as a unique offset. This trend is

calculated using Linear Regression Analysis.

2.5.2.4 Test using Artificial Ribo-Seq data. To construct artificial ribosome occupancies,

we used Gillespie’s algorithm119 to simulate translation across S. cerevisiae mRNA

transcripts. During the simulations, we saved snapshots every 100 steps recording the A-

site codon location and creating a histogram of ribosome occupancies across the

Page 50: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

35

transcript. To be consistent with the sampling statistics of the experimental Pooled S.

cerevisiae data, we carried out our analysis on the same 4,487 transcripts that met our

filtering criteria for (𝑆, 𝐹) = (28, 0), and normalized our simulated ribosome occupancies

such that they sum up to the total number of reads mapped to that transcript in the

experimental data. We then created different fragment size and 5′ end reading frame

distributions (Figure A.3A, B). Specifically, since the reads are counts, we use Poisson

statistics by treating each (𝑆, 𝐹) as an event in the order:

(20, 0), (20, 1), (20, 2), … , (35, 0), (35, 1) and (35, 2). Six shifted Poisson distributions of

different variances (𝜆 = 4, 8, 16, 24, 48, 80) were generated. The distributions were shifted

such that the mode of the distribution was at (𝑆, 𝐹) = (28, 0), which is typically found in

experiments, with probabilities summing up to 1 between (20, 0) and (35, 2). Two

additional read length distributions were also considered with modes at (𝑆, 𝐹) = (24, 0)

and (𝑆, 𝐹) = (32, 0) with 𝜆 = 8. Four different sets of offset tables were used as an input

to generate the artificial Ribo-Seq reads from the simulated ribosome occupancies for

each of these distributions. These four offset sets are i) a constant offset of 15 nucleotides

for all (𝑆, 𝐹)s, ii) a constant offset of 18 for all (𝑆, 𝐹)s iii) a constant offset of 12 for 𝑆 =

20, 21, … , 26, 27 and constant offset of 18 for 𝑆 = 28, 29, … , 34, 35 iv) the “top offset” values

for (𝑆, 𝐹) combinations identified using our algorithm in the experimental Pooled S.

cerevisiae data (i.e., the offset values of Table 2.1). These input offset tables were

compared to the ‘output’ offset table generated by applying the IP algorithm on the artificial

Ribo-Seq data to test the correctness of our method.

2.5.2.5 Statistical significance of PPX and XPP motifs. To test if the normalized read

density distribution of a PPX or XPP motif is due to random chance, we calculated the P-

value using a permutation test120. For the total number of instances of a PPX/XPP motif,

we randomly selected an equal number of instances of any other three-residue motif and

determined the median normalized read density at the third codon position of the motif,

thereby creating a random distribution. We repeated this procedure 10,000 times and

calculated the fraction of iterations that had a median density equal to or greater than the

one observed for that PPX/XPP motif. This fraction is equal to the P-value. The instances

of PPX and XPP motifs are identified from those transcripts that have at least 50% of

codon positions with 1 read or more.

2.5.2.6 Comparison with other A-site mapping methods. We compared the

performance of Integer Programming algorithm with other methods by calculating the

Page 51: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

36

difference in normalized read density between the Integer Programming A-site value and

the compared method’s A-site value at the third codon of PPG and PPE motifs, which are

associated with ribosome pausing in S. cerevisiae and mESCs respectively.

In S. cerevisiae, A-site ribosome profiles were obtained for Integer Programming

method by applying the offsets listed in Table 2.1 for fragment sizes 24 to 34 nt. For

methods used by Martens and co-workers31 and Hussmann and co-workers29 specifically

in S. cerevisiae, A-site profiles were obtained by applying the offsets for specific fragment

sizes as stated in the Methods sections of those studies. We included a constant heuristic

offset of 15 nt which has been used in several studies of S. cerevisiae Ribo-Seq

data25,43,60,101. The constant offset of 15 nt has been applied to a wide range of fragment

lengths across studies including 22-32 nt25, 27-30 nt60, 28 nt43, 27-34 nt101. To be

conservative, we apply a constant offset of 15 nt to fragments between 27 and 30 nt only.

Similarly, we also include a method where a constant offset of 18 nt is applied to fragments

between 27 and 30 nt to compare to the performance of the Integer Programming method.

For mESCs, Ingolia and co-workers33 implemented length specific offsets of 15, 16 and

17 nts from the 5′ end, respectively, for fragments of size 29-30 nt, 31-33 nt and 34-35 nt.

Several studies have also implemented a constant offset of 15 for range of fragment sizes

25-35 nt94,121. Similar to S. cerevisiae, we also implement a constant offset of 18 nt to

fragment size range of 25-35 nt.

Few general methods have been proposed to determine A-site locations in any

organism. We implemented the methods Plastid38, RiboProfiling92 and riboWaltz37 which

are publicly available as R packages. The A-site offset tables generated using these

methods for our analyzed datasets in S. cerevisiae and mESCs are presented in Table

A.10. To determine the A-site profiles using the ‘ribodeblur’ method created by Wang and

co-workers32, we ran the source code available in GitHub (https://github.com/Kingsford-

Group/ribodeblur-analysis/releases/tag/v0.1) on our datasets and added a custom Python

script to generate the ‘deblurred’ A-site profiles. For Rpbp102, the publicly available

software was downloaded and run locally to obtain the A-site offsets. We also applied the

center-weighted method as described by Becker and co-workers89; for reads greater than

23 nt, we trim 11 nt from both ends of the fragment and distribute the read equally among

the remaining nucleotides. For scikit-ribo method30, the source code was downloaded and

was successfully run for S. cerevisiae datasets to obtain the A-site profiles. Scikit-ribo

could not be run on mouse ESC data as the current available version of the source code

Page 52: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

37

contains bugs resulting in inaccurate annotation assignments for higher eukaryotic

genomes.

Instances of PPG motifs (in S. cerevisiae) and PPE motifs (in mESCs) used for analysis

are selected from genes in which at least 90% of codon positions have at least 1 read in

their 5′ aligned ribosome profiles in the CDS region and an upstream region of 18 nt. An

instance of a motif is included for analysis only if its ribosome density is greater than 1.5

of average ribosome density at the third codon position in the A-site profile of any

compared methods. We use the Wilcoxon signed rank test to determine if there is a

statistically significant difference between the normalized read density at the third codon

of motif instances obtained by Integer Programming and other methods.

2.6 Acknowledgements

We thank Ajeet Sharma for providing us with computer code to simulate translation using

Gillespie’s algorithm and the members of the O’Brien Lab for critical feedback on the

manuscript. P.S. is supported by a Borysiewicz Biomedical Fellowship from the University

of Cambridge. This work was supported by the research grant from the National Science

Foundation ABI grant 1759860 to E.P.O.

2.7 Data Availability

All source code is made available on the GitHub repository https://github.com/nabeel-

bioinfo/Asite_IP_method

Page 53: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

38

Chapter 3

A CHEMICAL KINETIC BASIS FOR MEASURING TRANSLATION ELONGATION

RATES FROM RIBOSOME PROFILING DATA

The research presented in this chapter is part of a study published in PLOS Computational

Biology titled “A Chemical Kinetic Basis for Measuring Translation Initiation and Elongation

Rates from Ribosome Profiling data” by Ajeet K. Sharma*, Pietro Sormanni*, Nabeel

Ahmed*, Prajwal Ciryam, Ulrike A. Friedrich, Günter Kramer and Edward P. O’Brien (*

denotes co-first authors). This publication describes three methods for measuring

translation initiation rates, average translation elongation rate as well as individual codon

translation rates respectively. The contribution of Nabeel Ahmed for this study is towards

development of method for measuring individual codon translation rates and estimating

the molecular factors influencing the rate of translation elongation.

Portions of text of this chapter is being reproduced from the above publication with

permission from PLOS under Open Access Creative Commons Attribution License (CC

BY). Only the research contributed by Nabeel Ahmed related to codon translation rates is

presented in this chapter. The in silico validation analysis for codon translation rates

presented in this chapter was carried out by Ajeet K. Sharma for data provided by Nabeel

Ahmed.

3.1 Abstract

Analysis methods based on simulations and optimization have been previously developed

to estimate relative translation rates from next-generation sequencing data. Translation

involves molecules and chemical reactions,; hence, bioinformatics methods consistent

with the laws of chemistry and physics are more likely to produce accurate results. Here,

we derive simple equations based on chemical kinetic principles to measure the individual

codon translation rates from ribosome profiling experiments. Our methods reproduce the

known rates from ribosome profiles generated from detailed simulations of translation. By

applying our methods to data from S. cerevisiae, we find that the extracted rates reproduce

expected correlations with various molecular properties in agreement with previous

reports that used other approaches. Our analysis further reveals that a codon can exhibit

up to 26-fold variability in its translation rate depending upon its context within a transcript.

This broad distribution means that the average translation rate of a codon is not

representative of the rate at which most instances of that codon are translated, and it

Page 54: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

39

suggests that translational regulation might be used by cells to a greater degree than

previously thought.

3.2 Author Summary

The process of translating the genetic information encoded in an mRNA molecule to a

protein is crucial to cellular life and plays a major role in regulating gene expression. The

translation initiation rate of a transcript is a direct measure of the rate of protein synthesis

and is the key kinetic parameter defining translational control of the gene’s expression.

Translation rates of individual codons play a considerable role in coordinating co-

translational processes like protein folding and protein secretion and thus contribute to the

proper functioning of the encoded protein. Direct measurement of these rates in vivo is

nontrivial and recent next generation sequencing methods like ribosome profiling offer an

opportunity to quantify these rates for the entire translatome. In this study, we develop

chemical kinetic models to measure absolute rates and quantify the influence of different

molecular factors in shaping the variability of these rates at codon resolution. These new

analysis methods are significant to the field because they allow scientists to measure

absolute rates of translation from Next-Generation Sequencing data, provide analysis

tools rooted in the physical sciences rather than heuristic or ad hoc approaches, and allow

for the quantitative, rather than qualitative study of translation kinetics.

3.3 Introduction

Translation-associated rates influence in vivo protein abundance, structure and function.

It is therefore crucial to be able to accurately measure these rates. The ribosome

synthesizes a protein in three steps namely initiation, elongation, and termination 122–124.

Translation is initiated at the start codon of the mRNA transcript by the ribosomal subunits

that form a stable translation-initiation complex125,126. During the elongation step, the

ribosomes moves along the mRNA transcript decoding individual codons and adding

residues to the growing nascent chain. Translation is terminated when the stop codon is

in the ribosome’s A-site resulting in release of the synthesized protein. The elongation

phase is terminated when the ribosome’s A-site reaches the stop codon, resulting in

release of the fully synthesized protein. The initiation and elongation phases of translation

contribute to protein levels inside a cell; indeed, alteration of their rates can cause protein

abundance to vary by up to five orders of magnitude5,127,128, and alter protein structure and

function11. Termination does not influence the cellular concentration of proteins as it is not

Page 55: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

40

a rate limiting step129. Therefore, knowledge of translation initiation and codon translation

rates are important to understand the regulation of gene expression.

Significant efforts have been made to extract these rates from data generated from

ribosome profiling experiments39,130–132, a technique that measures the relative ribosome

density across transcripts25. In a number of methods, the actual rates are not measured

but instead a ratio of rates, or other relevant quantities have been reported35,41–43,60.

Estimates of translation-initiation and codon translation rates have helped identify the

molecular determinants of these rates. For example, estimated initiation rates correlate

with the stability of mRNA structure near the start codon and in the 5' untranslated

region4,41,129,132 indicating mRNA structure can influence initiation. Similarly, codon

translation rates have been found to positively correlate with their cognate tRNA

abundance41,105, and anti-correlate with the presence of downstream mRNA secondary

structure53,54 and positively charged nascent-chain residues inside the ribosome exit

tunnel42,133. Some of these findings are controversial as different analysis methods and

data have led to contradictory results concerning the role of tRNA concentration27,33,41,43,

positively charged residues42,60 and coding sequence (CDS) length41,129,131,132. Moreover,

the accuracy of these methods is unknown because orthogonal, high-throughput

experimental measurements of translation rates do not exist.

In the absence of data that could differentiate the accuracy of different methods,

we argue that the methods most likely to be accurate will be those that are constrained by

and account for the chemistry and physics of the translating system. Here, we present

such a method, derived from chemical kinetic principles that permit the extraction of

individual codon translation rates from Next-Generation Sequencing (NGS) data. These

methods are verified against artificial ribosome profiling data generated from detailed

simulations of the translation process where the translation rates are known a priori. We

then apply these methods to in vivo ribosome profiling data and extract the transcriptome-

wide translation-initiation and codon translation rates in S. cerevisiae and transcriptome-

wide average elongation rate in mouse stem cells. We show that the translation rate

parameters correlate with factors known to modulate these rates, and assign absolute

numbers to these rates.

Page 56: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

41

3.4 Results

3.4.1 Theory

3.4.1.1 Measuring individual codon translation rates

To derive a mathematical expression for extracting codon translation rates from ribosome

profiling data we assumed steady state conditions in which the flux of ribosomes at each

codon position is equal to the rate of protein synthesis

𝑁2,𝑖

ribo

𝜏(2,𝑖)=

𝑁3,𝑖ribo

𝜏(3,𝑖)= ⋯

𝑁𝑗,𝑖ribo

𝜏(𝑗,𝑖)= ⋯ =

𝑁𝑁𝑐(𝑖),𝑖ribo

𝜏(𝑁𝑐(𝑖),𝑖) . (Eq 3.1)

In Eq. (3.1), 𝑁𝑗,𝑖ribo and 𝜏(𝑗, 𝑖) are, respectively, the steady-state number of ribosomes and

the average translation time of the 𝑗𝑡ℎ codon position within copies of transcript 𝑖 in a given

experimental sample. The mean total time of synthesis ⟨𝑇(𝑖)⟩ of transcript 𝑖 is, by

definition, equal to

⟨𝑇(𝑖)⟩ = 𝜏(2, 𝑖) + 𝜏(3, 𝑖) + ⋯ + 𝜏(𝑁𝑐(𝑖), 𝑖), (Eq. 3.2)

Solving Eqs. (3.1) and (3.2) for 𝜏(𝑗, 𝑖) (see derivation in Appendix B) yields

𝜏(𝑗, 𝑖) =𝑁𝑗,𝑖

ribo

∑ 𝑁𝑙,𝑖ribo𝑁𝑐(𝑖)

𝑙=2

⟨𝑇(𝑖)⟩. (Eq. 3.3)

As is the convention in the field26, we assume that ribosome profiling reads at the 𝑗𝑡ℎ codon

position of transcript 𝑖, 𝑐(𝑗, 𝑖), are directly proportional to 𝑁𝑖𝑗ribo. This relationship can be

expressed as

𝑁𝑗,𝑖ribo = 𝑎𝑗,𝑖𝑐(𝑗, 𝑖), (Eq. 3.4)

where 𝑎𝑗,𝑖 is a constant of proportionality. 𝑎𝑗,𝑖 values have not been experimentally

measured, but they are commonly assumed to be constant for all codon positions in a

single experiment26. That is, 𝑎𝑗,𝑖 = 𝑎𝑖 for all 𝑖 and 𝑗. Using Eq. (3.4) with 𝑎𝑗,𝑖 = 𝑎𝑖 in Eq.

(3.3) yields

𝜏(𝑗, 𝑖) =𝑐(𝑗,𝑖)

∑ 𝑐(𝑙,𝑖)𝑁𝑐(𝑖)𝑙=2

⟨𝑇(𝑖)⟩. (Eq. 3.5)

Eq. (3.5) indicates that we can determine the individual codon translation rates from

ribosome profiling reads provided we know the average total synthesis time of the

transcript. Eq. (3.5) can be connected to the expression for normalized ribosome density,

derived in the SI of Weinberg et al.41, where 𝜏(𝑗,𝑖)

𝜏(𝑖)𝑁𝑐(𝑖) is the normalized ribosome density

and is expressed as a function of 𝑐(𝑗, 𝑖)s. Eq. (3.5) is also related to a metric used in the

Page 57: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

42

simulations of study of Dao Duc and Song132 to estimate the codon translation rates. It is

important to note that 𝜏(𝑗, 𝑖) is the actual codon translation time, which includes the time

delay caused by ribosome-ribosome interactions and is distinct from the intrinsic

translation rate of a codon 𝜔(𝑗, 𝑖). 𝜏(𝑗, 𝑖) is equal to the inverse of 𝜔(𝑗, 𝑖)𝑓(𝑗, 𝑗 + ℓ, 𝑖)134

where 𝑓(𝑗, 𝑗 + 𝑙, 𝑖) is the conditional probability that given that a ribosome is at the 𝑗𝑡ℎ

codon position there is no ribosome at the (𝑗 + 𝑙)𝑡ℎ codon position.

3.4.2 Application

3.4.2.1 In silico validation of our methods

As a first step to test the accuracy of the measured translation rates from Eq. (3.5), we

applied them to S. cerevisiae ribosome profiles generated by Gillespie simulations 135 in

which all of the underlying rates are known (see Methods). If our analysis method is

accurate then a necessary condition is that they reproduce these rates from the simulated

profiles. We applied Eq. (3.5) to the steady-state ribosome profiles and find that the

individual codon translation times are accurately measured by our method (Figure 3.1,

median 𝑅2 = 0.96 and median slope =1.00). Thus, the analysis method we have created

can in principle accurately capture the translation rate parameters.

There are several points worth noting concerning these tests. First, the rates used

in the simulation model are realistic, having been taken from literature values129,136.

Second, the depth of coverage in the simulated ribosome profiles is in the same range as

experiments, e.g., having 26 million reads arising from coding sequences41. Third, Eq.

(3.5) require knowledge of the average synthesis time of a protein, which is experimentally

difficult to measure. Therefore, in the above analyses we used the approximation that the

average synthesis time of a protein is proportional to the number of codons in its transcript,

multiplied by the transcriptome-wide average codon translation time (Eq. (B.5))33,137.

However, when we increase our read coverage from 7.1 million to 35.5 billion reads and

use the exact synthesis time of a protein, 𝑅2 between the estimated and true codon

translation rates goes to > 0.99 for all 85 transcripts in the simulated data (See Methods).

Thus, our model is reasonably accurate when approximate protein synthesis times are

used (Eq. (B.5)) and the coverage is similar to typical experiments, and highly accurate

when the exact synthesis time is used and coverage is high.

Page 58: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

43

Figure 3.1. Eq. (3.5) accurately determines codon translation times from simulated ribosome

profiles. (A) Average translation time of a codon in YER009W S. cerevisiae transcript is plotted as

a function of its position within the transcript. The true codon translation times in the simulations are

plotted as green boxes, blue and black data points are the translation times measured using Eq.

(3.5). Blue data points were calculated using the average protein synthesis time measured from the

simulations and relative ribosome density calculated using a large number of in silico ribosome

profiling reads. Black data points were calculated using the average protein synthesis time

estimated from the scaling relationship (Eq. (B.5)) and the relative ribosome density calculated from

the in silico reads which were equal to the reads aligned to the same transcript in the experiment41.

(B) Measured codon translation times, plotted with black and blue data points in (A), are plotted

against true codon translation times in the simulations in the top and bottom panel, respectively. (C)

Probability distribution of the 𝑅2 correlation between the true and calculated codon translation times

for the 85 S. cerevisiae transcripts. (D) Probability distribution of the slope of the best-fit lines

between the estimated and true codon translation times for the 85 S. cerevisiae transcripts. The

high 𝑅2 in (C) and median slope of 1.00 in (D) indicate that Eq. (3.5) can, in principle, accurately

measure absolute rates

Page 59: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

44

3.4.2.2 Measurement of individual codon translation rates

To extract individual codon translation times along a coding sequence we applied Eq. (3.5)

to 117 and 364 high-coverage transcripts from ribosome profiling data reported,

respectively, in studies of Nissley et al.9 and Williams et al.114 (see Methods). The number

of transcripts in both of these datasets are small as compared to the size of S. cerevisiae

transcriptome. Therefore, to determine whether these subsets of transcripts are

representative of the whole transcriptome we compared the distributions of different

physicochemical properties in these two sets to the total transcriptome. We find that the

subset of transcripts from Nissley et al.9 have 6.6% higher mean GC content but a very

similar mode of length distribution and codon usage relative to the total transcriptome

(Figure B.1A-C). For Williams et al.’s ribosome profiling dataset114 we again find that the

mode of the length distribution and codon usage is similar to the S. cerevisiae

transcriptome, with 5.3% higher mean GC content (Figure B.1D-F). This indicates that the

set of transcripts we analyze are largely representative of the properties of the

transcriptome, but have a bit higher GC content.

Upon extracting individual codon translation times from these ribosome profiling

data, we first characterized the distribution of translation times for the 61 sense codons

(Table B.1). We find around three-fold difference between the median translation times of

the fastest and slowest codons in the Nissley dataset9. The fastest and slowest codons

are AUU and CCG codons that are translated in 127±2 and 344±37 ms (median ± standard

error), respectively. The variability in translation times for a given codon type is even

larger, as illustrated by wide distributions of their translation times in the Nissley dataset

(Figures 3.2A, B.2A). Figure 3.2B shows an example where the AAG codon is translated

with translation times ranging from 59 ms at codon position 413 to 363 ms at codon

position 196 in YAL038W S. cerevisiae transcript. We find a 16-fold variability in codon

translation times across the transcriptome even if we ignore the extremities of the

distributions by only considering the translation times between the 5th and 95th percentiles

of all codon types. Similar ranges are found in the Williams dataset where there is a 26-

fold variability in translation times and 3.9 fold-difference in median translation times of the

fastest (AUC) and slowest (CGA) codons, which are translated with median time of 128±2

and 496±61 ms, respectively (Table B.2, Figure B.2B). The medians and standard

deviations of translation time distributions are well correlated between the above two

datasets (Figures B.3A, B). The study of Dao Duc and Song132 also infers the individual

Page 60: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

45

codon translation rates and a very high correlation is observed between the rates obtained

using our method and the rates found in that study (Figure B.3C).

3.4.2.3 Molecular factors flanking the A-site shape the variability of individual codon

translation rates

A number of molecular factors have been shown or suggested to influence the translation

rate of a codon in the A-site, including tRNA concentration, mRNA structure, wobble-base

pairing, and proline residues at or near the ribosome P-site29,35,41–43,53,54,105. Here, we test

whether the presence or absence of these factors correlate with changes in translation

speed that we measure. We first examined whether the cognate tRNA concentration

correlates with our translation times. We find that the median codon translation times

negatively correlates with the abundance of cognate tRNA (Figures 3.3A and B, 𝜌 =

−0.51 (p-value = 0.0006) and 𝜌 = −0.50 (p-value= 0.0009), respectively), indicating that

codons with lower cognate tRNA concentrations typically are translated more slowly.

The presence of a proline amino acid at the ribosome’s P-site can slowdown

translation due to its stereochemistry58. We tested whether such an effect was present in

our data set by calculating the percentage difference in median translation time at the A-

site when proline is present at the P-site versus when it is not present at the P-site. We

Figure 3.2. Wide variability in individual codon translation rates in vivo. (A) Probability density

functions for translation times of AUU, GAC and UGG codons in Nissley dataset. Median translation

times for AUU, GAC and UGG codon are 127, 208 and 331 ms, respectively. (B) The translation

time profile of S. cerevisiae transcript YAL038W from Nissley dataset is shown between codon

positions 150 and 450. AAG codon (colored red) is translated in 362.8 ms at the 196𝑡ℎ codon

position and in 58.6 ms at 413𝑡ℎcodon position.

Page 61: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

46

find a 19% increase in median translation time when proline is present (Figure 3.3C, Mann-

Whitney U test, p-value = 2.2 × 10−32) indicating that proline does systematically

slowdown translation in vivo.

It has been found that the presence of downstream mRNA secondary structure

can slow down the translation at the A-site51–54. To test for this effect, we plotted the

difference in the median translation time at the A-site when mRNA secondary structure is

present versus when it is not present at a given number of codon positions downstream

of the ribosome A-site. Structured versus unstructured nucleotides were identified using

DMS data138. We find that when secondary structure is present 4 codons downstream of

the A-site, placing that structure directly at the leading edge of the ribosome, there is on

average a 6.7% increase in codon translation time at the A-site (Figure 3.3D, Mann-

Whitney U test, p-value = 2.7 × 10−14). A slowdown is also found when we cross-

reference our codon translation times with mRNA structure data from PARS139, which

measures the presence of mRNA structure in vitro (Figure 3.3E, Mann-Whitney U test, p-

value = 5.6 × 10−9).

Wobble base pairing between the codon and anti-codon tRNA stem-loop has been

found to slowdown translation speed as compared to Watson-Crick base pairing in

bacteria140 and metazoans49. For each pair of codon types that are decoded by the same

tRNA molecule, by Watson-Crick base pairing in one instance and wobble base pairing in

the other, we tested whether two codon types are translated with different rates. We find

that there is no systematic difference in median translation times between codons that are

decoded by either mechanism (Figure 3.3F, Wilcoxon signed-rank test, p-value=0.46),

indicating that, at least in S. cerevisiae, wobble base pairing does not slowdown in vivo

translation elongation.

These results were reproduced using another dataset 114 that also shows that

codon translation times anti-correlate with tRNA concentration (Figures B.4A-B, 𝜌 =

−0.58 (p-value = 7.8 × 10−5) and 𝜌 = −0.56 (p-value= 0.0002), respectively), exhibit

significant slow-down in codon translation time when a proline is present at the P-site

(Figure B.4C, Mann-Whitney U test, p-value = 3.0 × 10−27) and mRNA structure present

downstream to the A-site (Figures B.4D-E, Mann-Whitney U test, p-values =

3.6 × 10−5,8.7 × 10−4, respectively) and similarly, we found no difference between the

translation rate of codons that are translated with Watson-Crick and Wobble base pairing

(Figure B.4F, Wilcoxon signed-rank test, p-value=0.88).

Page 62: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

47

Figure 3.3. Molecular factors shaping the variability of individual codon translation rates.

(A-B) Median translation times of codon types are negatively correlated with cognate tRNA

abundance estimated by (A) gene copy number and (B) RNA-Seq gene expression. (C) Probability

distribution of the translation time of codons that are followed by the proline encoding codon and

the rest of the other codons are plotted in green and blue, respectively. (D-E) Percentage difference

in median translation times when mRNA structure is present relative to when it is not present is

plotted as a function of codon position after the A-site. Grey bars indicate results that are not

statistically significant. Error bars are the 95% C.I. calculated using 104 bootstrap cycles;

significance is assessed using the Mann-Whitney U test corrected with the Benjamini Hochberg

FDR method for multiple-hypothesis correction. mRNA structure information used in (D) and (E)

are provided by in vivo DMS and in vitro PARS data, respectively. (F) Scatter plot of the median

translation times of pairs of codon types that are decoded by the same tRNA molecule. The red

line is the identity line. The list of tRNA molecule names and decoded codon types were taken from

Cannarrozzi et al.176. Error bars are standard error about the median translation time calculated

with 104 bootstrap cycles.

Page 63: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

48

3.5 Discussion

We have presented a method for measuring elongation rates from ribosome profiling data.

What distinguishes our approach from many others is that it uses simple equations derived

from chemical kinetic principles, it does not require simulations or a large number of

parameters, and it yields absolute rather than relative rates. We demonstrated that our

approach provides accurate results when applied to test data sets (Figure 3.1), and

reproduced previously reported correlations between translation speed and various

molecular factors (Figures 3.3 and B.4), suggesting the rates obtained by this method are

reasonable.

A novel finding concerning elongation rates is that in S. cerevisiae the translation

time of a codon depends dramatically on its context within a transcript. In S. cerevisiae,

the range of individual codon translation time spans up to 26-fold, from 45 to 1,194 ms,

even after discarding the top and bottom 5% of this distribution as possible outliers. The

codon AAG in gene YAL038W, for example, occurs 36 times along this gene’s transcript.

At the 196th codon position AAG is translated in 363 ms, and at the 413th position AAG is

translated in 59 ms. Thus, the same codon in different contexts can be translated at very

different speeds. Characterizing the distribution of mean times of translation of different

occurrences of the same codon reveals a broad distribution (Figure 3.2A), whose

coefficient of variation is often close to 0.5 (Tables B.1 and B.2). This means that the

standard deviation is half of the average translation time of a codon. This leads to the

important finding that the average translation rate of a codon type is not representative of

the rate at which most instances of that codon type are translated. These results are

consistent with the findings that a large number of molecular factors determine codon

translation rates in vivo141, thus giving rise to a broad distribution of rates (Figures 3.2A,

B.2) and these factors have been shown to cause a bias towards slower translation in the

first 200 codons of the transcript54.

A molecular factor that has not been quantified in this study is ribosome queuing.

Currently, the conventional Ribosome profiling protocol isolates only monosomes and the

monosome-protected fragments are extracted and sequenced. However, should

ribosomes queue along a transcript, disomes and trisomes are likely to be produced that

are not accounted for in current datasets. Recent studies132,142 have attempted to quantify

the extent of ribosomal queuing but several challenges remain. One of the central

challenges is to correctly identify the location of A-sites of ribosomes translating disome-

and trisome-protected mRNA fragments. Current ribosome profiling datasets that include

Page 64: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

49

disomes have very sparse coverage, which limits the application of our method but more

importantly suggests that the occurrence of disomes, and hence of queuing, may be rather

rare under normal growth conditions. However, under stress conditions, ribosome queuing

has the potential to become frequent for some genes and potentially decrease the

accuracy of our method unless the disomes and trisomes fragments are included. As

advances in ribosome profiling experiments are made to generate high coverage data and

improve the A-site identification on disomes and trisomes, our method will be able to more

accurately quantify the rates of translation elongation under non-standard growth

conditions.

A number of approaches have been developed to measure codon translation times

including simulation based approaches131,132 that extract rates by comparing the local

distribution of ribosome profiling reads with simulated ribosome densities, others that

optimize an objective function39 or fit a normalized-footprint-count distribution of a codon

to an empirical function130, and yet others that measure relative codon translation times

by quantifying the enrichment of ribosome read density using a variety of procedures35,43.

In contrast, Eq. (3.5) allows individual, absolute codon translation rates to be calculated

directly from the ribosome profile along the transcript. Another distinction is that a number

of these methods39,43,130,131 assume that all occurrences of a codon across the

transcriptome must be translated at the same rate. This assumption cannot be correct as

it is known that non-local aspects of translation (such as mRNA structure) can influence

the translation speed of individual codons. Eq. (3.5) does not make this assumption, and

therefore its extracted rates will better reflect the naturally occurring variation of codon

translation times across a transcript.

The codon usage in a transcript, and associated translation rates, can affect

various co- and post-translational processes involving nascent proteins11. Therefore, the

accurate knowledge of codon translation times measured using Eq. (3.5) will help provide

a better quantitative understanding of how translation speed can impact the efficiency of

co-translational processes, such as protein folding, chaperone binding, and numerous

other processes involving the nascent protein. Coupled with molecular biology techniques

that can knock out various genes and their functions in cells, Eq. (3.5) provides the

opportunity to quantitatively examine whether co-translational processes can cause

translation speed changes.

Ribosome profiles have ill-quantified sequencing biases27 that can potentially

produce reads that are not proportional to the underlying number of ribosomes at a

Page 65: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

50

particular codon position. This could lead to errors in the extraction of translation rate

parameters using our methods143. It has been demonstrated that using translational

inhibitors like cycloheximide leads to distortion of ribosome profiles due to inefficient arrest

of translation28,29. This was one of the primary reasons why initial studies using

cycloheximide did not observe a correlation of codon translation rates and cognate tRNA

concentration. While there is often a strong correlation between the total number of

mapped reads per transcript between datasets from different studies, the correlation is

often poor at the individual nucleotide level101. This “noise” at this resolution has been

attributed to sparse read coverage101, choice of ribonuclease for digestion144, and the

methods used to halt elongation in the ribosome profiling protocol28,29. Restricting our

analyses to transcripts with high coverage contributes to more reproducible results, as can

be seen by the high correlation between the two datasets used in this study (Figure B.3).

Experimental improvements that minimize bias have been developed26,41,144,145, such as

using flash-freezing for arresting translation and utilizing short microRNA library

generation techniques146, but sequence-dependent biases can still exist, for example due

to varying efficiencies of linker ligation147. As experimental techniques are improved to

minimize bias, the accuracy of the rates extracted using our methods will also increase.

The absence of accurate translation rate parameters is an impediment to

quantitatively modeling the process of translation. By measuring translation rate

parameters using a chemical-kinetic framework, our method can contribute to ongoing

efforts2,148 to understand how the sequence features of an mRNA molecule can regulate

gene expression. More broadly, the approach we have taken in this study is to utilize ideas

from chemistry and physics to analyze Next-generation Sequencing data; a branch of

bioinformatics we refer to as physical bioinformatics. We expect that this physical-science-

based approach will prove useful in understanding other large biological data sets

concerning translation and compliment the conventional computer science approaches to

bioinformatics.

3.6 Methods

3.6.1 Simulated steady state ribosome profiling data. We carried out protein synthesis

simulations using the inhomogeneous ℓ-TASEP model142,148–151. In this model, with 𝑙=10

and the A-site of the ribosome located at the 6th codon within the ribosome-protected

mRNA fragment, a new translation-initiation event stochastically occurs on transcript 𝑖 with

rate 𝛼(𝑖) when the first six codons of the transcript are not occupied by another ribosome25.

Page 66: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

51

The ribosome then stochastically moves along the transcript from codon position 𝑗 to 𝑗 + 1

with rate 𝜔(𝑗, 𝑖) if no ribosome is present at the (𝑗 + ℓ)th codon position. A ribosome

stochastically terminates the translation process with rate 𝛽 when its A-site encounters

the stop codon. Note that our simulation model does not account for other processes such

as ribosome recycling152 and drop-off153.

85 S. cerevisiae mRNA transcripts were selected to test our codon translation rate

measurement method. They were chosen based on the filtering criteria that each codon

has at least 3 reads in the ribosome profiling data reported in study of Weinberg et al.41.

We used the translation-initiation rates reported in Ciandrini et al.129 in our simulations for

these transcripts. We used codon translation rates from Fluitt and Viljoen’s model for all

61 sense codons136 and set the translation-termination rate to 35 𝑠−1 129. We set ℓ = 10

codons in our simulations because it is the canonical mRNA fragment length that is

protected by ribosomes against nucleotide digestion in ribosome profiling experiments25.

We simulated the translation of these 85 S. cerevisiae mRNA sequences using the

Gillespie’s algorithm135 to generate the in silico ribosome profiling data. During the

simulations, we recorded ribosome locations across the transcript every 100 steps, which

we found minimized the time-correlation between successive saved snapshots. The codon

positions of the ribosome’s A-site in each of these snapshots, summed over all snapshots,

constituted the in silico generated ribosome profile for the transcript. We ran the

simulations until the total number of in silico ribosome profiling reads were equal to the

total number of reads aligned to the same transcript measured from experimental

ribosome profiling data reported in study of Weinberg et al.41. This allowed us to create a

realistic level of statistical sampling in our in silico ribosome profiles. Each of the

uncorrelated snapshots can be thought as a separate copy of the mRNA transcript. Thus,

the total number of these snapshots were equal to the mRNA copy number in our in silico

experiment which we used to calculate 𝜌(𝑗, 𝑖)s.

3.6.2 In silico measurement of average protein synthesis and codon translation

times

To calculate the codon translation times (Eq. (3.5)) from in silico ribosome profiles we

need the average time a ribosome takes to synthesize a protein from a given transcript.

We measured this quantity from our simulations by recording the time it takes a ribosome

to traverse from the start codon to the stop codon in the transcript. The average synthesis

time of a protein was then calculated from 10,000 individual ribosomes.

Page 67: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

52

We also calculated the average synthesis time of a protein using a scaling

relationship that uses the transcriptome-wide average codon translation time (Appendix

B, Eq. (B.5)). To calculate this quantity, we first computed the average codon translation

time for each transcript by dividing the average protein synthesis time of a transcript by its

CDS length. We then calculated the transcriptome-wide average codon translation time

using the average codon translation time of each transcript.

Testing the accuracy of Eq. (3.5) requires the real codon translation times which

we measured by setting a separate clock at each codon position of a transcript in our

simulations. These clocks measured the time difference between successive arrival and

departure of a ribosome at each codon position. To calculate the average codon

translation time at each codon position at least 10,000 such measurements were made.

3.6.3 Analysis of ribosome profiling and RNA-Seq data

3.6.3.1 Datasets: To calculate the codon translation rates, we apply our method to high-

coverage ribosome profiling datasets of wild type S. cerevisiae reported in Nissley et al.9

and Williams et al.114 with NCBI accession numbers GSM1949551 and GSM1495503,

respectively. In our analysis, reads were preprocessed and mapped to sacCer3 reference

genome as described in Nissley et al.9. To maintain the accuracy of read assignment,

transcripts in which multiple mapped reads constitute more than 0.1% of total reads

mapping to the CDS region were not considered in the analysis. A-site positions in

ribosome profiling reads were assigned according to the offset table generated using an

Integer Programming algorithm which maximizes the reads between the second and stop

codon of transcripts154. The offset table for S. cerevisiae is taken from Table 2.1 of Chapter

2 (also Ahmed et al.154).

3.6.3.2 Selection of genes for codon translation rates: To extract individual codon

translation rates, we restrict our analysis to genes that have at least 3 reads at every codon

position of the transcript. We find that 117 and 364 genes meet this criterion in the data

set of Nissley et al.9 and Williams et al.114, respectively. This stringent requirement is

necessary since Eq. (3.5) would predict codons with zero reads to be translated in zero

time. Reads at the start codon and the second codon have contributions from the

translation initiation process; therefore, we ignored these codon positions in our

calculations of translation time distributions and correlation with molecular factors. As

stated before, transcripts containing multiple mapped reads greater than 0.1% of the total

reads mapped to the transcript were discarded. Genes with overlap of coding sequence

Page 68: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

53

regions as well as those containing introns (which is less than 6% of S. cerevisiae genome)

were not considered in the analysis to avoid overlap of ribosome profiles.

3.6.3.3 Miscellaneous: (a) Since experimental measurements of ⟨𝑇(𝑖)⟩ s are not available

for S. cerevisiae we use Eq. (B.5) to estimate ⟨𝑇(𝑖)⟩ with ⟨𝜏𝐴⟩ = 200 ms, as reported in the

literature155,156. (b) The measures for tRNA abundance based on gene copy number and

RNA-Seq measurements were obtained from Table S2 of Weinberg et al.41.

3.6.4 Assignment of mRNA secondary structure

Both DMS and PARS data provide information about base-paired nucleotides within an

mRNA molecule. We considered a codon to be structured if at least two of its three

nucleotides were base-paired or one nucleotide was base-paired and the structure

information for the other two nucleotides were not available.

DMS data for S. cerevisiae were downloaded from GEO database with accession

number GSE45803 138. The reads from all in vivo replicates were pooled together and then

aligned to the ribosomal RNA sequences using Bowtie 2 (v2.2.3)109. The reads which did

not align to the ribosomal RNA sequences were then aligned to the Saccharomyces

cerevisiae assembly R64-2-1 (UCSC: sacCer3) using Tophat (v2.0.13)110. In our analysis,

A and C nucleotides were considered base-paired when the DMS signal was below the

threshold of 0.2 and considered unstructured if the DMS signal was greater than 0.5. A

and C nucleotides with DMS signal between 0.2 and 0.5 are considered ambiguous and

classified together with U and G nucleotides, which do not react with DMS. Codons

involving such nucleotides were not considered in our analysis.

PARS data were downloaded from genie.weizmann.ac.il/pubs/PARS10 with

PARS scores available for all transcripts, except YDR461W and YNL145W, which were

excluded from our analysis. Nucleotides with a PARS score greater than 0 were

considered base-paired139.

Page 69: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

54

Chapter 4 EVOLUTIONARILY SELECTED AMINO ACID PAIRS ENCODE

TRANSLATION-ELONGATION RATE INFORMATION

This chapter is formatted as a 2,500-word manuscript that was recently submitted in a

peer-reviewed journal. The authors are Nabeel Ahmed, Ulrike A. Friedrich, Pietro

Sormanni, Prajwal Ciryam, Bernd Bukau, Günter Kramer and Edward P. O’Brien. The

contributions of each author to this study are: N.A and E.P.O conceived the study. U.F,

G. K and B.B carried out experiments to generate S. cerevisiae mutant strains and running

Ribo-Seq on these strains. P.S. and P.C contributed to analysis methods of published

Ribo-Seq data and annotation of domain boundaries in S. cerevisiae. N.A. analyzed the

data. N.A and E.P.O wrote the manuscript.

4.1 Abstract

The speed of translation is generally considered to be encoded within messenger RNA

molecules and influenced by intracellular conditions. Here, using a combination of

mutational experiments, bioinformatic and evolutionary analyses, we show that particular

pairs of amino acids and their associated tRNA molecules predictably and causally encode

translation rate information within the primary structures of proteins when these pairs are

present in the A- and P-sites of the ribosome. For some pairs, it is solely the amino acid

identity or tRNA identity that determines the variation in translation speed, while for others,

the speed is determined by a combination of these two factors. The fast-translating pairs

of amino acids are enriched seven-fold relative to the slow-translating pairs across the

Saccharomyces cerevisiae proteome, while the slow-translating pairs are enriched

downstream of domain boundaries. Thus, translation rate information is causally encoded

in the primary structures of proteins via pairs of amino acids, and signatures of

evolutionary selection pressure indicate their use in coordinating co-translational

processes.

4.2 Main Text

Variation in translation-elongation kinetics along a transcript’s coding sequence plays an

important role in the maintenance of cellular protein homeostasis by regulating co-

translational protein folding, localization, and maturation11,70,81. Codon translation rates are

influenced by a range of molecular factors, including the presence of particular tripeptide

sequence motifs composed of one or more prolines58,60–62,65 and positively charged

Page 70: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

55

nascent-chain residues within the negatively charged ribosome exit tunnel42,133,

suggesting that the primary structures of proteins encode translation-elongation rate

information, in some cases through their influence on ribosome catalysis. Since the

ribosome catalyzes peptide bond formation between 400 unique amino acid pairs when

they reside in the P- and A-sites of the ribosome, we postulated that the chemical identity

of some of these pairs might predictably and causally alter codon translation rates.

To test our hypothesis, we used ribosome profiling applied to Saccharomyces

cerevisiae. Ribosome profiling is a high-throughput technique that measures ribosome

densities that are directly proportional to the location and number of ribosomes translating

different codon positions across a transcriptome25. The measured ribosome density ρ at

a codon is inversely proportional to the speed at which ribosomes translate that

codon26,143. We analyzed the translational profiles of 364 high-coverage transcripts

measured in six independent, published data sets41,111,113,114,157 (Table C.1). For each of

the 400 unique pairs of amino acids that can reside in the P- and A-sites — which for a

given pair we denote as (X,Z), where X is the amino acid in the P-site and Z is the amino

acid in the A-site — we compared the normalized ribosome density distribution, [𝜌(𝑋, 𝑍)],

arising from all instances of the pair (X,Z) in the data set versus the distribution [𝜌(~𝑋, 𝑍)]

arising from all instances of Z being in the A-site but X not being present in the P-site. For

example, for the pair denoted (N, R), N is in the P-site and R is in the A-site, while (~N, R)

corresponds to the 19 other naturally occurring amino acids that can be in the P-site when

R is in the A-site (Figure 4.1a). The percent change in the median of [𝜌(𝑋, 𝑍)] relative to

the median of [𝜌(~𝑋, 𝑍)] quantifies the influence of the identity of the P-site amino acid on

the rate of translation relative to the rate when any other amino acid is in the P-site (Eq.

C.2). This approach controls for cognate tRNA concentration effects because the A-site

amino acid is fixed. Applying this analysis to each of the six published datasets, we

obtained six matrices reporting the percent change in ribosome density when a particular

amino acid pair is present in the P- and A-sites (Figure C.1). We focus only on highly

reproducible results by taking the intersection of those pairs that exhibit a consistent sign

change in all datasets and a percent change that is statistically significant in at least four

of the datasets. We found 84 pairs in which the presence of a particular P-site amino acid

is correlated with faster translation (green-shifted colors in Figure 4.1b) and 73 pairs in

which the identity of the P-site amino acid is correlated with slower translation (red-shifted

colors in Figure 4.1b). The results for the remaining pairs are not significant or consistent

Page 71: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

56

Figure 4.1. Bioinformatic analyses of ribosome profiling data indicate that the identity of amino

acids in the P- and A-sites can predictably alter the translation speed of the A-site codon. (a) A

ribosome with the amino acids N and R in the P- and A-sites, respectively. From ribosome profiling data,

we calculated the distribution of ribosome densities in the A-site from all instances of (N, R) in our dataset

and compared the result to the distributions of all other instances of R in the A-site when N is not present

in the P-site, denoted [(~N, R)] (top panel). (b) Each box in the matrix indicates, for a pair of amino acids

in the P- and A-sites, the percent change in median normalized ribosome density ρ when that particular

amino acid is in P-site compared to any other amino acid in the P-site, keeping the A-site amino acid

unchanged (Eq. C.2). The sign of the percent change must be consistent in all 6 analyzed ribosome

profiling datasets and statistically significant in at least 4 out of the 6 datasets, otherwise the box is colored

gray. * corresponds to any of the stop codons being present in the A-site. (c) Comparison of distributions

of amino acid pairs where R is kept constant at the A-site while the P-site is mutated from N to S. The

distributions of normalized ribosome densities for P- and A-site pairs (N, R) and (S, R), which differ

significantly from each other, are shown (Mann-Whitney U test, 𝑝 = 4.45 × 10−17). The median normalized

ribosome densities of the two distributions differ by 53.4%, and the odds of a change in translation speed

when (N, R) is mutated to (S, R) or vice versa is 2.98 (Eq. C.4). (d) The estimated percent difference

values for all 7,980 mutations of amino acid pairs with a constant A-site are plotted with respect to the

statistical significance of the difference between the distributions (see Methods in Appendix C). We

estimate that mutating the P-site will lead to significant changes in translation speed in 4,254 (53%) of

these mutations. (e) For the significant combinations of amino acids pairs, the distribution of the odds of

mutating any instance of the pair resulting in a change in speed is plotted.

Page 72: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

57

across the datasets (gray boxes in Figure 4.1b). These results suggest that the identity of

the P-site amino acid in 157 pairs of amino acids can predictably accelerate or retard

translation.

To test whether the potentially confounding factors of tripeptide motifs65, positively

charged upstream residues42,133, downstream mRNA structure52,53, or cognate tRNA

concentration41,105 explain the speed changes in Fig. 1b, we controlled for each of these

factors separately and found that even in their absence, the sign of the speed change is

preserved in 156 of the 157 pairs (Figure C.2). Thus, while these molecular factors can

contribute to codon translation rates, they do not explain the sign of the speed change we

observed.

Figure 4.1b predicts that by keeping the A-site amino acid fixed and mutating the

P-site amino acid, it is possible to accelerate or retard translation. For example, when

comparing the amino acid pairs (N,R) to (S,R), where R is the amino acid in the A-site, we

find a median ribosome density difference of 53% (Figure 4.1c). Hence, we predict that

the codon encoding R in the (S,R) pair will be translated faster than the codon encoding

R in the (N,R) pair. We predict that there will be a translation rate change for 53% of the

7,980 possible P-site mutations in amino acid pairs where the A-site is fixed (Figure 4.1d).

Because we are dealing with overlapping distributions (Figure 4.1c), we can calculate the

odds (Eq. C.4) that, for example, a mutation from (N,R) to (S,R) will accelerate translation

with 3-to-1 odds. We calculated these odds for each of the possible 7,980 mutations and

found a broad distribution (Figure 4.1e). With odds of 5.7-to-1, mutating (W,G) to (P,G)

will retard translation, while mutation with odds of 1-to-1, such as (V,W) to (H,W), are

equally likely to accelerate translation as they are to retard translation when Val is mutated

to His for different instances of (V,W) across the proteome.

To experimentally test these predictions, we introduced 12 non-synonymous

mutations into various positions of five non-essential S. cerevisiae genes that are not

involved in translation, and no mutations were made at functional sites of the encoded

proteins158 (Table C.2). Five of the mutations we predicted will accelerate translation, five

we predicted will retard translation, and two we predicted to have minimal effects on

translation speed when the mutated residue is present in the P-site. To ensure precise

measurements at codon resolution we performed ribosome profiling experiments at

unconventionally high read depths. Having an average of 86 million mapped exome reads

Page 73: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

58

Figure 4.2. Experiments demonstrate that the identity of amino acids in the P- and A-sites can

predictably alter the translation speed of the A-site codon, consistent with the predictions from Fig

4.1b. Normalized ribosome density (Eq. C.1) upon mutation at five pairs of residues that are predicted to

retard translation (a), five other pairs that are predicted to accelerate translation (b), and two negative control

mutations that are predicted to have little to no effect on translation speed (c) were measured in S.

cerevisiae. The gene name and pair of amino acids before and after mutation are listed above the panels

in (a) through (c). Full details concerning the mutations are provided in Table C.2. In each panel, the

normalized ribosome density measured at the A-site residue is reported for the wild-type sequence

transcript (blue data points) and mutated sequence (orange data points). Each data point corresponds to

one biological replicate; the horizontal bar indicates the mean value. The difference between the medians

in each panel is statistically significant (Fisher-Pitman permutation test, 𝑝 = 0.036 for all subpanels in (a)

and for mutations in YOL*, YKL* and YLR* in (b). 𝑝 = 0.002 for two mutations in YHR* in (b). 𝑝 = 0.002 and

𝑝 = 0.004 for the two subpanels in (c), respectively). The distribution of percent differences in ribosome

density between the mutant and wild-type sequences for the data in panels (a) and (b) is shown as a blue

box plot in panel (d), and for the negative control sequences in panel (c), the distribution is shown as an

orange box plot in panel (d). The mutations in the negative controls do show a statistically significant

difference in normalized densities compared to wild-type (Fisher-Pitman permutation test, 𝑝 = 0.002 and

𝑝 = 0.004). However, these mutations exhibit a 2.5-fold reduction in effect size (d), consistent with the

predictions from Fig. 4.1b.

Page 74: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

59

per sample, and totaling 1.7 billion mapped exome reads across samples (Table C.3). The

resulting ribosome profiles exhibit strong 3-nt periodicity, 87% of mapped reads are in

frame zero at a fragment size of 28 nt, and a very strong correlation between profiles for

the same gene across samples (Pearson 𝑟 = 0.96, Figures C.3 and C.4), indicating

technical biases are minimal, and any such biases that exist will cancel out when we carry

out a relative comparison between wild type and mutant results. Comparing the

normalized ribosome densities between the wild-type and mutant strains (Figure 4.2) we

find that the direction of change in ribosome density at the A-site is consistent with the

predictions from Figure 4.1b. Two of these ten mutations include a proline in the P-site,

for which we observe a speedup when the pair (P,G) is mutated to (E,G) in the gene

YMR122W-A (Figure 4.2a), while mutating (Q,D) to (P,D) in YOL109W leads to a

slowdown of translation (Figure 4.2b). These two mutations serve as positive controls

because the presence of proline is well established to retard translation58,60. Two additional

mutations were incorporated as negative controls and are predicted to cause little change

in the rate of translation (i.e., mutations that switch between gray boxes in Figure 4.1b).

We found that while the normalized ribosome densities of these mutants are statistically

different from that of the wild type (Figure 4.2c), the median effect size on translation speed

was 2.5-fold lower than what we observed for the other 10 mutations (Figure 4.2d). These

results are consistent with the hypothesis that the P-site amino acid can predictably and

significantly alter the translation rate of the A-site codon.

These amino acid mutations also change the identity of the tRNA molecule.

Therefore, the change in tRNA identity could also be the cause of the altered translation

speeds. To bioinformatically estimate the relative contributions of amino acid versus tRNA

identity to changes in speed, we projected the normalized ribosome density to the 42

unique tRNA molecules that can reside in the P- and A-sites. Based on a comparison of

these distributions, we can calculate the contributions of tRNA and amino acids to the

change in translation speed (Eq. C.5, and see Methods in Appendix C). We estimate that

the contribution of these two factors is a continuum for different pairs of amino acids

(Figure 4.3a). For some pairs, such as when (E,N) is mutated to (H,N), we estimate that

the change in speed is driven entirely by the amino acid identity, while other pairs, such

as (S,G) mutated to (E,G), are driven entirely by tRNA identity; many pairs are driven by

Page 75: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

60

Figure 4.3. Depending on the amino acid pair, translation speed is influenced by either

the identity of the tRNA pair, the amino acid pair, or both. (a) A bioinformatic analysis that

models all 7,980 possible mutations to the P-site residue for a given starting amino acid pair

(Eq. C.5 and Methods in Appendix C) estimates that speed changes are caused by a

combination of the change in tRNA identity and amino acid identity. Plotted is the probability

density of the estimated contribution of the change in amino acid identity to the speed change.

For some pairs, upon mutation, the speed change is entirely due to the amino acid change (i.e.,

data near 100%), for some, the speed change is entirely due to the tRNA molecule change (i.e.,

data near 0%), and for others, the speed change is due to a combination of the two changes.

(b) For the same amino acid mutation, two mutants are created using synonymous codons

decoded by different tRNAs. (c-e) To experimentally measure the contribution of amino acid

identity, a given non-synonymous mutation in the P-site was encoded using two different

synonymous codons, each resulting in the same amino acid mutation but decoded by different

tRNA molecules (see (b)). For the mutations YOL109W (G,G)→(S,G) (c) and YOL109W

(Q,D)→(P,D) (d), the change in ribosome density from wild-type was similar for both

synonymous mutants (Mutant 1 vs Mutant 2, Fisher-Pitman permutation test, 𝑝 = 0.1857 and

𝑝 = 0.7714, respectively), and hence, the amino acid is the predominant cause for the change

in translation speed. For the mutation YOL109W (N,R)→(S,R) (e), the speedup was seen for

only one mutant while the other mutant exhibits a normalized ribosome density indistinguishable

from that of the wild-type (Wild type vs. Mutant 2, Fisher-Pitman permutation test, 𝑝 = 0.1857),

indicating in this case that the tRNA identity is the predominant cause for the change in speed

upon mutation.

Page 76: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

61

a mixture of these two factors. To experimentally estimate the relative contribution of

amino acid identity, we took the three mutations that we previously incorporated into the

gene YOL109W (Table C.2) and created a new gene construct with the same three amino

acid mutations but used synonymous codons that are decoded by different tRNA

molecules (Table C.4). For the mutation (N,R) to (S,R), for example, we previously used

the codon UCC to mutate N to S. In the new strain, we used the synonymous codon UCG,

which is decoded by a different tRNA molecule 45 (Figure 4.3b). For the mutants (G,G) to

(S,G) and (Q,D) to (P,D), there was a change in normalized ribosome density that was in

the same direction and similar in magnitude regardless of the tRNA molecule used

(Figures 4.3c, d). This indicates that for these mutations, the change in amino acid identity

in the P-site is the primary cause of the change in translation rate in the A-site (Figures

4.3c, d). In contrast, for the mutation (N,R) to (S,R), we observed a change in ribosome

density when one tRNA molecule was used but no change in ribosome density when

another tRNA molecule was used (Figure 4.3e), indicating that the tRNA identity was the

primary cause. Our bioinformatic analysis predicted that for the mutations (G,G) to (S,G),

and (Q,D) to (P,D), the amino acid change is the major contributor to the observed change

in speed. For the mutation (N,R) to (S,R), involving different tRNAs, we were unable to

make a bioinformatic prediction because the sample size was too small (n < 10) to apply

any statistical test. Thus, the bioinformatic predictions and experimental results are

qualitatively consistent, indicating that in some cases it is the amino acid identity that

causes the change in speed and in others it is the tRNA identity.

If evolutionary selection pressures have acted to encode translation rate

information in the primary structures of proteins, then there should be a non-random

distribution of fast and slow-translating pairs of amino acids across the proteome. To test

this hypothesis, we calculated the enrichment and depletion of all 400 pairs of amino acids

across the S. cerevisiae proteome relative to the occurrence expected from a random

pairing. We selected the top 20% of the amino acid pairs that were enriched across the

proteome and the bottom 20% that were depleted and determined how many of the 84

fast-translating and 73 slow-translating amino acid pairs were present in either of these

quintiles. The odds ratio of fast-translating pairs being enriched across the proteome and

slow-translating pairs being depleted was 7.5 (Eq. C.7, 𝑝 = 0.0011, Fischer’s exact test),

indicating that selection pressures have indeed selected for the presence of fast-

translating pairs and selected against slow-translating pairs (Figure 4.4a) across the

proteome.

Page 77: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

62

-10

-5

0

5

10

15

20

10 20 30 40 50 60 70 80 90 100 EntireLinker

% c

han

ge in

P in

L v

s D

reg

ion

s

Window size of linker region

Slow Pairs Fast Pairs

b

-60

-40

-20

0

20

40

60

80

100

0 0.5 1 1.5 2 2.5

% c

han

ge

in

me

dia

n ρ

Fold enrichment of amino acid pair relative to random chance

Insignificant pairs

Fast pairs

Slow pairs

Depletion Enrichment

a

Slowdown of

translation

Speedup of

translation

-10

-5

0

5

10

15

20

25

10 20 30 40 50 60 70 80

Pe

rce

nt ch

an

ge

in

P

in

L v

ers

us D

re

gio

n

Window Size of Linker region

Slow pairs

Fast pairs

b

Domain Exit tunnel Linker

10

20

30

Window

Size

30

c

Page 78: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

63

Despite the broad preference for fast-translating pairs, we found that the slow-translating

amino acid pairs were locally enriched by 15% (95% CI: [6%, 23%], p<0.0001, n=170 domains,

random permutation test) in linker regions relative to domain regions of S. cerevisiae cytosolic

proteins (Figures 4.4b,c). In these linker regions, which start 30 residues downstream of domain

boundaries, 21% of the amino acid pairs, on average, are slow-translating pairs, and in the

extreme cases of genes YDR432W and YGL203C, there are 18 slow pairs in a 30 residue stretch.

Codon usage does not explain this enrichment of slow-translating pairs, as we found no difference

in the frequency of non-optimal codon usage between linker and domain regions (Figure C.5).

These results indicate that a number of slow-translating pairs exist in linker regions that can

cumulatively lead to a slowdown of translation as domains emerge from the ribosome exit tunnel,

which may aid in co-translational folding.

When the Hsp70 chaperone Ssb is bound to ribosome-nascent chain complexes,

translation is faster than when Ssb is not bound, possibly because chaperone binding prevents

nascent chain folding and hence allows translation to become uncoupled from folding and to

proceed faster159. We examined if the fast-translating amino acid pairs we identified contributed

to this speedup. We found that the fast-translating amino acid pairs were enriched by at least 4%

(95% CI: [2.3%, 6.1%], p=0.0001, n=425, random permutation test) in regions translated while

Ssb is bound, suggesting that these pairs do make a contribution (Figure C.6). Taken together,

these results indicate that across the primary structures of proteins, evolutionary pressures have

Figure 4.4. Evolution selects for fast-translating pairs across the proteome but enriches slow-

translating pairs across interdomain linker regions. (a) The enrichment and depletion of amino acid

pairs across the S. cerevisiae proteome is plotted against the percent change in median normalized

ribosome densities (ρ) of amino acid pairs taken from Figure 4.1b. Among the top 20% enriched and top

20% depleted set, the odds ratio of fast-translating pairs being enriched and slow-translating pairs being

depleted is 7.5 (Eq. C.7, Fisher’s exact test, 𝑝 = 0.0011). (b) The enrichment of fast- and slow-translating

pairs in linker (L) regions relative to domain (D) regions. The percent change is calculated as 𝑓(𝑋,𝐿)−𝑓(𝑋,𝐷)

𝑓(𝑋,𝐷)∗

100%, where 𝑓(𝑋, 𝐿) is the fraction of either slow or fast pairs in the linker region, and 𝑓(𝑋, 𝐷) is the fraction

of either slow or fast pairs in the domain regions. A positive percent change would indicate an enrichment

in the linker region, while a negative value would indicate a depletion. As a test of robustness, 𝑓(𝑋, 𝐿) was

computed over different window sizes in the linker region, discarding the first 30 residues after the domain

to account for those residues being in the ribosome exit tunnel, as illustrated in panel (c). 𝑛 = 170 for a

window size of 30 residues. For all window sizes, the percent change was significant (𝑝 < 0.05) for slow-

translating pairs and insignificant (𝑝 > 0.05) for fast-translating pairs. p-values were computed using the

random permutation test, and error bars in (b) represent 95% CI calculated from bootstrapping.

Page 79: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

64

selected for amino acid pairs that exhibit faster translation, and along a transcript, fast- and slow-

translating pairs are enriched locally in regions that are associated with co-translational folding

and chaperone binding.

We have demonstrated that the chemical identity of pairs of amino acids and tRNA

molecules present in the P- and A-sites can predictably and causally accelerate or retard

translation in the A-site of S. cerevisiae ribosomes. Two essential and unique features of our

analyses of ribosome profiling data are Eqs. C.2 and C.5. Eq. C.2 holds a fixed amino acid identity

in the A-site while varying the amino acid in the P-site. This feature keeps the cognate tRNA

concentration and the accommodation time into the A-site constant, thereby allowing us to isolate

the effect of the P-site amino acid on translation rates in the A-site. On the other hand, Eq. C.5

allows us to distinguish between the relative contributions of the change in amino acid and tRNA

identity when a mutation is made in the P-site. Confounding molecular factors, such as tripeptide

sequence motifs and many others, that can also affect translation speed do not explain our results

(Figure C.2). For example, a recent study160 observed that 17 pairs of very rare codons, when

present in the P- and A-sites, inhibit translation due to the interaction of their predominantly

inosine-modified, wobble-decoding tRNAs. To test whether this mechanism explains our slow-

translating pairs of amino acids, we removed all instances of codon pairs decoded by wobble

base pairing and found that 72 out of 73 of our slow-translating pairs remained slow (Figure C.7).

Thus, wobble base pairing does not explain our observations. In total, we identified 157 amino

acid pairs that could change the translation speed and verified these predictions experimentally

for 10 of these pairs. A surprising result is the large number of amino acids, including glycine and

aspartic acid, that can retard translation when present in the P-site. The molecular mechanism by

which these amino acids retard translation is not known, but we hypothesize this may be due to

a much slower step of peptide bond formation, as has been observed biochemically with proline.

Pairs whose effect arises primarily due to tRNA identity could influence a large number of different

steps during the translation-elongation cycle, including hybrid state formation and translocation.

Conversely, there are few reports in the literature on molecular factors whose presence

accelerates translation; however, we have identified 84 putative pairs that have this effect and

experimentally verified five of these pairs (Figures 4.1b and 4.2b). Determination of the molecular

cause of this speedup would be a fruitful area of future research. Evolutionary selection pressures

select only against phenotypic traits, not genotype. Therefore, the enrichment of fast-translating

amino acid pairs across the S. cerevisiae transcriptome and the clusters of slow- and fast-

translating pairs along transcripts that are correlated with co-translational processes suggest that

the elongation kinetics encoded by these pairs influence organismal phenotype and fitness. These

Page 80: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

65

results suggest the surprising possibility that there exists disease-causing amino acid mutations

that do not alter the final folded structures of proteins but instead alter the co-translational behavior

and processing of the nascent proteins via altered elongation kinetics. In summary, elongation

kinetics are causally and predictably encoded in protein primary structures through pairs of amino

acids, with broad implications for protein sequence evolution, translational control of gene

expression, and disease.

Page 81: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

66

Chapter 5

EVOLUTIONARILY-ENCODED TRANSLATION KINETICS COORDINATE CO-

TRANSLATIONAL SSB CHAPERONE BINDING IN YEAST

The research presented in this chapter was first published as part of a large-scale

study in Cell titled “Profiling Ssb-Nascent Chain Interactions Reveals Principles of Hsp70-

Assisted Folding” by Kristina Döring, Nabeel Ahmed, Trine Riemer, Harsha Garadi

Suresh, Yevhen Vainshtein, Markus Habich, Jan Riemer, Matthias P. Mayer, Edward P.

O’Brien, Günter Kramer and Bernd Bukau.

The research contributed by only Nabeel Ahmed is being discussed in this chapter

and was also presented at 62nd Annual Meeting of Biophysical Society and the Abstract

was published in Biophysical Journal titled “Evolutionarily-Encoded Translation Kinetics

Coordinate Co-Translational SSB Chaperone Binding in Yeast” by Nabeel Ahmed, Kristina

Döring, Günter Kramer, Bernd Bukau and Edward P. O’Brien.

Portions of text of this chapter are being reproduced from the above publications

with permission from CellPress under the Journal publishing agreement that allows

authors to use the publication for inclusion in a thesis or dissertation.

5.1 Abstract

Chaperones can bind to the ribosomes and nascent polypeptides co-translationally to

assist protein folding. It was not known previously whether there is any interplay between

chaperone binding and translation kinetics. To study this effect, we utilize the high-

throughput transcriptome-wide quantitative data for translation kinetics and chaperone

binding from ribosome profiling and selective ribosome profiling methods respectively. In

vivo selective ribosome profiling has shown that yeast Hsp70 chaperone Ssb associates

broadly with a major fraction of nascent proteins. Using the ribosome footprint densities

as a measure of local translation rate, we find that mRNA segments within the ribosome

are translated faster during periods of Ssb binding to the nascent polypeptides compared

to when Ssb is not bound. The acceleration of translation is maintained even when Ssb is

knocked out thus implying an inherent encoding of faster translation within the mRNA

sequence. Testing for mRNA features that can slow translation, we find that mRNA

segments translated by Ssb-engaged ribosomes are enriched for fast-translated codons

(having higher cognate tRNA concentrations), are depleted for slowly translated codons,

Page 82: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

67

and contain fewer proline codons. In addition, mRNA segments located 1-15 nucleotides

downstream of Ssb-bound ribosomes have reduced mRNA secondary structure. Finally,

nascent chain segments located in the ribosome tunnel of Ssb-bound ribosomes have

average numbers of positively charged residues but are enriched in negatively charged

residues. Taken together, these evolutionarily-encoded mRNA and nascent chain features

cause faster translation during Ssb binding. This finding is significant as any alteration of

translation kinetics due to synonymous mutations can potentially disrupt efficient binding

of Ssb leading to possible misfolding of the protein without any change in its sequence.

5.2 Introduction

Proteins attain their correctly folded conformation and localization through complex set of

processes that are prone to errors and sensitive to perturbations and necessitates

coordination between mechanisms161,162. Networks of chaperones have been evolved in

eukaryotic cells to engage the nascent polypeptide chain appearing outside the ribosome

exit tunnel163. This engagement is important as the nascent chain must be assisted to

avoid any non-native interactions that can result in the protein being misfolded73.

Chaperones also assist the nascent chain to attain its functional form72,164 or to coordinate

with targeting factors for efficient translocation to other organelles165. This engagement of

chaperone occurs as the mRNA transcript is being translated sequentially within the

ribosome and the synthesized polypeptide is exiting the ribosome. Since these two

processes are occurring during similar time scales, it is likely that there may be some

coordination between them to achieve successful binding of chaperones along with

efficient translation elongation. Evidence of coordination between a co-translational

process and translation kinetics is seen for co-translational folding where there is

slowdown of translation in interdomain linker regions that can facilitate co-translational

folding of the domain that exited the ribosome81,166.

Ssb is a Hsp70 chaperone that has been found to bind ribosomes to facilitate early

folding167. Ssb acts in coordination with a network of chaperones including nascent chain

associated complex (NAC) and a pair of two proteins Ssz1 and J-protein Zuo1 that

constitute the ribosome-associated complex (RAC)168. Recent studies have characterized

the binding patterns of Ssb to show that it prefers to engage with nascent chains with

domains enriched in aggregation-prone, hydrophobic and intrinsically disordered

regions169. In S. cerevisiae, two isoforms of Ssb are present: Ssb1 and Ssb2 that differ by

only four amino acids. Deletion of both Ssb isoforms results in defects in ribosome

Page 83: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

68

biogenesis, sensitivity against cold, slow growth as well as reduced efficiency of protein

folding164. Hence Ssb is an important chaperone required for generation of a functional

proteome.

With the advent of next-generation sequencing technology and development of

method like Ribo-Seq, it has been possible to capture translation at the transcriptome level

which is referred as the ‘translatome’. Characterizing the translatome allows us to

determine properties of translation kinetics and understand the molecular origin of variable

rates of translation elongation. Combining the power of Ribo-Seq with biochemical

approaches of crosslinking and immunoprecipitation, a subset of ribosomes can be

isolated that associates with a specific factor of interest. Selective Ribo-Seq is a variant of

Ribo-Seq that implements this approach and the data obtained can be modeled to extract

the profile of binding of a factor to the ribosome-nascent chain complex. In this study, we

use selective Ribo-Seq for Ssb-bound ribosomes to test whether there is any coordination

of binding of Ssb with translation elongation kinetics measured by ribosome profiling.

5.3 Results

5.3.1 Selective Profiling of Ssb-Bound Ribosomes

Selective Ribo-Seq is a variant of Ribo-Seq where a subset of ribosomes is isolated that

are associated with a specific factor/chaperone at the time of translation arrest34,89. This is

achieved by cross-linking the specific factor of interest with the ribosome-nascent chain

(RNC) complex. The RNCs cross-linked with the factor are isolated from other

monosomes by immunoprecipitation. The ribosome footprints obtained from this subset of

ribosomes are those undergoing translation when the factor of interest is bound to the

RNC. At the transcript level, the number of reads mapping to the transcript can lead to

identification of protein substrates that are bound by the factor during their synthesis. The

reads mapped to individual codon positions can be modeled to extract a profile of binding

state of the factor across different regions of the mRNA transcript. However, the number

of reads obtained will also be a function of translation rate apart from binding of the

chaperone. In a profile of raw reads, a fast translating chaperone bound region will not be

detected in comparison with slow-translating chaperone bound region that will have higher

ribosome density attributed to translation kinetics. To accurately determine the chaperone

binding profile, the number of reads at each codon in the Sel-Ribo-Seq dataset is

normalized by the number of reads at that codon position from Ribo-Seq dataset. This

Page 84: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

69

ratio is termed fold enrichment (FE) that determines whether the codon is being translation

during periods of chaperone binding outside the ribosome’s exit tunnel. The theoretical

basis of FE measure is demonstrated by derivation of equations based on biophysical

principles (Appendix D).

Selective Ribo-Seq is applied on S. cerevisiae with Ssb as the chaperone of

interest. The set of ribosome-protected mRNA footprints obtained from Ribo-Seq is termed

the translatome while those from selective Ribo-Seq are termed Ssb-bound translatome.

It should be noted that Ssb-bound translatome represents the mRNA fragments x that are

being translated at the ribosome’s peptidyl transferase center (PTC) while Ssb is

physically interacting with the part of nascent chain that is n amino acids upstream of x

and has already been translated and has emerged from the exit tunnel (Figure 5.1).

With the availability of translatome wide Ssb-binding profile, molecular principles

of Ssb binding were ascertained by co-authors of the publication159 that includes the study

discussed in this chapter. A few important inferences are: i) ~72% of detected proteins are

identified as Ssb substrates including proteins from all major compartments, ii) Ssb has a

preference to bind positively charged sequences closer to ribosomal surface and iii)

activity of Ssb was found to be dependent on RAC but not NAC.

5.3.2 Coordination of Ssb Binding with Translation Elongation Rates

Translatome and Ssb-bound translatome represent the translation kinetics profile and the

Ssb binding profile respectively. The analysis for correlating translation rates with Ssb

binding profile is restricted to high coverage transcripts with at least one read per codon

Figure 5.1. Schematic representing the ribosome footprint x obtained from selective Ribo-

Seq when Ssb is bound to the region of nascent chain n amino acids upstream of x.

Page 85: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

70

for statistical robustness (Figure 5.3A). A meta-analysis where all Ssb-bound translated

regions normalized to constant length window are aligned and the ribosome occupancy is

observed to dip as the aligned profile shifts from Ssb-unbound to Ssb-bound region and

again increases once Ssb-bound region is crossed over (Figure 5.2A). The dip in ribosome

occupancy represents the speedup of translation happening during the period of Ssb

binding. For a more comprehensive analysis, thresholds are used to define Ssb-bound

and unbound segments (See Methods for details). These thresholds are based on the

percentiles from the Cumulative Distribution Function of the FE measure (Figures 5.3B, C

and D). The Ssb-bound and unbound mRNA segments are compared in independent

Ribo-Seq datasets analyzing translation in wild-type cells. mRNA segments translated by

Ssb-bound ribosomes are generally translated faster than Ssb-unbound ribosomes

(Figure 5.2B). The difference in observed local translation speed in wild-type (WT) cells

vary by 10%-38% (Figure 5.2B) depending on the stringency of the selection criteria used

to define bound and unbound segments (Figure 5.3D).

The acceleration of translation during periods of Ssb binding can be a result of two

non-exclusive mechanisms. The first mechanism is that binding of Ssb triggers speed up

of translation. The second is that intrinsic features of the mRNA or the nascent chain

accelerate translation. The hypothesis of the first mechanism is tested by analyzing

relative elongation speed in translatomes of ssb1Δssb2Δ cells. It is observed that

accelerated translation is maintained even in the absence of Ssb (Figure 5.2B) although

to a slightly reduced extent (Figure 5.3E). This effect is uniformly observed for all

thresholds used to identify mRNA segments translated by Ssb-bound or unbound

ribosomes and the Ssb contribution to the translation speed-up is limited (up to 15%;

Figure 5.2C). In Chapter 3, I have demonstrated that translation rate of a codon can

depend on its cognate tRNA concentration, presence of proline residues in the P-site as

well as presence of downstream secondary structure. Since the accelerated translation is

encoded by intrinsic features of mRNA and nascent chain, it can be hypothesized that

these factors may play a role in facilitating faster translation. Indeed, it is found that Ssb-

bound translated segments are enriched in fast-translated codons and depleted in slow-

translated codons (Figure 5.2D). Features that slow translation like mRNA secondary

structure are found to be depleted 1-15 nt downstream of Ssb-bound translated segments

(Figure 5.2E). Similarly, proline residues are depleted thus avoiding slowdown of

translation (Figure 5.3F). Nascent chain features like positively charged residues can

interact with negatively charged tunnel and pause translation. It is observed that Ssb-

Page 86: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

71

Figure 5.2. Altered Translation Kinetics of Ssb-Bound Ribosomes (A) Average ribosome

densities in translatomes and Ssb-bound translatomes related to Ssb binding. 95% CI is

shaded. (B) Change in translation speed for Ssb-bound and unbound ribosomes in WT and

ssb1Δssb2Δ translatomes (Wilcoxon rank-sum test, p < 0.0001 for all thresholds). Error bars

show 95% CI. (C) Contribution of Ssb binding and mRNA features to faster translation. Error

bars show 95% CI. (D) Enrichment of fast codons and depletion of slow codons in bound

versus unbound segments for indicated thresholds (Wilcoxon signed-rank test, p < 0.05 for

[P95, P5], p < 0.0001 for other thresholds). Error bars show 95% CI. (E) Change in DMS

reactivity of bound versus unbound segments reflecting the probability of secondary structure

formation with the indicated offsets from each nucleotide. Differences are significant up to an

offset of 15 nt (paired t test, p < 0.0001).

Page 87: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

72

Page 88: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

73

Figure 5.3. Identifying Ssb-Bound mRNA Segments and the Molecular Origins of Translation

Acceleration (A) Overlap of high coverage genes with Ssb bound mRNA segments in Ssb1 WT

translatome in two biological replicates. (B) Probability distribution of Fold Enrichment (FE) values

across the high coverage gene set. (C) Cumulative Distribution Function (CDF) of FE values.

Percentiles were used to define the stringency thresholds to identify Ssb bound and unbound

periods. P5, P50 and P95 are shown in red. (D) An example of a Ssb binding profile for the gene

CYS3 using different thresholds for the bound and unbound segment definition. The first 120 nt

and last 60 nt are excluded from the analysis as well as nucleotide positions exhibiting fold

enrichment values between the thresholds (white). Ssb bound (green) and unbound (red) segments

are either defined using the P50/P50 threshold (upper panel) or the P95/P5 threshold resulting in less

nucleotide positions (i.e., more white space) that are used for the statistical analysis (lower panel).

(E) The differences in speed-up between WT and ssb1Δssb2Δ translatomes. (Wilcoxon signed-

rank test, p < 0.0001 for all thresholds). Error bars show 95% CI. (F) Depletion of proline residues

in Ssb bound compared to unbound segments for the indicated thresholds. (Wilcoxon signed-rank

test, p < 0.0001 for all thresholds). Error bars show 95% CI. (G) Percent change of probability of

finding positively and negatively charged residues in upstream regions of Ssb bound compared to

upstream regions of unbound segments for the indicated thresholds. (Wilcoxon signed-rank test, p

< 0.0001 for negatively charged residues and p > 0.05 for positively charged residues for all

thresholds). Error bars show 95% CI. (H) Change in translation speed for Ssb bound and unbound

ribosomes in publicly available RP datasets. GEO: GSE63789 (Pop), GSE69414 (Young),

GSE61011 (Williams), GSE61012 (Jan), GSE52968 (Guydosh), GSE75322 (Nissley), GSE67387

(Nedialkova), GSE51164 (Gardin), GSE53268 (Weinberg). Error bars show 95% CI. The samples

from these datasets were chosen based on the criteria that they do not use CHX for pretreatment

and hence capture in vivo translation dynamics reliably.

bound translated segments contain average numbers of positively charged residues but

are enriched in negatively charged residues (Figure 5.3G). The distribution of these

features indicate that evolution encodes the faster translation within the mRNA transcript

and nascent chain such that they are coordinated with binding of Ssb.

5.4 Discussion

The results presented in this study demonstrate for the first time that translation kinetics

are evolutionarily encoded to coincide with the binding of Hsp70 chaperone Ssb. The

analysis for testing this coordination has been possible due to development of methods

like Ribo-Seq and selective Ribo-Seq that can capture the translatome as well as the

factor-bound translatome. In this study, selective Ribo-Seq was used to characterize the

binding profile of Hsp70 chaperone Ssb and correlate the binding profile with translation

Page 89: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

74

elongation kinetics to demonstrate that Ssb binding coincides with faster translation of

mRNA within the ribosome. As a test of robustness, this analysis was carried out in nine

other published Ribo-Seq datasets from S. cerevisiae that did not use cycloheximide

(CHX) as a pretreatment (Figure 5.3H). It was found that seven out of nine of the datasets

exhibit this speedup with Ssb-bound regions (the origins of the inconsistency with two of

the datasets is unknown). The evolutionary encoding of kinetics is through the distribution

of molecular factors that influence translation rates across the transcript such that they

create periods of faster translation during periods of Ssb binding. These factors were

correlated with translation kinetics and discussed extensively in Chapters 3 and 4. The

results of this study indicate that these molecular factors are playing an important role in

a co-translational process relevant for creating a functional proteome.

During stress conditions, it has been shown that the interaction of Hsp70

chaperones with RNCs is altered and it subsequently results in pausing of translation

elongation170,171. Deletion of Hsp70 chaperones also resulted in elongation pause even in

absence of heat stress indicating that inhibition of Hsp70 activity is a mechanism of stress

response to induce global pause of translation170. However, the results demonstrated in

this study is distinct from these findings since the faster translation is encoded within the

features of mRNA and nascent chain rather than an effect of chaperone binding. This is

demonstrated by conservation of faster translation upon deletion of Ssb isoforms and no

induced ribosome stalling. However, what is the biological reason for evolving faster

translation kinetics upon chaperone binding? Engagement of chaperones like Ssb can

help the nascent chain avoid misfolded intermediates. Hence these are periods of protein

synthesis that can proceed at a faster pace without caring for the co-translational folding

of the nascent chain. We speculate that the acceleration of translation is to increase of

efficiency of protein production during periods of Ssb binding where the nascent chain is

being prevented by chaperone Ssb to acquire non-canonical conformations. Evolution has

selected for the faster translation to coincide with Ssb binding and potentially optimize the

efficiency of protein production in the cell.

5.5 Methods

5.5.1 Translation kinetics analysis

High coverage genes were obtained from list of substrates which have greater than zero

reads at every codon position along the transcript. The first 40 codons as well as last 20

codons were excluded from the subsequent analyses of these transcripts since these

Page 90: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

75

regions can be influenced by initiation and termination, respectively. Ssb bound and

unbound segments were initially defined using the peak detection algorithm in which a

region is defined as Ssb bound if its Fold Enrichment (FE) value is greater than 1.5 over

a stretch of at least 15 nt. To study the effect of translation rate at the extremities of Ssb-

binding probabilities, varying stringency thresholds were set to define the Ssb bound and

unbound segments. These thresholds are defined by the percentiles from the Cumulative

Distribution Function of FE values (Figures 5.3B and C). Setting an initial threshold of P50,

every nucleotide position with an FE value higher than P50 was classified as Ssb bound

and every nucleotide position with an FE value lower than the P50 threshold was classified

as Ssb unbound (Figure 5.3D). For all other pairs of thresholds, e.g., (P95, P5), all positions

with FE values higher than the upper threshold (e.g., P95) were classified as Ssb bound

while all values below the lower threshold (e.g., P5) were classified as Ssb unbound. The

other positions with FE values between the thresholds were excluded from the analysis.

Ssb bound and unbound segments were defined in the Ssb1-GFP strain background.

These regions were then used to perform the relative translation speed analysis in

independent translatomes (WT and ssb1Δssb2Δ).

5.5.2 Speed-up of translation

The translation rate was calculated as the inverse of the average number of ribosome

reads per nucleotide and translation rate for the Ssb bound and unbound segments

computed. To control for expression level differences across the genes, the percent

change in translation rate was calculated for each gene separately using the equation

% 𝑐ℎ𝑎𝑛𝑔𝑒 = <𝑅𝐵>−1− <𝑅𝑈𝐵>−1

<𝑅𝑈𝐵>−1 ∗ 100% where < 𝑅𝐵 > and < 𝑅𝑈𝐵 > are the average number

of reads per codon in the Ssb bound (B) and unbound (UB) segments. The statistical

significance of the speed-up across the gene dataset was calculated using the Wilcoxon

rank-sum test. Error bars in the associated plots are 95% CI about the median calculated

using the Bootstrapping method (Figure 5.2B).

5.5.3 Contribution of mRNA versus Ssb binding

For every gene in our dataset, we use the percent change calculation described above

to estimate the contribution of mRNA and Ssb binding to the translation speed-up using

the equation: % 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑚𝑅𝑁𝐴 =% 𝑐ℎ𝑎𝑛𝑔𝑒∆𝑆𝑠𝑏

% 𝑐ℎ𝑎𝑛𝑔𝑒𝑊𝑇 ∗ 100% and

% 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑆𝑠𝑏 𝑏𝑖𝑛𝑑𝑖𝑛𝑔 = 100 − % 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑚𝑅𝑁𝐴. % 𝑐ℎ𝑎𝑛𝑔𝑒∆𝑆𝑠𝑏 and

% 𝑐ℎ𝑎𝑛𝑔𝑒𝑊𝑇 correspond to the % 𝑐ℎ𝑎𝑛𝑔𝑒 in the ssb1Δssb2Δ and WT cells, respectively.

Page 91: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

76

The error bars in the associated plots are 95% CI about the median calculated using the

Bootstrapping method (Figure 5.2C).

5.5.4 Enrichment/Depletion of Fast/Slow codons

The 61 sense codons were classified as being either Fast or Slow translating based on

the local tAI values reported in Tuller et al.172. The 31 codons with the highest tAI values

were classified as ‘Fast’ and the remaining 30 codons as ‘Slow’. The probability of finding

Fast and Slow codons in the B and UB segments were then calculated and the percent

change in these values between these segments computed. The statistical significance of

this difference was computed using the paired Permutation test120. 95% CI for the percent

change in probability were calculated using Bootstrapping. The enrichment/depletion of

proline residues was determined in the same manner.

5.5.5 Upstream charged residues

To test for enrichment/depletion of charged residues in the exit tunnel, we defined a 30-

residue window upstream of the Ssb bound and unbound segments along with the region

itself. The probability of finding a positively charged residue (K, N, H) and negatively

charged residue (D, E) were compared between the defined upstream regions of Ssb

bound and unbound segments. We find the results do not change even if overlapping

upstream positions of the Ssb bound and unbound segments are excluded from this

analysis.

5.5.6 Downstream mRNA secondary structure

In vivo mRNA secondary structure information for all yeast genes was taken from Rouskin

et al.138. ‘A’ and ‘C’ bases react with DMS if they are not base-paired into the mRNA’s

secondary structure. Hence, DMS reactivity is inversely proportional to the probability of

the nucleotide position forming secondary structure. DMS reactivities of ‘A’ and ‘C’

nucleotides within the Ssb bound and the unbound segments were compared as a function

of nucleotide offset downstream of each nucleotide position. The significance of the

change in DMS reactivity was assessed using the paired t test.

Page 92: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

77

Chapter 6

Chapter 6 CONCLUSIONS AND FUTURE DIRECTIONS

6.1 Conclusions

The two goals of the research studies described in this dissertation are i) to demonstrate

that methods based on chemical kinetics and mathematical optimization can increase the

accuracy of modeling Ribo-Seq data to study the properties of translation elongation and

ii) to gain novel biological insights by uncovering previously unknown molecular factors

and determining how translation kinetics coordinates with co-translational processes. I

achieved both goals in this dissertation.

Ribo-Seq generates a snapshot of translation by isolating and sequencing

ribosome-protected mRNA fragments. However, the nuclease digestion step of Ribo-Seq

is imperfect giving rise to a wide distribution of fragment sizes and it is non-trivial to identify

which codon within the variable sized fragment was being translated. The method

presented in Chapter 2 improves the identification of A-site within ribosome-protected

fragments thus overcoming an important challenge for an accurate analysis of Ribo-Seq

data. This method is based on an Integer Programming based optimization of reads

between the second and stop codons of a transcript. The A-site offset from the 5′ end is

identified that maximizes the reads in this region. The Integer Programming method

outperforms 11 other methods by assigning more ribosome density signal at the A-site

stalling site of polyproline motifs that has been determined through orthogonal biochemical

experiments61,62,104. The offset tables generated for S. cerevisiae and mouse embryonic

stem cells are easy to apply for any Ribo-Seq dataset in these organisms and the method

itself has been made readily available online for researchers to generate A-site offset

tables in their organism of study.

The method I presented in Chapter 3 to measure codon translation rates has key

advantages: (1) it is based on chemical kinetic theory, (2) it simultaneously utilizes all the

reads along the CDS, (3) it does not assume a sense codon is translated at the same

speed in all of its different sequence contexts, meaning the true variability of translation

rates is captured, and (4) it measures an absolute translation rate as compared to a

relative difference in rates. This method is applied to high coverage S. cerevisiae Ribo-

Seq data and a 26-fold variability in translation rates is observed. To explain this

variability, molecular factors are correlated which show that cognate tRNA concentration,

presence of proline in the P-site and mRNA secondary structure 4 codons downstream

Page 93: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

78

of A-site can influence the translation rate of the A-site codons. These factors have been

known previously to affect translation elongation but our method enabled measurement

of a stronger correlation and at codon resolution. For example, downstream mRNA

secondary structure has been estimated roughly through folding energy calculated based

on thermodynamic principles. However, in this analysis, I used DMS and PARS data that

provide high-throughput in vivo and in vitro profiles of mRNA secondary structure. The

correlation of translation slowdown with mRNA secondary structure is highest at 4 codons

downstream of the A-site that places it right at the mouth of the ribosome-mRNA channel

where the ribosome is likely to unwind the mRNA structure. A novel insight we can get

from the application of method presented in Chapter 2 is that we did not find that wobble-

decoding mechanism systematically slows down translation in S. cerevisiae. This effect

had been shown only in metazoans49 and only for one rare codon CGA in S. cerevisiae

where strong inhibition of translation was attributed to wobble decoding mechanism46.

In Chapter 4, I demonstrated a novel molecular factor that predictably and

causally influences translation rates. The chemical identity of the amino acid pairs in the

P- and A-sites can influence the rate at which the A-site codon will be translated.

Bioinformatic analyses predict that mutating the P-site amino acid can result in a

significant alteration of translation rate for ~54% of amino acid pairs and 10 of these

predictions are experimentally validated. I also demonstrated that evolution cares about

this encoding of translation rate information within the primary structure of the protein.

Fast-translating pairs are enriched 8 times more than slow-translating pairs across the

proteome. The slowdown of translation in interdomain linkers41 can be attributed to

enrichment of slow-translating pairs and not due to non-optimal codons that has been

hypothesized but have not been established conclusively47. This nascent-chain encoded

feature and the above analyses provides evidence that the protein sequence is optimized

such that slow-translating pairs are locally enriched to assist co-translational folding while

in absence of any functional need, fast-translating pairs are enriched to potentially

increase the efficiency of protein production.

Chapter 5 provides the first direct evidence of coordination between translation

kinetics and the binding of a chaperone, which in this case was the Hsp70 chaperone

Ssb. Analysis of Selective Ribo-Seq and Ribo-Seq data provided transcriptome-wide Ssb

binding profiles and translation rate profiles, respectively. Correlating these two

measures demonstrates faster translation within the ribosome’s PTC coordinated with

binding of Ssb to nascent chain outside the ribosome and this effect is proportional to

Page 94: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

79

probability of Ssb binding. I also show that the faster translation is encoded within the

mRNA transcript with molecular factors that can influence translation enriched or

depleted in Ssb-bound translated mRNA segments. This demonstrates yet again that

evolution has optimized the mRNA codon choice and nascent chain features encoding

translation rates to coordinate with co-translational processes and generate a functional

proteome.

6.2 Future Directions

6.2.1 Synonymous mutations and diseases

In Chapter 1, I highlighted the evidence demonstrating that altered translation kinetics

can have an influence on co-translational processes like protein folding, targeting and

assembly. The experimental approach in multiple studies to demonstrate this effect was

to introduce synonymous mutations that do not change the nascent chain’s primary

structure but alter translation rates and determine the relative loss/gain of protein activity.

The future direction of research in this area should be to identify synonymous mutations

that are enriched in diseased conditions and establish their association by determining

whether the synonymous mutation is altering the translation kinetics and how it is

disrupting a co-translation process resulting in the loss/gain of protein activity. The

methods developed in Chapters 2 and 3 offer a simple and quantitative approach to

determine absolute codon translation rates. Ribo-Seq experiments can be run on

samples from patients and healthy individuals to determine differential translation

elongation rates at sites of synonymous codon substitutions.

The synonymous mutations have already been established in some cases to lead

to protein aggregation diseases and cancer173. Therefore, it is important to study the

variability in codon translation rates that can arise due to synonymous mutations. This

approach can potentially explain the molecular mechanism of disease states and enable

therapeutic manipulation through development of rationally-designed mRNA sequences.

6.2.2 Test phenotypic effect of loss of amino acid pairing due to mutations

In Chapter 4, I demonstrated the evolutionarily selection of amino acid pairs that encode

translation rate information. If selective pressures have acted on certain pairs of residues

because they encode translation rate information, then there must be some phenotypic

effect that is diminished upon mutating these pairs of residues. This leads to the

hypothesis that mutating these pairs is likely to alter the structure and function of the

encoded protein. To test this hypothesis, one can bioinformatically identify five yeast

Page 95: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

80

proteins that are non-essential, cytoplasmic, monomeric enzymes (since these are easier

to assay than multimeric enzymes) and that are also predicted to have conserved,

translation-rate-encoding pairs of residues. Single amino acid mutations in these five

proteins can be chosen, as done in Chapter 4, that are likely to cause the largest change

in translation speed, but now under the assumption that these will be the most likely to

influence function. These mutations can be implemented in vivo by an experimental

collaborator with expertise in functional protein assays. The hypothesis will be supported

if the mutations both change the translation speed and decrease the protein’s relative

specific activity.

This future direction can dramatically expand the scope of this idea by

demonstrating that protein primary structure also encodes translation-rate information in

some pairs of residues so as to kinetically guide the co-translational acquisition of

structure and function in nascent proteins. Future research questions include how these

pairs of amino acids are modulating translation speed. I have demonstrated in Chapter 4

that both the identity of the amino acids in the P- and A-sites as well as the two tRNAs

aligned in these two positions have varied contributions to the change in translation speed.

One hypothesis is that if the amino acids are playing a major role, the catalysis of peptide

bond formation is rate limiting for these pairs, and hence mutating the residue at the P-

site can potentially switch peptide bond formation to the non-rate limiting regime and vice

versa. However, the role of tRNA interactions in influencing the translation rate is

unknown. One can speculate that depending on the identity of the two tRNAs, P-site tRNA

can facilitate or restrict the accommodation of amino-acyl tRNA in to the A-site influencing

the rate of translation. Further research is required to uncover evidence for this

mechanism and its molecular principles.

H

6.2.3 Causally test the effect of altered translation kinetics on Ssb chaperone

binding

In Chapter 5, I demonstrated that evolution has encoded translation rate information in S.

cerevisiae transcripts that correlate with the co-translational binding of the chaperone Ssb.

However, the correlation between faster translation and Ssb binding does not establish

causation. As a future direction, causation can be tested using the following procedure

(visually represented in Figure 6.1): i) Ssb-bound translating regions are determined using

the wild-type Ssb binding profiles. ii) Synonymous mutations are introduced in the

identified Ssb-bound translated regions such that they alter the translation kinetics from

Page 96: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

81

fast to slow without changing the encoded protein sequence. iii) Selective Ribo-Seq is

carried out in the mutated strain to estimate the profile of Ssb binding. Loss of Ssb binding

will support our hypothesis that altering evolutionarily encoded translation kinetics will

disrupt the binding of this chaperone to nascent polypeptides. This can also be a potential

mechanism for a synonymous mutation in diseased condition reducing the efficiency of

chaperone binding and causing a downstream phenotypic effect. The evidence from the

analyses presented in Chapter 5 and inferences drawn from them will hopefully motivate

experimental researchers and structural biologists to study the role of chaperone binding

of nascent chain in final maturation of the protein.

Page 97: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

82

Wild-type Mutant

0

5

10

15

20

25

0 100 200 300 400 500

Fold

En

rich

me

nt

Codon Position

0

5

10

15

20

25

0 100 200 300 400 500

Fold

En

rich

me

nt

Codon Position

a

b

Figure 6.1. Illustration of the hypothesis that a change in translation-elongation

rates will lead to disruption of Ssb binding. (a) Analysis in Chapter 5 demonstrated

that faster translation rates are encoded in those regions of an mRNA where Ssb tends

to bind the nascent chain (left panel). Strongly correlated regions of translation speed

and Ssb binding in genes can be identified and synonymous mutations will be

introduced that result in a slowdown in translation without changing the protein

sequence (right panel). If the proposed hypothesis is correct, these synonymous

mutations will disrupt Ssb binding. (b) The signal peaks from Sel-Ribo-Seq signify that

Ssb is bound to the nascent chain. Ssb binding to regions 1 and 2 (yellow regions on

nascent chain) will result in a binding peak in the Sel-Ribo-Seq profile (left panel). If

Ssb binding is disrupted due to a slowdown of translation downstream of region 2,

there will be loss of signal when the mutated region is translated (right panel).

Page 98: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

83

Appendix A

CHAPTER 2 SUPPORTING INFORMATION

A.1 Supporting Figures

Figure A.1. Fragment size distribution in (A) Pooled Ribo-Seq data in mouse embryonic stem cells

(mESCs) and (B) Pooled Ribo-Seq data in Escherichia coli.

Page 99: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

84

Figure A.2. Pairwise comparison of fragment-size and frame distributions between genes in

S. cerevisiae. (A) The heat map reports the pairwise Hellinger distance177 between the probability

densities of the fragment-size and frame distributions of individual genes. Only genes in the Pooled

data set that have at least 1 read per codon for fragment sizes between 24 and 34 nt were analyzed,

resulting in 210 genes in this analysis. (B) The probability density distribution of Hellinger distances

reported in (A). The Hellinger distance metric is bound between (0, 1); 0 indicates identical

distributions; while 1 indicates the distributions are divergent. All pairwise Hellinger distances are

less than 0.45 and only 11% of pairwise distances are greater than 0.1. Hence, the distribution of

reads of different fragment sizes and frames are highly similar and therefore exhibit very little

dependence on gene identity.

Page 100: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

85

Read length

Distribution

Input offset table

Constant

offset of 15

Constant

offset of 18

Mixed offsets of

12 and 18

Top offsets from

experimental data

1 100% 93% 100% 93%

2 95% 95% 100% 95%

3 96.5% 100% 96.5% 100%

4 100% 100% 100% 97%

5 100% 96% 100% 98%

6 100% 100% 100% 100%

7 100% 100% 100% 100%

8 95% 95% 95% 100%

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

20 22 24 26 28 30 32 34

P (𝑆, 𝐹

)

Fragment Size

Frame 0

Frame 1

Frame 2

C

A B

Figure A.3. Integer Programming algorithm correctly reproduces the true A-site offsets

from Artificial Ribo-Seq data. (A) An example of a Shifted Poisson distribution with mode at

(𝑆, 𝐹) = (28, 0) and a variance 𝜆 = 48 that was used to generate artificial ribosome-protected

fragments (see Methods for details). The reads generated from this distribution are subjected to

the Integer Programming algorithm for four different input offset tables. These input offset tables

are shown in Supplementary Table S4. (B) Six read length distributions with their mode at (28, 0)

were generated with Poisson variances 𝜆 = 4, 8, 16, 24, 48, 80 and labeled 1 through 6,

respectively. The distribution in (A) is the distribution labeled 5 in (B). Two more distributions

were generated with variance 𝜆 = 8 but with modes at (24, 0) and (32, 0) and labelled as 7 and

8, respectively. (C) The percentage of offsets that the Integer Programming correctly identifies

in the artificial Ribo-Seq data created based on the eight read length distributions shown in (B)

and for each of the four different input offset tables (see Methods for details) used to generate

the artificial Ribo-Seq reads. The four input offset tables and the corresponding output offset

tables generated by the Integer Programming algorithm for distribution 5 is shown in Table A.4.

Page 101: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

86

Figure A.4. Meta-gene analysis in Pooled Ribo-Seq data reveal excess ribosome density in

E.coli genes beyond CDS regions. A-site profiles are obtained for S. cerevisiae using unique offsets

from Table 2.1 obtained after application of Integer Programming algorithm. For E. coli, we use a

constant offset of 12 nt from 3΄ end as used by Woolstenhulme and co-workers66. For mESCs, we

use the unique offsets from Table A.6. We plot the meta-gene profiles for all genes dataset as well

as for the subset of filtered genes containing only single isoforms with one translation start site.

Page 102: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

87

Figure A.5. Stalling at PPE and PPD motifs are reproduced in mESCs. The median normalized

ribosome density is obtained for all instances of (A) PPX and (B) XPP motifs in which X corresponds

to any one of the 20 naturally occurring amino acids (and stop codon for instance PP*). Using a

permutation test, we determine if the median ribosome density is statistically different from the

average ribosome density. Statistically significant motifs are highlighted in dark red. This analysis

was carried out on the Pooled dataset for transcripts in which at least 50% of codon positions have

reads mapped to them. Error bars are 95% Confidence Intervals for the median obtained using

Bootstrapping120. The dashed line at Normalized Ribo Density = 1 indicates whether that motif

results in a slowdown or not (a value > 1 would indicate it is slower than average). The dashed line

is unrelated to statistical significance of motifs determined by permutation test.

Page 103: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

88

Figure A.6. Sequence-independent translational pause observed post-initiation in S.

cerevisiae and mESCs. Meta-gene analysis at the codon level with reads mapped to the P-site are

shown. There is a mild but distinct pausing of translation when the 4th and 5th codons are in the P-

site. This effect is seen in both Pooled and Pop datasets of S. cerevisiae as well as the filtered genes

dataset of mouse embryonic stem cells (mESCs).

Page 104: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

89

Figure A.7. The Integer Programming algorithm correctly assigns greater ribosome density to the

Glycine residue in PPG motifs than other methods in S. cerevisiae. (A) Normalized ribosome density at

the predicted A-site using different methods to determine A-site is shown for an instance of the PPG motif in

gene YDR226W with the Glycine of the PPG motif is at codon position 16, for Pop dataset in S. cerevisiae. (B)

The fraction of PPG instances (𝑛 = 35) at which Integer Programming method yields greater ribosome density

at glycine against the compared method. The color-coding is same as shown in the legend of panel (A). Our

method does better if it assigns greater ribosome density in more than half the instances (horizontal line in panel

B). Integer Programming yields significantly higher ribosome density at G in the PPG motifs than all other

methods (For Hussmann 𝑃 = 0.026, for ribodeblur 𝑃 = 0.01 and for others 𝑃 < 10−5). Two-sided P-values

were calculated using the Wilcoxon signed rank test. Error bars are 95% Confidence Interval about the median

calculated using Bootstrapping120.

Page 105: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

90

A.2 Supplementary Tables

Table A.1. Number of genes for the various fragment size and frame combinations that meet the

criteria of at least 1 read per codon on average in the Pop and Pooled datasets of S. cerevisiae.

Fragment size Pop dataset

Pooled dataset

Frame 0 Frame 1 Frame 2 Frame 0 Frame 1 Frame 2

20 34 7 57 156 50 98

21 45 60 41 224 129 139

22 99 55 48 234 115 199

23 73 47 95 162 107 290

24 58 91 72 161 352 193

25 155 69 55 647 251 194

26 105 64 175 481 241 916

27 159 255 161 1096 2213 878

28 1081 248 333 4487 1861 1468

29 850 330 1437 3919 1504 4474

30 528 876 1139 3184 3041 3835

31 276 610 643 2089 2251 3164

32 70 279 181 799 1897 1789

33 39 82 9 474 1076 322

34 1 2 0 237 194 71

35 0 0 0 58 40 33

Page 106: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

91

Table A.2. Initial offset tables after application of Integer Programming algorithm to Pop and

Pooled datasets in S. cerevisiae.

Fragment size

Pop dataset

Pooled dataset

Frame 0 Frame 1 Frame 2 Frame 0 Frame 1 Frame 2

20 0/6 ND* 9/0 6/15 6/0 9/0

21 15/6 9/15 9/0 15/6 9/0 9/18

22 9/0 9/18 18/9 9/15 9/0 18/9

23 9/15 12/15 18/12 9/15 9/18 18/12

24 15/12 12/18 18/12 15/9 12/15 18/12

25 12/15 12/18 18/12 15/12 12/15 18/12

26 15 /12 15/18 18/15 15 /12 12/15 18/15

27 15 15/18 18/15 15 15 18/15

28 15 18/15 18 15 15 18/15

29 15 18/15 18 15 15/18 18

30 15 18 18 15 18/15 18

31 18/15 18 18 15 18 18

32 18/15 18 18 18/15 18 18

33 18/0 18/0 ND 18/15 18 18/15

34 ND ND ND 18/15 18/15 18/21

35 ND ND ND 18/15 15/18 21/18

*ND = Not Defined. The number of genes for certain 𝑆 and 𝐹 combinations are less than 10 and hence

Integer Programming algorithm is not applied to these combinations.

Page 107: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

92

Table A.3. For unique offsets described in Table 2.1, the robustness to variation in

parameters and consistency across different Ribo-Seq datasets are described with

additional sub columns. The two sub columns in the top row refers to the unique offsets being

sensitive (S) or robust (R) to parameter variation. Namely, the change in threshold from 60% to

80% to classify the most probable offset as unique (left sub column) and variation in threshold of

the secondary selection criterion 𝑅(1) < 𝑎 ∗ 𝑀𝑒𝑎𝑛(𝑅(2), 𝑅(3), 𝑅(4)) where 𝑎 ranges from 1 to 1

10

(right sub column). The bottom row specifies the consistency of the unique offset across individual

Ribo-Seq datasets. For example, for fragment size 27 in frame 0, 15 is the unique offset which is

sensitive (S in left sub column) to a change in threshold from 60% to 80% and robust (R in right

sub column) to change in secondary selection parameter 𝑎 from 1 to 1

10. It is also consistent in 9 out

of 12 datasets for which we have more than 10 genes meeting our filtering criteria.

Fragment Size Frame 0 Frame 1 Frame 2

24 15 R R

15/12 18/12 4 of 4

25 15 S R

12/15 18 S S

5 of 7 4 of 4

26 15/12 18/15 18/15

27 15 S R

15 S R

18 R R

9 of 12 7 of 12 6 of 9

28 15 R R

15 S R

18 R R

14 of 17 10 of 13 10 of 12

29 15 R R

15/18 18 R R

14 of 15 15 of 16

30 15 R R

18 R R

18 R R

15 of 15 12 of 16 16 of 16

31 15 S R

18 R R

18 R R

12 of 13 13 of 15 15 of 16

32 18/15 18 R R

18 R R

11 of 11 9 of 10

33 18 S S

18 R R

18 S S

6 of 6 7 of 7 4 of 4

34 18 R R

18 R S

18/21 2 of 2 2 of 2

Page 108: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

93

Table A.4. Input A-site offset tables used in the creation of artificial Ribo-Seq data (table below,

see Methods). Offset A-site tables (next page) output by the Integer Programming method when

applied to artificial Ribo-Seq data constructed using the input tables (Top) and P(𝑺, 𝑭) distribution

with mode (𝟐𝟖, 𝟎) and variance 𝝀 = 𝟒𝟖 (Distribution 5 in Figure A.3).

Input Offset tables

Fragment

size

Constant offset of 15 Constant offset of 18 Mixed offsets of 12 and

18

Top offsets from exp.

data

Frame

0

Frame

1

Frame

2

Frame

0

Frame

1

Frame

2

Frame

0

Frame

1

Frame

2

Frame

0

Frame

1

Frame

2

20 15 15 15 18 18 18 12 12 12 6 6 9

21 15 15 15 18 18 18 12 12 12 15 9 9

22 15 15 15 18 18 18 12 12 12 15 6 18

23 15 15 15 18 18 18 12 12 12 15 18 18

24 15 15 15 18 18 18 12 12 12 15 15 12

25 15 15 15 18 18 18 12 12 12 15 12 18

26 15 15 15 18 18 18 12 12 12 15 15 18

27 15 15 15 18 18 18 12 12 12 15 15 18

28 15 15 15 18 18 18 18 18 18 15 15 18

29 15 15 15 18 18 18 18 18 18 15 15 18

30 15 15 15 18 18 18 18 18 18 15 18 18

31 15 15 15 18 18 18 18 18 18 15 18 18

32 15 15 15 18 18 18 18 18 18 15 18 18

33 15 15 15 18 18 18 18 18 18 18 18 18

34 15 15 15 18 18 18 18 18 18 18 18 18

35 15 15 15 18 18 18 18 18 18 18 18 18

Page 109: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

94

Output Offset tables

Fragment

size

Constant offset of 15 Constant offset of 18 Mixed offsets of 12 and

18

Top offsets from exp.

data

Frame

0

Frame

1

Frame

2

Frame

0

Frame

1

Frame

2

Frame

0

Frame

1

Frame

2

Frame

0

Frame

1

Frame

2

20 15 15 15 18 18 18 12 12 12 6 6 9

21 15 15 15 18 18 18 12 12 12 15 9 9

22 15 15 15 18 18 18 12 12 12 15 6 18

23 15 15 15 18 18 18 12 12 12 15 18 18

24 15 15 15 18 18 18 12 12 12 15 15 12

25 15 15 15 18 18 18 12 12 12 15 12 18

26 15 15 15 18 18 18 12 12 12 15 15 18

27 15 15 15 18 18 18 12 12 12 15 15 18

28 15 15 15 18 18 18 18 18 18 15 15 18

29 15 15 15 18 18 18 18 18 18 15 15 18

30 15 15 15 18 18 18 18 18 18 15 18 18

31 15 15 15 18 18 18 18 18 18 15 18 18

32 15 15 15 18 18 18 18 18 18 15 18 18

33 15 15 15 18 18 18 18 18 18 18 18 18

34 15 15 15 18 18 18 18 18 18 18 18 18

35 15 15 15 18 18/15 18/15 18 18 18 18 18 18/15

Page 110: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

95

Table A.5. Initial offset table after application of Integer Programming algorithm to a Pooled dataset

in mESCs consisting of all genes. Offset table after application of Integer Programming algorithm

to a Pooled dataset of E. coli.

Fragment size

mESCs Pooled dataset

E. coli Pooled dataset

Frame 0 Frame 1 Frame 2 Frame 0 Frame 1 Frame 2

20 6/3 6/9 0/15 12/9 15/0 9/0

21 6/0 6/0 0/18 12/9 3/12 9/0

22 0/9 0/3 9/15 12/0 12/0 9/0

23 6/15 9/0 18/9 12/3 0/12 9/0

24 9/15 3/0 18/0 12/3 12/0 0/9

25 9/6 9/6 12/15 12/3 12/3 9/6

26 12/6 9/12 15/0 12/3 3/0 9/3

27 12/9 12/21 15/9 3/12 3/6 9/3

28 12/15 12/21 15/12 3/12 3/6 3/0

29 15/12 15/12 18/15 3/12 3/6 3/9

30 15 15/18 18/15 12/9 9/3 9/6

31 15/12 15/18 18/15 12/3 9/6 9/3

32 12/15 18/15 18 12/9 3/9 9/3

33 15/12 18/15 18 12/9 9/3 9/3

34 15/18 15/18 18/12 12/9 9/12 9/6

35 ND* ND ND 12/9 12/9 9/6

36 ND ND ND 12/15 9/12 9/12

37 ND ND ND 12/6 12/9 9/12

38 ND ND ND 12/15 12/15 9/12

39 ND ND ND 12/15 12/3 9/3

40 ND ND ND 12/15 15/3 9/3

Page 111: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

96

Table A.6. A-site locations (nucleotide offsets from 5΄ end) determined by applying the

Integer Programming algorithm to the Pooled dataset in mESCs are shown as a function of

fragment size and frame. The dataset consists of only genes that have a single isoform and only

one translation start site. The top two offset values are listed for those 𝑆 and 𝐹 combinations in

which the A-site location could not be uniquely determined. The description of the sub columns is

the same as Table A.3.

Fragment Size Frame 0 Frame 1 Frame 2

28 15/12 15/12 15 S R

1 of 2

29 15 R R

15/18 15/18 2 of 2

30 15 R R

15/18 18/15 2 of 2

31 15 R R

15 S S

18 R R

2 of 2 1 of 2 2 of 2

32 15 R R

18/15 18 R R

2 of 2 2 of 2

33 15 R R

18 R R

18 R R

2 of 2 2 of 2 2 of 2

34 15/12 18 S S

18 R R

2 of 2 2 of 2

Page 112: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

97

Table A.7. Number of genes in the combination of fragment size and frame meeting the criteria of

at least 1 read per codon on average in mESCs and E. coli Pooled datasets. The mESCs Pooled

dataset consist of genes that are single isoform and have only one defined translation initiation site.

Fragment size mESCs Pooled dataset

E. coli Pooled dataset

Frame 0 Frame 1 Frame 2 Frame 0 Frame 1 Frame 2

20 8 7 10 313 243 330

21 10 0 1 440 270 377

22 10 1 10 532 416 431

23 15 15 18 610 471 645

24 41 20 19 806 471 655

25 61 19 52 816 610 603

26 52 43 75 765 625 742

27 73 38 45 952 681 825

28 126 55 119 1001 849 868

29 187 125 208 988 840 981

30 230 191 257 1072 791 956

31 197 192 280 1042 898 916

32 103 187 237 994 823 993

33 47 125 108 1060 761 943

34 17 55 45 1008 842 856

35 0 0 0 891 740 924

36 0 0 0 943 598 799

37 0 0 0 827 618 625

38 0 0 0 640 494 596

39 0 0 0 588 323 461

40 0 0 0 440 288 278

Page 113: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

98

Table A.8. Median normalized ribosome densities for 61 codon types were correlated with tRNA

abundance for the Integer Programming method and 11 other contemporary methods (see

Methods for details). The tRNA abundance values were obtained from Table S2 of study of

Weinberg and co-workers41.

Method Spearman's rho p-value

Integer Programming -0.583 6.39 × 10−5

Heuristic +18 -0.581 6.76 × 10−5

Plastid -0.580 6.98 × 10−5

RiboProfiling -0.575 8.55 × 10−5

riboWaltz -0.574 8.65 × 10−5

Hussmann -0.571 9.53 × 10−5

Martens -0.571 9.82 × 10−5

Heuristic +15 -0.570 9.94 × 10−5

ribodeblur -0.570 9.94 × 10−5

Scikit-ribo -0.567 1.09 × 10−4

Rpbp -0.566 1.12 × 10−4

Center-weighted -0.517 5.31 × 10−4

Page 114: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

99

Table A.9. Publicly available datasets used in the study.

Dataset (first

author name) Year of

publication Number of

replicates* GEO

Study Accession numbers of

samples used

Saccharomyces cerevisiae Pop 2014 1 GSE63789 GSM1557447

Guydosh 2014 1 GSE52968 GSM1279568 Jan 2014 1 GSE61012 GSM1495525

Williams 2014 1 GSE61011 GSM1495503 Gerashchenko 2014 1 GSE59573 GSM1439584

Gardin 2014 2 GSE51164 GSM1239959, GSM1239960

Lareau 2014 3 GSE58321 GSM1406453, GSM1406454,

GSM1406455

Nedialkova 2015 3 GSE67387 GSM1646015, GSM1646016,

GSM1646017 Young 2015 1 GSE69414 GSM1700885

Weinberg 2016 1 GSE53268 GSM1289257 Nissley 2016 2 GSE75322 GSM1949550, GSM1949551

Mouse embryonic stem cells (mESCs) Ingolia 2011 1 GSE30839 GSM765298 Hurt 2013 1 GSE41785 GSM1024298

Escherichia coli Li 2012 2 GSE35641 GSM872393, GSM872394 Li 2014 1 GSE53767 GSM1300279

Woolstenhulme 2015 2 GSE64488 GSM1572273, GSM1572275

* For datasets with more than one replicate, all replicates were used to create the Pooled dataset.

Page 115: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

100

Table A.10. A-site offsets determined using the publicly available R packages – Plastid38 ,

RiboProfiling92 and riboWaltz37. These methods generate a P-site offset as output for each

fragment length. The A-site offsets below are obtained after adding 3 nt to the P-site offsets.

S. cerevisiae Pop data S. cerevisiae Pooled data mESCs Pooled data

Fragment size Plastid RiboProfiling riboWaltz Plastid RiboProfiling riboWaltz Plastid RiboProfiling riboWaltz

20 16 7 16 16 7 16 NA NA NA

21 16 7 16 16 7 13 NA NA NA

22 16 7 15 16 7 15 NA NA NA

23 16 10 15 16 10 15 NA NA NA

24 16 10 17 16 10 15 NA NA NA

25 16 11 15 16 11 15 16 13 15

26 16 11 16 16 11 16 16 14 16

27 16 14 14 16 14 14 16 15 15

28 16 15 15 16 15 15 13 16 15

29 16 16 16 16 16 16 13 6 15

30 16 16 16 16 16 16 15 15 15

31 16 16 16 16 16 16 13 13 16

32 16 17 17 16 17 17 16 13 16

33 16 17 17 16 17 17 16 14 16

34 16 13 15 16 13 15 17 13 14

35 16 13 16 16 13 15 NA NA NA

Page 116: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

101

Appendix B

CHAPTER 3 SUPPORTING INFORMATION

B.1 Supplementary Methods

B.1.1 Derivation of Eq. (3.3) from Eq. (3.1) and Eq. (3.2)

Eq. (3.1), restated below, defines the steady state condition of translation. Eq. (3.2) is the

mean synthesis time of transcript 𝑖, which is the sum of the translation times of the

elongating codons of transcript 𝑖.

𝑁2,𝑖ribo

𝜏(2,𝑖)=

𝑁3,𝑖ribo

𝜏(3,𝑖)= ⋯

𝑁𝑗,𝑖ribo

𝜏(𝑗,𝑖)= ⋯ =

𝑁𝑁𝑐(𝑖),𝑖ribo

𝜏(𝑁𝑐(𝑖),𝑖) (Eq. 3.1)

⟨𝑇(𝑖)⟩ = 𝜏(2, 𝑖) + 𝜏(3, 𝑖) + ⋯ + 𝜏(𝑁𝑐(𝑖), 𝑖) (Eq. 3.2)

The translation time of a codon position 𝑙 in transcript 𝑖 can be expressed (through a simple

algebraic rearrangement of Eq. (3.1)) in terms of the translation time of any other codon

position 𝑗 as

𝜏(𝑙, 𝑖) =𝜏(𝑗,𝑖)𝑁𝑖𝑙

Ribo

𝑁𝑖𝑗Ribo . (Eq. B.1)

For each codon position, 𝑙 = 2, 3, 4, … . , 𝑁𝑐(𝑖), we substitute Eq. (B.1) into Eq. (3.2), which

yields

⟨𝑇(𝑖)⟩ =𝜏(𝑗,𝑖)𝑁𝑖,2

Ribo

𝑁𝑖𝑗Ribo +

𝜏(𝑗,𝑖)𝑁𝑖,3Ribo

𝑁𝑖𝑗Ribo + ⋯ +

𝜏(𝑗,𝑖)𝑁𝑖𝑁𝑖

cRibo

𝑁𝑖𝑗Ribo . (Eq. B.2)

We then pull out the common terms, yielding

⟨𝑇(𝑖)⟩ =𝜏(𝑗,𝑖)

𝑁𝑖𝑗Ribo [𝑁𝑖,2

Ribo + 𝑁𝑖,3Ribo + ⋯ + 𝑁𝑖𝑁𝑖

cRibo], (Eq. B.3)

where the term in square brackets on the right-hand-side of Eq. (B.3) can be expressed

as a summation, yielding

⟨𝑇(𝑖)⟩ =𝜏(𝑗,𝑖)

𝑁𝑖𝑗Ribo ∑ 𝑁𝑙,𝑖

ribo𝑁𝑐(𝑖)𝑙=2 . (Eq. B.4)

Rearranging Eq. (S20) yields Eq. (3.3) in the main text:

𝜏(𝑗, 𝑖) =𝑁𝑗,𝑖

ribo

∑ 𝑁𝑙,𝑖ribo𝑁𝑐(𝑖)

𝑙=2

⟨𝑇(𝑖)⟩. (Eq. 3.3)

Page 117: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

102

B.1.2. Estimation of ⟨𝝉(𝒊)⟩

We estimated the synthesis time of a protein by using the finding that it scales linearly with

the number of elongating codons in a transcript 137

⟨𝑇(𝑖)⟩ = (𝑁𝑐(𝑖) − 1)⟨𝜏A⟩. (Eq. B.5)

In Eq. (B.5), ⟨𝜏A⟩ is the transcriptome-wide average codon translation time. This

approximation is supported both by experimental results33 and a theoretical analysis that

indicates this estimate is typically within 5% of the true synthesis time137.

Page 118: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

103

B.2 Supplementary Figures

Figure B.1. Comparison of the properties of the 117- and 364-transcript data sets from

studies of Nissley et al.9 and Williams et al.114, respectively, to the entire S. cerevisiae

transcriptome. Probability distributions of CDS length and percent GC content from the data set

of 117-transcripts from Nissley et al.9 (green) and from the entire transcriptome (blue) are plotted in

(A) and (B), respectively. (C) Scatter plot of the codon usage in the whole genome versus the 118-

transcript data set from Nissley et al.9. (D), (E) and (F) are the same as (A), (B) and (C), respectively,

except 364-transcripts from Williams et al.114 is used.

Page 119: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

104

Figure B.2. Translation time distributions for the 64 codon types. (A) The translation time

distributions for each codon type is shown for the dataset of Nissley et al.9. The distribution is shown

ignoring the extreme 5th percentiles at both ends of the distribution. The codons are sorted based

on the medians of their translation time distributions. There are only three instances of CGG and

one instance of CGA in our gene subset and hence their boxplot is not noticeable. (B) Same as (A)

but for the dataset of Williams et al.114. The sorting is the same as in (A).

Page 120: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

105

Figure B.3 Codon translation rates are highly correlated across datasets and with rates from

method of Dao Duc and Song . (A) The medians of the translation time distributions of the 64

codon types are highly correlated between the datasets of Nissley et al.9 and Williams et al.114. (B)

The standard deviations of these translation time distributions are also highly correlated for the two

datasets indicating that the variability of translation times is reproducible across datasets. (C) The

codon translation rates obtained using Eq. (3.5) for the dataset from Weinberg et al.41 is correlated

with codon translation rates inferred in the study of Dao Duc and Song132 on the same dataset.

Page 121: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

106

Figure B.4. Molecular factors shaping the variability of individual codon translation rates in

the dataset from Williams et al.114. (A-B) Median translation times of codon types are n egatively

correlated with cognate tRNA abundance estimated by (A) gene copy number and (B) RNA-Seq

gene expression. (C) Probability distribution of the translation time of codons which are followed by

the proline encoding codon and the rest of the other codons are plotted in green and blue,

respectively. (D-E) Percentage difference in median translation times when mRNA structure is

present relative to when it is not present as a function of codon position after the A-site. Grey bars

indicate results that are not statistically significant. Error bars are the 95% C.I. calculated using 104

bootstrap cycles; significance is assessed using the Mann-Whitney U test corrected with the

Benjamini Hochberg FDR method for multiple-hypothesis correction. mRNA structure information

used in (D) and (E) were taken from in vivo DMS and in vitro PARS data, respectively. (F) Scatter

plot of the median translation times of pairs of codon types that are decoded by the same tRNA

molecule. The red line is the identity line. The list of tRNA molecules and which codon they decode

were taken from Cannarrozzi et al.176. Error bars are standard error about the median calculated

with 104 bootstrap cycles.

Page 122: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

107

B.3 Supplementary Tables

Page 123: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

108 T

ab

le B

.1. S

tatistics f

or

the

tra

nsla

tio

n tim

e d

istr

ibution

s o

f 64 c

odon

typ

es o

bta

ine

d f

rom

the N

issle

y d

ata

set

Co

do

n

typ

e

Am

ino

acid

Nu

mb

er

of

su

ch

co

do

ns

in o

ur

da

ta s

et

Mean

tran

sla

tio

n

tim

e (

ms)

Med

ian

tran

sla

tio

n

tim

e (

ms)

Sta

n

da

rd

de

via

tio

n

(ms)

Vari

an

ce

(ms

2)

5%

pe

rcen

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

95%

pe

rcen

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

Tim

e 5

%

pe

rcen

til

e

Ob

serv

e

d (

ms)

Tim

e 9

5%

pe

rcen

tile

Ob

serv

ed

(ms)

Min

imu

m

tim

e

Ob

serv

ed

(ms)

Maxim

um

tim

e

Ob

serv

ed

(ms)

Co

eff

icie

nt

of

Vari

ati

on

AU

U

I 908

146 +

/- 3

127 +

/- 2

97

9482

YH

L034C

/176

YE

R091C

/305

49

.93

6 2

90

.03

1 1

7.5

90

14

91

.588

0

.66

4

AU

C

I 836

143 +

/- 2

131 +

/- 3

70

4891

YN

L064C

/178

YG

L245W

/66

51

.52

9 2

74

.48

5 9

.06

3 4

70

.16

2 0

.49

0

GU

U

V

1052

148 +

/- 2

133 +

/- 2

80

6411

YO

R198C

/160

YO

R369C

/10

5

7.3

82

27

9.6

13

23

.18

0 1

06

0.5

20

0.5

41

AC

C

T

621

156 +

/- 4

138 +

/- 3

99

9854

YH

R208W

/107

YH

L034C

/91

5

7.5

37

29

9.6

18

21

.03

7 1

17

2.0

43

0.6

35

AA

U

N

401

161 +

/- 6

139 +

/- 4

113

12753

YB

R162W

-A/1

2

YO

R027W

/70

61

.52

8 3

15

.66

3 2

1.4

00

13

33

.213

0

.70

2

AA

C

N

994

153 +

/- 2

142 +

/- 3

73

5307

YD

R454C

/145

YG

R155W

/232

59

.44

1 2

85

.12

1 1

3.9

94

52

3.3

29

0.4

77

AA

G

K

1692

174 +

/- 3

145 +

/- 2

143

20352

YP

L061W

/449

YLR

150W

/115

5

3.9

13

37

5.8

49

17

.67

1 2

60

8.5

37

0.8

22

AC

U

T

748

162 +

/- 4

146 +

/- 4

104

10849

YLR

197W

/298

YP

L106C

/269

59

.97

6 3

10

.07

7 6

.06

5 1

46

4.1

46

0.6

42

CA

A

Q

931

169 +

/- 3

152 +

/- 3

90

8050

YG

R209C

/98

YO

L040C

/3

65

.18

1 3

21

.00

5 1

9.4

96

84

0.4

33

0.5

33

AA

A

K

820

178 +

/- 4

153 +

/- 3

116

13484

YB

R249C

/335

YD

R071C

/116

5

7.4

98

35

8.5

80

10

.54

3 1

41

1.5

64

0.6

52

UU

A

L

541

172 +

/- 4

153 +

/- 4

88

7786

YB

R025C

/340

YN

L244C

/39

6

6.8

52

34

0.5

22

20

.95

3 6

59

.73

8 0

.51

2

GU

C

V

750

167 +

/- 3

155 +

/- 3

83

6903

YD

L067C

/27

YA

L038W

/256

68

.92

5 2

98

.32

6 1

6.8

30

78

9.1

76

0.4

97

GC

U

A

1508

167 +

/- 2

156 +

/- 2

79

6168

YP

R069C

/176

YP

L028W

/258

66

.83

6 3

10

.78

9 1

5.7

83

75

7.5

90

0.4

73

UU

G

L

1331

175 +

/- 2

160 +

/- 3

88

7668

YO

R210W

/24

YD

L055C

/156

7

2.6

14

32

0.2

05

24

.59

2 1

25

1.3

41

0.5

03

GC

C

A

734

179 +

/- 3

164 +

/- 4

93

8608

YP

R069C

/27

YB

L030C

/124

66

.83

6 3

38

.18

8 1

5.4

38

11

66

.562

0

.52

0

UU

U

F

423

182 +

/- 6

165 +

/- 4

113

12687

YC

R053W

/287

YP

R069C

/72

67

.12

6 3

39

.32

3 1

8.3

07

16

10

.375

0

.62

1

UU

C

F

792

179 +

/- 3

169 +

/- 3

79

6223

YD

L055C

/211

YP

R035W

/108

73

.49

0 3

12

.21

9 1

9.2

20

63

0.3

72

0.4

41

Page 124: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

109

Tab

le B

.1. S

tatistics f

or

the

tra

nsla

tio

n tim

e d

istr

ibution

s o

f 64 c

odon

typ

es o

bta

ine

d f

rom

the N

issle

y d

ata

set

Co

do

n

typ

e

Am

ino

acid

Nu

mb

er

of

su

ch

co

do

ns

in o

ur

da

ta s

et

Mean

tran

sla

tio

n

tim

e (

ms)

Med

ian

tran

sla

tio

n

tim

e (

ms)

Sta

n

da

rd

de

via

tio

n

(ms)

Vari

an

ce

(ms

2)

5%

perc

en

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

95%

perc

en

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

Tim

e 5

%

pe

rcen

tile

Ob

serv

ed

(ms)

Tim

e 9

5%

pe

rcen

tile

Ob

serv

ed

(ms)

Min

imu

m

tim

e

Ob

serv

ed

(ms)

Maxim

um

tim

e

Ob

serv

ed

(ms)

Co

eff

icie

nt

of

Vari

ati

on

UA

C

Y

634

233 +

/- 4

216 +

/- 4

101

10180

YG

R175C

/319;

YG

R175C

/457

YC

R012W

/123

10

4.7

67

41

4.6

40

33

.22

8 8

17

.72

8 0

.43

3

CC

A

P

943

245 +

/- 6

218 +

/- 4

171

29352

YG

L253W

/425

YG

L253W

/74

88

.62

7 4

59

.41

2 2

.57

3 3

89

8.7

17

0.6

98

CG

A

R

1

218 +

/- 0

218 +

/- 0

0

0

YE

R072W

/20

YE

R072W

/20

21

8.2

77

21

8.2

77

21

8.2

77

21

8.2

77

0.0

00

AC

A

T

188

259 +

/- 9

228 +

/- 1

0

124

15368

YP

L106C

/491

YB

R249C

/44

11

0.1

00

51

1.0

94

57

.57

1 8

07

.89

6 0

.47

9

CA

C

H

309

248 +

/- 6

236 +

/- 6

109

11783

YLL018C

/487

YIL

053W

/51

94

.93

8 4

65

.45

8 2

5.8

52

76

1.3

73

0.4

40

CC

U

P

272

254 +

/- 7

237 +

/- 8

122

14834

YLL018C

/520

YD

R226W

/208

10

7.5

96

46

9.2

14

62

.33

1 7

31

.43

2 0

.48

0

GC

A

A

251

253 +

/- 7

238 +

/- 8

111

12238

YD

R454C

/156

YLR

197W

/16

9

5.1

05

49

3.1

38

45

.49

1 7

41

.95

6 0

.43

9

CA

G

Q

124

253 +

/- 1

0

243 +

/- 1

5

111

12254

YH

R019C

/84

YO

R027W

/554

11

1.9

13

43

6.6

67

58

.39

2 6

94

.59

8 0

.43

9

AG

C

S

121

253 +

/- 9

247 +

/- 1

3

97

9456

YK

L192C

/70

YG

R175C

/308

1

22

.95

1 4

19

.06

6 6

1.4

18

65

1.2

62

0.3

83

CU

C

L

43

288 +

/- 2

2

250 +

/- 2

0

143

20370

YE

R009W

/3

YC

R053W

/56

13

7.9

84

63

4.6

43

42

.33

6 6

95

.23

6 0

.49

7

GG

C

G

224

263 +

/- 8

250 +

/- 7

126

15922

YD

R023W

/391

YLR

109W

/12

9

0.9

85

45

1.5

25

20

.53

4 9

44

.99

4 0

.47

9

GU

G

V

145

280 +

/- 1

8

253 +

/- 1

3

216

46686

YJR

104C

/7

YH

R064C

/118

9

1.3

20

47

5.0

38

56

.89

9 2

38

2.0

25

0.7

71

GA

G

E

319

268 +

/- 7

254 +

/- 7

117

13606

YO

R198C

/415

YO

R285W

/51

12

2.9

62

47

5.4

07

43

.04

9 1

09

4.6

66

0.4

37

GC

G

A

54

272 +

/- 1

9

258 +

/- 2

2

136

18474

YG

R282C

/13

YE

R055C

/53

83

.66

4 4

99

.33

9 3

5.6

73

65

5.7

17

0.5

00

UG

C

C

48

292 +

/- 1

9

263 +

/- 1

1

132

17535

YN

L244C

/89

YF

L045C

/46

12

6.1

19

57

5.0

68

59

.14

0 6

73

.77

4 0

.45

2

AU

A

I 79

279 +

/- 1

4

266 +

/- 1

7

122

14942

YE

R120W

/121;

YE

R120W

/211

YB

R162W

-

A/4

0

10

7.4

75

47

4.6

45

85

.15

1 6

65

.91

1 0

.43

7

CU

G

L

86

301 +

/- 1

3

279 +

/- 1

5

122

14892

YP

R062W

/88

YG

R155W

/307

12

6.5

17

52

2.7

21

11

1.9

13

72

0.7

22

0.4

05

Page 125: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

110

Tab

le B

.1. S

tatistics f

or

the

tra

nsla

tio

n tim

e d

istr

ibution

s o

f 64 c

odon

typ

es o

bta

ine

d f

rom

the N

issle

y d

ata

set

Co

do

n

typ

e

Am

ino

acid

Nu

mb

er

of

su

ch

co

do

ns

in o

ur

da

ta s

et

Mean

tran

sla

tio

n

tim

e (

ms)

Med

ian

tran

sla

tio

n

tim

e (

ms)

Sta

n

dard

de

via

tio

n

(ms)

Vari

an

ce

(ms

2)

5%

pe

rcen

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

95%

pe

rcen

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

Tim

e 5

%

perc

en

til

e

Ob

serv

e

d (

ms)

Tim

e 9

5%

perc

en

tile

Ob

serv

ed

(ms)

Min

imu

m

tim

e

Ob

serv

ed

(ms)

Maxim

um

tim

e

Ob

serv

ed

(ms)

Co

eff

icie

nt

of

Vari

ati

on

UA

U

Y

237

187 +

/- 7

171 +

/- 8

107

11425

YD

L084W

/285

YD

L208W

/20

6

7.3

25

34

9.4

52

26

.30

5 1

06

9.2

03

0.5

72

AU

G

M

479

192 +

/- 4

177 +

/- 5

91

8234

YN

L244C

/92

YD

R071C

/78

7

5.6

71

34

7.3

75

35

.34

1 6

42

.12

8 0

.47

4

GA

U

D

900

203 +

/- 4

179 +

/- 3

130

16971

YO

R298C

-A/5

Y

DL084W

/229

7

2.7

71

41

4.3

06

21

.40

0 2

16

1.4

27

0.6

40

GG

U

G

1695

210 +

/- 3

183 +

/- 3

131

17063

YH

L034C

/151

YG

L253W

/80

74

.90

4 4

41

.32

5 5

.78

5 1

28

9.9

47

0.6

24

UC

U

S

740

201 +

/- 3

187 +

/- 4

94

8888

YP

R069C

/220

YLR

197W

/108

7

7.1

19

37

3.1

85

19

.86

9 7

15

.83

3 0

.46

8

UG

U

C

234

197 +

/- 6

189 +

/- 1

0

95

9059

YM

L022W

/98

YD

R461W

/33

74

.74

4 3

53

.32

6 3

9.2

65

73

0.3

81

0.4

82

GA

A

E

1886

208 +

/- 2

190 +

/- 2

100

9964

YE

R120W

/147

YO

R198C

/120

8

8.5

09

39

3.4

77

10

.83

8 1

04

2.4

20

0.4

81

AG

A

R

912

209 +

/- 4

192 +

/- 3

109

11800

YH

L034C

/116

YM

R260C

/66

8

1.1

46

40

4.2

65

8.3

97

13

88

.206

0

.52

2

CA

U

H

211

203 +

/- 6

192 +

/- 8

93

8574

YJR

104C

/81

YP

L106C

/244

78

.72

4 3

95

.46

1 3

3.6

32

46

9.6

42

0.4

58

CG

U

R

231

212 +

/- 7

193 +

/- 6

102

10431

YP

R035W

/333

YA

L012W

/227

77

.04

1 4

17

.24

7 4

4.3

04

61

4.6

31

0.4

81

GU

A

V

122

231 +

/- 1

3

200 +

/- 9

143

20506

YP

L061W

/295

YB

R109C

/122

97

.30

7 4

60

.73

9 6

7.8

98

13

14

.736

0

.61

9

AG

U

S

149

214 +

/- 7

205 +

/- 1

0

82

6774

YK

L216W

/4

YH

R019C

/60

8

8.1

92

36

5.1

91

33

.32

0 4

88

.18

7 0

.38

3

GA

C

D

850

231 +

/- 4

208 +

/- 4

127

16255

YG

R155W

/256

YLR

197W

/68

9

1.0

80

42

6.4

97

25

.19

0 2

16

5.0

45

0.5

50

CU

U

L

132

221 +

/- 9

209 +

/- 9

101

10294

YD

L084W

/177

YH

R179W

/28

88

.04

0 3

83

.42

6 1

8.3

07

60

1.2

73

0.4

57

CU

A

L

245

234 +

/- 7

210 +

/- 8

106

11131

YO

R063W

/43

YC

R053W

/80

97

.11

9 4

14

.95

9 5

8.4

88

74

4.2

31

0.4

53

UC

A

S

173

232 +

/- 9

211 +

/- 1

0

113

12790

YG

L245W

/374

YD

L208W

/106

8

3.3

26

41

7.8

23

62

.42

7 8

95

.30

6 0

.48

7

UC

C

S

527

235 +

/- 5

213 +

/- 5

114

12929

YD

R023W

/403

YE

R091C

/60

95

.53

4 4

51

.99

6 2

7.9

50

69

3.3

70

0.4

85

Page 126: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

111

Tab

le B

.1:

Sta

tistics f

or

the tra

nsla

tio

n tim

e d

istr

ibution

s o

f 64 c

odon

typ

es o

bta

ine

d f

rom

the N

issle

y d

ata

set

Co

do

n

typ

e

Am

ino

acid

Nu

mb

er

of

su

ch

co

do

ns

in o

ur

da

ta s

et

Mean

tran

sla

tio

n

tim

e (

ms)

Med

ian

tran

sla

tio

n

tim

e (

ms)

Sta

n

da

rd

de

via

tio

n

(ms)

Vari

an

ce

(ms

2)

5%

pe

rcen

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

95%

pe

rcen

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

Tim

e 5

%

pe

rcen

tile

Ob

serv

ed

(ms)

Tim

e 9

5%

pe

rcen

tile

Ob

serv

ed

(ms)

Min

imu

m

tim

e

Ob

serv

ed

(ms)

Maxim

um

tim

e

Ob

serv

ed

(ms)

Co

eff

icie

nt

of

Vari

ati

on

GG

A

G

106

325 +

/- 1

7

292 +

/- 1

5

171

29181

YG

L123W

/26

YE

R087C

-B/4

9

14

7.5

89

62

0.4

94

99

.21

6 1

43

6.1

45

0.5

26

GG

G

G

56

327 +

/- 2

2

297 +

/- 2

3

168

28104

YD

R276C

/25

YH

R183W

/431

12

7.0

21

66

0.7

16

97

.68

4 8

51

.40

2 0

.51

4

AG

G

R

73

321 +

/- 1

4

305 +

/- 1

2

118

14042

YLL018C

/495

YLL018C

/112

16

4.5

59

53

1.6

52

13

2.3

41

70

8.8

69

0.3

68

CG

C

R

21

402 +

/- 5

2

321 +

/- 3

4

238

56430

YO

R332W

/166

YE

R055C

/38

16

5.2

48

77

7.5

43

14

3.4

43

12

16

.508

0

.59

2

AC

G

T

40

343 +

/- 1

8

327 +

/- 3

0

116

13428

YH

R072W

-

A/2

7

YE

R087C

-B/4

8

17

0.1

12

53

4.3

14

13

4.4

68

56

3.2

98

0.3

38

UC

G

S

40

334 +

/- 2

1

329 +

/- 2

4

133

17635

YH

R005C

-A/6

0

YG

L245W

/246

12

4.4

74

59

3.0

84

25

.31

7 6

02

.51

7 0

.39

8

CC

C

P

52

347 +

/- 2

1

331 +

/- 3

0

154

23588

YD

R071C

/16

YG

R037C

/20

14

5.6

73

55

1.4

34

44

.30

4 7

90

.12

7 0

.44

4

UG

G

W

259

370 +

/- 1

2

331 +

/- 1

1

193

37364

YH

R208W

/346

YD

R454C

/71

15

8.2

26

71

9.2

31

31

.63

0 1

46

7.2

49

0.5

22

CG

G

R

3

443 +

/- 8

6

337 +

/- 1

39

150

22411

YO

R027W

/586

YO

R332W

/114

33

6.7

07

65

4.3

83

33

6.7

07

65

4.3

83

0.3

39

CC

G

P

18

390 +

/- 3

8

344 +

/- 3

7

161

26061

YIL

051C

/23

YLR

109W

/94

24

8.4

18

59

5.0

42

16

3.0

92

87

5.2

21

0.4

13

UA

A

ST

OP

78

719 +

/- 3

1

669 +

/- 3

3

275

75830

YLR

325C

/79

YG

L245W

/709

40

0.8

59

13

03

.804

3

11

.21

9 1

70

0.9

31

0.3

82

UA

G

ST

OP

19

821 +

/- 7

9

829 +

/- 1

19

346

119416

YP

R069C

/294

YD

R002W

/202

38

0.4

53

13

04

.308

2

93

.87

9 1

53

9.1

06

0.4

21

UG

A

ST

OP

20

989 +

/- 9

6

976 +

/- 1

69

432

187022

YD

L055C

/362

YLL018C

/558

41

9.9

40

16

20

.272

2

89

.48

6 1

67

0.2

80

0.4

37

Page 127: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

112

Tab

le B

.2. S

tatistics f

or

the

tra

nsla

tio

n tim

e d

istr

ibution

s o

f 64 c

odon

typ

es o

bta

ine

d f

rom

the W

illia

ms d

ata

set

Co

do

n

typ

e

Am

ino

acid

Nu

mb

er

of

su

ch

co

do

ns

in o

ur

da

ta s

et

Mean

tran

sla

tio

n

tim

e (

ms)

Med

ian

tran

sla

tio

n

tim

e (

ms)

Sta

n

da

rd

de

via

tio

n

(ms)

Vari

an

ce

(ms

2)

5%

pe

rcen

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

95%

pe

rcen

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

Tim

e 5

%

pe

rcen

til

e

Ob

serv

e

d (

ms)

Tim

e 9

5%

pe

rcen

tile

Ob

serv

ed

(ms)

Min

imu

m

tim

e

Ob

serv

ed

(ms)

Maxim

um

tim

e

Ob

serv

ed

(ms)

Co

eff

icie

nt

of

Vari

ati

on

AU

C

I 2637

146 +

/- 2

128 +

/- 2

93

8570

YO

R230W

/291

YG

R175C

/134

50.1

27

297.2

78

6.9

83

1803.1

82

0.6

37

GU

U

V

3339

145 +

/- 1

129 +

/- 1

83

6844

YC

R005C

/418

YM

R217W

/260

50.1

13

284.1

06

8.2

00

1154.6

68

0.5

72

AU

U

I 3462

151 +

/- 2

133 +

/- 1

102

10438

YG

R175C

/456;

YG

R175C

/476

YO

R374W

/202

45.4

17

313.6

39

7.6

93

1489.1

36

0.6

75

AC

C

T

1929

162 +

/- 3

135 +

/- 2

119

14179

YLR

075W

/42

YK

L182W

/945

50.7

41

354.5

98

13.1

87

1567.6

86

0.7

35

GU

C

V

2203

152 +

/- 2

137 +

/- 2

85

7146

YC

R005C

/453

YN

L244C

/57

52.4

99

293.3

33

6.7

31

896.4

51

0.5

59

CA

A

Q

3143

159 +

/- 2

138 +

/- 2

100

9910

YLR

438C

-A/5

0

YB

R025C

/201

51.7

01

325.4

35

12.1

09

1513.8

47

0.6

29

AC

U

T

2660

173 +

/- 4

141 +

/- 1

184

33816

YE

R091C

/93

YLR

438C

-A/7

4

52.6

42

347.1

37

10.5

95

5487.4

20

1.0

64

UU

A

L

2283

158 +

/- 2

142 +

/- 2

93

8668

YE

L002C

/129

YN

L138W

/17

51.5

54

333.3

05

16.1

39

923.4

12

0.5

89

AA

C

N

2972

166 +

/- 2

148 +

/- 2

89

7896

YIL

053W

/76

YG

R175C

/408

60.9

64

322.0

51

8.2

05

1036.4

69

0.5

36

GC

C

A

2412

166 +

/- 2

148 +

/- 2

94

8808

YH

R183W

/285

YE

R178W

/149

57.7

52

325.7

26

9.7

23

1161.8

43

0.5

66

AA

U

N

1980

168 +

/- 2

149 +

/- 2

102

10345

YB

R149W

/288

YO

R136W

/328

53.5

14

348.8

80

14.6

78

1252.3

47

0.6

07

GC

U

A

4226

164 +

/- 1

149 +

/- 1

85

7237

YO

R362C

/277

YO

R153W

/598

58.2

29

315.1

95

8.4

76

952.4

93

0.5

18

UC

U

S

2868

167 +

/- 2

150 +

/- 2

99

9839

YG

L187C

/7

YLR

027C

/131

49.3

86

335.0

70

9.2

01

1401.9

65

0.5

93

UU

C

F

2654

168 +

/- 2

152 +

/- 2

91

8261

YK

R039W

/362

YG

R240C

/318

58.5

62

327.9

65

6.1

50

1203.3

17

0.5

42

AA

G

K

4640

194 +

/- 2

155 +

/- 2

170

28836

YB

R106W

/186

YF

L014W

/88

55.2

03

444.9

31

12.7

04

3719.2

43

0.8

76

UU

U

F

2031

172 +

/- 2

157 +

/- 2

100

9998

YH

R183W

/209

YK

R013W

/3

57.7

52

338.8

84

14.3

97

2170.5

89

0.5

81

UC

C

S

1923

177 +

/- 2

160 +

/- 3

100

9962

YM

R202W

/103

YO

L040C

/29

55.8

96

359.8

00

11.8

40

953.4

87

0.5

65

Page 128: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

113

T

ab

le B

.2. S

tatistics f

or

the

tra

nsla

tio

n tim

e d

istr

ibution

s o

f 64 c

odon

typ

es o

bta

ine

d f

rom

the W

illia

ms d

ata

set

Co

do

n

typ

e

Am

ino

acid

Nu

mb

er

of

su

ch

co

do

ns

in o

ur

da

ta s

et

Mean

tran

sla

tio

n

tim

e (

ms)

Med

ian

tran

sla

tio

n

tim

e (

ms)

Sta

n

da

rd

de

via

tio

n

(ms)

Vari

an

ce

(ms

2)

5%

pe

rcen

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

95%

pe

rcen

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

Tim

e 5

%

pe

rcen

til

e

Ob

serv

e

d (

ms)

Tim

e 9

5%

pe

rcen

tile

Ob

serv

ed

(ms)

Min

imu

m

tim

e

Ob

serv

ed

(ms)

Maxim

um

tim

e

Ob

serv

ed

(ms)

Co

eff

icie

nt

of

Vari

ati

on

UG

U

C

849

178 +

/- 3

161 +

/- 3

90

8157

YN

L036W

/134;

YN

L036W

/171

Y

OL030W

/100

65.5

89

357.1

05

18.6

86

646.1

12

0.5

06

UU

G

L

4170

179 +

/- 2

161 +

/- 2

98

9566

YG

L253W

/395

YF

L022C

/162

61.4

61

351.2

27

15.6

46

1097.3

82

0.5

47

CA

U

H

1081

185 +

/- 3

164 +

/- 3

111

12292

YP

L231W

/11

YH

R183W

/40

54.5

35

397.6

62

10.0

99

1156.4

53

0.6

00

AU

G

M

1855

182 +

/- 2

167 +

/- 3

94

8778

YK

R093W

/499;

YK

R093W

/504;

YK

R093W

/592

YLR

355C

/309

60.3

68

355.8

79

14.5

70

732.7

46

0.5

16

CG

U

R

827

192 +

/- 4

167 +

/- 5

112

12580

YO

R153W

/270

YP

L061W

/352

63.8

91

391.4

35

14.8

82

1084.5

17

0.5

83

GA

U

D

3453

200 +

/- 2

175 +

/- 2

132

17319

YD

R322C

-A/6

4

YB

R126C

/440

65.5

57

415.0

94

9.5

84

3165.4

78

0.6

60

UA

U

Y

1370

191 +

/- 3

176 +

/- 4

98

9692

YK

R093W

/601

YF

L045C

/113

68.4

17

375.2

64

5.4

57

700.1

41

0.5

13

CU

A

L

1151

194 +

/- 3

178 +

/- 3

99

9864

YLR

058C

/276

Y

IL002W

-A/5

5

73.6

40

368.6

70

15.5

38

1105.8

56

0.5

10

AA

A

K

3419

216 +

/- 3

179 +

/- 2

158

25021

YD

R023W

/396

YO

R230W

/48

67.4

19

476.2

03

11.6

26

2478.9

69

0.7

31

UC

A

S

1079

200 +

/- 3

180 +

/- 5

113

12661

YJL174W

/239

YM

R251W

-

A/2

8

67.5

73

397.3

97

13.8

67

900.9

91

0.5

65

GA

C

D

2746

202 +

/- 2

181 +

/- 2

114

12982

YIL

033C

/216

YG

L256W

/60

68.9

15

398.9

95

10.8

69

1979.4

87

0.5

64

GG

U

G

5179

223 +

/- 3

182 +

/- 2

182

33239

YK

L085W

/24

YP

R062W

/14

68.3

38

488.7

77

3.2

05

3821.5

45

0.8

16

AG

U

S

851

201 +

/- 4

184 +

/- 4

119

14210

YLR

179C

/3

YLL018C

/280

65.5

16

402.1

56

12.5

36

1931.9

33

0.5

92

CU

U

L

744

204 +

/- 4

184 +

/- 4

107

11518

YLR

300W

/4

YB

R283C

/339

72.0

92

408.5

80

15.0

83

928.2

04

0.5

25

GA

A

E

5887

206 +

/- 1

184 +

/- 2

113

12806

YLR

257W

/293

Y

KL060C

/194

73.3

88

409.3

45

4.8

72

1827.6

45

0.5

49

GU

A

V

698

204 +

/- 4

185 +

/- 5

111

12431

YK

L216W

/212

YB

R162W

-A/3

69.3

54

405.2

19

15.5

63

936.0

84

0.5

44

CA

C

H

967

208 +

/- 4

186 +

/- 4

113

12699

YJR

070C

/99;

YJR

070C

/139

YB

R196C

/182

78.0

94

413.2

48

15.6

36

1100.2

86

0.5

43

Page 129: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

114

Tab

le B

.2.

Sta

tistics f

or

the

tra

nsla

tio

n tim

e d

istr

ibution

s o

f 64 c

odon

typ

es o

bta

ine

d f

rom

the W

illia

ms d

ata

set

Co

do

n

typ

e

Am

ino

acid

Nu

mb

er

of

su

ch

co

do

ns

in o

ur

da

ta s

et

Mean

tran

sla

tio

n

tim

e (

ms)

Med

ian

tran

sla

tio

n

tim

e (

ms)

Sta

n

da

rd

de

via

tio

n

(ms)

Vari

an

ce

(ms

2)

5%

pe

rcen

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

95%

pe

rcen

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

Tim

e 5

%

pe

rcen

tile

Ob

serv

ed

(ms)

Tim

e 9

5%

pe

rcen

tile

Ob

serv

ed

(ms)

Min

imu

m

tim

e

Ob

serv

ed

(ms)

Maxim

um

tim

e

Ob

serv

ed

(ms)

Co

eff

icie

nt

of

Vari

ati

on

AG

A

R

2735

218 +

/- 3

191 +

/- 2

135

18286

YLL018C

/544

YG

R285C

/247

74.8

84

436.5

41

14.2

15

2057.6

10

0.6

19

UA

C

Y

2184

216 +

/- 2

196 +

/- 2

104

10745

YG

R065C

/507

YG

R060W

/81

80.6

22

414.2

01

14.5

65

897.8

06

0.4

81

AG

C

S

593

217 +

/- 4

205 +

/- 5

100

9952

YK

R013W

/277

YG

R189C

/36

84.7

21

398.8

25

20.1

99

691.1

15

0.4

61

CC

A

P

2706

238 +

/- 3

206 +

/- 3

157

24516

YD

R353W

/273

YLR

355C

/67

83.4

62

498.5

75

13.0

49

3102.9

75

0.6

60

CA

G

Q

749

237 +

/- 5

212 +

/- 5

128

16333

YIL

043C

/280

YM

R002W

/113

76.8

23

476.0

71

19.6

26

938.0

37

0.5

40

GC

A

A

1429

239 +

/- 3

220 +

/- 4

112

12647

YP

L265W

/408

YB

L099W

/273

92.2

80

450.6

71

20.2

89

888.4

51

0.4

69

GG

C

G

1089

277 +

/- 6

233 +

/- 4

210

43985

YF

L005W

/198

YE

L060C

/101

83.2

35

574.7

35

22.4

71

3323.4

63

0.7

58

GA

G

E

1685

255 +

/- 3

234 +

/- 4

136

18480

YM

R203W

/268

YO

L038W

/87

90.5

01

503.1

80

15.0

29

1531.5

99

0.5

33

CC

U

P

1348

271 +

/- 4

235 +

/- 5

156

24443

YLR

179C

/55

YP

L265W

/4

88.6

40

559.1

05

32.7

97

1605.7

79

0.5

76

AU

A

I 584

273 +

/- 6

236 +

/- 8

156

24467

YO

R142W

/240

YK

L216W

/198

88.9

30

575.9

39

44.4

54

1299.1

16

0.5

71

AC

A

T

1168

260 +

/- 4

237 +

/- 4

146

21360

YD

L135C

/83

YG

R282C

/6

94.6

14

508.1

26

18.5

85

1737.9

88

0.5

62

GU

G

V

1010

264 +

/- 5

238 +

/- 5

148

21771

YD

L126C

/629

YD

L022W

/39

90.4

48

547.2

16

34.9

19

1954.1

11

0.5

61

UG

C

C

281

273 +

/- 9

244 +

/- 7

153

23403

YD

R098C

/176

YIL

078W

/261

87.5

30

538.7

04

34.7

66

1268.0

04

0.5

60

UC

G

S

454

282 +

/- 7

253 +

/- 7

147

21493

YK

R093W

/130

YG

R286C

/110

92.5

64

563.5

32

23.7

10

1200.7

43

0.5

21

CU

C

L

233

283 +

/- 9

256 +

/- 8

139

19262

YE

R012W

/75

YLR

056W

/94

105.9

93

514.6

89

37.1

26

949.2

60

0.4

91

CG

C

R

117

286 +

/- 1

4

260 +

/- 2

0

150

22393

YN

L111C

/83

YLR

257W

/95

117.6

06

567.5

35

12.2

28

812.0

40

0.5

24

GC

G

A

413

307 +

/- 8

271 +

/- 1

1

173

29868

YG

R020C

/74

YLR

027C

/150

111.1

18

620.6

41

29.9

18

1346.6

82

0.5

64

Page 130: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

115

Ta

ble

B.2

. S

tatistics f

or

the tra

nsla

tio

n tim

e d

istr

ibution

s o

f 64 c

odon

typ

es o

bta

ine

d f

rom

the W

illia

ms d

ata

set

Co

do

n

typ

e

Am

ino

acid

Nu

mb

er

of

su

ch

co

do

ns

in o

ur

da

ta s

et

Mean

tran

sla

tio

n

tim

e (

ms)

Med

ian

tran

sla

tio

n

tim

e (

ms)

Sta

n

da

rd

de

via

tio

n

(ms)

Vari

an

ce

(ms

2)

5%

perc

en

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

95%

perc

en

tile

ob

serv

ed

in

Gen

e n

am

e /

co

do

n

po

sit

ion

Tim

e 5

%

pe

rcen

tile

Ob

serv

ed

(ms)

Tim

e 9

5%

pe

rcen

tile

Ob

serv

ed

(ms)

Min

imu

m

tim

e

Ob

serv

ed

(ms)

Maxim

um

tim

e

Ob

serv

ed

(ms)

Co

eff

icie

nt

of

Vari

ati

on

GG

A

G

651

320 +

/- 1

0

271 +

/- 6

245

59912

YK

L216W

/223

YO

R276W

/92

107.0

46

688.6

50

19.4

26

3936.7

25

0.7

66

AG

G

R

435

318 +

/- 9

273 +

/- 9

183

33659

YC

R005C

/420

YO

L038W

/49;

YO

L038W

/119

107.3

85

635.5

96

35.7

10

1658.5

00

0.5

75

UG

G

W

1079

325 +

/- 6

275 +

/- 5

211

44365

YLR

056W

/332

YO

L058W

/144

85.7

81

691.1

26

21.2

97

2176.3

20

0.6

49

CU

G

L

631

305 +

/- 7

281 +

/- 6

165

27089

YLR

027C

/5

YO

R187W

/21

106.6

13

606.9

04

36.6

42

1338.0

78

0.5

41

GG

G

G

501

326 +

/- 9

287 +

/- 8

190

36116

YB

R286W

/110

YK

L016C

/24

102.8

16

723.0

00

14.6

43

1358.2

55

0.5

83

AC

G

T

441

323 +

/- 8

296 +

/- 7

163

26726

YP

L078C

/66

YF

L037W

/72

117.9

19

637.2

51

40.9

96

1537.9

09

0.5

05

CC

C

P

426

356 +

/- 1

0

315 +

/- 1

0

210

44062

YG

L202W

/182;

YG

L202W

/440

YO

R153W

/830

115.2

42

677.2

44

45.

795

1696.0

36

0.5

90

CG

G

R

26

427 +

/- 5

4

360 +

/- 7

0

273

74374

YO

R052C

/71

YF

L038C

/108

83.9

36

940.1

93

68.3

12

1052.4

79

0.6

39

CC

G

P

178

504 +

/- 2

3

455 +

/- 2

1

311

96474

YP

L265W

/117

YD

L084W

/440

170.9

89

1194.4

65

39.1

40

1827.1

27

0.6

17

CG

A

R

31

482 +

/- 4

5

496 +

/- 6

1

250

62599

YM

L001W

/104

YD

L181W

/70

94.5

81

848.7

01

42.4

30

1212.1

05

0.5

19

UA

A

ST

OP

203

185 +

/- 1

3

144 +

/- 8

190

36114

YJL166W

/95

YC

R004C

/248

45.0

35

424.3

92

21.3

95

2208.0

19

1.0

27

UA

G

ST

OP

77

311 +

/- 2

0

255 +

/- 2

4

176

31145

YF

L045C

/255

YD

L022W

/392

98.0

80

630.2

06

62.1

33

852.0

48

0.5

66

UG

A

ST

OP

84

376 +

/- 3

2

317 +

/- 2

2

296

87558

YD

L067C

/60

YP

L059W

/151

92.8

45

880.4

35

26.6

69

2174.4

16

0.7

87

Page 131: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

116

Appendix C

CHAPTER 4 SUPPORTING INFORMATION

C.1 Methods

C.1.1 Details of Experiments

C.1.1.1 Design of mutant strains

There are 7,980 possible amino acid mutations for all combinations of amino acid pairs

where the P-site amino acid is mutated while keeping the A-site amino acid unchanged.

Bioinformatic analyses of published Ribo-Seq data predicted that 4,134 out of these 7,980

possible mutations will result in a significant change in speed (Figure 4.1d). To

experimentally validate our bioinformatic predictions, we chose 5 mutations that can speed

up translation and 5 mutations that can slow down translation. Two more mutations were

created where our bioinformatic analysis predicted no significant change in speed. These

12 mutations were chosen such that they represent as many different combinations of

amino acids and also such that they can be mutated on a small number of genes.

Mutations (P, G) → (E, G) and (Q, D) → (P, D) were chosen to act as positive control since

Proline has been known to slow down translation when present in P-site. (G, G) → (S, G)

and (S, G) → (G, G) were implemented to test the complementarity of the mutations, i.e.,

if mutating the P-site from G → S is having a significant change in translation speed, is S

→ G having the same effect in the opposite direction? To experimentally verify whether

the effect on speed for G → S and S → G are also possible for more than one A-site, we

carry our similar mutations with T in the A-site, i.e., the mutations (G, T) → (S, T) and (S,

T) → (G, T). The rest of the mutations were chosen to represent amino acids not

represented in the above mutations. Location of these mutations were chosen such that

the normalized ribosome density in the published datasets at these instances of the amino

acid pair is close to the median and an instance is avoided that is at the extreme tail of the

distribution. The mutations were chosen on 5 non-essential highly expressed genes where

these mutations can be distributed on the instances of these amino acid pairs. The chosen

genes were not involved in the process of ribosome biogenesis or translation. The mutated

positions on the selected genes were chosen such that they were not at the functional

sites or sites subjected to post-translational modifications as defined in the

Saccharomyces Genome Database158. The gene name and location of mutations are

listed in Table C.2.

Page 132: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

117

We denote the five mutant strains as YKL*, YMR*, YLR*, YOL* and YHR* that

were created such that each strain contained mutations in a single gene: YKL096W-A,

YMR122W-A, YLR109W, YOL109W and YHR179W respectively. To assess the effect of

tRNA versus amino acid identity on translation, an additional mutant strain, denoted

YOL**, was created which contained the same amino acid mutations as YOL*, but using

a synonymous set of codons. Details concerning these two set of synonymous mutations

in provided in Table C.4.

Ribosome profiling was carried out in two phases. In the first phase, two replicates

of mutant strains YKL*, YMR*, YLR* and YOL* were subject to Ribosome profiling. For

the single mutation in YKL096W-A, the mutant ribosome densities were from two

replicates of YKL* while the wild-type ribosome densities were from YMR*, YLR* and YOL*

which contained the endogenous transcript for YKL096W-A. A similar procedure was

followed for the four other mutations where we predict a speedup of translation and three

slowdown mutations present on these four genes.

In the second phase, Ribosome profiling was run for four replicates of YHR*, YOL*

and YOL**. YOL* and YOL** contained the same amino acid mutations but differed in

terms of the set of synonymous codons used. YHR* contained four mutations. Two

mutations were the negative control mutations. The other two were mutations predicted to

slowdown translation bringing the total slowdown mutations to 5. The 8 samples (4

replicates each of YOL* and YOL**) were used as wild type for mutations in YHR179W

while the 4 replicates of YHR* served as wild-type samples containing the endogenous

YOL109W gene against the YOL109W mutations in YOL* and YOL**. The number of

replicates were increased to 4 in the second phase to generate enough sample size for a

valid statistical test. The normalized ribosome densities were not compared across the

two phases as the Ribo-Seq samples prepared on different days show poor correlation of

ribosome densities at the codon level (Figure C.8).

C.1.1.2 Strain Construction of mutants

A two-step procedure omitting selection markers in the final construct was used for mutant

strain construction. First, the gene of interest was replaced in the strain BY4741 by a

K. lactis URA3 cassette according to Janke et al.174. Second, the desired mutant gene

enclosing overhangs (45nt/60nt) was constructed by PCR and used to replace the

introduced URA3 cassette by homologous recombination. Candidates were selected on

5-Fluoroorotic Acid (5-fluorouracil-6-carboxylic acid monohydrate; 5-FOA) containing

Page 133: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

118

plates and insertion of the correct mutations was verified by colony PCR and DNA

sequencing.

C.1.1.3 Ribosome profiling and library preparation

200 mL of cells were grown in YPD to an OD600 nm of 0.5, rapidly filtered (All-Glass Filter

90mm, Millipore), flash frozen in liquid nitrogen, mixed with 600 µL frozen lysis buffer (20

mM Tris-HCl pH 8.0, 140 mM KCl, 6 mM MgCl2, 0.1% NP-40, 0.1 mg/ml CHX, 1 mM

PMSF, 2x Complete EDTA-free protease inhibitors (5056489001, Roche), 0.02 U/ml

DNase I (4716728001, Roche), 20 mg/mL leupeptin, 20 mg/mL aprotinin, 10 mg/mL E-64,

40 mg/mL bestatin) and pulverized by mixer milling (2 min, 30 Hz, MM400, Retsch).

Thawed cell lysates were cleared by centrifugation (20,000xg, 5 min, 4°C) and digested

by RNase I (AM2295, Ambion; 125 U/1mg nucleic acid) for 1 hr (25°C, 650 rpm) to obtain

ribosome footprints. The reaction was stopped by adding 10 µl SUPERase-In (AM2696,

Ambion). The digested lysate was loaded onto 10-50% (w/v) sucrose gradients and

centrifuged at 35,000 rpm, 4°C for 2.5 hrs. Gradients were fractionated and monosome

fractions were collected, pooled and used for RNA purification by hot acid-phenol

extraction. 5 µg of purified RNA was depleted for rRNA using the Ribo-Zero Gold for Yeast

kit (MRZY1306, Illumina). Deep sequencing libraries were prepared following the protocol

described in Döring et al.159 and sequenced on a HiSeq 2000 (Illumina).

C.1.2 Computational analyses of Ribo-Seq data

C.1.2.1 Analysis of Ribo-Seq datasets

Wild-type Ribo-Seq datasets were obtained from five different published

studies9,41,111,113,114 whose accession numbers are provided in Table C.1. The raw reads

for each of these published datasets were preprocessed according to the steps specified

in the Methods of the respective study. The sequenced reads for all Ribo-Seq datasets of

mutant strains prepared for this study were subject to a uniform preprocessing step. The

raw reads were first trimmed of their 3′ adapter sequence

CTGTAGGCACCATCAATTCGTATGCCGTCTTCTGCTTG using cutadapt v1.14115. The

reads were also subject to a quality filter of at least 20 during the cutadapt run.

For all downstream analyses, a uniform protocol was used as specified below.

Preprocessed reads were first mapped to a set of ribosomal RNA sequences using

Bowtie2 109 and then subsequently the unmapped reads were mapped to the rest of S.

cerevisiae reference genome sacCer3 using Tophat2 110. Custom python scripts were

implemented for all downstream analyses. Mapped reads were first quantified by their 5′

Page 134: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

119

ends on individual gene transcripts and A-site positions were assigned according to Table

2.1 that is also published in Ahmed et al.154. To maintain the accuracy of read assignment,

transcripts in which multiple mapped reads constitute more than 0.1% of the reads

mapped to the CDS region were not considered in the analysis. To minimize noise and

increase the confidence in the results, we restrict our dataset to only transcripts that have

at least 3 reads mapped at every codon position. Applying this filter, we obtain 364 genes

for Williams’ dataset114, which has the highest coverage among the published datasets.

Hence, we use this dataset for all downstream analyses.

C.1.2.2 Estimation of translation speed change for amino acid pairs

The normalized ribosome density 𝜌 for every codon position 𝑗 in transcript 𝑖 is calculated

by dividing the number of mapped Ribo-Seq reads 𝑅𝑘,𝑖 by the average reads mapped to

the transcript 𝑖 consisting of 𝑁𝐶,𝑖 codons.

𝜌𝑗,𝑖 = 𝑅𝑗,𝑖

∑ 𝑅𝑘,𝑖/𝑁𝐶,𝑖𝑘 [Eq. C.1]

𝜌 values are binned into an individual distribution for every amino acid pair (X, Z) where X

is in the P-site and Z is in the A-site. The distribution [𝜌(𝑋, 𝑍)] is populated by 𝜌(𝑗, 𝑖) for

each codon position 𝑗 in transcript 𝑖 such that (𝑗 − 1𝐴𝐴, 𝑗𝐴𝐴) = (𝑋, 𝑍). The terms 𝑗 and 𝑖

are dropped from [𝜌(𝑋, 𝑍)] since this is an aggregated distribution of all instances of 𝜌𝑗,𝑖

for the amino acid pair (X, Z). The speedup or slowdown of translation caused by an amino

acid pair is estimated using the percent change in median of the distribution [𝜌(𝑋, 𝑍)] as

compared to median of the distribution [𝜌(~𝑋, 𝑍)]

𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝑐ℎ𝑎𝑛𝑔𝑒 = 𝑀𝑒𝑑𝑖𝑎𝑛[𝜌(𝑋,𝑍)]− 𝑀𝑒𝑑𝑖𝑎𝑛[𝜌({~𝑋},𝑍)]

𝑀𝑒𝑑𝑖𝑎𝑛[𝜌(~𝑋,𝑍)]∗ 100 % [Eq. C.2]

(~X, Z) represents the set of all pairs of amino acids where Z is in the A-site and X is not

in the P-site. A positive percent change (red shades in Figure 4.1b) will indicate that for

amino acid pair (X, Z), presence of X is leading to slower translation (higher values of 𝜌)

of Z as compared to when X is not present in the P-site. A negative percent change (green

shades in Figure 4.1b) will indicate Z is translated faster when X is present in the P-site

as compared to when X is no present in the P-site.

The distribution [𝜌(𝑋, 𝑍)] is plotted in Figure 4.1d for two pairs (N, R) and (S, R). 𝜌 is plotted

across the X-axis and the probability density 𝑃(𝜌) is plotted on the Y-axis that is calculated

below as

Page 135: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

120

𝑃(𝜌(𝑋, 𝑍)) = ∑ ∑ 𝜌(𝑗−1,𝑗) = (𝑋,𝑍)(Θ(𝜌𝑗,𝑖−𝛿𝜌)−Θ(𝜌𝑗,𝑖+𝛿𝜌))𝑗=2𝑖

∑ ∑ 𝜌(𝑘−1,𝑘) = (𝑋,𝑍)𝑘=2𝑖 [Eq. C.3]

where 𝛿𝜌 is the bin width of the histogram. Θ(𝜌𝑗,𝑖 − 𝛿𝜌) and Θ(𝜌𝑗,𝑖 + 𝛿𝜌) are terms of

Heaviside steps function that are used to classify whether the term 𝜌𝑗,𝑖 is to be included in

a particular bin of width 𝛿𝜌 or not.

An odds measure is calculated for whether a P-site mutation from (X,Z) → (B,Z)

will result in change of translation rate. First all combinations of difference of normalized

ribosome densities of each instance of the two distributions is calculated. The odds is the

ratio of number of positive differences to number of negative differences if we predict a

slowdown and vice versa, if we predict a speedup of translation.

𝑂𝑑𝑑𝑠𝑇𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒 𝑐ℎ𝑎𝑛𝑔𝑒 = 𝑁{𝜌(𝑋,𝑍)−𝜌(𝐵,𝑍)}>0

𝑁{𝜌(𝑋,𝑍)−𝜌(𝐵,𝑍)}<0 [Eq. C.4]

C.1.2.3 Reproducibility of the trends across different datasets

Normalized ribosome density profiles were calculated from the six Ribo-Seq data sets for

the 364 high coverage genes identified in the Williams dataset114. [𝜌(𝑋, 𝑍)] and [𝜌(~𝑋, 𝑍)]

were then calculated for all pairs of amino acids in the P- and A-sites. Instances of zero

A-site reads were not included in the distributions. The percent change in median

normalized ribosome density 𝜌 is considered to be reproducible if it is positive (slowdown)

or negative (speedup) in all 6 datasets and statistically significant in at least 4 out of the 6

datasets. It is these reproducible results that are reported in Figure 4.1b.

C.1.2.4 Percent contribution of amino acid vs tRNA

In S. cerevisiae, there are 41 unique tRNA molecules (excluding the start codon tRNAMet)

encoding the 20 amino acids of the nascent polypeptides. In P- and A-sites of the

ribosome, the two neighboring amino acids interact to form peptide bond and the two

tRNAs may also interact to probably influence the translation speed at the A-site. To

distinguish these two effects, we derive an equation that can quantity the percent

contribution of amino acid and tRNA to the translation speed. For a comparison of amino

acid pairs X-Z and B-Z, the percent contribution of tRNA is given by

% 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡𝑅𝑁𝐴 =

(𝑀𝑎𝑥𝜌{𝐵𝑡1,𝐵𝑡2….,𝐵𝑡𝑚}− 𝑀𝑖𝑛𝜌{𝑋𝑡1,𝑋𝑡2….,𝑋𝑡𝑛})−(𝑀𝑖𝑛𝜌{𝐵𝑡1,𝐵𝑡2….,𝐵𝑡𝑚}−𝑀𝑎𝑥𝜌{𝑋𝑡1,𝑋𝑡2….,𝑋𝑡𝑛})×100

(𝑀𝑎𝑥𝜌{𝐵𝑡1,𝐵𝑡2….,𝐵𝑡𝑚}− 𝑀𝑖𝑛𝜌{𝑋𝑡1,𝑋𝑡2….,𝑋𝑡𝑛}) [Eq. C.5]

Page 136: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

121

subject to condition

𝑀𝑎𝑥𝜌{𝐵𝑡1, 𝐵𝑡2 … . , 𝐵𝑡𝑚} ≥ 𝑀𝑖𝑛𝜌{𝐵𝑡1, 𝐵𝑡2 … . , 𝐵𝑡𝑚} ≥ 𝑀𝑎𝑥𝜌{𝑋𝑡1, 𝑋𝑡2 … . , 𝑋𝑡𝑛} ≥

𝑀𝑖𝑛𝜌{𝑋𝑡1, 𝑋𝑡2 … . , 𝑋𝑡𝑛})

% 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝐴𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑 = 100 − % 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡𝑅𝑁𝐴. [Eq. C.6]

The sets {𝐵𝑡1, 𝐵𝑡2 … . , 𝐵𝑡𝑚} and {𝑋𝑡1, 𝑋𝑡2 … . , 𝑋𝑡𝑛} are composed of the 𝑚 and 𝑛 different

tRNAs covalently attached to amino acids B and X, respectively. The percent contribution

can only be computed when the median normalized ribosome densities of individual

tRNAs of amino acids X do not overlap with those of individual tRNAs of B. In cases where

there is an overlap, it is not possible to apply Eq [S4], therefore they are neglected from

this analysis.

As an example, consider the case in which the change in tRNA identity contributes

100% to the corresponding change in translation speed. In such a case there will a tRNA

for X, for which normalized ribosome density 𝜌𝑋𝑡𝑥 will be equal to normalized ribosome

density for a tRNA of B, 𝜌𝑍𝑡𝑧. Since there is no overlap of the normalized ribosome

densities, 𝑀𝑖𝑛𝜌{𝐵} = 𝜌𝑍𝑡𝑧 and 𝑀𝑎𝑥𝜌{𝑋} = 𝜌𝑋𝑡𝑥

. This will result in 𝑀𝑖𝑛𝜌{𝐵} = 𝑀𝑎𝑥𝜌{𝑋} and

applying this to Eq. C.5

% 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡𝑅𝑁𝐴 = (𝑀𝑎𝑥𝜌{𝐵𝑡1, 𝐵𝑡2 … . , 𝐵𝑡𝑚}− 𝑀𝑖𝑛𝜌{𝑋𝑡1, 𝑋𝑡2 … . , 𝑋𝑡𝑛}) − (0) × 100

(𝑀𝑎𝑥𝜌{𝐵𝑡1, 𝐵𝑡2 … . , 𝐵𝑡𝑚} − 𝑀𝑖𝑛𝜌{𝑋𝑡1, 𝑋𝑡2 … . , 𝑋𝑡𝑛}) = 100%.

Now consider the other end of the spectrum in which the tRNA identity contributes nothing

to translation speed change and it is only the amino acid identity. In such a case, the

median normalized ribosome densities will be equal for all tRNAs of X and similarly for all

tRNAs of Z. Hence,

𝑀𝑎𝑥𝜌{𝐵} = 𝑀𝑖𝑛𝜌{𝐵} and 𝑀𝑎𝑥𝜌{𝑋} = 𝑀𝑖𝑛𝜌{𝑋}

Rearranging the terms in Eq. C.5

% 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡𝑅𝑁𝐴 = (𝑀𝑎𝑥𝜌{𝐵}−𝑀𝑖𝑛𝜌{𝐵})−(𝑀𝑖𝑛𝜌{𝑋}−𝑀𝑎𝑥𝜌{𝑋})×100

(𝑀𝑎𝑥𝜌{𝐵}−𝑀𝑖𝑛𝜌{𝑋})= 0 %.

C.1.2.5 Enrichment/Depletion of amino acid pairs

Enrichment/Depletion of an amino acid pair across the proteome of S. cerevisiae is

calculated by dividing the observed probability of finding the amino acid pair by the

Page 137: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

122

probability expected of forming these pairs by random chance, that is the product of the

probabilities of the individual amino acids across the proteome. This ratio is a measure of

enrichment/depletion of the amino acid pair which we call the enrichment score. To test

for evolutionary selection of amino acid pairs that significantly influence translation speed,

the top 20% (80 out of 400 amino acid pairs) of amino acid pairs are taken that have the

highest enrichment scores (highly enriched across the transcriptome) and the bottom 20%

of amino acid pairs with the lowest enrichment scores (highly depleted across the

transcriptome). Fisher’s exact test is used to test the hypothesis that the fast pairs are

more likely to be enriched while slow pairs are more likely to be depleted across the S.

cerevisiae transcriptome. The odds ratio of fast pairs being enriched and slow pairs being

depleted are calculated as shown below:

𝑂𝑑𝑑𝑠 𝑟𝑎𝑡𝑖𝑜 = 𝐸𝑛𝑟𝑖𝑐ℎ𝑒𝑑(𝐹𝑎𝑠𝑡 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑 𝑝𝑎𝑖𝑟𝑠)×𝐷𝑒𝑝𝑙𝑒𝑡𝑒𝑑(𝑆𝑙𝑜𝑤 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑 𝑝𝑎𝑖𝑟𝑠)

𝐸𝑛𝑟𝑖𝑐ℎ𝑒𝑑 ( 𝑆𝑙𝑜𝑤 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑 𝑝𝑎𝑖𝑟𝑠)×𝐷𝑒𝑝𝑙𝑒𝑡𝑒𝑑(𝐹𝑎𝑠𝑡 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑 𝑝𝑎𝑖𝑟𝑠) [Eq. C.7]

C.1.2.6 Classification of downstream linker and domain regions

A database of 864 S. cerevisiae proteins with annotated domain boundaries was created.

The domain boundaries were identified based on the criteria from Ciryam et al.15 that was

used to identify domain boundaries in E.coli.

The hypothesis of co-translational folding influenced by translation kinetics states

that when the entire domain has exited the ribosome exit tunnel, the translation occurs

slowly for the domain to fold co-translationally80. An additional study14 also states that

when the partial domain region of the nascent polypeptide is exiting the ribosome exit

tunnel, the translation is likely to occur faster to avoid misfolded intermediates. To test

whether the enrichment of our fast and slow amino acid pairs contributes to this

faster/slower translation, we first identify pairs of domain and linker regions which meet

the criteria below. In our analysis, we define the domain region as the region being

translated when the nascent chain segment constituting the domain is outside the exit

tunnel. The growing nascent chain moves through the ribosome’s exit tunnel that usually

contains thirty residues. Hence, the first residue of the domain will appear outside the exit

tunnel when the 31st residue is being translated in the A-site. To account for these 30

residues in the exit tunnel, we consider our translated domain region to begin from 31st

residue of the defined domain start and end 30 residues downstream of the domain which

is when the most C-terminal residue of the domain appears outside the ribosome exit

tunnel. The second region which we call linker region with a fixed window size begins

when the entire domain has emerged from the ribosome exit tunnel which is the 31st

Page 138: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

123

residue downstream of the end of the domain and not overlapping with the next domain.

Based on the window size, we determine the linker region to be from the 31st residue. If

the linker is shorter than the sum of 30 residues in the tunnel and window size, then this

domain-linker pair is not considered in the analysis for this window size. The

enrichment/depletion of fast and slow amino acid pairs is calculated in the linker region

relative to the domain region and the statistical significance is estimated using a random

permutation test. The amino acid pairs are kept intact while they are randomly permutated

across a pooled set of domain and linker regions. For each iteration of the permutation

test, the probability of fast and slow amino acid pairs is calculated across both the

downstream regions. Error bars are calculated using Bootstrapping 120.

C.1.2.7 Enrichment/depletion of amino acid pairs in Ssb-bound translated regions

In a previous study159, regions of mRNA were identified that are translated by the ribosome

when Hsp70 chaperone Ssb is bound to the nascent polypeptide chain and when it is not

bound. A Fold Enrichment (FE) measure is the ratio of selective Ribo-Seq reads to Ribo-

Seq reads and its profile across an mRNA transcript, where FE is above a threshold,

defines the regions that are translated when Ssb is bound. We define 5 thresholds based

on the percentile values of FEs from the Cumulative Distribution Function of FE values

(see Figure S6 from study of Döring and co-workers159). For example, for thresholds (P80,

P20), every nucleotide position with a FE value greater than P80 was classified as Ssb-

bound translated region (B region) while every position with a FE value lower than P20

were defined as Ssb-unbound translated region (UB region). The rest of the nucleotides

with FE value between P20 and P80 were ignored. Similarly, 5 thresholds are defined to

represent the increasing differential of Ssb binding strength as move from left to right on

X-axis in Figure C.5.

In the study of Döring and co-workers159, it was shown translation was faster, on

average, in the Ssb-bound translated regions relative to Ssb-unbound regions. The

presence or absence of several molecular factors, such as downstream mRNA secondary

structure, optimal codons, and proline content were shown to correlate with the increased

translation speed in the Ssb-bound regions. To test whether amino acid pairs are also

contributing as a molecular factor, we test the hypothesis that fast amino acid pairs are

enriched and slow amino acid pairs are depleted across Ssb-bound translated regions

relative to Ssb-unbound translated regions. Permutation test is used to measure the

statistical significance of the percent change of probabilities of fast and slow amino acid

pairs in Ssb-bound translated region (B region) relative to Ssb-unbound translated regions

Page 139: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

124

(UB region) which is plotted on Y-axis is Figure C.5. Error bars are calculated using

Bootstrapping 120.

C.1.2.8 Miscellaneous details

(1) Mann-Whitney U test is used to assess statistical significance between the normalized

ribosome density distributions of amino acid pairs. The p-values are corrected with the

Benjamini-Hochberg FDR method for multiple-hypothesis correction.

(2) The confounding molecular factors being controlled in Figure C.2 are defined below.

The instances of amino acid pairs containing the concerned molecular factor are not

considered in the analysis in which the molecular factor is controlled. The five factors we

consider are [a] mRNA secondary structure 4 to 6 codons downstream of A-site, [b]

Positively charged residues 2 to 5 residues upstream of A-site, [c] Presence of sequence

motifs like PPX, XPP and motifs identified by Schuller et al. in wild-type S. cerevisiae65,

[d] Instances containing non-optimal codons of A-site amino acid as defined by Pechmann

and Frydman48 and [e] A-site and P-site codon instances which are decoded through

Wobble base-pairing. Mann-Whitney U test is used to compare probabilities of each of

these five factors to determine if there is any bias towards one distribution and hence the

statistically significant different can be attributed to the confounding factor.

(3) The sample size of mutant strains is either 2 or 4 and hence application of Mann-

Whitney U test is not feasible. An exact Fisher-Pitman permutation test is used which

overcomes the problem of low sample size. This test is applied to determine the statistical

significance of the difference between the normalized ribosome densities of mutant and

wild type strains. For non-overlapping distributions with 𝑛 = 2 and 𝑛 = 4 in one

distribution, the p-value equals 0.036 and 0.002 respectively. Hence, we get the same p-

values for all 8 comparisons with 𝑛 = 2 and 4 comparisons with 𝑛 = 4 of normalized

ribosome densities in mutant versus wild-type in Figure 4.2.

Page 140: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

125

C.2 Supplementary Figures

Page 141: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

126

Figure C.1. The percent change in median normalized ribosome density 𝝆 for a given pair

of amino acids in the P-site and A-site, relative to any other amino acid being in the P-site

(Eq. C.2). . The published datasets are listed based on the name of the first author of the study,

namely, Williams114 (a), Jan113 (b), two biological replicates from study of Nissley9 (c, d),

Weinberg41 (e) and Young111 (f). The accession numbers of these samples are listed in Table

C.1. The legend is same as in Figure 4.1b.

Page 142: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

127

Page 143: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

128

Figure C.2. The sign of the percent change in ribosome density (Eq. C.2) for the fast and slow

translating amino acid pairs remains the same after controlling for different molecular factors

known to influence translation speed. (a) 84 fast translating pairs (dark blue) and 73 slow translating

pairs (dark orange) shown in Figure 4.1b not controlling for any molecular factors. (b) The 157

significant amino acid pairs after controlling for downstream mRNA secondary structure. The direction

of the median speed change for all 157 pairs remains the same but 72 (46%) pairs lose statistical

significance (light orange and light blue colors for slow and fast, respectively) after controlling for the

factor. (c) Same analysis as (b) but controlling for positively charged residues present upstream of the

P-site. The direction of the speed change for all 157 pairs remains the same, 42 (27%) pairs lose

statistical significance. (d) Same analysis as (b) but controlling for non-optimal codons in the A-site.

The direction of the median speed change for 156-out-of157 pairs remains the same but 53 (34%) pairs

lose statistical significance . (e) Similar analysis as (b) but controlling for stalling motifs. The direction

of the speed change for all 157 pairs remains the same, 4 pairs (2.5%) lose statistical significance. The

loss of statistical significance is primarily due to a decrease in the sample size after filtering out of

instances of the given molecular factor from the [𝜌(𝑋, 𝑍)] and [𝜌(~𝑋, 𝑍)] distributions that are compared.

Page 144: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

129

Figure C.3. The ribosome profiling data for all the mutant strains demonstrate consistent

fragment size distribution, strong 3 nt periodicity, robust frame distribution and high pairwise

correlation of individual transcript’s ribosome profiles. (a) Fragment size distribution of reads

mapped to the CDS regions by 5′ end including 50 nt region upstream of start codon for the mutant

strains created in this study. YOL* strain was subjected to ribosome profiling twice on different days

denoted as YOL*-I and YOL*-II for phases I and II respectively (see Methods for details). (b) The

distribution of reads of fragment size 28 whose 5′ end have aligned to reading frame 0, 1 or 2 for all

mutant strains. (c) The pairwise correlation of ribosome profiles for 108 high coverage genes that have

at least 3 reads at every codon position. The pairwise correlation is carried out only between samples

prepared during the same phase (See Methods for details). The median Pearson r is 0.96 indicating

very high correlation between ribosome profiles of genes across different samples. (d) Meta-gene

profile of normalized read counts for fragments of size 28 mapped by the 5′ end and plotted in a 100

nt region starting from -18 nucleotide position with respect to first nucleotide of start codon up to

nucleotide position 82. For analyses in (a), (b) and (d), for all mutant strains, the normalized read counts

were averaged across all replicates.

Page 145: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

130

Figure C.4 Ribosome profiles of mutant and wild-type strains are highly correlated. (a-e) The

normalized ribosome density 𝜌 in the mutated gene, averaged over all replicates for mutant and wild-type

reference samples are correlated and plotted on the X and Y-axis respectively. The normalized ribosome

density at the codon position which is in A-site when the mutated position is in P-site is shown in red. In

all cases, the median Pearson R between the individual replicates is greater than 0.92.

Page 146: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

131

Figure C.5. Optimal and non-optimal codons are equally distributed between the domain

and linker regions of proteins for both fast- and slow-translating amino acid pairs. Probability

of observing optimal or non-optimal codons in either domain or linker regions, and whether the

codons are part of fast- or slow-translating amino acids identified in Figure 4.1b. Comparison of

domain versus linker regions for both optimal and non-optimal codons and also for optimal vs non-

optimal codons within both domain and linker regions in slow pairs shows that they are equally

distributed (𝑝 − 𝑣𝑎𝑙𝑢𝑒𝑠 > 0.05, Wilcoxon-signed rank test, corrected for multiple testing by

Bonferroni method).

Page 147: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

132

Figure C.6. Fast-translating amino acid pairs are enriched in those

transcript segments that are being translated when the chaperone Ssb is

bound to the nascent chain. As done in a previous publication159, 5 different

thresholds listed on the x-axis were defined based on the percentile values of the

Cumulative Distribution Function of Fold Enrichment (FE) metric (see Methods

for details). For each of these thresholds, Ssb-bound translating regions (B

segments) were defined by the nucleotide positions that have FE values greater

than the upper threshold (e.g., P75 for threshold (P75, P25)) while Ssb-unbound

translating regions (UB segments) are defined by the nucleotide positions with FE

less than the lower threshold (e.g., P25 for threshold (P75, P25)). For fast-translating

amino acid pairs (green) and slow-translating amino acid pairs (red), the percent

change in probability of finding these pairs in the B segments relative to UB

segments is reported on the y-axis. A positive percent change indicates

enrichment and negative percent change indicates depletion of the pairs.

Significance of enrichment/depletion is calculated using the random permutation

test (***: p < 0.0001, **: p < 0.01, *: p < 0.05). Error bars represent 95% CI and

were estimated using the Bootstrapping method. These results indicate that the

fast translating pairs of amino acids are enriched in those segments being

translated when Ssb is bound to the ribosome nascent chain complex.

Page 148: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

133

Figure C.7. Translation speed differences are not explained by wobble decoding in the P- and

A-sites. (a) After controlling for all wobble base pairing tRNAs in both P- and A-sites, the sign of the

percent change in median normalized ribosome density 𝜌 (Eq. C.2) remains the same in 156 out of

157 the fast and slow translating pairs identified in Figure 4.1b. The coloring scheme is same as in

Figure C.2. (b) The percent difference in medians of the distribution of normalized ribosome density 𝜌

of all possible 7,980 amino acid pair mutations to the P-site is plotted on the X-axis with its statistical

significance being represented by the negative log of p-value plotted on the Y-axis. These distributions

are compared after controlling for instances of amino acid pairs where either the P-site or the A-site

codon is decoded through a Wobble base pairing mechanism. 2,758 amino acid pair mutations are

statistically significant. (c) After filtering for Wobble base pairing, the probability density distribution of

the odds of mutating any instance of P-site amino acid resulting in a significant change of translation

rate is plotted. The median odds are 1.72, similar to the 1.68 value observed in the original dataset

(Figure 4.1e).

Page 149: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

134

Figure C.8. Samples prepared in the same phase (single batch on same day) exhibit higher

correlations than samples prepared in different phases. For 118 genes having at least 3 reads

per codon in the highest coverage replicate of YOL* mutant strain prepared in Phase I, pairwise

correlations are run between the normalized ribosome densities across the CDS in this YOL*

replicate with the highest coverage replicate of all other mutant strains. The boxplot of R2 values is

plotted here for correlating the normalized ribosome density profiles of 118 genes of sample YOL*

from Phase I with mutant strains from Phase I and II. YOL* mutant strain prepared in Phase I has

highly correlated ribosome densities at individual codon positions with mutant strains YMR*, YKL*

and YLR* which were also prepared in Phase I. The correlation of codon level ribosome density for

118 YOL* genes was lower when correlated with ribosome profiles of strains YHR*, YOL* and YOL**

prepared in Phase II. Hence, wild-type replicates were chosen for our mutations (Figure 4.2) such

that they have been prepared in the same phase.

Page 150: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

135

C.3 Supplementary Tables

Table C.1. Ribo-Seq was obtained from five different published studies. The study and the sample

accession numbers are listed below.

Dataset (first

author name) Year of

publication Number of

replicates GEO Study Accession numbers of

samples used

Jan 2014 1 GSE61012 GSM1495525 Williams 2014 1 GSE61011 GSM1495503 Young 2015 1 GSE69414 GSM1700885

Weinberg 2016 1 GSE53268 GSM1289257 Nissley 2016 2 GSE75322 GSM1949550, GSM1949551

Page 151: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

136

Table C.2. Details on the 12 single amino acid mutations that were made across 5 different

genes. For a given amino acid pair in the A- and P-sites, the columns, from left to right, report the

amino acid in the P-site (‘P-site’), the amino acid that the P-site is mutated to (‘Mutated P-site’), the

amino acid in the A-site (‘A-site’), the Gene name (‘Gene’), the A-site codon number (‘A-site codon

No.’), the wild-type and mutated codon in the P-site (P-site codon and Mutated P-site codon,

respectively), and the codon in the A-site (‘A-site codon’).

P-site Mutated P-site

A-site Gene A-site Codon No.

P-site Codon

Mutated P-site codon

A-site Codon

Slow → Fast Translation mutations

P E G YMR122W-A 56 GGC GAA CCA

N S R YOL109W 106 AAC UCC CGU

D F G YKL096W-A 32 GGU UUC GAC

G S T YLR109W 162 ACC UCU GGU

G S G YOL109W 99 GGC UCU GGU

Fast → Slow Translation mutations

Q P D YOL109W 14 CAA CCA GAU

S G G YLR109W 140 GGU GGU AGU

S G T YKL096W-A 62 ACC GGU AGC V H K YHR179W 339 GUG CAC AAG

E K E YHR179W 150 GAA AAA GAA

Negative Control Mutations

V Y F YHR179W 251 GUC UAC UUC

L N A YHR179W 129 CUU AAC GCU

Page 152: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

137

Table C.3: Statistics of read mapping for ribosome profiling experiments for the

mutant strains carried out in this study. 1.7 billion reads were mapped to the exome in

total for all samples with an average of 86 million reads per sample.

Sample Total reads

(millions)

Mapped to rRNA Mapped to exome

Reads

(millions) % reads

Reads

(millions) % reads

Phase I

samples

YMR* rep1 37.10 9.04 24.37% 24.76 66.75%

YMR* rep2 56.06 11.90 21.23% 25.02 44.63%

YKL* rep1 59.79 15.65 26.18% 41.46 69.35%

YKL* rep2 65.39 17.04 26.05% 44.88 68.64%

YOL* rep1 123.58 28.99 23.46% 83.56 67.62%

YOL* rep2 59.10 14.76 24.97% 41.51 70.25%

YLR* rep1 55.39 8.54 15.43% 44.03 79.49%

YLR* rep2 57.00 5.34 9.36% 44.67 78.36%

Phase II

samples

YHR* rep1 195.94 27.96 14.27% 158.14 80.71%

YHR* rep2 163.22 30.91 18.94% 124.51 76.28%

YHR* rep3 151.04 24.84 16.45% 116.16 76.91%

YHR* rep4 159.08 19.18 12.06% 131.19 82.46%

YOL* rep1 158.03 29.91 18.93% 118.97 75.28%

YOL* rep2 148.15 16.85 11.38% 99.34 67.05%

YOL* rep3 149.12 24.39 16.36% 114.79 76.98%

YOL* rep4 141.42 23.34 16.50% 104.26 73.72%

YOL** rep1 146.85 18.53 12.62% 74.68 50.85%

YOL** rep2 157.73 22.76 14.43% 125.87 79.80%

YOL** rep3 141.19 20.74 14.69% 112.96 80.00%

YOL** rep4 141.42 23.34 16.50% 104.26 73.72%

Average 86.75

Total reads mapped to exome 1735.10

Page 153: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

138

Table C.4. Three mutations to gene YOL109W to test the contribution of amino acid and

tRNA identity. Columns are the same as in Table C.2, except the two synonymous mutations that

encode for the same mutated residue at the P-site, are labeled ‘Mutant 1 P-site codon’ and ‘Mutant

2 P-site codon’.

P-site Mutated P-site

A-site Gene A-site Codon no

P-site Codon

Mutant 1 P-site codon

Mutant 2 P-site codon

A-site Codon

G S G YOL109W 99 GGC UCU AGC GGU

Q P D YOL109W 14 CAA CCA CCU GAU

N S R YOL109W 106 AAC UCC UCG CGU

Page 154: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

139

Appendix D

CHAPTER 5 SUPPORTING INFORMATION

This appendix contains proofs for the Fold Enrichment (FE) measure used in Chapter 5 to

determine the Ssb-bound translated segments. This was published as Data_S1

Supplementary file for the study published in Cell titled “Profiling Ssb-Nascent Chain

Interactions Reveals Principles of Hsp70-Assisted Folding” by Kristina Döring, Nabeel

Ahmed, Trine Riemer, Harsha Garadi Suresh, Yevhen Vainshtein, Markus Habich, Jan

Riemer, Matthias P. Mayer, Edward P. O’Brien, Günter Kramer and Bernd Bukau. The

text below is being reproduced with permission from CellPress under the Journal

publishing agreement that allows authors to use the publication for inclusion in a thesis or

dissertation.

D.1 Derivations Demonstrating that the Fold Enrichment Is Directly Proportional to

the Ssb-Binding Probability and that, in the Fold Enrichment, the Contribution of

the Elongation Rate Is Eliminated

There are two proofs provided below. The first demonstrates that the Fold

Enrichment (FE) is directly proportional to the probability of Ssb binding. The second

demonstrates that the number of SeRP reads is a function of the elongation rate, and that

by using Fold Enrichment, the contribution of the elongation rate is eliminated and the

probability of Ssb binding isolated.

D.1.1 Proof 1: Demonstration that the FE is directly proportional to the probability

of Ssb binding

The probability of finding Ssb bound to a nascent chain when codon 𝑖 of transcripts from

gene 𝑗 are being translated at a given instant in time is defined as

𝑷𝒃𝒐𝒖𝒏𝒅(𝒊, 𝒋) = 𝑵𝒃𝒐𝒖𝒏𝒅(𝒊,𝒋)

𝑵𝑻𝒐𝒕𝒂𝒍(𝒊,𝒋) , (1)

where 𝑁𝑏𝑜𝑢𝑛𝑑(𝑖, 𝑗) is the number of ribosomes that have Ssb bound when codon 𝑖 is being

translated, and 𝑁𝑇𝑜𝑡𝑎𝑙(𝑖, 𝑗) is the total number of ribosomes actively translating codon 𝑖.

The FE at codon 𝑖 of gene 𝑗 is defined as 𝐹𝐸(𝑖, 𝑗) = 𝑆(𝑖,𝑗)

𝑅(𝑖,𝑗), where 𝑆(𝑖, 𝑗) is the number of

reads arising from selective ribosome profiling that map to codon 𝑖 of gene 𝑗, and 𝑅(𝑖, 𝑗) is

the number of reads arising from ribosome profiling that map to the same location.

Experiments have shown that the number of RNA-Seq reads is directly proportional to the

number of mRNA molecules (Figure 2C in Mortazavi et al.175). We assume this holds for

Page 155: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

140

RP and SeRP reads as well, since they are also Next Generation Sequencing methods.

Therefore, 𝑅(𝑖, 𝑗) is directly proportional to the total number or ribosomes at 𝑖 and 𝑗; that

is 𝑅(𝑖, 𝑗) ∝ 𝑁𝑇𝑜𝑡𝑎𝑙(𝑖, 𝑗), and 𝑅(𝑖, 𝑗) = 𝑎𝑁𝑇𝑜𝑡𝑎𝑙(𝑖, 𝑗), where 𝑎 is a constant of proportionality.

This equation can be algebraically rearranged to find

𝑵𝑻𝒐𝒕𝒂𝒍(𝒊, 𝒋) =𝑹(𝒊,𝒋)

𝒂 (2)

Likewise, 𝑆(𝑖, 𝑗) is most likely directly proportional to the number of ribosomes that

have Ssb bound at length 𝑖 on transcript 𝑗, and hence 𝑆(𝑖, 𝑗) = 𝑏 𝑁𝐵𝑜𝑢𝑛𝑑(𝑖, 𝑗) and

𝑵𝑩𝒐𝒖𝒏𝒅(𝒊, 𝒋) =𝑺(𝒊,𝒋)

𝒃 (3)

where 𝑏 is a constant of proportionality.

Substituting Eqs. 2 and 3 into Eq. 1 results in 𝑃𝑏𝑜𝑢𝑛𝑑(𝑖, 𝑗) = 𝑎 𝑆(𝑖,𝑗)

𝑏 𝑅(𝑖,𝑗) . Substituting

our definition for 𝐹. 𝐸. (𝑖, 𝑗) into this equation yields 𝑃𝑏𝑜𝑢𝑛𝑑(𝑖, 𝑗) = 𝑎

𝑏𝐹. 𝐸. (𝑖, 𝑗). Therefore,

𝑷𝒃𝒐𝒖𝒏𝒅(𝒊, 𝒋) ∝ 𝑭. 𝑬. (𝒊, 𝒋) (4)

This demonstrates that the fold enrichment that is experimentally measured is directly

proportional to the probability of a Ssb molecule being bound to the nascent chain.

D.1.2 Proof 2: Demonstration that SeRP reads are a function of the elongation rate,

and that the Fold Enrichment metric controls for this effect.

The total number of ribosomes on transcript 𝑗, 𝑵𝑻𝒐𝒕𝒂𝒍(𝒋) = ∑ 𝑵𝑻𝒐𝒕𝒂𝒍(𝒌, 𝒋)𝑁𝐶𝑘=1 , where 𝑁𝐶 is

the number of codons in the transcript, is equal to

𝑵𝑻𝒐𝒕𝒂𝒍(𝒋) = 𝒌𝒊𝒏𝒕,𝒋𝑵𝒎𝑹𝑵𝑨,𝒋𝝉𝑺,𝒋 (5)

where 𝑘𝑖𝑛𝑡,𝑗, 𝑁𝑚𝑅𝑁𝐴,𝑗 , 𝜏𝑆,𝑗 are, respectively, the initiation rate, mRNA copy number and

average synthesis time of transcript 𝑗. The synthesis time is the sum of the codon

translation times, therefore 𝜏𝑆,𝑗 = ∑ 𝜏𝐴(𝑘, 𝑗)𝑁𝐶𝑘=1 . 𝜏𝐴(𝑘, 𝑗) is the average translation time

of codon 𝑘 in transcript 𝑗. Substituting this expression into Eq. 5 it can be seen that

𝑵𝑻𝒐𝒕𝒂𝒍(𝒊, 𝒋) = 𝒌𝒊𝒏𝒕,𝒋𝑵𝒎𝑹𝑵𝑨,𝒋 𝝉𝑨(𝒊, 𝒋) (6)

Page 156: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

141

Inserting Eq. 3 into Eq. 1 we get 𝑆(𝑖, 𝑗) = 𝑏𝑁𝑇𝑜𝑡𝑎𝑙𝑃𝑏𝑜𝑢𝑛𝑑(𝑖, 𝑗). Subsituting Eq. 6

into this equation we get a key result

𝑺(𝒊, 𝒋) = 𝒃𝒌𝒊𝒏𝒕,𝒋𝑵𝒎𝑹𝑵𝑨,𝒋 𝝉𝑨(𝒊, 𝒋)𝑷𝒃𝒐𝒖𝒏𝒅(𝒊, 𝒋) (7)

Thus, we have demonstrated that the number of SeRP reads (𝑺(𝒊, 𝒋)) is not just a

function of the probability of Ssb binding, but also a function of the codon translation time

among other factors. Thus, the probability of binding cannot be defined by the SeRP

reads alone. We need additional information to define the bound regions.

Likewise, it can be shown using Eq. 2 that 𝑹(𝒊, 𝒋) = 𝒂𝒌𝒊𝒏𝒕,𝒋𝑵𝒎𝑹𝑵𝑨,𝒋 𝝉𝑨(𝒊, 𝒋). Substituting

this expression and Eq. 7 into our definition of FE, we get

𝑭. 𝑬. (𝒊, 𝒋) = 𝑺(𝒊,𝒋)

𝑹(𝒊,𝒋)=

𝒃𝒌𝒊𝒏𝒕,𝒋𝑵𝒎𝑹𝑵𝑨,𝒋 𝝉𝑨(𝒊,𝒋)𝑷𝒃𝒐𝒖𝒏𝒅(𝒊,𝒋)

𝒂𝒌𝒊𝒏𝒕,𝒋𝑵𝒎𝑹𝑵𝑨,𝒋 𝝉𝑨(𝒊,𝒋)=

𝒃

𝒂𝑷𝒃𝒐𝒖𝒏𝒅(𝒊, 𝒋) (8)

These additional factors (including elongation speed) all cancel out and we have thereby

isolated the effect of Ssb binding.

𝑷𝒃𝒐𝒖𝒏𝒅(𝒊, 𝒋) ∝ 𝑭. 𝑬. (𝒊, 𝒋) (9)

In Eqs. 8 and 9 we have demonstrated that by dividing the SeRP reads by the RP reads

we eliminate the effects of elongation times, initiation rates, and mRNA copy number

and measure the effect of the probability of Ssb binding.

Page 157: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

142

REFERENCES

1. Vogel, C. Translation’s coming of age. Mol. Syst. Biol. 7, 498 (2011).

2. Schwanhäusser, B., Busse, D., Li, N., Dittmar, G., Schuchhardt, J., Wolf, J., Chen, W. & Selbach, M. Global quantification of mammalian gene expression control. Nature 473, 337–342 (2011).

3. Cooper, G. Translation of mRNA. The Cell: A Molecular Approach. (Sinauer Associates, 2000). at <https://www.ncbi.nlm.nih.gov/books/NBK9839/>

4. Shah, P., Ding, Y., Niemczyk, M., Kudla, G. & Plotkin, J. B. Rate-limiting steps in yeast protein translation. Cell 153, 1589–601 (2013).

5. Espah Borujeni, A. & Salis, H. M. Translation Initiation is Controlled by RNA Folding Kinetics via a Ribosome Drafting Mechanism. J. Am. Chem. Soc. 138, 7016–7023 (2016).

6. Nissley, D. a & Brien, E. P. O. Timing Is Everything: Unifying Codon Translation Rates and Nascent Proteome Behavior. J. amercan Chem. Soc. 136, 17892−17898 (2014).

7. Trovato, F. & O’Brien, E. P. Fast Protein Translation Can Promote Co- and Posttranslational Folding of Misfolding-Prone Proteins. Biophys. J. 112, 1807–1819 (2017).

8. Sharma, A. K., Bukau, B. & O’Brien, E. P. Physical Origins of Codon Positions That Strongly Influence Cotranslational Folding: A Framework for Controlling Nascent-Protein Folding. J. Am. Chem. Soc. 138, 1180–1195 (2016).

9. Nissley, D. A., Sharma, A. K., Ahmed, N., Friedrich, U. A., Kramer, G., Bukau, B. & O’Brien, E. P. Accurate prediction of cellular co-translational folding indicates proteins can switch from post- to co-translational folding. Nat. Commun. 7, 10341 (2016).

10. Sharma, A. K. & O’Brien, E. P. Increasing Protein Production Rates Can Decrease the Rate at Which Functional Protein is Produced and Their Steady-State Levels. J. Phys. Chem. B 121, 6775–6784 (2017).

11. Sharma, A. K. & O’Brien, E. P. Non-equilibrium coupling of protein structure and function to translation–elongation kinetics. Curr. Opin. Struct. Biol. 49, 94–103 (2018).

12. Curran, J. F. & Yarus, M. Rates of aminoacyl-tRNA selection at 29 sense codons in vivo. J. Mol. Biol. 209, 65–77 (1989).

13. Komar, A. a. A pause for thought along the co-translational folding pathway. Trends Biochem. Sci. 34, 16–24 (2009).

14. O’Brien, E. P., Vendruscolo, M. & Dobson, C. M. Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates. Nat. Commun. 5, 2988 (2014).

15. Ciryam, P., Morimoto, R. I., Vendruscolo, M., Dobson, C. M. & O’Brien, E. P. In vivo translation rates can substantially delay the cotranslational folding of the Escherichia coli cytosolic proteome. Proc. Natl. Acad. Sci. U. S. A. 110, E132-40 (2013).

Page 158: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

143

16. Gingold, H. & Pilpel, Y. Determinants of translation efficiency and accuracy. Mol. Syst. Biol. 7, 481 (2011).

17. Plotkin, J. B. & Kudla, G. Synonymous but not the same: the causes and consequences of codon bias. Nat. Rev. Genet. 12, 32–42 (2011).

18. Dong, H., Nilsson, L. & Kurland, C. G. Co-variation of tRNA Abundance and Codon Usage in \textit{Escherichia coli} at Different Growth Rates. J. Mol. Biol. 260, 649–663 (1996).

19. Trotta, E. Selection on codon bias in yeast: A transcriptional hypothesis. Nucleic Acids Res. 41, 9382–9395 (2013).

20. Sharp, P. M. & Li, W.-H. The codon adaptation index - a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987).

21. dos Reis, M., Savva, R. & Wernisch, L. Solving the riddle of codon usage preferences: A test for translational selection. Nucleic Acids Res. 32, 5036–5044 (2004).

22. Carlini, D. B. & Stephan, W. In vivo introduction of unpreferred synonymous codons into the drosophila Adh gene results in reduced levels of ADH protein. Genetics 163, 239–243 (2003).

23. Nicola, A. V., Chen, W. & Helenius, A. Co-translational folding of an alphavirus capsid protein in the cytosol of living cells. Nat. Cell. Biol. 1, 341–345 (1999).

24. Dennis, P. P. & Bremer, H. Modulation of Chemical Composition and Other Parameters of the Cell at Different Exponential Growth Rates. EcoSal Plus 3, (2008).

25. Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).

26. Ingolia, N. T., Brar, G. A., Rouskin, S., McGeachy, A. M. & Weissman, J. S. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments. Nat. Protoc. 7, 1534–1550 (2012).

27. Dana, A. & Tuller, T. Determinants of Translation Elongation Speed and Ribosomal Profiling Biases in Mouse Embryonic Stem Cells. PLoS Comput. Biol. 8, (2012).

28. Gerashchenko, M. V. & Gladyshev, V. N. Translation inhibitors cause abnormalities in ribosome profiling experiments. Nucleic Acids Res. 42, (2014).

29. Hussmann, J. A., Patchett, S., Johnson, A., Sawyer, S. & Press, W. H. Understanding Biases in Ribosome Profiling Experiments Reveals Signatures of Translation Dynamics in Yeast. PLoS Genet. 11, e1005732 (2015).

30. Fang, H., Huang, Y.-F., Radhakrishnan, A., Siepel, A., Lyon, G. J. & Schatz, M. C. Scikit-ribo Enables Accurate Estimation and Robust Modeling of Translation Dynamics at Codon Resolution. Cell Syst. 6, 180-191.e4 (2018).

31. Martens, A. T., Taylor, J. & Hilser, V. J. Ribosome A and P sites revealed by length analysis of ribosome profiling data. Nucleic Acids Res. 43, 3680 (2015).

Page 159: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

144

32. Wang, H., McManus, J. & Kingsford, C. Accurate Recovery of Ribosome Positions Reveals Slow Translation of Wobble-Pairing Codons in Yeast. J. Comput. Biol. 24, 486–500 (2017).

33. Ingolia, N. T., Lareau, L. F. & Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789–802 (2011).

34. Oh, E., Becker, A. H., Sandikci, A., Huber, D., Chaba, R., Gloge, F., Nichols, R. J., Typas, A., Gross, C. A., Kramer, G., Weissman, J. S. & Bukau, B. Selective ribosome profiling reveals the cotranslational chaperone action of trigger factor in vivo. Cell 147, 1295–1308 (2011).

35. Gardin, J., Yeasmin, R., Yurovsky, A., Cai, Y., Skiena, S. & Futcher, B. Measurement of average decoding rates of the 61 sense codons in vivo. Elife 3, e03735 (2014).

36. Popa, A., Lebrigand, K., Paquet, A., Nottet, N., Robbe-Sermesant, K., Waldmann, R. & Barbry, P. RiboProfiling: a Bioconductor package for standard Ribo-seq pipeline processing. F1000Research 5, 1309 (2016).

37. Lauria, F., Tebaldi, T., Bernabò, P., Groen, E. J. N., Gillingwater, T. H. & Viero, G. riboWaltz: Optimization of ribosome P-site positioning in ribosome profiling data. PLoS Comput. Biol. 14, 1–20 (2018).

38. Dunn, J. G. & Weissman, J. S. Plastid: nucleotide-resolution analysis of next-generation sequencing and genomics data. BMC Genomics 17, 958 (2016).

39. Pop, C., Rouskin, S., Ingolia, N. T., Han, L., Phizicky, E. M., Weissman, J. S. & Koller, D. Causal signals between codon bias , mRNA structure , and the efficiency of translation and elongation. Mol. Syst. Biol. 10, 770 (2014).

40. Lareau, L. F., Hite, D. H., Hogan, G. J. & Brown, P. O. Distinct stages of the translation elongation cycle revealed by sequencing ribosome-protected mRNA fragments. Elife 2014, 1–16 (2014).

41. Weinberg, D. E., Shah, P., Eichhorn, S. W., Hussmann, J. A., Plotkin, J. B. & Bartel, D. P. Improved Ribosome-Footprint and mRNA Measurements Provide Insights into Dynamics and Regulation of Yeast Translation. Cell Rep. 14, 1787–1799 (2016).

42. Charneski, C. A. & Hurst, L. D. Positively Charged Residues Are the Major Determinants of Ribosomal Velocity. PLoS Biol. 11, e1001508 (2013).

43. Qian, W., Yang, J. R., Pearson, N. M., Maclean, C. & Zhang, J. Balanced codon usage optimizes eukaryotic translational efficiency. PLoS Genet. 8, e1002603 (2012).

44. Dana, a. & Tuller, T. Mean of the Typical Decoding Rates: A New Translation Efficiency Index Based on the Analysis of Ribosome Profiling Data. Genes|Genomes|Genetics 5, 73–80 (2014).

45. Hani, J. tRNA genes and retroelements in the yeast genome. Nucleic Acids Res. 26, 689–696 (2002).

46. Letzring, D. P., Dean, K. M. & Grayhack, E. J. Control of translation efficiency in yeast by codon-anticodon interactions. Rna 16, 2516–2528 (2010).

47. Chaney, J. L. & Clark, P. L. Roles for Synonymous Codon Usage in Protein Biogenesis.

Page 160: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

145

Annu. Rev. Biophys. 44, 143–166 (2015).

48. Pechmann, S. & Frydman, J. Evolutionary conservation of codon optimality reveals hidden signatures of cotranslational folding. Nat. Struct. Mol. Biol. 20, 237–43 (2013).

49. Stadler, M. & Fire, A. Wobble base-pairing slows in vivo translation elongation in metazoans. RNA 17, 2063–2073 (2011).

50. Choi, J., Grosely, R., Prabhakar, A., Lapointe, C. P., Wang, J. & Puglisi, J. D. How Messenger RNA and Nascent Chain Sequences Regulate Translation Elongation. Annu. Rev. Biochem. 87, 421–449 (2018).

51. Qu, X., Wen, J.-D., Lancaster, L., Noller, H. F., Bustamante, C. & Tinoco, I. The ribosome uses two active mechanisms to unwind messenger RNA during translation. Nature 475, 118–121 (2011).

52. Wen, J. Der, Lancaster, L., Hodges, C., Zeri, A. C., Yoshimura, S. H., Noller, H. F., Bustamante, C. & Tinoco, I. Following translation by single ribosomes one codon at a time. Nature 452, 598–603 (2008).

53. Tuller, T., Waldman, Y. Y., Kupiec, M. & Ruppin, E. Translation efficiency is determined by both codon bias and folding energy. Proc. Natl. Acad. Sci. 107, 3645–3650 (2010).

54. Tuller, T., Veksler-Lublinsky, I., Gazit, N., Kupiec, M., Ruppin, E. & Ziv-Ukelson, M. Composite effects of gene determinants on the translation speed and density of ribosomes. Genome Biol. 12, (2011).

55. Nedialkova, D. D. & Leidel, S. A. Optimization of Codon Translation Rates via tRNA Modifications Maintains Proteome Integrity. Cell 161, 1606–1618 (2015).

56. Goodarzi, H., Nguyen, H. C. B., Zhang, S., Dill, B. D., Molina, H. & Tavazoie, S. F. Modulated expression of specific tRNAs drives gene expression and cancer progression. Cell 165, 1416–1427 (2016).

57. Yona, A. H., Bloom-Ackermann, Z., Frumkin, I., Hanson-Smith, V., Charpak-Amikam, Y., Feng, Q., Boeke, J. D., Dahan, O. & Pilpel, Y. tRNA genes rapidly change in evolution to meet novel translational demands. Elife 2, 1–17 (2013).

58. Pavlov, M. Y., Watts, R. E., Tan, Z., Cornish, V. W., Ehrenberg, M. & Forster, A. C. Slow peptide bond formation by proline and other N-alkylamino acids in translation. Proc. Natl. Acad. Sci. U. S. A. 106, 50–54 (2009).

59. Johansson, M., Ieong, K.-W., Trobro, S., Strazewski, P., Aqvist, J., Pavlov, M. Y. & Ehrenberg, M. pH-sensitivity of the ribosomal peptidyl transfer reaction dependent on the identity of the A-site aminoacyl-tRNA. Proc. Natl. Acad. Sci. 108, 79–84 (2010).

60. Artieri, C. G. & Fraser, H. B. Accounting for biases in riboprofiling data indicates a major role for proline in stalling translation. Genome Res. 24, 2011–2021 (2014).

61. Ude, S., Lassak, J., Starosta, A. L., Kraxenberger, T., Wilson, D. N. & Jung, K. Translation elongation factor EF-P alleviates ribosome stalling at Polyproline Stretches. Science (80-. ). 339, 82–86 (2013).

62. Doerfel, L. K., Wohlgemuth, I., Kothe, C., Peske, F., Urlaub, H. & Rodnina, M. V. EF-P Is

Page 161: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

146

Essential for Rapid Synthesis of Proteins Containing Consecutive Proline Residues. Science 339, 85–88 (2013).

63. Starosta, A. L., Lassak, J., Peil, L., Atkinson, G. C., Virumäe, K., Tenson, T., Remme, J., Jung, K. & Wilson, D. N. Translational stalling at polyproline stretches is modulated by the sequence context upstream of the stall site. Nucleic Acids Res. 42, 10711–10719 (2014).

64. Gutierrez, E., Shin, B. S., Woolstenhulme, C. J., Kim, J. R., Saini, P., Buskirk, A. R. & Dever, T. E. eif5A promotes translation of polyproline motifs. Mol. Cell 51, 35–45 (2013).

65. Schuller, A. P., Wu, C. C. C., Dever, T. E., Buskirk, A. R. & Green, R. eIF5A Functions Globally in Translation Elongation and Termination. Mol. Cell 66, 194-205.e5 (2017).

66. Woolstenhulme, C. J., Guydosh, N. R., Green, R. & Buskirk, A. R. High-Precision analysis of translational pausing by ribosome profiling in bacteria lacking EFP. Cell Rep. 11, 13–21 (2015).

67. Peil, L., Starosta, A. L., Lassak, J., Atkinson, G. C., Virumae, K., Spitzer, M., Tenson, T., Jung, K., Remme, J. & Wilson, D. N. Distinct XPPX sequence motifs induce ribosome stalling, which is rescued by the translation elongation factor EF-P. Proc. Natl. Acad. Sci. 110, 15265–15270 (2013).

68. Sabi, R. & Tuller, T. A comparative genomics study on the effect of individual amino acids on ribosome stalling. BMC Genomics 16, S5 (2015).

69. Lu, J. & Deutsch, C. Electrostatics in the Ribosomal Tunnel Modulate Chain Elongation Rates. J. Mol. Biol. 384, 73–86 (2008).

70. Stein, K. C. & Frydman, J. The stop-and-go traffic regulating protein biogenesis: How translation kinetics controls proteostasis. J. Biol. Chem. 294, 2076–2084 (2018).

71. Frydman, J., Nimmesgern, E., Ohtsuka, K. & Ulrich, B. F. Folding of nascent polypeptide chains in high molecular chaperones. Nature 370, 111–117 (1994).

72. Kramer, G., Boehringer, D., Ban, N. & Bukau, B. The ribosome as a platform for co-translational processing, folding and targeting of newly synthesized proteins. Nat. Struct. Mol. Biol. 16, 589–597 (2009).

73. Pechmann, S., Willmund, F. & Frydman, J. The Ribosome as a Hub for Protein Quality Control. Mol. Cell 49, 411–421 (2013).

74. Preissler, S. & Deuerling, E. Ribosome-associated chaperones as key players in proteostasis. Trends Biochem. Sci. 37, 274–283 (2012).

75. Shiber, A., Döring, K., Friedrich, U., Klann, K., Merker, D., Zedan, M., Tippmann, F., Kramer, G. & Bukau, B. Cotranslational assembly of protein complexes in eukaryotes revealed by ribosome profiling. Nature 561, 268–272 (2018).

76. Chartron, J. W., Hunt, K. C. L. & Frydman, J. Cotranslational signal-independent SRP preloading during membrane targeting. Nature 536, 224–228 (2016).

77. Thommen, M., Holtkamp, W. & Rodnina, M. V. Co-translational protein folding: progress and methods. Curr. Opin. Struct. Biol. 42, 83–89 (2017).

Page 162: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

147

78. Yu, C. H., Dang, Y., Zhou, Z., Wu, C., Zhao, F., Sachs, M. S. & Liu, Y. Codon Usage Influences the Local Rate of Translation Elongation to Regulate Co-translational Protein Folding. Mol. Cell 59, 744–754 (2015).

79. Elena, C., Ravasi, P., Castelli, M. E., Peirú, S. & Menzella, H. G. Expression of codon optimized genes in microbial systems: Current industrial applications and perspectives. Front. Microbiol. 5, 1–8 (2014).

80. Buhr, F., Jha, S., Thommen, M., Mittelstaet, J., Kutz, F., Schwalbe, H., Rodnina, M. V. & Komar, A. A. Synonymous Codons Direct Cotranslational Folding toward Different Protein Conformations. Mol. Cell 61, 341–351 (2016).

81. Zhang, G., Hubalewska, M. & Ignatova, Z. Transient ribosomal attenuation coordinates protein synthesis and co-translational folding. Nat. Struct. Mol. Biol. 16, 274–280 (2009).

82. Fedyunin, I., Lehnhardt, L., Böhmer, N., Kaufmann, P., Zhang, G. & Ignatova, Z. tRNA concentration fine tunes protein solubility. FEBS Lett. 586, 3336–3340 (2012).

83. Zhou, M., Guo, J., Cha, J., Chae, M., Chen, S., Barral, J. M., Sachs, M. S. & Liu, Y. Non-optimal codon usage affects expression, structure and function of clock protein FRQ. Nature 495, 111–115 (2013).

84. Zhou, T., Weems, M. & Wilke, C. O. Translationally optimal codons associate with structurally sensitive sites in proteins. Mol. Biol. Evol. 26, 1571–1580 (2009).

85. Sander, I. M., Chaney, J. L. & Clark, P. L. Expanding Anfinsen ’ s Principle : Contributions of Synonymous Codon Selection to Rational Protein Design Expanding Anfinsen ’ s Principle : Contributions of Synonymous Codon Selection to Rational Protein Design. J. Am. Chem. Soc. ASAP, (2014).

86. Chaney, J. L., Steele, A., Carmichael, R., Rodriguez, A., Specht, A. T., Ngo, K., Li, J., Emrich, S. & Clark, P. L. Widespread position-specific conservation of synonymous rare codons within coding sequences. PLoS Comput. Biol. 13, 1–19 (2017).

87. Saunders, R. & Deane, C. M. Synonymous codon usage influences the local protein structure observed. Nucleic Acids Res. 38, 6719–6728 (2010).

88. Dana, A. & Tuller, T. Efficient Manipulations of Synonymous Mutations for Controlling Translation Rate: An Analytical Approach. J. Comput. Biol. 19, 200–231 (2012).

89. Becker, A. H., Oh, E., Weissman, J. S., Kramer, G. & Bukau, B. Selective ribosome profiling as a tool for studying the interaction of chaperones and targeting factors with nascent polypeptide chains and ribosomes. Nat. Protoc. 8, 2212–39 (2013).

90. Calkhoven, C. F., Müller, C. & Leutz, A. Translational control of gene expression and disease. Trends Mol. Med. 8, 577–583 (2002).

91. Ingolia, N. T. Ribosome Footprint Profiling of Translation throughout the Genome. Cell 165, 22–33 (2016).

92. Popa, A., Lebrigand, K., Paquet, A., Nottet, N., Robbe-Sermesant, K., Waldmann, R. & Barbry, P. RiboProfiling: a Bioconductor package for standard Ribo-seq pipeline processing. F1000Research 5, 1309 (2016).

Page 163: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

148

93. Sierksma, G. Linear and Integer Programming Theory and Practice. (CRC Press, 2001). at <http://openlibrary.org/books/OL8124799M/Linear_Integer_Programming>

94. Ingolia, N. T., Brar, G. A., Stern-Ginossar, N., Harris, M. S., Talhouarne, G. J. S., Jackson, S. E., Wills, M. R. & Weissman, J. S. Ribosome Profiling Reveals Pervasive Translation Outside of Annotated Protein-Coding Genes. Cell Rep. 8, 1365–1379 (2014).

95. O’Connor, P. B. F., Li, G. W., Weissman, J. S., Atkins, J. F. & Baranov, P. V. RRNA:mRNA pairing alters the length and the symmetry of mRNA-protected fragments in ribosome profiling experiments. Bioinformatics 29, 1488–1491 (2013).

96. Mohammad, F., Woolstenhulme, C. J., Green, R. & Buskirk, A. R. Clarifying the Translational Pausing Landscape in Bacteria by Ribosome Profiling. Cell Rep. 14, 686–694 (2016).

97. Nakahigashi, K., Takai, Y., Kimura, M., Abe, N., Nakayashiki, T., Shiwa, Y., Yoshikawa, H., Wanner, B. L., Ishihama, Y. & Mori, H. Comprehensive identification of translation start sites by tetracycline-inhibited ribosome profiling. DNA Res. 23, 193–201 (2016).

98. Malys, N. Shine-Dalgarno sequence of bacteriophage T4: GAGG prevails in early genes. Mol. Biol. Rep. 39, 33–39 (2012).

99. Han, Y., Gao, X., Liu, B., Wan, J., Zhang, X. & Qian, S. B. Ribosome profiling reveals sequence-independent post-initiation pausing as a signature of translation. Cell Res. 24, 842–851 (2014).

100. Haase, N., Holtkamp, W., Lipowsky, R., Rodnina, M. & Rudorf, S. Decomposition of time-dependent fluorescence signals reveals codon-specific kinetics of protein synthesis. Nucleic Acids Res. 46, (2018).

101. Diament, A. & Tuller, T. Estimation of ribosome profiling performance and reproducibility at various levels of resolution. Biol. Direct 11, 24 (2016).

102. Malone, B., Atanassov, I., Aeschimann, F., Li, X., Großhans, H. & Dieterich, C. Bayesian prediction of RNA translation from ribosome profiling. Nucleic Acids Res. 45, 2960–2972 (2016).

103. Sabi, R. & Tuller, T. A comparative genomics study on the effect of individual amino acids on ribosome stalling. BMC Genomics 16, S5 (2015).

104. Gutierrez, E., Shin, B. S., Woolstenhulme, C. J., Kim, J. R., Saini, P., Buskirk, A. R. & Dever, T. E. eif5A promotes translation of polyproline motifs. Mol. Cell 51, 35–45 (2013).

105. Dana, A. & Tuller, T. The effect of tRNA levels on decoding times of mRNA codons. Nucleic Acids Res. 42, 9171–9181 (2014).

106. Brackley, C. A., Romano, M. C. & Thiel, M. The dynamics of supply and demand in mRNA translation. PLoS Comput. Biol. 7, (2011).

107. Rudorf, S. & Lipowsky, R. Protein synthesis in E. coli: Dependence of codon-specific elongation on tRNA concentration and codon usage. PLoS One 10, 1–22 (2015).

108. Sonenberg, N. & Hinnebusch, A. G. Regulation of Translation Initiation in Eukaryotes: Mechanisms and Biological Targets. Cell 136, 731–745 (2009).

Page 164: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

149

109. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359 (2012).

110. Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R. & Salzberg, S. L. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

111. Young, D. J., Guydosh, N. R., Zhang, F., Hinnebusch, A. G. & Green, R. Rli1/ABCE1 Recycles Terminating Ribosomes and Controls Translation Reinitiation in 3′UTRs In Vivo. Cell 162, 872–884 (2015).

112. Guydosh, N. R. & Green, R. Dom34 rescues ribosomes in 3′ untranslated regions. Cell 156, 950–962 (2014).

113. Jan, C. H., Williams, C. C. & Weissman, J. S. ‘Principles of ER cotranslational translocation revealed by proximity-specific ribosome profiling’. Science 346, 748–751 (2014).

114. Williams, C. C., Jan, C. H. & Weissman, J. S. Targeting and plasticity of mitochondrial proteins revealed by proximity-specific ribosome profiling. Science 346, 748–751 (2014).

115. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10 (2011).

116. Hurt, J. a, Robertson, A. D. & Burge, C. B. Global analyses of UPF1 binding and function reveals expanded scope of nonsense-mediated mRNA decay. Genome Res. 23, 1636–1650 (2013).

117. Li, G.-W., Oh, E. & Weissman, J. S. The anti-Shine–Dalgarno sequence drives translational pausing and codon choice in bacteria. Nature 484, 538–541 (2012).

118. Li, G. W., Burkhardt, D., Gross, C. & Weissman, J. S. Quantifying absolute protein synthesis rates reveals principles underlying allocation of cellular resources. Cell 157, 624–635 (2014).

119. Gillespie, D. T. Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361 (1977).

120. Good, P. Permutation, Parametric, and Bootstrap Tests of Hypothesis. (Springer Series in Statistics, 2005). doi:10.1007/978-0-387-98135-2

121. Reid, D. W. & Nicchitta, C. V. Primary role for endoplasmic reticulum-bound ribosomes in cellular translation identified by ribosome profiling. J. Biol. Chem. 287, 5518–5527 (2012).

122. Chowdhury, D. Stochastic mechano-chemical kinetics of molecular motors: A multidisciplinary enterprise from a physicist’s perspective. Phys. Rep. 529, 1–197 (2013).

123. Marshall, R. A., Aitken, C. E., Dorywalska, M. & Puglisi, J. D. Translation at the single-molecule level. Annu. Rev. Biochem. 77, 177–203 (2008).

124. Sharma, A. K. & Chowdhury, D. Template-directed biopolymerization:Tape-copying turing machines. Biophys. Rev. Lett. 7, 135–175 (2012).

125. Jackson, R. J., Hellen, C. U. T. & Pestova, T. V. The mechanism of eukaryotic translation initiation and principles of its regulation. Nat. Rev. Mol. Cell Biol. 11, 113–127 (2010).

Page 165: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

150

126. Hinnebusch, A. G. The Scanning Mechanism of Eukaryotic Translation Initiation. Annu. Rev. Biochem. 83, 779–812 (2014).

127. Spriggs, K. A., Bushell, M. & Willis, A. E. Translational Regulation of Gene Expression during Conditions of Cell Stress. Mol. Cell 40, 228–237 (2010).

128. Kervestin, S. & Amrani, N. Translational regulation of gene expression. Genome Biol. 5, 359 (2004).

129. Ciandrini, L., Stansfield, I. & Romano, M. C. Ribosome Traffic on mRNAs Maps to Gene Ontology: Genome-wide Quantification of Translation Initiation Rates and Polysome Size Regulation. PLoS Comput. Biol. 9, e1002866 (2013).

130. Dana, A. & Tuller, T. Mean of the Typical Decoding Rates: A New Translation Efficiency Index Based on the Analysis of Ribosome Profiling Data. G3-Genes Genomes Genet. 5, 73–80 (2015).

131. Gritsenko, A. A., Hulsman, M., Reinders, M. J. T. & de Ridder, D. Unbiased Quantitative Models of Protein Translation Derived from Ribosome Profiling Data. PLOS Comput. Biol. 11, e1004336 (2015).

132. Dao Duc, K. & Song, Y. S. The impact of ribosomal interference, codon usage, and exit tunnel interactions on translation elongation rate variation. PLoS Genet. 14, e1001508 (2018).

133. Requião, R. D., de Souza, H. J. A., Rossetto, S., Domitrovic, T. & Palhano, F. L. Increased ribosome density associated to positively charged residues is evident in ribosome profiling experiments performed in the absence of translation inhibitors. RNA Biol. 13, 561–568 (2016).

134. Dao Duc, K., Saleem, Z. H. & Song, Y. S. Theoretical analysis of the distribution of isolated particles in totally asymmetric exclusion processes: Application to mRNA translation rate estimation. Phys. Rev. E 97, 12106 (2018).

135. Gillesple, D. T. Exact Stochastic Simulation of couple chemical reactions. J. Phys. Chem. 81, 2340–2361 (1977).

136. Fluitt, A., Pienaar, E. & Viljoen, H. Ribosome kinetics and aa-tRNA competition determine rate and fidelity of peptide synthesis. Comput. Biol. Chem. 31, 335–346 (2007).

137. Sharma, A. K., Ahmed, N. & O’Brien, E. P. Determinants of translation speed are randomly distributed across transcripts resulting in a universal scaling of protein synthesis times. Phys. Rev. E 97, 22409 (2018).

138. Rouskin, S., Zubradt, M., Washietl, S., Kellis, M. & Weissman, J. S. Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature 505, 701–705 (2014).

139. Kertesz, M., Wan, Y., Mazor, E., Rinn, J. L., Nutter, R. C., Chang, H. Y. & Segal, E. Genome-wide measurement of RNA secondary structure in yeast. Nature 467, 103–107 (2010).

140. Sorensen, M. A. & Pedersen, S. Absolute in vivo translation rates of individual codons in Escherichia coli. The two glutamic acid codons GAA and GAG are translated with a

Page 166: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

151

threefold difference in rate. J. Mol. Biol. 222, 265–280 (1991).

141. Brar, G. A. Beyond the Triplet Code: Context Cues Transform Translation. Cell 167, 1681–1692 (2016).

142. Diament, A., Feldman, A., Schochet, E., Kupiec, M., Arava, Y. & Tuller, T. The extent of ribosome queuing in budding yeast. PLoS Comput. Biol. 14, e1005951 (2018).

143. Ingolia, N. T. Ribosome Footprint Profiling of Translation throughout the Genome. Cell 165, 22–33 (2016).

144. Gerashchenko, M. V. & Gladyshev, V. N. Ribonuclease selection for ribosome profiling. Nucleic Acids Res. 45, e6 (2017).

145. Lecanda, A., Nilges, B. S., Sharma, P., Nedialkova, D. D., Schwarz, J., Vaquerizas, J. M. & Leidel, S. A. Dual randomization of oligonucleotides to reduce the bias in ribosome-profiling libraries. Methods 107, 89–97 (2016).

146. Sébastien, P., Mariana, L.-Q. & Thomas, T. Cloning of Small RNA Molecules. Curr. Protoc. Mol. Biol. 72, 26.4.1-26.4.18 (2005).

147. Levin, J. Z., Yassour, M., Adiconis, X., Nusbaum, C., Thompson, D. A., Friedman, N., Gnirke, A. & Regev, A. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat. Methods 7, 709–715 (2010).

148. Zur, H. & Tuller, T. Predictive biophysical modeling and understanding of the dynamics of mRNA translation and its evolution. Nucleic Acids Res. 44, 9031–9049 (2016).

149. Shaw, L. B., Zia, R. K. P. & Lee, K. H. Totally asymmetric exclusion process with extended objects: A model for protein synthesis. Phys. Rev. E 68, 021910 (2003).

150. Chou, T., Mallick, K. & Zia, R. K. P. Non-equilibrium statistical mechanics: from a paradigmatic model to biological transport. Reports Prog. Phys. 74, 116601 (2011).

151. Lakatos, G. & Chou, T. Totally asymmetric exclusion processes with particles of arbitrary size. 36, 2027–2041 (2003).

152. Fernandes, L. D., de Moura, A. & Ciandrini, L. Gene length as a regulator for ribosome recruitment and protein synthesis: theoretical insights. Sci. Rep. 7, 17409 (2017).

153. Bonnin, P., Kern, N., Young, N. T., Stansfield, I. & Romano, M. C. Novel mRNA-specific effects of ribosome drop-off on translation rate and polysome profile. PLoS Comput. Biol. (2017). doi:10.1371/journal.pcbi.1005555

154. Ahmed, N., Sormanni, P., Ciryam, P., Vendruscolo, M., Dobson, C. M. & O’Brien, E. P. Identifying A- and P-site locations on ribosome-protected mRNA fragments using Integer Programming. Sci. Reports 9, 6256 (2019).

155. Bonven, B. & Gulløv, K. Peptide chain elongation rate and ribosomal activity in Saccharomyces cerevisiae as a function of the growth rate. Mol. Gen. Genet. 170, 225–30 (1979).

156. Karpinets, T. V, Greenwood, D. J., Sams, C. E. & Ammons, J. T. RNA:protein ratio of the unicellular organism as a characteristic of phosphorous and nitrogen stoichiometry and

Page 167: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

152

of the cellular requirement of ribosomes for protein synthesis. BMC Biol. 4, 30 (2006).

157. Nissley, D. A., Sharma, A. K., Ahmed, N., Friedrich, U. A., Kramer, G., Bukau, B. & O’Brien, E. P. Accurate prediction of cellular co-translational folding indicates proteins can switch from post- to co-translational folding. Nat. Commun. 7, (2016).

158. JM, C., EL, H., Amundsen, C., Balakrishnan, R., Binkley, G., ET, C., KR, C., MC, C., SS, D., SR, E., DG, F., JE, H., BC, H., Karra, K., CJ, K., SR, M., RS, N., Park, J., MS, S., Simison, M., Weng, S. & ED, W. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 40, D700-5 (2012).

159. Döring, K., Ahmed, N., Riemer, T., Suresh, H. G., Vainshtein, Y., Habich, M., Riemer, J., Mayer, M. P., O’Brien, E. P., Kramer, G. & Bukau, B. Profiling Ssb-Nascent Chain Interactions Reveals Principles of Hsp70-Assisted Folding. Cell 170, 298-311.e20 (2017).

160. Fields, S., Gamble, C. E., Grayhack, E. J., Brule, C. E. & Dean, K. M. Adjacent Codons Act in Concert to Modulate Translation Efficiency in Yeast. Cell 166, 679–690 (2016).

161. Bukau, B., Deuerling, E., Pfund, C. & Craig, E. A. Getting newly synthesized proteins into shape. Cell 101, 119–122 (2000).

162. Frydman, J. Folding of Newly Translated Proteins in vivo: The Role of Molecular Chaperones. Annu. Rev. Biochem. (2001).

163. Albanèse, V., Yam, A. Y. W., Baughman, J., Parnot, C. & Frydman, J. Systems analyses reveal two chaperone networks with distinct functions in eukaryotic cells. Cell 124, 75–88 (2006).

164. Koplin, A., Preissler, S., Llina, Y., Koch, M., Scior, A., Erhardt, M. & Deuerling, E. A dual function for chaperones SSB-RAC and the NAC nascent polypeptide-associated complex on ribosomes. J. Cell Biol. 189, 57–68 (2010).

165. Ariosa, A., Lee, J. H., Wang, S., Saraogi, I. & Shan, S. Regulation by a chaperone improves substrate selectivity during cotranslational protein targeting. Proc. Natl. Acad. Sci. 112, E3169–E3178 (2015).

166. Jacobson, G. N. & Clark, P. L. Quality over quantity: Optimizing co-translational protein folding with non-’optimal’ synonymous codons. Curr. Opin. Struct. Biol. 38, 102–110 (2016).

167. Nelson, R. J., Ziegelhoffer, T., Nicolet, C., Werner-Washburne, M. & Craig, E. A. The translation machinery and 70 kd heat shock protein cooperate in protein synthesis. Cell 71, 97–105 (1992).

168. Gautschi, M., Mun, A., Ross, S. & Rospert, S. A functional chaperone triad on the yeast ribosome. Proc. Natl. Acad. Sci. 99, 4209–4214 (2002).

169. Willmund, F., Del Alamo, M., Pechmann, S., Chen, T., Albanèse, V., Dammer, E. B., Peng, J. & Frydman, J. The cotranslational function of ribosome-associated Hsp70 in eukaryotic protein homeostasis. Cell 152, 196–209 (2013).

170. Shalgi, R., Hurt, J. a., Krykbaeva, I., Taipale, M., Lindquist, S. & Burge, C. B. Widespread Regulation of Translation by Elongation Pausing in Heat Shock. Mol. Cell 49, 439–452

Page 168: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

153

(2013).

171. Liu, B., Han, Y. & Qian, S. B. Cotranslational Response to Proteotoxic Stress by Elongation Pausing of Ribosomes. Mol. Cell 49, 453–463 (2013).

172. Tuller, T., Carmi, A., Vestsigian, K., Navon, S., Dorfan, Y., Zaborske, J., Pan, T., Dahan, O., Furman, I. & Pilpel, Y. An evolutionarily conserved mechanism for controlling the efficiency of protein translation. Cell 141, 344–354 (2010).

173. Sauna, Z. E. & Kimchi-Sarfaty, C. Understanding the contribution of synonymous mutations to human disease. Nat. Rev. Genet. 12, 683–91 (2011).

174. Janke, C., Magiera, M. M., Rathfelder, N., Taxis, C., Reber, S., Maekawa, H., Moreno-Borchart, A., Doenges, G., Schwob, E., Schiebel, E. & Knop, M. A versatile toolbox for PCR-based tagging of yeast genes: new fluorescent proteins, more markers and promoter substitution cassettes. Yeast 21, 947–62 (2004).

175. Mortazavi, A., Williams, B. a, McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008).

176. Cannarrozzi, G., Schraudolph, N. N., Faty, M., von Rohr, P., Friberg, M. T., Roth, A. C., Gonnet, P., Gonnet, G. & Barral, Y. A role for codon order in translation dynamics. Cell 141, 355–367 (2010).

177. Hazewinkel, M. Encyclopedia of Mathematics. (Springer Science+Business Media B.V. / Kluwer Academic Publishers).

Page 169: PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …

VITA

Nabeel Ahmed

Education

Ph.D., Bioinformatics and Genomics, The Pennsylvania State University Aug 2019 M.S. (Honors), Biological Sciences, Birla Institute of Technology and Science Jul 2013 M.S. (Technology), Information Systems, Birla Institute of Technology and Science Jul 2013

Publications

1) N. Ahmed, U. F. Friedrich,…,G. Kramer, E. P. O’Brien. “Evolutionarily selected amino acid

pairs encode translation-elongation rate information.” (Submitted)

2) A. K. Sharma*, P. Sormani*, N. Ahmed*, P. Ciryam, U. F. Friedrich, G. Kramer and E. P.

O’Brien. “A chemical kinetic basis for measuring initiation and elongation rates from

ribosome profiling data.” PLOS Comp Biology. 15(5): e1007070 (2019).

3) N. Ahmed*, P. Sormani*, P. Ciryam, M. Vendruscolo, C.M. Dobson and E. P. O’Brien.

“Identifying A- and P-site locations on ribosome-protected mRNA fragments using Integer

Programming.” Scientific Reports, 9:6256 (2019)

4) A. K. Sharma, N. Ahmed and E. P. O’Brien, “Determinants of translation speed are

randomly distributed across transcripts resulting in a universal scaling of protein synthesis

times.” Phys. Rev. E., 022409 (2018).

5) K. Döring, N. Ahmed, …, E. P. O’Brien, G. Kramer and B. Bukau. Profiling Ssb-Nascent

Chain Interactions Reveals Principles of Hsp70-Assisted Folding. Cell,170, 298–311.e20

(2017).

6) D.A. Nissley, A. K. Sharma, N. Ahmed, U. Friedrich, G. Kramer, B. Bukau and E. P.

O’Brien. “Accurate prediction of cellular co-translational folding indicates proteins can

switch from post- to co-translational folding” Nature Communications, 7, 10341 (2016).

*Co-first authors

Invited Oral Presentations

Biophysical Society 62nd Annual Meeting Feb 2018

Penn State Genomics Seminar series Mar 2016 and Sep 2017

Selected Poster Presentations

Triennial Ribosome 2019 meeting Jan 2019

EMBL Symposium ‘The Complex Life of RNA’ Oct 2018

Cold Spring Harbor Laboratory meeting on ‘Translational Control’ Sep 2018

From Computational Biophysics to Systems Biology 2017 May 2017

Selected Awards and Honors

Penn State Huck Institutes of Life Sciences Graduate Travel Award 2018

Penn State Eberly College of Science Braddock Scholarship 2013