a common class of transcripts with 5′-intron …...2016/09/06 · 9 transcripts with 5’...
TRANSCRIPT
CanCenik
1
ACommonClassofTranscriptswith5’-IntronDepletion,DistinctEarlyCoding1
SequenceFeatures,andN1-MethyladenosineModification2
CanCenik1,2,HonNianChua3,4,GuramritSingh2,5,7,8,AbdallaAkef6,MichaelPSnyder1,3
AlexanderF.Palazzo6,MelissaJMoore2,7,8#,andFrederickPRoth3,9,10#4
5
Affiliations:6
1DepartmentofGenetics,StanfordUniversitySchoolofMedicine,Stanford,CA94305,USA72DepartmentofBiochemistryandMolecularPharmacology,UniversityofMassachusettsMedical8School,Worcester,MA01605,USA93DonnellyCentreandDepartmentsofMolecularGeneticsandComputerScience,UniversityofToronto10andLunenfeld-TanenbaumResearchInstitute,MtSinaiHospital,TorontoM5G1X5,Ontario,Canada114DataRobot,Inc.,61ChathamSt.BostonMA02109,USA125DepartmentofMolecularGenetics,CenterforRNABiology,TheOhioStateUniversity,Columbus,OH1343210,USA146DepartmentofBiochemistry,UniversityofToronto,1King’sCollegeCircle,MSBRoom5336,Toronto,15ONM5S1A8,Canada167HowardHughesMedicalInstitute,UniversityofMassachusettsMedicalSchool,Worcester,MA01605,17USA188RNATherapeuticsInstitute,UniversityofMassachusettsMedicalSchool,Worcester,MA01605,USA199CenterforCancerSystemsBiology(CCSB),Dana-FarberCancerInstitute,Boston02215,MA,USA2010TheCanadianInstituteforAdvancedResearch,TorontoM5G1Z8,ON,Canada21
22
#Correspondenceto:[email protected],[email protected]
24
ShortTitle:Firstintronpositiondefinesatranscriptclass25
Keywords:5’UTRintrons,randomforest,N1methyladenosine,ExonJunctionComplex26
27
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
2
Abstract:1
Intronsarefoundin5’untranslatedregions(5’UTRs)for35%ofallhumantranscripts.2
These5’UTRintronsarenotrandomlydistributed:genesthatencodesecreted,membrane-3
boundandmitochondrialproteinsarelesslikelytohavethem.Curiously,transcriptslacking4
5’UTRintronstendtoharborspecificRNAsequenceelementsintheirearlycodingregions.To5
modelandunderstandtheconnectionbetweencoding-regionsequenceand5’UTRintron6
status,wedevelopedaclassifierthatcanpredict5’UTRintronstatuswith>80%accuracy7
usingonlysequencefeaturesintheearlycodingregion.Thus,theclassifieridentifies8
transcriptswith5’proximal-intron-minus-like-codingregions(“5IM”transcripts).9
Unexpectedly,wefoundthattheearlycodingsequencefeaturesdefining5IMtranscriptsare10
widespread,appearingin21%ofallhumanRefSeqtranscripts.The5IMclassoftranscriptsis11
enrichedfornon-AUGstartcodons,moreextensivesecondarystructurebothprecedingthe12
startcodonandnearthe5’cap,greaterdependenceoneIF4Efortranslation,andassociation13
withER-proximalribosomes.5IMtranscriptsareboundbytheExonJunctionComplex(EJC)at14
non-canonical5’proximalpositions.Finally,N1-methyladenosinesarespecificallyenrichedin15
theearlycodingregionsof5IMtranscripts.Takentogether,ouranalysespointtotheexistence16
ofadistinct5IMclasscomprising~20%ofhumantranscripts.Thisclassisdefinedby17
depletionof5’proximalintrons,presenceofspecificRNAsequencefeaturesassociatedwith18
lowtranslationefficiency,N1-methyladenosinesintheearlycodingregion,andenrichmentfor19
non-canonicalbindingbytheExonJunctionComplex.20
21
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
3
Introduction:1
2
Approximately35%ofallhumantranscriptsharborintronsintheir5’untranslated3
regions(5’UTRs)(Ceniketal.2010;Hongetal.2006).Amonggeneswith5’UTRintrons(5UIs),4
thoseannotatedas“regulatory”aresignificantlyoverrepresented,whilethereisanunder-5
representationofgenesencodingproteinsthataretargetedtoeithertheendoplasmic6
reticulum(ER)ormitochondria(Ceniketal.2011).FortranscriptsthatencodeER-and7
mitochondria-targetedproteins,5UIdepletionisassociatedwithpresenceofspecificRNA8
sequences(Ceniketal.2011;Palazzoetal.2007,2013).Specifically,nuclearexportofan9
otherwiseinefficientlyexportedmicroinjectedmRNAorcDNAtranscriptcanbepromotedby10
anER-targetingsignalsequence-containingregion(SSCRs)ormitochondrialsignalsequence11
codingregion(MSCRs)fromagenelacking5’UTRintrons(Ceniketal.2011;Leeetal.2015).12
However,morerecentstudiessuggestthatmanySSCRshavelittleimpactonnuclearexportfor13
RNAstranscribedinvivo(Leeetal.2015),butratherenhancetranslationinaRanBP2-14
dependentmanner(Mahadevanetal.2013).15
AmongSSCR-andMSCR-containingtranscripts(referredtohereafterasSSCRand16
MSCRtranscripts),~75%lack5’UTRintrons(“5UI–”transcripts)and~25%havethem(“5UI+”17
transcripts).Thesetwogroupshavemarkedlydifferentsequencecompositionsatthe5’ends18
oftheircodingsequences.5UI–transcriptstendtohaveloweradeninecontent(Palazzoetal.19
2007)andusecodonswithfeweruracilsandadeninesthan5UI+transcripts(Ceniketal.20
2011).Theirsignalsequencesalsocontainleucineandargininemoreoftenthanthe21
biochemicallysimilaraminoacidsisoleucineandlysine,respectively.Leucineandarginine22
codonscontainfeweradenineandthyminenucleotides,consistentwithadenineandthymine23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
4
depletion.ThisdepletionisalsoassociatedwiththepresenceofaspecificGC-richRNAmotif1
intheearlycodingregionof5UI–transcripts(Ceniketal.2011).2
Despitesomeknowledgeastotheirearlycodingregionfeatures,keyquestionsabout3
thisclassof5UI–transcriptshaveremainedunanswered:Dotheabovesequencefeatures4
extendbeyondSSCR-andMSCR-containingtranscriptstoother5UI–genes?Do5UI–5
transcriptshavingthesefeaturessharecommonfunctionalorregulatoryfeatures?What6
bindingfactor(s)recognizetheseRNAelements?Amorecompletemodeloftherelationshipof7
earlycodingfeaturesand5UI–statuswouldbegintoaddressthesequestions.8
Here,tobetterunderstandtherelationshipbetweenearlycodingregionfeaturesand9
5UIstatus,weundertookanintegrativemachinelearningapproach.Wereasonedthata10
machinelearningclassifierwhichcouldidentify5UI–transcriptssolelyfromearlycoding11
sequencewouldpotentiallyprovidetwotypesofinsight:First,itcouldsystematicallyidentify12
predictivefeatures.Second,thesubsetof5UI–transcriptswhichcouldbeidentifiedbythe13
classifiermightthenrepresentafunctionallydistincttranscriptclass.Havingdevelopedsuch14
aclassifier,wefoundthatitidentified~21%ofallhumantranscriptsasharboringcoding15
regionscharacteristicof5UI–transcripts.WhilemanyofthesetranscriptsencodeER-and16
mitochondrial-targetedproteins,manyothersencodenuclearandcytoplasmicproteins.This17
classoftranscriptssharescharacteristicssuchasatendencytolack5’proximalintrons,to18
containnon-canonicalExonJunctionComplexbindingsites,tohavemultiplefeatures19
associatedwithlowerintrinsictranslationefficiency,andtohaveanincreasedincidenceofN1-20
methyladenosinemodification.21
22
Results:23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
5
Aclassifierthatpredicts5UIstatususingonlyearlycodingsequenceinformation.1
Tobetterunderstandthepreviously-reportedenigmaticrelationshipbetweencertainearly2
codingregionsequencesandtheabsenceofa5UI,wesoughttomodelthisrelationship.3
Specifically,weusedarandomforestclassifier(Breiman2001)tolearntherelationship4
between5UIabsenceandacollectionof36differentnucleotide-levelfeaturesextractedfrom5
thefirst99nucleotidesofallhumancodingregions(CDS)(Figs1A-C;TableS1;Methods).6
WethenusedalltranscriptsknowntocontainanSSCR(atotalof3743transcriptsclusters;7
Methods),regardlessof5UIstatus,asourtrainingset.Thistrainingconstraintensuredthatall8
inputnucleotidesequencesweresubjecttosimilarfunctionalconstraintsattheproteinlevel.9
Thus,wesoughttoidentifysequencefeaturesthatdifferbetween5UI–and5UI+transcriptsat10
theRNAlevel.11
Ourclassifierassignstoeachtranscripta“5’UTR-intron-minus-predictor”(5IMP)score12
between0and10,wherehigherscorescorrespondtoahigherlikelihoodofbeing5UI–(Fig13
1C).Interestingly,preliminaryrankingofthe5UI–transcriptsby5IMPscorerevealeda14
relationshipbetweenthepositionofthefirstintroninthecodingregionandthe5IMPscore.15
5UI–transcriptsforwhichthefirstintronwasmorethan85ntsdownstreamofthestartcodon16
hadthehighest5IMPscores.Furthermore,thecloserthefirstintronwastothestartcodon,17
thelowerthe5IMPscore(Fig1D).Weexploredthisrelationshipfurtherbytrainingclassifiers18
thatincreasinglyexcludedfromthetrainingset5UI–transcriptsaccordingtothedistanceof19
thefirstintronfromthe5’endofthecodingregion.Thisrevealedthatclassifierperformance,20
asmeasuredbytheareaundertheprecisionrecallcurve(AUPRC),increasedasafunctionof21
thedistancefromstartcodontofirstintrondistance(Methods,Fig1E).Thus,theRNA22
sequencefeaturesweidentifiedasbeingpredictiveof5UI–transcriptsaremoreaccurately23
describedasbeingpredictorsoftranscriptswithout5’-proximalintrons.24
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
6
Tominimizetheimpactoftranscriptsthatmay‘behave’asthoughtheywere5UI+due1
toanintronearlyinthecodingregion,weeliminated5UI–SSCRtranscriptswithafirstintron2
<90ntsdownstreamofthestartcodon(Methods)andgeneratedanewclassifier.3
Discriminativemotiffeatureswerelearnedindependently(Methods),andperformanceofthis4
newclassifierwasgaugedusing10-foldcrossvalidation.Weassessedcross-validation5
performanceintwoways:1)intermsoftheareaunderthereceiveroperatingcurve(AUC)—6
whichcanbethoughtofasameasureofaveragerecallacrossarangeoffalsepositiverates;2)7
intermsofareaundertheprecisionvsrecallcurve(AUPRC),whichcanbethoughtofasthe8
averageprecision(fractionofpredictionswhicharecorrect)acrossarangeofrecallvalues.9
Specifically,theclassifiershowedanAUCof74%andAUPRCof88%(Fig2A,yellowcurves;10
exceedingAUC50%andAUPRC71%,theperformancevalueexpectedofanaïvepredictor).11
Weusedthisoptimizedclassifierforallsubsequentanalyses.12
SSCRtranscriptsexhibitedmarkedlydifferent5IMPscoredistributionsforthe5UI+and13
5UI–subsets(Fig2B).The5UI+scoredistributionwasunimodalwithapeakat~2.4.In14
contrast,the5UI–scoredistributionwasbimodalwithonepeakat~3.6andanotherat~9,15
suggestingtheexistenceofatleasttwounderlying5UI–transcriptclasses.Thepeakatscore16
3.6resembledthe5UI+peak.Alsocontributingtothepeakat3.6isthesetof5UI–transcripts17
harboringanintroninthefirst90ntsoftheCDS(55%ofall5UI–transcripts).Theother18
distincthigh-scoring5UI–class(peakatscore9)iscomposedoftranscriptsthathavespecific19
5UI--predictiveRNAsequenceelementswithintheearlycodingregion.20
Wenextwishedtoevaluatewhetherourclassifierwasdiscriminating5UI+and5UI–21
SSCRtranscriptsusingsignalsthatappearspecificallyintheearlycodingregionasopposedto22
signalsthatappearbroadlyacrossthecodingregion.Todoso,foreverytranscriptwe23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
7
randomlychose99ntsfromtheregiondownstreamofthe3rdexon.The5IMPscore1
distributionsofthese‘3’proximalexon’setswereidenticalfor5UI+and5UI–transcripts(Figs2
2A,and2C),confirmingthatthesequencefeaturesthatdistinguish5UI+and5UI–transcripts3
arespecifictotheearlycodingregion.4
5
RNAelementsassociatedwith5UI–transcriptsarepervasiveinthehumangenome6
HavingtrainedtheclassifieronSSCRtranscripts,wewonderedhowwellitwouldpredictthe7
5UIstatusofothertranscripts.DespitehavingbeentrainedexclusivelyonSSCRtranscripts,8
theclassifierperformedremarkablywellonMSCRtranscripts,achievinganAUCof86%and9
AUPRCof95%(Fig2A,purpleline;Fig2D;ascomparedwith50%and77%,respectively,10
expectedbychance).ThisresultsuggeststhatRNAelementswithinearlycodingregionsof11
5UI–MSCR-transcriptsaresimilartothosein5UI–SSCR-transcriptsdespitedistinctfunctional12
constraintsattheproteinlevel.13
Wenextwonderedwhethertheclassof5UI–transcriptsthatcanbepredictedonthe14
basisofearlycodingregionfeaturesisrestrictedtotranscriptsencodingproteinstraffickedto15
theERormitochondria,orisinsteadamoregeneralclassoftranscripts.Wethereforeasked16
whethertheclassifiercouldpredict5UI–statusintranscriptsthatcontainneitheranSSCRnor17
anMSCR(“S–/M–”transcripts).BecauseunannotatedSSCRscouldconfoundthisanalysis,we18
firstusedSignalP3.0toidentifyS–/M–transcriptsmostlikelytocontainanunannotatedSSCR19
(Bendtsenetal.2004).These‘SignalP+‘transcriptshada5IMPscoredistributioncomparable20
tothoseofknownSSCRandMSCRtranscripts(Fig2E),andtheclassifierworkedwellto21
identifythe5UI–subsetofthesetranscripts(AUC82%andAUPRC95%,Fig2A,lightblue22
line).While5UI+SignalP+transcriptshadpredominantlylow5IMPscores,5UI–SignalP+5IMP23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
8
scoreswerestronglyskewedtowardshigh5IMPscores(peakat~9;Fig2E).Theseresults1
wereconsistentwiththeideathatSignalP+transcriptsdocontainmanyunannotatedSSCRs.2
HavingconsideredSignalP+transcriptsaswellasSSCR-andMSCR-containing3
transcripts,weusedtheclassifiertocalculate5IMPscoresforallremaining“S–/M–/SignalP–”4
transcripts.Althoughtheperformancewasweakeronthisgeneset,itwasstillbetterthan5
expectedofanaivepredictor(Fig2A,greenline).5UI+S–/M–/SignalP–transcriptswere6
stronglyskewedtowardlow5IMPscores(Fig2F).Surprisingly,however,asignificant7
fractionof5UI–S–/M–/SignalP–transcriptshadhigh5IMPscores(~18%).Thus,ourresults8
suggestabroadclassoftranscriptswithearlycodingregionscarryingsequencesignalsthat9
predicttheabsenceofa5’proximalintron,orinotherwords,aclassoftranscriptswith10
5’proximal-intron-minus-likecodingregions.Hereafterwerefertotranscriptsinthisclassas11
“5IM”transcripts.12
Wesoughttoidentifywhatfractionoftranscriptshave5IMPscoresthatexceedwhat13
wouldbeexpectedintheabsenceof5UI--predictiveearlycodingregionsignals.Toestablish14
thisexpectation,weusedtheabove-describednegativecontrolsetofequal-lengthcoding15
sequencesfromoutsideoftheearlycodingregion.Byquantifyingtheexcessofhigh-scoring16
sequencesintherealdistributionrelativetothiscontroldistribution,weestimatethat21%of17
allhumantranscriptsare5IMtranscripts(a5IMPscoreof7.41correspondstoa5%False18
DiscoveryRate;Fig2G).Thesetof5IMtranscriptsdefinedbyourclassifier(TableS2)19
includesmanythatdonotencodeER-targetedormitochondrialproteins.Thedistributionof20
variousclassesofmRNAsamongthe5IMtranscriptswas:38%ER-targeted(SSCRorSignalP+),21
9%mitochondrial(MSCR)and53%otherclasses(S–/M–/SignalP–)(Figure2G).Theseresults22
suggestthatRNA-levelfeaturesprevalentintheearlycodingregionsof5UI–SSCRandMSCR23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
9
transcriptsarealsofoundinothertranscripttypes(Figs2F-G),andthat5IMtranscripts1
representabroadclass.2
3
Functionalcharacterizationof5IMtranscripts4
5IMtranscriptsaredefinedbymRNAsequencefeatures.Hence,wehypothesizedthat5IM5
transcriptsmaybefunctionallyrelatedthroughsharedregulatorymechanismsmediatedby6
thepresenceofthesecommonfeatures.Tothisend,wecollectedlarge-scaledatasets7
representingdiverseattributescoveringsixbroadcategories(seeTableS3foracompletelist):8
(1)Curatedfunctionalannotations--e.g.,GeneOntologyterms,annotationasa‘housekeeping’9
gene,genessubjecttoRNAediting;(2)RNAlocalization--e.g.,todendrites,tomitochondria;10
(3)ProteinandmRNAhalf-life,ribosomeoccupancyandfeaturesthatdecreasestabilityofone11
ormoremRNAisoforms--e.g.,AU-richelements(4)Sequencefeaturesassociatedwith12
regulatedtranslation--e.g.,codonoptimality,secondarystructurenearthestartcodon;(5)13
KnowninteractionswithRNA-bindingproteinsorcomplexessuchasStaufen-1,TDP-43,orthe14
ExonJunctionComplex(EJC)(6)RNAmodifications–i.e.,N1-methyladenosine(m1A).15
Weadjustedformultiplehypothesestestingattwolevels.First,wetookaconservative16
approach(Bonferronicorrection)tocorrectforthenumberoftestedfunctional17
characteristics.Second,someofthefunctionalcategorieswereanalyzedinmoredepthand18
multiplesub-hypothesesweretestedwithinthegivencategory.Inthisin-depthanalysisafalse19
discovery-basedcorrectionwasadopted.Below,allreportedp-valuesremainsignificant(p-20
adjusted<0.05)aftermultiplehypothesistestcorrection.21
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
10
Noassociationsbetween5IMtranscriptsandfeaturesincategories(1),(2)and(3)were1
found,otherthanthealready-knownenrichmentsforER-andmitochondrial-targetedmRNAs.2
However,analysesfortheremainingcategoriesyieldedthesignificantresultsdescribedbelow.3
4
5IMtranscriptshavefeaturessuggestinglowertranslationefficiency5
Translationregulationisamajordeterminantofproteinlevels(VogelandMarcotte2012).To6
investigatepotentialconnectionsbetween5IMtranscriptsandtranslationalregulation,we7
examinedfeaturesassociatedwithtranslation.Featuresfoundtobesignificantwere:8
(I)Secondarystructuresnearthestartcodoncanaffectinitiationratebymodulating9
startcodonrecognition(Parsyanetal.2011).Weobservedapositivecorrelationbetween10
5IMPscoreandthefreeenergyoffolding(-ΔG)ofthe35nucleotidesimmediatelypreceding11
thestartcodon(Fig3A;Spearmanrho=0.39;p<2.2e-16).Thissuggeststhat5IMtranscripts12
haveagreatertendencyforsecondarystructurenearthestartcodon,presumablymakingthe13
startcodonlessaccessible.14
(II)Similarly,secondarystructuresnearthe5’capcanmodulatetranslationby15
hinderingbindingbythe43S-preinitiationcomplextothemRNA(Babendureetal.2006).We16
observedapositivecorrelationbetween5IMPscoreandthefreeenergyoffolding(-ΔG)ofthe17
5’most35nucleotides(Fig3B;Spearmanrho=0.18;p=7.9e-130).Thissuggeststhat5IM18
transcriptshaveagreatertendencyforsecondarystructurenearthe5’cap,presumably19
hinderingbindingbythe43S-preinitiationcomplex.20
(III)eIF4Eoverexpression.TheheterotrimerictranslationinitiationcomplexeIF4F21
(madeupofeIF4A,eIF4EandeIF4G)isresponsibleforfacilitatingthetranslationof22
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
11
transcriptswithstrong5’UTRsecondarystructures(Parsyanetal.2011).TheeIF4Esubunit1
bindstothe7mGpppG‘methyl-G’capandtheATP-dependenthelicaseeIF4A(scaffoldedby2
eIF4G)destabilizes5’UTRsecondarystructure(Marintchevetal.2009).Apreviousstudy3
identifiedtranscriptsthatweremoreactivelytranslatedunderconditionsthatpromotecap-4
dependenttranslation(overexpressionofeIF4E)(Larssonetal.2007).Inagreementwiththe5
observationthat5IMtranscriptshavemoresecondarystructureupstreamofthestartcodon6
andnearthe5’cap,transcriptswithhigh5IMPscoresweremorelikelytobetranslationally,7
butnottranscriptionally,upregulateduponeIF4Eoverexpression(Fig3C;WilcoxonRankSum8
Testp=2.05e-22,andp=0.28,respectively).9
(IV)Non-AUGstartcodons.Transcriptswithnon-AUGstartcodonsalsohave10
intrinsicallylowtranslationinitiationefficiencies(HinnebuschandLorsch2012).These11
mRNAsweregreatlyenrichedamongtranscriptswithhigh5IMPscores(Fisher’sExactTestp12
=0.0003;oddsratio=3.9)andhaveamedian5IMPscorethatis3.57higherthanthosewith13
anAUGstart(Fig3D).14
(V)Codonoptimality.Theefficiencyoftranslationelongationisaffectedbycodon15
optimality(HershbergandPetrov2008).Althoughsomeaspectsofthisremaincontroversial16
(CharneskiandHurst2013;Shahetal.2013;ZinshteynandGilbert2013;Gerashchenkoand17
Gladyshev2015),itisclearthatdecodingofcodonsbytRNAswithdifferentabundancescan18
affectthetranslationrateunderconditionsofcellularstress(reviewedinGingoldandPilpel19
2011).WethereforeexaminedthetRNAadaptationindex(tAI),whichcorrelateswithcopy20
numbersoftRNAgenesmatchingagivencodon(dosReisetal.2004).Specifically,we21
calculatedthemediantAIofthefirst99codingnucleotidesofeachtranscript,andfoundthat22
5IMPscorewasnegativelycorrelatedwithtAI(FigS1A;SpearmanCorrelationrho=-0.23;p<23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
12
2e-16mediantAIand5IMPscore).Thiseffectwasrestrictedtotheearlycodingregionsasthe1
negativecontrolsetofrandomlychosensequencesdownstreamofthe3rdexonfromeach2
transcriptdidnotexhibitarelationshipbetween5IMPscoreandcodonoptimality(FigS1B-C).3
Thus,5IMtranscriptsshowreducedcodonoptimalityinearlycodingregions,suggestingthat4
5IMtranscriptshavedecreasedtranslationelongationefficiency.5
Tomorepreciselydeterminewherethecodonoptimalityphenomenonoccurswithin6
theentireearlycodingregion,wegroupedtranscriptsby5IMPscore.Foreachgroup,we7
calculatedthemeantAIatcodons2-33(i.e.,nts4-99).Acrossthisentireregion,5IM8
transcripts(5IMP>7.41;5%FDR)hadsignificantlylowertAIvaluesateverycodonexcept9
codons24and32(Fig3E;WilcoxonRankSumtestHolm-adjustedp<0.05forall10
comparisons).Toeliminatepotentialconfoundingvariables,includingnucleotidecomposition,11
weperformedseveraladditionalcontrolanalyses(Methods);noneofthesealteredthebasic12
conclusionthat5IMtranscriptshavelowercodonoptimalitythannon-5IMtranscriptsacross13
theentireearlycodingregion.14
(VI)RibosomespermRNA.Finally,weexaminedtherelationshipbetween5IMPscore15
andtranslationefficiency,asmeasuredbythesteady-statenumberofribosomespermRNA16
molecule.Tothisend,weusedalargedatasetofribosomeprofilingandRNA-Seqexperiments17
fromhumanlymphoblastoidcelllines(Ceniketal.2015).Fromthis,wecalculatedtheaverage18
numberofribosomesoneachtranscriptandidentifiedtranscriptswithhighorlowribosome19
occupancy(respectivelydefinedbyoccupancyatleastonestandarddeviationaboveorbelow20
themean;seeMethods).5IMtranscriptswereslightlybutsignificantlydepletedinthehigh21
ribosome-occupancycategory(Fig3F;Fisher’sExactTestp=0.0006,oddsratio=1.3).22
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
13
Moreover,5IMPscoresexhibitedaweakbutsignificantnegativecorrelationwiththenumber1
ofribosomespermRNAmolecule(Spearmanrho=-0.11;p=5.98e-23).2
Takentogether,alloftheaboveresultsrevealthat5IMtranscriptshavesequence3
featuresassociatedwithlowertranslationefficiency,atthestagesofbothtranslationinitiation4
andelongation.5
6
Non-ERtrafficked5IMtranscriptsareenrichedinER-proximalribosomeoccupancy7
Wenextinvestigatedtherelationshipbetween5IMPscoreandthelocalizationoftranslation8
withincells.Exploringthesubcellularlocalizationoftranslationatatranscriptome-scale9
remainsasignificantchallenge.Yet,arecentstudydescribedproximity-specificribosome10
profilingtoidentifymRNAsoccupiedbyER-proximalribosomesinbothyeastandhumancells11
(Janetal.2014).Inthismethod,ribosomesarebiotinylatedbasedontheirproximitytoa12
markerproteinsuchasSec61,whichlocalizestotheERmembrane(Janetal.2014).Foreach13
transcript,theenrichmentforbiotinylatedribosomeoccupancyyieldsameasureofER-14
proximityoftranslatedmRNAs.15
Wereanalyzedthisdatasettoexploretherelationshipbetween5IMPscoresandER-16
proximalribosomeoccupancyinHEK-293cells.Asexpected,transcriptsthatexhibitthe17
highestenrichmentforER-proximalribosomeswereSSCR-containingtranscriptsand18
transcriptswithotherER-targetingsignals.Yet,wenoticedasurprisingpositivecorrelation19
betweenER-proximalribosomeoccupancyand5IMtranscriptswithnoER-targetingevidence20
(Fig4).Thisrelationshipwastrueforbothmitochondrialgenes(Fig4;Spearmanrho=0.43;p21
<2.2e-16),andgeneswithnoevidenceforeitherER-ormitochondrial-targeting(Fig4;22
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
14
Spearmanrho=-0.36;p<2.2e-16).Theseresultssuggestthat5IMtranscriptsaremorelikely1
thannon-5IMtranscriptstoengagewithER-proximalribosomes.2
3
5IMtranscriptsarestronglyenrichedinnon-canonicalEJCoccupancysites4
Sharedsequencefeaturesandfunctionaltraitsamong5IMtranscriptscausesonetowonder5
whatcommonmechanismsmightlink5IMsequencefeaturesto5IMtraits.Forexample,5IM6
transcriptsmightshareregulationbyoneormoreRNA-bindingproteins(RBPs).To7
investigatethisideafurther,wetestedforenrichmentof5IMtranscriptsamongthe8
experimentallyidentifiedtargetsof23differentRBPs(includingCLIP-Seq,andvariants;see9
Methods).Onlyonedatasetwassubstantiallyenrichedforhigh5IMPscoresamongtargets:a10
transcriptome-widemapofbindingsitesoftheExonJunctionComplex(EJC)inhumancells,11
obtainedviatandem-immunoprecipitationfollowedbydeepsequencing(RIPiT)(Singhetal.12
2014,2012).TheEJCisamulti-proteincomplexthatisstablydepositedupstreamofexon-13
exonjunctionsasaconsequenceofpre-mRNAsplicing(LeHiretal.2000).RIPiTdata14
confirmedthatcanonicalEJCsites(cEJCsites;sitesboundbyEJCcorefactorsandappearing15
~24ntsupstreamofexon-exonjunctions)occupy~80%ofallpossibleexon-exonjunction16
sitesandarenotassociatedwithanysequencemotif.Unexpectedly,manyEJC-associated17
footprintsoutsideofthecanonical-24regionswereobserved(Fig5A)(Singhetal.2012).18
These‘non-canonical’EJCoccupancysites(ncEJCsites)wereassociatedwithmultiple19
sequencemotifs,threeofwhichweresimilartoknownrecognitionmotifsforSRproteinsthat20
co-purifiedwiththeEJCcoresubunits(Singhetal.2012).Interestingly,anothermotif(Fig5B;21
top)thatwasspecificallyfoundinfirstexonsisnotknowntobeboundbyanyknownRNA-22
bindingprotein(Singhetal.2012).ThismotifwasCG-rich,asequencefeaturethatalso23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
15
defines5IMtranscripts.Thissimilaritypresagesthepossibilityofenrichmentoffirstexon1
ncEJCsitesamong5IMtranscripts.2
PositionanalysisofcalledEJCpeaksrevealedthatwhileonly9%ofcEJCsresideinfirst3
exons,19%ofallncEJCsarefoundthere.Whenweinvestigatedtherelationshipbetween5IMP4
scoresandncEJCsinearlycodingregions,wefoundastrikingcorrespondence—themedian5
5IMPscorewashighestfortranscriptswiththegreatestnumberofncEJCs(Fig5C;Wilcoxon6
RankSumTest;p<0.0001).Whenwerepeatedthisanalysisbyconditioningon5UIstatus,we7
similarlyfoundthatncEJCswereenrichedamongtranscriptswithhigh5IMPscoresregardless8
of5UIstatus(Fisher’sExactTest,p<3.16e-14,oddsratio>2.3;Fig5D).Theseresultssuggest9
thatthestrikingenrichmentofncEJCpeaksinearlycodingregionswasgenerallyapplicableto10
alltranscriptswithhigh5IMPscoresregardlessof5UIpresence.11
TranscriptsharboringN1-methyladenosine(m1A)havehigh5IMPscores12
ItisincreasinglyclearthatribonucleotidebasemodificationsinmRNAsarehighlyprevalent13
andcanbeamechanismforpost-transcriptionalregulation(Fryeetal.2016).OneRNA14
modificationpresenttowardsthe5’endsofmRNAtranscriptsisN1-methyladenosine(m1A)15
(Lietal.2016;Dominissinietal.2016),initiallyidentifiedintotalRNAandrRNAs(Dunn1961;16
Hall1963;KlootwijkandPlanta1973).Intriguingly,thepositionofm1Amodificationshas17
beenshowntobemorecorrelatedwiththepositionofthefirstintronthanwith18
transcriptionalortranslationalstartsites(Figure2gfromDominissinietal.2016).Whenthe19
distanceofm1AstoeachsplicesiteinagivenmRNAwascalculated,thefirstsplicesitewas20
foundtobethenearestfor85%ofm1As(Dominissinietal.2016).When5’UTRintronswere21
present,m1Awasfoundtobenearthefirstsplicesiteregardlessofthepositionofthestart22
codon(Dominissinietal.2016).Giventhat5IMtranscriptsarealsocharacterizedbythe23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
16
positionofthefirstintron,weinvestigatedtherelationshipbetween5IMPscoreandm1ARNA1
modificationmarks.2
Weanalyzedtheunionofpreviouslyidentifiedm1Amodifications(Dominissinietal.3
2016)acrossallcelltypesandconditions.Althoughthereissomeevidencethatthesemarks4
dependoncelltypeandgrowthcondition,itisdifficulttobeconfidentofthecelltypeand5
condition-dependenceofanyparticularmarkgivenexperimentalvariation(seeMethods).6
Nevertheless,wefoundthatmRNAswithm1Amodificationearlyinthecodingregion(first997
nucleotides)hadsubstantiallyhigher5IMPscoresthanmRNAslackingthesemarks(Fig6A;8
WilcoxonRankSumTestp=3.4e-265),andweregreatlyenrichedamong5IMtranscripts(Fig9
6A;Fisher’sExactTestp=1.6e-177;oddsratio=3.8).Inotherwords,thesequencefeatures10
withintheearlycodingregionthatdefine5IMPtranscriptsalsoassociatewithm1A11
modificationintheearlycodingregion.12
Wenextwonderedwhether5IMPscorewasrelatedtom1Amodificationgenerally,or13
onlyassociatedwithm1Amodificationintheearlycodingregion.Indeed,manyofthe14
previouslyidentifiedm1Apeakswerewithinthe5’UTRsofmRNAs(Lietal.2016).15
Interestingly,5IMPscoreswereonlyassociatedwithm1Amodificationintheearlycoding16
region,andnotwithm1Amodificationinthe5’UTR(Fig6B).Thisofferstheintriguing17
possibilitythatthesequencefeaturesthatdefine5IMPtranscriptsareco-localizedwithm1A18
modification.19
20
Discussion:21
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
17
Coordinatingtheexpressionoffunctionallyrelatedtranscriptscanbeachievedbypost-1
transcriptionalprocessessuchassplicing,RNAexport,RNAlocalizationortranslation(Moore2
andProudfoot2009).SetsofmRNAssubjecttoacommonregulatorytranscriptionalprocess3
canexhibitcommonsequencefeaturesthatdefinethemtobeaclass.Forexample,transcripts4
subjecttoregulationbyparticularmiRNAstendtosharecertainsequencesintheir3’UTRsthat5
arecomplementarytotheregulatorymiRNA(AmeresandZamore2013).Similarly,transcripts6
thatsharea5’terminaloligopyrimidinetractarecoordinatelyregulatedbymTORand7
ribosomalproteinS6kinase(Meyuhas2000).Herewequantitativelydefine‘5IM’transcripts8
asaclassthatsharescommonsequenceelementsandfunctionalproperties.Weestimatethe9
5IMclasstocomprise21%ofallhumantranscripts.10
Whereas35%ofhumantranscriptshaveoneormore5’UTRintrons,themajorityof11
5IMtranscriptshaveneithera5’UTRintronnoranintroninthefirst90ntsoftheORF.Other12
sharedfeaturesof5IMtranscriptsincludesequencefeaturesassociatedwithlowtranslation13
initiationrates.Theseare:(1)atendencyforRNAsecondarystructureintheregion14
immediatelyprecedingthestartcodon(Fig3A),andnearthe5’cap(Fig3B);(2)translational15
upregulationuponoverexpressionofeIF4E(Fig3C);and(3)morefrequentuseofnon-AUG16
startcodons(Fig3D).Alsoconsistentwithlowintrinsictranslationefficiencies,5IM17
transcriptsadditionallytendtodependonlessabundanttRNAstodecodethebeginningofthe18
openreadingframe(Fig3E).19
WehadpreviouslyreportedthattranscriptsencodingproteinswithER-and20
mitochondrial-targetingsignalsequences(SSCRsandMSCRs,respectively)areover-21
representedamongthe65%oftranscriptslacking5’UTRintrons(Ceniketal.2011).22
Transcriptsinthissetareenrichedforthesequencefeaturesdetectedbyour5IMclassifier.By23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
18
examiningtheseenrichedsequencefeatures,weshowedthatthe5IMclassextendsbeyond1
mRNAsencodingmembraneproteins.Janetal.recentlydevelopedatranscriptome-scale2
methodtoidentifymRNAsoccupiedbyER-proximalribosomesinbothyeastandhumancells3
(Janetal.2014).Asexpected,transcriptsknowntoencodeER-traffickedproteinswerehighly4
enrichedforER-proximalribosomeoccupancy.However,theirdataalsoshowedmany5
transcriptsencodingnon-ERtraffickedproteinstoalsobeengagedwithER-proximal6
ribosomes(ReidandNicchitta2015b).Similarly,severalotherstudieshavesuggestedacritical7
roleofER-proximalribosomesintranslatingseveralcytoplasmicproteins(ReidandNicchitta8
2015a).Here,wefoundthat5IMtranscriptsincludingthosethatarenotER-traffickedor9
mitochondrialweresignificantlymorelikelytoexhibitbindingtoER-proximalribosomes10
(Figs4A-B).11
InadditiontoribosomesdirectlyresidentontheER,aninterestingpossibilityisthe12
presenceofapoolofperi-ERribosomes(Janetal.2015;ReidandNicchitta2015a).13
Associationof5IMtranscriptswithsuchaperi-ERribosomepoolcouldpotentiallyexplainthe14
observedcorrelationof5IMstatuswithbindingtoER-proximalribosomes.TheERis15
physicallyproximaltomitochondria(RowlandandVoeltz2012),soperi-ERribosomesmay16
includethosetranslatingmRNAsonmitochondria(i.e.,mRNAswithMSCRs)(Sylvestreetal.17
2003).However,evenwhentranscriptscorrespondingtoER-traffickedandmitochondrial18
proteinswereexcludedfromconsideration,ER-proximalribosomeenrichmentand5IMP19
scoreswerehighlycorrelated(Fig4A).Thusanothersharedfeatureof5IMtranscriptsistheir20
translationonorneartheERregardlessoftheultimatedestinationoftheencodedprotein.21
Inanattempttoidentifyacommonfactorbinding5IMtranscripts,weaskedwhether22
5IMtranscriptswereenrichedamongtheexperimentallyidentifiedtargetsof23different23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
19
RBPs.OnlyoneRBPemerged--theexonjunctioncomplex(EJC).Specifically,weobserveda1
dramaticenrichmentofnon-canonicalEJC(ncEJC)bindingsiteswithintheearlycodingregion2
of5IMtranscripts.Further,theCG-richmotifidentifiedforncEJCsinfirstexonsisstrikingly3
similartotheCG-richmotifenrichedinthefirstexonsof5IMtranscripts(Fig5B).Previous4
workimplicatedRanBP2,aproteinassociatedwiththecytoplasmicfaceofthenuclearpore,as5
abindingfactorforsomeSSCRs(Mahadevanetal.2013).Thisfindingsuggeststhatnuclear6
poreproteinsmayinfluenceEJCoccupancyonthesetranscripts.7
EJCdepositionduringtheprocessofpre-mRNAsplicingenablesthenuclearhistoryof8
anmRNAtoinfluencepost-transcriptionalprocessesincludingmRNAlocalization,translation9
efficiency,andnonsensemediateddecay(Changetal.2007;KervestinandJacobson2012;10
Choeetal.2014).WhilecanonicalEJCbindingoccursatafixeddistanceupstreamofexon-exon11
junctionsandinvolvesdirectcontactbetweenthesugar-phosphatebackboneandtheEJCcore12
anchoringproteineIF4AIII,ncEJCbindingsiteslikelyreflectstableengagementbetweenthe13
EJCcoreandothermRNPproteins(e.g.,SRproteins)recognizingnearbysequencemotifs.14
AlthoughsomeRBPswereidentifiedforncEJCmotifsfoundininternalexons(Singhetal.15
2014,2012),todatenocandidateRBPhasbeenidentifiedfortheCG-richncEJCmotiffoundin16
thefirstexon.IfthismotifdoesresultfromanRBPinteraction,itislikelytobeoneormoreof17
the~70proteinsthatstablyandspecificallybindtotheEJCcore(Singhetal.2012).18
Finally,weobservedadramaticenrichmentform1Amodificationsamong5IM19
transcripts,withspecificenrichmentform1Amodificationsintheearlycodingregion.Given20
thisstrikingenrichmentperhapsitisperhapsnotsurprisingthatm1AcontainingmRNAswere21
alsoshowntohavemorestructured5’UTRsthatareGC-richcomparedtom1AlackingmRNAs22
(Dominissinietal.2016).Similarto5IMtranscripts,m1AcontainingmRNAswerefoundto23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
20
decoratestartcodonsthatappearinahighlystructuredcontext.WhileALKBH3hasbeen1
identifiedasaproteinthatcandemethylatem1A,itiscurrentlyunknownwhetherthereare2
anyproteinsthatcanspecificallyactas“readers”ofm1A.Recentstudieshavebeguntoidentify3
suchreadersforothermRNAmodificationssuchasYTHDF1,YTHDF2,WTAPandHNRNPA2B14
(Pingetal.2014;Liuetal.2014;Wangetal.2014,2015;Alarcónetal.2015).Ourstudy5
highlightsapossiblelinkbetweennon-canonicalEJCbindingandm1A.Hence,ourresultsyield6
theintriguinghypothesisthatoneormoreofthe~70proteinsthatstablyandspecificallybind7
totheEJCcorecanfunctionasanm1Areader.Futureworkinvolvingdirectedexperiments8
wouldbeneededtotestthishypothesis.9
Giventhat5IMtranscriptsareenrichedforER-targetedandmitochondrialproteins,it10
isplausiblethattheobservedfunctionalcharacteristicsof5IMtranscriptsaredrivensolelyby11
SSCRandMSCR-containingtranscripts.Hence,werepeatedallanalysesforthesubclassesof12
5IMtranscripts(MSCR-containing,SSCR-containing,S–/M–/SignalP+,orS–/M–/SignalP–).We13
foundtheobservedassociationsremainedstatisticallysignificantandhadthesamedirection14
ofeffect,evenaftereliminatingSSCR-andMSCR-containingtranscripts,despitethefactthatall15
trainingofthe5IMclassifierwasperformedonlyusingSSCRtranscripts.Wealsofoundthat16
5IMPscorewasequallyormorestronglyassociatedwitheachofthefunctionalcharacteristics17
comparedtothe5UIstatus.Inconclusion,themolecularassociationswereportapplyto5IM18
transcriptsasawhole,andarenotdrivensolelybythesubsetof5IMtranscriptsencodingER-19
ormitochondria-targetingsignalpeptides,andseemtoindicatesharedfeaturesbeyondsimple20
lackofa5’UTRintron.21
Anintriguingpossibilityisthat5IMtranscriptfeaturesassociatedwithlowerintrinsic22
translationefficiencymaytogetherenablegreater‘tunability’of5IMtranscriptsatthe23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
21
translationstage.Regulatedenhancementorrepressionoftranslation,for5IMtranscripts,1
couldallowforrapidchangesinproteinlevels.Thereareanalogiestothisscenarioin2
transcriptionalregulation,whereinhighlyregulatedgenesoftenhavepromoterswithlow3
baselinelevelsthatcanberapidlymodulatedthroughtheactionofregulatorytranscription4
factors.Asmoreribosomeprofilingstudiesarepublishedexaminingtranslationalresponses5
transcriptome-wideundermultipleperturbations,conditionsunderwhich5IMtranscriptsare6
translationallyregulatedmayberevealed.Directedexperimentswillbeneededtotest7
translationalfeaturesof5IMtranscriptshypothesizedviathiscomputationalanalysis.8
Takentogether,ouranalysesrevealtheexistenceofadistinct‘5IM’classcomprising9
21%ofhumantranscripts.Thisclassisdefinedbydepletionof5’proximalintrons,presence10
ofspecificRNAsequencefeaturesassociatedwithlowtranslationefficiency,non-canonical11
bindingbytheExonJunctionComplexandanenrichmentforN1-methyladenosine12
modification.13
14
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
22
MaterialsandMethods:1
DatasetsandAnnotations2
HumantranscriptsequencesweredownloadedfromtheNCBIhumanReferenceGene3
Collection(RefSeq)viatheUCSCtablebrowser(hg19)onJun252010(Kentetal.2002;Pruitt4
etal.2005).Transcriptswithfewerthanthreecodingexons,andtranscriptswherethefirst995
codingnucleotidesstraddlemorethantwoexonswereremovedfromfurtherconsideration.6
Thecriteriaforexclusionofgeneswithfewerthanthreecodingexonswastoensurethatthe7
analysisofdownstreamregionswaspossibleforallgenesthatwereusedinouranalysisof8
earlycodingregions.Specifically,thedownstreamregionswereselectedrandomlyfrom9
downstreamofthe3rdexons.Hence,geneswithfewerexonswouldnotbeabletocontributea10
downstreamregionpotentiallycreatingaskewinrepresentation.Intotaltherewere~300011
genesthatwereremovedfromconsiderationduetothisfilter.Therefore,ourclassifieris12
limitedinitsabilitytoassesstranscriptsfromthesegenes.However,theperformance13
measuresreportedinourmanuscriptarerobusttoexclusionofthesegenes,inthesensethat14
thesameclassoftranscriptswasusedinbothtrainingandtestdatasets.15
Transcriptswereclusteredbasedonsequencesimilarityinthefirst99coding16
nucleotides.Specifically,eachtranscriptpairwasalignedusingBLASTwiththeDUSTfilter17
disabled(Altschuletal.1990).TranscriptpairswithBLASTE-values<1e-25weregrouped18
intotranscriptclusters.Intotal,therewere15576transcriptclustersthatwereconsidered19
further.Theseclustersthatweresubsequentlyassignedtooneoffourcategories:MSCR-20
containing,SSCR-containing,S–/MSCR–SignalP+,orS–/MSCR–SignalP–asfollows:21
MSCR-containingtranscriptswereannotatedusingMitoCartaandothersourcesas22
describedin(Ceniketal.2011).SSCR-containingtranscriptswerethesetoftranscripts23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
23
annotatedtocontainsignalpeptidesintheEnsemblGenev.58annotations,whichwere1
downloadedthroughBiomartonJun252010.FortranscriptswithoutanannotatedMSCRor2
SSCR,thefirst70aminoacidswereanalyzedusingSignalP3.0(Bendtsenetal.2004).Using3
theeukaryoticpredictionmode,transcriptswereassignedtotheS–/MSCR–SignalP+category4
ifeithertheHiddenMarkovModelortheArtificialNeuralNetworkclassifiedthesequenceas5
signalpeptidecontaining.AllremainingtranscriptclusterswereassignedtotheS–/MSCR–6
SignalP–category.Thenumberoftranscriptclustersineachofthefourcategorieswas:37437
SSCR,737MSCR,696S–/MSCR–SignalP+,10400S–/MSCR–SignalP–.8
Foreachtranscriptcluster,wealsoconstructedmatchedcontrolsequences.Control9
sequenceswerederivedfromasinglerandomlychosenin-frame‘window’downstreamofthe10
3rdexonfromtheevaluatedtranscripts.Ifanevaluatedtranscripthadfewerthan9911
nucleotidesdownstreamofthe3rdexon,nocontrolsequencewasextracted.5UIlabelsand12
transcriptclusteringforthecontrolsequenceswereinheritedfromtheevaluatedtranscript.13
Therationaleforthisdecisionisthatouranalysisdependsonthepositionofthefirstintron;14
hencegeneswithfewerthantwoexonsneedtobeexcluded,asthesewillnothaveintrons.We15
furtherrequiredthematchedcontrolsequencestofalldownstreamoftheearlycodingregion.16
Inthevastmajorityofcasesthethirdexonfelloutsidethefirst99nucleotidesofthecoding17
region,makingthisaconvenientcriterionbywhichtochoosecontrolregions.18
SequenceFeaturesandMotifDiscovery19
36sequencefeatureswereextractedfromeachtranscript(TableS1).Thesequence20
featuresincludedtheratioofargininestolysines,theratioofleucinestoisoleucines,adenine21
content,lengthofthelongeststretchwithoutadenines,preferenceagainstcodonsthatcontain22
adeninesorthymines.ThesefeatureswerepreviouslyfoundtobeenrichedinSSCR-23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
24
containingandcertain5UI–transcripts(Ceniketal.2011;Palazzoetal.2007).Inaddition,we1
extractedratiosbetweenseveralotheraminoacidpairsbasedonhaving2
biochemical/evolutionarysimilarity,i.e.havingpositivescores,accordingtotheBLOSUM623
matrix(HenikoffandHenikoff1992).Toavoidextremeratiosgiventherelativelyshort4
sequencelength,pseudo-countswereaddedtoaminoacidratiosusingtheirrespective5
genome-wideprevalence.6
Inaddition,weusedthreepublishedmotiffindingalgorithms(AlignACE,DEME,and7
MoAN(Rothetal.1998;RedheadandBailey2007;Valenetal.2009))todiscoverRNA8
sequencemotifsenrichedamong5UI–transcripts.AlignACEimplementsaGibbssampling9
approachandisoneofthepioneeringeffortsinmotifdiscovery(Rothetal.1998).We10
modifiedtheAlignACEsourcecodetorestrictmotifsearchestoonlytheforwardstrandofthe11
inputsequencestoenableRNAmotifdiscovery.DEMEandMoANadoptdiscriminative12
approachestomotiffindingbysearchingformotifsthataredifferentiallyenrichedbetween13
twosetsofsequences(RedheadandBailey2007;Valenetal.2009).MoANhastheadditional14
advantageofdiscoveringvariablelengthmotifs,andcanidentifyco-occurringmotifswiththe15
highestdiscriminativepower.16
Intotal,sixmotifswerediscoveredusingthethreemotiffindingalgorithms(TableS1).17
Positionspecificscoringmatricesforallmotifswereusedtoscorethefirst99-lpositionsin18
eachsequence,wherelisthelengthofthemotif.Weassessedthesignificanceofeachmotif19
instancebycalculatingthep-valueofenrichment(Fisher’sExactTest)among5UI–transcripts20
consideringalltranscriptswithamotifinstanceachievingaPSSMscoregreaterthanequalto21
theinstanceunderconsideration.Thesignificancescoreandpositionofthetwobestmotif22
instanceswereusedasfeaturesfortheclassifier(TableS1).23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
25
5IMClassifierTrainingandPerformanceEvaluation1
WemodifiedanimplementationoftheRandomForestclassifier(Breiman2001)to2
modeltherelationshipbetweensequencefeaturesintheearlycodingregionandtheabsence3
of5’UTRintrons(5UIs).Thisclassifierdiscriminatestranscriptswith5’proximal-intron-4
minus-like-codingregionsandhenceisnamedthe‘5IM’classifier.Thetrainingsetforthe5
classifierwascomposedofSSCRtranscriptsexclusively.Thereweretworeasonstorestrict6
modelconstructiontoSSCRtranscripts:1)weexpectedthepresenceofspecificRNAelements7
asafunctionof5UIpresencebasedonourpreviouswork(Ceniketal.2011);and2)we8
wantedtorestrictmodelbuildingtosequencesthathavesimilarfunctionalconstraintsatthe9
proteinlevel.10
Weobservedthat5UI–transcriptswithintronsproximalto5’endofthecodingregion11
havesequencecharacteristicssimilarto5UI+transcripts(Fig1D).Tosystematically12
characterizethisrelationship,webuiltdifferentclassifiersusingtrainingsetsthatexcluded13
5UI–transcriptswithacodingregionintronpositionedatincreasingdistancesfromthestart14
codon.Weevaluatedtheperformanceofeachclassifierusing10-foldcrossvalidation.15
Giventhatalargenumberofmotifdiscoveryiterationswereneeded,wesoughtto16
reducethecomputationalburden.Weisolatedasubsetofthetrainingexamplestobeused17
exclusivelyformotiffinding.Motifdiscoverywasperformedonceusingthissetofsequences,18
andthesamemotifswereusedineachfoldofthecrossvalidationforalltheclassifiers.19
Imbalancesbetweenthesizesofpositiveandnegativetrainingexamplescanleadto20
detrimentalclassificationperformance(WangandYao2012).Hence,webalancedthetraining21
setsizeof5UI–and5UI+transcriptsbyrandomlysamplingfromthelargerclass.We22
constructed10sub-classifierstoreducesamplingbias,andforeachtestexample,the23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
26
predictionscorefromeachsubclassifierwassummedtoproduceacombinedscorebetween01
and10.Fortherestoftheanalyses,weusedtheclassifiertrainedusing5UI–transcriptswhere2
firstcodingintronfallsoutsidethefirst90codingnucleotides(Fig1E).3
Weevaluatedclassifierperformanceusinga10-foldcrossvalidationstrategyforSSCR-4
containingtranscripts(i.e.thetrainingset).Ineachfoldofthecross-validation,themodelwas5
trainedwithoutanyinformationfromtheheld-outexamples,includingmotifdiscovery.Forall6
theothertranscriptsandthecontrolsets(seeabove),the5IMPscoreswerecalculatedusing7
theclassifiertrainedusingSSCRtranscriptsasdescribedabove.5IMPscoredistributionfor8
thecontrolsetwasusedtocalculatetheempiricalcumulativenulldistribution.Usingthis9
function,wedeterminedthep-valuecorrespondingtothe5IMPscoreforalltranscripts.We10
correctedformultiplehypothesestestingusingtheqvalueRpackage(Storey2003).Basedon11
thisanalysis,weestimatethata5IMPscoreof7.41correspondstoa5%FalseDiscoveryRate12
andsuggestthat21%ofallhumantranscriptscanbeconsideredas5IMtranscripts.13
Whilethetheoreticalrangeof5IMPscoresis0-10,thehighestobserved5IMPis9.855.14
Wenotethatforallfiguresthatdepict5IMPscoredistributions,wedisplayedtheentire15
theoreticalrangeof5IMPscores(0-10).16
FunctionalCharacterizationof5IMTranscripts:17
Wecollectedgenome-scaledatasetfrompublicallyavailabledatabasesandfrom18
supplementaryinformationprovidedinselectedarticles.Forallanalyzeddatasets,wefirst19
convertedallgene/transcriptidentifiers(IDs)intoRefSeqtranscriptIDsusingtheSynergizer20
webserver(BerrizandRoth2008).Ifadatasetwasgeneratedusinganon-humanspecies(ex.21
TargetsidentifiedbyTDP-43RNAimmunoprecipitationinratneuronalcells),weused22
Homologenerelease64(downloadedonSep282009)toidentifythecorrespondingortholog23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
27
inhumans.Ifatleastonememberofatranscriptclusterwasassociatedwithafunctional1
phenotype,weassignedtheclustertothepositivesetwithrespecttothefunctionalphenotype.2
Ifmorethanonememberofaclusterhadthefunctionalphenotype,weonlyretainedonecopy3
oftheclusterunlesstheydifferedinaquantitativemeasurement.Forexample,considertwo4
hypotheticaltranscripts:NM_1andNM_2thatwereclusteredtogetherandhavea5IMPscore5
of8.5.IfNM_1hadanmRNAhalf-lifeof2hrwhileNM_2’shalf-lifewas1hrthanwesplitthe6
clusterwhilepreservingthe5IMPscoreforbothNM_1andNM_2.7
Oncethetranscriptswerepartitionedbasedonthefunctionalphenotype,werantwo8
statisticaltests:1)Fisher’sExactTestforenrichmentof5IMtranscriptswithinthefunctional9
category;2)WilcoxonRankSumTesttocompare5IMPscoresbetweentranscriptspartitioned10
bythefunctionalphenotype.Additionally,fordatasetswhereaquantitativemeasurementwas11
available(ex.mRNAhalf-life),wecalculatedtheSpearmanrankcorrelationbetween5IMP12
scoresandthequantitativevariable.Intheseanalyses,weassumedthatthetestspacewasthe13
entiresetofRefSeqtranscripts.Forallphenotypeswhereweobservedapreliminary14
statisticallysignificantresult,wefollowedupwithmoredetailedanalysesdescribedbelow.15
AnalysisofFeaturesAssociatedwithTranslation:16
Foreachtranscript,wepredictedthepropensityforsecondarystructureprecedingthe17
translationstartsiteandthe5’cap.Specifically,weextracted35nucleotidesprecedingthe18
translationstartsiteorthefirst35nucleotidesofthe5’UTR.Ifa5’UTRisshorterthan3519
nucleotides,thetranscriptwasremovedfromtheanalysis.hybrid-ss-minutility(UNAFold20
packageversion3.8)withdefaultparameterswasusedtocalculatetheminimumfolding21
energy(MarkhamandZuker2008).22
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
28
CodonoptimalitywasmeasuredusingthetRNAAdaptationIndex(tAI),whichisbased1
onthegenomiccopynumberofeachtRNA(dosReisetal.2004).tAIforallhumancodons2
weredownloadedfromTulleretal.2010TableS1.tAIprofilesforthefirst30aminoacids3
werecalculatedforalltranscripts.Codonoptimalityprofilesweresummarizedforthefirst304
aminoacidsforeachtranscriptorbyaveragingtAIateachcodon.5
Wecarriedouttwocontrolexperimentstotestwhethertheassociationbetween5IMP6
scoreandtAIcouldbeexplainedbyconfoundingvariables.First,wepermutedthenucleotides7
inthefirst90ntsandobservednorelationshipbetween5IMPscoreandmeantAIwhenthese8
permutedsequenceswereused(FigS1).Second,weselectedrandomin-frame99nucleotides9
from3rdexontotheendofthecodingregionandfoundnosignificantdifferencesintAI(Fig10
S1).TheseresultssuggestthattherelationshipbetweentAIand5IMPscoreisconfinedtothe11
first30aminoacidsandisnotexplainedbysimpledifferencesinnucleotidecomposition.12
RibosomeprofilingandRNAexpressiondataforhumanlymphoblastoidcells(LCLs)13
weredownloadedfromNCBIGEOdatabaseaccessionnumber:GSE65912.Translation14
efficiencywascalculatedaspreviouslydescribed(Ceniketal.2015).Mediantranslation15
efficiencyacrossthedifferentcell-typeswasusedforeachtranscript. 16
AnalysisofProximitySpecificRibosomeProfilingData:17
WedownloadedproximityspecificribosomeprofilingdataforHEK293cellsfromJan18
etal.2014;TableS6.WeconvertedUCSCgeneidentifierstoHGNCsymbolsusingg:Profiler19
(Reimandetal.2011).WeretainedallgeneswithanRPKM>5ineitherinputorpulldownand20
requiredthatatleast30readsweremappedineitherofthetwolibraries.WeusedER-21
targetingevidencecategories“secretome”,“phobius”,“TMHMM”,“SignalP”,“signalSequence”,22
“signalAnchor”from(Janetal.2014)toannotategenesashavingER-targetingevidence.The23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
29
genesthatdidnothaveanyER-targetingevidenceor“mitoCarta”/“mito.GO”annotationswere1
deemedasthesetofgeneswithnoER-targetingormitochondrialevidence.Wecalculatedthe2
log2oftheratiobetweenER-proximalribosomepulldownRPKMandcontrolRPKMasthe3
measureofenrichmentforER-proximalribosomeoccupancy(asinJanetal.2014).Amoving4
averageofthisratiowascalculatedforgenesgroupedbytheir5IMPscore.Forthiscalculation,5
weusedbinsof30mitochondrialgenesor100geneswithnoevidenceofER-ormitochondrial6
targeting.7
AnalysisofGenome-wideBindingSitesofExonJunctionComplex:8
Dr.GeneYeoandGabrielPrattgenerouslyshareduniformlyprocessedpeakcallsfor9
experimentsidentifyinghumanRNAbindingproteintargets.Thesedatasetsincludevarious10
CLIP-SeqdatasetsanditsvariantssuchasiCLIP.Atotal49datasetsfrom22factorswere11
analyzed.Thesefactorswere:hnRNPA1,hnRNPF,hnRNPM,hnRNPU,Ago2,hnRNPU,HuR,12
IGF2BP1,IGF2BP2,IGF2BP3,FMR1,eIF4AIII,PTB,IGF2BP1,Ago3,Ago4,MOV10,Fip1,CF13
Im68,CFIm59,CFIm25,andhnRNPA2B1.Weextractedthe5IMPscoresforalltargetsofeach14
RBP.WecalculatedtheWilcoxonRankSumteststatisticcomparingthe5IMPscore15
distributionofthetargetsofeachRBPtoallothertranscriptswith5IMPscores.Noneofthe16
testedRBPtargetsetshadanadjustedp-value<0.05andamediandifferencein5IMPscore>17
1whencomparedtonon-targettranscripts.18
Inaddition,weusedRNA:proteinimmunoprecipitationintandem(RIPiT)datato19
determineExonJunctionComplex(EJC)bindingsites(Singhetal.2014,2012).Weanalyzed20
thecommonpeaksfromtheY14-Magohimmunoprecipitationconfiguration(Singhetal.2012;21
Kucukuraletal.2013).CanonicalEJCbindingsitesweredefinedaspeakswhoseweighted22
centerwere15to32nucleotidesupstreamofanexon-intronboundary.Allremainingpeaks23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
30
weredeemedas“non-canonical”EJCbindingsites.Weextractedallnon-canonicalpeaksthat1
overlappedthefirst99nucleotidesofthecodingregionandrestrictedouranalysisto2
transcriptsthathadanRPKMgreaterthanoneinthematchedRNA-Seqdata.3
Analysisofm1Amodifiedtranscripts4
WedownloadedthelistofRNAsobservedtocontainm1AfromLietal.(2016)and5
Dominissinietal.(2016).RefSeqtranscriptidentifierswereconvertedtoHGNCsymbolsusing6
g:Profiler(Reimandetal.2011).Theoverlapbetweenthetwodatasetswasdeterminedusing7
HGNCsymbols.m1Amodificationsthatoverlapthefirst99nucleotidesofthecodingregion8
weredeterminedusingbedtools(QuinlanandHall2010).9
Lietal.(2016)identified600transcriptswithm1AmodificationinnormalHEK29310
cells.Ofthese,368transcriptswerenotfoundtocontainthem1AmodificationinHEK293cells11
byDominissinietal.(2016).Yet,81%ofthesewerefoundtobem1Amodifiedinothercell12
types.Lietal.(2016)alsoanalyzedm1AuponH2O2treatmentandserumstarvationinHEK29313
cellsandidentifiedmanym1Amodificationsthatareonlyfoundthesestressconditions.14
However,20%of371transcriptsharboringstress-inducedm1Amodificationswerefoundin15
normalHEK293cellsbyDominissinietal.(2016).Takentogether,theseanalysessuggestthat16
transcriptome-widem1Amapsremainincomplete.Hence,weanalyzedthe5IMPscoresofall17
mRNAswithm1Aacrosscelltypesandconditions.WereportedresultsusingtheDominissini18
etal.datasetbutthesameconclusionsweresupportedbym1AmodificationsfromLietal.19
20
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
31
Acknowledgements:1
WewouldliketothankGeneYeo,andGabrielPrattforsharingpeakcallsforRNA-2
bindingproteintargetidentificationexperiments(CLIP,PAR-CLIP,iCLIP).WethankElif3
SarinayCenik,BaşarCenik,andAlperKüçükuralforhelpfuldiscussions.F.P.R.andH.N.C.were4
supportedbytheCanadaExcellenceResearchChairsProgram.Thisworkwassupportedin5
partbyNIHgrantsGM50037(M.J.M.),HG001715andHG004233(F.P.R.),theKrembil6
Foundation,anOntarioResearchFundaward,andtheAvonFoundation(F.P.R.).A.F.P.was7
supportedbyagrantfromtheCanadianInstitutesofHealthResearchtoA.F.P.(FRN102725).8
Authorcontributions:CC,MJM,andFPRdesignedthestudy.CCandHNCimplementedthe9
randomforestclassifierandcarriedoutallanalyses.GSandAAperformedinitialexperiments.10
MPSandAFPprovidedcriticalfeedback.MJMandFPRjointlysupervisedtheproject.CC,FPR11
andMJMwrotethemanuscriptwithinputfromallauthors.12
13
14
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
32
References:1
AlarcónCR,GoodarziH,LeeH,LiuX,TavazoieS,TavazoieSF.2015.HNRNPA2B1IsaMediatorof2m(6)A-DependentNuclearRNAProcessingEvents.Cell162:1299–1308.3
AltschulSF,GishW,MillerW,MyersEW,LipmanDJ.1990.Basiclocalalignmentsearchtool.JMolBiol4215:403–410.5
AmeresSL,ZamorePD.2013.DiversifyingmicroRNAsequenceandfunction.NatRevMolCellBiol14:6475–488.7
BabendureJR,BabendureJL,DingJ-H,TsienRY.2006.ControlofmammaliantranslationbymRNA8structurenearcaps.RNA12:851–861.9
BendtsenJD,NielsenH,vonHeijneG,BrunakS.2004.Improvedpredictionofsignalpeptides:SignalP103.0.JMolBiol340:783–795.11
BerrizGF,RothFP.2008.TheSynergizerservicefortranslatinggene,proteinandotherbiological12identifiers.Bioinformatics24:2272–2273.13
BreimanL.2001.RandomForests.MachLearn45:5–32.14
CenikC,ChuaHN,ZhangH,TarnawskySP,AkefA,DertiA,TasanM,MooreMJ,PalazzoAF,RothFP.152011.Genomeanalysisrevealsinterplaybetween5’UTRintronsandnuclearmRNAexportfor16secretoryandmitochondrialgenes.PLoSGenet7:e1001366.17
CenikC,DertiA,MellorJC,BerrizGF,RothFP.2010.Genome-widefunctionalanalysisofhuman5’18untranslatedregionintrons.GenomeBiol11:R29.19
CenikC,SarinayCenikE,ByeonGW,GrubertF,CandilleSI,SpacekD,AlsallakhB,TilgnerH,ArayaCL,20TangH,etal.2015.IntegrativeanalysisofRNA,translationandproteinlevelsrevealsdistinct21regulatoryvariationacrosshumans.GenomeRes.http://dx.doi.org/10.1101/gr.193342.115.22
ChangY-F,ImamJS,WilkinsonMF.2007.Thenonsense-mediateddecayRNAsurveillancepathway.23AnnuRevBiochem76:51–74.24
CharneskiCA,HurstLD.2013.Positivelychargedresiduesarethemajordeterminantsofribosomal25velocity.PLoSBiol11:e1001508.26
ChoeJ,RyuI,ParkOH,ParkJ,ChoH,YooJS,ChiSW,KimMK,SongHK,KimYK.2014.eIF4AIIIenhances27translationofnuclearcap-bindingcomplex-boundmRNAsbypromotingdisruptionofsecondary28structuresin5’UTR.ProcNatlAcadSciUSA111:E4577–86.29
DominissiniD,NachtergaeleS,Moshitch-MoshkovitzS,PeerE,KolN,Ben-HaimMS,DaiQ,DiSegniA,30Salmon-DivonM,ClarkWC,etal.2016.ThedynamicN(1)-methyladenosinemethylomein31eukaryoticmessengerRNA.Nature530:441–446.32
dosReisM,SavvaR,WernischL.2004.Solvingtheriddleofcodonusagepreferences:atestfor33translationalselection.NucleicAcidsRes32:5036–5044.34
DunnDB.1961.Theoccurrenceof1-methyladenineinribonucleicacid.BiochimBiophysActa46:198–35200.36
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
33
FryeM,JaffreySR,PanT,RechaviG,SuzukiT.2016.RNAmodifications:whathavewelearnedand1whereareweheaded?NatRevGenet17:365–372.2
GerashchenkoMV,GladyshevVN.2015.Translationinhibitorscauseabnormalitiesinribosome3profilingexperiments.NucleicAcidsRes42:e134.4
GingoldH,PilpelY.2011.Determinantsoftranslationefficiencyandaccuracy.MolSystBiol7:481.5
HallRH.1963.Methodforisolationof2′-O-methylribonucleosidesandN1-methyladenosinefrom6ribonucleicacid.BiochimicaetBiophysicaActa(BBA)-SpecializedSectiononNucleicAcidsand7RelatedSubjects68:278–283.8
HenikoffS,HenikoffJG.1992.Aminoacidsubstitutionmatricesfromproteinblocks.ProcNatlAcadSci9USA89:10915–10919.10
HershbergR,PetrovDA.2008.Selectiononcodonbias.AnnuRevGenet42:287–299.11
HinnebuschAG,LorschJR.2012.Themechanismofeukaryotictranslationinitiation:newinsightsand12challenges.ColdSpringHarbPerspectBiol4.http://dx.doi.org/10.1101/cshperspect.a011544.13
HongX,ScofieldDG,LynchM.2006.Intronsize,abundance,anddistributionwithinuntranslated14regionsofgenes.MolBiolEvol23:2392–2404.15
JanCH,WilliamsCC,WeissmanJS.2015.LOCALTRANSLATION.ResponsetoCommenton“Principlesof16ERcotranslationaltranslocationrevealedbyproximity-specificribosomeprofiling.”Science348:171217.18
JanCH,WilliamsCC,WeissmanJS.2014.PrinciplesofERcotranslationaltranslocationrevealedby19proximity-specificribosomeprofiling.Science346:1257521.20
KentWJ,SugnetCW,FureyTS,RoskinKM,PringleTH,ZahlerAM,HausslerD.2002.Thehuman21genomebrowseratUCSC.GenomeRes12:996–1006.22
KervestinS,JacobsonA.2012.NMD:amultifacetedresponsetoprematuretranslationaltermination.23NatRevMolCellBiol13:700–712.24
KlootwijkJ,PlantaRJ.1973.AnalysisofthemethylationsitesinyeastribosomalRNA.EurJBiochem39:25325–333.26
KucukuralA,ÖzadamH,SinghG,MooreMJ,CenikC.2013.ASPeak:anabundancesensitivepeak27detectionalgorithmforRIP-Seq.Bioinformatics29:2485–2486.28
LarssonO,LiS,IssaenkoOA,AvdulovS,PetersonM,SmithK,BittermanPB,PolunovskyVA.2007.29EukaryoticTranslationInitiationFactor4E–InducedProgressionofPrimaryHumanMammary30EpithelialCellsalongtheCancerPathwayIsAssociatedwithTargetedTranslationalDeregulation31ofOncogenicDriversandInhibitors.CancerRes67:6814–6824.32
LeeES,AkefA,MahadevanK,PalazzoAF.2015.Theconsensus5’splicesitemotifinhibitsmRNA33nuclearexport.PLoSOne10:e0122743.34
LeHirH,IzaurraldeE,MaquatLE,MooreMJ.2000.Thespliceosomedepositsmultipleproteins20-2435nucleotidesupstreamofmRNAexon-exonjunctions.EMBOJ19:6860–6869.36
LiuJ,YueY,HanD,WangX,FuY,ZhangL,JiaG,YuM,LuZ,DengX,etal.2014.AMETTL3-METTL1437complexmediatesmammaliannuclearRNAN6-adenosinemethylation.NatChemBiol10:93–95.38
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
34
LiX,XiongX,WangK,WangL,ShuX,MaS,YiC.2016.Transcriptome-widemappingrevealsreversible1anddynamicN1-methyladenosinemethylome.NatChemBiol.2http://dx.doi.org/10.1038/nchembio.2040(AccessedMarch29,2016).3
MahadevanK,ZhangH,AkefA,CuiXA,GueroussovS,CenikC,RothFP,PalazzoAF.2013.4RanBP2/Nup358potentiatesthetranslationofasubsetofmRNAsencodingsecretoryproteins.5PLoSBiol11:e1001545.6
MarintchevA,EdmondsKA,MarintchevaB,HendricksonE,ObererM,SuzukiC,HerdyB,SonenbergN,7WagnerG.2009.TopologyandregulationofthehumaneIF4A/4G/4Hhelicasecomplexin8translationinitiation.Cell136:447–460.9
MarkhamNR,ZukerM.2008.UNAFold:softwarefornucleicacidfoldingandhybridization.Methods10MolBiol453:3–31.11
MeyuhasO.2000.Synthesisofthetranslationalapparatusisregulatedatthetranslationallevel.EurJ12Biochem267:6321–6330.13
MooreMJ,ProudfootNJ.2009.Pre-mRNAprocessingreachesbacktotranscriptionandaheadto14translation.Cell136:688–700.15
PalazzoAF,MahadevanK,TarnawskySP.2013.ALREX-elementsandintrons:twoidentityelements16thatpromotemRNAnuclearexport.WileyInterdiscipRevRNA4:523–533.17
PalazzoAF,SpringerM,ShibataY,LeeC-S,DiasAP,RapoportTA.2007.Thesignalsequencecoding18regionpromotesnuclearexportofmRNA.PLoSBiol5:e322.19
ParsyanA,SvitkinY,ShahbazianD,GkogkasC,LaskoP,MerrickWC,SonenbergN.2011.mRNA20helicases:thetacticiansoftranslationalcontrol.NatRevMolCellBiol12:235–245.21
PingX-L,SunB-F,WangL,XiaoW,YangX,WangW-J,AdhikariS,ShiY,LvY,ChenY-S,etal.2014.22MammalianWTAPisaregulatorysubunitoftheRNAN6-methyladenosinemethyltransferase.Cell23Res24:177–189.24
PruittKD,TatusovaT,MaglottDR.2005.NCBIReferenceSequence(RefSeq):acuratednon-redundant25sequencedatabaseofgenomes,transcriptsandproteins.NucleicAcidsRes33:D501–4.26
QuinlanAR,HallIM.2010.BEDTools:aflexiblesuiteofutilitiesforcomparinggenomicfeatures.27Bioinformatics26:841–842.28
RedheadE,BaileyTL.2007.DiscriminativemotifdiscoveryinDNAandproteinsequencesusingthe29DEMEalgorithm.BMCBioinformatics8:385.30
ReidDW,NicchittaCV.2015a.DiversityandselectivityinmRNAtranslationontheendoplasmic31reticulum.NatRevMolCellBiol16:221–231.32
ReidDW,NicchittaCV.2015b.LOCALTRANSLATION.Commenton“PrinciplesofERcotranslational33translocationrevealedbyproximity-specificribosomeprofiling.”Science348:1217.34
ReimandJ,ArakT,ViloJ.2011.g:Profiler—awebserverforfunctionalinterpretationofgenelists35(2011update).NucleicAcidsResgkr378.36
RothFP,HughesJD,EstepPW,ChurchGM.1998.FindingDNAregulatorymotifswithinunaligned37noncodingsequencesclusteredbywhole-genomemRNAquantitation.NatBiotechnol16:939–38
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
35
945.1
RowlandAA,VoeltzGK.2012.Endoplasmicreticulum–mitochondriacontacts:functionofthejunction.2NatRevMolCellBiol13:607–625.3
ShahP,DingY,NiemczykM,KudlaG,PlotkinJB.2013.Rate-limitingstepsinyeastproteintranslation.4Cell153:1589–1601.5
SinghG,KucukuralA,CenikC,LeszykJD,ShafferSA,WengZ,MooreMJ.2012.ThecellularEJC6interactomerevealshigher-ordermRNPstructureandanEJC-SRproteinnexus.Cell151:750–764.7
SinghG,RicciEP,MooreMJ.2014.RIPiT-Seq:ahigh-throughputapproachforfootprintingRNA:protein8complexes.Methods65:320–332.9
StoreyJD.2003.Thepositivefalsediscoveryrate:aBayesianinterpretationandtheq-value.AnnStat1031:2013–2035.11
SylvestreJ,VialetteS,CorralDebrinskiM,JacqC.2003.LongmRNAscodingforyeastmitochondrial12proteinsofprokaryoticoriginpreferentiallylocalizetothevicinityofmitochondria.GenomeBiol4:13R44.14
TullerT,CarmiA,VestsigianK,NavonS,DorfanY,ZaborskeJ,PanT,DahanO,FurmanI,PilpelY.2010.15Anevolutionarilyconservedmechanismforcontrollingtheefficiencyofproteintranslation.Cell16141:344–354.17
ValenE,SandelinA,WintherO,KroghA.2009.Discoveryofregulatoryelementsisimprovedbya18discriminatoryapproach.PLoSComputBiol5:e1000562.19
VogelC,MarcotteEM.2012.Insightsintotheregulationofproteinabundancefromproteomicand20transcriptomicanalyses.NatRevGenet13:227–232.21
WangS,YaoX.2012.MulticlassImbalanceProblems:AnalysisandPotentialSolutions.IEEETransSyst22ManCybernBCybern.http://dx.doi.org/10.1109/TSMCB.2012.2187280.23
WangX,LuZ,GomezA,HonGC,YueY,HanD,FuY,ParisienM,DaiQ,JiaG,etal.2014.N6-24methyladenosine-dependentregulationofmessengerRNAstability.Nature505:117–120.25
WangX,ZhaoBS,RoundtreeIA,LuZ,HanD,MaH,WengX,ChenK,ShiH,HeC.2015.N(6)-26methyladenosineModulatesMessengerRNATranslationEfficiency.Cell161:1388–1399.27
ZinshteynB,GilbertWV.2013.LossofaconservedtRNAanticodonmodificationperturbscellular28signaling.PLoSGenet9:e1003675.29
30 31
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
36
FigureLegends:1
Fig1|Modelingtherelationshipbetweensequencefeaturesintheearlycoding2
regionandtheabsenceof5’UTRintrons(5UIs).(a)Forallhumantranscripts,3
informationabout36nucleotide-levelfeaturesoftheearlycodingregion(first994
nucleotides)and5UIpresencewasextracted.(b)Transcriptscontainingasignalsequence5
codingregion(SSCR)wereusedtotrainaRandomForestclassifierthatmodeledthe6
relationshipbetween5UIabsenceand36sequencefeatures.(c)Withthisclassifier,all7
humantranscriptswereassignedascorethatquantifiesthelikelihoodof5UIabsence8
basedonspecificRNAsequencefeaturesintheearlycodingregion.Transcriptswithhigh9
scoresarethusconsideredtohave5’-proximalintronminus-likecodingregions(5IMs).10
(d)“5’UTR-intron-minus-predictor”(5IMP)scoredistributionsforSSCR-containing11
transcriptsshifttohigherscoreswithlater-appearingfirstintrons,suggestingthat5IM12
codingregionfeaturesnotonlypredictlackofa5UI,butalsolackofearlycodingregion13
introns.(e)Classifierperformancewasoptimizedbyexcluding5UI–transcriptswith14
intronsappearingearlyinthecodingregion.Cross-validationperformance(areaunderthe15
precisionrecallcurve,AUPRC)wasexaminedforaseriesofalternative5IMclassifiers16
usingdifferentfirst-intron-positioncriterionforexcluding5UI–transcriptsfromthe17
trainingset(Methods).18
19
Fig2|Predicting5UIstatusaccuratelyusingonlyearlycodingsequences.(a)As20
judgedbyareaunderthereceiveroperatingcharacteristiccurve(AUROC)andAUPRC,The21
5IMclassifierperformedwellforseveraldifferenttranscriptclasses.(b)Thedistribution22
of5IMPscoresrevealsclearseparationof5UI+and5UI–transcriptsforSSCR-containing23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
37
transcripts,whereeachSSCR-containingtranscriptwasscoredusingaclassifierthatdid1
notusethattranscriptintraining(Methods).(c)Codingsequencefeaturesthatare2
predictiveof5’proximalintronpresencearerestrictedtotheearlycodingregion.Thiswas3
supportedbyidentical5IMclassifierscoredistributionswithrespectto5UIpresencefor4
negativecontrolsequences,eachderivedfromasinglerandomlychosen‘window’5
downstreamofthe3rdexonfromoneoftheevaluatedtranscripts.(d)MSCRtranscripts6
exhibitedamajordifferencein5IMPscoresbasedontheir5UIstatuseventhoughnoMSCR7
transcriptswereusedintrainingtheclassifier.(e)Transcriptspredictedtocontainsignal8
peptides(SignalP+)hada5IMPscoredistributionsimilartothatofSSCR-containing9
transcripts.(f)AftereliminatingSSCR,MSCR,andSignalP+transcripts,theremainingS–10
/MSCR–SignalP–transcriptswerestillsignificantlyenrichedforhigh5IMclassifierscores11
among5UI–transcripts.(g)Thecontrolsetofrandomlychosensequencesdownstreamof12
the3rdexonfromeachtranscriptwasusedtocalculateanempiricalcumulativenull13
distributionof5IMPscores.Usingthisfunction,wedeterminedthep-valuecorresponding14
tothe5IMPscoreforalltranscripts.Thereddashedlineindicatesthep-value15
correspondingto5%FalseDiscoveryRate.Theinsetdepictsthedistributionofvarious16
classesofmRNAsamongtheinputsetand5IMtranscripts.17
18
Fig3|5IMtranscriptsaremorelikelytobedifferentiallyexpressedandhave19
sequencefeaturesassociatedwithlowertranslationefficiency(a)The5IMclassifier20
scorewaspositivelycorrelatedwiththepropensityformRNAstructureprecedingthestart21
codon(-ΔG)(Spearmanrho=0.39;p<2.2e-16).Foreachtranscript,35nucleotides22
immediatelyupstreamoftheAUGwereusedtocalculate-ΔG(Methods).(b)The5IM23
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
38
classifierscorewaspositivelycorrelatedwiththepropensityformRNAstructurenearthe1
5’cap(-ΔG)(Spearmanrho=0.18;p=7.9e-130;Methods).(c)Transcriptsthatare2
translationallyupregulatedinresponsetoeIF4Eoverexpression(Larssonetal.2007)3
(blue)wereenrichedforhigher5IMPscores.Lightgreenshadingindicates5IMPscores4
correspondingto5%FDR.(d)Transcriptswithnon-AUGstartcodons(blue)exhibited5
significantlyhigher5IMPscoresthantranscriptswithacanonicalATGstartcodon(yellow).6
(e)Higher5IMPscoreswereassociatedwithlessoptimalcodons(asmeasuredbythe7
tRNAadaptationindex,tAI)forthefirst33codons.Foralltranscriptswithineach5IMP8
scorecategory(blue-high,orange-low),themeantAIwascalculatedateachcodonposition.9
Startcodonwasnotshown.(f)Transcriptswithlowertranslationefficiencywereenriched10
forhigher5IMPscores.Transcriptswithtranslationefficiencyonestandarddeviation11
belowthemean(“LOW”translation,yellow)andonestandarddeviationhigherthanthe12
mean(“HIGH”translation,blue)wereidentifiedusingribosomeprofilingandRNA-Seqdata13
fromhumanlymphoblastoidcelllines(Methods).14
15
Fig4|5IMtranscriptswithnoevidenceofER-targetingaremorelikelytoexhibitER-16
proximalribosomeoccupancy.AmovingaverageofER-proximalribosomeoccupancy17
wascalculatedbygroupinggenesby5IMPscore(seeMethods).Weplottedthemoving18
averageof5IMPscoresfortranscriptswithnoevidenceofER-ormitochondrialtargeting19
(green)orfortranscriptspredictedtobemitochondrial(purple).Weplottedarandom20
subsampleoftranscriptsontopofthemovingaverage(circles).21
22
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
39
Fig5|5IMtranscriptsharbornon-canonicalExonJunctionComplex(EJC)binding1
sites(a)ObservedEJCbindingsites(Singhetal.2012)areshownforanexample5IM2
transcript(LAMC1).CanonicalEJCbindingsites(purple)are~24ntupstreamofanexon-3
intronboundary.Theremainingbindingsitesareconsideredtobenon-canonical(green).4
(b)ACG-richsequencemotifpreviouslyidentifiedtobeenrichedamongncEJCbinding5
sitesinfirstexons(Singhetal.2012)isshown(c)5IMPscorefortranscriptswithzero,6
one,twoormorenon-canonicalEJCbindingsitesinthefirst99codingnucleotidesreveals7
thattranscriptswithhigh5IMPscoresfrequentlyharbornon-canonicalEJCbindingsites.8
(d)Transcriptswithhigh5IMPscoresareenrichedfornon-canonicalEJCsregardlessof9
5UIpresenceorabsence.10
11
Fig6|5IMtranscriptsareenrichedformRNAswithearlycodingregionm1A12
modifications(a)Transcriptswithm1Amodifications(blue)inthefirst99codingnucleotides13
exhibitsignificantenrichmentfor5IMtranscriptsandhavehigher5IMPSscoresthan14
transcriptswithoutm1Amodificationsinthefirst99codingnucleotides(yellow).(b)15
Transcriptswithm1Amodifications(blue)inthe5’UTRdonotdisplayasimilarenrichment.16
17
SupportingInformation18
FigureS1|Associationbetween5IMPscoresandcodonoptimalityisrestrictedtothe19
first30aminoacidsandisnotexplainedbynucleotidecontent.(a)5IMtranscriptstend20
tohavelessoptimalcodonsintheirfirst30aminoacidsasmeasuredbytRNAadaptationindex21
(tAI).ThemediantAIforeachtranscriptwascalculatedandtranscriptsweregroupedbytheir22
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
CanCenik
40
5IMPscores.ThedistributionofmediantAIwasplottedasaboxplot.(b)Wepermutedthe1
nucleotidesofthefirst99nucleotidesandfoundthattherelationshipbetween5IMPandtAI2
waslost.Thisresultsuggestedthatnucleotidecompositionalonedoesn’texplainthe3
relationshipbetween5IMPandtAI.(c)Weusedthepreviouslydescribednegativecontrol4
sequences,eachderivedfromasinglerandomlychosenin-frame‘window’downstreamofthe5
3rdexonfromoneoftheevaluatedtranscripts.Wefoundnorelationshipbetween5IMPscore6
andtAIintheseregionssuggestingthattheobservedassociationisrestrictedtotheearly7
codingregion.8
TableS1-|Listoffeaturesusedbythe5IMclassifier.9
TableS2-|5IMPscoresofallhumantranscripts.10
TableS3-|Listoffunctionalfeaturestestedforassociationwith5IMPscores.11
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
5'UTR Intron (5UI) First 99 nucleotidesATGGCGCA
ATGACAGA
ATGTATAAT
Sequence 1
Sequence 2
Sequence N
5’UTRIntron Arg:Lys
+
–
–
1.3
3.5
4.0
Adenine Codon Bias
0.9
2.1
1.5
Extract CDS featuresA B
Model the relationship between early coding sequence and 5UI status
C
Grow Decision Tree Repeat SamplingGenerate Multiple Trees
Sequence 5Sequence 5Sequence 9Sequence 11
BootstrapSample
Assign ‘5IMP’ score reflecting 5UI– propensity5IMP Score
9.2/10Sequence X ATGAGCGCGT...?
Likely 5UI–
2.7/10Sequence Y ATGACTATAA...?
Likely 5UI+
5UTR
Intron 0-4
243
-5960
-7071
-84 85+
2
4
6
8
First Coding Intron Position
5IM
P Sc
ore
D
2000 50 100 1500.5
0.6
0.7
0.8
0.9
1.0
AUPR
C (5
UI– p
redi
ctio
n)
First Coding Intron Position
E90 nt cutoff
Figure 1.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
E
0 2 4 6 8 105IMP Score
0.00
0.10
0.20
0.30
Dens
ity
5UI+5UI–
SSCR–/ MSCR–/ SignalP+
0.00
0.10
0.20
0.30
C
Dens
ity
0 2 4 6 8 105IMP Score
3’ proximal exons
5UI+5UI–
A
SSCRMSCR
3’ proximal exons
S– /M– SignalP+
Prec
isio
n
Recall
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate
Reca
ll
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
D
0 2 4 6 8 105IMP Score
5UI+5UI–
MSCR
Dens
ity
0.00
0.10
0.20
0.30
B
Dens
ity
0 2 4 6 8 10
SSCR
0.00
0.10
0.20
0.30
5IMP Score
5UI+5UI–
0 2 4 6 8 10
SSCR–/ MSCR–/ SignalP–
5IMP Score
5UI+5UI–
0.00
0.10
0.20
0.30
Dens
ity
F
p−value
Num
ber o
f Tra
nscr
ipts
0.0 0.2 0.4 0.6 0.8 1.00
1k
2k
G
5IMP Classifier Performance
S– /M– SignalP–
Figure 2
All
5IM
0% 40% 80%
Sign
alP–
Sign
alP+
SSC
R
MSC
R
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
A
0.00
0.05
0.10
0.15
0.20
0.25
Dens
ity
NON-AUG
AUG
Start Codon
5IMP Score0 2 4 6 8 10
D
C
5IMP Score0 2 4 6 8 10
0.00
0.05
0.10
0.15
0.20
Dens
ity
TRANSLATIONDOWN TRANSLATION
UP
NO CHANGEEIF4E Overexpression
FRibosomes per mRNA
LOW
HIGH
0 2 4 6 8 105IMP Score
0.00
0.05
0.10
0.15
0.20
Dens
ity
B
5IMP Score
Seco
dary
Stru
ctur
e-∆
G
2 4 6 8
0
5
10
15
20
25
E
0.36
0.38
0.40
0.42
Codo
n O
ptim
ality
5 10 15 20 25 30Codon Position
5IM
Non-5IM
Mean tAI
Figure 3
2 4 6 8
0
5
10
15
20
Seco
dary
Stru
ctur
e-∆
G
5IMP Score
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
A
Mitochondrial Evidence
2 4 6 8
5IMP score
ER−p
roxim
al Ri
boso
me
Occu
panc
y
No ER-targeting or Mitochondrial Evidence
<−1.12
−0.8
−0.4
0
>0.3
Figure 4.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
AScale 1 kb
LAMC1
200 basesScale
EJC Peak Calls
Non-Canonical EJC(ncEJC) Peaks
Canonical EJC(cEJC) Peaks
2 4 6 8 10
0
0.050.1
0.15
0.20
0.25
0.30
0.35
5IMP Score
Dens
ity
All TranscriptsFirst 99 Coding Nucleotides
No Peaks 1 Peak
2+ Peaks
C
ncEJC Peaks
ACG
CG
GCC
GGCGCGCG
C
B
First 99 Coding NucleotidesD
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35 5UI+ Transcripts
No ncEJC peak
1+ ncEJC peak
5UI– Transcripts
No ncEJC peak
1+ ncEJC peak
0 2 4 6 8 10 2 4 6 8 10
5IMP Score5IMP Score
Dens
ity
Figure 5.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
0.00
0.05
0.10
0.15
0.20
Dens
ity
0.00
0.05
0.10
0.15
0.20
Dens
ity
5IMP Score
0 2 4 6 8 10
A
5IMP Score
0 2 4 6 8 10
5’UTRFirst 99 Coding Nucleotides
B
m1A-modified m1A-modified
No m1A No m1A
Figure 6
.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint
0.25
0.30
0.35
0.40
0.45
5IMP Score
Med
ian
tAI
0-7.41 7.41-10
A
Codon Position
5 10 15 20 25 30
0.30
0.32
0.34
0.36
0.38
0.40
0.42
Codo
n O
ptim
ality
mea
n tA
I
First 99 nucleotides permuted
5IMP ≤ 7.41 7.41 < 5IMP
B
Codon Position
5 10 15 20 25 30
0.30
0.32
0.34
0.36
0.38
0.40
0.42
Random 99 nucleotidesDownstream of 3rd Exon
Codo
n O
ptim
ality
mea
n tA
I
7.41 < 5IMP5IMP ≤ 7.41
C
Figure S1.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint