a common class of transcripts with 5′-intron …...2016/09/06  · 9 transcripts with 5’...

47
Can Cenik 1 A Common Class of Transcripts with 5’-Intron Depletion, Distinct Early Coding 1 Sequence Features, and N 1 -Methyladenosine Modification 2 Can Cenik 1,2 , Hon Nian Chua 3,4 , Guramrit Singh 2,5,7,8 , Abdalla Akef 6 , Michael P Snyder 1 , 3 Alexander F. Palazzo 6 , Melissa J Moore 2,7,8# , and Frederick P Roth 3,9,10# 4 5 Affiliations: 6 1 Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA 7 2 Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical 8 School, Worcester, MA 01605, USA 9 3 Donnelly Centre and Departments of Molecular Genetics and Computer Science, University of Toronto 10 and Lunenfeld-Tanenbaum Research Institute, Mt Sinai Hospital, Toronto M5G 1X5, Ontario, Canada 11 4 DataRobot, Inc., 61 Chatham St. Boston MA 02109, USA 12 5 Department of Molecular Genetics, Center for RNA Biology, The Ohio State University, Columbus, OH 13 43210, USA 14 6 Department of Biochemistry, University of Toronto, 1 King’s College Circle, MSB Room 5336, Toronto, 15 ON M5S 1A8, Canada 16 7 Howard Hughes Medical Institute, University of Massachusetts Medical School, Worcester, MA 01605, 17 USA 18 8 RNA Therapeutics Institute, University of Massachusetts Medical School, Worcester, MA 01605, USA 19 9 Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston 02215, MA, USA 20 10 The Canadian Institute for Advanced Research, Toronto M5G 1Z8, ON, Canada 21 22 #Correspondence to: [email protected] , [email protected] 23 24 Short Title: First intron position defines a transcript class 25 Keywords: 5’ UTR introns, random forest, N 1 methyladenosine, Exon Junction Complex 26 27 . CC-BY-ND 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted September 6, 2016. . https://doi.org/10.1101/057455 doi: bioRxiv preprint

Upload: others

Post on 29-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

1

ACommonClassofTranscriptswith5’-IntronDepletion,DistinctEarlyCoding1

SequenceFeatures,andN1-MethyladenosineModification2

CanCenik1,2,HonNianChua3,4,GuramritSingh2,5,7,8,AbdallaAkef6,MichaelPSnyder1,3

AlexanderF.Palazzo6,MelissaJMoore2,7,8#,andFrederickPRoth3,9,10#4

5

Affiliations:6

1DepartmentofGenetics,StanfordUniversitySchoolofMedicine,Stanford,CA94305,USA72DepartmentofBiochemistryandMolecularPharmacology,UniversityofMassachusettsMedical8School,Worcester,MA01605,USA93DonnellyCentreandDepartmentsofMolecularGeneticsandComputerScience,UniversityofToronto10andLunenfeld-TanenbaumResearchInstitute,MtSinaiHospital,TorontoM5G1X5,Ontario,Canada114DataRobot,Inc.,61ChathamSt.BostonMA02109,USA125DepartmentofMolecularGenetics,CenterforRNABiology,TheOhioStateUniversity,Columbus,OH1343210,USA146DepartmentofBiochemistry,UniversityofToronto,1King’sCollegeCircle,MSBRoom5336,Toronto,15ONM5S1A8,Canada167HowardHughesMedicalInstitute,UniversityofMassachusettsMedicalSchool,Worcester,MA01605,17USA188RNATherapeuticsInstitute,UniversityofMassachusettsMedicalSchool,Worcester,MA01605,USA199CenterforCancerSystemsBiology(CCSB),Dana-FarberCancerInstitute,Boston02215,MA,USA2010TheCanadianInstituteforAdvancedResearch,TorontoM5G1Z8,ON,Canada21

22

#Correspondenceto:[email protected],[email protected]

24

ShortTitle:Firstintronpositiondefinesatranscriptclass25

Keywords:5’UTRintrons,randomforest,N1methyladenosine,ExonJunctionComplex26

27

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 2: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

2

Abstract:1

Intronsarefoundin5’untranslatedregions(5’UTRs)for35%ofallhumantranscripts.2

These5’UTRintronsarenotrandomlydistributed:genesthatencodesecreted,membrane-3

boundandmitochondrialproteinsarelesslikelytohavethem.Curiously,transcriptslacking4

5’UTRintronstendtoharborspecificRNAsequenceelementsintheirearlycodingregions.To5

modelandunderstandtheconnectionbetweencoding-regionsequenceand5’UTRintron6

status,wedevelopedaclassifierthatcanpredict5’UTRintronstatuswith>80%accuracy7

usingonlysequencefeaturesintheearlycodingregion.Thus,theclassifieridentifies8

transcriptswith5’proximal-intron-minus-like-codingregions(“5IM”transcripts).9

Unexpectedly,wefoundthattheearlycodingsequencefeaturesdefining5IMtranscriptsare10

widespread,appearingin21%ofallhumanRefSeqtranscripts.The5IMclassoftranscriptsis11

enrichedfornon-AUGstartcodons,moreextensivesecondarystructurebothprecedingthe12

startcodonandnearthe5’cap,greaterdependenceoneIF4Efortranslation,andassociation13

withER-proximalribosomes.5IMtranscriptsareboundbytheExonJunctionComplex(EJC)at14

non-canonical5’proximalpositions.Finally,N1-methyladenosinesarespecificallyenrichedin15

theearlycodingregionsof5IMtranscripts.Takentogether,ouranalysespointtotheexistence16

ofadistinct5IMclasscomprising~20%ofhumantranscripts.Thisclassisdefinedby17

depletionof5’proximalintrons,presenceofspecificRNAsequencefeaturesassociatedwith18

lowtranslationefficiency,N1-methyladenosinesintheearlycodingregion,andenrichmentfor19

non-canonicalbindingbytheExonJunctionComplex.20

21

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 3: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

3

Introduction:1

2

Approximately35%ofallhumantranscriptsharborintronsintheir5’untranslated3

regions(5’UTRs)(Ceniketal.2010;Hongetal.2006).Amonggeneswith5’UTRintrons(5UIs),4

thoseannotatedas“regulatory”aresignificantlyoverrepresented,whilethereisanunder-5

representationofgenesencodingproteinsthataretargetedtoeithertheendoplasmic6

reticulum(ER)ormitochondria(Ceniketal.2011).FortranscriptsthatencodeER-and7

mitochondria-targetedproteins,5UIdepletionisassociatedwithpresenceofspecificRNA8

sequences(Ceniketal.2011;Palazzoetal.2007,2013).Specifically,nuclearexportofan9

otherwiseinefficientlyexportedmicroinjectedmRNAorcDNAtranscriptcanbepromotedby10

anER-targetingsignalsequence-containingregion(SSCRs)ormitochondrialsignalsequence11

codingregion(MSCRs)fromagenelacking5’UTRintrons(Ceniketal.2011;Leeetal.2015).12

However,morerecentstudiessuggestthatmanySSCRshavelittleimpactonnuclearexportfor13

RNAstranscribedinvivo(Leeetal.2015),butratherenhancetranslationinaRanBP2-14

dependentmanner(Mahadevanetal.2013).15

AmongSSCR-andMSCR-containingtranscripts(referredtohereafterasSSCRand16

MSCRtranscripts),~75%lack5’UTRintrons(“5UI–”transcripts)and~25%havethem(“5UI+”17

transcripts).Thesetwogroupshavemarkedlydifferentsequencecompositionsatthe5’ends18

oftheircodingsequences.5UI–transcriptstendtohaveloweradeninecontent(Palazzoetal.19

2007)andusecodonswithfeweruracilsandadeninesthan5UI+transcripts(Ceniketal.20

2011).Theirsignalsequencesalsocontainleucineandargininemoreoftenthanthe21

biochemicallysimilaraminoacidsisoleucineandlysine,respectively.Leucineandarginine22

codonscontainfeweradenineandthyminenucleotides,consistentwithadenineandthymine23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 4: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

4

depletion.ThisdepletionisalsoassociatedwiththepresenceofaspecificGC-richRNAmotif1

intheearlycodingregionof5UI–transcripts(Ceniketal.2011).2

Despitesomeknowledgeastotheirearlycodingregionfeatures,keyquestionsabout3

thisclassof5UI–transcriptshaveremainedunanswered:Dotheabovesequencefeatures4

extendbeyondSSCR-andMSCR-containingtranscriptstoother5UI–genes?Do5UI–5

transcriptshavingthesefeaturessharecommonfunctionalorregulatoryfeatures?What6

bindingfactor(s)recognizetheseRNAelements?Amorecompletemodeloftherelationshipof7

earlycodingfeaturesand5UI–statuswouldbegintoaddressthesequestions.8

Here,tobetterunderstandtherelationshipbetweenearlycodingregionfeaturesand9

5UIstatus,weundertookanintegrativemachinelearningapproach.Wereasonedthata10

machinelearningclassifierwhichcouldidentify5UI–transcriptssolelyfromearlycoding11

sequencewouldpotentiallyprovidetwotypesofinsight:First,itcouldsystematicallyidentify12

predictivefeatures.Second,thesubsetof5UI–transcriptswhichcouldbeidentifiedbythe13

classifiermightthenrepresentafunctionallydistincttranscriptclass.Havingdevelopedsuch14

aclassifier,wefoundthatitidentified~21%ofallhumantranscriptsasharboringcoding15

regionscharacteristicof5UI–transcripts.WhilemanyofthesetranscriptsencodeER-and16

mitochondrial-targetedproteins,manyothersencodenuclearandcytoplasmicproteins.This17

classoftranscriptssharescharacteristicssuchasatendencytolack5’proximalintrons,to18

containnon-canonicalExonJunctionComplexbindingsites,tohavemultiplefeatures19

associatedwithlowerintrinsictranslationefficiency,andtohaveanincreasedincidenceofN1-20

methyladenosinemodification.21

22

Results:23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 5: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

5

Aclassifierthatpredicts5UIstatususingonlyearlycodingsequenceinformation.1

Tobetterunderstandthepreviously-reportedenigmaticrelationshipbetweencertainearly2

codingregionsequencesandtheabsenceofa5UI,wesoughttomodelthisrelationship.3

Specifically,weusedarandomforestclassifier(Breiman2001)tolearntherelationship4

between5UIabsenceandacollectionof36differentnucleotide-levelfeaturesextractedfrom5

thefirst99nucleotidesofallhumancodingregions(CDS)(Figs1A-C;TableS1;Methods).6

WethenusedalltranscriptsknowntocontainanSSCR(atotalof3743transcriptsclusters;7

Methods),regardlessof5UIstatus,asourtrainingset.Thistrainingconstraintensuredthatall8

inputnucleotidesequencesweresubjecttosimilarfunctionalconstraintsattheproteinlevel.9

Thus,wesoughttoidentifysequencefeaturesthatdifferbetween5UI–and5UI+transcriptsat10

theRNAlevel.11

Ourclassifierassignstoeachtranscripta“5’UTR-intron-minus-predictor”(5IMP)score12

between0and10,wherehigherscorescorrespondtoahigherlikelihoodofbeing5UI–(Fig13

1C).Interestingly,preliminaryrankingofthe5UI–transcriptsby5IMPscorerevealeda14

relationshipbetweenthepositionofthefirstintroninthecodingregionandthe5IMPscore.15

5UI–transcriptsforwhichthefirstintronwasmorethan85ntsdownstreamofthestartcodon16

hadthehighest5IMPscores.Furthermore,thecloserthefirstintronwastothestartcodon,17

thelowerthe5IMPscore(Fig1D).Weexploredthisrelationshipfurtherbytrainingclassifiers18

thatincreasinglyexcludedfromthetrainingset5UI–transcriptsaccordingtothedistanceof19

thefirstintronfromthe5’endofthecodingregion.Thisrevealedthatclassifierperformance,20

asmeasuredbytheareaundertheprecisionrecallcurve(AUPRC),increasedasafunctionof21

thedistancefromstartcodontofirstintrondistance(Methods,Fig1E).Thus,theRNA22

sequencefeaturesweidentifiedasbeingpredictiveof5UI–transcriptsaremoreaccurately23

describedasbeingpredictorsoftranscriptswithout5’-proximalintrons.24

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 6: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

6

Tominimizetheimpactoftranscriptsthatmay‘behave’asthoughtheywere5UI+due1

toanintronearlyinthecodingregion,weeliminated5UI–SSCRtranscriptswithafirstintron2

<90ntsdownstreamofthestartcodon(Methods)andgeneratedanewclassifier.3

Discriminativemotiffeatureswerelearnedindependently(Methods),andperformanceofthis4

newclassifierwasgaugedusing10-foldcrossvalidation.Weassessedcross-validation5

performanceintwoways:1)intermsoftheareaunderthereceiveroperatingcurve(AUC)—6

whichcanbethoughtofasameasureofaveragerecallacrossarangeoffalsepositiverates;2)7

intermsofareaundertheprecisionvsrecallcurve(AUPRC),whichcanbethoughtofasthe8

averageprecision(fractionofpredictionswhicharecorrect)acrossarangeofrecallvalues.9

Specifically,theclassifiershowedanAUCof74%andAUPRCof88%(Fig2A,yellowcurves;10

exceedingAUC50%andAUPRC71%,theperformancevalueexpectedofanaïvepredictor).11

Weusedthisoptimizedclassifierforallsubsequentanalyses.12

SSCRtranscriptsexhibitedmarkedlydifferent5IMPscoredistributionsforthe5UI+and13

5UI–subsets(Fig2B).The5UI+scoredistributionwasunimodalwithapeakat~2.4.In14

contrast,the5UI–scoredistributionwasbimodalwithonepeakat~3.6andanotherat~9,15

suggestingtheexistenceofatleasttwounderlying5UI–transcriptclasses.Thepeakatscore16

3.6resembledthe5UI+peak.Alsocontributingtothepeakat3.6isthesetof5UI–transcripts17

harboringanintroninthefirst90ntsoftheCDS(55%ofall5UI–transcripts).Theother18

distincthigh-scoring5UI–class(peakatscore9)iscomposedoftranscriptsthathavespecific19

5UI--predictiveRNAsequenceelementswithintheearlycodingregion.20

Wenextwishedtoevaluatewhetherourclassifierwasdiscriminating5UI+and5UI–21

SSCRtranscriptsusingsignalsthatappearspecificallyintheearlycodingregionasopposedto22

signalsthatappearbroadlyacrossthecodingregion.Todoso,foreverytranscriptwe23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 7: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

7

randomlychose99ntsfromtheregiondownstreamofthe3rdexon.The5IMPscore1

distributionsofthese‘3’proximalexon’setswereidenticalfor5UI+and5UI–transcripts(Figs2

2A,and2C),confirmingthatthesequencefeaturesthatdistinguish5UI+and5UI–transcripts3

arespecifictotheearlycodingregion.4

5

RNAelementsassociatedwith5UI–transcriptsarepervasiveinthehumangenome6

HavingtrainedtheclassifieronSSCRtranscripts,wewonderedhowwellitwouldpredictthe7

5UIstatusofothertranscripts.DespitehavingbeentrainedexclusivelyonSSCRtranscripts,8

theclassifierperformedremarkablywellonMSCRtranscripts,achievinganAUCof86%and9

AUPRCof95%(Fig2A,purpleline;Fig2D;ascomparedwith50%and77%,respectively,10

expectedbychance).ThisresultsuggeststhatRNAelementswithinearlycodingregionsof11

5UI–MSCR-transcriptsaresimilartothosein5UI–SSCR-transcriptsdespitedistinctfunctional12

constraintsattheproteinlevel.13

Wenextwonderedwhethertheclassof5UI–transcriptsthatcanbepredictedonthe14

basisofearlycodingregionfeaturesisrestrictedtotranscriptsencodingproteinstraffickedto15

theERormitochondria,orisinsteadamoregeneralclassoftranscripts.Wethereforeasked16

whethertheclassifiercouldpredict5UI–statusintranscriptsthatcontainneitheranSSCRnor17

anMSCR(“S–/M–”transcripts).BecauseunannotatedSSCRscouldconfoundthisanalysis,we18

firstusedSignalP3.0toidentifyS–/M–transcriptsmostlikelytocontainanunannotatedSSCR19

(Bendtsenetal.2004).These‘SignalP+‘transcriptshada5IMPscoredistributioncomparable20

tothoseofknownSSCRandMSCRtranscripts(Fig2E),andtheclassifierworkedwellto21

identifythe5UI–subsetofthesetranscripts(AUC82%andAUPRC95%,Fig2A,lightblue22

line).While5UI+SignalP+transcriptshadpredominantlylow5IMPscores,5UI–SignalP+5IMP23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 8: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

8

scoreswerestronglyskewedtowardshigh5IMPscores(peakat~9;Fig2E).Theseresults1

wereconsistentwiththeideathatSignalP+transcriptsdocontainmanyunannotatedSSCRs.2

HavingconsideredSignalP+transcriptsaswellasSSCR-andMSCR-containing3

transcripts,weusedtheclassifiertocalculate5IMPscoresforallremaining“S–/M–/SignalP–”4

transcripts.Althoughtheperformancewasweakeronthisgeneset,itwasstillbetterthan5

expectedofanaivepredictor(Fig2A,greenline).5UI+S–/M–/SignalP–transcriptswere6

stronglyskewedtowardlow5IMPscores(Fig2F).Surprisingly,however,asignificant7

fractionof5UI–S–/M–/SignalP–transcriptshadhigh5IMPscores(~18%).Thus,ourresults8

suggestabroadclassoftranscriptswithearlycodingregionscarryingsequencesignalsthat9

predicttheabsenceofa5’proximalintron,orinotherwords,aclassoftranscriptswith10

5’proximal-intron-minus-likecodingregions.Hereafterwerefertotranscriptsinthisclassas11

“5IM”transcripts.12

Wesoughttoidentifywhatfractionoftranscriptshave5IMPscoresthatexceedwhat13

wouldbeexpectedintheabsenceof5UI--predictiveearlycodingregionsignals.Toestablish14

thisexpectation,weusedtheabove-describednegativecontrolsetofequal-lengthcoding15

sequencesfromoutsideoftheearlycodingregion.Byquantifyingtheexcessofhigh-scoring16

sequencesintherealdistributionrelativetothiscontroldistribution,weestimatethat21%of17

allhumantranscriptsare5IMtranscripts(a5IMPscoreof7.41correspondstoa5%False18

DiscoveryRate;Fig2G).Thesetof5IMtranscriptsdefinedbyourclassifier(TableS2)19

includesmanythatdonotencodeER-targetedormitochondrialproteins.Thedistributionof20

variousclassesofmRNAsamongthe5IMtranscriptswas:38%ER-targeted(SSCRorSignalP+),21

9%mitochondrial(MSCR)and53%otherclasses(S–/M–/SignalP–)(Figure2G).Theseresults22

suggestthatRNA-levelfeaturesprevalentintheearlycodingregionsof5UI–SSCRandMSCR23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 9: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

9

transcriptsarealsofoundinothertranscripttypes(Figs2F-G),andthat5IMtranscripts1

representabroadclass.2

3

Functionalcharacterizationof5IMtranscripts4

5IMtranscriptsaredefinedbymRNAsequencefeatures.Hence,wehypothesizedthat5IM5

transcriptsmaybefunctionallyrelatedthroughsharedregulatorymechanismsmediatedby6

thepresenceofthesecommonfeatures.Tothisend,wecollectedlarge-scaledatasets7

representingdiverseattributescoveringsixbroadcategories(seeTableS3foracompletelist):8

(1)Curatedfunctionalannotations--e.g.,GeneOntologyterms,annotationasa‘housekeeping’9

gene,genessubjecttoRNAediting;(2)RNAlocalization--e.g.,todendrites,tomitochondria;10

(3)ProteinandmRNAhalf-life,ribosomeoccupancyandfeaturesthatdecreasestabilityofone11

ormoremRNAisoforms--e.g.,AU-richelements(4)Sequencefeaturesassociatedwith12

regulatedtranslation--e.g.,codonoptimality,secondarystructurenearthestartcodon;(5)13

KnowninteractionswithRNA-bindingproteinsorcomplexessuchasStaufen-1,TDP-43,orthe14

ExonJunctionComplex(EJC)(6)RNAmodifications–i.e.,N1-methyladenosine(m1A).15

Weadjustedformultiplehypothesestestingattwolevels.First,wetookaconservative16

approach(Bonferronicorrection)tocorrectforthenumberoftestedfunctional17

characteristics.Second,someofthefunctionalcategorieswereanalyzedinmoredepthand18

multiplesub-hypothesesweretestedwithinthegivencategory.Inthisin-depthanalysisafalse19

discovery-basedcorrectionwasadopted.Below,allreportedp-valuesremainsignificant(p-20

adjusted<0.05)aftermultiplehypothesistestcorrection.21

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 10: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

10

Noassociationsbetween5IMtranscriptsandfeaturesincategories(1),(2)and(3)were1

found,otherthanthealready-knownenrichmentsforER-andmitochondrial-targetedmRNAs.2

However,analysesfortheremainingcategoriesyieldedthesignificantresultsdescribedbelow.3

4

5IMtranscriptshavefeaturessuggestinglowertranslationefficiency5

Translationregulationisamajordeterminantofproteinlevels(VogelandMarcotte2012).To6

investigatepotentialconnectionsbetween5IMtranscriptsandtranslationalregulation,we7

examinedfeaturesassociatedwithtranslation.Featuresfoundtobesignificantwere:8

(I)Secondarystructuresnearthestartcodoncanaffectinitiationratebymodulating9

startcodonrecognition(Parsyanetal.2011).Weobservedapositivecorrelationbetween10

5IMPscoreandthefreeenergyoffolding(-ΔG)ofthe35nucleotidesimmediatelypreceding11

thestartcodon(Fig3A;Spearmanrho=0.39;p<2.2e-16).Thissuggeststhat5IMtranscripts12

haveagreatertendencyforsecondarystructurenearthestartcodon,presumablymakingthe13

startcodonlessaccessible.14

(II)Similarly,secondarystructuresnearthe5’capcanmodulatetranslationby15

hinderingbindingbythe43S-preinitiationcomplextothemRNA(Babendureetal.2006).We16

observedapositivecorrelationbetween5IMPscoreandthefreeenergyoffolding(-ΔG)ofthe17

5’most35nucleotides(Fig3B;Spearmanrho=0.18;p=7.9e-130).Thissuggeststhat5IM18

transcriptshaveagreatertendencyforsecondarystructurenearthe5’cap,presumably19

hinderingbindingbythe43S-preinitiationcomplex.20

(III)eIF4Eoverexpression.TheheterotrimerictranslationinitiationcomplexeIF4F21

(madeupofeIF4A,eIF4EandeIF4G)isresponsibleforfacilitatingthetranslationof22

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 11: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

11

transcriptswithstrong5’UTRsecondarystructures(Parsyanetal.2011).TheeIF4Esubunit1

bindstothe7mGpppG‘methyl-G’capandtheATP-dependenthelicaseeIF4A(scaffoldedby2

eIF4G)destabilizes5’UTRsecondarystructure(Marintchevetal.2009).Apreviousstudy3

identifiedtranscriptsthatweremoreactivelytranslatedunderconditionsthatpromotecap-4

dependenttranslation(overexpressionofeIF4E)(Larssonetal.2007).Inagreementwiththe5

observationthat5IMtranscriptshavemoresecondarystructureupstreamofthestartcodon6

andnearthe5’cap,transcriptswithhigh5IMPscoresweremorelikelytobetranslationally,7

butnottranscriptionally,upregulateduponeIF4Eoverexpression(Fig3C;WilcoxonRankSum8

Testp=2.05e-22,andp=0.28,respectively).9

(IV)Non-AUGstartcodons.Transcriptswithnon-AUGstartcodonsalsohave10

intrinsicallylowtranslationinitiationefficiencies(HinnebuschandLorsch2012).These11

mRNAsweregreatlyenrichedamongtranscriptswithhigh5IMPscores(Fisher’sExactTestp12

=0.0003;oddsratio=3.9)andhaveamedian5IMPscorethatis3.57higherthanthosewith13

anAUGstart(Fig3D).14

(V)Codonoptimality.Theefficiencyoftranslationelongationisaffectedbycodon15

optimality(HershbergandPetrov2008).Althoughsomeaspectsofthisremaincontroversial16

(CharneskiandHurst2013;Shahetal.2013;ZinshteynandGilbert2013;Gerashchenkoand17

Gladyshev2015),itisclearthatdecodingofcodonsbytRNAswithdifferentabundancescan18

affectthetranslationrateunderconditionsofcellularstress(reviewedinGingoldandPilpel19

2011).WethereforeexaminedthetRNAadaptationindex(tAI),whichcorrelateswithcopy20

numbersoftRNAgenesmatchingagivencodon(dosReisetal.2004).Specifically,we21

calculatedthemediantAIofthefirst99codingnucleotidesofeachtranscript,andfoundthat22

5IMPscorewasnegativelycorrelatedwithtAI(FigS1A;SpearmanCorrelationrho=-0.23;p<23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 12: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

12

2e-16mediantAIand5IMPscore).Thiseffectwasrestrictedtotheearlycodingregionsasthe1

negativecontrolsetofrandomlychosensequencesdownstreamofthe3rdexonfromeach2

transcriptdidnotexhibitarelationshipbetween5IMPscoreandcodonoptimality(FigS1B-C).3

Thus,5IMtranscriptsshowreducedcodonoptimalityinearlycodingregions,suggestingthat4

5IMtranscriptshavedecreasedtranslationelongationefficiency.5

Tomorepreciselydeterminewherethecodonoptimalityphenomenonoccurswithin6

theentireearlycodingregion,wegroupedtranscriptsby5IMPscore.Foreachgroup,we7

calculatedthemeantAIatcodons2-33(i.e.,nts4-99).Acrossthisentireregion,5IM8

transcripts(5IMP>7.41;5%FDR)hadsignificantlylowertAIvaluesateverycodonexcept9

codons24and32(Fig3E;WilcoxonRankSumtestHolm-adjustedp<0.05forall10

comparisons).Toeliminatepotentialconfoundingvariables,includingnucleotidecomposition,11

weperformedseveraladditionalcontrolanalyses(Methods);noneofthesealteredthebasic12

conclusionthat5IMtranscriptshavelowercodonoptimalitythannon-5IMtranscriptsacross13

theentireearlycodingregion.14

(VI)RibosomespermRNA.Finally,weexaminedtherelationshipbetween5IMPscore15

andtranslationefficiency,asmeasuredbythesteady-statenumberofribosomespermRNA16

molecule.Tothisend,weusedalargedatasetofribosomeprofilingandRNA-Seqexperiments17

fromhumanlymphoblastoidcelllines(Ceniketal.2015).Fromthis,wecalculatedtheaverage18

numberofribosomesoneachtranscriptandidentifiedtranscriptswithhighorlowribosome19

occupancy(respectivelydefinedbyoccupancyatleastonestandarddeviationaboveorbelow20

themean;seeMethods).5IMtranscriptswereslightlybutsignificantlydepletedinthehigh21

ribosome-occupancycategory(Fig3F;Fisher’sExactTestp=0.0006,oddsratio=1.3).22

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 13: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

13

Moreover,5IMPscoresexhibitedaweakbutsignificantnegativecorrelationwiththenumber1

ofribosomespermRNAmolecule(Spearmanrho=-0.11;p=5.98e-23).2

Takentogether,alloftheaboveresultsrevealthat5IMtranscriptshavesequence3

featuresassociatedwithlowertranslationefficiency,atthestagesofbothtranslationinitiation4

andelongation.5

6

Non-ERtrafficked5IMtranscriptsareenrichedinER-proximalribosomeoccupancy7

Wenextinvestigatedtherelationshipbetween5IMPscoreandthelocalizationoftranslation8

withincells.Exploringthesubcellularlocalizationoftranslationatatranscriptome-scale9

remainsasignificantchallenge.Yet,arecentstudydescribedproximity-specificribosome10

profilingtoidentifymRNAsoccupiedbyER-proximalribosomesinbothyeastandhumancells11

(Janetal.2014).Inthismethod,ribosomesarebiotinylatedbasedontheirproximitytoa12

markerproteinsuchasSec61,whichlocalizestotheERmembrane(Janetal.2014).Foreach13

transcript,theenrichmentforbiotinylatedribosomeoccupancyyieldsameasureofER-14

proximityoftranslatedmRNAs.15

Wereanalyzedthisdatasettoexploretherelationshipbetween5IMPscoresandER-16

proximalribosomeoccupancyinHEK-293cells.Asexpected,transcriptsthatexhibitthe17

highestenrichmentforER-proximalribosomeswereSSCR-containingtranscriptsand18

transcriptswithotherER-targetingsignals.Yet,wenoticedasurprisingpositivecorrelation19

betweenER-proximalribosomeoccupancyand5IMtranscriptswithnoER-targetingevidence20

(Fig4).Thisrelationshipwastrueforbothmitochondrialgenes(Fig4;Spearmanrho=0.43;p21

<2.2e-16),andgeneswithnoevidenceforeitherER-ormitochondrial-targeting(Fig4;22

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 14: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

14

Spearmanrho=-0.36;p<2.2e-16).Theseresultssuggestthat5IMtranscriptsaremorelikely1

thannon-5IMtranscriptstoengagewithER-proximalribosomes.2

3

5IMtranscriptsarestronglyenrichedinnon-canonicalEJCoccupancysites4

Sharedsequencefeaturesandfunctionaltraitsamong5IMtranscriptscausesonetowonder5

whatcommonmechanismsmightlink5IMsequencefeaturesto5IMtraits.Forexample,5IM6

transcriptsmightshareregulationbyoneormoreRNA-bindingproteins(RBPs).To7

investigatethisideafurther,wetestedforenrichmentof5IMtranscriptsamongthe8

experimentallyidentifiedtargetsof23differentRBPs(includingCLIP-Seq,andvariants;see9

Methods).Onlyonedatasetwassubstantiallyenrichedforhigh5IMPscoresamongtargets:a10

transcriptome-widemapofbindingsitesoftheExonJunctionComplex(EJC)inhumancells,11

obtainedviatandem-immunoprecipitationfollowedbydeepsequencing(RIPiT)(Singhetal.12

2014,2012).TheEJCisamulti-proteincomplexthatisstablydepositedupstreamofexon-13

exonjunctionsasaconsequenceofpre-mRNAsplicing(LeHiretal.2000).RIPiTdata14

confirmedthatcanonicalEJCsites(cEJCsites;sitesboundbyEJCcorefactorsandappearing15

~24ntsupstreamofexon-exonjunctions)occupy~80%ofallpossibleexon-exonjunction16

sitesandarenotassociatedwithanysequencemotif.Unexpectedly,manyEJC-associated17

footprintsoutsideofthecanonical-24regionswereobserved(Fig5A)(Singhetal.2012).18

These‘non-canonical’EJCoccupancysites(ncEJCsites)wereassociatedwithmultiple19

sequencemotifs,threeofwhichweresimilartoknownrecognitionmotifsforSRproteinsthat20

co-purifiedwiththeEJCcoresubunits(Singhetal.2012).Interestingly,anothermotif(Fig5B;21

top)thatwasspecificallyfoundinfirstexonsisnotknowntobeboundbyanyknownRNA-22

bindingprotein(Singhetal.2012).ThismotifwasCG-rich,asequencefeaturethatalso23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 15: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

15

defines5IMtranscripts.Thissimilaritypresagesthepossibilityofenrichmentoffirstexon1

ncEJCsitesamong5IMtranscripts.2

PositionanalysisofcalledEJCpeaksrevealedthatwhileonly9%ofcEJCsresideinfirst3

exons,19%ofallncEJCsarefoundthere.Whenweinvestigatedtherelationshipbetween5IMP4

scoresandncEJCsinearlycodingregions,wefoundastrikingcorrespondence—themedian5

5IMPscorewashighestfortranscriptswiththegreatestnumberofncEJCs(Fig5C;Wilcoxon6

RankSumTest;p<0.0001).Whenwerepeatedthisanalysisbyconditioningon5UIstatus,we7

similarlyfoundthatncEJCswereenrichedamongtranscriptswithhigh5IMPscoresregardless8

of5UIstatus(Fisher’sExactTest,p<3.16e-14,oddsratio>2.3;Fig5D).Theseresultssuggest9

thatthestrikingenrichmentofncEJCpeaksinearlycodingregionswasgenerallyapplicableto10

alltranscriptswithhigh5IMPscoresregardlessof5UIpresence.11

TranscriptsharboringN1-methyladenosine(m1A)havehigh5IMPscores12

ItisincreasinglyclearthatribonucleotidebasemodificationsinmRNAsarehighlyprevalent13

andcanbeamechanismforpost-transcriptionalregulation(Fryeetal.2016).OneRNA14

modificationpresenttowardsthe5’endsofmRNAtranscriptsisN1-methyladenosine(m1A)15

(Lietal.2016;Dominissinietal.2016),initiallyidentifiedintotalRNAandrRNAs(Dunn1961;16

Hall1963;KlootwijkandPlanta1973).Intriguingly,thepositionofm1Amodificationshas17

beenshowntobemorecorrelatedwiththepositionofthefirstintronthanwith18

transcriptionalortranslationalstartsites(Figure2gfromDominissinietal.2016).Whenthe19

distanceofm1AstoeachsplicesiteinagivenmRNAwascalculated,thefirstsplicesitewas20

foundtobethenearestfor85%ofm1As(Dominissinietal.2016).When5’UTRintronswere21

present,m1Awasfoundtobenearthefirstsplicesiteregardlessofthepositionofthestart22

codon(Dominissinietal.2016).Giventhat5IMtranscriptsarealsocharacterizedbythe23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 16: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

16

positionofthefirstintron,weinvestigatedtherelationshipbetween5IMPscoreandm1ARNA1

modificationmarks.2

Weanalyzedtheunionofpreviouslyidentifiedm1Amodifications(Dominissinietal.3

2016)acrossallcelltypesandconditions.Althoughthereissomeevidencethatthesemarks4

dependoncelltypeandgrowthcondition,itisdifficulttobeconfidentofthecelltypeand5

condition-dependenceofanyparticularmarkgivenexperimentalvariation(seeMethods).6

Nevertheless,wefoundthatmRNAswithm1Amodificationearlyinthecodingregion(first997

nucleotides)hadsubstantiallyhigher5IMPscoresthanmRNAslackingthesemarks(Fig6A;8

WilcoxonRankSumTestp=3.4e-265),andweregreatlyenrichedamong5IMtranscripts(Fig9

6A;Fisher’sExactTestp=1.6e-177;oddsratio=3.8).Inotherwords,thesequencefeatures10

withintheearlycodingregionthatdefine5IMPtranscriptsalsoassociatewithm1A11

modificationintheearlycodingregion.12

Wenextwonderedwhether5IMPscorewasrelatedtom1Amodificationgenerally,or13

onlyassociatedwithm1Amodificationintheearlycodingregion.Indeed,manyofthe14

previouslyidentifiedm1Apeakswerewithinthe5’UTRsofmRNAs(Lietal.2016).15

Interestingly,5IMPscoreswereonlyassociatedwithm1Amodificationintheearlycoding16

region,andnotwithm1Amodificationinthe5’UTR(Fig6B).Thisofferstheintriguing17

possibilitythatthesequencefeaturesthatdefine5IMPtranscriptsareco-localizedwithm1A18

modification.19

20

Discussion:21

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 17: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

17

Coordinatingtheexpressionoffunctionallyrelatedtranscriptscanbeachievedbypost-1

transcriptionalprocessessuchassplicing,RNAexport,RNAlocalizationortranslation(Moore2

andProudfoot2009).SetsofmRNAssubjecttoacommonregulatorytranscriptionalprocess3

canexhibitcommonsequencefeaturesthatdefinethemtobeaclass.Forexample,transcripts4

subjecttoregulationbyparticularmiRNAstendtosharecertainsequencesintheir3’UTRsthat5

arecomplementarytotheregulatorymiRNA(AmeresandZamore2013).Similarly,transcripts6

thatsharea5’terminaloligopyrimidinetractarecoordinatelyregulatedbymTORand7

ribosomalproteinS6kinase(Meyuhas2000).Herewequantitativelydefine‘5IM’transcripts8

asaclassthatsharescommonsequenceelementsandfunctionalproperties.Weestimatethe9

5IMclasstocomprise21%ofallhumantranscripts.10

Whereas35%ofhumantranscriptshaveoneormore5’UTRintrons,themajorityof11

5IMtranscriptshaveneithera5’UTRintronnoranintroninthefirst90ntsoftheORF.Other12

sharedfeaturesof5IMtranscriptsincludesequencefeaturesassociatedwithlowtranslation13

initiationrates.Theseare:(1)atendencyforRNAsecondarystructureintheregion14

immediatelyprecedingthestartcodon(Fig3A),andnearthe5’cap(Fig3B);(2)translational15

upregulationuponoverexpressionofeIF4E(Fig3C);and(3)morefrequentuseofnon-AUG16

startcodons(Fig3D).Alsoconsistentwithlowintrinsictranslationefficiencies,5IM17

transcriptsadditionallytendtodependonlessabundanttRNAstodecodethebeginningofthe18

openreadingframe(Fig3E).19

WehadpreviouslyreportedthattranscriptsencodingproteinswithER-and20

mitochondrial-targetingsignalsequences(SSCRsandMSCRs,respectively)areover-21

representedamongthe65%oftranscriptslacking5’UTRintrons(Ceniketal.2011).22

Transcriptsinthissetareenrichedforthesequencefeaturesdetectedbyour5IMclassifier.By23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 18: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

18

examiningtheseenrichedsequencefeatures,weshowedthatthe5IMclassextendsbeyond1

mRNAsencodingmembraneproteins.Janetal.recentlydevelopedatranscriptome-scale2

methodtoidentifymRNAsoccupiedbyER-proximalribosomesinbothyeastandhumancells3

(Janetal.2014).Asexpected,transcriptsknowntoencodeER-traffickedproteinswerehighly4

enrichedforER-proximalribosomeoccupancy.However,theirdataalsoshowedmany5

transcriptsencodingnon-ERtraffickedproteinstoalsobeengagedwithER-proximal6

ribosomes(ReidandNicchitta2015b).Similarly,severalotherstudieshavesuggestedacritical7

roleofER-proximalribosomesintranslatingseveralcytoplasmicproteins(ReidandNicchitta8

2015a).Here,wefoundthat5IMtranscriptsincludingthosethatarenotER-traffickedor9

mitochondrialweresignificantlymorelikelytoexhibitbindingtoER-proximalribosomes10

(Figs4A-B).11

InadditiontoribosomesdirectlyresidentontheER,aninterestingpossibilityisthe12

presenceofapoolofperi-ERribosomes(Janetal.2015;ReidandNicchitta2015a).13

Associationof5IMtranscriptswithsuchaperi-ERribosomepoolcouldpotentiallyexplainthe14

observedcorrelationof5IMstatuswithbindingtoER-proximalribosomes.TheERis15

physicallyproximaltomitochondria(RowlandandVoeltz2012),soperi-ERribosomesmay16

includethosetranslatingmRNAsonmitochondria(i.e.,mRNAswithMSCRs)(Sylvestreetal.17

2003).However,evenwhentranscriptscorrespondingtoER-traffickedandmitochondrial18

proteinswereexcludedfromconsideration,ER-proximalribosomeenrichmentand5IMP19

scoreswerehighlycorrelated(Fig4A).Thusanothersharedfeatureof5IMtranscriptsistheir20

translationonorneartheERregardlessoftheultimatedestinationoftheencodedprotein.21

Inanattempttoidentifyacommonfactorbinding5IMtranscripts,weaskedwhether22

5IMtranscriptswereenrichedamongtheexperimentallyidentifiedtargetsof23different23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 19: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

19

RBPs.OnlyoneRBPemerged--theexonjunctioncomplex(EJC).Specifically,weobserveda1

dramaticenrichmentofnon-canonicalEJC(ncEJC)bindingsiteswithintheearlycodingregion2

of5IMtranscripts.Further,theCG-richmotifidentifiedforncEJCsinfirstexonsisstrikingly3

similartotheCG-richmotifenrichedinthefirstexonsof5IMtranscripts(Fig5B).Previous4

workimplicatedRanBP2,aproteinassociatedwiththecytoplasmicfaceofthenuclearpore,as5

abindingfactorforsomeSSCRs(Mahadevanetal.2013).Thisfindingsuggeststhatnuclear6

poreproteinsmayinfluenceEJCoccupancyonthesetranscripts.7

EJCdepositionduringtheprocessofpre-mRNAsplicingenablesthenuclearhistoryof8

anmRNAtoinfluencepost-transcriptionalprocessesincludingmRNAlocalization,translation9

efficiency,andnonsensemediateddecay(Changetal.2007;KervestinandJacobson2012;10

Choeetal.2014).WhilecanonicalEJCbindingoccursatafixeddistanceupstreamofexon-exon11

junctionsandinvolvesdirectcontactbetweenthesugar-phosphatebackboneandtheEJCcore12

anchoringproteineIF4AIII,ncEJCbindingsiteslikelyreflectstableengagementbetweenthe13

EJCcoreandothermRNPproteins(e.g.,SRproteins)recognizingnearbysequencemotifs.14

AlthoughsomeRBPswereidentifiedforncEJCmotifsfoundininternalexons(Singhetal.15

2014,2012),todatenocandidateRBPhasbeenidentifiedfortheCG-richncEJCmotiffoundin16

thefirstexon.IfthismotifdoesresultfromanRBPinteraction,itislikelytobeoneormoreof17

the~70proteinsthatstablyandspecificallybindtotheEJCcore(Singhetal.2012).18

Finally,weobservedadramaticenrichmentform1Amodificationsamong5IM19

transcripts,withspecificenrichmentform1Amodificationsintheearlycodingregion.Given20

thisstrikingenrichmentperhapsitisperhapsnotsurprisingthatm1AcontainingmRNAswere21

alsoshowntohavemorestructured5’UTRsthatareGC-richcomparedtom1AlackingmRNAs22

(Dominissinietal.2016).Similarto5IMtranscripts,m1AcontainingmRNAswerefoundto23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 20: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

20

decoratestartcodonsthatappearinahighlystructuredcontext.WhileALKBH3hasbeen1

identifiedasaproteinthatcandemethylatem1A,itiscurrentlyunknownwhetherthereare2

anyproteinsthatcanspecificallyactas“readers”ofm1A.Recentstudieshavebeguntoidentify3

suchreadersforothermRNAmodificationssuchasYTHDF1,YTHDF2,WTAPandHNRNPA2B14

(Pingetal.2014;Liuetal.2014;Wangetal.2014,2015;Alarcónetal.2015).Ourstudy5

highlightsapossiblelinkbetweennon-canonicalEJCbindingandm1A.Hence,ourresultsyield6

theintriguinghypothesisthatoneormoreofthe~70proteinsthatstablyandspecificallybind7

totheEJCcorecanfunctionasanm1Areader.Futureworkinvolvingdirectedexperiments8

wouldbeneededtotestthishypothesis.9

Giventhat5IMtranscriptsareenrichedforER-targetedandmitochondrialproteins,it10

isplausiblethattheobservedfunctionalcharacteristicsof5IMtranscriptsaredrivensolelyby11

SSCRandMSCR-containingtranscripts.Hence,werepeatedallanalysesforthesubclassesof12

5IMtranscripts(MSCR-containing,SSCR-containing,S–/M–/SignalP+,orS–/M–/SignalP–).We13

foundtheobservedassociationsremainedstatisticallysignificantandhadthesamedirection14

ofeffect,evenaftereliminatingSSCR-andMSCR-containingtranscripts,despitethefactthatall15

trainingofthe5IMclassifierwasperformedonlyusingSSCRtranscripts.Wealsofoundthat16

5IMPscorewasequallyormorestronglyassociatedwitheachofthefunctionalcharacteristics17

comparedtothe5UIstatus.Inconclusion,themolecularassociationswereportapplyto5IM18

transcriptsasawhole,andarenotdrivensolelybythesubsetof5IMtranscriptsencodingER-19

ormitochondria-targetingsignalpeptides,andseemtoindicatesharedfeaturesbeyondsimple20

lackofa5’UTRintron.21

Anintriguingpossibilityisthat5IMtranscriptfeaturesassociatedwithlowerintrinsic22

translationefficiencymaytogetherenablegreater‘tunability’of5IMtranscriptsatthe23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 21: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

21

translationstage.Regulatedenhancementorrepressionoftranslation,for5IMtranscripts,1

couldallowforrapidchangesinproteinlevels.Thereareanalogiestothisscenarioin2

transcriptionalregulation,whereinhighlyregulatedgenesoftenhavepromoterswithlow3

baselinelevelsthatcanberapidlymodulatedthroughtheactionofregulatorytranscription4

factors.Asmoreribosomeprofilingstudiesarepublishedexaminingtranslationalresponses5

transcriptome-wideundermultipleperturbations,conditionsunderwhich5IMtranscriptsare6

translationallyregulatedmayberevealed.Directedexperimentswillbeneededtotest7

translationalfeaturesof5IMtranscriptshypothesizedviathiscomputationalanalysis.8

Takentogether,ouranalysesrevealtheexistenceofadistinct‘5IM’classcomprising9

21%ofhumantranscripts.Thisclassisdefinedbydepletionof5’proximalintrons,presence10

ofspecificRNAsequencefeaturesassociatedwithlowtranslationefficiency,non-canonical11

bindingbytheExonJunctionComplexandanenrichmentforN1-methyladenosine12

modification.13

14

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 22: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

22

MaterialsandMethods:1

DatasetsandAnnotations2

HumantranscriptsequencesweredownloadedfromtheNCBIhumanReferenceGene3

Collection(RefSeq)viatheUCSCtablebrowser(hg19)onJun252010(Kentetal.2002;Pruitt4

etal.2005).Transcriptswithfewerthanthreecodingexons,andtranscriptswherethefirst995

codingnucleotidesstraddlemorethantwoexonswereremovedfromfurtherconsideration.6

Thecriteriaforexclusionofgeneswithfewerthanthreecodingexonswastoensurethatthe7

analysisofdownstreamregionswaspossibleforallgenesthatwereusedinouranalysisof8

earlycodingregions.Specifically,thedownstreamregionswereselectedrandomlyfrom9

downstreamofthe3rdexons.Hence,geneswithfewerexonswouldnotbeabletocontributea10

downstreamregionpotentiallycreatingaskewinrepresentation.Intotaltherewere~300011

genesthatwereremovedfromconsiderationduetothisfilter.Therefore,ourclassifieris12

limitedinitsabilitytoassesstranscriptsfromthesegenes.However,theperformance13

measuresreportedinourmanuscriptarerobusttoexclusionofthesegenes,inthesensethat14

thesameclassoftranscriptswasusedinbothtrainingandtestdatasets.15

Transcriptswereclusteredbasedonsequencesimilarityinthefirst99coding16

nucleotides.Specifically,eachtranscriptpairwasalignedusingBLASTwiththeDUSTfilter17

disabled(Altschuletal.1990).TranscriptpairswithBLASTE-values<1e-25weregrouped18

intotranscriptclusters.Intotal,therewere15576transcriptclustersthatwereconsidered19

further.Theseclustersthatweresubsequentlyassignedtooneoffourcategories:MSCR-20

containing,SSCR-containing,S–/MSCR–SignalP+,orS–/MSCR–SignalP–asfollows:21

MSCR-containingtranscriptswereannotatedusingMitoCartaandothersourcesas22

describedin(Ceniketal.2011).SSCR-containingtranscriptswerethesetoftranscripts23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 23: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

23

annotatedtocontainsignalpeptidesintheEnsemblGenev.58annotations,whichwere1

downloadedthroughBiomartonJun252010.FortranscriptswithoutanannotatedMSCRor2

SSCR,thefirst70aminoacidswereanalyzedusingSignalP3.0(Bendtsenetal.2004).Using3

theeukaryoticpredictionmode,transcriptswereassignedtotheS–/MSCR–SignalP+category4

ifeithertheHiddenMarkovModelortheArtificialNeuralNetworkclassifiedthesequenceas5

signalpeptidecontaining.AllremainingtranscriptclusterswereassignedtotheS–/MSCR–6

SignalP–category.Thenumberoftranscriptclustersineachofthefourcategorieswas:37437

SSCR,737MSCR,696S–/MSCR–SignalP+,10400S–/MSCR–SignalP–.8

Foreachtranscriptcluster,wealsoconstructedmatchedcontrolsequences.Control9

sequenceswerederivedfromasinglerandomlychosenin-frame‘window’downstreamofthe10

3rdexonfromtheevaluatedtranscripts.Ifanevaluatedtranscripthadfewerthan9911

nucleotidesdownstreamofthe3rdexon,nocontrolsequencewasextracted.5UIlabelsand12

transcriptclusteringforthecontrolsequenceswereinheritedfromtheevaluatedtranscript.13

Therationaleforthisdecisionisthatouranalysisdependsonthepositionofthefirstintron;14

hencegeneswithfewerthantwoexonsneedtobeexcluded,asthesewillnothaveintrons.We15

furtherrequiredthematchedcontrolsequencestofalldownstreamoftheearlycodingregion.16

Inthevastmajorityofcasesthethirdexonfelloutsidethefirst99nucleotidesofthecoding17

region,makingthisaconvenientcriterionbywhichtochoosecontrolregions.18

SequenceFeaturesandMotifDiscovery19

36sequencefeatureswereextractedfromeachtranscript(TableS1).Thesequence20

featuresincludedtheratioofargininestolysines,theratioofleucinestoisoleucines,adenine21

content,lengthofthelongeststretchwithoutadenines,preferenceagainstcodonsthatcontain22

adeninesorthymines.ThesefeatureswerepreviouslyfoundtobeenrichedinSSCR-23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 24: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

24

containingandcertain5UI–transcripts(Ceniketal.2011;Palazzoetal.2007).Inaddition,we1

extractedratiosbetweenseveralotheraminoacidpairsbasedonhaving2

biochemical/evolutionarysimilarity,i.e.havingpositivescores,accordingtotheBLOSUM623

matrix(HenikoffandHenikoff1992).Toavoidextremeratiosgiventherelativelyshort4

sequencelength,pseudo-countswereaddedtoaminoacidratiosusingtheirrespective5

genome-wideprevalence.6

Inaddition,weusedthreepublishedmotiffindingalgorithms(AlignACE,DEME,and7

MoAN(Rothetal.1998;RedheadandBailey2007;Valenetal.2009))todiscoverRNA8

sequencemotifsenrichedamong5UI–transcripts.AlignACEimplementsaGibbssampling9

approachandisoneofthepioneeringeffortsinmotifdiscovery(Rothetal.1998).We10

modifiedtheAlignACEsourcecodetorestrictmotifsearchestoonlytheforwardstrandofthe11

inputsequencestoenableRNAmotifdiscovery.DEMEandMoANadoptdiscriminative12

approachestomotiffindingbysearchingformotifsthataredifferentiallyenrichedbetween13

twosetsofsequences(RedheadandBailey2007;Valenetal.2009).MoANhastheadditional14

advantageofdiscoveringvariablelengthmotifs,andcanidentifyco-occurringmotifswiththe15

highestdiscriminativepower.16

Intotal,sixmotifswerediscoveredusingthethreemotiffindingalgorithms(TableS1).17

Positionspecificscoringmatricesforallmotifswereusedtoscorethefirst99-lpositionsin18

eachsequence,wherelisthelengthofthemotif.Weassessedthesignificanceofeachmotif19

instancebycalculatingthep-valueofenrichment(Fisher’sExactTest)among5UI–transcripts20

consideringalltranscriptswithamotifinstanceachievingaPSSMscoregreaterthanequalto21

theinstanceunderconsideration.Thesignificancescoreandpositionofthetwobestmotif22

instanceswereusedasfeaturesfortheclassifier(TableS1).23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 25: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

25

5IMClassifierTrainingandPerformanceEvaluation1

WemodifiedanimplementationoftheRandomForestclassifier(Breiman2001)to2

modeltherelationshipbetweensequencefeaturesintheearlycodingregionandtheabsence3

of5’UTRintrons(5UIs).Thisclassifierdiscriminatestranscriptswith5’proximal-intron-4

minus-like-codingregionsandhenceisnamedthe‘5IM’classifier.Thetrainingsetforthe5

classifierwascomposedofSSCRtranscriptsexclusively.Thereweretworeasonstorestrict6

modelconstructiontoSSCRtranscripts:1)weexpectedthepresenceofspecificRNAelements7

asafunctionof5UIpresencebasedonourpreviouswork(Ceniketal.2011);and2)we8

wantedtorestrictmodelbuildingtosequencesthathavesimilarfunctionalconstraintsatthe9

proteinlevel.10

Weobservedthat5UI–transcriptswithintronsproximalto5’endofthecodingregion11

havesequencecharacteristicssimilarto5UI+transcripts(Fig1D).Tosystematically12

characterizethisrelationship,webuiltdifferentclassifiersusingtrainingsetsthatexcluded13

5UI–transcriptswithacodingregionintronpositionedatincreasingdistancesfromthestart14

codon.Weevaluatedtheperformanceofeachclassifierusing10-foldcrossvalidation.15

Giventhatalargenumberofmotifdiscoveryiterationswereneeded,wesoughtto16

reducethecomputationalburden.Weisolatedasubsetofthetrainingexamplestobeused17

exclusivelyformotiffinding.Motifdiscoverywasperformedonceusingthissetofsequences,18

andthesamemotifswereusedineachfoldofthecrossvalidationforalltheclassifiers.19

Imbalancesbetweenthesizesofpositiveandnegativetrainingexamplescanleadto20

detrimentalclassificationperformance(WangandYao2012).Hence,webalancedthetraining21

setsizeof5UI–and5UI+transcriptsbyrandomlysamplingfromthelargerclass.We22

constructed10sub-classifierstoreducesamplingbias,andforeachtestexample,the23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 26: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

26

predictionscorefromeachsubclassifierwassummedtoproduceacombinedscorebetween01

and10.Fortherestoftheanalyses,weusedtheclassifiertrainedusing5UI–transcriptswhere2

firstcodingintronfallsoutsidethefirst90codingnucleotides(Fig1E).3

Weevaluatedclassifierperformanceusinga10-foldcrossvalidationstrategyforSSCR-4

containingtranscripts(i.e.thetrainingset).Ineachfoldofthecross-validation,themodelwas5

trainedwithoutanyinformationfromtheheld-outexamples,includingmotifdiscovery.Forall6

theothertranscriptsandthecontrolsets(seeabove),the5IMPscoreswerecalculatedusing7

theclassifiertrainedusingSSCRtranscriptsasdescribedabove.5IMPscoredistributionfor8

thecontrolsetwasusedtocalculatetheempiricalcumulativenulldistribution.Usingthis9

function,wedeterminedthep-valuecorrespondingtothe5IMPscoreforalltranscripts.We10

correctedformultiplehypothesestestingusingtheqvalueRpackage(Storey2003).Basedon11

thisanalysis,weestimatethata5IMPscoreof7.41correspondstoa5%FalseDiscoveryRate12

andsuggestthat21%ofallhumantranscriptscanbeconsideredas5IMtranscripts.13

Whilethetheoreticalrangeof5IMPscoresis0-10,thehighestobserved5IMPis9.855.14

Wenotethatforallfiguresthatdepict5IMPscoredistributions,wedisplayedtheentire15

theoreticalrangeof5IMPscores(0-10).16

FunctionalCharacterizationof5IMTranscripts:17

Wecollectedgenome-scaledatasetfrompublicallyavailabledatabasesandfrom18

supplementaryinformationprovidedinselectedarticles.Forallanalyzeddatasets,wefirst19

convertedallgene/transcriptidentifiers(IDs)intoRefSeqtranscriptIDsusingtheSynergizer20

webserver(BerrizandRoth2008).Ifadatasetwasgeneratedusinganon-humanspecies(ex.21

TargetsidentifiedbyTDP-43RNAimmunoprecipitationinratneuronalcells),weused22

Homologenerelease64(downloadedonSep282009)toidentifythecorrespondingortholog23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 27: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

27

inhumans.Ifatleastonememberofatranscriptclusterwasassociatedwithafunctional1

phenotype,weassignedtheclustertothepositivesetwithrespecttothefunctionalphenotype.2

Ifmorethanonememberofaclusterhadthefunctionalphenotype,weonlyretainedonecopy3

oftheclusterunlesstheydifferedinaquantitativemeasurement.Forexample,considertwo4

hypotheticaltranscripts:NM_1andNM_2thatwereclusteredtogetherandhavea5IMPscore5

of8.5.IfNM_1hadanmRNAhalf-lifeof2hrwhileNM_2’shalf-lifewas1hrthanwesplitthe6

clusterwhilepreservingthe5IMPscoreforbothNM_1andNM_2.7

Oncethetranscriptswerepartitionedbasedonthefunctionalphenotype,werantwo8

statisticaltests:1)Fisher’sExactTestforenrichmentof5IMtranscriptswithinthefunctional9

category;2)WilcoxonRankSumTesttocompare5IMPscoresbetweentranscriptspartitioned10

bythefunctionalphenotype.Additionally,fordatasetswhereaquantitativemeasurementwas11

available(ex.mRNAhalf-life),wecalculatedtheSpearmanrankcorrelationbetween5IMP12

scoresandthequantitativevariable.Intheseanalyses,weassumedthatthetestspacewasthe13

entiresetofRefSeqtranscripts.Forallphenotypeswhereweobservedapreliminary14

statisticallysignificantresult,wefollowedupwithmoredetailedanalysesdescribedbelow.15

AnalysisofFeaturesAssociatedwithTranslation:16

Foreachtranscript,wepredictedthepropensityforsecondarystructureprecedingthe17

translationstartsiteandthe5’cap.Specifically,weextracted35nucleotidesprecedingthe18

translationstartsiteorthefirst35nucleotidesofthe5’UTR.Ifa5’UTRisshorterthan3519

nucleotides,thetranscriptwasremovedfromtheanalysis.hybrid-ss-minutility(UNAFold20

packageversion3.8)withdefaultparameterswasusedtocalculatetheminimumfolding21

energy(MarkhamandZuker2008).22

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 28: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

28

CodonoptimalitywasmeasuredusingthetRNAAdaptationIndex(tAI),whichisbased1

onthegenomiccopynumberofeachtRNA(dosReisetal.2004).tAIforallhumancodons2

weredownloadedfromTulleretal.2010TableS1.tAIprofilesforthefirst30aminoacids3

werecalculatedforalltranscripts.Codonoptimalityprofilesweresummarizedforthefirst304

aminoacidsforeachtranscriptorbyaveragingtAIateachcodon.5

Wecarriedouttwocontrolexperimentstotestwhethertheassociationbetween5IMP6

scoreandtAIcouldbeexplainedbyconfoundingvariables.First,wepermutedthenucleotides7

inthefirst90ntsandobservednorelationshipbetween5IMPscoreandmeantAIwhenthese8

permutedsequenceswereused(FigS1).Second,weselectedrandomin-frame99nucleotides9

from3rdexontotheendofthecodingregionandfoundnosignificantdifferencesintAI(Fig10

S1).TheseresultssuggestthattherelationshipbetweentAIand5IMPscoreisconfinedtothe11

first30aminoacidsandisnotexplainedbysimpledifferencesinnucleotidecomposition.12

RibosomeprofilingandRNAexpressiondataforhumanlymphoblastoidcells(LCLs)13

weredownloadedfromNCBIGEOdatabaseaccessionnumber:GSE65912.Translation14

efficiencywascalculatedaspreviouslydescribed(Ceniketal.2015).Mediantranslation15

efficiencyacrossthedifferentcell-typeswasusedforeachtranscript. 16

AnalysisofProximitySpecificRibosomeProfilingData:17

WedownloadedproximityspecificribosomeprofilingdataforHEK293cellsfromJan18

etal.2014;TableS6.WeconvertedUCSCgeneidentifierstoHGNCsymbolsusingg:Profiler19

(Reimandetal.2011).WeretainedallgeneswithanRPKM>5ineitherinputorpulldownand20

requiredthatatleast30readsweremappedineitherofthetwolibraries.WeusedER-21

targetingevidencecategories“secretome”,“phobius”,“TMHMM”,“SignalP”,“signalSequence”,22

“signalAnchor”from(Janetal.2014)toannotategenesashavingER-targetingevidence.The23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 29: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

29

genesthatdidnothaveanyER-targetingevidenceor“mitoCarta”/“mito.GO”annotationswere1

deemedasthesetofgeneswithnoER-targetingormitochondrialevidence.Wecalculatedthe2

log2oftheratiobetweenER-proximalribosomepulldownRPKMandcontrolRPKMasthe3

measureofenrichmentforER-proximalribosomeoccupancy(asinJanetal.2014).Amoving4

averageofthisratiowascalculatedforgenesgroupedbytheir5IMPscore.Forthiscalculation,5

weusedbinsof30mitochondrialgenesor100geneswithnoevidenceofER-ormitochondrial6

targeting.7

AnalysisofGenome-wideBindingSitesofExonJunctionComplex:8

Dr.GeneYeoandGabrielPrattgenerouslyshareduniformlyprocessedpeakcallsfor9

experimentsidentifyinghumanRNAbindingproteintargets.Thesedatasetsincludevarious10

CLIP-SeqdatasetsanditsvariantssuchasiCLIP.Atotal49datasetsfrom22factorswere11

analyzed.Thesefactorswere:hnRNPA1,hnRNPF,hnRNPM,hnRNPU,Ago2,hnRNPU,HuR,12

IGF2BP1,IGF2BP2,IGF2BP3,FMR1,eIF4AIII,PTB,IGF2BP1,Ago3,Ago4,MOV10,Fip1,CF13

Im68,CFIm59,CFIm25,andhnRNPA2B1.Weextractedthe5IMPscoresforalltargetsofeach14

RBP.WecalculatedtheWilcoxonRankSumteststatisticcomparingthe5IMPscore15

distributionofthetargetsofeachRBPtoallothertranscriptswith5IMPscores.Noneofthe16

testedRBPtargetsetshadanadjustedp-value<0.05andamediandifferencein5IMPscore>17

1whencomparedtonon-targettranscripts.18

Inaddition,weusedRNA:proteinimmunoprecipitationintandem(RIPiT)datato19

determineExonJunctionComplex(EJC)bindingsites(Singhetal.2014,2012).Weanalyzed20

thecommonpeaksfromtheY14-Magohimmunoprecipitationconfiguration(Singhetal.2012;21

Kucukuraletal.2013).CanonicalEJCbindingsitesweredefinedaspeakswhoseweighted22

centerwere15to32nucleotidesupstreamofanexon-intronboundary.Allremainingpeaks23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 30: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

30

weredeemedas“non-canonical”EJCbindingsites.Weextractedallnon-canonicalpeaksthat1

overlappedthefirst99nucleotidesofthecodingregionandrestrictedouranalysisto2

transcriptsthathadanRPKMgreaterthanoneinthematchedRNA-Seqdata.3

Analysisofm1Amodifiedtranscripts4

WedownloadedthelistofRNAsobservedtocontainm1AfromLietal.(2016)and5

Dominissinietal.(2016).RefSeqtranscriptidentifierswereconvertedtoHGNCsymbolsusing6

g:Profiler(Reimandetal.2011).Theoverlapbetweenthetwodatasetswasdeterminedusing7

HGNCsymbols.m1Amodificationsthatoverlapthefirst99nucleotidesofthecodingregion8

weredeterminedusingbedtools(QuinlanandHall2010).9

Lietal.(2016)identified600transcriptswithm1AmodificationinnormalHEK29310

cells.Ofthese,368transcriptswerenotfoundtocontainthem1AmodificationinHEK293cells11

byDominissinietal.(2016).Yet,81%ofthesewerefoundtobem1Amodifiedinothercell12

types.Lietal.(2016)alsoanalyzedm1AuponH2O2treatmentandserumstarvationinHEK29313

cellsandidentifiedmanym1Amodificationsthatareonlyfoundthesestressconditions.14

However,20%of371transcriptsharboringstress-inducedm1Amodificationswerefoundin15

normalHEK293cellsbyDominissinietal.(2016).Takentogether,theseanalysessuggestthat16

transcriptome-widem1Amapsremainincomplete.Hence,weanalyzedthe5IMPscoresofall17

mRNAswithm1Aacrosscelltypesandconditions.WereportedresultsusingtheDominissini18

etal.datasetbutthesameconclusionsweresupportedbym1AmodificationsfromLietal.19

20

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 31: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

31

Acknowledgements:1

WewouldliketothankGeneYeo,andGabrielPrattforsharingpeakcallsforRNA-2

bindingproteintargetidentificationexperiments(CLIP,PAR-CLIP,iCLIP).WethankElif3

SarinayCenik,BaşarCenik,andAlperKüçükuralforhelpfuldiscussions.F.P.R.andH.N.C.were4

supportedbytheCanadaExcellenceResearchChairsProgram.Thisworkwassupportedin5

partbyNIHgrantsGM50037(M.J.M.),HG001715andHG004233(F.P.R.),theKrembil6

Foundation,anOntarioResearchFundaward,andtheAvonFoundation(F.P.R.).A.F.P.was7

supportedbyagrantfromtheCanadianInstitutesofHealthResearchtoA.F.P.(FRN102725).8

Authorcontributions:CC,MJM,andFPRdesignedthestudy.CCandHNCimplementedthe9

randomforestclassifierandcarriedoutallanalyses.GSandAAperformedinitialexperiments.10

MPSandAFPprovidedcriticalfeedback.MJMandFPRjointlysupervisedtheproject.CC,FPR11

andMJMwrotethemanuscriptwithinputfromallauthors.12

13

14

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 32: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

32

References:1

AlarcónCR,GoodarziH,LeeH,LiuX,TavazoieS,TavazoieSF.2015.HNRNPA2B1IsaMediatorof2m(6)A-DependentNuclearRNAProcessingEvents.Cell162:1299–1308.3

AltschulSF,GishW,MillerW,MyersEW,LipmanDJ.1990.Basiclocalalignmentsearchtool.JMolBiol4215:403–410.5

AmeresSL,ZamorePD.2013.DiversifyingmicroRNAsequenceandfunction.NatRevMolCellBiol14:6475–488.7

BabendureJR,BabendureJL,DingJ-H,TsienRY.2006.ControlofmammaliantranslationbymRNA8structurenearcaps.RNA12:851–861.9

BendtsenJD,NielsenH,vonHeijneG,BrunakS.2004.Improvedpredictionofsignalpeptides:SignalP103.0.JMolBiol340:783–795.11

BerrizGF,RothFP.2008.TheSynergizerservicefortranslatinggene,proteinandotherbiological12identifiers.Bioinformatics24:2272–2273.13

BreimanL.2001.RandomForests.MachLearn45:5–32.14

CenikC,ChuaHN,ZhangH,TarnawskySP,AkefA,DertiA,TasanM,MooreMJ,PalazzoAF,RothFP.152011.Genomeanalysisrevealsinterplaybetween5’UTRintronsandnuclearmRNAexportfor16secretoryandmitochondrialgenes.PLoSGenet7:e1001366.17

CenikC,DertiA,MellorJC,BerrizGF,RothFP.2010.Genome-widefunctionalanalysisofhuman5’18untranslatedregionintrons.GenomeBiol11:R29.19

CenikC,SarinayCenikE,ByeonGW,GrubertF,CandilleSI,SpacekD,AlsallakhB,TilgnerH,ArayaCL,20TangH,etal.2015.IntegrativeanalysisofRNA,translationandproteinlevelsrevealsdistinct21regulatoryvariationacrosshumans.GenomeRes.http://dx.doi.org/10.1101/gr.193342.115.22

ChangY-F,ImamJS,WilkinsonMF.2007.Thenonsense-mediateddecayRNAsurveillancepathway.23AnnuRevBiochem76:51–74.24

CharneskiCA,HurstLD.2013.Positivelychargedresiduesarethemajordeterminantsofribosomal25velocity.PLoSBiol11:e1001508.26

ChoeJ,RyuI,ParkOH,ParkJ,ChoH,YooJS,ChiSW,KimMK,SongHK,KimYK.2014.eIF4AIIIenhances27translationofnuclearcap-bindingcomplex-boundmRNAsbypromotingdisruptionofsecondary28structuresin5’UTR.ProcNatlAcadSciUSA111:E4577–86.29

DominissiniD,NachtergaeleS,Moshitch-MoshkovitzS,PeerE,KolN,Ben-HaimMS,DaiQ,DiSegniA,30Salmon-DivonM,ClarkWC,etal.2016.ThedynamicN(1)-methyladenosinemethylomein31eukaryoticmessengerRNA.Nature530:441–446.32

dosReisM,SavvaR,WernischL.2004.Solvingtheriddleofcodonusagepreferences:atestfor33translationalselection.NucleicAcidsRes32:5036–5044.34

DunnDB.1961.Theoccurrenceof1-methyladenineinribonucleicacid.BiochimBiophysActa46:198–35200.36

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 33: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

33

FryeM,JaffreySR,PanT,RechaviG,SuzukiT.2016.RNAmodifications:whathavewelearnedand1whereareweheaded?NatRevGenet17:365–372.2

GerashchenkoMV,GladyshevVN.2015.Translationinhibitorscauseabnormalitiesinribosome3profilingexperiments.NucleicAcidsRes42:e134.4

GingoldH,PilpelY.2011.Determinantsoftranslationefficiencyandaccuracy.MolSystBiol7:481.5

HallRH.1963.Methodforisolationof2′-O-methylribonucleosidesandN1-methyladenosinefrom6ribonucleicacid.BiochimicaetBiophysicaActa(BBA)-SpecializedSectiononNucleicAcidsand7RelatedSubjects68:278–283.8

HenikoffS,HenikoffJG.1992.Aminoacidsubstitutionmatricesfromproteinblocks.ProcNatlAcadSci9USA89:10915–10919.10

HershbergR,PetrovDA.2008.Selectiononcodonbias.AnnuRevGenet42:287–299.11

HinnebuschAG,LorschJR.2012.Themechanismofeukaryotictranslationinitiation:newinsightsand12challenges.ColdSpringHarbPerspectBiol4.http://dx.doi.org/10.1101/cshperspect.a011544.13

HongX,ScofieldDG,LynchM.2006.Intronsize,abundance,anddistributionwithinuntranslated14regionsofgenes.MolBiolEvol23:2392–2404.15

JanCH,WilliamsCC,WeissmanJS.2015.LOCALTRANSLATION.ResponsetoCommenton“Principlesof16ERcotranslationaltranslocationrevealedbyproximity-specificribosomeprofiling.”Science348:171217.18

JanCH,WilliamsCC,WeissmanJS.2014.PrinciplesofERcotranslationaltranslocationrevealedby19proximity-specificribosomeprofiling.Science346:1257521.20

KentWJ,SugnetCW,FureyTS,RoskinKM,PringleTH,ZahlerAM,HausslerD.2002.Thehuman21genomebrowseratUCSC.GenomeRes12:996–1006.22

KervestinS,JacobsonA.2012.NMD:amultifacetedresponsetoprematuretranslationaltermination.23NatRevMolCellBiol13:700–712.24

KlootwijkJ,PlantaRJ.1973.AnalysisofthemethylationsitesinyeastribosomalRNA.EurJBiochem39:25325–333.26

KucukuralA,ÖzadamH,SinghG,MooreMJ,CenikC.2013.ASPeak:anabundancesensitivepeak27detectionalgorithmforRIP-Seq.Bioinformatics29:2485–2486.28

LarssonO,LiS,IssaenkoOA,AvdulovS,PetersonM,SmithK,BittermanPB,PolunovskyVA.2007.29EukaryoticTranslationInitiationFactor4E–InducedProgressionofPrimaryHumanMammary30EpithelialCellsalongtheCancerPathwayIsAssociatedwithTargetedTranslationalDeregulation31ofOncogenicDriversandInhibitors.CancerRes67:6814–6824.32

LeeES,AkefA,MahadevanK,PalazzoAF.2015.Theconsensus5’splicesitemotifinhibitsmRNA33nuclearexport.PLoSOne10:e0122743.34

LeHirH,IzaurraldeE,MaquatLE,MooreMJ.2000.Thespliceosomedepositsmultipleproteins20-2435nucleotidesupstreamofmRNAexon-exonjunctions.EMBOJ19:6860–6869.36

LiuJ,YueY,HanD,WangX,FuY,ZhangL,JiaG,YuM,LuZ,DengX,etal.2014.AMETTL3-METTL1437complexmediatesmammaliannuclearRNAN6-adenosinemethylation.NatChemBiol10:93–95.38

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 34: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

34

LiX,XiongX,WangK,WangL,ShuX,MaS,YiC.2016.Transcriptome-widemappingrevealsreversible1anddynamicN1-methyladenosinemethylome.NatChemBiol.2http://dx.doi.org/10.1038/nchembio.2040(AccessedMarch29,2016).3

MahadevanK,ZhangH,AkefA,CuiXA,GueroussovS,CenikC,RothFP,PalazzoAF.2013.4RanBP2/Nup358potentiatesthetranslationofasubsetofmRNAsencodingsecretoryproteins.5PLoSBiol11:e1001545.6

MarintchevA,EdmondsKA,MarintchevaB,HendricksonE,ObererM,SuzukiC,HerdyB,SonenbergN,7WagnerG.2009.TopologyandregulationofthehumaneIF4A/4G/4Hhelicasecomplexin8translationinitiation.Cell136:447–460.9

MarkhamNR,ZukerM.2008.UNAFold:softwarefornucleicacidfoldingandhybridization.Methods10MolBiol453:3–31.11

MeyuhasO.2000.Synthesisofthetranslationalapparatusisregulatedatthetranslationallevel.EurJ12Biochem267:6321–6330.13

MooreMJ,ProudfootNJ.2009.Pre-mRNAprocessingreachesbacktotranscriptionandaheadto14translation.Cell136:688–700.15

PalazzoAF,MahadevanK,TarnawskySP.2013.ALREX-elementsandintrons:twoidentityelements16thatpromotemRNAnuclearexport.WileyInterdiscipRevRNA4:523–533.17

PalazzoAF,SpringerM,ShibataY,LeeC-S,DiasAP,RapoportTA.2007.Thesignalsequencecoding18regionpromotesnuclearexportofmRNA.PLoSBiol5:e322.19

ParsyanA,SvitkinY,ShahbazianD,GkogkasC,LaskoP,MerrickWC,SonenbergN.2011.mRNA20helicases:thetacticiansoftranslationalcontrol.NatRevMolCellBiol12:235–245.21

PingX-L,SunB-F,WangL,XiaoW,YangX,WangW-J,AdhikariS,ShiY,LvY,ChenY-S,etal.2014.22MammalianWTAPisaregulatorysubunitoftheRNAN6-methyladenosinemethyltransferase.Cell23Res24:177–189.24

PruittKD,TatusovaT,MaglottDR.2005.NCBIReferenceSequence(RefSeq):acuratednon-redundant25sequencedatabaseofgenomes,transcriptsandproteins.NucleicAcidsRes33:D501–4.26

QuinlanAR,HallIM.2010.BEDTools:aflexiblesuiteofutilitiesforcomparinggenomicfeatures.27Bioinformatics26:841–842.28

RedheadE,BaileyTL.2007.DiscriminativemotifdiscoveryinDNAandproteinsequencesusingthe29DEMEalgorithm.BMCBioinformatics8:385.30

ReidDW,NicchittaCV.2015a.DiversityandselectivityinmRNAtranslationontheendoplasmic31reticulum.NatRevMolCellBiol16:221–231.32

ReidDW,NicchittaCV.2015b.LOCALTRANSLATION.Commenton“PrinciplesofERcotranslational33translocationrevealedbyproximity-specificribosomeprofiling.”Science348:1217.34

ReimandJ,ArakT,ViloJ.2011.g:Profiler—awebserverforfunctionalinterpretationofgenelists35(2011update).NucleicAcidsResgkr378.36

RothFP,HughesJD,EstepPW,ChurchGM.1998.FindingDNAregulatorymotifswithinunaligned37noncodingsequencesclusteredbywhole-genomemRNAquantitation.NatBiotechnol16:939–38

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 35: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

35

945.1

RowlandAA,VoeltzGK.2012.Endoplasmicreticulum–mitochondriacontacts:functionofthejunction.2NatRevMolCellBiol13:607–625.3

ShahP,DingY,NiemczykM,KudlaG,PlotkinJB.2013.Rate-limitingstepsinyeastproteintranslation.4Cell153:1589–1601.5

SinghG,KucukuralA,CenikC,LeszykJD,ShafferSA,WengZ,MooreMJ.2012.ThecellularEJC6interactomerevealshigher-ordermRNPstructureandanEJC-SRproteinnexus.Cell151:750–764.7

SinghG,RicciEP,MooreMJ.2014.RIPiT-Seq:ahigh-throughputapproachforfootprintingRNA:protein8complexes.Methods65:320–332.9

StoreyJD.2003.Thepositivefalsediscoveryrate:aBayesianinterpretationandtheq-value.AnnStat1031:2013–2035.11

SylvestreJ,VialetteS,CorralDebrinskiM,JacqC.2003.LongmRNAscodingforyeastmitochondrial12proteinsofprokaryoticoriginpreferentiallylocalizetothevicinityofmitochondria.GenomeBiol4:13R44.14

TullerT,CarmiA,VestsigianK,NavonS,DorfanY,ZaborskeJ,PanT,DahanO,FurmanI,PilpelY.2010.15Anevolutionarilyconservedmechanismforcontrollingtheefficiencyofproteintranslation.Cell16141:344–354.17

ValenE,SandelinA,WintherO,KroghA.2009.Discoveryofregulatoryelementsisimprovedbya18discriminatoryapproach.PLoSComputBiol5:e1000562.19

VogelC,MarcotteEM.2012.Insightsintotheregulationofproteinabundancefromproteomicand20transcriptomicanalyses.NatRevGenet13:227–232.21

WangS,YaoX.2012.MulticlassImbalanceProblems:AnalysisandPotentialSolutions.IEEETransSyst22ManCybernBCybern.http://dx.doi.org/10.1109/TSMCB.2012.2187280.23

WangX,LuZ,GomezA,HonGC,YueY,HanD,FuY,ParisienM,DaiQ,JiaG,etal.2014.N6-24methyladenosine-dependentregulationofmessengerRNAstability.Nature505:117–120.25

WangX,ZhaoBS,RoundtreeIA,LuZ,HanD,MaH,WengX,ChenK,ShiH,HeC.2015.N(6)-26methyladenosineModulatesMessengerRNATranslationEfficiency.Cell161:1388–1399.27

ZinshteynB,GilbertWV.2013.LossofaconservedtRNAanticodonmodificationperturbscellular28signaling.PLoSGenet9:e1003675.29

30 31

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 36: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

36

FigureLegends:1

Fig1|Modelingtherelationshipbetweensequencefeaturesintheearlycoding2

regionandtheabsenceof5’UTRintrons(5UIs).(a)Forallhumantranscripts,3

informationabout36nucleotide-levelfeaturesoftheearlycodingregion(first994

nucleotides)and5UIpresencewasextracted.(b)Transcriptscontainingasignalsequence5

codingregion(SSCR)wereusedtotrainaRandomForestclassifierthatmodeledthe6

relationshipbetween5UIabsenceand36sequencefeatures.(c)Withthisclassifier,all7

humantranscriptswereassignedascorethatquantifiesthelikelihoodof5UIabsence8

basedonspecificRNAsequencefeaturesintheearlycodingregion.Transcriptswithhigh9

scoresarethusconsideredtohave5’-proximalintronminus-likecodingregions(5IMs).10

(d)“5’UTR-intron-minus-predictor”(5IMP)scoredistributionsforSSCR-containing11

transcriptsshifttohigherscoreswithlater-appearingfirstintrons,suggestingthat5IM12

codingregionfeaturesnotonlypredictlackofa5UI,butalsolackofearlycodingregion13

introns.(e)Classifierperformancewasoptimizedbyexcluding5UI–transcriptswith14

intronsappearingearlyinthecodingregion.Cross-validationperformance(areaunderthe15

precisionrecallcurve,AUPRC)wasexaminedforaseriesofalternative5IMclassifiers16

usingdifferentfirst-intron-positioncriterionforexcluding5UI–transcriptsfromthe17

trainingset(Methods).18

19

Fig2|Predicting5UIstatusaccuratelyusingonlyearlycodingsequences.(a)As20

judgedbyareaunderthereceiveroperatingcharacteristiccurve(AUROC)andAUPRC,The21

5IMclassifierperformedwellforseveraldifferenttranscriptclasses.(b)Thedistribution22

of5IMPscoresrevealsclearseparationof5UI+and5UI–transcriptsforSSCR-containing23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 37: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

37

transcripts,whereeachSSCR-containingtranscriptwasscoredusingaclassifierthatdid1

notusethattranscriptintraining(Methods).(c)Codingsequencefeaturesthatare2

predictiveof5’proximalintronpresencearerestrictedtotheearlycodingregion.Thiswas3

supportedbyidentical5IMclassifierscoredistributionswithrespectto5UIpresencefor4

negativecontrolsequences,eachderivedfromasinglerandomlychosen‘window’5

downstreamofthe3rdexonfromoneoftheevaluatedtranscripts.(d)MSCRtranscripts6

exhibitedamajordifferencein5IMPscoresbasedontheir5UIstatuseventhoughnoMSCR7

transcriptswereusedintrainingtheclassifier.(e)Transcriptspredictedtocontainsignal8

peptides(SignalP+)hada5IMPscoredistributionsimilartothatofSSCR-containing9

transcripts.(f)AftereliminatingSSCR,MSCR,andSignalP+transcripts,theremainingS–10

/MSCR–SignalP–transcriptswerestillsignificantlyenrichedforhigh5IMclassifierscores11

among5UI–transcripts.(g)Thecontrolsetofrandomlychosensequencesdownstreamof12

the3rdexonfromeachtranscriptwasusedtocalculateanempiricalcumulativenull13

distributionof5IMPscores.Usingthisfunction,wedeterminedthep-valuecorresponding14

tothe5IMPscoreforalltranscripts.Thereddashedlineindicatesthep-value15

correspondingto5%FalseDiscoveryRate.Theinsetdepictsthedistributionofvarious16

classesofmRNAsamongtheinputsetand5IMtranscripts.17

18

Fig3|5IMtranscriptsaremorelikelytobedifferentiallyexpressedandhave19

sequencefeaturesassociatedwithlowertranslationefficiency(a)The5IMclassifier20

scorewaspositivelycorrelatedwiththepropensityformRNAstructureprecedingthestart21

codon(-ΔG)(Spearmanrho=0.39;p<2.2e-16).Foreachtranscript,35nucleotides22

immediatelyupstreamoftheAUGwereusedtocalculate-ΔG(Methods).(b)The5IM23

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 38: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

38

classifierscorewaspositivelycorrelatedwiththepropensityformRNAstructurenearthe1

5’cap(-ΔG)(Spearmanrho=0.18;p=7.9e-130;Methods).(c)Transcriptsthatare2

translationallyupregulatedinresponsetoeIF4Eoverexpression(Larssonetal.2007)3

(blue)wereenrichedforhigher5IMPscores.Lightgreenshadingindicates5IMPscores4

correspondingto5%FDR.(d)Transcriptswithnon-AUGstartcodons(blue)exhibited5

significantlyhigher5IMPscoresthantranscriptswithacanonicalATGstartcodon(yellow).6

(e)Higher5IMPscoreswereassociatedwithlessoptimalcodons(asmeasuredbythe7

tRNAadaptationindex,tAI)forthefirst33codons.Foralltranscriptswithineach5IMP8

scorecategory(blue-high,orange-low),themeantAIwascalculatedateachcodonposition.9

Startcodonwasnotshown.(f)Transcriptswithlowertranslationefficiencywereenriched10

forhigher5IMPscores.Transcriptswithtranslationefficiencyonestandarddeviation11

belowthemean(“LOW”translation,yellow)andonestandarddeviationhigherthanthe12

mean(“HIGH”translation,blue)wereidentifiedusingribosomeprofilingandRNA-Seqdata13

fromhumanlymphoblastoidcelllines(Methods).14

15

Fig4|5IMtranscriptswithnoevidenceofER-targetingaremorelikelytoexhibitER-16

proximalribosomeoccupancy.AmovingaverageofER-proximalribosomeoccupancy17

wascalculatedbygroupinggenesby5IMPscore(seeMethods).Weplottedthemoving18

averageof5IMPscoresfortranscriptswithnoevidenceofER-ormitochondrialtargeting19

(green)orfortranscriptspredictedtobemitochondrial(purple).Weplottedarandom20

subsampleoftranscriptsontopofthemovingaverage(circles).21

22

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 39: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

39

Fig5|5IMtranscriptsharbornon-canonicalExonJunctionComplex(EJC)binding1

sites(a)ObservedEJCbindingsites(Singhetal.2012)areshownforanexample5IM2

transcript(LAMC1).CanonicalEJCbindingsites(purple)are~24ntupstreamofanexon-3

intronboundary.Theremainingbindingsitesareconsideredtobenon-canonical(green).4

(b)ACG-richsequencemotifpreviouslyidentifiedtobeenrichedamongncEJCbinding5

sitesinfirstexons(Singhetal.2012)isshown(c)5IMPscorefortranscriptswithzero,6

one,twoormorenon-canonicalEJCbindingsitesinthefirst99codingnucleotidesreveals7

thattranscriptswithhigh5IMPscoresfrequentlyharbornon-canonicalEJCbindingsites.8

(d)Transcriptswithhigh5IMPscoresareenrichedfornon-canonicalEJCsregardlessof9

5UIpresenceorabsence.10

11

Fig6|5IMtranscriptsareenrichedformRNAswithearlycodingregionm1A12

modifications(a)Transcriptswithm1Amodifications(blue)inthefirst99codingnucleotides13

exhibitsignificantenrichmentfor5IMtranscriptsandhavehigher5IMPSscoresthan14

transcriptswithoutm1Amodificationsinthefirst99codingnucleotides(yellow).(b)15

Transcriptswithm1Amodifications(blue)inthe5’UTRdonotdisplayasimilarenrichment.16

17

SupportingInformation18

FigureS1|Associationbetween5IMPscoresandcodonoptimalityisrestrictedtothe19

first30aminoacidsandisnotexplainedbynucleotidecontent.(a)5IMtranscriptstend20

tohavelessoptimalcodonsintheirfirst30aminoacidsasmeasuredbytRNAadaptationindex21

(tAI).ThemediantAIforeachtranscriptwascalculatedandtranscriptsweregroupedbytheir22

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 40: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

CanCenik

40

5IMPscores.ThedistributionofmediantAIwasplottedasaboxplot.(b)Wepermutedthe1

nucleotidesofthefirst99nucleotidesandfoundthattherelationshipbetween5IMPandtAI2

waslost.Thisresultsuggestedthatnucleotidecompositionalonedoesn’texplainthe3

relationshipbetween5IMPandtAI.(c)Weusedthepreviouslydescribednegativecontrol4

sequences,eachderivedfromasinglerandomlychosenin-frame‘window’downstreamofthe5

3rdexonfromoneoftheevaluatedtranscripts.Wefoundnorelationshipbetween5IMPscore6

andtAIintheseregionssuggestingthattheobservedassociationisrestrictedtotheearly7

codingregion.8

TableS1-|Listoffeaturesusedbythe5IMclassifier.9

TableS2-|5IMPscoresofallhumantranscripts.10

TableS3-|Listoffunctionalfeaturestestedforassociationwith5IMPscores.11

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 41: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

5'UTR Intron (5UI) First 99 nucleotidesATGGCGCA

ATGACAGA

ATGTATAAT

Sequence 1

Sequence 2

Sequence N

5’UTRIntron Arg:Lys

+

1.3

3.5

4.0

Adenine Codon Bias

0.9

2.1

1.5

Extract CDS featuresA B

Model the relationship between early coding sequence and 5UI status

C

Grow Decision Tree Repeat SamplingGenerate Multiple Trees

Sequence 5Sequence 5Sequence 9Sequence 11

BootstrapSample

Assign ‘5IMP’ score reflecting 5UI– propensity5IMP Score

9.2/10Sequence X ATGAGCGCGT...?

Likely 5UI–

2.7/10Sequence Y ATGACTATAA...?

Likely 5UI+

5UTR

Intron 0-4

243

-5960

-7071

-84 85+

2

4

6

8

First Coding Intron Position

5IM

P Sc

ore

D

2000 50 100 1500.5

0.6

0.7

0.8

0.9

1.0

AUPR

C (5

UI– p

redi

ctio

n)

First Coding Intron Position

E90 nt cutoff

Figure 1.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 42: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

E

0 2 4 6 8 105IMP Score

0.00

0.10

0.20

0.30

Dens

ity

5UI+5UI–

SSCR–/ MSCR–/ SignalP+

0.00

0.10

0.20

0.30

C

Dens

ity

0 2 4 6 8 105IMP Score

3’ proximal exons

5UI+5UI–

A

SSCRMSCR

3’ proximal exons

S– /M– SignalP+

Prec

isio

n

Recall

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

Reca

ll

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

D

0 2 4 6 8 105IMP Score

5UI+5UI–

MSCR

Dens

ity

0.00

0.10

0.20

0.30

B

Dens

ity

0 2 4 6 8 10

SSCR

0.00

0.10

0.20

0.30

5IMP Score

5UI+5UI–

0 2 4 6 8 10

SSCR–/ MSCR–/ SignalP–

5IMP Score

5UI+5UI–

0.00

0.10

0.20

0.30

Dens

ity

F

p−value

Num

ber o

f Tra

nscr

ipts

0.0 0.2 0.4 0.6 0.8 1.00

1k

2k

G

5IMP Classifier Performance

S– /M– SignalP–

Figure 2

All

5IM

0% 40% 80%

Sign

alP–

Sign

alP+

SSC

R

MSC

R

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 43: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

A

0.00

0.05

0.10

0.15

0.20

0.25

Dens

ity

NON-AUG

AUG

Start Codon

5IMP Score0 2 4 6 8 10

D

C

5IMP Score0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

Dens

ity

TRANSLATIONDOWN TRANSLATION

UP

NO CHANGEEIF4E Overexpression

FRibosomes per mRNA

LOW

HIGH

0 2 4 6 8 105IMP Score

0.00

0.05

0.10

0.15

0.20

Dens

ity

B

5IMP Score

Seco

dary

Stru

ctur

e-∆

G

2 4 6 8

0

5

10

15

20

25

E

0.36

0.38

0.40

0.42

Codo

n O

ptim

ality

5 10 15 20 25 30Codon Position

5IM

Non-5IM

Mean tAI

Figure 3

2 4 6 8

0

5

10

15

20

Seco

dary

Stru

ctur

e-∆

G

5IMP Score

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 44: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

A

Mitochondrial Evidence

2 4 6 8

5IMP score

ER−p

roxim

al Ri

boso

me

Occu

panc

y

No ER-targeting or Mitochondrial Evidence

<−1.12

−0.8

−0.4

0

>0.3

Figure 4.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 45: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

AScale 1 kb

LAMC1

200 basesScale

EJC Peak Calls

Non-Canonical EJC(ncEJC) Peaks

Canonical EJC(cEJC) Peaks

2 4 6 8 10

0

0.050.1

0.15

0.20

0.25

0.30

0.35

5IMP Score

Dens

ity

All TranscriptsFirst 99 Coding Nucleotides

No Peaks 1 Peak

2+ Peaks

C

ncEJC Peaks

ACG

CG

GCC

GGCGCGCG

C

B

First 99 Coding NucleotidesD

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35 5UI+ Transcripts

No ncEJC peak

1+ ncEJC peak

5UI– Transcripts

No ncEJC peak

1+ ncEJC peak

0 2 4 6 8 10 2 4 6 8 10

5IMP Score5IMP Score

Dens

ity

Figure 5.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 46: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

0.00

0.05

0.10

0.15

0.20

Dens

ity

0.00

0.05

0.10

0.15

0.20

Dens

ity

5IMP Score

0 2 4 6 8 10

A

5IMP Score

0 2 4 6 8 10

5’UTRFirst 99 Coding Nucleotides

B

m1A-modified m1A-modified

No m1A No m1A

Figure 6

.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint

Page 47: A Common Class of Transcripts with 5′-Intron …...2016/09/06  · 9 transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). 10 Unexpectedly, we

0.25

0.30

0.35

0.40

0.45

5IMP Score

Med

ian

tAI

0-7.41 7.41-10

A

Codon Position

5 10 15 20 25 30

0.30

0.32

0.34

0.36

0.38

0.40

0.42

Codo

n O

ptim

ality

mea

n tA

I

First 99 nucleotides permuted

5IMP ≤ 7.41 7.41 < 5IMP

B

Codon Position

5 10 15 20 25 30

0.30

0.32

0.34

0.36

0.38

0.40

0.42

Random 99 nucleotidesDownstream of 3rd Exon

Codo

n O

ptim

ality

mea

n tA

I

7.41 < 5IMP5IMP ≤ 7.41

C

Figure S1.CC-BY-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted September 6, 2016. . https://doi.org/10.1101/057455doi: bioRxiv preprint