scilifelab rna-seq course outline allele-specific ... · (b) mapping biasase tools (c) many...
TRANSCRIPT
2610//16
1
SciLifeLabRNA-seqcourse
Allele-specificexpression,ASE
OlofEmanuelssonKTHRoyalInsBtuteofTechnology
[email protected],11:00-12:00Navet(E10),BMC,Uppsala
ASE:allele-specificexpressionOutline1. Defini=onofASE2. Detec=ngASE(introductorycase)3. Applica=onsandprevalenceofASE4.ImportantASEconsidera=ons
(a)Variantcalling(b)MappingbiasASEtools(c)Manyvariantsinagene
5.ASEtools6. GeneiASE–atooltodetectgeneswithASEfromRNA-
seqdata
[1]Defini=onofallele-specificexpression(ASE)
Addinganotherlayertotranscriptomecomplexity...Onegenecanproducemanydifferenttranscripts...
Adopted from Unneberg, 2010
Adopted from Unneberg, 2010
Addinganotherlayertotranscriptomecomplexity...♀♂...andeachgeneispresentontwochromosomes.=>ithastwoalleles
Allele,defini=onAnalleleisthevariantformofagivengene(orlocus).SomeBmes,differentallelescanresultindifferentobservablephenotypictraits,suchasdifferentpigmentaBon./…/Ifbothallelesatagene(orlocus)onthehomologouschromosomesarethesame,theyandtheorganismarehomozygouswithrespecttothatgene(orlocus).Iftheallelesaredifferent,theyandtheorganismareheterozygouswithrespecttothatgene(orlocus).hXps://en.wikipedia.org/wiki/Allele
2610//16
2
Allele-specificexpression,defini=onAnimbalanceintranscripBonbetweenthematernalandpaternalallelesatalocus.• I.e.,adevia=onfromtheexpected50/50ra=ooftranscripBonfromthetwoallelesofadiploidorganism.
• Canbeassessedwithinasingleindividual(Presentalsowhenploidy>2,e.g.,plants)Othereventsmayalsobe“allele-specific”,e.g.• transcripBonfactorbinding• DNAbackbonemethylaBon• X-chromosomeinacBvaBoninfemalemammals
Allele-specificexpression,defini=ongenomicDNA->transcript(e.g.mRNA)• SNV=singlenucleoBdevariant• ThegenomicSNVisreflectedinthetranscribedRNA(TisUinRNA).
UU
Allele-specificgeneexpression(mRNAs)Diploidgenome
SNV
[2]Detec=ngASE
Detec=ngallele-specificexpressionWetlabtechnologies:• microarrays(ifdesignedproperly)• qRT-PCR+TaqMan• pyrosequencing• RNA-seqN.B.:asthesearesequence-basedtheywillnotprovideanyinformaBoninthecaseofahomozygousallele,althoughitmaysBllbeexpressedpredominantlyfromonlyoneofthechromosomes.eQTL–expressionquan=ta=vetraitlociAnotherapproach!Requiresmanysubjects
Detec=ngallele-specificexpressionusingRNA-seqdata• RNA-seqreadsprovidethesequenceofatranscript• ...whichenablesthedeterminaBonoftheallelicoriginofthereadsoverlappingwiththeSNV
RNA-seqreadsT
T
C
C
C
C
UU
Allele-specificgeneexpression(mRNAs)Diploidgenome
SNV
Detec=ngallele-specificexpressionusingRNA-seqdataGeneraloutline:1.MaptheRNA-seqreads2.Countthereadsthatmaptoeitherallele3.Calculateeffectsizeandp-value
2610//16
3
Paternalallele(a) Maternalallele(A)…AGTCTTCCAATTAGC… …AGTCTTCTAATTAGC…
Reads–10xcoverageofthelocus…AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCCAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCCAATTAGC……AGTCTTCCAATTAGC…
Detec=ngallele-specificexpressionusingRNA-seqdata1.MaptheRNA-seqreads
Paternalallele(a) Maternalallele(A)…AGTCTTCCAATTAGC… …AGTCTTCTAATTAGC…
Mappedreads…AGTCTTCTAATTAGC……AGTCTTCTAATTAGC…
…AGTCTTCCAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC…
…AGTCTTCCAATTAGC……AGTCTTCCAATTAGC…
Detec=ngallele-specificexpressionusingRNA-seqdata1.MaptheRNA-seqreads
Paternalallele(a) Maternalallele(A)3x …AGTCTTCCAATTAGC… 7x …AGTCTTCTAATTAGC…3readsmappedtopaternalallele7readsmappedtomaternalalleleIntotal10readsmappedtothelocus
Detec=ngallele-specificexpressionusingRNA-seqdata2.Countthereads
Effectsize:(otherdefiniBonspossible)
ASEeffect=calt/(calt+cref)–0.5i.e.,thefracBonofcountsmappedtoalternaBvealleleminus0.5=>�ifnoASEthenASEeffect=0�rangeofASEeffectis[-0.5,0.5]
P-value:Usebinomialwithp=0.5(assuming50/50transcripBon)Ourexamplefrompreviousslide:
Effectsize=ASEeffect=calt/(calt+cref)–0.5=3/(3+7)-0.5=–0.2P-value:binomialtestfordeviaBonfrom50/50distribuBonbetweenalleles(inR):> pbinom(3, size=10, prob=0.5)[1] 0.171875
⇒ NotsignificantinthisparBcularexample⇒ Ifcoveragewas30x(9+21reads)insteadof10x(3+7),thenp-value<0.03
Detec=ngallele-specificexpressionusingRNA-seqdata3.Calculateeffectsizeandp-value
eQTLvs.ASE
eQTL• Inter-individualdifferencesinexpression• Modesteffects
• LargenumberofSNP-genecombinaBons
• Manysamplesneeded
• Mayusemicroarraysforgeneexpression
• Genotypingrequired
ASE• Sufficientpowerwithasingleindividual
• IdenBcalcellularenvironmentforthetwochromosomes
• NoassociaBontoregulatoryregion• MustuseRNA-seqforgeneexpression
10individualsgenotyped
[3]Applica=onsandprevalenceofASE
2610//16
4
Applica=onofASEFindproteinvariants
Toinferachangedprotein,theSNVmustbe• incodingregion• non-synonymous
Differentproteins
UU
Allele-specificgeneexpression(mRNAs)Diploidgenome
SNV
Applica=onofASEFindcis-regulatoryvariantPossibletodetectifyoualsohaveinformaBonaboutnon-transcribedvariants(e.g.,fromwhole-genomeDNAsequencingorSNP-array).
UU
Allele-specificgeneexpression(mRNAs)cis-regulatory
variant
SNV
Applica=onofASENormalvs.tumorexpressionPossibletodetectifyouhaveexpressionmeasuredfrombothnormaland
tumorBssue(inthesameindividual).
PrevalenceofASE
GeneswithsignificantASE(%ofgeneswithheterozygousvariant).
[4]ImportantASEconsidera=ons
ImportantASEconsidera=ons(a) Variantdetec=on
(b)Mappingbias(c)Manyvariantsinagene
2610//16
5
[4]ImportantASEconsidera=ons:(a)Variantdetec=on
Variantdetec=onVariant=aposiBoninthegenomethatisdifferentfromanothergenome.• Homozygousvariant:thetwoallelesareidenBcaltoeachother• Heterozygousvariant:thetwoallelesaredifferent• “Ref.”=thealleleisthesameasforthereferencegenome• “Alt.”=alternate=thealleleisdifferentfromthereferencegenome• SNVisonetypeofvariant,othersincludeinserBon,deleBon,...Variantdetec=on=detecBngwhatvariantsarepresentinasample:1. Variantcalling–anyposiBonwithevidenceofanalternaBvebase2. VariantprioriBzaBon–definereliablevariantswithhighconfidence
TypicallyperformedbasedongenomicDNAdata,from• Microarrays(e.g.IlluminaOmni2.5M)• Sequencing(e.g.whole-genomere-sequencingorexomesequencing)
Paternalallele(a) Maternalallele(A)…AGTCTTCCAATTAGC… …AGTCTTCTAATTAGC…
Reads–10xcoverageofthelocus…AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCCAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCCAATTAGC……AGTCTTCCAATTAGC…
Variantdetec=onfromsequencingdataStartbymapthereads.
Paternalallele(a) Maternalallele(A)…AGTCTTCCAATTAGC… …AGTCTTCTAATTAGC…
Mappedreads…AGTCTTCTAATTAGC……AGTCTTCTAATTAGC…
…AGTCTTCCAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC…
…AGTCTTCCAATTAGC……AGTCTTCCAATTAGC…
Variantdetec=onfromsequencingdataOK,pieceofcake?
Thisiswhatweactuallyhave:Paternalallele(a) Maternalallele(A)…AGTCTTCCAATTAGC… …AGTCTTCTAATTAGC…
Mappedreads…AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCCAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCTAATTAGC……AGTCTTCCAATTAGC……AGTCTTCCAATTAGC…
=>needtodetectthevariantposiBonsinthereferencesequence
Variantdetec=onfromsequencingdata
Referencesequence …AGTCTTCTAATTAGC…
Variantdetec=onfromsequencingdataStandard:GATK(DePristoetal.,2011)orSamtools–worksonanymappedsequencingdata.GATKscorestheSNVsbytakingintoaccountanumberofcharacterisBcs,including:• Sequencingdepth(coverage)• Mappingquality• PosiBonbias(basequality)SpecificRNA-seqbasedtools:• Colib’read–LeBrasetal.,2016• RVboost–Wangetal.,2014• ACCUSA2–PiechoXaetal.,2013GATKthemostwidelyused,evenforRNA-seq.
2610//16
6
Variantdetec=on–VCF,VariantCallFormatVCFisatextfileformat(“flattext”).ExampleVCFoutputfromGATK:##fileformat=VCFv4.1...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1_NA12878 [SAMPLE1_BLALBA] ...1 873762 . T C 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 ... 1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0... 1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:26:103,0,26 ...1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:61:61,0,255......
GT:thegenotypeofthissampleatthissite(0/0,0/1,1/1,1/2,...).0=ref.,1=alt.AD:alleledepths,i.e.,thenumberofreadsthatsupporteachofthereportedallelesGQ:qualityofassignedgenotype(max=99)FullspecificaBonofVCFfileformat:hXp://samtools.github.io/hts-specs/
Sequencingdepth
Allelicra
Bo
:40:67:30:20
50
Variantdetec=on–whichvariantstouse(priori=za=on)? VariantsfromRNA-seq• Whatsequencingdepth?
influencesthepower,seeè• Heterozygous• OthercriteriaFilteringofknownvariants• KeeponlyvariantsindbSNP?
[4]ImportantASEconsidera=ons:(b)Mappingbias
MappingbiasReferencegenomevariants(“ref.”)haveanadvantageinthemapping.Maternalallele:…ATCGAATGAAGCTCATTGGATCAGAT… (ref.)Paternalallele: …ATCGAATGAAGCTTATTGGATCAGAT… (alt.)Reference: …ATCGAATGAAGCTCATTGGATCAGAT…MappingofreadsReadfrommaternalallele: AGCTCATTReference: ATCGAATGAAGCTCATTGGATCAGATReadfrompaternalallele: AGCTTATTThepaternalallelereadwillmapwithalowermappingquality.IncaseofsequencingerrororpoorbasequalityatanotherposiBon,thismightpushthemappingqualityofthepaternalallelereadbelowthethreshold,andthereadwillbediscarded.
:::::::::::: :::
Mappingbias–exampleinrealdataHeterozygousvariants(alt/ref)mappedtoreferencegenome.X-axis,alternateallelefracBon[0,1]Y-axis,density(Datafrom16RNA-seqexperiments;VariantsdetectedwithRNA-seqdataorSNParray).
Mappingbias–waystogetarounditinASEdetec=onMaskingvariants({A,C,G,T}=>N)YoulooseinformaBon.ConstructallpossibleversionsofthegenomefromexisBngvariantsCansoongenerateaprohibiBveamountofgenomeversions.Mapreadstodiploidgenome(ortranscriptome)Requiresthatyoueitherhaveorconstructthediploidgenome(ortranscriptome)oftheindividual.Modfiythebinomialprobabilityptoreflectthemappingbias.RequiressimulaBontoproperlymodifyp.
2610//16
7
Mappingbias–waystogetarounditinASEdetec=onMaskingvariants({A,C,G,T}=>N)YoulooseinformaBon.ConstructallpossibleversionsofthegenomefromexisBngvariantsCansoongenerateaprohibiBveamountofgenomeversions.Mapreadstodiploidgenome(ortranscriptome)Requiresthatyoueitherhaveorconstructthediploidgenome(ortranscriptome)oftheindividual.E.g.,Turroetal.(2011)(transcriptome).Modfiythebinomialprobabilityptoreflectthemappingbias.RequiressimulaBontoproperlymodifyp.E.g.,Montgomeryetal.(2010).
[4]ImportantASEconsidera=ons:(c)Manyvariantsinagene
ManyvariantsinageneMorethanonevariantwithinageneiscommon:
UU
Allele-specificgeneexpression(mRNAs)Diploidgenome
SNV_1SNV_2
AA
GGGGG
CCCCC
AG 2different“haploisoforms”
ManyvariantsinageneRNA-seqcontainsinformaBonthattherearetwoheterozygousSNVs.VariantsdetectedfromRNA-seqreads:• A/Gatonelocus• T/Catonelocus
UU
Allele-specificgeneexpression(mRNAs)Diploidgenome
SNV_1SNV_2
AA
GGGGG
CCCCC
AG
RNA-seqreadsT
T
C
C
C
C
A
A
G
GG
G
G
ManyvariantsinageneBut:RNA-seqdoesnotnecessarilycapturetherelaBonbetweenthetwoSNVs.PossiblecombinaBons(haplotypes)fromRNA-seqreads:• A+TandG+C• A+CandG+T
UU
Allele-specificgeneexpression(mRNAs)Diploidgenome
SNV_1SNV_2
AA
GGGGG
CCCCC
AG
RNA-seqreadsC
T
C
C
T
A
C
G
G
AG
G
G
Manyvariantsinagene–phasingPhasing=decidingwhichallelesthatareonthesamechromosomalhomologuePossiblecombinaBons(haplotypes)fromRNA-seqreads:• A+TandG+C:and• A+CandG+T:andButcanourRNA-seqreadsprovidethephase?
UU
Diploidgenome
SNV_1SNV_2
AA
AG
AG
RNA-seqreadsC
T
C
C
T
A
C
G
G
AG
G
G
AT GC
AC GT (2+2different“haploisoforms”)
2610//16
8
PhasingisusefulbutnotnecessarytodetectASEPhasinginformaBontypicallyachievedbysequencingthegenomesoftheparentsofthesubject.Directhaplotypesequencingalsopossible.Ifyoudon’tknowthephase(andformostRNA-seqdatasets,youdon’t):• Trytoinferitfrom
(a)yourRNA-seqdata–possible,buttypicallyonlyparBalphasing(b)exisBngpopulaBondata(LD)–notapplicableonnewvariants
• DisregardfromitandcalculateASEanyway
Phasing• reducesmappingbias• enablesthedetecBonofhaploisoformexpression(isoformsrepresenBng
thetwohomologouschromosomes)• butisnotnecessarytodetectASEingeneswith>1SNV
[5]ASEtools
ASEtoolsAlistoftoolsthatcandetectASE,givenspecifiedinputdata:• cisASE–pairedgenomic+transcriptomicdata,Liuetal.,2016• MutRSeq–nonsynonomousSNVsfromRNA-seqdata,Fuetal.,2016• GeneiASE–unphasedRNA-seqdata,Edsgärdetal.,2016• ASE-TIGAR–parentaldatarequired,bayesian,Nariaietal.,2016• ASEQ–pairedgenomic+transcriptomicdata,Romaneletal.,2015• MBASED–phasedorunphasedRNA-seqdata,Maybaetal.,2015• Allim–parentaldatarequired,Pandeyetal.,2013• MMSEQ–aXemptshaploisoformidenBficaBon,Turroetal.,2011• (Skelly)–requiresphaseddata,Skellyetal.,2011• AlleleSeq–requiresgenomicsequence,Rozowskyetal.,2011• (AlleleDB–databaseforASEetc.of1000genomes,Chenetal.,2016)
ASEtools–whereonlyRNA-seqdatafromasingleindividualisrequired.• cisASE–pairedgenomic+transcriptomicdata,Liuetal.,2016• MutRSeq–nonsynonomousSNVsfromRNA-seqdata,Fuetal.,2016• GeneiASE–unphasedRNA-seqdata,Edsgärdetal.,2016• ASE-TIGAR–parentaldatarequired,bayesian,Nariaietal.,2016• ASEQ–pairedgenomic+transcriptomicdata,Romaneletal.,2015• MBASED–phasedorunphasedRNA-seqdata,Maybaetal.,2015• Allim–parentaldatarequired,Pandeyetal.,2013• MMSEQ–aXemptshaploisoformidenBficaBon,Turroetal.,2011• (Skelly)–requiresphaseddata,Skellyetal.,2011• AlleleSeq–requiresgenomicsequence,Rozowskyetal.,2011• (AlleleDB–databaseforASEetc.of1000genomes,Chenetal.,2016)
[6]GeneiASE
GeneiASEGeneiASEdetectsgeneswithsignificantASE,insingleindividualsandbasedonlyonRNA-seqdata.HaplotypeinformaBon(phasing)isnotneeded.Datarequired:• RNA-seqdataPre-processingrequired:• Mappingandqualitycontrolofreads• VariantdetecBon(e.g.,GATK)• Filtervariantsifdesired• AllelecountsforvariantsextractedintocustominputtextfileAvailability:• Edsgärdetal.,Scien9ficReports6:21134,2016• hXps://github.com/edsgard/geneiase(GNUGPL3license)
2610//16
9
Thesitua9on:
• unphaseddata• non-uniformeffectwithingene• technicalvariability
TheGeneiASEsolu9on:1. Foreachgene,loopoverallitsSNVsandtheir2x1matrixofreadcounts2. CalculateateststaBsBc(sij)foreachSNV,basedonreadcounts3. CombinetheteststaBsBcsfortheSNVswithinagene=>teststaBsBcfor
enBregene(gi)asdf4. ResamplefromparametricnullSNVmodel(esBmatedfromDNAdata)105
Bmes,calculatetheresulBngdistribuBonofgeneteststaBsBc(g0i).5. Comparegitog0iandcalculateap-valueforgenei.
Gene i SNV_1 SNV_2 SNV_3 SNV...Ref 60 20 70 ...
Alt 40 80 30 ...
GeneiASE
Absolutevalueofeffectsize=>Undirectedeffect
1.CountreadsforeachSNVinagene;addpseudocountsifrequired
2.CalculateSNVteststaBsBcsijbasedonabsolutevalueofeffectsize,eff.
3.CalculategeneteststaBsBcgiusingStouffer-Liptakmethod;kisnumberofSNVsingenei
GeneiASE–calculateSNP-basedgene-wiseteststa=s=c
0.EsBmateSNVnullmodelparameters
Foreachgene(genei):
1.SampleallelecountsfromnullSNVmodel
(Randomeffectmodel)
2.CalculatekSNVteststaBsBc
3.CalculategeneteststaBsBc(Stouffer-Liptak)
5.Calculatep-valueforgenei
4.Reiterate1-3NBmes(default:105)
DNAbasedesBmateofthetechnicalvariability
k=numberofSNVsingenei
GeneiASE–nullmodel,andgene-wisep-valuecalcula=on
• geneiase-cvtl-icvtl.test.input.tab
• geneiase-icd-iicd.test.input.tab
RunningGeneiASE–input
Sta=cASE
Condi=on-dependentASE
RunningGeneiASE–output
Onelinepergene.Outputcolumns:• feat:FeatureIDasspecifiedintheinputfile(typicallyageneidenBfier)• n.vars:Numberofvariantswithinthegene• mean.s:Meanofsacrossthevariantswithinthegene• median.s:Medianofsacrossthevariantswithinthegene• sd.s:StandarddeviaBonofsacrossthevariantswithinthegene• cv.s:CoefficientofvariaBonofsacrossthevariantswithinthegene• liptak.s:Stouffer-LiptakcombinaBonofs(calledgonpreviousslides)• p.nom:Nominalp-value• fdr:Benjamini-Hochbergcorrectedp-value(Reminder:sistheeffectsize-basedteststaBsBcforeachSNVinagene).
RunningGeneiASE–results
Thenumberofgeneswithsignificant(fdr<0.05)ASEasdetectedbyGeneiASEfrom16RNA-seqsamples(primarywhitebloodcells).
Num
bero
fgen
es
(with
≥2SN
Vs)
Millionmappedhigh-qualityreads