scaling bali-phy to large datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can...
TRANSCRIPT
ScalingBAli-PhytoLargeDatasets
June16,2016MichaelNute
1
BAli-Phy:BriefSummary• WhatisBAli-Phy?(Redelings&Suchard,2005)
– SoLwarefrom2005thattakesasinputunalignedsequencesandco-esQmatesthealignmentandthephylogenyinawaythataccountsforindels.
– OutputcanbeamulQplesequencealignment,aphylogeny,orboth,andcangiveesQmateofuncertaintyineachone.
• WhyisitinteresQng?– ThestaQsQcalmodelisuniqueanddetailed,sogivenenoughQmeitmightfindabeXeropQmumthanothermethods.
– ExperimentalevidencehasshownthatitgivesmoreaccuratemulQplesequencealignmentsthanmorecommonmethods(Liu,etal,2012).
1Redelings,B.D.,&Suchard,M.a.(2005).JointBayesianesQmaQonofalignmentandphylogeny.Systema(cBiology,54(3),401–418.2Liu,K.,Raghavan,S.,Nelesen,S.,Linder,C.R.,&Warnow,T.(2009).Rapidandaccuratelarge-scalecoesQmaQonofsequencealignmentsandphylogeneQctrees.Science(NewYork,N.Y.),324(5934),1561–4.
2
BAli-Phy:BriefSummary• Whatisthedisadvantage?
– SoLwarecannothandlemorethan≈200taxaduetosuspectednumericalinstability.
– ComputaQonisveryslow:mostpublicaQonsusingithaverunforseveralweeks.• (Gaya,etal.,2011)68sequencesranin3weeks• Largestdatasetwehavefoundis117-sequences(McKenzie,etal.,2014)
3
0%
10%
20%
30%
40%
0%
10%
20%
30%
40%
BAli-Phy:QuickLookatResults(1of2)
#Taxa: 100 200 100 200Simulator: Indelible(DNA) RNAsim(RNA)
AlignmentError*
*Averagesover10replicates4
0.0%
10.0%
20.0%
30.0%
40.0%MAFFT
PASTA
BAli-Phy
FalseNegaB
ve%
(1–SP-Score)
FalsePo
siBv
e%
(1–M
odelerScore)
0%
10%
20%
30%
40%
BAli-Phy:QuickLookatResults(2of2)
#Taxa: 100 200 100 200Simulator: Indelible(DNA) RNAsim(RNA)
Total-C
olum
nScore
AlignmentAccuracy*
*Averagesover10replicates
Total-ColumnScore:PercentageofcolumnsfromthereferencealignmentthatarefullyreproducedbytheesQmatedalignment.
5
0.0%
10.0%
20.0%
30.0%
40.0%MAFFT
PASTA
BAli-Phy
EsQmateMLtreeonnewalignment
Tree
ObtaininiQalalignmentandesQmatedMLtree
Usetreetocomputenewalignment
Alignment
RepeatunQlterminaQoncondiQon,and
returnthealignment/treepairwiththebestMLscore
SATéandPASTAAlgorithms
6
PASTAAlgorithm
7
Input:unalignedsequences1)GetiniBalalignment
2)EsBmatetreeoncurrentalignment 3)Breakintosubsets
accordingtotree
4)Useexternalalignertoalignsubsets
5)Useexternalprofilealignertomergesubset
alignments
6)UsetransiBvitytomergesubsetpairsintoafull
alignment,scraptheoldtree
(repeat)
?
DivideandConquerwithBAli-Phy
TogetthemostoutofPASTA+BAli-Phy,startwiththetreefromtheLASTitera(onofdefaultPASTA
QUESTION:SincewesawthatBAli-PhygivesbeDeralignmentsonsmallnumbersoftaxa,couldwegetbeDeralignmentonlargedatasetsifweusedBAli-Phyonsubsets?ANSWER:Yes,butittakesalotofcompuQngresources.
8
MethodstoComparePASTA:
– IteraBons1-3:MAFFT(SubsetSize200)– (alldefaultselngs)
PASTA+BAli-Phy– IteraBons1-3:MAFFT(SubsetSize200)– IteraBon4:BAli-Phy(SubsetSize100)
PASTA+MAFFT– IteraBons1-3:MAFFT(SubsetSize200)– IteraBon4:MAFFT(SubsetSize100)
MAFFTL-INS-i– DefaultMAFFT(v7.273)usingthemafft-linsicommand
• TakesadvantageoffasterMAFFTonearlyitera(onswheresubsetsaremorediverse.
9
• Helpfultoiden(fywhetheranygainisfromextraitera(onorbecauseBAli-Phywasused.
• AlldefaultseLngs…
• Usefulcomparisontobenchmarkdifficultyofalignment.MAFFTispopularandL-INS-iisthemostaccurateversion.
ErrorReducQon(1000Sequences)
10
0%
5%
10%
15%
20%
25%
30%
IndelibleM2 RNAsim RoseL1 RoseM1 RoseS1
0%
5%
10%
15%
20%
25%
30%
IndelibleM2 RNAsim RoseL1 RoseM1 RoseS1
PASTA
PASTA+BAli-Phy
PASTA(ExtraIteration)
MAFFTL-INS-i
FalseNegaB
ve%
(i.e.1-SP-Score)
0%
5%
10%
15%
20%
25%
30%
IndelibleM2 RNAsim RoseL1 RoseM1 RoseS1
FalsePo
siBv
e%
(i.e.1–M
odelerScore)
(SmallerisbeDer)
AccuracyGain(1000Sequences)
11
0%
5%
10%
15%
20%
25%
30%
IndelibleM2 RNAsim RoseL1 RoseM1 RoseS1
PASTA
PASTA+BAli-Phy
PASTA(ExtraIteration)
MAFFTL-INS-i
TotalColum
nScore
0%
5%
10%
15%
20%
25%
30%
35%
IndelibleM2 RNAsim RoseL1 RoseM1 RoseS1
TreeErrorRelaQvetoML(ReferenceAlignment)
12
Delta
-RF(RAx
ML)
0%
1%
2%
3%
4%
5%
6%
IndelibleM2 RNAsim RoseL1 RoseM1 RoseS10%
5%
10%
15%
20%
25%
30%
IndelibleM2 RNAsim RoseL1 RoseM1 RoseS1
PASTA
PASTA+BAli-Phy
PASTA(ExtraIteration)
MAFFTL-INS-i
AccuracyGain(1000Sequences,Detail)
PASTA+BAli-Phy Better
PA
STA
Better
0.00
0.05
0.10
0.15
0.00 0.05 0.10 0.15PASTA+BAli-Phy
PA
STA
Tree Error: Delta RF (RAxML)
PASTA+BAli-Phy Better
PA
STA
Better
0.0
0.1
0.2
0.3
0.4
0.0 0.1 0.2 0.3 0.4PASTA
PA
STA
+BA
li-P
hy
False Negative %
PASTA+BAli-Phy Better
PA
STA
Better
0.0
0.1
0.2
0.3
0.4
0.0 0.1 0.2 0.3 0.4PASTA
PA
STA
+BA
li-P
hy
Total Column Score
PASTA+BAli-Phy Better
PA
STA
Better
-0.05
0.00
0.05
0.10
0.15
0.20
-0.05 0.00 0.05 0.10 0.15 0.20PASTA+BAli-Phy
PA
STA
Tree Error: Delta RF (FastTree-2)
PASTA+BAli-Phy Better
PA
STA
Better
0.0
0.1
0.2
0.3
0.4
0.0 0.1 0.2 0.3 0.4PASTA
PA
STA
+BA
li-P
hy
False Positive %
dataIndelible M2RNAsimRose L1Rose M1Rose S1
PASTA+BAli-Phy Better
PA
STA
Better
0.00
0.05
0.10
0.15
0.00 0.05 0.10 0.15PASTA+BAli-Phy
PA
STA
Tree Error: Delta RF (RAxML)
PASTA+BAli-Phy Better
PA
STA
Better
0.0
0.1
0.2
0.3
0.4
0.0 0.1 0.2 0.3 0.4PASTA
PA
STA
+BA
li-P
hy
False Negative %
PASTA+BAli-Phy Better
PA
STA
Better
0.0
0.1
0.2
0.3
0.4
0.0 0.1 0.2 0.3 0.4PASTA
PA
STA
+BA
li-P
hyTotal Column Score
PASTA+BAli-Phy Better
PA
STA
Better
-0.05
0.00
0.05
0.10
0.15
0.20
-0.05 0.00 0.05 0.10 0.15 0.20PASTA+BAli-Phy
PA
STA
Tree Error: Delta RF (FastTree-2)
PASTA+BAli-Phy Better
PA
STA
Better
0.0
0.1
0.2
0.3
0.4
0.0 0.1 0.2 0.3 0.4PASTA
PA
STA
+BA
li-P
hy
False Positive %
dataIndelible M2RNAsimRose L1Rose M1Rose S1
13
AccuracyGain(1000Sequences,Detail)
14
Scalingto10,000Sequences• WecanuseUPP(Nguyen,etal,2015)toextendanalignmenttolargernumbersofsequences:– Takearandom“backbone”subset(i.e.1,000sequencesfrompreviousslides)
– Alignthebackbone– AlignallremainingsequencestothebackboneviaHMMs
15
Scalingto10,000Sequences• Accuracyoffullalignmenttendstotracktheaccuracyofthebackbone:
16
Data Backbone FP% FN% TC Δ-RFPASTA 3.8% 6.4% 2.6% 0.77%PASTA+BAli-Phy 2.2% 4.4% 4.3% 0.54%PASTA+MAFFT 2.7% 5.0% 3.2% 0.62%PASTA 9.2% 9.5% 0.5% 0.77%
RNAsim PASTA+BAli-Phy 8.6% 9.0% 0.6% 0.67%PASTA+MAFFT 10.6% 10.9% 0.5% 0.67%
IndelibleM2
Scalingto10,000Sequences
17
PASTA+BAli-Phy Better
PAS
TA B
etter
0.000
0.025
0.050
0.075
0.100
0.000 0.025 0.050 0.075 0.100PASTA
PAST
A+B
Ali-
Phy
Total Column Score
dataIndelible M2
RNAsim
PASTA+BAli-Phy Better
PAS
TA B
etter
0.900
0.925
0.950
0.975
1.000
0.900 0.925 0.950 0.975 1.000PASTA
PAST
A+B
Ali-
Phy
Recall (SP-Score)PASTA+BAli-Phy Better
PAS
TA B
etter
0.000
0.005
0.010
0.015
0.000 0.005 0.010 0.015PASTA+BAli-Phy
PAST
A
Tree Error: Delta RF
HypotheQcalRunningTimeComparisonData:1000TaxaGoal:RunPASTA(1iteraQon,maximumsubsetsize100)Resource:1Server,32CoresQuesBon:Howlongdoeseachsubset-alignmentmethodtake?Answer:• With1,000taxaandsubsetsnolargerthan100,we’ll
haveapproximately15subsets(between10-16).
• MAFFT:– Takes1coreapprox.10-20minutestodo1subset.
Candoallsubsetsinparallel.– Total:10-20minutes
• BAli-Phy:– Takes24hoursforall32corestodo1subset.Can’t
runinparallelsinceweonlyhave32cores.– Total:15Days
Thesecalcula(onsarehypothe(calbutrepresenta(ve.
Ifwehavemul(pleservers,wecanrunBAli-Phyinparallelinless(me…
S(ll,ifwewanttorunBAli-Phyitmakesthemostsensetoseveralitera(onswithMAFFTfirst.
18
Summary• BAli-PhyprovidesmoreaccuratealignmentsthanMAFFTonsmalldata,whichcantranslatetomoreaccuratealignmentsonupto1,000taxabyboosQngwithPASTA.
• AlthoughtherunningQmeislonger,thisallowsBAli-Phytobescaledinaparallelway,sowecanalign1,000sequencesintheQmeittakestoalign100.
• Alignmentaccuracytranslatestoimprovedtreeaccuracyonthisdata.
• Alignmentscanbefurtherextendedto10,000sequencesusingUPP
Codeat:http://github.com/mgnute/pasta
19
AcknowledgementsSpecialThanks• Thisisajointworkwithmyadvisor,Prof.TandyWarnow.SpecialthankstoNamNguyenforearlycollaboraQononthisprojectandforteachingmethecodebaseforPASTAandUPP,andtoErinMolloyforseveralhelpfuldiscussionsandsuggesQonsoverthecourseofthisresearch.
NSF• ThisworkwasfundedbyNSFgrantIII:AF:1513629BlueWaters• ThisresearchispartoftheBlueWaterssustained-petascalecompuQngproject,whichissupportedbytheNaQonalScienceFoundaQon(awardsOCI-0725070andACI-1238993)andthestateofIllinois.BlueWatersisajointeffortoftheUniversityofIllinoisatUrbana-ChampaignanditsNaQonalCenterforSupercompuQngApplicaQons.
20
References[1.]Redelings,B.D.,&Suchard,M.(2005).JointBayesianesQmaQonofalignmentandphylogeny.
Systema(cBiology,54(3),401–418.[2.]Liu,K.,Raghavan,S.,Nelesen,S.,Linder,C.R.,&Warnow,T.(2009).Rapidandaccurate
large-scalecoesQmaQonofsequencealignmentsandphylogeneQctrees.Science(NewYork,N.Y.),324(5934),1561–4.
[3.]Katoh,K.,Misawa,K.,Kuma,K.,&Miyata,T.(2002).MAFFT:anovelmethodforrapidmulQplesequencealignmentbasedonfastFouriertransform.NucleicAcidsResearch,30(14),3059–3066.
[4.]Mirarab,S.,Nguyen,N.,Guo,S.,Wang,L.-S.,Kim,J.,&Warnow,T.(2015).PASTA:Ultra-LargeMulQpleSequenceAlignmentforNucleoQdeandAmino-AcidSequences.JournalofComputa(onalBiology,22(5),377–386.
[5.]McKenzie,S.K.,Oxley,P.R.,&Kronauer,D.J.C.(2014).ComparaQvegenomicsandtranscriptomicsinantsprovidenewinsightsintotheevoluQonandfuncQonofodorantbindingandchemosensoryproteins.BMCGenomics,15(1),718.
[6.]Liu,K.,Warnow,T.J.,Holder,M.T.,Nelesen,S.M.,Yu,J.,Stamatakis,A.P.,&Linder,C.R.(2012).SATe-II:VeryFastandAccurateSimultaneousEsQmaQonofMulQpleSequenceAlignmentsandPhylogeneQcTrees.Systema(cBiology,61(1),90–106.
[7.]Nguyen,N.D.,Mirarab,S.,Kumar,K.,&Warnow,T.(2015).Ultra-largealignmentsusingphylogeny-awareprofiles.GenomeBiology,16(1),124.
21