scaling bali-phy to large datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can...

21
Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1

Upload: others

Post on 03-Mar-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

ScalingBAli-PhytoLargeDatasets

June16,2016MichaelNute

1

Page 2: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

BAli-Phy:BriefSummary• WhatisBAli-Phy?(Redelings&Suchard,2005)

–  SoLwarefrom2005thattakesasinputunalignedsequencesandco-esQmatesthealignmentandthephylogenyinawaythataccountsforindels.

–  OutputcanbeamulQplesequencealignment,aphylogeny,orboth,andcangiveesQmateofuncertaintyineachone.

• WhyisitinteresQng?–  ThestaQsQcalmodelisuniqueanddetailed,sogivenenoughQmeitmightfindabeXeropQmumthanothermethods.

–  ExperimentalevidencehasshownthatitgivesmoreaccuratemulQplesequencealignmentsthanmorecommonmethods(Liu,etal,2012).

1Redelings,B.D.,&Suchard,M.a.(2005).JointBayesianesQmaQonofalignmentandphylogeny.Systema(cBiology,54(3),401–418.2Liu,K.,Raghavan,S.,Nelesen,S.,Linder,C.R.,&Warnow,T.(2009).Rapidandaccuratelarge-scalecoesQmaQonofsequencealignmentsandphylogeneQctrees.Science(NewYork,N.Y.),324(5934),1561–4.

2

Page 3: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

BAli-Phy:BriefSummary• Whatisthedisadvantage?

–  SoLwarecannothandlemorethan≈200taxaduetosuspectednumericalinstability.

–  ComputaQonisveryslow:mostpublicaQonsusingithaverunforseveralweeks.•  (Gaya,etal.,2011)68sequencesranin3weeks•  Largestdatasetwehavefoundis117-sequences(McKenzie,etal.,2014)

3

Page 4: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

0%

10%

20%

30%

40%

0%

10%

20%

30%

40%

BAli-Phy:QuickLookatResults(1of2)

#Taxa: 100 200 100 200Simulator: Indelible(DNA) RNAsim(RNA)

AlignmentError*

*Averagesover10replicates4

0.0%

10.0%

20.0%

30.0%

40.0%MAFFT

PASTA

BAli-Phy

FalseNegaB

ve%

(1–SP-Score)

FalsePo

siBv

e%

(1–M

odelerScore)

Page 5: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

0%

10%

20%

30%

40%

BAli-Phy:QuickLookatResults(2of2)

#Taxa: 100 200 100 200Simulator: Indelible(DNA) RNAsim(RNA)

Total-C

olum

nScore

AlignmentAccuracy*

*Averagesover10replicates

Total-ColumnScore:PercentageofcolumnsfromthereferencealignmentthatarefullyreproducedbytheesQmatedalignment.

5

0.0%

10.0%

20.0%

30.0%

40.0%MAFFT

PASTA

BAli-Phy

Page 6: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

EsQmateMLtreeonnewalignment

Tree

ObtaininiQalalignmentandesQmatedMLtree

Usetreetocomputenewalignment

Alignment

RepeatunQlterminaQoncondiQon,and

returnthealignment/treepairwiththebestMLscore

SATéandPASTAAlgorithms

6

Page 7: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

PASTAAlgorithm

7

Input:unalignedsequences1)GetiniBalalignment

2)EsBmatetreeoncurrentalignment 3)Breakintosubsets

accordingtotree

4)Useexternalalignertoalignsubsets

5)Useexternalprofilealignertomergesubset

alignments

6)UsetransiBvitytomergesubsetpairsintoafull

alignment,scraptheoldtree

(repeat)

?

Page 8: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

DivideandConquerwithBAli-Phy

TogetthemostoutofPASTA+BAli-Phy,startwiththetreefromtheLASTitera(onofdefaultPASTA

QUESTION:SincewesawthatBAli-PhygivesbeDeralignmentsonsmallnumbersoftaxa,couldwegetbeDeralignmentonlargedatasetsifweusedBAli-Phyonsubsets?ANSWER:Yes,butittakesalotofcompuQngresources.

8

Page 9: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

MethodstoComparePASTA:

–  IteraBons1-3:MAFFT(SubsetSize200)–  (alldefaultselngs)

PASTA+BAli-Phy–  IteraBons1-3:MAFFT(SubsetSize200)–  IteraBon4:BAli-Phy(SubsetSize100)

PASTA+MAFFT–  IteraBons1-3:MAFFT(SubsetSize200)–  IteraBon4:MAFFT(SubsetSize100)

MAFFTL-INS-i–  DefaultMAFFT(v7.273)usingthemafft-linsicommand

• TakesadvantageoffasterMAFFTonearlyitera(onswheresubsetsaremorediverse.

9

• Helpfultoiden(fywhetheranygainisfromextraitera(onorbecauseBAli-Phywasused.

• AlldefaultseLngs…

• Usefulcomparisontobenchmarkdifficultyofalignment.MAFFTispopularandL-INS-iisthemostaccurateversion.

Page 10: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

ErrorReducQon(1000Sequences)

10

0%

5%

10%

15%

20%

25%

30%

IndelibleM2 RNAsim RoseL1 RoseM1 RoseS1

0%

5%

10%

15%

20%

25%

30%

IndelibleM2 RNAsim RoseL1 RoseM1 RoseS1

PASTA

PASTA+BAli-Phy

PASTA(ExtraIteration)

MAFFTL-INS-i

FalseNegaB

ve%

(i.e.1-SP-Score)

0%

5%

10%

15%

20%

25%

30%

IndelibleM2 RNAsim RoseL1 RoseM1 RoseS1

FalsePo

siBv

e%

(i.e.1–M

odelerScore)

(SmallerisbeDer)

Page 11: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

AccuracyGain(1000Sequences)

11

0%

5%

10%

15%

20%

25%

30%

IndelibleM2 RNAsim RoseL1 RoseM1 RoseS1

PASTA

PASTA+BAli-Phy

PASTA(ExtraIteration)

MAFFTL-INS-i

TotalColum

nScore

0%

5%

10%

15%

20%

25%

30%

35%

IndelibleM2 RNAsim RoseL1 RoseM1 RoseS1

Page 12: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

TreeErrorRelaQvetoML(ReferenceAlignment)

12

Delta

-RF(RAx

ML)

0%

1%

2%

3%

4%

5%

6%

IndelibleM2 RNAsim RoseL1 RoseM1 RoseS10%

5%

10%

15%

20%

25%

30%

IndelibleM2 RNAsim RoseL1 RoseM1 RoseS1

PASTA

PASTA+BAli-Phy

PASTA(ExtraIteration)

MAFFTL-INS-i

Page 13: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

AccuracyGain(1000Sequences,Detail)

PASTA+BAli-Phy Better

PA

STA

Better

0.00

0.05

0.10

0.15

0.00 0.05 0.10 0.15PASTA+BAli-Phy

PA

STA

Tree Error: Delta RF (RAxML)

PASTA+BAli-Phy Better

PA

STA

Better

0.0

0.1

0.2

0.3

0.4

0.0 0.1 0.2 0.3 0.4PASTA

PA

STA

+BA

li-P

hy

False Negative %

PASTA+BAli-Phy Better

PA

STA

Better

0.0

0.1

0.2

0.3

0.4

0.0 0.1 0.2 0.3 0.4PASTA

PA

STA

+BA

li-P

hy

Total Column Score

PASTA+BAli-Phy Better

PA

STA

Better

-0.05

0.00

0.05

0.10

0.15

0.20

-0.05 0.00 0.05 0.10 0.15 0.20PASTA+BAli-Phy

PA

STA

Tree Error: Delta RF (FastTree-2)

PASTA+BAli-Phy Better

PA

STA

Better

0.0

0.1

0.2

0.3

0.4

0.0 0.1 0.2 0.3 0.4PASTA

PA

STA

+BA

li-P

hy

False Positive %

dataIndelible M2RNAsimRose L1Rose M1Rose S1

PASTA+BAli-Phy Better

PA

STA

Better

0.00

0.05

0.10

0.15

0.00 0.05 0.10 0.15PASTA+BAli-Phy

PA

STA

Tree Error: Delta RF (RAxML)

PASTA+BAli-Phy Better

PA

STA

Better

0.0

0.1

0.2

0.3

0.4

0.0 0.1 0.2 0.3 0.4PASTA

PA

STA

+BA

li-P

hy

False Negative %

PASTA+BAli-Phy Better

PA

STA

Better

0.0

0.1

0.2

0.3

0.4

0.0 0.1 0.2 0.3 0.4PASTA

PA

STA

+BA

li-P

hyTotal Column Score

PASTA+BAli-Phy Better

PA

STA

Better

-0.05

0.00

0.05

0.10

0.15

0.20

-0.05 0.00 0.05 0.10 0.15 0.20PASTA+BAli-Phy

PA

STA

Tree Error: Delta RF (FastTree-2)

PASTA+BAli-Phy Better

PA

STA

Better

0.0

0.1

0.2

0.3

0.4

0.0 0.1 0.2 0.3 0.4PASTA

PA

STA

+BA

li-P

hy

False Positive %

dataIndelible M2RNAsimRose L1Rose M1Rose S1

13

Page 14: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

AccuracyGain(1000Sequences,Detail)

14

Page 15: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

Scalingto10,000Sequences• WecanuseUPP(Nguyen,etal,2015)toextendanalignmenttolargernumbersofsequences:–  Takearandom“backbone”subset(i.e.1,000sequencesfrompreviousslides)

–  Alignthebackbone–  AlignallremainingsequencestothebackboneviaHMMs

15

Page 16: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

Scalingto10,000Sequences• Accuracyoffullalignmenttendstotracktheaccuracyofthebackbone:

16

Data Backbone FP% FN% TC Δ-RFPASTA 3.8% 6.4% 2.6% 0.77%PASTA+BAli-Phy 2.2% 4.4% 4.3% 0.54%PASTA+MAFFT 2.7% 5.0% 3.2% 0.62%PASTA 9.2% 9.5% 0.5% 0.77%

RNAsim PASTA+BAli-Phy 8.6% 9.0% 0.6% 0.67%PASTA+MAFFT 10.6% 10.9% 0.5% 0.67%

IndelibleM2

Page 17: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

Scalingto10,000Sequences

17

PASTA+BAli-Phy Better

PAS

TA B

etter

0.000

0.025

0.050

0.075

0.100

0.000 0.025 0.050 0.075 0.100PASTA

PAST

A+B

Ali-

Phy

Total Column Score

dataIndelible M2

RNAsim

PASTA+BAli-Phy Better

PAS

TA B

etter

0.900

0.925

0.950

0.975

1.000

0.900 0.925 0.950 0.975 1.000PASTA

PAST

A+B

Ali-

Phy

Recall (SP-Score)PASTA+BAli-Phy Better

PAS

TA B

etter

0.000

0.005

0.010

0.015

0.000 0.005 0.010 0.015PASTA+BAli-Phy

PAST

A

Tree Error: Delta RF

Page 18: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

HypotheQcalRunningTimeComparisonData:1000TaxaGoal:RunPASTA(1iteraQon,maximumsubsetsize100)Resource:1Server,32CoresQuesBon:Howlongdoeseachsubset-alignmentmethodtake?Answer:•  With1,000taxaandsubsetsnolargerthan100,we’ll

haveapproximately15subsets(between10-16).

•  MAFFT:–  Takes1coreapprox.10-20minutestodo1subset.

Candoallsubsetsinparallel.–  Total:10-20minutes

•  BAli-Phy:–  Takes24hoursforall32corestodo1subset.Can’t

runinparallelsinceweonlyhave32cores.–  Total:15Days

Thesecalcula(onsarehypothe(calbutrepresenta(ve.

Ifwehavemul(pleservers,wecanrunBAli-Phyinparallelinless(me…

S(ll,ifwewanttorunBAli-Phyitmakesthemostsensetoseveralitera(onswithMAFFTfirst.

18

Page 19: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

Summary• BAli-PhyprovidesmoreaccuratealignmentsthanMAFFTonsmalldata,whichcantranslatetomoreaccuratealignmentsonupto1,000taxabyboosQngwithPASTA.

• AlthoughtherunningQmeislonger,thisallowsBAli-Phytobescaledinaparallelway,sowecanalign1,000sequencesintheQmeittakestoalign100.

• Alignmentaccuracytranslatestoimprovedtreeaccuracyonthisdata.

• Alignmentscanbefurtherextendedto10,000sequencesusingUPP

Codeat:http://github.com/mgnute/pasta

19

Page 20: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

AcknowledgementsSpecialThanks•  Thisisajointworkwithmyadvisor,Prof.TandyWarnow.SpecialthankstoNamNguyenforearlycollaboraQononthisprojectandforteachingmethecodebaseforPASTAandUPP,andtoErinMolloyforseveralhelpfuldiscussionsandsuggesQonsoverthecourseofthisresearch.

NSF•  ThisworkwasfundedbyNSFgrantIII:AF:1513629BlueWaters•  ThisresearchispartoftheBlueWaterssustained-petascalecompuQngproject,whichissupportedbytheNaQonalScienceFoundaQon(awardsOCI-0725070andACI-1238993)andthestateofIllinois.BlueWatersisajointeffortoftheUniversityofIllinoisatUrbana-ChampaignanditsNaQonalCenterforSupercompuQngApplicaQons.

20

Page 21: Scaling BAli-Phy to Large Datasetstandy.cs.illinois.edu/nute-symposium.pdf · 2016. 6. 16. · can run BAli-Phy in parallel in less (me… S(ll, if we want to run BAli-Phy it makes

References[1.]Redelings,B.D.,&Suchard,M.(2005).JointBayesianesQmaQonofalignmentandphylogeny.

Systema(cBiology,54(3),401–418.[2.]Liu,K.,Raghavan,S.,Nelesen,S.,Linder,C.R.,&Warnow,T.(2009).Rapidandaccurate

large-scalecoesQmaQonofsequencealignmentsandphylogeneQctrees.Science(NewYork,N.Y.),324(5934),1561–4.

[3.]Katoh,K.,Misawa,K.,Kuma,K.,&Miyata,T.(2002).MAFFT:anovelmethodforrapidmulQplesequencealignmentbasedonfastFouriertransform.NucleicAcidsResearch,30(14),3059–3066.

[4.]Mirarab,S.,Nguyen,N.,Guo,S.,Wang,L.-S.,Kim,J.,&Warnow,T.(2015).PASTA:Ultra-LargeMulQpleSequenceAlignmentforNucleoQdeandAmino-AcidSequences.JournalofComputa(onalBiology,22(5),377–386.

[5.]McKenzie,S.K.,Oxley,P.R.,&Kronauer,D.J.C.(2014).ComparaQvegenomicsandtranscriptomicsinantsprovidenewinsightsintotheevoluQonandfuncQonofodorantbindingandchemosensoryproteins.BMCGenomics,15(1),718.

[6.]Liu,K.,Warnow,T.J.,Holder,M.T.,Nelesen,S.M.,Yu,J.,Stamatakis,A.P.,&Linder,C.R.(2012).SATe-II:VeryFastandAccurateSimultaneousEsQmaQonofMulQpleSequenceAlignmentsandPhylogeneQcTrees.Systema(cBiology,61(1),90–106.

[7.]Nguyen,N.D.,Mirarab,S.,Kumar,K.,&Warnow,T.(2015).Ultra-largealignmentsusingphylogeny-awareprofiles.GenomeBiology,16(1),124.

21