using ensembles of hidden markov models for grand challenges...

50
Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

Upload: others

Post on 19-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

UsingEnsemblesofHiddenMarkovModelsforGrandChallengesinBioinformatics

TandyWarnowFounderProfessorofEngineering

TheUniversityofIllinoisatUrbana-Champaignhttp://tandy.cs.illinois.edu

Page 2: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

From the Tree of the Life Website, University of Arizona

Phylogenomics

Page 3: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD

Challenge: Alignments and trees on > 100,000 sequences

Plus many many other people…

Page 4: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

1000-taxonmodels,orderedbydifficulty(Liuetal.,Science324(5934):1561-1564,2009)

Page 5: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Re-aligning on a tree A

B D

C

Merge sub-alignments

Estimate ML tree on merged

alignment

Decompose dataset

A B

C D

Align subsets

A B

C D

ABCD

Page 6: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

SATéandPASTAAlgorithms

Estimate ML tree on new alignment

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

Alignment

Repeatuntilterminationcondition,and

returnthealignment/treepairwiththebestMLscore

Page 7: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

1000taxonmodels,orderedbydifficulty,Liuetal.,Science324(5934):1561-1564,2009

24hourSATé-Ianalysis,ondesktopmachines

(Similarimprovementsforbiologicaldatasets)

Page 8: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

1000-taxonmodelsrankedbydifficulty

SATé-2betterthanSATé-1

SATé-1(Liuetal.,Science2009):cananalyzeupto8KsequencesSATé-2(Liuetal.,SystematicBiology2012):cananalyzeupto~50Ksequences

Page 9: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

RNASim

0.00

0.05

0.10

0.15

0.20

10000 50000 100000 200000

Tree

Erro

r (FN

Rat

e) Clustal−OmegaMuscleMafftStarting TreeSATe2PASTAReference Alignment

PASTA:Mirarab,Nguyen,andWarnow,JComp.Biol.2015–  SimulatedRNASimdatasetsfrom10Kto200Ktaxa–  Limitedto24hoursusing12CPUs–  Notallmethodscouldrun(missingbarscouldnotfinish)

PASTA:evenbetterthanSATé-2

Page 10: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Decompositionto100-sequencesubsets,oneiterationofPASTA+BAli-Phy

ComparingdefaultPASTAtoPASTA+BAli-Phyonsimulateddatasetswith1000sequences

PASTA+BAli−Phy Better

PAS

TA B

etter

0.0

0.1

0.2

0.3

0.4

0.0 0.1 0.2 0.3 0.4PASTA

PAS

TA+B

Ali−

Phy

Total Column Score

dataIndelible M2

RNAsim

Rose L1

Rose M1

Rose S1

PASTA+BAli−Phy Better

PAS

TA B

etter

0.6

0.7

0.8

0.9

1.0

0.6 0.7 0.8 0.9 1.0PASTA

PAS

TA+B

Ali−

Phy

Recall (SP−Score)

PASTA+BAli−Phy Better

PAS

TA B

etter

0.00

0.05

0.10

0.15

0.00 0.05 0.10 0.15PASTA+BAli−Phy

PAS

TA

Tree Error: Delta RF (RAxML)

Page 11: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD

Challenge: Alignments and trees on > 100,000 sequences

Plus many many other people…

Page 12: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Length

Counts

0

2000

4000

6000

8000

10000

12000 Mean:317Median:266

0 500 1000 1500 2000

1KPdataset:morethan100,000p450amino-acidsequences,manyfragmentary

Page 13: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Length

Counts

0

2000

4000

6000

8000

10000

12000 Mean:317Median:266

0 500 1000 1500 2000

1KPdataset:morethan100,000p450amino-acidsequences,manyfragmentary

Allstandardmultiplesequencealignmentmethodswetestedperformedpoorlyondatasetswithfragments.

Page 14: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

ProfileHiddenMarkovModels•  Probabilisticmodeltorepresentafamilyofsequences,

representedbyamultiplesequencealignment

•  IntroducedforsequenceanalysisinKroghetal.1994.PopularizedinEddy1996andtextbookDurbinetal.1998

•  FundamentalpartofHMMERandotherproteindatabases

•  Usedfor:homologydetection,proteinfamilyassignment,multiplesequencealignment,phylogeneticplacement,proteinstructureprediction,alignmentsegmentation,etc.

Page 15: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

ProfileHMMsl  GenerativemodelforrepresentingaMSA

l  Consistsof:

l  Setofstates(Match,insertion,anddeletion)

l  Transitionprobabilities

l  Emissionprobabilities

Page 16: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

ProfileHiddenMarkovModelforDNAsequencealignment

Page 17: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

HMMsforMSAl  Givenseedalignment(e.g.,inPFAM)andacollectionof

sequencesfortheproteinfamily:

l  RepresentseedalignmentusingHMM

l  AligneachadditionalsequencetotheHMM

l  UsetransitivitytoobtainMSA

l  Canwedosomethinglikethiswithoutaseedalignment?

Page 18: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

HMMsforMSAl  Givenseedalignment(e.g.,inPFAM)andacollectionof

sequencesfortheproteinfamily:

l  RepresentseedalignmentusingHMM

l  AligneachadditionalsequencetotheHMM

l  UsetransitivitytoobtainMSA

l  Canwedosomethinglikethiswithoutaseedalignment?

Page 19: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

UPPUPP=“Ultra-largemultiplesequencealignmentusingPhylogeny-awareProfiles”Nguyen,Mirarab,andWarnow.GenomeBiology,2014.Purpose:highlyaccuratelarge-scalemultiplesequencealignments,eveninthepresenceoffragmentarysequences.

Page 20: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

UPPUPP=“Ultra-largemultiplesequencealignmentusingPhylogeny-awareProfiles”Nguyen,Mirarab,andWarnow.GenomeBiology,2014.Purpose:highlyaccuratelarge-scalemultiplesequencealignments,eveninthepresenceoffragmentarysequences.

UsesanensembleofHMMs

Page 21: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Simpleidea(notUPP)

•  Selectrandomsubsetofsequences,andbuild“backbonealignment”

•  ConstructaHiddenMarkovModel(HMM)onthebackbonealignment

•  AddallremainingsequencestothebackbonealignmentusingtheHMM

Page 22: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

RNASim

0.00

0.05

0.10

0.15

0.20

10000 50000 100000 200000

Tree

Erro

r (FN

Rat

e) Clustal−OmegaMuscleMafftStarting TreeSATe2PASTAReference Alignment

PASTA:Mirarab,Nguyen,andWarnow,JComp.Biol.2015–  SimulatedRNASimdatasetsfrom10Kto200Ktaxa–  Limitedto24hoursusing12CPUs–  Notallmethodscouldrun(missingbarscouldnotfinish)

PASTA:evenbetterthanSATé-2StartingtreeisbasedonUPP(simple):oneprofileHMMforBackbonealignment

Page 23: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

•  Selectrandomsubsetofsequences,andbuild“backbonealignment”

•  ConstructaHiddenMarkovModel(HMM)onthebackbonealignment

•  AddallremainingsequencestothebackbonealignmentusingtheHMM

Thisapproachworkswellifthedatasetissmallandhaslowevolutionaryrates,butisnotveryaccurateotherwise.

Page 24: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

One Hidden Markov Model for the entire alignment?

Page 25: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

OneHiddenMarkovModelforthebackbonealignment?

HMM1

Page 26: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Or2HMMs?

HMM1

HMM2

Page 27: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

HMM1

HMM3 HMM4

HMM2

Or4HMMs?

Page 28: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

m

HMM2

HMM3

HMM1

HMM4

HMM5 HMM6

HMM7

Orall7HMMs?

Page 29: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

UPPAlgorithmicApproach

1.  Selectrandomsubsetoffull-lengthsequences,andbuild“backbonealignment”

2.  Constructan“EnsembleofHiddenMarkovModels”onthebackbonealignment

3.  AddallremainingsequencestothebackbonealignmentusingtheEnsembleofHMMs

Page 30: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Evaluation•  Simulateddatasets(somehavefragmentarysequences):–  10Kto1,000,000sequencesinRNASim–complexRNAsequenceevolutionsimulation

–  1000-sequencenucleotidedatasetsfromSATépapers–  5000-sequenceAAdatasets(fromFastTreepaper)–  10,000-sequenceIndeliblenucleotidesimulation

•  Biologicaldatasets:–  Proteins:largestBaliBASEandHomFam–  RNA:3CRWdatasetsupto28,000sequences

Page 31: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

RNASimMillionSequences:treeerror

Using 12 processors: •  UPP(Fast,NoDecomp)

took 2.2 days,

•  UPP(Fast) took 11.9 days, and

•  PASTA took 10.3 days

Page 32: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

0.0

0.2

0.4

0.6

0 12.5 25 50% Fragmentary

Mea

n al

ignm

ent e

rror

PASTA UPP(Default)

(a) Average alignment error

0.0

0.2

0.4

0 12.5 25 50% Fragmentary

Del

ta F

N tr

ee e

rror

PASTA UPP(Default)

(b) Average tree error

Figure S32: Alignment and tree error of PASTA and UPP on the fragmentary 1000M2datasets.

80

Performanceonfragmentarydatasetsofthe1000M2modelcondition

UPPisveryrobusttofragmentarysequences

Underhighratesofevolution,PASTAisbadlyimpactedbyfragmentarysequences(thesameistrueforothermethods).UPPcontinuestohavegoodaccuracyevenondatasetswithmanyfragmentsunderallratesofevolution.

Page 33: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

0

5

10

15

50000 100000 150000 200000Number of sequences

Wal

l clo

ck a

lign

time

(hr)

● UPP(Fast)

UPPRunningTime

Wall-clocktimeused(inhours)given12processors

Page 34: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

OtherApplicationsoftheEnsembleofHMMs

SEPP(phylogeneticplacement,Mirarab,Nguyen,andWarnowPSB2014)

TIPP(metagenomictaxonidentification,Nguyen,Mirarab,Liu,Pop,andWarnow,Bioinformatics2014)

HIPPI(proteinclassificationandremotehomologydetection),RECOMB-CG2016andBMCGenomics2016(Nguyen,Nute,Mirarab,andWarnow)

Page 35: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent
Page 36: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Objective:Distributionofthespecies(orgenera,orfamilies,etc.)withinthesample.

Forexample:Thedistributionofthesampleatthespecies-levelis:

50% speciesA

20% speciesB

15% speciesC

14% speciesD

1% speciesE

AbundanceProfiling

Page 37: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

TIPPpipeline

Input:setofreadsfromashotgunsequencingexperimentofametagenomicsample

1.  AssignreadstomarkergenesusingBLAST2.  Forreadsassignedtomarkergenes,perform

taxonomicanalysis3.  CombineanalysesfromStep2

Page 38: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Highindeldatasetscontainingknowngenomes

Note:NBC,MetaPhlAn,andMetaPhylercannotclassifyanysequencesfromatleastoneofthehighindellongsequencedatasets,andmOTUterminateswithanerrormessageonallthehighindeldatasets.

Page 39: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

“Novel”genomedatasets

Note:mOTUterminateswithanerrormessageonthelongfragmentdatasetsandhighindeldatasets.

Page 40: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

ProteinFamilyAssignment

•  Input:newAAsequence(mightbefragmentary)anddatabaseofproteinfamilies(e.g.,PFAM)

•  Output:assignment(ifjustified)ofthesequencetoanexistingfamilyinthedatabase

Page 41: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

HIPPI

•  HIerarchicalProfileHMMsforProteinfamilyIdentification

•  Nguyen,Nute,Mirarab,andWarnow,RECOMB-CG2016andBMC-Genomics2016

•  UsesanensembleofHMMstoclassifyproteinsequences

•  TestedonHMMER

Page 42: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Pre

cis

ion

.97

.98

.98

.99

.75 .80 .85 .90 .95 1.00

Seq Length: Full

.96

.97

.98

.99

.50 .60 .70 .80 .90

Recall

Seq Length: 50%

.92

.95

.98

.20 .40 .60 .80

Seq Length: 25%

Method HIPPI HMMER BLAST HHsearch: 1 Iteration HHsearch: 2 Iterations

Page 43: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

FourProblems

•  PhylogeneticPlacement(SEPP,PSB2012)

•  Multiplesequencealignment(UPP,RECOMB2014andGenomeBiology2014)

•  Metagenomictaxonidentification(TIPP,Bioinformatics2014)

•  Genefamilyassignmentandhomologydetection(HIPPI,RECOMB-CG2016andBMCGenomics2016)

Aunifyingtechniqueisthe“EnsembleofHiddenMarkovModels”(introducedbyMirarabetal.,2012)

Page 44: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Summary•  UsinganensembleofHMMstendstoimproveaccuracy,foracostofrunning

time.Applicationssofartotaxonomicplacement(SEPP),multiplesequencealignment(UPP),proteinfamilyclassification(HIPPI).Improvementsaremostlynoticeableforlargediversedatasets.

•  Phylogenetically-basedconstructionoftheensemblehelpsaccuracy(note:thedecompositionsweproducearenotclade-based),butthedesignanduseoftheseensemblesisstillinitsinfancy.(Manyrelativelysimilarapproacheshavebeenusedbyothers,includingSci-PhyandFlowerPowerbySjolander.)

•  Thebasicideacanbeusedwithanykindofprobabilisticmodel,doesn’thavetoberestrictedtoprofileHMMs.

Page 45: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Basicquestion:whydoesithelp?

•  UsinganensembleofHMMstendstoimproveaccuracy,foracostofrunningtime.Applicationssofartotaxonomicplacement(SEPP),multiplesequencealignment(UPP),proteinfamilyclassification(HIPPI).Improvementsaremostlynoticeableforlargediversedatasets.

•  Phylogenetically-basedconstructionoftheensemblehelpsaccuracy(note:thedecompositionsweproducearenotclade-based),butthedesignanduseoftheseensemblesisstillinitsinfancy.(Manyrelativelysimilarapproacheshavebeenusedbyothers,includingSci-PhyandFlowerPowerbySjolander.)

•  Thebasicideacanbeusedwithanykindofprobabilisticmodel,doesn’thavetoberestrictedtoprofileHMMs.

Page 46: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Scientific challenges: •  Ultra-large multiple-sequence alignment •  Alignment-free phylogeny estimation •  Supertree estimation •  Estimating species trees from many gene trees •  Genome rearrangement phylogeny •  Reticulate evolution •  Visualization of large trees and alignments •  Data mining techniques to explore multiple optima •  Theoretical guarantees under Markov models of evolution

Techniques: machine learning, applied probability theory, graph theory, combinatorial optimization, supercomputing, and heuristics

The Tree of Life: Multiple Challenges

Page 47: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Acknowledgments

PASTAandUPP:NamNguyen(nowpostdocatUIUC)andSiavashMirarab(nowfacultyatUCSD),undergrad:KeerthanaKumar(atUT-Austin)PASTA+BAli-Phy:MikeNute(PhDstudentatUIUC)CurrentNSFgrants:ABI-1458652(multiplesequencealignment)GraingerFoundation(atUIUC),andUIUCTACC,UTCS,BlueWaters,andUIUCcampusclusterPASTA,UPP,SEPP,andTIPPareavailableongithubathttps://github.com/smirarab/;seealsoPASTA+BAli-Phyathttp://github.com/MGNute/pasta

Page 48: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

Nguyen et al. Genome Biology (2015) 16:124 Page 6 of 15

Table 2 Average alignment SP-error, tree error, and TC score across most full-length datasets

Method ROSE RNASim Indelible ROSE CRW 10 AA HomFam HomFam

NT 10K 10K AA (17) (2)

Average alignment SP-error

UPP 7.8 (1) 9.5 (1) 1.7 (2) 2.9 (1) 12.5 (1) 24.2 (1) 23.3 (1) 20.8 (2)

PASTA 7.8 (1) 15.0 (2) 0.4 (1) 3.1 (1) 12.8 (1) 24.0 (1) 22.5 (1) 17.3 (1)

MAFFT 20.6 (2) 25.5 (3) 41.4 (3) 4.9 (2) 28.3 (2) 23.5 (1) 25.3 (2) 20.7 (2)

Muscle 20.6 (2) 64.7 (5) 62.4 (4) 5.5 (3) 30.7 (3) 30.2 (2) 48.1 (4) X

Clustal 49.2 (3) 35.3 (4) X 6.5 (4) 43.3 (4) 24.3 (1) 27.7 (3) 29.4 (3)

Average !FN error

UPP 1.3 (1) 0.8 (1) 0.3 (1) 1.8 (1) 7.8 (2) 3.4 (2) NA NA

PASTA 1.3 (1) 0.4 (1) <0.1 (1) 1.3 (1) 5.1 (1) 3.3 (1) NA NA

MAFFT 5.8 (2) 3.5 (2) 24.8 (3) 4.5 (3) 10.1 (3) 2.3 (1) NA NA

Muscle 8.4 (3) 7.3 (3) 32.5 (4) 3.1 (2) 5.5 (1) 12.6 (3) NA NA

Clustal 24.3 (4) 10.4 (4) X 4.2 (3) 34.1 (4) 3.5 (2) NA NA

Average TC score

UPP 37.8 (1) 0.5 (2) 11.0 (3) 2.6 (2) 1.4 (1) 11.4 (1) 47.3 (1) 40.3 (3)

PASTA 37.8 (1) 2.3 (1) 48.0 (1) 5.4 (1) 2.3 (1) 12.1 (1) 46.1 (2) 50.0 (1)

MAFFT 31.4 (2) 0.4 (2) 7.8 (4) 0.6 (3) 0.7 (2) 12.1 (1) 45.5 (2) 46.9 (2)

Muscle 9.8 (3) <0.0 (2) 18.3 (2) 2.7 (2) 0.7 (2) 10.5 (2) 27.7 (4) X

Clustal 5.7 (4) 0.2 (2) X 3.1 (2) 0.1 (2) 11.8 (1) 38.6 (3) 31.0 (4)

We report the average alignment SP-error (the average of SPFN and SPFP errors) (top), average !FN error (middle), and average TC score (bottom), for the collection offull-length datasets. All scores represent percentages and so are out of 100. Results marked with an X indicate that the method failed to terminate within the time limit(24 hours on a 12-core machine). Muscle failed to align two of the HomFam datasets; we report separate average results on the 17 HomFam datasets for all methods and thetwo HomFam datasets for all but Muscle. We did not test tree error on the HomFam datasets (therefore, the !FN error is indicated by “NA”). The tier ranking for each methodis shown parenthetically

memory error message were marked as failures. Forexperiments on the million-sequence RNASim dataset,we ran the methods on a dedicated machine with 256GBof main memory and 12 cores until an alignment wasgenerated or the method failed. We also performed a lim-ited number of experiments on TACC with UPP’s internal

checkpointing mechanism, to explore performance whentime is not limited. All methods other than Muscle hadparallel implementations and were able to take advantageof the 12 available cores.On full-length datasets (Table 2) where nearly all meth-

ods were able to complete, PASTA was nearly always in

Table 3 Average alignment SP-error and tree error across fragmentary datasets

Method ROSE NT RNASim 10K Indelible 10K CRW

(16S.3 and 16S.T)

Average alignment SP-error

UPP 8.3 (1) 11.8 (1) 2.7 (1) 16.1 (1)

PASTA 25.2 (2) 47.7 (4) 8.8 (2) 23.3 (2)

MAFFT 32.5 (3) 25.5 (2) 51.3 (3) 24.5 (3)

Muscle 35.3 (4) 82.2 (5) 77.6 (4) 70.6 (5)

Clustal 62.0 (5) 35.0 (3) X 46.7 (4)

Average !FN error

UPP 1.9 (1) 3.1 (1) 2.5 (1) 7.4 (2)

PASTA 25.2 (3) 21.9 (3) 9.0 (2) 8.2 (2)

MAFFT 18.0 (2) 6.2 (2) 35.6 (3) 2.5 (1)

Muscle 27.5 (4) 43.6 (5) 45.2 (4) 30.1 (3)

Clustal 47.8 (5) 26.3 (4) X 37.4 (4)

We report the average alignment error (top) and average !FN error (bottom) on the collection of fragmentary datasets. Clustal-Omega failed to align any of the Indelible10000M2 fragmentary datasets and thus we mark the results with an X. The tier ranking for each method is shown in parentheses

Page 49: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

.75

.80

.85

.90

.95

1.0

.25 .50 .75 1.0.75

.80

.85

.90

.95

1.0

.25 .50 .75 1.0.75

.80

.85

.90

.95

1.0

.25 .50 .75 1.0

.75

.80

.85

.90

.95

1.0

.25 .50 .75 1.0.75

.80

.85

.90

.95

1.0

.25 .50 .75 1.0.75

.80

.85

.90

.95

1.0

.25 .50 .75 1.0

.80

.90

1.0

.00 .25 .50 .75 1.0

.80

.90

1.0

.00 .25 .50 .75 1.0

.80

.90

1.0

.00 .25 .50 .75 1.0

.80

.90

1.0

.00 .25 .50 .75 1.0

.80

.90

1.0

.00 .25 .50 .75 1.0

.80

.90

1.0

.00 .25 .50 .75 1.0

.70

.80

.90

1.0

.00 .20 .40 .60 .80

.70

.80

.90

1.0

.00 .20 .40 .60 .80

.70

.80

.90

1.0

.00 .20 .40 .60 .80

.70

.80

.90

1.0

.00 .20 .40 .60 .80

.70

.80

.90

1.0

.00 .20 .40 .60 .80

.70

.80

.90

1.0

.00 .20 .40 .60 .80

Pre

cisi

on

Avg. Pairwise Sequence Identiy

> 30% 20−30% < 20%

Size

: 0−

10

0S

ize: >

10

0S

ize: 0

−1

00

Size

: > 1

00

Size

: 0−

10

0S

ize: >

10

0

Sequence

Length

: Full

Sequence

Length

: 50%

Sequence

Length

: 25%

Recall

Method HIPPI HMMER BLAST HHsearch: 1 Iteration HHsearch: 2 Iterations

Page 50: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent

1. Pre-processing seed alignments

.

.

.

.

Family A

Family B

Family N

.

.

i) Compute an ML tree on each seed alignment

ii) Build ensemble of HMMs for each seed alignment, using its ML tree

HMM A1

HMM B1

HMM N1

HMM A2

HMM A3

HMM N2

HMM N5

HMM N4

HMM N3

iii) Collect HMMs into database

HMM A1 HMM A2HMM A3

HMM B1

......HMM N1HMM N2

HMM N5

HMM N4HMM N3

2. Classification of query sequences

HMM A1 HMM A2HMM A3

HMM B1

......HMM N1HMM N2

HMM N5

HMM N4HMM N3

HMM A1 = 8.9 HMM A2 = 7.3HMM A3 = 9.4HMM B1 = 5.6......HMM N1 = 4.4HMM N2 = 5.6

HMM N5 = 6.6HMM N4 = 4.7HMM N3 = 4.3

1) Family A = 9.4

8) Family B = 5.6 ......

2) Family N = 6.6 ......

Query sequence

HMM database

i) Score query sequence against all HMMs in database, keeping only scores above inclusion threshold

ii) Rank families by best scoring HMM within family

iii) Assign query sequence to top ranking family

Family A

Query

HMM database