alexandros stamatakis and alexey kozlov · 2021. 1. 13. · raxml-ng: a fast, scalable and...

83
RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies Alexandros Stamatakis and Alexey Kozlov Institute for Theoretical Informatics, Karlsruhe Institute of Technology ISCBacademy webinar September 30, 2020

Upload: others

Post on 24-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

RAxML-NG: a fast, scalable and user-friendly tool for

maximum likelihood phylogenetic inference

Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies

Alexandros Stamatakis and Alexey Kozlov

Institute for Theoretical Informatics, Karlsruhe Institute of Technology

ISCBacademy webinar – September 30, 2020

Page 2: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

Outline● Introduction to Phylogenetic Inference -

Alexandros● The RAxML Search Algorithm - Alexandros● Improvements in RAxML Next Generation -

Alexey● Tutorial - Alexey

Page 3: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

A phylogeny

Taxon 5Taxon 3

Taxon 4

Taxon 1

Taxon 2

Phylogenies describe evolutionary relationships among species

Hypothetical common ancestors

Page 4: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

A phylogeny

Taxon 5Taxon 3

Taxon 4

Taxon 1

Taxon 2 Extant species

Page 5: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

A phylogeny

Taxon 5Taxon 3

Taxon 4

Taxon 1

Taxon 2

Phylogenetic trees are unrooted binary trees!!!

Page 6: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

Tree Inference Pipeline

Taxon 1:ACGTTTTaxon 2:ACGTTTaxon 3:ACCCTTaxon 4:AGGGTTT

MSAProgram

Tree inferenceprogram

Taxon 1:ACGTTT-Taxon 2:ACGTT--Taxon 3:ACCCT--Taxon 4:AGGGTTT

Taxon 1

Taxon 2

Taxon 3

Taxon 4

Page 7: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

Tree Inference Pipeline

Taxon 1:ACGTTTTaxon 2:ACGTTTaxon 3:ACCCTTaxon 4:AGGGTTT

MSAProgram

RAxML-NGTaxon 1:ACGTTT-Taxon 2:ACGTT--Taxon 3:ACCCT--Taxon 4:AGGGTTT

Taxon 1

Taxon 2

Taxon 3

Taxon 4Our focus in this talk

Page 8: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

How many unrooted 4-taxon trees exist?

Page 9: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

How many unrooted 4-taxon trees exist?

A

D

B

C

A

C

B

D

A

B

C

D

Page 10: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

How do we select among them ?

A

D

B

C

A

C

B

D

A

B

C

D

Page 11: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

How do we select among them???

A

D

B

C

A

C

B

D

A

B

C

D

We need scoring criteria!!!

Score: 1.0

Score: 2.0Score: 3.0

Page 12: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

How do we select among them???

A

D

B

C

A

C

B

D

A

B

C

D

We need scoring criteria!!!The currently most widely used criterion is maximum likelihood:How likely is it that the tree, given a model of evolution, generatedthe observed data?

Score: 1.0

Score: 2.0Score: 3.0

Page 13: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

How do we select among them???

A

D

B

C

A

C

B

D

A

B

C

D

We need scoring criteria!!!The currently most widely used criterion is maximum likelihood:How likely is it that the tree, given a model of evolution, generatedthe observed data?

Score: 1.0

Score: 2.0Score: 3.0A

Maximum Likelihood tree

Page 14: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

14

The number of trees

3 taxa → 1 tree

Page 15: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

15

The number of trees

4 taxa → 3 trees

Page 16: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

16

The number of trees

5 taxa → 15 trees

Page 17: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

17

The number of trees

6 taxa → 105 trees

Page 18: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

18

The number of trees explodes!

BANG !

Page 19: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

19

# possible trees with 2000 taxa

Page 20: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

20

Problem Complexity

good scores

bad scores

search spaceheuristic tree search strategy

Global maximum

Page 21: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

21

Problem Complexity

good scores

bad scores

search spaceheuristic tree search strategy

Finding the best tree under Maximum Likelihood is NP-hard!

Global maximum

Page 22: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

22

Problem Complexity

search spaceheuristic tree search strategy

Maximum Likelihood tree searches can end up in local optima

Global maximum

Page 23: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

23

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

Page 24: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

24

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Page 25: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

25

Nucleotide Substitution Models

A C

G T

We model evolution as time-reversible Markov Process!

Page 26: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

26

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Commonly denoted as Q matrix: transition probs for time dt, for time t: P(t)=eQt

Page 27: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

27

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Page 28: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

28

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Page 29: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

29

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

virtual root: vrvirtual root: vr

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Page 30: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

30

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)

mm

vrvr

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Page 31: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

31

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)

mm

vrvr

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Page 32: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

32

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

mm

vrvr

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)

Floating-point & memory intensive

Page 33: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

33

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

mm

vrvr

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)

Sites evolve independently → we can compute per-site likelihoods in

parallel :-)

Page 34: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

Outline● Introduction to Phylogenetic Inference -

Alexandros● The RAxML Search Algorithm - Alexandros● Improvements in RAxML Next Generation -

Alexey● Tutorial - Alexey

Page 35: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

35

How does it work?

Compute randomized stepwise addition orderMaximum Parsimony tree

Page 36: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

36

How does it work?

Compute randomized stepwise addition orderMaximum Parsimony tree

Advantage of RAxML: search starts from distinct point in search space

every time

Page 37: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

37

How does it work?

Compute randomized stepwise addition orderMaximum Parsimony tree

Advantage of RAxML: search starts from distinct point in search space

every time

Alternatively, we can start froma completely random tree

Page 38: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

38

Starting Trees

good scores

bad scores

search space

Global maximum

Good starting tree

Page 39: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

39

Starting Trees

good scores

bad scores

search space

Global maximum

Another good starting tree

Page 40: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

40

How does it work?

Apply lazy subtree rearrangements

Compute randomized stepwise addition orderMaximum Parsimony tree

Page 41: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

41

How does it work?

Apply lazy subtree rearrangements

Compute randomized stepwise addition orderMaximum Parsimony tree

Iterate while tree improves

Page 42: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

42

Subtree Pruning & Re-Grafting

ST5

ST2

ST6ST4

ST3

ST1

Page 43: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

43

Subtree Pruning & Re-Grafting

ST5

ST2

ST6

ST4

ST3

ST1

prune

Page 44: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

44

ST5

ST2

ST6

ST4

ST3

ST1+1

Subtree Pruning & Re-Grafting

re-graft

Page 45: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

45

ST5

ST2

ST6

ST4

ST3

ST1+1

Subtree Pruning & Re-Grafting

Page 46: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

46

ST5

ST2ST6

ST4

ST3

ST1+1

Subtree Pruning & Re-Grafting

Page 47: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

47

ST5

ST2ST6

ST4

ST3

ST1+1

Subtree Pruning & Re-Grafting

Page 48: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

48

ST5

ST2

ST6 ST4

ST3

ST1+2

Subtree Pruning & Re-Grafting

Page 49: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

49

ST5

ST2

ST6 ST4

ST3

ST1+2

Subtree Pruning & Re-Grafting

Page 50: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

50

ST5

ST2

ST6 ST4

ST3

ST1+2

This search parameter can be explicitly set in RAxML-NG via

--spr-radius

Subtree Pruning & Re-Grafting

Page 51: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

51

ST5

ST2

ST6 ST4

ST3

ST1+2

If you don't set it RAxML-NG will auto-tune it

Subtree Pruning & Re-Grafting

Page 52: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

52

ST5

ST2

ST6 ST4

ST3

ST1

Optimize all branches!

Subtree Pruning & Re-Grafting

Page 53: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

53

ST5

ST2

ST6 ST4

ST3

ST1

Do we need to optimize all branches?

Subtree Pruning & Re-Grafting

Page 54: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

54

ST5

ST2

ST6

ST4

ST3

ST1

Lazy Subtree Pruning & Re-Grafting

Page 55: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

55

ST5

ST2

ST6

ST4

ST3

ST1

Lazy Subtree Pruning & Re-Grafting

“FAST” SPR round: no branches are optimized

Page 56: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

56

ST5

ST2

ST6

ST4

ST3

ST1

Lazy Subtree Pruning & Re-Grafting

“SLOW” SPR round: three adjacent branches are optimized

Page 57: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

Outline● Introduction to Phylogenetic Inference -

Alexandros● The RAxML Search Algorithm - Alexandros● Improvements in RAxML Next Generation -

Alexey● Tutorial - Alexey

Page 58: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

58

Evolution of RAxML(-NG)

2006 2013 2017

RAxML-NG

ExaML

RAxML

RAxML-Light

2019 202020152003

III VI v8.0

v0.1 v0.8

v1.0 v3.0

v8.2.12

v3.0.21

2021

v0.9 v1.0

2006 2013 2017

RAxML-NG

ExaML

RAxML

RAxML-Light

2019 202020152003

III VI v8.0

v0.1 v0.8

v1.0 v3.0

v8.2.12

v3.0.21

2021

v0.9 v1.0

Page 59: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

59

RAxML-NG design goals● Full rewrite of RAxML

– Search heuristic largely unchanged (as of v1.0)

● Improve maintainability & enable code reuse● Eliminate known bugs & bottlenecks● Improve user experience

– by default, “do the right thing”

Page 60: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

60

RAxML-NG & family

libpll / pll-modules

RAxML-NG

EPA-NG

ModelTest-NG

Vectorized PLF kernels(SSE3, AVX, AVX2)

Parsimony

MSA validation& statistics

Discrete tree operations

RAxMLFile I/O

(MSA & trees)

Numerical optimization(Stamatakis ’06, ‘14)

(Kozlov ’19)

(Darriba ‘20)

(Barbera ’18)

(Flouri in prep.)

GeneRax(Morel ‘20)

Page 61: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

61

Improvements & new features● Flexible evolutionary models

– All “classical” + custom DNA models (eg. DNA010010 = HKY)– Per-partition rate heterogeneity (incl. FreeRate)– Proportional branch lengths

● Phylogenetic terrace detection (Biczok ‘17)

● Transfer bootstrap support metric (Lemoine ‘18)

● Energy monitoring

Page 62: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

62

Performance & scalability● Checkpointing● Advanced load balancing● Binary alignment format● “Site repeats” optimization (Kobert ‘17)

– 10-60% speedup + memory savings

● Flexible and user-friendly parallelization

from ExaML

Page 63: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

63

Parallelization: hardware

CPU Vectorization (SSE, AVX ...)

Desktop Multi-threading

Cluster MPI

raxml-ngraxmlHPC

raxmlHPC-AVX

raxmlHPC-MPI

raxmlHPC-HYBRID

raxmlHPC-MPI-AVX2

raxmlHPC-SSE

raxmlHPC-PTHREADS

raxmlHPC-PTHREADS-SSE

raxml-ng-mpi

Page 64: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

64

Parallelization: software

Fine-grained Coarse-grained Mixed/hybrid

New in v1.0: Full native support and automatic configuration!

sites

T1

T2

T3

T4

T1

T2

T3

T4

T1

T3 T4

T2

T1 T2

T3 T4

T1 T2 T3 T4

T1 T2 T3 T4

T1 T2 T3 T4

T1 T2 T3 T4

Search 1

Search 2

Search 3

Search 4

Search 1

Alignment 4 threads 4 searches

T1, T2, T3, T4 e.g. from 4 starting trees

Page 65: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

Outline● Introduction to Phylogenetic Inference -

Alexandros● The RAxML Search Algorithm - Alexandros● Improvements in RAxML Next Generation -

Alexey● Tutorial - Alexey

Page 66: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

66

Quick start: ML tree search● Default command: --search

– 20 starting trees (10 random + 10 parsimony)– Pick the best-scoring one

$ raxml-ng --msa prim.phy --model GTR+G

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Page 67: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

67

ML tree search: OutputAnalysis options: run mode: ML tree search start tree(s): random (10) + parsimony (10)...Starting ML tree search with 20 distinct starting trees

[00:00:00 -7871.515760] Initial branch length optimization...[00:00:00 -5736.644605] FAST spr round 1 (radius: 5)...[00:00:00 -5709.394601] SLOW spr round 1 (radius: 5)...[00:00:00] ML tree search #1, logLikelihood: -5708.979717...[00:00:07] ML tree search #20, logLikelihood: -5709.014076

Final LogLikelihood: -5708.923977

Best ML tree saved to: /home/alexey/test/prim.phy.raxml.bestTreeOptimized model saved to: /home/alexey/test/prim.phy.raxml.bestModel

--log info to hide search progress

Page 68: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

68

Tree with support values● All-in-one mode: --all

– ML tree search (as before)– Bootstrapping with autoMRE convergence test– Compute support values + map on ML tree

$ raxml-ng --all --msa prim.phy --model GTR+G --prefix A1 --seed 1

Output files: A1.raxml.* Fixed RNG seedfor better reproducibility

Page 69: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

69

Tree with support values: OutputAnalysis options: run mode: ML tree search + bootstrapping (Felsenstein Bootstrap) start tree(s): random (10) + parsimony (10) bootstrap replicates: max: 1000 + bootstopping (autoMRE, cutoff: 0.030000)

Starting ML tree search with 20 distinct starting trees...[00:00:02] ML tree search completed, best tree logLH: -5708.926130

[00:00:02] Starting bootstrapping analysis with 1000 replicates....[00:00:14] Bootstrapping converged after 100 replicates.

Best ML tree with Felsenstein bootstrap (FBP) support values saved to: /home/alexey/test/A1.raxml.support...Bootstrap trees saved to: /home/alexey/test/A1.raxml.bootstraps

Page 70: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

70

Customize analysis● Starting trees: --tree

● Bootstrap replicates: --bs-trees

● Branch length linkage mode: --brlen

● Branch length limits: --blmin / --blmax

--tree rand{1} --tree pars{50},rand{50} --tree user.nw

--bs-trees 100 --bs-trees autoMRE{500} --bs-trees bs.bw

--brlen linked --brlen scaled --brlen unlinked--brlen linked

--blmin 1e-9 --blmax 10

Page 71: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

71

Tree likelihood evaluation● Optimize free model parameters and branch

lengths on a fixed topology: --evaluate$ raxml-ng --evaluate --msa prim.phy --tree A1.raxml.bestTree --model GTR+G --prefix E1

Evaluating 1 trees

[00:00:00] Tree #1, initial LogLikelihood: -6419.996676

[00:00:00 -6419.996676] Initial branch length optimization[00:00:00 -6276.983451] Model parameter optimization (eps = 0.100000)

[00:00:00] Tree #1, final logLikelihood: -5709.005148...Best ML tree saved to: /home/alexey/test/E1.raxml.bestTreeOptimized model saved to: /home/alexey/test/E1.raxml.bestModel

Page 72: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

72

Tree likelihood evaluation (2)● Compute and print tree log-likelihood: --loglh

– No branch length optimization – No model optimization– No output files created

$ raxml-ng --loglh --msa prim.phy --tree A1.raxml.bestTree --model GTR{1/2/3/4/5/6}+G{0.5}

Final LogLikelihood: -6334.023267

Page 73: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

73

Check & parse● Check alignment for format errors: --check

● Compress alignment into binary file: --parse

● ...which can be then used in parallel jobs

$ raxml-ng --check --msa prim.phy –model GTR+G

$ raxml-ng --parse --msa prim.phy --model GTR+G --prefix prim

$ raxml-ng --search --msa prim.raxml.rba --prefix S1

Page 74: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

74

Parallelization tuning● Fully automatic (default) → heuristic-based$ raxml-ng --msa prim.rba

System: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16 cores, 62 GB RAM...Analysis options: run mode: ML tree search start tree(s): random (10) + parsimony (10)... parallelization: coarse-grained (auto), PTHREADS (auto)...[00:00:00] Alignment comprises 12 taxa, 1 partitions and 413 patterns...Parallelization scheme autoconfig: 16 worker(s) x 1 thread(s)...[00:00:00] Data distribution: max. partitions/sites/weight per thread: 1 / 413 / 6608[00:00:00] Data distribution: max. searches per worker: 2

Page 75: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

75

Parallelization tuning● Automatic with upper limits

● Manual

● Also works with MPI$ mpirun -n 4 raxml-ng-mpi --msa prim.rba --threads 16 -–workers 8

$ raxml-ng --msa prim.rba --threads auto{16} -–workers auto{2}

$ raxml-ng --msa prim.rba --threads 16 -–workers 2

4 ranks * 16 threads = 64 = 8 workers * 8 threads

Page 76: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

76

Energy monitoring● New in RAxML-NG v1.0: energy usage report

– Measured with Intel RAPL → CPU+DRAM only– Supported on Linux systems only – To disable, add: --extra energy-off

Page 77: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

77

Energy monitoring● New in RAxML-NG v1.0: energy usage report

– Measured with Intel RAPL → CPU+DRAM only– Supported on Linux systems only – To disable, add: --extra energy-off

Elapsed time: 42846.287 seconds

Consumed energy: 162370.469 Wh (= 812 km in an electric car, or 4059 km with an e-scooter!)

Single tree search (96 nodes x 12h): >160 kWhMy apartment per month: ~100 kWh

Page 78: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

78

You can't improve what you can't measure!

Page 79: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

79

You can't improve what you can't measure!

v1.0+ v0.9

-30%

default:

Page 80: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

80

RAxML-NG availability● Web services

– Vital-IT: https://raxml-ng.vital-it.ch/#/ → free, for small datasets– CIPRES: http://www.phylo.org/ → registration required

● Source+binaries for Linux & macOS– GitHub: https://github.com/amkozlov/raxml-ng

● Conda: https://anaconda.org/bioconda/raxml-ng● GUI: https://github.com/AntonelliLab/raxmlGUI

Page 81: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

81

Where to get help?● Documentation

https://github.com/amkozlov/raxml-ng/wiki

● Tutorialhttps://github.com/amkozlov/raxml-ng/wiki/Tutorial

● User support grouphttps://groups.google.com/forum/#!forum/raxml

Page 82: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

Q & A

Page 83: Alexandros Stamatakis and Alexey Kozlov · 2021. 1. 13. · RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Computational Molecular

83

References● Barbera et al. (2018) EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences. Systematic Biology,

syy054, https://doi.org/10.1093/sysbio/syy054● Biczok et al. (2017) Two C++ libraries for counting trees on a phylogenetic terrace. Bioinformatics.

https://doi.org/10.1093/bioinformatics/bty384● Kobert et al. (2017) Efficient detection of repeating sites to accelerate phylogenetic likelihood calculations. Syst. Biol

https://doi.org/10.1093/sysbio/syw075● Kozlov et al. (2019) RAxML-NG: A fast, scalable, and user-friendly tool for maximum likelihood phylogenetic

inference. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz305● Kozlov, Aberer, Stamatakis (2015) ExaML version 3: a tool for phylogenomic analyses on supercomputers.

Bioinformatics. https://doi.org/10.1093/bioinformatics/btv184● Lemoine et al. (2018) Renewing Felsensteinen phylogenetic bootstrap in the era of big data. Nature.

https://doi.org/10.1038/s41586-018-0043-0 ● Morel, Kozlov, Stamatakis (2019) ParGenes: a tool for massively parallel model selection and phylogenetic tree

inference on thousands of genes. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty839● Morel et al (2020). GeneRax: A tool for species tree-aware maximum likelihood based gene tree inference under gene

duplication, transfer, and loss. MBE. https://doi.org/10.1093/molbev/msaa141 ● Stamatakis (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and

mixed models. Bioinformatics, https://doi.org/10.1093/bioinformatics/btl446● Stamatakis (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.

Bioinformatics, https://doi.org/10.1093/bioinformatics/btu033● Stamatakis and Aberer (2013) Novel parallelization schemes for large-scale likelihood-based phylogenetic inference.

In Parallel Distributed Processing (IPDPS) https://doi.org/10.1109/IPDPS.2013.70