maria poptsova university of connecticut dept. of molecular and cell biology

45
Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology August 18, 2006, Stanford University, CA AUTOMATED ASSEMBLY OF GENE FAMILIES AND DETECTION OF HORIZONTALLY TRANSFERRED GENES Superfamily of ATP synthases for 317 taxa of bacteria and archaea

Upload: xylia

Post on 10-Jan-2016

20 views

Category:

Documents


2 download

DESCRIPTION

AUTOMATED ASSEMBLY OF GENE FAMILIES AND DETECTION OF HORIZONTALLY TRANSFERRED GENES. Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology. Superfamily of ATP synthases for 317 taxa of bacteria and archaea. August 18, 2006, Stanford University, CA. Outline :. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Maria Poptsova

University of ConnecticutDept. of Molecular and Cell Biology

August 18, 2006, Stanford University, CA

AUTOMATED ASSEMBLY OF GENE FAMILIES AND DETECTION OF HORIZONTALLY

TRANSFERRED GENES

Superfamily of ATP synthases for 317 taxa of bacteria and archaea

Page 2: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Outline:

Automated Methods of Assembling Orthologous Gene Families

Methods of HGT Detection

Tree of Life

16s RNA treeRooting the Tree of LifeHow Tree-like is an Organismal Evolution?What is Horizontal Gene Transfer?What is Organismal Lineage in light of HGT?

Reciprocal Blast Hit Method – problems with paralogsBranchclust: Phylogenetic Algorithm for Assembling Gene Families

Overview of methods for HGT detection: AU Test, Symmetrical Difference of Robinson and Foulds, Bipartition AnalysisHGT In-silico experiments

Page 3: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Cenancestoras placed by ancient duplicated genes (ATPases, Signal recognition particles, EF)

To Root

• strictly bifurcating• no reticulation• only extant lineages• based on a single molecular phylogeny• branch length is not proportional to time

The Tree of Life according to SSU ribosomal RNA (+)

Page 4: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

SS

U-r

RN

A T

ree

of L

ife

EuglenaTrypanosoma

Zea

Paramecium

Dictyostelium

EntamoebaNaegleria

Coprinus

Porphyra

Physarum

HomoTritrichomonas

Sulfolobus

ThermofilumThermoproteus

pJP 27pJP 78

pSL 22pSL 4

pSL 50

pSL 12

E.coli

Agrobacterium

Epulopiscium

AquifexThermotoga

Deinococcus

Synechococcus

Bacillus

Chlorobium

Vairimorpha

Cytophaga

HexamitaGiardia

mitochondria

chloroplast

Haloferax

Methanospirillum

Methanosarcina

Methanobacterium

ThermococcusMethanopyrus

Methanococcus

ARCHAEA BACTERIA

EUCARYA

Encephalitozoon

Thermus

EM 17

0.1 changes per nt

Marine group 1

RiftiaChromatium

ORIGIN

Treponema

CPSV/A-ATPaseProlyl RSLysyl RSMitochondriaPlastids

Fig. modified from Norman Pace

Page 5: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

What is HGT?

Genes can be passed vertically – from ancestor to a child

Genes also can be passed horizontally – exchange of genes between different species

HGT stands for Horizontal Gene Transfer

Page 6: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Science,280 p.672ff (1998)

Horizontal Gene Transfer Mosaic Genomes

How Tree-like is Organismal Evolution?

Page 7: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Escherichia coli, strain CFT073, uropathogenic Escherichia coli, strain EDL933, enterohemorrhagic Escherichia coli K12, strain MG1655, laboratory strain,

Welch RA, et al.

Proc Natl Acad Sci U S A. 2002; 99:17020-4

“… only 39.2% of their combined (nonredundant) set of proteins actually are common to all three strains.”

How many common genes?

Page 8: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

What is an “organismal lineage” in light of horizontal gene transfer?

Over very short time intervals an organismal lineage can be defined as the majority consensus of genes.

Organismal Lineage

Page 9: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Rope as a metaphor to describe an organismal lineage (Gary Olsen)

Individual fibers = genes that travel for some time in a lineage.

While no individual fiber (gene) present at the beginning might be present at the end, the rope (or the organismal lineage) nevertheless has continuity.

Page 10: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

However, the genome as a whole will acquire the character of the incoming genes (the rope turns solidly red over time).

Page 11: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

From:

Bill Martin (1999)BioEssays 21, 99-104

Page 12: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Selection of Orthologous Gene Families

(COG, or Cluster of Orthologous Groups)

All automated methods for assembling sets of orthologous genes are based on sequence similarities.

BLAST hits

(SCOP database)

Triangular circular BLAST significant hits

Sequence identity of 30% and greater

Similarity complemented by HMM-profile analysis

Pfam database

Reciprocal BLAST hit method

Page 13: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

1 2

3 4

1 2

3 4

2’

often fails in the presence of paralogs

1 gene family

Reciprocal BLAST Hit Method

0 gene family

Page 14: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

ATP-F

Case of 2 bacteria and 2 archaea species

ATP-A (catalytic subunit) ATP-B (non-catalytic subunit)

Escherichia coli

Bacillus subtilis

Methanosarcina mazei

Sulfolobus solfataricus

ATP-A

ATP-B

ATP-A

ATP-B

ATP-A

ATP-B

ATP-A

ATP-B

Escherichia coli

Bacillus subtilis

Methanosarcina mazei

Sulfolobus solfataricus

ATP-A

ATP-B

ATP-A

ATP-B

ATP-A

ATP-B

ATP-A

ATP-B

ATP-F

Neither ATP-A nor ATB-B is selected by RBH method

Families of ATP-synthases

Page 15: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Families of ATP-synthases

ATP-A

ATP-AATP-A

ATP-A

ATP-F

ATP-F

ATP-BATP-B

ATP-B

ATP-B

Escherichia coli

Escherichia coli

Bacillus subtilis

Bacillus subtilis

Bacillus subtilis

Escherichia coli

Methanosarcina mazei

Methanosarcina mazei

Sulfolobus solfataricus

Sulfolobus solfataricus

Family of ATP-A

Family of ATP-B

Family of ATP-F

Phylogenetic Tree

Page 16: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

BranchClust Algorithm

www.bioinformatics.org/branchclust

genome igenome 1

genome 2

genome 3

genome N

dataset of N genomes superfamily tree

BLAST hits

Page 17: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

BranchClust Algorithm

www.bioinformatics.org/branchclust

Page 18: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

BranchClust Algorithm

Superfamily of penicillin-binding protein Superfamily of DNA-binding protein

13 gamma proteo bacteria

Root positions

www.bioinformatics.org/branchclust

13 gamma proteo bacteria

Page 19: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

BranchClust Algorithm

Comparison of the best BLAST hit method and BranchClust algorithm

Number of taxa - A: ArchaeaB: Bacteria

Number of selected families:

Reciprocal best BLAST hit

BranchClust

2A 2B 80 414 (all complete)

13B 236 409 (263 complete, 409 with n8 )

16B 14A 12 126 (60 complete, 126 with n24).

www.bioinformatics.org/branchclust

Page 20: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

BranchClust AlgorithmATP-synthases: Examples of Clustering

13 gamma proteobacteria

30 taxa: 16 bacteria and 14 archaea

317 bacteria and archaea

www.bioinformatics.org/branchclust

Page 21: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

BranchClust AlgorithmTypical Superfamily for 30 taxa (16 bacteria and 14 archaea)

www.bioinformatics.org/branchclust

59:30

33:19

53:26

55:2137:19

36:21

Page 22: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

BranchClust AlgorithmGene Annotation

www.bioinformatics.org/branchclust

------------ CLUSTER 1 ----------------------- FAMILY ------------>gi|27904705| peptidoglycan synthetase FtsI [Buchnera aphidicola str. Bp (Baizongia pistaciae)]>gi|26246017| Peptidoglycan synthetase ftsI precursor [Escherichia coli CFT073]>gi|16273058| penicillin-binding protein 3 [Haemophilus influenzae Rd KW20]>gi|15602001| FtsI [Pasteurella multocida subsp. multocida str. Pm70]>gi|15599614| penicillin-binding protein 3 [Pseudomonas aeruginosa PAO1]>gi|16763512| division specific transpeptidase [Salmonella typhimurium LT2]>gi|15642404| penicillin-binding protein 3 [Vibrio cholerae O1 biovar eltor str. N16961]>gi|32490961| hypothetical protein WGLp212 [Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis]>gi|21230194| penicillin-binding protein 3 [Xanthomonas campestris pv. campestris str. ATCC 33913]>gi|21241544| penicillin-binding protein 3 [Xanthomonas axonopodis pv. citri str. 306]>gi|15837394| penicillin binding protein 3 [Xylella fastidiosa 9a5c]>gi|16120877| penicillin-binding protein 3 [Yersinia pestis CO92]>gi|22127506| peptidoglycan synthetase [Yersinia pestis KIM]COMPLETE: 13>>>>> IN-PARALOGS ----------->gi|16765177| putative penicillin-binding protein 3 [Salmonella typhimurium LT2]>gi|15597468| penicillin-binding protein 3A [Pseudomonas aeruginosa PAO1]------------ CLUSTER 2 ----------------------- FAMILY ------------>gi|26246616| Penicillin-binding protein 2 [Escherichia coli CFT073]>gi|16272007| penicillin-binding protein 2 [Haemophilus influenzae Rd KW20]>gi|15603789| Pbp2 [Pasteurella multocida subsp. multocida str. Pm70]>gi|15599198| penicillin-binding protein 2 [Pseudomonas aeruginosa PAO1]>gi|16764017| cell elongation-specific transpeptidase [Salmonella typhimurium LT2]>gi|15640966| penicillin-binding protein 2 [Vibrio cholerae O1 biovar eltor str. N16961]>gi|32490921| hypothetical protein WGLp172 [Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis]>gi|21232896| penicillin-binding protein 2 [Xanthomonas campestris pv. campestris str. ATCC 33913]>gi|21241430| penicillin-binding protein 2 [Xanthomonas axonopodis pv. citri str. 306]>gi|15837913| penicillin binding protein 2 [Xylella fastidiosa 9a5c]>gi|16122817| penicillin-binding protein 2 [Yersinia pestis CO92]>gi|22125081| peptidoglycan synthetase, penicillin-binding protein 2 [Yersinia pestis KIM]INCOMPLETE: 12>>>>> IN-PARALOGS ----------->gi|16765252| putative penicillin-binding protein [Salmonella typhimurium LT2]

Superfamily of penicillin-binding protein for 13 gamma proteobacteria

Page 23: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

BranchClust AlgorithmImplementation and Usage

www.bioinformatics.org/branchclust

1.Bioperl module for parsing trees  Bio::TreeIO2. Taxa recognition file gi_numbers.out must be present in the current directory. How to create this file, read the Taxa recognition file section on the web-site.

The BranchClust algorithm is implemented in Perl with the use of the BioPerl module for parsing trees and is freely available at http://bioinformatics.org/branchclust

Required:

Usage:

At the command line type:

# perl branch_clust.pl <tree-file> <MANY>

Output: families.list, clusters.out, clusters.log

How to do batch processing:

Example of a wrapper you can find on the web-site.

Page 24: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

BranchClust AlgorithmData Flow

www.bioinformatics.org/branchclust

Download n complete genomes (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria)

In fasta format (*.faa)

Put all n genomes in one database

Take one starting genome and do BLAST of this genome against the database, consisting of n genomes

Parse BLAST-output with the requirement that all n-taxa should be present

Superfamilies

Align with ClustalW

Reconstruct superfamily treeClustalW –quick distance method

Phyml – Maximum Likelihood

Parse with BranchClust

Gene families

Page 25: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Why do we need gene families?

How many genes are common between different species?

Do all the common genes share the common history?

How do we reconstruct the tree of life?

How can we detect genes that were horizontally transferred?

Page 26: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Methods of HGT Detection

• Parametric methods

• GC-content analysis • analysis of single nucleotide composition (SNC) and dinucleotide composition (DNC) • codon usage bias • other measures based on sequence composition

• AU-test• SPR metric (NP-hard problem)• Symmeytric difference of Robison and Foulds• Biparition spectrum analysis• Quartet spectrum analysis

• Phylogenetic methods

Page 27: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

AU test

AU test, or approximately unbiased test of phylogenetic tree selection was proposed for assessing the confidence of tree selection.

The AU test method produces for each tree a number ranging from zero to one – (P1 and P2). This number Is the probability value that the tree is the true tree. The greater the P-value, the greater the probability that the tree is the true tree.

P1 P2

One would expect that a tree with different topology would have a small P-value.Accepted requirement for HGT detection:

P-Value < 1E-2 – 1E-4

Hidetoshi Shimodaira (2002)

If P1 = genome tree , and P2 = gene tree, then the AU test provides the probability that P1 is the true tree for the gene family.

unclear

Page 28: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Some Metrics to Compare Tree Topologies

• SPR – distance

• Symmetric Difference of Robinson and Foulds (bipartition distance)

• Quartet distance

Page 29: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

is very hard computationally

There are no tools available to calculate the differenceIn tree topology by number of SPR-operations required to transform one tree to another.

That is why bipartitions come into the scene

SPR metric – Subtree Pruning and Regrafting

There is Robert Beiko’s program..Also the SPR distance alone would not consider support levels.

Page 30: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Phylogenetic information present in genomes

Break information into small quanta of information (bipartitions or embedded quartets)

Spectral Analyses of Phylogenetic Data

Analyze spectra to detect transferred genes and plurality consensus.

Page 31: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Bipartition (or split) – a division of a phylogenetic tree into two parts that are connected by a single branch. It divides a dataset into two groups, but it does not

consider the relationships within each of the two groups.

Number of non-trivial bipartitions for N genomes is equal to 2(N-1)-N-1.

**…***..

*...**.*.*

Bipartitions can be divided in conflicting and non-conflicting

non-conflicting (can coexist in one tree)

**…***..

conflicting (can not coexist in one tree)

**…*…*

BIPARTITION OF A PHYLOGENETIC TREE

A B C D E A E C D B

Page 32: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Try to infer phylogeny

“Likely” Trees “Best” Tree

Choose/ make

consensus

The Tree Drawing Process

unclear

Page 33: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Resampling e.g.. bootstrapping

Resampling simulates examining extra sequence from the original data

65

100

75

100

75

Obtaining Bootstrap Support for BranchesNow bipartitions have weights:.**….. 65***….. 75…**… 75…..*** 100……** 100

ABCDEFGH

*.*….. 75

Page 34: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Data Flow

For every gene family,align sequences (ClustalW)

For every gene family,align sequences (ClustalW)

For every gene family,reconstruct Maximum Likelihood (ML) Tree

and generate 100 bootstrap samples (phyml)

For every gene family,extract bipartition information from each bootstrapped tree,

and compose a bipartitionmatrix

For every gene family,extract bipartition information from each bootstrapped tree,

and compose a bipartitionmatrix

Bipartitions matrix is generated

Do “Lento” Plot analysisDo “Lento” Plot analysis

Results:Results:

select gene families

**….. *.*…. *…*…. *….*…. etc.fam1 71 34 0 99 fam2 56 2 0 76fam3 1 99 1 99…famN 34 99 0 99

majority consensusbipartitions

detected conflicts =

HGT events

Page 35: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

13 gamma proteobacteria

1. Buchnera aphidicola str. Bp (Baizongia pistaciae)2. Escherichia coli CFT0733. Haemophilus influenzae Rd KW20 4. Pasteurella multocida subsp. multocida str. Pm705. Pseudomonas aeruginosa PAO16. Salmonella typhimurium LT27. Vibrio cholerae O1 biovar eltor str. N169618. Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis 9. Xanthomonas campestris pv. campestris str. ATCC 3391310. Xanthomonas axonopodis pv. citri str. 30611. Xylella fastidiosa 9a5c12. Yersinia pestis KIM

13. Yersinia pestis CO92

1.37E+10 possible unrooted tree topologies

Page 36: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

“Lento”-plot of 34 supported bipartitions (out of 4082 possible)

13 gamma-proteobacterial genomes (258 putative orthologs):

•E.coli•Buchnera•Haemophilus•Pasteurella•Salmonella•Yersinia pestis (2 strains)•Vibrio•Xanthomonas (2 sp.)•Pseudomonas•Wigglesworthia

There are 13,749,310,575

possible unrooted tree topologies for 13 genomes

Page 37: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Consensus Tree and Horizontally Transferred Gene

Phylogeny of putatively transferred gene(virulence factor homologs (mviN))

only 258 genes analyzed

Consensus clusters of eight significantly supported bipartitions

Page 38: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

Are the detected transfers mainly false positives, or are they the tip of an iceberg of many transfer events most of which go undetected by current methods?

Here we explore how well these methods perform using in silico transfers between the leaves of a gamma proteobacterial phylogeny.

What are the Actual HGT Rates?

Page 39: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

HGT in silico: Testing Methods of Detection

•AU test

•Symmetric Difference of Robinson – Foulds

•Biparition Analysis

Page 40: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

HGT in silico: AU test

236 families

13 gamma proteobacteria

A

AU Test: Number of Detected Conflicts

0

50

100

150

200

250

-4 -3,5 -3 -2,5 -2 -1,5 -1 -0,5 0

Logarithm of Significance Level

Number of Conflicts

Only two families out of 236 showed a conflict at the significance level of 5 *10-4, 5 conflicts were found at the significance level of

0,01 and 26 conflicts at the significance level of 0,05.

Page 41: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

HGT in silico: AU test

236 families

Escherichia coli Xylella fastidosaPseudomonas aeroginosa Vibrio cholera

13 gamma proteobacteria

Au-value = 10-4

Log(Au-value)=-4

Only 10% of au-values is less than 10-4

05

101520253035404550

-40.00-37.50-35.00-32.50-30.00-27.50-25.00-22.50-20.00-17.50-15.00-12.50-10.00-7.50-5.00-2.500.00

0

20

40

60

80

100

120

-40.00-37.50-35.00-32.50-30.00-27.50-25.00-22.50-20.00-17.50-15.00-12.50-10.00-7.50-5.00-2.500.00

86% of au-values is less than 10-4

Page 42: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

HGT in silico: AU testPower of Detection

Significance level < 1e-4 Significance level < 1e-2

Page 43: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

HGT in silico: Robinson Foulds Metric

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13

Nu of different bipartitions

Distribution of number of differentbipartitions in the original dataset

Power of DetectionSignificance level < 1e-2

Page 44: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

HGT in silico: Bipartition Analysis

Bootstrap support >70%Bootstrap support >90%

Power of Detection

Page 45: Maria Poptsova University of Connecticut Dept. of Molecular and Cell Biology

AcknowledgementsAcknowledgements

NSF Microbial GeneticsNASA Exobiology & AISR Programs

Gogarten Lab:

Pascal LapierreGregory Fournier Alireza Ghodsi SenejaniHolly E. GardnerTim HarlowKristen SwithersKaiyuan Shi

Prof. Peter Gogarten, University of Connecticut