07410 jlggc

46
Supplementary Figure 1 Fully resolved phylogenetic tree of the diatoms. The tree represents a collapsed version of the Bayesian tree shown in Fig. 2 of 1 . Line drawings illustrate the basic morphology of the bolidomonads (the sister group of the diatoms), the radial centrics (Coscinodiscophyceae), the bi/multipolar centrics (Mediophyceae), and the pennates (Bacillariophyceae) of which there are two groups, the araphid (top) and the raphid (bottom) pennates. Line drawings by Sara Beszteri. SUPPLEMENTARY INFORMATION doi: 10.1038/nature07410 www.nature.com/nature 1

Upload: others

Post on 13-Sep-2019

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 07410 JLGGC

Supplementary Figure 1 Fully resolved phylogenetic tree of the diatoms. The tree represents a collapsed version of the Bayesian tree shown in Fig. 2 of 1. Line drawings illustrate the basic morphology of the bolidomonads (the sister group of the diatoms), the radial centrics (Coscinodiscophyceae), the bi/multipolar centrics (Mediophyceae), and the pennates (Bacillariophyceae) of which there are two groups, the araphid (top) and the raphid (bottom) pennates. Line drawings by Sara Beszteri.

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07410

www.nature.com/nature 1

Page 2: 07410 JLGGC

Coscinodiscophyceae

Bacillariophyceae

Phaeodactylum tricornutum

Araphid pennates

Raphid pennates

Mediophyceae

Thalassiosira pseudonana

Bolidophyceae

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 2

Page 3: 07410 JLGGC

Supplementary Figure 2 Comparison of P. tricornutum and T. pseudonana genomes in terms of G+C content, and distribution of transposable elements. This figure represents compositional mapping of the two genomes (based on the P. tricornutum scaffolds and the T. pseudonana chromosomes), and shows colour-coded plots of G+C levels using a 5 Kb window 2-4. Colours indicate 5% G+C intervals to which these G+C levels belong. The line plots were filled with five colours representing G+C levels from deep blue (< 37% G+C) to red (> 52% G+C). Unidentified nucleotides which represent gaps are shown in grey. G+C% was calculated as 100% x (G+C)/(A+C+G+T). The plot was made using the windowGC.pl programme that is freely distributed as source code under General Public License (GLP) and can be downloaded from http://genomat.img.cas.cz. The distribution of LTR-retrotransposons on each scaffold/chromosome is shown in blue to the right of each G+C profile. The figure shows that the frequency of clustered G+C-rich DNA segments is scarce in T. pseudonana, whereas the P. tricornutum genome displays a contrasted mosaic of G+C-rich and G+C-poor patches. The figure also shows that transposons are more frequent in P. tricornutum compared to T. pseudonana and that they generally tend to be located in G+C poor areas, as expected from the higher density of insertion target sites that are A+T-rich. The presence of the telomeric repeat sequence CCCTAA is indicated by black balls at the end of scaffolds/chromosomes. No evidence for centromeric sequences could be found based on G+C heterogeneity, the distribution of transposable elements, nor the existence of gene poor regions (see also Supplementary Fig. 5). Scale bar is shown (1 Mb).

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 3

Page 4: 07410 JLGGC

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 4

Page 5: 07410 JLGGC

Supplementary Figure 3 Synteny between P. tricornutum and T. pseudonana genomes. A. Oxford plot of P. tricornutum scaffolds (y axis) plotted against T. pseudonana chromosomes (x axis). In such a presentation, conserved segmental homologies are visualized by diagonally oriented clusters, or at least by co-clustering of genes on genomic scaffolds. The lack of such clusters indicates the lack of any major synteny between the two diatom genomes, although several microsyntenic regions can be visualized, e.g., between P. tricornutum scaffold 6 and T. pseudonana chromosome 12, and between P. tricornutum scaffold 14 and T. pseudonana chromosome 13. B. A conserved gene neighbourhood rich in genes related to membrane biology in both diatom genomes (chromosome 14 in T. pseudonana and scaffold 11 in P. tricornutum). 1: G protein α subunit; 2: γ subunit of clathrin adaptor protein complex AP1; 3: vacuolar ATPase subunit; 4: Sec23 component of coat protein complex COPII; 5: Sec1-like Syntaxin-interacting protein; 6: Oxidoreductase; 7: Unknown expressed sequence (no similarities in public databases but EST evidence from P. tricornutum); 8: Phosphate transporter. Approximate scale bar (in Kbp) is indicated.

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 5

Page 6: 07410 JLGGC

A

295,000 315,000 335,000

Thalassiosira pseudonana, chromosome 14

B

1 2 3 4 5 6 7 8

200,000 215,000 230,000

Phaeodactylum tricornutum, scaffold 11

1. Gα2. APγ3. vATPase4. Sec235. SynIP6. Oxidoreductase7 U k d

10 kbp

12 34 567 87. Unknown expressed8. P transporter

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 6

Page 7: 07410 JLGGC

Supplementary Figure 4 Diatom-specific copia retrotransposons. Phylogenetic tree constructed from a CLUSTALW multiple sequence alignment of reverse transcriptase (RT) domains from representative elements belonging to the major clades of retroelements: Ty1/Copia, Ty3/Gypsy, DIRS and BEL, as well as from diatom elements belonging to the CoDiI, CoDiII and CoDiIII lineages. The tree was constructed with the NJ method using the MEGA4 software 5. Branches belonging to a same lineage were subsequently compressed together. The Copia clade contains sequences from plants (Tnt194, Tto1, Panzee, Ta13, RIRE1, Sto4, Endovir11, SIRE1, Opie2, Tpv26, AtRE1, AtRE2, Evelknievel, Hopscotch, Retrofit), animals (Mosqcopia, Copia, 1731), fungi (Tca5, Ty56p, Tca2, Ty1, Ty4), and diatoms (CoDiIII elements). The Gypsy clade contains sequences from animals (Cer1, Micropia, Blastopia, 297, Tom, 17.6, Idefix, Tv1, ZAM, Gypsy, Yoyo, 412, mdg1, Osvaldo, Ulysses, Sushi), slime mold (Skipper), fungi (Maggy, Skippy, Ty3), and plants (Tomato, Peabody, Rire3, Tma11, Athila41, t24g23, Cinful, Rire2). The BEL clade is comprised of sequences from animals (BEL, Kamikaze, Cer8, Cer9, Cer7, Cer13, Pao). The DIRS clade contains sequences from slime mold (DIRS1) and from animals (Pat). The CoDiI and CoDiII clades are diatom specific and contain RT sequences from P. tricornutum and T. pseudonana elements.

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 7

Page 8: 07410 JLGGC

Ty1

CoDi I

Ty1

CoDi I

Copia

CoDi II

Copia

CoDi II

0.2BEL

0.2BEL

DIRSDIRS

Ty3/GypsyTy3/Gypsy

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 8

Page 9: 07410 JLGGC

Supplementary Figure 5 Clustering of bacterial genes in P. tricornutum genome. Top: bacterial genes, Bottom: all gene models.

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 9

Page 10: 07410 JLGGC

3

Shared diatom bacterial genes

Pt specific bacterial genes

Chromosome ends

1

2

Size (M

b)

0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

Scaffold

3

Scaffold

1

2

Size (M

b)

0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

Scaffold

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 10

Page 11: 07410 JLGGC

Supplementary Figure 6 Major differences in gene family expansions between P. tricornutum (Ptr) and T. pseudonana (Tps). The position of each dot representing a gene family (4,671 in total) describes the number of genes identified in P. tricornutum and T. pseudonana (abscissa and ordinate, respectively). The dotted grey line shows the 1:1 ratio, the black line the best-fit through all gene families and the dashed grey lines indicate a two-fold size difference. Numbers on the plot indicate gene family identifiers (see Supplementary Table 5), whereas 1:n refers to the presence of a gene family consisting of a single gene in one species and n multiple genes copies in the other species.

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 11

Page 12: 07410 JLGGC

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 12

Page 13: 07410 JLGGC

Supplementary Figure 7 Hierarchical clustering of diatom-specific gene family expansions. On the right side, the gene family ID’s are shown followed by the number (between brackets) of genes in the family for P. tricornutum with respect to T. pseudonana followed by the gene description. The blue and red scale (based on z-scores) shows that, given a certain gene family and organism, the gene family size is substantially smaller or larger than the mean gene family size. Hence, red blocks reflect gene family expansions, blue blocks reflect gene family constrictions.

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 13

Page 14: 07410 JLGGC

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 14

Page 15: 07410 JLGGC

Supplementary Figure 8 Major expansion of HSFs in diatom genomes. A. Heat shock transcription factors (Tp/Pt orthologous pairs are indicated in blue/pink). Group 1 represents the most ancient group. Of the 7 diatom HSFs with a classical domain content (i.e., containing heptad repeats), three are orthologous diatom pairs: 1.2a, 1.2b and 1.2c (subgroup 1.2). The fourth conventional HSF of this subgroup, Tp_1.2d, was not found to have an ortholog in P. tricornutum. The remaining conventional P. tricornutum and T. pseudonana HSFs represent subgroups 1.1 and 1.4, respectively. B. Expression profiles of P. tricorntum HSFs, based on EST frequencies in cDNA libraries derived from cells grown in 16 different conditions. Abbreviations used for the libraries are as follows: OS) Original 12000 Standard 6, OM) Oval Morphotype in 10% seawater, TM) Triradiate Morphotype, TA) Tropical Accession at 15ºC, SM) Silicic acid Minus in artifical seawater, SP) Silicic acid Plus in artificial seawater with 350 uM metasilicate, NR) Nitrate Replete with 1.12 mM nitrate in chemostat, NS) Nitrate Starved for 3 days with 50 uM nitrate in chemostat, AA) Ammonium Adapted at 75 uM ammonium, UA) Urea Adapted at 50 uM urea, FL) Iron Limited at 5 nM iron, BL) Blue Light for 1 h on 48 h dark adapted cells, LD) Low Decadienal treated for 6 h with 0.5 ug/ml 2E,4E-decadienal, HD) High Decadienal treated for 6 h with 5 ug/ml 2E,4E-decadienal, C1) 230 uM/Kg of CO2 for 1 day in chemostat, C4) 230 uM/Kg of CO2 for 4 days in chemostat. EST levels have been normalized to the total number of sequences in each library and are shown in a grey to blue scale from zero ESTs (light grey) to more than ten ESTs per ten thousand total ESTs (dark blue). This analysis was performed by hierarchical clustering followed by average-linkage cluster analysis 7. Clustering was performed using Java treeview 8. HSFs 1.2c and 3.2b are the most constitutively expressed members of the P. tricornutum HSF family whereas 3.2c is rather specifically induced in response to high CO2 and 4.7a is induced in response to nitrate starvation. Several other HSFs can also be seen to display interesting expression profiles, which provide a basis for functional exploration.

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 15

Page 16: 07410 JLGGC

BA Tp 3.5bTp 3.5f

Tp 3.5eTp 3.5h

Tp 3.5dTp 3.5c

Tp 3.5gTp 3.5a

Tp 3.4dTp 3.4aTp 3.4b

Tp 3.4cPt 3.2a

Tp 3.7aTp 3 7b

8857

50

Group 3Tp 3.7bTp 3.7c

Tp 3.7dTp 3.7e

Pt 3.2dPt 3.2c

Pt 3.2eTp 3.2i

Tp 3.2jTp 3.2g

Tp 3.2hPt 3.2bTp 3.2k

Tp 3.6aTp 3.6b

Pt 3.2fPt 3.1aPt 3.1b

Pt 3.1cPt 2.2b

Pt 2.1cPt 2.2cTp 2.2c

Pt 2.1aPt 2.1b

Pt 2.1dTp 2.6e

Tp 2.6fTp 2.6bTp 2 6c

94

93

86

84

7255

79

72

68

60

56

Group 2

Tp 2.6cTp 2.6d

Tp 2aPt 2.2e

Tp 2.6aTp 2.7f

Tp 2bTp 2.7g

Pt 2.2dPt 2.2a

Tp 2.2aTp 2.7a

Tp 2.7bTp 2.7d

Tp 2.7eTp 2.7c

Tp 2.7hTp 2.5gTp 2.5h

Tp 2.5eTp 2.5b

Tp 2.5fTp 2.5a

Tp 2.5cTp 2.5d

Tp 2.3cTp 2.3a

Tp 2.3bTp 2.4a

T 2 4d

99

94

72

7889

88

85

84

69

61

58

57

56

50

G 4

Tp 2.4dTp 2.4c

Tp 2.4fTp 2.4e

Tp 2.4bTp 2.4g

Pt 4.3aPt 4b

Tp 4.2fTp 4.2g

Tp 4.2jPt 4.3b

Pt 4.3cPt 4.2b

Tp 4.2bPt 4.2c

Tp 4.2cPt 4.2d

Pt 4.2aTp 4.2k

Pt 4.2ePt 4.1aPt 4.1b

Pt 4.1cTp 4d

Pt 4.5eTp 4.5gPt 4.6f

Pt 4.6hPt 4 6

99

99

97

71

93

72

51

Group 4Pt 4.6gPt 4.6e

Pt 4.6dPt 4.6c

Pt 4.5dPt 4.5c

Pt 4.6aPt 4.6b

Pt 4.5bPt 4.5aPt 4.5f

Tp 4.5fPt 4.4b

Pt 4.4aPt 4.4c

Pt 4.4dPt 4a

Tp 3bPt 4c

Pt 4.7ePt 4.7f

Pt 4.7aPt 4.7b

Pt 4.7cPt 4.7d

Pt 1aTp 1a

Tp 1eTp 1g

99

99

78

74

69

65

62

52

Group 1

p gPt 1c

Tp 1fPt 1.3b

Pt 1.3cTp 1.3i

Tp 1.3dTp 1.3e

Tp 1.3fTp 1.3g

Tp 1.3hPt 1.3a

Pt 1dTp 1d

Pt 1bTp 1b

Tp 1.4bTp 1.4c

Tp 1.4aPte HSF1

Pte HSF2Tt HSF1Pte HSF3

Tt HSF3Tt HSF2

Tt HSF4Pr HSF1

Ps HSF1Pr HSF2

Ps HSF2

99

99

99

99

6999

98

5658

85

6169

6660

65

66

Othereukaryotes

Pt 1.2bTp 1.2b

Pt 1.2aTp 1.2a

Pt 1.2cTp 1.2c

Tp 1.2dPt 1.1aPt 1.1b

Pt 1.1cPt 1.1d99

98

92

91

53

0.1

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 16

Page 17: 07410 JLGGC

Supplementary Table 1 Summary of predicted genes in P. tricornutum and T. pseudonana.

Gene model property

P. tricornutum

T. pseudonana

# genes % # genes %

All gene models: 10,402 100% 11,776 100% single-exon gene models 5,549 53% 4,615 39% multi-exon gene models 4,853 47% 7,161 61% Prediction methods: ab initio 7,444 72% 7,818 66% homology-based 2,958 28% 3,958 34% Supported with: EST alignment 8,944 86% 7,238 64% SwissProt alignment 6,992 67% 8,622 73% Pfam domain 4,759 46% 5,370 46% EC assignment 2,006 19% 2,538 22% KOG assignment 6,715 65% 8,164 69% GO term assignment 4,730 45% 5,651 48% signal peptide 1,479 14% 1,384 12% transmembrane (TM) domain 2,405 23% 1,844 16% signal peptide & TM domain 665 6% 453 4%

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 17

Page 18: 07410 JLGGC

Supplementary Table 2 Phylogenomic search for proteins of red algal origin.

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 18

Page 19: 07410 JLGGC

Thaps query (Protein ID)

Sister to stramenopile clade

Thaps BP%

Oomy-cete?

PHAEO query (Protein ID)

Sister to stramenopile clade

Phaeo BP%

Oomy-cete? Annotation

Thaps 261320 red algal (nuclear) 100 Pr, Ps Phaeo 15495 red algal (nuclear) 100 Pr, Ps GlucokinaseThaps 5692 red algal (nuclear) 100 Phaeo 43368 red algal (nuclear) 100 unknownThaps 38289 red algal (nuclear) 100 Phaeo 11843 red algal (nuclear) 100 potassium ion transporterThaps 34104 red algal (nuclear) 100 Phaeo 12411 red algal (nuclear) 100 Peptidylprolyl isomerase, FKBP-typeThaps 32053 red algal (nuclear) 100 Phaeo 43097 red algal (nuclear) 100 Prolyl-tRNA synthetaseThaps 268713 red algal (nuclear) 100 Phaeo 42543 red algal (nuclear) 100 hypothetical proteinThaps 267979 red algal (nuclear) 100 Phaeo 43194 red algal (nuclear) 100 putative Anion exchanger family memberThaps 263776 red algal (nuclear) 100 Phaeo 17226 red algal (nuclear) 100 similar to myo-inositol-phosphataseThaps 21539 red algal (nuclear) 100 Phaeo 8772 red algal (nuclear) 100 2-epi-5-epi-valiolone synthase AcbCThaps 39845 red algal (nuclear) 100 Phaeo 33633 red algal (nuclear) 100 DNA Topoisomerase IV, ATP/GTP binding siteThaps 13427 red algal (nuclear) 100 Phaeo 13150 red algal (nuclear) 100 hypothetical proteinThaps 40193 red algal (nuclear) 100 Phaeo 8950 red algal (nuclear) 100 Rubisco expression protein cbbXThaps 38769 red algal (nuclear) 100 Phaeo 13895 red algal (nuclear) 100 Photosystem II stability/assembly factor HCF136Thaps 34830 red algal (nuclear) 100 Phaeo 20331 red algal (nuclear) 100 oxygen evolving enhancer protein 1Thaps 31930 red algal (nuclear) 100 Phaeo 17504 red algal (nuclear) 100 plastid division protein, metalloprotease ftsHThaps 28125 red algal (nuclear) 100 Phaeo 54134 red algal (nuclear) 100 glutamyl-tRNA reductaseThaps 261275 red algal (nuclear) 100 Phaeo 37959 red algal (nuclear) 100 Protein THYLAKOID FORMATION1Thaps 23329 red algal (nuclear) 100 Phaeo 41721 red algal (nuclear) 100 heat shock protein 60, chaperoneThaps 22293 red algal (nuclear) 100 Phaeo 9538 red algal (nuclear) 100 sulfite reductases

Thaps 14655 red algal (nuclear) 100 Phaeo 46871 red algal (nuclear) 100Rubisco small subunit small subunit N-methyltransferase

Thaps 40335 red algal (nuclear) 100 Phaeo 15042 red algal (nuclear) 100 putative topoisomerase VI subunit BThaps 263321 red algal (nuclear) 100 Phaeo 3527 red algal (nuclear) 99 Hypothetical proteinThaps 37970 red algal (nuclear) 100 Phaeo 17555 red algal (nuclear) 99 Putative mitochondrial carrier proteinThaps 32066 red algal (nuclear) 100 Phaeo 13538 red algal (nuclear) 99 Mitochondrial carrier proteinThaps 7583 red algal (nuclear) 100 Phaeo 44212 red algal (nuclear) 95 Hypothetical proteinThaps 40747 red algal (nuclear) 100 Phaeo 13877 red algal (nuclear) 69 CABThaps 6942 red algal (nuclear) 100 Phaeo 34014 red algal (nuclear) 68 hypothetical protein

Thaps 36037red algal (nuclear); stramenopiles 100 Phaeo 13240 red algal (nuclear) 61 DEGP1, Protease Do-like

Thaps 23812 red algal (nuclear) 100 Phaeo 44055 red algal (nuclear) 51 unknownThaps 6457 red algal (nuclear) 100 Pr unknownThaps 269141 red algal (nuclear) 100 Pr, Ps plastid RNA processing domainThaps 8416 red algal (nuclear) 100 putative sodium symporterThaps 6950 red algal (nuclear) 100 probable carboxyterminal proteaseThaps 264001 red algal (nuclear) 100 putative oxidoreductaseThaps 20921 red algal (nuclear) 100 hypothetical proteinThaps 1620 red algal (nuclear) 100 hypothetical protein

Thaps 264407 red algal (nuclear) 100 putative indole-3-glycerol-phosphate synthase (IGPS)

Thaps 261633 red algal (nuclear) 100Probable cysteinyl-tRNA synthetase (cysteine-tRNA ligase)

Thaps 25544 red algal (nuclear) 100 putative phosphoglycolate phosphatase

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

19

Page 20: 07410 JLGGC

Thaps query (Protein ID)

Sister to stramenopile clade

Thaps BP%

Oomy-cete?

PHAEO query (Protein ID)

Sister to stramenopile clade

Phaeo BP%

Oomy-cete? Annotation

Thaps 574 red algal (nuclear) 100D-1-deoxyxylulose 5-phosphate synthase chloroplast precursor

Thaps 34824 red algal (nuclear) 100 plastid division putative chloroplast ftsHThaps 10404 red algal (nuclear) 100 Phaeo 44327 unknown

Thaps 269624 red algal (nuclear) 100 Phaeo 35057 Indole-3-gycerol phosphate synthase (IGPS) domainThaps 38775 red algal (nuclear) 99 Phaeo 50356 red algal (nuclear) 100 Hypothetical proteinThaps 40788 red algal (nuclear) 99 Phaeo 23281 red algal (nuclear) 99 putative dTDP-glucose 4,6-dehydrataseThaps 34632 red algal (nuclear) 99 Phaeo 14988 red algal (nuclear) 99 hypothetical protein

Thaps 268124 red algal (nuclear) 99 Phaeo 10718 red algal (nuclear) 99Putative protein of unknown function containing an esterase/lipase domain

Thaps 38221 red algal (nuclear) 99 Phaeo 15371 red algal (nuclear) 99 hypothetical protein

Thaps 35103 red algal (nuclear) 99 Phaeo 21455 red algal (nuclear) 99 dynamin related protein involved in chloroplast division

Thaps 22115 red algal (nuclear) 99 Phaeo 20833 red algal (nuclear) 98Putative ubiE; ubiquinone/menaquinone biosynthesis methyltransferase

Thaps 264063 red algal (nuclear) 99 Phaeo 13855 red algal (nuclear) 97pseudouridylate synthase (pseudouridine synthase) (uracil hydrolase)

Thaps 262553 cyanobacteria 99 Phaeo 43785 red algal (nuclear) 96putative chloride channel protein, voltage gated ion channel

Thaps 31415 red algal (nuclear) 99 Phaeo 17059 red algal (nuclear) 92 hypothetical proteinThaps 23622 red algal (nuclear) 99 Phaeo 49268 red algal (nuclear) 74 hypothetical proteinThaps 264420 red algal (nuclear) 99 Phaeo 46408 red algal (nuclear) 63 pseudouridylate synthaseThaps 25918 red algal (nuclear) 99 Pr, Ps unknownThaps 2492 red algal (nuclear) 99 hypothetical proteinThaps 1310 red algal (nuclear) 99 unknownThaps 37277 red algal (nuclear) 99 Trigger factor

Thaps 5135 red algal (nuclear) 99ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit epsilon N-methyltransferase

Thaps 33941 red algal (nuclear) 99 Potential Mitochondrial carrier proteinThaps 42307 red algal (nuclear) 98 Phaeo 1719 red algal (nuclear) 99 Small GTP-binding protein domainThaps 36429 red algal (nuclear) 98 Phaeo 49533 red algal (nuclear) 99 plastidic ATP/ADP transporterThaps 5380 red algal (nuclear) 98 Phaeo 11792 red algal (nuclear) 98 hypothetical proteinThaps 36431 red algal (nuclear) 98 Phaeo 11022 red algal (nuclear) 98 probable thylakoid lumen rotamaseThaps 36350 red algal (nuclear) 98 Phaeo 14387 red algal (nuclear) 98 catalytic subunit of clp protease, clpPThaps 33558 red algal (nuclear) 98 Phaeo 9327 red algal (nuclear) 98 secA preprotein translocaseThaps 263902 red algal (nuclear) 98 Phaeo 2738 red algal (nuclear) 98 tocopherol polyprenyltransferase domainThaps 15398 red algal (nuclear) 98 Phaeo 14426 red algal (nuclear) 98 ftsZ homologue

Thaps 13134red algal (nuclear); stramenopiles 98 Phaeo 12121

red algal (nuclear); stramenopiles 97 putative ABC transporter

Thaps 35022red algal (nuclear); stramenopiles 98 Phaeo 18345

red algal (nuclear); stramenopiles 96 carboxyvinyl-carboxyphosphonate phosphorylmutase

Thaps 31006 red algal (nuclear) 98 Phaeo 16962 red algal (nuclear) 54 putative light-repressed protein A

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

20

Page 21: 07410 JLGGC

Thaps query (Protein ID)

Sister to stramenopile clade

Thaps BP%

Oomy-cete?

PHAEO query (Protein ID)

Sister to stramenopile clade

Phaeo BP%

Oomy-cete? Annotation

Thaps 20587 red algal (nuclear) 98 Pr, Ps - - Hypothetical protein

Thaps 39941 red algal (nuclear) 97 Phaeo 10779 red algal (nuclear) 98 Possible branched-chain amino acid aminotransferaseThaps 18613 red algal (nuclear) 97 Phaeo 45773 red algal (nuclear) 98 Protease clpPThaps 31003 red algal (nuclear) 97 Phaeo 8975 red algal (nuclear) 96 glycerol-3-phosphate dehydrogenaseThaps 264184 red algal (nuclear) 97 Phaeo 13331 red algal (nuclear) 95 CTPA, Carboxyl-terminal processing proteaseThaps 33039 red algal (nuclear) 97 Phaeo 11099 red algal (nuclear) 94 similar to GTP pyrophosphokinase

Thaps 264345 red algal (nuclear) 97 Phaeo 15019 red algal (nuclear) 83 putative ribose methyltransferase, SpoU_methylase

Thaps 12662 red algal (nuclear) 97 Phaeo 29633 red algal (nuclear) 42 Cyclopropane-fatty-acyl-phospholipid synthase domainThaps 37534 red algal (nuclear) 96 Phaeo 1494 red algal (nuclear) 96 hypothetical proteinThaps 33663 red algal (nuclear) 96 Phaeo 23988 red algal (nuclear) 95 Glucose-6-phosphate isomerase (putative)Thaps 40554 red algal (nuclear) 96 Phaeo 17043 red algal (nuclear) 84 ATP-binding protein of ABC transporter

Thaps 38085

animals; fungi; green algal / plant (nuclear); red algal (nuclear); stramenopiles 96 Phaeo 54017 red algal (nuclear) 44 glucose-6-phosphate/phosphate-translocator precursor

Thaps 26191 red algal (nuclear) 95 Phaeo 42247 red algal (nuclear) 91 similar to putative asparaginyl-tRNA synthetaseThaps 38909 red algal (nuclear) 94 Phaeo 52655 red algal (nuclear) 98 Glutamyl-tRNA synthetaseThaps 21837 red algal (nuclear) 94 Phaeo 33525 red algal (nuclear) 97 hsp33

Thaps 10234 red algal (nuclear) 94 Phaeo 31683 red algal (nuclear) 94 carotenoid synthesis, geranylgeranyl hydrogenaseThaps 263424 red algal (nuclear) 94 Phaeo 52316 red algal (nuclear) 91 putative aspartate kinaseThaps 24425 red algal (nuclear) 94 possible permeaseThaps 17854 red algal (nuclear) 93 Phaeo 5851 red algal (nuclear) 93 Heme oxygenase

Thaps 5240 red algal (nuclear) 93 Phaeo 41746 red algal (nuclear) 93Porphobilinogen synthase, aminoterminus with afap missing from gene model

Thaps 35728red algal (nuclear); stramenopiles 93 Phaeo 14995

red algal (nuclear); stramenopiles 93 plastid division ftsZ

Thaps 269655red algal (nuclear); stramenopiles 93 Phaeo 42361

red algal (nuclear); stramenopiles 93 Plastid division, ftsZ homologue

Thaps 261232 red algal (nuclear) 93 Phaeo 19188 red algal (nuclear) 93 Uroporphyrinogen decarboxylaseThaps 269235 red algal (nuclear) 93 Pr, Ps Phaeo 13361 red algal (nuclear) 91 Pr, Ps rpl3Thaps 33044 red algal (nuclear) 93 Phaeo 32202 red algal (nuclear) 91 Lysyl-tRNA synthetase, class-2Thaps 10635 red algal (nuclear) 93 Hypothetical proteinThaps 38046 red algal (nuclear) 93 ABC transporter, unknown function

Thaps 3959 red algal (nuclear) 92 Pr, Ps Phaeo 8765 red algal (nuclear) 92 Pr, Pspantothenate biosynthesis, 3-methyl-2-oxobutanoate hydroxymethyltransferase

Thaps 33829 red algal (plastid) 92 Hypothetical proteinThaps 35532 red algal (nuclear) 91 Phaeo 46785 red algal (nuclear) 94 UDP-glucose 4-epimeraseThaps 33566 red algal (nuclear) 91 Phaeo 9241 red algal (nuclear) 92 putative thylakoidal processing peptidaseThaps 39516 red algal (nuclear) 91 Phaeo 15625 red algal (nuclear) 91 unknownThaps 43128 red algal (nuclear) 91 Phaeo 40135 red algal (nuclear) 91 possible saccharopine dehydrogenase

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

21

Page 22: 07410 JLGGC

Thaps query (Protein ID)

Sister to stramenopile clade

Thaps BP%

Oomy-cete?

PHAEO query (Protein ID)

Sister to stramenopile clade

Phaeo BP%

Oomy-cete? Annotation

Thaps 40156 red algal (nuclear) 91 Phaeo 20657 red algal (nuclear) 91 atpGThaps 30747 red algal (nuclear) 91 Phaeo 47250 red algal (nuclear) 74 Phosphatidate cytidylyltransferaseThaps 15424 red algal (nuclear) 89 Phaeo 3351 red algal (nuclear) 96 dimethyladenosine synthase

Thaps 31394 red algal (nuclear) 89 Phaeo 22909 red algal (nuclear) 93hypothetical aspartate aminotransferase, AAT/AspAT/AST

Thaps 13254 red algal (nuclear) 89 Phaeo 27821 red algal (nuclear) 86similar to Nucleolar RNA helicase II (Nucleolar RNA helicase Gu) (RH II/Gu) (DEAD-box protein 21)

Thaps 37294 red algal (nuclear) 89 protein of unkown function with ABC1 domainThaps 6897 red algal (nuclear) 88 Phaeo 48145 red algal (nuclear) 99 Hypothetical proteinThaps 262862 red algal (nuclear) 88 Phaeo 43626 red algal (nuclear) 95 similar to 16S rRNA processing protein rimMThaps 264548 red algal (nuclear) 88 Phaeo 49459 red algal (nuclear) 94 putative elongation factor G

Thaps 35968red algal (nuclear); stramenopiles 88 Phaeo 605

red algal (nuclear); stramenopiles 88 mitochondrial helicase twinkle

Thaps 3258 red algal (nuclear) 87 Phaeo 9078 red algal (nuclear) 97 11 kDa PSIIThaps 55 red algal (nuclear) 87 Ps Phaeo 47075 red algal (nuclear) 89 Pr, Ps hypothetical protein

Thaps 1515 red algal (nuclear) 87 Phaeo 46593 red algal (nuclear) 88NifU-like protein, biogenesis of ferredoxin and photosystem I

Thaps 13922 red algal (nuclear) 87 Pr, Ps Phaeo 15324 red algal (nuclear) 84 Pr, Ps Possible amino acid transporter

Thaps 14347red algal (nuclear); stramenopiles 87 Amino acid transporter

Thaps 661 red algal (nuclear) 87 Mg chelatase, chlD/bchD

Thaps 32477 red algal (nuclear) 86 Phaeo 12642 red algal (nuclear) 88 link to ureas cycale: putative ornithine decarboxylaseThaps 711 red algal (nuclear) 85 Phaeo 3046 red algal (nuclear) 86 sodium dependent bile acid symporterThaps 31372 red algal (nuclear) 85 Phaeo 21970 red algal (nuclear) 85 protein with transmembrane domainsThaps 35780 red algal (nuclear) 85 Phaeo 10617 red algal (nuclear) 83 rpl3

Thaps 34333red algal (nuclear); stramenopiles 85 amino acid permease family protein

Thaps 42821 red algal (nuclear) 84 Phaeo 18877 red algal (nuclear) 84 tRNA-i(6)A37 modification enzyme MiaBThaps 29183 red algal (nuclear) 84 Phaeo 38362 red algal (nuclear) 82 similar to 2-isopropylmalate synthase, bacterialThaps 264918 red algal (nuclear) 84 Pr, Ps Possible amino acid transporterThaps 21692 red algal (nuclear) 83 Phaeo 54952 red algal (nuclear) 89 Pyridoxamine 5'-phosphate oxidase domain

Thaps 263644 red algal (nuclear) 83 Pr, Ps Phaeo 13581 red algal (nuclear) 74 Pr, Psputative DNA topoisomerase II, DNA gyrase, topoisomerase IV subunit A

Thaps 262236 red algal (nuclear) 83 Pr, Ps Putative amino acid transporter, possibly cationicThaps 35911 red algal (nuclear) 82 Phaeo 44311 red algal (nuclear) 95 SAM-dependent methyltransferase-likeThaps 866 red algal (nuclear) 82 Phaeo 40933 red algal (nuclear) 88 Protease, clpPThaps 1738 red algal (nuclear) 82 Phaeo 44382 red algal (nuclear) 84 clpP proteaseThaps 31785 red algal (nuclear) 82 Phaeo 32791 red algal (nuclear) 83 hypothetical protein

Thaps 8014 red algal (nuclear) 82 Phaeo 4550 red algal (nuclear) 82Vitamin K-dependent carboxylation/gamma-carboxyglutamic region

Thaps 32319 red algal (nuclear) 82 Phaeo 1784 red algal (nuclear) 82putative IMP-GMP specific 5' nucleotidase (purine 5' nucleotidase)

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

22

Page 23: 07410 JLGGC

Thaps query (Protein ID)

Sister to stramenopile clade

Thaps BP%

Oomy-cete?

PHAEO query (Protein ID)

Sister to stramenopile clade

Phaeo BP%

Oomy-cete? Annotation

Thaps 11102 red algal (nuclear) 82 Pr, Ps conserved hypothetical proteinThaps 24705 red algal (nuclear) 81 Phaeo 47622 red algal (nuclear) 99 unknownThaps 19311 red algal (nuclear) 81 Phaeo 7173 red algal (nuclear) 82 Hypothetical proteinThaps 37497 red algal (nuclear) 81 Phaeo 13476 red algal (nuclear) 82 Serine O-acetyltransferaseThaps 2950 red algal (nuclear) 81 Phaeo 44610 red algal (nuclear) 81 uroporphyrinogen III synthase (urogen synthase)Thaps 27618 red algal (nuclear) 81 Phaeo 14715 red algal (nuclear) 69 rpl10

Thaps 37748red algal (nuclear); stramenopiles 81 Phaeo 769 red algal (nuclear) 42

similar to nuclear VCP-like; nuclear valosin-containing protein

Thaps 5579 red algal (nuclear) 80 Phaeo 21912 red algal (nuclear) 84 similar to abhydrolaseThaps 5236 red algal (nuclear) 80 Phaeo 10800 red algal (nuclear) 80 hypothetical protein

Thaps 1219

cyanobacteria; green algal / plant (nuclear); stramenopiles 79 Phaeo 32261 red algal (nuclear) 87 chaperone-like protein

Thaps 39802 red algal (nuclear) 79 Pr, Ps Phaeo 29340 red algal (nuclear) 79 Pr, Ps Peptidase C19, ubiquitin carboxyl-terminal hydrolase 2Thaps 2848 red algal (nuclear) 79 Phaeo 26293 red algal (nuclear) 79 psbUThaps 25171 red algal (nuclear) 79 Phaeo 45612 hypothetical proteinThaps 261102 red algal (nuclear) 78 Phaeo 3061 red algal (nuclear) 95 Putative carboxyl-terminal proteaseThaps 26031 red algal (nuclear) 78 Pr, Ps Phaeo 18665 red algal (nuclear) 78 Pr, Ps serine hydroxymethyltransferase, SHMTThaps 269966 red algal (nuclear) 78 Pr, Ps Phaeo 43016 red algal (nuclear) 76 Pr, Ps Putative valyl-tRNA synthetaseThaps 35712 red algal (nuclear) 77 Phaeo 29157 red algal (nuclear) 77 Phosphoglycerate kinase precursorThaps 37359 red algal (nuclear) 77 Phaeo 10593 red algal (nuclear) 77 clpP

Thaps 33008 red algal (nuclear) 77 Phaeo 18246 red algal (nuclear) 77shikimate pathway, 5-enolpyruvylshikimate-3-phosphate

Thaps 42577 red algal (nuclear) 77 Phosphoglycerate kinase precursorThaps 34348 red algal (nuclear) 76 Pr, Ps Phaeo 3362 red algal (nuclear) 78 Pr, Ps S-adenosylmethionine decarboxylaseThaps 9352 red algal (nuclear) 76 Phaeo 45887 red algal (nuclear) 76 Hypothetical proteinThaps 14106 red algal (nuclear) 76 Phaeo 12346 red algal (nuclear) 76 putative membrane protein Thaps 32201 red algal (nuclear) 75 Phaeo 33017 red algal (nuclear) 88 Mg-protoporphyrin IX chelatase, subunit D

Thaps 37495 bacteria 75 Phaeo 44670 red algal (plastid) 60 1,4-dihydroxy-2-naphthoate octaprenyltransferaseThaps 36684 red algal (nuclear) 74 Phaeo 14412 red algal (nuclear) 86 plastid division, ftsY?Thaps 31705 red algal (nuclear) 74 Phaeo 22819 red algal (nuclear) 65 Hypothetical proteinThaps 30977 red algal (nuclear) 74 similarity to polypeptide deformylasesThaps 31984 red algal (nuclear) 74 Putative serine acetyltransferaseThaps 268062 red algal (nuclear) 73 Phaeo 13424 red algal (nuclear) 72 HSP dnaJThaps 11101 red algal (nuclear) 73 Pr, Ps hypothetical protein

Thaps 26573 red algal (nuclear) 72 Phaeo 13265 red algal (nuclear) 75Chlorophyll synthesisMg-protoporphyrin IX chelatase, H subunit

Thaps 6264 red algal (nuclear) 72 Pr, Ps Phaeo 43288 red algal (nuclear) 72 Pr, Ps unknownThaps 36917 red algal (nuclear) 72 Phaeo 24792 red algal (nuclear) 65 putative queuine tRNA ribosyltransferaseThaps 268541 red algal (nuclear) 72 Pr, Ps Phaeo 54754 red algal (nuclear) 55 Pr, Ps ubiquitin activating enzyme 1Thaps 33055 red algal (nuclear) 72 Pr, Ps putative arogenate dehydrogenase

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

23

Page 24: 07410 JLGGC

Thaps query (Protein ID)

Sister to stramenopile clade

Thaps BP%

Oomy-cete?

PHAEO query (Protein ID)

Sister to stramenopile clade

Phaeo BP%

Oomy-cete? Annotation

Thaps 5487 red algal (nuclear) 71 Phaeo 44401 red algal (nuclear) 99 hypothetical protein

Thaps 30691 red algal (nuclear) 71 Phaeo 55177 red algal (nuclear) 72putative spermine synthase (E.C. 2.5.1.22) or spermidin synthase (2.5.1.16)

Thaps 26548 red algal (nuclear) 71 Phaeo 3639 red algal (nuclear) 61 probable myo-inositol dehydrogenase

Thaps 27873red algal (nuclear); stramenopiles 70 Phaeo 31718

red algal (nuclear); stramenopiles 70 inosine monophosphate dehydrogenase 1 (IMPDH)

Thaps 31113 red algal (nuclear) 70 Pr Phaeo 11016 red algal (nuclear) 68 Pr Cytochrome oxidase c, subunit VIbThaps 1049 red algal (nuclear) 70 probable myo-inositol dehydrogenaseThaps 1238 red algal (nuclear) 69 Phaeo 38271 red algal (nuclear) 69 lactoylglutathione lyase-like

Thaps 268908 red algal (nuclear) 69 Phaeo 41878 red algal (nuclear) 69Carotenoid synthesis, similar to phytoene synthase PSY1

Thaps 14772 red algal (nuclear) 69 Pyridoxal-dependent decarboxylaseThaps 269487 red algal (nuclear) 69 Hypothetical proteinThaps 3461 red algal (nuclear) 68 Phaeo 15595 red algal (nuclear) 76 Tryptophanyl-tRNA synthetase, class Ib

Thaps 41637 red algal (nuclear) 68 Pr, Ps Phaeo 1971red algal (nuclear); stramenopiles 65 hypothetical protein

Thaps 263012 red algal (nuclear) 68 Pr, Ps RAS-related proteinThaps 13806 red algal (nuclear) 67 Ps Phaeo 27385 red algal (nuclear) 67 Ps Glutaminyl-tRNA synthetaseThaps 1283 red algal (nuclear) 67 queuine tRNA-ribosyltransferase

Thaps 33131red algal (nuclear); stramenopiles 67 Chlorophyll A-B binding protein

Thaps 35768 red algal (nuclear) 66 Phaeo 47453 red algal (nuclear) 66 Hypothetical proteinThaps 21069 red algal (nuclear) 66 Phaeo 49764 red algal (nuclear) 51 Mitochondrial carrier protein

Thaps 40774 red algal (nuclear) 65 Pr, Ps Phaeo 47999 red algal (nuclear) 65 Pr, Psputative alanine-tRNA ligase, alanine-tRNA synthetase (ALATS)

Thaps 38248 red algal (nuclear) 65 Pr, Ps Phaeo 15186 red algal (nuclear) 65 Pr, Ps putative alanine-tRNA ligase, alanyl-tRNA synthetaseThaps 11220 red algal (nuclear) 65 Phaeo 48306 red algal (nuclear) 56 unknownThaps 32459 red algal (nuclear) 65 Phaeo 19661 red algal (nuclear) 54 putative membrane protein Thaps 263147 red algal (nuclear) 64 Phaeo 51519 red algal (nuclear) 68 Lipoate protein ligase BThaps 24248 red algal (nuclear) 64 Pr, Ps Phaeo 39528 red algal (nuclear) 64 Pr, Ps Urea cycle, Carbamyl phosphate synthetase III

Thaps 269393 red algal (nuclear) 64 Phaeo 21201 red algal (nuclear) 48 Sulfolipid (UDP-sulfoquinovose) biosynthesis proteinThaps 268226 red algal (nuclear) 64 Putative ammonium transporterThaps 28544 red algal (nuclear) 63 Phaeo 4025 red algal (nuclear) 63 Dihydrodipicolinate reductaseThaps 33088 red algal (nuclear) 63 GTP binding domainThaps 16675 red algal (nuclear) 63 TatD related deoxyribonuclease

Thaps 31451 red algal (nuclear) 62 Phaeo 10774red algal (nuclear); stramenopiles 100 rpl15

Thaps 35135 red algal (nuclear) 62 Pr, Ps Phaeo 16032 red algal (nuclear) 72 Pr, Ps putative protein kinaseThaps 21300 red algal (nuclear) 62 Phaeo 43169 red algal (nuclear) 63 mRNA binding proteinThaps 264481 red algal (nuclear) 62 Ps putative mannosyltransferaseThaps 262704 red algal (nuclear) 62 Hypothetical protein

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

24

Page 25: 07410 JLGGC

Thaps query (Protein ID)

Sister to stramenopile clade

Thaps BP%

Oomy-cete?

PHAEO query (Protein ID)

Sister to stramenopile clade

Phaeo BP%

Oomy-cete? Annotation

Thaps 36679 red algal (nuclear) 61 Phaeo 48233red algal (nuclear); stramenopiles 82 hypothetical protein

Thaps 263883 red algal (nuclear) 61 Phaeo 11883 red algal (nuclear) 77 Hypothetical protein

Thaps 460 red algal (nuclear) 61 Pr, Ps Phaeo 10073 red algal (nuclear) 64 Pr, Psputative aminotransferase or cysteine beta conjugate-lyase

Thaps 37327 red algal (nuclear) 61 Phaeo 15294 red algal (nuclear) 61putative tRNA (5-methylaminomethyl-2-thiouridylate)-methyltransferase

Thaps 264759 red algal (nuclear) 61 Phaeo 44775 red algal (nuclear) 45Sulfur metabolism, Glutathione biosynthesis, Gamma-glutamyltranspeptidase

Thaps 27003 red algal (nuclear) 61 similar to methionine aminopeptidase

Thaps 263836 red algal (nuclear) 61

Sulfur metabolism, Glutathion synthesis and degradation, putative gamma-glutamyltransferase (GGT)

Thaps 264149 red algal (nuclear) 61 Phaeo 11953 rpl24Thaps 263554 red algal (nuclear) 60 Phaeo 6807 red algal (nuclear) 86 Putative shikimate kinase

Thaps 4142red algal (nuclear); stramenopiles 60 Ps Phaeo 54869 red algal (nuclear) 81 GTPase activatong enzmye

Thaps 264188 red algal (nuclear) 60 - - metalloprotease

Thaps 38121red algal (nuclear); stramenopiles 59 Phaeo 26173

red algal (nuclear); stramenopiles 89 Protease, clpA

Thaps 40537 red algal (nuclear) 59 Ammonium transporter

Thaps 30701red algal (nuclear); stramenopiles 59 Urea cycle. ArgA N-acetylglutamate synthase

Thaps 27656 red algal (nuclear) 58 Pr, Ps Phaeo 54246 red algal (nuclear) 70 Pr, Ps heat shock protein/chaperone - Dank/hsp70

Thaps 31216 red algal (nuclear) 58 Pr, Ps Phaeo 36165 red algal (nuclear) 61 Pr, Psputative oxononanoate synthase, 7-keto-8-aminopelargonate synthetase

Thaps 33330 red algal (nuclear) 57 Phaeo 19705 red algal (nuclear) 66 Dolichyl-phosphate beta-Mannosyltransferase

Thaps 454red algal (nuclear); stramenopiles 57 Phaeo 38559 red algal (nuclear) 39 GTP-binding protein YchF

Thaps 23484 red algal (nuclear) 57 Pr, Ps unknownThaps 38578 red algal (nuclear) 57 Pr, Ps ER chaperone, Dank/hsp70Thaps 37640 red algal (nuclear) 57 Small multidrug resistance proteinThaps 9499 red algal (nuclear) 56 Phaeo 9904 red algal (nuclear) 74 hydrolase domainThaps 40966 red algal (nuclear) 56 Phaeo 9855 red algal (nuclear) 61 RNA Polymerase sigma factor

Thaps 277 red algal (nuclear) 56 Phaeo 21046 red algal (nuclear) 56

similar to cycloartenol synthase; (S)-2,3-epoxysqualene mutase (2,3-epoxysqualene cycloartenol-cyclase)

Thaps 511 red algal (nuclear) 56 Phaeo 37067 red algal (nuclear) 54 Putative GTP binding proteinThaps 36263 red algal (nuclear) 56 Ammonium transporterThaps 30924 red algal (nuclear) 56 Myo-inositol-1-phosphate synthase

Thaps 264824 red algal (nuclear) 55 Phaeo 10674 red algal (nuclear) 56putative Translation initiation factor eIF-2B alpha subunit

Thaps 26545 red algal (nuclear) 55 Phaeo 468 red algal (nuclear) 53 CDC21, DNA replication licensing factor mis5, cdc21Thaps 9840 red algal (nuclear) 55 unknown

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

25

Page 26: 07410 JLGGC

Thaps query (Protein ID)

Sister to stramenopile clade

Thaps BP%

Oomy-cete?

PHAEO query (Protein ID)

Sister to stramenopile clade

Phaeo BP%

Oomy-cete? Annotation

Thaps 261650 red algal (nuclear) 55 permease, aminoterminus missingThaps 261470 red algal (nuclear) 54 Hypothetical proteinThaps 7642 red algal (nuclear) 54 Hypothetical protein

Thaps 225 red algal (nuclear) 53 Phaeo 40048red algal (nuclear); stramenopiles 91 Ribosome recycling factor

Thaps 31046red algal (nuclear); stramenopiles 53 Phaeo 24069 red algal (nuclear) 72 transcription factor APFI

Thaps 4126 red algal (nuclear) 53 Phaeo 48122 red algal (nuclear) 53 unknownThaps 268374 red algal (nuclear) 53 Pr, Ps Phaeo 55035 red algal (nuclear) 52 Pr, Ps Pyruvate dehydrogenase E1 alpha subunitThaps 15259 red algal (nuclear) 53 Phaeo 10847 red algal (nuclear) 47 rps1Thaps 34739 red algal (nuclear) 53 - - unknownThaps 13996 red algal (nuclear) 52 Phaeo 54981 red algal (nuclear) 56 Ammonium transporterThaps 36964 red algal (nuclear) 52 Phaeo 16599 red algal (nuclear) 52 Nicotinate-nucleotide pyrophosphorylaseThaps 8537 red algal (nuclear) 52 Phaeo 33462 red algal (nuclear) 48 Hypothetical proteinThaps 10136 red algal (nuclear) 52 Phaeo 47730 red algal (nuclear) 43 unknownThaps 24146 red algal (nuclear) 51 Pr, Ps Phaeo 12174 red algal (nuclear) 51 Pr, Ps similar to protein phosphatase 2Thaps 263219 red algal (nuclear) 51 Phaeo 16421 red algal (nuclear) 51 possible transporterThaps 15027 red algal (nuclear) 51 Phaeo 42116 red algal (nuclear) 51 putative Na+/H+ antiporter

Thaps 6988 red algal (nuclear) 51 Phaeo 41016 red algal (nuclear) 38putative aspartate-tRNA ligase or aspartyl tRNA synthetase

Thaps 37793 red algal (nuclear) 51 Hypothetical proteinThaps 38157 red algal (nuclear) 50 Phaeo 43174 red algal (nuclear) 52 Putative GDP-L-fucose transporterThaps 9002 red algal (nuclear) 50 Hypothetical proteinThaps 3073 red algal (nuclear) 50 unknown

Thaps 35180 red algal (nuclear) 48 Phaeo 12379red algal (nuclear); stramenopiles 99 hypothetical protein

Thaps 15114 red algal (nuclear) 48 Phaeo 15083 red algal (nuclear) 63 Urea cycle, aminoacylase 1Thaps 36454 red algal (nuclear) 46 Phaeo 14915 red algal (nuclear) 51 rpl1Thaps 34306 red algal (nuclear) 45 Pr, Ps Phaeo 30502 red algal (nuclear) 51 Pr, Ps Hypothetical proteinThaps 14643 red algal (nuclear) 44 Phaeo 14002 red algal (nuclear) 59 putative beta-1,4 mannosyltransferase

Thaps 260799 red algal (nuclear) 42 Phaeo 12106 red algal (nuclear) 57protein containing a zinc metallopeptidase (metalloprotease) domain

Thaps 262367 red algal (nuclear) 41 Phaeo 43556

green algal / plant (nuclear); red algal (nuclear) 99 15 kDa thylakoid lumen protein

Thaps 38189 red algal (nuclear) 38 Phaeo 15806 red algal (nuclear) 52 Amine oxidase or phytoene desaturase

Thaps 3748red algal (nuclear); stramenopiles 37 Phaeo 35949 red algal (nuclear) 54 mitochondrial uncoupling protein

Thaps 34357 red algal (nuclear) 32 Phaeo 13534 red algal (nuclear) 77 hypothetical protein

Thaps 2317red algal (nuclear); stramenopiles 16 Phaeo 35984 red algal (nuclear) 57 mitochondrial carrier-like protein

Phaeo 974 red algal (nuclear) 100 unknownPhaeo 51570 red algal (nuclear) 100 hypothetical proteinPhaeo 48737 red algal (nuclear) 100 hypothetical proteinPhaeo 47504 red algal (nuclear) 100 unknown

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

26

Page 27: 07410 JLGGC

Thaps query (Protein ID)

Sister to stramenopile clade

Thaps BP%

Oomy-cete?

PHAEO query (Protein ID)

Sister to stramenopile clade

Phaeo BP%

Oomy-cete? Annotation

Phaeo 44273 red algal (nuclear) 100 unknownPhaeo 37234 red algal (nuclear) 100 Adenylate cyclase domainPhaeo 34521 red algal (nuclear) 100 hypothetical proteinPhaeo 41170 red algal (nuclear) 100 quinol-to-oxygen oxidoreductase

Phaeo 54355 red algal (nuclear) 100 putative acetolactate synthase large subunit (AHAS)Phaeo 48359 red algal (nuclear) 100 23 kd OEC proteinPhaeo 54343 red algal (nuclear) 100 Histone acetyltransferase, putative

Thaps 5806 Phaeo 31981 red algal (nuclear) 100Tryptophan synthesis, putative protein similiar to antranilate synthase

Thaps 25247 Phaeo 34522 red algal (nuclear) 100 hypothetical proteinThaps 1136 Phaeo 46675 red algal (nuclear) 100 Hypothetical protein

Phaeo 45238 red algal (nuclear) 99 hypothetical proteinPhaeo 39421 red algal (nuclear) 99 Chloride channel domainPhaeo 8891 red algal (nuclear) 98 Histone deacetylase complexPhaeo 5142 red algal (nuclear) 98 Mitochondrial substrate carrierPhaeo 46865 red algal (nuclear) 97 unknownPhaeo 40261 red algal (nuclear) 95 unknown

Thaps 7330 Phaeo 36358 red algal (nuclear) 93 unknownPhaeo 15960 red algal (nuclear) 92 Pseudouridine synthasePhaeo 48297 red algal (nuclear) 92 unknown

Phaeo 40492 red algal (nuclear) 92probable phosphate/phosphoenolpyruvate translocator precursor

Thaps 3863 Phaeo 39247 red algal (nuclear) 92 unknown

Phaeo 33726red algal (nuclear); stramenopiles 91 unknown

Phaeo 40588 red algal (nuclear) 91 unknown

Phaeo 50189red algal (nuclear); stramenopiles 90 unknown

Phaeo 14334 red algal (nuclear) 90 hypothetical protein

Phaeo 41282red algal (nuclear); stramenopiles 89 Ribosome recycling factor

Phaeo 18087 red algal (nuclear) 88 Pr, Ps putative Adenylosuccinate lyasePhaeo 43654 red algal (nuclear) 84 Pr, Ps hypothetical protein

Phaeo 9233red algal (nuclear); stramenopiles 84 Amino acid transporter

Phaeo 45656red algal (nuclear); stramenopiles 83 Plastidial HCO3-transporter

Phaeo 37858 red algal (nuclear) 79 Tocopherol UbiA prenyltransferasePhaeo 8744 red algal (nuclear) 77 Fructose-1,6-bisphosphatasePhaeo 30466 red algal (nuclear) 76 RNA binding protein

Phaeo 47935red algal (nuclear); stramenopiles 74 possible tRNA pseudouridine synthase

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

27

Page 28: 07410 JLGGC

Thaps query (Protein ID)

Sister to stramenopile clade

Thaps BP%

Oomy-cete?

PHAEO query (Protein ID)

Sister to stramenopile clade

Phaeo BP%

Oomy-cete? Annotation

Phaeo 13386red algal (nuclear); stramenopiles 73 Pr, Ps putative spermine/spermidine synthase

Phaeo 9359 red algal (nuclear) 73 Fructose-1,6-bisphosphatase

Phaeo 16840 red algal (nuclear) 72putative mitochondrial NADH:ubiquinone oxidoreductase 29 kDa

Phaeo 40200 red algal (nuclear) 70 GTP binding domainPhaeo 50542 red algal (nuclear) 69 Mitochondrial substrate carrier

Phaeo 16322red algal (nuclear); stramenopiles 67 cab

Phaeo 47094 red algal (nuclear) 65 DNA ligase doaminPhaeo 1862 red algal (nuclear) 65 Ammonium transporter

Phaeo 10257 red algal (nuclear) 62 Pr, PsKynurenine aminotransferase, glutamine transaminase K

Phaeo 44463 red algal (nuclear) 62 unknownPhaeo 20588 red algal (nuclear) 61 putative Small nuclear ribonucleoprotein

Thaps 9506 Phaeo 43802 red algal (nuclear) 61 unknownPhaeo 27077 red algal (nuclear) 59 hypothetical proteinPhaeo 10881 red algal (nuclear) 58 Ammonium transporterPhaeo 11128 red algal (nuclear) 56 Ammonium transporter

Thaps 261226 Phaeo 55176 red algal (nuclear) 54 Thylakoid lumenal 17.4 kDa proteinPhaeo 31975 red algal (nuclear) 52 unknownPhaeo 1813 red algal (nuclear) 52 Ammonium transporter

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

28

Page 29: 07410 JLGGC

Supplementary Table 3 The 587 P. tricornutum genes of proposed bacterial origin. Presence of orthologs in T. pseudonana and Phytophthora spp. is indicated by x, and bootstrap values are grouped as >90, 75-90, 50-75, and <50. Annotation information and number of ESTs is also given. Individual trees of each of these 587 genes are shown in the supplementary file. See separate pdf file for Supplementary Table 3

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 29

Page 30: 07410 JLGGC

Supplementary Table 4 Major gene family expansions in diatoms and comparison with other organisms. R denotes the ranking of each gene family in each species.

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 30

Page 31: 07410 JLGGC

FamilyID Description T. a

nnul

ata

R-T

an

T. p

arva

R-T

pa

C. h

omin

is

R-C

ho

C. p

arvu

m

R-C

pa

P. f

alci

paru

m

R-P

fa

P. y

oelii

yoel

ii

R-P

yo

P. t

etra

urel

ia

R-P

te

T. th

erm

ophi

la

R-T

th

P. r

amor

um

R-P

ra

P. s

ojae

R-P

so

P. t

ricor

nutu

m

R-P

tr

T. p

seud

onan

a

R-T

ps

A. t

halia

na

R-A

th

O. s

ativ

a

R-O

sa

O. t

auri

R-O

ta

O. l

ucim

arin

us

R-O

lu

C. r

einh

ardt

ii

R-C

re

C. m

erol

ae

R-C

me

C. e

lega

ns

R-C

el

D. m

elan

ogas

ter

R-D

me

H. s

apie

ns

R-H

sa

S. c

erev

isia

e

R-S

ce

S. p

ombe

R-S

po

2 Protein kinase domain containing protein 20 7 23 6 35 3 17 3 21 9 26 8 1806 2 424 5 162 5 148 6 64 1 117 2 218 4 210 9 42 2 51 2 142 2 27 1 247 3 207 4 226 4 79 2 68 11 Kinesin-related motor protein 28 4 30 4 77 1 56 1 92 3 103 4 2429 1 3284 1 211 3 220 4 60 2 132 1 186 5 142 15 93 1 104 1 137 3 27 1 353 1 401 1 331 3 71 3 42 2598 No description 0 24 0 23 0 22 0 17 0 26 0 26 0 82 0 69 0 65 0 70 50 3 0 39 0 81 0 94 0 25 0 26 0 45 0 21 0 75 0 62 0 66 0 31 0 2271 Heat shock transcription factor 0 24 0 23 0 22 0 17 0 26 0 26 68 27 3 66 11 54 12 58 49 4 54 5 24 57 25 69 0 25 1 25 2 43 3 18 1 74 1 61 4 62 3 28 2 20765 No description 0 24 0 23 0 22 0 17 0 26 0 26 0 82 0 69 0 65 0 70 42 5 0 39 0 81 0 94 0 25 0 26 0 45 0 21 0 75 0 62 0 66 0 31 0 2223 Mitochondrial carrier protein 5 19 5 18 3 19 1 16 3 23 3 23 55 34 34 35 35 31 35 35 39 6 38 10 51 35 50 47 21 6 29 6 35 17 26 2 38 38 68 12 48 23 30 6 20 64 Serine/Threonine kinase protein 0 24 0 23 2 20 1 16 0 26 2 24 10 72 4 65 16 49 19 51 38 7 61 4 742 1 1342 1 3 22 9 17 56 9 0 21 37 39 54 17 58 18 0 31 0 2214 DEAD/DEAH box helicase 27 5 27 5 25 4 17 3 29 6 33 6 56 33 26 43 36 30 34 36 37 8 46 7 55 31 52 45 35 3 44 3 39 15 25 3 66 18 46 18 45 25 26 9 23 520 26S proteasome ATPase subunit 16 8 16 8 14 8 13 4 16 11 18 10 60 30 28 41 29 36 32 38 36 9 46 7 57 29 54 43 30 5 31 5 40 14 26 2 41 36 37 26 32 35 21 13 20 6376 Fucoxanthin chlorophyll protein 0 24 0 23 0 22 0 17 0 26 0 26 0 82 0 69 0 65 0 70 29 10 32 12 0 81 0 94 1 24 1 25 3 42 2 19 0 75 0 62 0 66 0 31 0 221370 No description 0 24 0 23 0 22 0 17 0 26 0 26 0 82 0 69 0 65 0 70 28 11 0 39 0 81 0 94 0 25 0 26 0 45 0 21 0 75 0 62 0 66 0 31 0 2219 Cell wall protein (lectin / mucin) 4 20 6 17 14 8 12 5 3 23 3 23 2 80 5 64 126 7 128 7 27 12 49 6 8 73 10 84 4 21 2 24 6 39 11 10 54 25 125 5 48 23 51 4 25 47 WD-40 repeat protein 12 12 15 9 18 5 8 9 13 14 16 11 274 6 82 19 48 23 44 30 26 13 31 13 67 24 62 36 34 4 36 4 60 8 22 5 65 19 74 9 75 15 27 8 36 327 DnaJ-class molecular chaperone with C-terminal Zn finger domain 10 14 12 11 15 7 9 8 15 12 13 13 56 33 26 43 24 41 18 52 26 13 41 9 51 35 53 44 16 10 16 10 38 16 17 6 28 47 41 22 30 37 13 18 15 832 ATP binding / ATP-dependent helicase/ DNA binding / helicase/ nucleic acid binding 9 15 9 14 12 10 11 6 9 17 12 14 32 50 14 55 24 41 20 50 25 14 25 16 40 43 36 59 18 8 19 8 22 24 12 9 26 49 26 36 30 37 17 15 20 61615 No description 0 24 0 23 0 22 0 17 0 26 0 26 0 82 0 69 0 65 0 70 25 14 0 39 0 81 0 94 0 25 0 26 0 45 0 21 0 75 0 62 0 66 0 31 0 221015 No description 0 24 0 23 0 22 0 17 0 26 0 26 0 82 0 69 0 65 0 70 22 15 13 26 0 81 0 94 0 25 0 26 0 45 0 21 0 75 0 62 0 66 0 31 0 2210 Protein kinase domain containing protein (MAP kinase / CDC2) 14 10 15 9 17 6 2 15 12 15 15 12 246 7 68 21 44 26 38 32 21 16 25 16 80 18 85 25 17 9 19 8 62 6 12 9 85 11 71 10 62 16 23 11 23 538 Catalytic/ protein phosphatase type 2C 3 21 3 20 5 17 5 12 3 23 5 21 113 16 25 44 16 49 14 56 20 17 19 21 58 28 58 39 12 13 12 14 20 26 4 17 10 65 15 47 11 55 4 27 3 1942 Oxidoreductase, short chain dehydrogenase/reductase family protein 0 24 0 23 1 21 0 17 1 25 1 25 18 64 15 54 19 46 22 48 20 17 18 22 59 27 63 35 6 19 4 22 19 27 7 14 37 39 30 32 28 38 6 25 12 1021 ABC transporter family protein 22 6 17 7 11 11 7 10 6 20 9 17 63 29 88 17 74 15 65 17 19 18 28 15 47 38 53 44 8 17 8 18 26 23 13 8 45 33 42 21 25 41 11 20 9 13104 ABC1 family protein 2 22 2 21 2 20 2 15 1 25 2 24 6 76 3 66 2 63 3 67 19 18 21 19 17 64 17 77 19 7 21 7 14 31 11 10 2 73 3 59 4 62 2 29 3 19463 No description 0 24 0 23 0 22 0 17 0 26 0 26 0 82 0 69 0 65 0 70 18 19 41 9 0 81 0 94 0 25 0 26 0 45 0 21 0 75 0 62 0 66 0 31 0 2211 Rab family GTPase 10 14 9 14 6 16 2 15 12 15 12 14 240 8 91 15 36 30 35 35 17 20 25 16 74 21 56 41 12 13 12 14 21 25 11 10 74 16 90 8 128 7 24 10 18 745 Cyclophilin type peptidyl-prolyl cis-trans isomerase 11 13 12 11 7 15 3 14 9 17 10 16 26 56 14 55 21 44 16 54 17 20 24 17 24 57 21 73 18 8 19 8 21 25 3 18 31 44 18 44 38 30 7 24 9 13106 Peptidyl-prolyl cis-trans isomerase 0 24 0 23 2 20 1 16 2 24 1 25 8 74 7 62 11 54 8 62 17 20 24 17 13 68 16 78 4 21 8 18 10 35 5 16 10 65 7 55 15 51 4 27 3 19195 Kinesin light chain protein 0 24 0 23 0 22 0 17 0 26 0 26 6 76 46 28 2 63 1 69 14 23 20 20 0 81 0 94 0 25 0 26 5 40 0 21 9 66 1 61 5 61 0 31 0 22160 Putative prolyl 4-hydroxylase alpha subunit homologue oxidoreductase protein 0 24 0 23 0 22 0 17 0 26 0 26 0 82 0 69 3 62 3 67 13 24 23 18 13 68 12 82 7 18 7 19 14 31 0 21 9 66 27 35 3 63 0 31 0 2278 Regulator of chromosome condensation / Ran GTPase binding / chromatin binding 3 21 1 22 2 20 1 16 2 24 3 23 30 52 23 46 15 50 15 55 10 27 23 18 23 58 23 71 10 15 14 12 7 38 3 18 3 72 10 52 15 51 1 30 2 20544 Cyclin 0 24 0 23 0 22 0 17 0 26 0 26 0 82 0 69 0 65 0 70 10 27 42 8 0 81 0 94 0 25 0 26 0 45 0 21 0 75 0 62 0 66 0 31 0 22344 Serine/Threonine protein kinase and Signal Transduction Histidine Kinase 0 24 0 23 0 22 0 17 0 26 0 26 0 82 0 69 0 65 0 70 8 29 65 3 0 81 0 94 0 25 0 26 0 45 0 21 0 75 0 62 0 66 0 31 0 2236 Trypsin protease GIP-like / glucanase inhibitor protein 0 24 0 23 0 22 0 17 0 26 0 26 0 82 0 69 31 35 34 36 7 30 29 14 0 81 0 94 0 25 1 25 1 44 0 21 3 72 235 3 107 9 0 31 0 22

Phylogenetic Profiles

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 31

Page 32: 07410 JLGGC

Supplementary Table 5 Differences in gene family sizes in P. tricornutum (Ptr) and T. pseudonana (Tps), and comparison with other eukaryotes. The table shows the number of homologous genes (or copy number) in each of the different species for each gene family (denoted as ‘cluster’). The ratio indicates the difference in copy number between P. tricornutum (Ptr) and T. pseudonana (Tps), which are indicated in grey. Comparisons were made with ten other heterokonts/alveolates (P. sojae (Pso), P. ramorum (Pra), Cryptosporidium hominis (Cho), Cryptosporidium parvum (Cpa), Plasmodium falciparum (Pfa), Plasmodium yoelii yoelii (Pyo), Theileria annulata (Tan), Theileria parva (Tpa), Paramecium tetraurelia (Pte), and Tetrahymena thermophila (Tth)), two plants (A. thaliana (Ath) and Oryza sativa (Osa)), three green algae (O. tauri (Ota), O. lucimarinus (Olu) and C. reinhardtii (Cre)), one red alga (C. merolae (Cme)), three metazoa (H. sapiens (Hsa), Caenorhabditis elegans (Cel) and Drosophila melanogaster (Dme)), and two fungi (S. cerevisiae (Sce) and Schizosaccharomyces pombe (Spo)). See separate pdf file for Supplementary Table 5

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 32

Page 33: 07410 JLGGC

Supplementary Table 6 Statistics of improved assemblies P. tricornutum T. pseudonana Main genome scaffold sequence total 26.1 Mb 31.3 Mb Main genome scaffold total 33 Mb 27 MB Main genome scaffold N/L50: 11/945.0 Kb 7/2.0 Mb Main genome contig sequence total 25.8 Mb (1.3% gap) 31.2 Mb (0.3% gap) Main genome contig total 102 45 Main genome contig N/L50: 19/423.4 Kb 8/1.3 Mb

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 33

Page 34: 07410 JLGGC

Supplementary Table 7 Segmental duplications in the P. tricornutum and T. pseudonana genomes # Block1 Block2 #genes Avg. aa

id,% Avg. algn. score

P. tricornutum duplications: 1. bd_18x34:2968-33062 bd_25x34:2996-33064 13 100 740 2. bd_8x17:497244725 chr_27:356508-391886 15 99 733 3. bd_9x21:4161-26853 chr_8:3568-26139 10 95 764 4. chr_13:585-21516 chr_7:1002176-1023740 8 97 1001 5. chr_15:785021-814891 chr_24:380252-419182 15 100 743 6. chr_16:701999-717257 chr_16:728621-742816 7 99 892 7. chr_22:25-28610 chr_28:354974-383562 12 99 850 8. chr_24:484902-510135 chr_29:2579-27122 7 99 437 9. chr_28:3636-37955 chr_7:221799-256118 13 100 804 10 chr_8:895835-921011 chr_8:921643-933652 7 99 640 T. pseudonana duplications: 1. bd_33x54:1079-41185 chr_9:1003484-1042684 12 100 1075 2. bd_6x46:57250-76268 chr_19a:1992-17036 6 96 1022 3. bd_7x47:8069-34652 chr_8:2392-33918 8 98 967 4. chr_11a:752111-805007 chr_22:1794-54690 17 100 928 5. chr_13:1015630-1051655 chr_17:589456-625480 7 99 1602

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 34

Page 35: 07410 JLGGC

Supplementary Table 8 Genes predicted to be of red algal origin in diatom genomes

Bootstrap support for red-

stramenopile clade

Number of genes predicted to be of red algal origin Number of gene products predicted

to be plastid-targeted

In one or both diatoms

Only in P. tricornutum

Only in T. pseudonana

85% 171 32 31 74 65% 258 45 41 102 50% 326 57 68 114

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 35

Page 36: 07410 JLGGC

Supplementary Methods and Notes Extraction of high molecular weight DNA for genome sequencing A monoclonal culture of fusiform cells derived from accession Pt1 9 (known as Pt1 8.6 and deposited as CCMP2561 in the Provasoli-Guillard National Center for Culture of Marine Phytoplankton) was grown at 18 °C in a 12 hour photoperiod at an illumination of approximately 180 µmol.m-2.s-1. Around 4 liters of diatom culture in exponential phase were centrifuged at 1,800 g for 15 min at 4 °C. The cell pellet (wet weight around 20 g) was frozen in liquid nitrogen and resuspended in 40 mL of lysis buffer (50 mM Tris-HCl pH 8.0, 10 mM EDTA pH 8.0, 1% SDS, 10 mM DTT, 10 mg/mL of proteinase K) and incubated at 50 °C for 45 minutes. Three phenol/chloroform extractions were performed to remove proteins, and a subsequent extraction with chloroform isoamyl alcohol (24:1) was made to eliminate completely the phenol residues. Genomic DNA was precipitated and resuspended in TE (10 mM Tris pH 8.0, 1 mM EDTA). After RNase treatment at 37 °C for 30 minutes, genomic DNA was purified on cesium chloride gradients at 55,000 rpm, 20 °C for 18 hours using the vertical rotor VTi 65.2 (Beckman, USA). DNA concentration was determined in a spectrophotometer at 260 nm and checked on a 0.8% agarose gel. This DNA was used to construct replicate libraries containing inserts of 2-3 Kb, 6-8 Kb, and 35-40 Kb. Genome sequencing, assembly and annotation Draft assemblies The initial draft assembly of the T. pseudonana genome was described in Armbrust et al 10 and served as a starting point for improving this genome (see below). For the draft assembly of the P. tricornutum genome, approximately 556,000 reads involving 564 Mb of sequence were trimmed, filtered for short reads, and using the JGI JAZZ assembler were assembled into 31.1 Mb of scaffold sequence with 3.5 Mb (11.2%) of gaps after excluding redundant and short scaffolds. Based on BLAST screening these scaffolds included: 44 Kb mitochondrion (no gaps) and 117 Kb chloroplast (no gaps). Based on the number of alignments per read, the main genome scaffolds were at a depth of 10.4 +/- 0.07. While the fraction of gaps in the main genome scaffolds was very high, the amount of sequence in the unplaced reads was 53.9 Mb, which would be sufficient to cover the gaps to a mean depth of 15.4. Genome improvement methods Based on the Jazz whole genome shotgun assemblies of the two genomes, all low quality areas and gaps were identified and converted into targets to address in the improvement process. These targets consisted of a high quality consensus sequence on either side, along with any read entering the target area and any subclone pair that captured the target region. These targets were then built using the JGI Phred/Phrap/Consed pipeline 11 and entered manual finishing. Following inspection of the assembled sequences, finishing was performed by resequencing plasmid subclones and by walking on plasmid subclones or fosmids using custom primers. All finishing reactions were performed with 4:1 BigDye to dGTP BigDye terminator chemistry (Applied Biosystems). Repeats in the sequence were resolved by transposon-hopping 8 Kb plasmid clones. Fosmid clones were shotgun sequenced and finished to fill large gaps, to resolve large repeats, and to extend into chromosome telomere regions where possible. The resulting improved assemblies consist of these finished targets integrated back into the draft whole genome assembly and are a mosaic of the two haplotypes. The P. tricornutum genome was assembled into 33 large scaffolds ranging from 2.54 Mb to 88 kilobases (Kb), twelve of which contain telomeric repeats (CCCTAA) at both ends (see Supplementary Fig. 2). The T. pseudonana genome was assembled into 24 chromosomes

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 36

Page 37: 07410 JLGGC

ranging from 3.04 Mb to 297 Kb, eight of which contain telomeres at both ends. In an attempt to remove redundancy caused by the polymorphic dataset, small scaffolds that were not linked into the larger chromosome scale scaffolds were excluded unless the scaffold contained an EST that was not already represented in the main genome. For each genome the small unlinked scaffolds that were included were binned into unmapped files that were kept separate from the main genome releases. Statistics for the improved assemblies of each genome are shown in Supplementary Table 6.

Annotation methods The genomes of P. tricornutum and T. pseudonana were annotated using the JGI annotation pipeline, which combines several gene prediction, annotation and analysis tools. Gene predictors used for these annotations included ab initio Fgenesh 12 trained on sets of known genes and reliable homology-based gene models from P. tricornutum and T. pseudonana, and homology-based Fgenesh+ 12 and Genewise 13 seeded by BLASTX alignments against sequences in the NCBI non-redundant protein set. Genewise models were extended into complete gene models to include start and stop codons based on assemblies. ESTs available for each of the genomes were clustered and either converted into putative full-length (FL) genes and directly mapped to genomic sequence or used to extend predicted gene models into FL genes by adding 5’ and/or 3’ UTRs to the models. Because multiple gene models were generated for each locus, a single representative model was chosen based on homology and EST support and used for further analysis.

All predicted gene models were annotated for protein function using InterProScan 14 and hardware-accelerated double-affine Smith-Waterman alignments (www.timelogic.com) against SwissProt (www.expasy.org/sprot) and other specialized databases such as KEGG 15. Finally, KEGG matches were used to map EC numbers (http://www.expasy.org/enzyme/) and Interpro hits were used to map GO terms 16. In addition predicted proteins were annotated according to KOG 17 classifications.

Improved chromosomes and unmapped scaffolds were annotated separately for each genome. In total 10,402 and 11,776 gene models were predicted for P. tricornutum and T. pseudonana, respectively. Characteristics of gene models for draft and improved sequences are summarized in Supplementary Table 1. 86% of P. tricornutum genes and 64% of T. pseudonana genes are supported by ESTs (see below). 60-65% show homology to proteins in SwissProt. Manually curated genes from the earlier release of the T. pseudonana draft assembly 10 were mapped forward to the new assembly. Generation of P. tricornutum ESTs cDNA libraries were constructed from poly(A)+ RNA using the CloneMiner cDNA library construction kit (Invitrogen) following the supplier’s instructions. Sixteen different conditions were used to maximize the detection of genes expressed with a specific condition-enriched profile (see http://www.biologie.ens.fr/diatomics/EST3).

Sequencing was performed mostly from the 5’ end of the inserts. For some of the libraries, an attempt was also made to sequence clones at both the 5' and 3' ends. When both EST reads overlapped, the two sequences were fused into a consensus sequence using PHRAP. A two-step strategy was used to align the P. tricornutum cDNA sequences on the genomic sequence, independently of whether gene models had been predicted or not for a particular sequence 18,19. Preliminary transcript models were created based on the alignments of the 5' and 3' repeat-masked EST reads derived from the cDNA clones and the P. tricornutum genome assembly. The high scoring pairs (HSPs) obtained by BLAST comparisons (W=20, X=8) 20 were combined in a coherent manner, consistent with their

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 37

Page 38: 07410 JLGGC

position on the reference genome sequence. In this way, one or several models were built for each transcript, composed of one or several tentative exons based on the alignment with the genome sequence. The model with the highest total score defined by the sum of the scores of each HSP (total score for P. tricornutum cDNA resource= 800; total score for P. tricornutum reads=400) was selected as the preliminary transcript model that underwent further analysis. cDNA clones with discrepant alignments of their 5' and 3' sequences on the genome were considered to be putative chimeras and were excluded from the analysis.

The unmasked regions of preliminary transcript models were extended by 5 Kb of genomic sequence on each end, and realigned with the cDNA clones using the est2genome algorithm (-mismatch 3 -gap_penalty 6 -align 1 -space 500) 21. These transcript models were fused in gene models by a single linkage clustering approach, in which transcript models from the same genomic region and same strand sharing at least 100 bp were merged in a single model.

A total of 132,088 EST sequences were generated, corresponding to a non-redundant set of 12,370 sequences. Of these, 8,944 corresponded to gene models assigned by the JGI annotation pipeline. Conversely, 1,458 gene models were not supported by ESTs whereas 3,426 non-redundant ESTs (1,330 contigs and 2,096 singletons) did not have corresponding gene models. A total of 612 ESTs (336 in the non-redundant set) did not align with the final genome sequence. Analysis of molecular divergence Genome duplications Several methods were used to search for within-genome duplications and to examine colinearity between the P. tricornutum and T. pseudonana genomes. First, segmental duplications were located using DAGchainer software 22 based on separate all-against-all BLASTP analysis for the P. tricornutum and T. pseudonana filtered gene sets. A total of ten and five duplications were found, respectively, in P. tricornutum and T. pseudonana, involving 6 - 17 genes and between 20 and 50 Kb (Supplementary Table 7). Besides those in the unmapped files, there are just seven in P. tricornutum and two in T. pseudonana. In P. tricornutum, one represents a tandem duplication on scaffold_8. Because percent identity between the corresponding gene pairs are high (99-100%), and because they are mainly located at the ends of scaffolds, they may represent artefacts of assembly rather than genuine duplications. Second, to identify colinear regions between both diatom genomes, the i-ADHoRe software 23 was used. Homology relationships between genes, which serve as input for the i-ADHoRe algorithm, were determined by extracting reciprocal best hit-pairs from the all-versus-all BLASTP result of the predicted proteins (E-value cutoff E-5). The following parameters were used in the i-ADHoRe analysis: gap size of 20 genes, Q value of 0.8, and a minimum of three homologs to define a colinear region. In addition, the genome coordinates of reciprocal best-hit pairs were used to generate an Oxford plot (Supplementary Fig. 3A). The results from these two methods were highly similar. Orthologous pairs Orthologous gene pairs examined in Fig. 1 were identified based on reciprocal best hits. Amino acid identities were calculated from Smith–Waterman alignments. Each distribution was computed from the pairwise alignments. Genes were regarded as putative orthologs in pairwise comparisons if their products were reciprocal best-hits with at least 40% similarity in sequence and if their sequences were less than 30% different in length, as in Dujon 24. Sequences were downloaded from www.yeastgenome.org

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 38

Page 39: 07410 JLGGC

for S. cerevisiae, http://cbi.labri.fr/Genolevures/index.php for C. glabrata, K. lactis, and D. hansenii, www.candidagenome.org for C. albicans, and from www.jgi.doe.gov for P. sojae. C. intestinalis, T. rubripes and H. sapiens sequences were downloaded from ENSEMBL (www.ensembl.org). Introns Intron loss/gain analysis was based on orthologous gene models from P. tricornutum, T. pseudonana and P. sojae (the closest outgroup to the diatoms with a published sequenced genome). Orthologous gene models were selected as triplets, each pair of which formed best bidirectional BLASTP hits. Shared intron positions were detected based on pairwise Smith-Waterman alignments for each pair of orthologous triplets.

P. tricornutum genes contain less than half the number of introns than are found in T. pseudonana (Table 1). The 10,402 P. tricornutum and 11,776 T. pseudonana gene models contain, respectively, 8,169 and 17,880 introns, with averages of 0.79 and 1.52 introns per gene. Intron length is similar in both diatoms (123-135 bp on average, with a median of 90-92 bp). Furthermore, approximately two thirds of intron positions are unique to each species. For example, from 2,253 pairs of intron-containing orthologs (containing a total of 3,406 introns) only 1,183 introns (35%) are in the same position in 1,018 orthologs. By contrast, just 256 orthologous pairs have all intron positions conserved, and the majority of them (228 pairs) contain only a single intron. Examination of transposon content We used the ReapeatMasker programme 25(http://www.repeatmasker.org) to screen the P. tricornutum genome for the presence of transposable elements. RepeatMasker identified 391 Kb of transposable elements among 815 Kb of repetitive sequences. We then manually screened the P. tricornutum and T. pseudonana genomes for the presence of LTR retrotransposons (LTR-RTs). All scaffolds were 6-frame translated and open reading frames larger than 1,000 amino acids were submitted to InterProScan. We then looked for the presence of long terminal repeats flanking the ORFs displaying LTR-RT specific domains. We found that three distinct lineages of Ty1/copia-like elements are present in both diatom genomes (CoDiI, CoDiII and CoDiIII for Copia from Diatoms), and that the CoDiI and CoDiII lineages are only distantly related to the Copia lineage, which contains elements from plants, animals and fungi, and to the Ty1 clade which contains elements from fungi only (Supplementary Fig. 4). Elements from the CoDiIII lineage fall into the Copia cluster. We also identified a group of Ty3/gypsy-like elements in the T. pseudonana genome (called GyDi elements) that is absent from the P. tricornutum genome. CoDiI, CoDiII, CoDiIII, and GyDi elements were also found in the 2x sequence of the pennate diatom Pseudo-nitzschia multiseries genome (M.S. Parker, personal communication). In addition, a GyDi element was found in an EST library derived from Pseudo-nitzschia multistriata (Alexander Luedeking, personal communication), and a CoDiII element was identified in an EST library from Fragilariopsis cylindrus (Andreas Krell, personal communication). These de novo identified LTR-RTs were used as probes to run the RepeatMasker programme on both diatom genomes and we found that RT-LTRs represent 1,505 kb (5.8%) and 337 kb (1.1%) of the P. tricornutum and T. pseudonana genomes, respectively. Search for genes of red algal origin Protein sequences predicted from complete genome sequences were downloaded from COGENT 26 and from JGI (http://genome.jgi-psf.org/). They included 6 metazoans, 10 fungi,

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 39

Page 40: 07410 JLGGC

2 plants, 3 green algae, 3 apicomplexans, 7 cyanobacteria, 5 proteobacteria, the oomycetes P. sojae and P. ramorum, and the red alga C. merolae. Candidate homologs from the T. pseudonana and P. tricornutum genomes were identified by running PSI-BLAST 27 in three iterations. A conservative selection of putative homologs was then performed from the PSI-BLAST hits for each query based on the following criteria: a reciprocal coverage of > 50 % and a minimum of 30 % amino acid identity in the aligned region. Multiple alignments were calculated using kalign 28, and neighbour joining bootstrap trees with Scoredist distances 29 were calculated using a modified version of Quicktree 30. The trees were rooted at a randomly selected terminal node that was neither a stramenopile (i.e., an oomycete or a diatom) nor a red algal protein. Then the root node of the smallest clade that included the query sequence and at least one non-stramenopile sequence was located. If the sister clade to the stramenopile clade (including the query) exclusively contained one or more red algal sequences, the tree was classified as indicating red algal affinity. For each query grouping with a red algal sister clade, the annotation was checked by an expert annotator, and Phylogena 31 was used to verify that the protein grouped with proteins known to have the same function. Putative plastid-targeted proteins were identified based on the functional annotation, and the gene models were extended where necessary to reveal the N-terminal targeting sequences, which are sometimes missed by automatic gene prediction.

The number of diatom sequences identified as being of putative red algal origin depended on the bootstrap support for the node grouping the stramenopile clade with the red algal sister clade, as shown in Supplementary Table 8. One hundred and seventy one genes were classified as being of red algal origin based on strong (>85%) bootstrap support for the red alga plus heterokont clade, and a larger number could be identified if the level of stringency was reduced. All the gene models with their links to the JGI Eukaryotic Genomes Portal are given in Supplementary Table 2. Detection of horizontal gene transfer events from prokaryotes Phylogenetic trees were created using APIS (Automated Phylogenetic Inference System; Badger, unpublished). APIS is a system for automatic creation and summarizing of phylogenetic trees for each protein encoded by a genome. It is implemented as a series of Ruby scripts, and the results are viewable on an internal web server that allows the user to explore data and results in an interactive manner. The homologs used by APIS for each phylogenetic tree are obtained by using WU-BLAST 32 to compare query proteins against an extended version of ComboDB (Wu, unpublished) that contains taxonomic, genomic, protein, and coding DNA information for 46 eukaryotic, 52 archaeal, 687 bacterial, and 1928 viral complete (or nearly complete) genomes (as of June 1st, 2008). The full-length sequences of these homologs are retrieved from the database and aligned using MUSCLE 33, and bootstrapped neighbor-joining trees are produced using QuickTree 30. The inferred tree is then midpoint rooted prior to analysis, allowing automatic determination of the taxonomic classification of the organisms with proteins in the same clade as the query protein.

Scripts were written in the Ruby programming language to identify trees that contain clades in which input sequences (such as ‘heterokont’ sequences) cluster with sequences from members of a particular target group (such as ‘Bacteria’), without including sequences from any other taxonomic groups. If the ingroup and target group lay on opposite sides of the tree root, the tree was rerooted using a member of the target group as an outgroup, thus forcing the ingroup and the remaining members of the target group to be on the same side of the tree root. In either case the bootstrap value of the node connecting the ingroup to the target was noted in order to identity particularly robust groupings. The 587 P. tricornutum genes that clustered with bacteria-only clades or outgrouped with clades that contained only bacterial genes are

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 40

Page 41: 07410 JLGGC

shown in Supplementary Table 3, and each tree is available in a supplementary file. Another 200 sequences failed our alignment criteria for automated tree generation (less than 50% amino acid coverage or e>1-09) but had only bacterial genes in the BLAST output (using a cutoff of e<1-05).

We chose to explore a particularly interesting phylogenetic tree generated by APIS which indicated a bacterial version of nitrite reductase (nirB) in the nuclear genomes of stramenopiles (Fig. 2), in more detail. To do so, we created a phylogenetically representative subset of the alignment generated by APIS and created a maximum likelihood tree using PhyML 34, WAG matrix with gamma-distributed rates, α = 0.80. We chose this model by analyzing our alignment with ProtTest 35, which finds the best-fit evolutionary model (among a set of supplied models) for a given protein alignment. Gene family analysis To construct the dataset for the gene family clustering presented in Fig. 4, the predicted protein sequences of the two diatoms, ten other heterokonts/alveolates (P. sojae (JGI, v1.1), P. ramorum (JGI, v1.1), Cryptosporidium hominis (VCU), Cryptosporidium parvum (Cryptodb), Plasmodium falciparum (Plasmodb), Plasmodium yoelii yoelii (Plasmodb), Theileria annulata (Sanger), Theileria parva (TIGR), Paramecium tetraurelia (Genoscope), and Tetrahymena thermophila (TIGR)), two plants (A. thaliana (TIGR, Release 5) and Oryza sativa (TIGR, Release 3)), three green algae (O. tauri 36, O. lucimarinus 37 and C. reinhardtii (JGI, v3.1)), one red algae C. merolae (http://merolae.biol.s.u-tokyo.ac.jp/, Release Apr 8 2004), three metazoa (H. sapiens (Ensembl, Release 35), Caenorhabditis elegans (Ensembl, Release 31.140) and Drosophila melanogaster (Ensembl, Release 31.3e)) and two fungi (S. cerevisiae (http://www.yeastgenome.org) and Schizosaccharomyces pombe (Sanger)) were downloaded. To delineate gene families, a similarity search was performed (all-against-all BLASTP; E-value cutoff E-5) with all proteins from the above 23 species. Subsequently, all proteins were grouped into gene families by applying the Markov clustering (MCL) algorithm on the BLASTP results using MCLBLASTLINE (http://micans.org/mcl/ 38 (Inflation Factor: 2.0).

To generate Supplementary Fig. 7, all gene families (orphans and species-specific families excluded) present in at least one diatom, the mean gene family size, and the standard deviation were calculated. The matrix of these profiles was transformed into a matrix of z-scores to center and normalize the data. The families with the 5% greatest z-scores in P. tricornutum and/or T. pseudonana were extracted, yielding 101 families. Subsequently, these profiles were hierarchically clustered (complete linkage clustering) using the Pearson correlation as a distance measure. The clustering and visualization was done using Genesis 39.

Several gene families have undergone major expansions in diatoms compared with most or all other eukaryotes examined (Supplementary Fig. 7, Supplementary Tables 4 and 5). The more complex structure of heterokont phototrophic cells compared to primary algae might explain some of these features because of the additional need for membrane transporters. In addition, metabolites that are synthesised within the plastids of primary algae need to be exported from or imported into secondary plastids (e.g., plastidic starch vs vacuolar chrysolaminarin or plastidic vs cytosolic pyrimidine biosynthesis). Furthermore, the mitochondrial carrier protein expansion appears to be specific to free-living unicellular eukaryotes, whereas the trypsin protease expansion appears to be an essentially heterokont feature.

Comparison of copy numbers in gene families from the two diatoms showed that overall gene family sizes are highly similar. In total, 156 gene families show ≥2 fold differences in copy number between both species, 42 of which are expanded in P. tricornutum

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 41

Page 42: 07410 JLGGC

and 114 are expanded in T. pseudonana (Supplementary Fig. 6). When comparing gene copy numbers in both diatoms, the fraction of inparalogs was estimated by counting the number of BLAST hits within the query species that were more similar than the first hit from the other species. Considering these expanded gene families the fraction of inparalogous genes indicates that at least 70% of these duplications have been created through species-specific events. This was also confirmed through (partial) phylogenetic tree construction (data not shown). Two component signalling systems Bipartite response regulator proteins are typically involved in two-component signal transduction systems (TCS) in bacteria, as well as in certain eukaryotes, and function to detect and respond to environmental stimuli. These systems have been found to be important during host invasion, drug resistance, motility, phosphate uptake, osmoregulation, and nitrogen fixation, amongst others 40. TCS typically consist of a histidine protein kinase (HK) sensor that phosphorylates the receiver domain of a response regulator protein (RR). Phosphorylation induces a conformational change in the response regulator, which activates an effector domain and triggers the cellular response. The domains of two-component proteins are highly modular, but the core structures and activities are maintained. We already identified the presence of this gene family in T. pseudonana 41, but the variability compared to P. tricornutum stimulated us to examine these systems further. From the observed diversity of domain combinations in diatom TCS-encoding components (Fig. 3), it seems likely that diatoms have exploited this family as a major means of stimulus-response coupling.

P. tricornutum contains a wide range of TCS proteins organized in various domain associations (Fig. 3), some of which are not shared between the two diatoms. P.tricornutum contains 14 RR-containing proteins, 11 of which are in a hybrid conformation with an HK domain, and 3 others are associated with DNA binding or transcriptional regulator domains. The HPT histidine phosphotransmitter protein (HPT), that typically functions as a phosphate carrier from the HK to the RR, is absent in both P. tricornutum and T. pseudonana, suggesting direct transmission from the receptor protein to the effector domain.

In other eukaryotic unicellular algae the distribution of the TCS family is not homogeneous. The prasinophyte marine O. tauri and O. lucimarinus algae contain only two HK and 7 RR domains that are sometimes associated with other binding or activity domains. The HPT protein has been clearly identified in these two genomes as well as in C. reinhardtii. This latter green microalga contains around 18 RR domain-containing proteins, some of which also contain a HK domain. It is curious that organisms containing HPTs have significant numbers of proteins with RR domains that are not associated with other domains, whereas in diatoms each RR domain is associated with a binding or activity domain. C. merolae has only two RR-containing proteins encoded in the chloroplast and one HK encoded in the nucleus (no HPT proteins have been found).

The variety of modular combinations is very different in diatoms than it is in other algae, which could suggest that some of these genes have been acquired more recently during evolution through lateral gene transfer from prokaryotes. Analysis of the TCS gene family of P. tricornutum shows that all the HKs contain at least one RR domain (Fig. 3). The hybrid proteins, in addition to containing an HK domain and an RR domain, show a variety of additional N-terminal domains, most commonly PAS or GAF domains. One of these proteins has been proposed to be a phytochrome 41,42. These PAS domain family proteins also contain the more thoroughly characterized LOV domain, known to be involved in light perception in many organisms either as a component of phototropin proteins (A. thaliana) or other blue

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 42

Page 43: 07410 JLGGC

light photoreceptors (cyanobacteria) 43. This raises the hypothesis that diatoms may have significantly more photoreceptors beyond the canonical cryptochrome and phytochrome groups that have been identified thus far 41,42. This is further reinforced by the discovery of a heterokont photoreceptor denoted Aureochrome 44 that can also be found in T. pseudonana (3 copies) and in P. tricornutum (3-4 copies). Furthermore, P. tricornutum contains orthologs of LovK, a PAS domain-containing histidine kinase that was recently found to regulate light-dependent attachment to substrata in bacteria 45, and other light-dependent histidine kinases that have been reported in bacteria 43. The fact that T. pseudonana does not contain any LovK orthologs is consistent with its pelagic lifestyle. The various combinations of PAS domains with TCS domains may therefore prove be an effective source of new photoreceptors.

One of the most unusual genes that contain HK and RR domains in P. tricornutum is a CHASE domain-containing protein (Fig. 3). A protein that has a very similar combination of domains in A. thaliana and Z. mays is the cytokinin receptor (Cre)46. This can also be found in an Ectocarpus siliculosus virus, although it is not found in T. pseudonana. In P. tricornutum a CHASE domain was also found associated with a guanylate cyclase, although again not in T. pseudonana nor in other sequenced unicellular eukaryotic algae. Two additional proteins in P. tricornutum contain RR domains associated with a Lux-R domain, until now only described in bacteria. Most Lux-R-type regulators act as transcriptional activators in the regulation of quorum sensing, but some can be repressors or have a dual role 47. T. pseudonana contains one such example of this domain combination. Heat shock transcription factor content Heat shock transcription factors (HSFs) are typically involved in stress responses in eukaryotes. There is a wide variability in the numbers of HSF genes in different species. S. cerevisiae and Drosophila contain a single HSF 48,49. Among vertebrates, three HSF genes are ubiquitous, whereas an additional HSF gene has been characterized in chicken 50, Xenopus and Platypus (data not shown). Information on HSF gene number in algae is very limited. There are one, two and three HSFs in Ostreococcus spp., C. reinhardtii and C. merolae, respectively. Unlike other eukaryotes, plants harbour multiple HSF genes with typically more than 20 members 51. However, even in plants HSFs constitute only a small fraction of the total number of transcription factors (almost 1,700 in Arabidopsis 52). Compared to plants, and especially to other organisms including algae, the numbers of HSFs is much higher in P. tricornutum and T. pseudonana. Indeed, as many as 69 and 89 putative HSF genes could be identified in the genomes of P. tricornutum and T. pseudonana, respectively. This represents approximately half of the genes encoding all transcription factors in these genomes.

We studied the phylogenetic relationships of HSFs in P. tricornutum and T. pseudonana with other species based on the amino acid sequence comparison of their characteristic domain, the winged helix-turn-helix DNA-binding domain (DBD). Four major groups of P. tricornutum and T. pseudonana HSFs could be identified (Supplementary Fig. 8). Comparative phylogenetic analysis with representatives of Phytophthora, plants, algae, fungi and animals indicated that seven P. tricornutum and seven T. pseudonana HSFs, which constitute P. tricornutum, T. pseudonana and P. tricornutum/T. pseudonana Subgroups 1.1, 1.4 and 1.2, respectively (Supplementary Fig. 8), were found to be phylogenetically similar to those of other species. These ‘traditional’ HSFs are best resolved with Phytophthora and alveolata (Tetrahymena and Paramecium) HSFs (Supplementary Fig. 8). The remaining diatom HSFs appear to derive from diatom-specific expansions, and in several cases clear orthologs (coloured in Supplementary Fig. 8) can be found in the two genomes.

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 43

Page 44: 07410 JLGGC

It is known that the activation of HSFs involves their trimerization, which requires the heptad domain repeat structure (reviewed by 53). All 14 traditional diatom HSFs were found to display this structure: two contain one heptad repeat, two have 2 separated repeats (as do the A and C HSF classes in plants), and ten have 3 or 4 repeats, like the class B HSFs in plants and other non-plant HSFs. The remaining, non-traditional diatom HSFs do not have heptad repeats. Similarly, we found that 31 of 35 Phytophthora HSFs and all three C. merolae HSFs did not display heptad repeats. The lack of heptad repeats in the majority of heterokont and C. merolae HSFs may indicate that trimerization is triggered by other factors, or that it is not necessary for their DNA-binding activity. References for Supplementary Materials 1. Sims, P. A., Mann, D. G. & Medlin, L. K. Evolution of the diatoms: Insights from

fossil, biological and molecular data. Phycologia 45, 361-402 (2006). 2. Pavlícek, A. et al. Similar integration but different stability of Alus and LINEs in the

human genome. Gene 276, 39-45 (2001). 3. Pavlícek, A., Paces, J., Clay, O. & Bernardi, G. A compact view of isochores in the

draft human genome sequence. FEBS Lett 511, 165-169 (2002). 4. Paces, J. et al. Representing GC variation along eukaryotic chromosomes. Gene 333,

135-141 (2004). 5. Kumar, S., Tamura, K. & Nei, M. MEGA3: Integrated software for molecular

evolutionary genetics analysis and sequence alignment. Briefings in Bioinformatics 5, 150-163 (2004).

6. Scala, S., Carels, N., Falciatore, A., Chiusano, M. L. & Bowler, C. Genome properties of the diatom Phaeodactylum tricornutum. Plant Physiology 129, 993-1002 (2002).

7. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863-14868 (1998).

8. Saldanha, A. J. Java Treeview--extensible visualization of microarray data. Bioinformatics 20, 3246-3248 (2004).

9. De Martino, A., Meichenin, A., Shi, J., Pan, K. & Bowler, C. Genetic and phenotypic characterisation of Phaeodactylum tricornutum (Bacillariophyceae) accessions. J. Phycol. 43, 992-1009 (2007).

10. Armbrust, E. V. et al. The genome of the diatom Thalassiosira pseudonana: ecology, evolution, and metabolism. Science 306, 79-86 (2004).

11. Gordon, D., Abajian, C. & Green, P. Consed: a graphical tool for sequence finishing. Genome Res 8, 195-202 (1998).

12. Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516-22 (2000).

13. Birney, E. & Durbin, R. Using GeneWise in the Drosophila annotation experiment. Genome Res 10, 547-8 (2000).

14. Zdobnov, E. M. & Apweiler, R. InterProScan--an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847-848 (2001).

15. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y. & Hattori, M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32, D277-80 (2004).

16. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-9 (2000).

17. Koonin, E. V. et al. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biology 5, R7 (2004).

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 44

Page 45: 07410 JLGGC

18. Castelli, V. et al. Whole genome sequence comparisons and "full-length" cDNA sequences: a combined approach to evaluate and improve Arabidopsis genome annotation. Genome Res 14, 406-13 (2004).

19. Porcel, B. M. et al. Numerous novel annotations of the human genome sequence supported by a 5'-end-enriched cDNA collection. Genome Res. 14, 463-71 (2004).

20. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403-10 (1990).

21. Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M. & Miller, W. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8, 967-74 (1998).

22. Haas, B. J., Delcher, A. L., Wortman, J. R. & Salzberg, S. L. DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics 20, 3643-6 (2004).

23. Simillion, C., Vandepoele, K., Saeys, Y. & Van de Peer, Y. Building genomic profiles for uncovering segmental homology in the twilight zone. Genome Res. 14, 1095-1106 (2004).

24. Dujon, B. Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution. Trends Genet 22, 375-87 (2006).

25. Smit, A. F. A., Hubley, R. & Green, P. (1996-2004). 26. Janssen, P. J. et al. COGENT: a flexible data environment for computational

genomics. Bioinformatics 19, 1451-1452 (2003). 27. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein

database search programs. Nucleic Acids Res 25, 3389-3402 (1997). 28. Lassmann, T. & Sonnhammer, E. L. L. Kalign-an accurate and fast multiple sequence

alignment algorithm. BMC Bioinformatics 6, 298 (2005). 29. Sonnhammer, E. L. L. & Hollich, V. Scoredist: A simple and robust protein sequence

distance estimator. BMC Bioinformatics 6, 108 (2005). 30. Howe, K., Bateman, A. & Durbin, R. QuickTree: building huge Neighbour-Joining

trees of protein sequences. Bioinformatics 18, 1546-1547 (2002). 31. Hanekamp, K., Bohnebeck, U., Beszteri, B. & Valentin, K. PhyloGena--a user-

friendly system for automated phylogenetic annotation of unknown sequences. Bioinformatics 23, 793-801 (2007).

32. Gish, W. (Distributed by the author, 2004). 33. Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high

throughput. Nucleic Acids Res. 32, 1792-1797 (2004). 34. Guindon, S. & Gascuel, O. A simple, fast, and accurate algorithm to estimate large

phylogenies by maximum likelihood. Syst Biol. 52, 696-704 (2003). 35. Abascal, F., Zardoya, R. & Posada, D. ProtTest: selection of best-fit models of protein

evolution. Bioinformatics 21, 2104-2105 (2005). 36. Derelle, E. et al. Genome analysis of the smallest free-living eukaryote Ostreococcus

tauri unveils many unique features. Proc Natl Acad Sci U S A 103, 11647-52 (2006). 37. Palenik, B. et al. The tiny eukaryote Ostreococcus provides genomic insights into the

paradox of plankton speciation. Proc. Natl. Acad. Sci. USA 104, 7705-7710 (2007). 38. van Dongen, S. (University of Utrecht, Utrecht, 2000). 39. Sturn, A., Quackenbush, J. & Trajanoski, Z. Genesis: Cluster analysis of microarray

data. Bioinformatics 18, 207-208 (2002). 40. Stock, A. M., Robinson, V. L. & Goudreau, P. N. Two-component signal transduction.

Annu Rev Biochem. 69, 183-215 (2000).

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 45

Page 46: 07410 JLGGC

41. Montsant, A. et al. Identification and comparative genomic analysis of signaling and regulatory components in the diatom Thalassiosira pseudonana. J. Phycol. 43, 585-603 (2007).

42. Falciatore, A. & Bowler, C. The evolution and function of blue and red light photoreceptors. Curr Top Dev Biol 68, 317-50 (2005).

43. Swartz, T. E. et al. Blue-light-activated histidine kinases: two-component sensors in bacteria. Science 317, 1090-1093 (2007).

44. Takahashi, F. et al. AUREOCHROME, a photoreceptor required for photomorphogenesis in stramenopiles. Proc Natl Acad Sci U S A. 104, 19625-19630 (2007).

45. Purcell, E. B., Siegal-Gaskins, D., Rawling, D. C., Fiebig, A. & Crosson, S. A photosensory two-component system regulates bacterial cell attachment. Proc Natl Acad Sci U S A. 104, 18241-18246 (2007).

46. Sheen, J. Phosphorelay and transcription control in cytokinin signal transduction. Science 296, 1650-1652 (2002).

47. Miller, M. B. & Bassler, B. L. Quorum sensing in bacteria. Annu Rev Microbiol. 55, 165-199 (2001).

48. Sorger, P. K. & Pelham, H. R. Yeast heat shock factor is an essential DNA-binding protein that exhibits temperature-dependent phosphorylation. Cell 54, 855-864 (1988).

49. Clos, J. et al. Molecular cloning and expression of a hexameric Drosophila heat shock factor subject to negative regulation. Cell 63, 1085-1097 (1990).

50. Morimoto, R. I. Regulation of the heat shock transcriptional response: cross talk between a family of heat shock factors, molecular chaperones, and negative regulators. Genes & Dev. 12, 3788-3796 (1998).

51. Baniwal, S. K. et al. Heat stress response in plants: a complex game with chaperones and more than twenty heat stress transcription factors. J Biosci. 29, 471-487 (2004).

52. Initiative, A. G. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796-815 (2000).

53. Wu, G. et al. Disease resistance conferred by expression of a gene encoding H2O2-generating glucose oxidase in transgenic potato plants. Plant Cell 7, 1357-1368 (1995).

doi: 10.1038/nature07410 SUPPLEMENTARY INFORMATION

www.nature.com/nature 46