introduction to microbial genomics - dtu … ·...

37
Center for Biological Sequence Analysis Department of Systems Biology Introduction to Microbial Genomics Sequences as information Dave Ussery Comparative Bacterial Genomics Workshop Centers for Disease Control Atlanta, Georgia, USA Monday, 27 August, 2012

Upload: trinhliem

Post on 13-Sep-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Center  for  Biological  Sequence  Analysis

Department  of  Systems  Biology

Introduction to Microbial Genomics

Sequences as information

Dave Ussery

Comparative Bacterial Genomics WorkshopCenters for Disease ControlAtlanta, Georgia, USA

Monday, 27 August, 2012

Page 2: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

2

Page 3: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology 3

http://www.cbs.dtu.dk/staff/dave/CDC_2012.php

www.cbs.dtu.dk

Page 4: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

0M

0.5

M1

M

1.5M

2M

2.5

M

V. cholerae O1 biovar El Tor str. N16961 I

2,961,149 bp

BASE ATLAS

Center for Biological Sequence Anhttp://www.cbs.dtu.dk/

G Content

0.18 0.30

A Content

0.20 0.32

T Content

0.21 0.32

C Content

0.17 0.30

Annotations:

CDS +

CDS -

rRNA

tRNA

AT Skew

-0.04 0.04

GC Skew

-0.08 0.08

Percent AT

0.46 0.59

Resolution: 1185

genomeStatistics

rnammer

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45

05000

10000

15000

New genes

New gene families

Core genome

Pan genome

1 : Ecoli_042 2 : Ecoli_536 3 : Ecoli_55989 4 : Ecoli_ABU_83972 5 : Ecoli_APEC_O1 6 : Ecoli_ATCC_8739 7 : Ecoli_BL21_DE3_28965 8 : Ecoli_BL21_DE3_30681 9 : Ecoli_BW2952 10 : Ecoli_B_str_REL606 11 : Ecoli_DH1 12 : Ecoli_E24377A 13 : Ecoli_ED1a 14 : Ecoli_ETEC_H10407 15 : Ecoli_HS 16 : Ecoli_IAI1 17 : Ecoli_IAI39 18 : Ecoli_IHE3034 19 : Ecoli_KO11 20 : Ecoli_O103H2_str_12009 21 : Ecoli_O111H_str_11128 22 : Ecoli_O127H6_str_E2348_69 23 : Ecoli_O157H7_str_EDL933 24 : Ecoli_O157H7_str_TW14359 25 : Ecoli_O26H11_str_11368 26 : Ecoli_O55H7_str_CB9615 27 : Ecoli_O83H1_str_NRG_857C 28 : Ecoli_S88 29 : Ecoli_SE11 30 : Ecoli_SE15 31 : Ecoli_SMS35 32 : Ecoli_UM146 33 : Ecoli_UMN026 34 : Ecoli_UTI89 35 : Ecoli_W 36 : Ecoli_str_K12_substr_DH10B 37 : Ecoli_str_K12_substr_MG1655 38 : Ecoli_str_K12_substr_W3110 39 : Vatypica_ACS_049_V_Sch6 40 : Vatypica_ACS_134_V_Col7a 41 : Vdispar_ATCC_17748 42 : Vparvula_ATCC_17745 43 : Vparvula_DSM_2008 44 : Vsp_3_1_44 45 : Vsp_6_1_27 46 : Vsp_str_F0412

3.3 %111 / 3,378

28.3 %1,980 / 6,989

55.5 %2,683 / 4,838

52.4 %2,666 / 5,085

34.9 %2,114 / 6,065

33.1 %2,074 / 6,269

30.3 %1,795 / 5,923

30.5 %1,813 / 5,939

26.7 %1,916 / 7,168

30.5 %2,050 / 6,715

32.6 %2,040 / 6,250

28.3 %2,095 / 7,406

32.3 %1,842 / 5,705

31.9 %2,074 / 6,494

33.6 %1,805 / 5,377

30.2 %1,747 / 5,786

29.9 %1,736 / 5,802

31.9 %1,743 / 5,469

34.4 %1,846 / 5,360

32.5 %1,873 / 5,769

30.6 %1,777 / 5,804

32.1 %1,846 / 5,747

5.0 %243 / 4,897

30.3 %2,110 / 6,968

29.7 %2,127 / 7,169

29.5 %2,198 / 7,456

28.1 %2,155 / 7,667

25.5 %1,872 / 7,339

28.0 %2,022 / 7,222

25.9 %2,170 / 8,370

27.8 %2,222 / 7,979

29.4 %2,212 / 7,534

26.1 %2,254 / 8,624

27.9 %1,972 / 7,061

29.6 %2,295 / 7,753

28.1 %1,904 / 6,782

25.7 %1,850 / 7,198

25.6 %1,841 / 7,205

26.9 %1,851 / 6,869

28.7 %1,944 / 6,766

27.5 %1,971 / 7,179

26.3 %1,893 / 7,208

27.2 %1,946 / 7,165

2.6 %96 / 3,691

75.0 %3,261 / 4,346

38.7 %2,246 / 5,808

36.6 %2,201 / 6,016

33.6 %1,915 / 5,695

34.5 %1,963 / 5,692

30.4 %2,085 / 6,866

34.2 %2,205 / 6,448

36.3 %2,179 / 6,005

29.6 %2,214 / 7,478

36.2 %1,976 / 5,464

35.9 %2,233 / 6,219

36.7 %1,906 / 5,192

32.8 %1,843 / 5,611

33.0 %1,848 / 5,596

34.9 %1,843 / 5,282

37.7 %1,947 / 5,165

35.3 %1,972 / 5,581

33.6 %1,884 / 5,612

35.0 %1,949 / 5,561

2.9 %112 / 3,894

38.1 %2,277 / 5,979

35.7 %2,219 / 6,213

32.5 %1,919 / 5,903

33.9 %1,991 / 5,874

29.4 %2,083 / 7,082

33.1 %2,209 / 6,672

35.3 %2,191 / 6,211

29.3 %2,244 / 7,665

34.5 %1,965 / 5,696

35.5 %2,270 / 6,400

35.6 %1,922 / 5,398

31.9 %1,857 / 5,817

32.1 %1,861 / 5,806

34.2 %1,872 / 5,473

36.6 %1,964 / 5,371

34.2 %1,983 / 5,797

32.5 %1,896 / 5,827

34.0 %1,963 / 5,771

2.8 %118 / 4,277

72.3 %3,688 / 5,101

38.6 %2,289 / 5,931

42.3 %2,451 / 5,795

36.7 %2,562 / 6,982

40.8 %2,680 / 6,565

43.7 %2,670 / 6,112

36.7 %2,759 / 7,516

45.4 %2,507 / 5,523

43.9 %2,762 / 6,293

41.8 %2,264 / 5,418

38.0 %2,213 / 5,823

37.9 %2,209 / 5,822

39.9 %2,202 / 5,514

42.9 %2,314 / 5,388

40.4 %2,345 / 5,808

38.6 %2,251 / 5,839

40.3 %2,326 / 5,771

2.3 %103 / 4,463

36.9 %2,259 / 6,124

40.2 %2,413 / 5,999

36.5 %2,593 / 7,105

39.7 %2,672 / 6,728

41.9 %2,637 / 6,301

34.6 %2,682 / 7,762

43.7 %2,492 / 5,705

41.4 %2,698 / 6,523

39.9 %2,238 / 5,609

36.9 %2,208 / 5,989

36.3 %2,186 / 6,014

38.0 %2,171 / 5,707

40.6 %2,270 / 5,592

38.5 %2,311 / 6,004

37.0 %2,227 / 6,026

38.4 %2,291 / 5,971

2.3 %88 / 3,822

46.2 %2,452 / 5,307

30.9 %2,144 / 6,948

37.5 %2,396 / 6,387

39.9 %2,372 / 5,942

45.0 %3,018 / 6,702

37.8 %2,081 / 5,503

47.0 %2,741 / 5,827

38.1 %1,994 / 5,228

34.4 %1,944 / 5,645

34.8 %1,952 / 5,617

36.4 %1,935 / 5,317

38.7 %2,021 / 5,225

36.4 %2,055 / 5,647

34.7 %1,968 / 5,677

35.8 %2,018 / 5,637

2.7 %103 / 3,886

34.5 %2,335 / 6,762

43.2 %2,655 / 6,143

46.1 %2,626 / 5,697

43.4 %2,981 / 6,875

45.0 %2,357 / 5,232

64.9 %3,385 / 5,213

41.6 %2,134 / 5,135

38.2 %2,104 / 5,504

37.2 %2,064 / 5,548

39.1 %2,048 / 5,244

41.6 %2,140 / 5,139

38.8 %2,162 / 5,566

37.9 %2,110 / 5,560

38.7 %2,143 / 5,536

3.9 %200 / 5,078

33.0 %2,516 / 7,615

34.4 %2,472 / 7,184

30.1 %2,581 / 8,574

34.3 %2,276 / 6,634

35.2 %2,581 / 7,333

32.4 %2,098 / 6,481

30.3 %2,079 / 6,856

29.6 %2,044 / 6,898

31.2 %2,045 / 6,565

33.0 %2,137 / 6,467

31.5 %2,169 / 6,884

30.4 %2,098 / 6,893

31.2 %2,143 / 6,862

3.1 %150 / 4,773

67.5 %3,741 / 5,540

37.0 %2,900 / 7,832

43.2 %2,597 / 6,013

46.4 %3,042 / 6,550

43.0 %2,483 / 5,781

39.4 %2,432 / 6,172

39.1 %2,418 / 6,182

40.1 %2,373 / 5,919

44.1 %2,533 / 5,743

41.9 %2,575 / 6,151

40.0 %2,473 / 6,185

41.7 %2,552 / 6,116

2.8 %121 / 4,337

38.7 %2,880 / 7,439

47.2 %2,608 / 5,524

48.9 %2,994 / 6,128

46.3 %2,464 / 5,326

42.2 %2,409 / 5,711

41.3 %2,372 / 5,746

43.5 %2,367 / 5,437

47.1 %2,503 / 5,310

44.5 %2,539 / 5,707

42.8 %2,449 / 5,718

44.3 %2,515 / 5,683

3.9 %202 / 5,116

34.9 %2,496 / 7,160

46.4 %3,371 / 7,266

33.3 %2,327 / 6,984

31.0 %2,282 / 7,362

30.7 %2,271 / 7,389

32.1 %2,268 / 7,062

34.3 %2,377 / 6,932

33.1 %2,415 / 7,299

31.7 %2,323 / 7,337

32.5 %2,385 / 7,336

2.1 %79 / 3,683

43.5 %2,547 / 5,858

46.0 %2,220 / 4,821

41.1 %2,153 / 5,242

41.1 %2,152 / 5,239

42.7 %2,113 / 4,953

45.9 %2,223 / 4,842

42.3 %2,236 / 5,283

41.3 %2,181 / 5,277

42.2 %2,215 / 5,254

3.2 %147 / 4,662

42.3 %2,399 / 5,675

37.9 %2,313 / 6,099

38.1 %2,320 / 6,091

39.7 %2,303 / 5,796

42.4 %2,408 / 5,683

40.0 %2,440 / 6,094

38.4 %2,348 / 6,120

40.0 %2,421 / 6,055

2.5 %84 / 3,305

68.5 %2,844 / 4,150

70.4 %2,886 / 4,098

73.1 %2,818 / 3,854

81.0 %2,989 / 3,688

72.2 %2,986 / 4,136

68.5 %2,869 / 4,191

70.4 %2,922 / 4,153

3.5 %125 / 3,567

64.5 %2,847 / 4,414

68.3 %2,820 / 4,126

74.3 %2,987 / 4,018

81.6 %3,264 / 4,000

77.5 %3,153 / 4,066

76.9 %3,165 / 4,117

2.8 %99 / 3,597

67.8 %2,806 / 4,137

67.6 %2,836 / 4,195

67.4 %2,983 / 4,424

65.0 %2,880 / 4,434

64.6 %2,888 / 4,474

2.2 %73 / 3,311

71.5 %2,801 / 3,915

69.7 %2,916 / 4,183

69.0 %2,860 / 4,145

68.7 %2,874 / 4,181

1.8 %59 / 3,353

80.2 %3,169 / 3,953

75.1 %3,024 / 4,028

79.6 %3,139 / 3,944

4.3 %157 / 3,665

80.2 %3,271 / 4,079

80.4 %3,303 / 4,109

3.3 %120 / 3,599

77.1 %3,186 / 4,134

3.0 %110 / 3,665

Aliivibrio salmonicida LFI1238

3,915 proteins, 3,378 families

Photobacterium profundum

SS9

5,480 proteins, 4,897 families

Vibrio fischeri ES114

3,818 proteins, 3,691 families

Vibrio fischeri MJ11

4,039 proteins, 3,894 families

Vibrio splendidus LGP32

4,431 proteins, 4,277 families

Vibrio species

MED

222 1099517005441

4,590 proteins, 4,463 families

Vibrio campbellii

AN

D4 1103602000595

3,935 proteins, 3,822 families

Vibrio species Ex25

4,004 proteins, 3,886 families

Vibrio shilonii

AK1 1103207002036

5,360 proteins, 5,078 families

Vibrio vulnificus YJ016

5,028 proteins, 4,773 families

Vibrio vulnificus CM

CP6

4,538 proteins, 4,337 families

Vibrio harveyi

ATCC BA

A-1116

6,064 proteins, 5,116 families

Vibrio parahaemolyticus 16

3,780 proteins, 3,683 families

Vibrio parahaemolyticus

RIMD

2210633

4,832 proteins, 4,662 families

Vibrio cholerae A

M-19226

3,407 proteins, 3,305 families

Vibrio cholerae 2740-80

3,771 proteins, 3,567 families

Vibrio cholerae 1587

3,758 proteins, 3,597 families

Vibrio cholerae MZO

-2

3,425 proteins, 3,311 families

Vibrio cholerae MO

10

3,421 proteins, 3,353 families

Vibrio cholerae 0395

3,875 proteins, 3,665 families

Vibrio cholerae V52

3,815 proteins, 3,599 families

Vibrio cholerae

O1 biovar eltor str. N

16961

3,828 proteins, 3,665 families

Aliivi

brio

salm

onici

da

LFI1

238

3,915

pro

tein

s, 3,3

78 fa

mili

es

Photo

bacte

rium

profu

ndum

SS9

5,480

pro

tein

s, 4,8

97 fa

mili

es

Vibrio

fisch

eri

ES11

4

3,818

pro

tein

s, 3,6

91 fa

mili

es

Vibrio

fisch

eri

MJ1

1

4,039

pro

tein

s, 3,8

94 fa

mili

es

Vibrio

splen

didu

s

LGP32

4,431

pro

tein

s, 4,2

77 fa

mili

es

Vibrio

spec

ies

MED

222 1

0995

1700

5441

4,590

pro

tein

s, 4,4

63 fa

mili

es

Vibrio

cam

pbell

ii

AN

D4 1

1036

0200

0595

3,935

pro

tein

s, 3,8

22 fa

mili

es

Vibrio

spec

ies

Ex2

5

4,004

pro

tein

s, 3,8

86 fa

mili

es

Vibrio

shilo

nii

AK1 1

1032

0700

2036

5,360

pro

tein

s, 5,0

78 fa

mili

es

Vibrio

vuln

ificu

s

YJ0

16

5,028

pro

tein

s, 4,7

73 fa

mili

es

Vibrio

vuln

ificu

s

CM

CP6

4,538

pro

tein

s, 4,3

37 fa

mili

es

Vibrio

harv

eyi

ATCC B

AA

-111

6

6,064

pro

tein

s, 5,1

16 fa

mili

es

Vibrio

para

haem

olytic

us

16

3,780

pro

tein

s, 3,6

83 fa

mili

es

Vibrio

para

haem

olytic

us

RIMD

2210

633

4,832

pro

tein

s, 4,6

62 fa

mili

es

Vibrio

chole

rae

AM

-192

26

3,407

pro

tein

s, 3,3

05 fa

mili

es

Vibrio

chole

rae

2740

-80

3,771

pro

tein

s, 3,5

67 fa

mili

es

Vibrio

chole

rae

1587

3,758

pro

tein

s, 3,5

97 fa

mili

es

Vibrio

chole

rae

MZO

-2

3,425

pro

tein

s, 3,3

11 fa

mili

es

Vibrio

chole

rae

MO

10

3,421

pro

tein

s, 3,3

53 fa

mili

es

Vibrio

chole

rae

0395

3,875

pro

tein

s, 3,6

65 fa

mili

es

Vibrio

chole

rae

V52

3,815

pro

tein

s, 3,5

99 fa

mili

es

Vibrio

chole

rae

O1 b

iovar

elto

r str.

N16

961

3,828

pro

tein

s, 3,6

65 fa

mili

es

Homology within proteomes

5.0 %1.8 %

Homology between proteomes

81.6 %25.5 %

BLAST matrix

grep

ls -1

gawk

pancoreplot

makebmdest blastmatrix

Copy and download, GenBank and DNA files

saco_extract

saco_convert Prodigal

4 1 Sequences as Biological Information

organisms, the number of species present in the environment, and, despite their small size, the biomass they represent on a worldwide scale. Even inside an animal, microbes are abundant: only one out of every 10 cells in a human body is actually human, whilst the other nine cells are prokaryotic.

From an evolutionary perspective, Bacteria and Archaea have been around for more than 3 billion years; plants and animals are relatively recent ‘newcomers’ on the scene, arriving less than half a billion years ago. Since Bacteria and Archaea can divide rather quickly and have had much more time to evolve, their diversity by far exceeds that of eukaryotes (the members of Eucarya). Our human perception is that plants and animals are completely unlike each other, and so are, say, insects and mammals, as they are strikingly different even at first sight. The diversity of

Fig. 1.1 A phylogenetic tree displaying the genetic distances between members of the three super-kingdoms of life: Bacteria, Archaea, and Eucarya. The represented bacterial genera will appear in examples throughout the book. The distance between bacterial genera is much larger than that of plants and animals, drawn on the same scale of genetic distance

BACTERIA

ARCHAEA

EUCARYA

Unicellulareukaryotes

Animals Plants

Macro-organisms

Protozoans

Flav

obac

teriu

m

Crenarchaeota

EuryarchaeotaChlamydiae

Cyanobacteria

Pro

teob

acte

ria

Act

inob

acte

ria

Chlorobi

Clostridium

Bacillus

Chloroflexi

Acidobacteria

Giardia

Saccharomyces

Trypanosoma

Slime mold

Babesia

Aquifi

cae

Ther

moto

ga

Thermus

Deinoco

ccus

Firmicutes

Bacteroidetes

Spirochaetes

Pla

ncto

myc

etes

16S rRNA phylogenetic

tree

locate rRNA sequences

Basic genome statistics

njplot

extractseqs

clustalw

Genome atlas

Published annotated

genes/proteins

genomeAtlas

sed

chmod

genewiz

Examine GenBank

files

mousepad

basicgenomeanalysis

Genefinding, genes/proteins

Amino acid and codon

usage

Number of genes/proteins

Information table for all genomes.

Add information to this table as you do the exercises

Subset specific gene

counts

MONDAY Tuesday Wednesday Thursday

Pan and core

genome plot

Raw DNA sequence

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology 4

Page 5: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology 5

SOU

RCE:

NCB

I; G

RAPH

ICS

BY N

. SPE

NCE

R &

W. F

ERN

AN

DES

At the time of the announcement of the first drafts of the human genome in 2000, there were 8 billion base pairs of sequence in the three main databases for ‘finished’ sequence: GenBank, run by the US National Center for Biotechnology

Information; the DNA Databank of Japan; and the European Molecu-lar Biology Laboratory (EMBL) Nucleotide Sequence Database. The databases share their data regularly as part of the International Nucle-otide Sequence Database Collaboration (INSDC). In the subsequent first post-genome decade, they have added another 270 billion bases to the collection of finished sequence, doubling the size of the database roughly every 18 months. But this number is dwarfed by the amount of raw sequence that has been created and stored by researchers around the world in the Trace archive and Sequence Read Archive (SRA). See Editorial, page 649, and human genome special at www.nature.com/humangenome

1. Venter, J. C. et al. Science 291, 1304–1351 (2001). 2. International Human Genome Sequencing

Consortium Nature 409, 860–921 (2001). 3. International Human Genome Sequencing

Consortium Nature 431, 931–945 (2004).4. Levy, S. et al. PLoS Biol. 5, e254 (2007). 5. Wheeler, D. A. et al. Nature 452, 872–876 (2008).6. Ley, T. J. et al. Nature 456, 66–72 (2008). 7. Bentley, D. R. et al. Nature 456, 53–59 (2008). 8. Wang, J. et al. Nature 456, 60–65 (2008).

9. Ahn, S.-M. et al. Genome Res. 19, 1622–1629 (2009).

10. Kim, J.-I. et al. Nature 460, 1011–1015 (2009). 11. Pushkarev, D., Neff, N. F. & Quake, S. R. Nature

Biotechnol. 27, 847–850 (2009). 12. Mardis, E. R. et al. N. Engl. J. Med. 10, 1058–1066

(2009).13. Drmanac, R. et al. Science 327, 78–81 (2009).14. McKernan, K. J. et al. Genome Res. 19, 1527–1541

(2009).

15. Pleasance, E. D. et al. Nature 463, 191–196 (2010). 16. Pleasance, E. D. et al. Nature 463, 184–190 (2010). 17. Clark, M. J. et al. PLoS Genet. 6, e1000832 (2010).18. Rasmussen, M. et al. Nature 463, 757–762 (2010).19. Schuster, S. C. et al. Nature 463, 943–947 (2010). 20. Lupski, J. R. et al. N. Engl. J. Med. doi:10.1056/

NEJMoa0908094 (2010). 21. Roach, J. C. et al. Science doi:10.1126/

science.1186802 (2010).

The graphic shows all published, fully sequenced hu-man genomes since 2000, including nine from the first quarter of 2010. Some are resequencing e!orts on the same person and the list does not include unpublished completed genomes.

HOW MANY HUMAN GENOMES?

THE SEQUENCE EXPLOSION

670

Vol 464|1 April 2010

670

NATURE|Vol 464|1 April 2010

671

Vol 464|1 April 2010

671

NATURE|Vol 464|1 April 2010 HUMAN GENOME AT TEN NEWS FEATURENEWS FEATURE HUMAN GENOME AT TEN

© 20 Macmillan Publishers Limited. All rights reserved10

Page 6: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology 6

From The New York Times, 22 August, 2012

Genome Detectives Solve a Hospital’s Deadly OutbreakBy GINA KOLATA

The ambulance sped up to the red brick federal research hospital on June 13, 2011, and paramedics rushed a gravely ill 43-year-old woman straight to intensive care. She had a rare lung disease and was gasping for breath. And, just hours before, the hospital learned she had been infected with a deadly strain of bacteria resistant to nearly all antibiotics.

The hospital employed the most stringent and severe form of isolation, but soon the bacterium, Klebsiella pneumoniae, was spreading through the hospital. Seventeen patients got it, and six of them died. Had they been infected by the woman? And, if so, how did the bacteria escape strict controls in one of the nation’s most sophisticated hospitals, the Clinical Center of the National Institutes of Health in Bethesda, Md.?

What followed was a medical detective story that involved the rare use of rapid genetic sequencing to map the entire genome of a bacterium as it spread and to use that information to detect its origins and trace its route.

“We had never done this type of research in real time,” said Julie Segre, the researcher who led the effort.

The results, published online Wednesday in the journal Science Translational Medicine, revealed a totally unexpected chain of transmission and an organism that can lurk undetected for much longer than anyone had known. The method used may eventually revolutionize how hospitals deal with hospital-acquired infections, which contribute to more than 99,000 deaths a year.

....

Page 7: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

NATURE BIOTECHNOLOGY VOLUME 27 NUMBER 7 JULY 2009 631

primary factor limiting the understanding of our microbial planet is, in fact, the need for even larger quantities of data. In a remarkable achievement, the Sorcerer II Global Ocean Sampling Expedition34 sequenced over six million microbial genes, almost doubling the size of GenBank at the time. However, viewed from the perspective of the actual extent of microbial diversity35, such efforts are, and will remain, extremely small scale. The remarkable number of microbes (Table 1)—already estimated to be several orders of magnitude greater than the number of stars in the universe—urgently calls for a transition from random, anecdotal and small-scale surveys toward a systematic and comprehensive exploration of our planet.

This cannot be achieved by the efforts of individual researchers but requires the establishment of effective national and interna-tional collaborations. For comparison, space and planetary explo-ration could never have been realized by a single researcher or even a small network. To achieve those goals, a National Aeronautics and Space Administration (NASA; Houston, TX, USA) was formed in the United States, with similar national efforts introduced in several other countries. The success of NASA can serve as a model here.

It is imperative to see the formation of national Microbial Environmental Genomics Administrations (MEGA) launched around the globe. Current ongoing international efforts include the International Census for Marine Microbes (ICoMM) (http://www.coml.org/descrip/icomm.htm) and the International Soil Metagenome Sequencing Project, or so-called ‘Terragenome’ (http://terragenome.org/). National initiatives include the Australian Genome Alliance (http://www.genomealliance.org.au/) and the MikroBioKosmos initiative in Greece (http://www.mikrobiokosmos.org/).

Clearly, efforts of this magnitude require substantial invest-ment. To explore and seek to understand how the Earth breaths, grows, evolves, renews and sustains life—all essentially the work of the microbial world—is the great adventure now beckoning to us. Microbial genomics paves the way forward.

Note: Supplementary information is available on the Nature Biotechnology website.

Published online at http://www.nature.com/naturebiotechnology/Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions/

ACKNOWLEDGMENTSI would like to thank C. Woese, P. Hugenholtz and C. Ouzounis for their critical reading and helpful suggestions, and M. Youle for her excellent editorial assistance. Special thanks to the members of the Genome Biology Program at the Joint Genome Institute for keeping me constantly in a most challenging and stimulating environment.

1. Roberts, R.J. Identifying protein function–a call for community action. PLoS Biol. 2, E42 (2004).

comparative analysis of microorganisms but now redefined as dynamic communities that may be computationally represented as pangenomes. Looking back at the breakthroughs that have brought genomics to where it stands today, we find that in 1960–1990, the era of ribosomal RNA, we were building the tree of life and establishing the framework for the genomics revolution of 1990–2010, when we were growing the tree of life. The next decade (2010–2020) will be marked as the era of pangenomics, defined as finally understanding the tree of life.

New technologies, new ways forwardThe greatest challenge to increasing our genomic coverage of micro-bial diversity lies in obtaining the DNA to sequence. More than 99% of the currently known microbial diversity resides in unculturable organisms. Of those that can be cultured, many are difficult to grow or grow only very slowly. Some present hindrances to DNA extrac-tion. Growing the organisms for even a hundred sequencing projects consumes huge resources and requires much infrastructure. Most importantly, unlike DNA sequencing and data analysis, provisioning of DNA does not seem to be scaling up to expedite the process.

Community metagenomics cannot fill this gap, as discrete genomes cannot be assembled from the metagenomic data obtained from most environments. Therefore, our best hope for the future may lay in a new direction: single-cell genomics31. Already, current technology can provide ~70% coverage of a microbial genome by sequencing the DNA from an individual microbial cell31. It has been predicted that cover-age will increase to ~95% within the next 3–5 years, owing to intense technology development. Even at the current coverage, this approach constitutes a major breakthrough that has opened a window into vast, previously inaccessible realms of unculturable microbial diversity.

Community metagenomics can be partnered with single-cell genom-ics, an approach that will likely become common for metagenomic projects. In parallel with sampling and sequencing the metagenome for an environment of medium complexity, single-cell techniques can be used to sequence several of the individual cell types present. Even at the current 70% coverage, this would provide representative reference genomes for that environment and lead to a more holistic understand-ing of the community and its individual members.

For those culturable organisms for which complete genome sequences can already be obtained, greater insights will emerge from bridging the gap between genotype and phenotype as expected from the integration of transcriptomics and proteomics with genomics. For the most part, genes in sequenced microbial genomes are computationally predicted based on the location of start and stop codons within the sequence. Thus, gene prediction is essentially protein prediction, and there is little known about the transcribed but untranslated regions (UTRs) at either end. Coordinating a genome with its companion transcriptome and proteome can provide experimental confirmation of the accuracy of those pre-dictions and can reveal genes missed by computational approaches32. Transcriptomes can extend known protein-coding sequences to include the UTRs, thus identifying the locations where transcription starts and stops. Overall, the advent of new sequencing technologies is opening entire new worlds of possibilities in microbial genomics, ranging from the identification of novel small regulatory RNAs33 to elucidation of the mechanisms underlying the generation of genetic diversity. Indeed, as sequencing technology becomes cheaper, faster and more accurate, rese-quencing, and by effect, studies on the origins of mutations and popula-tion variability, are finally within our reach.

National and international initiatives: a MEGA approachAlthough one of the greatest challenges ahead lies in managing the current exponential growth in sequence data, it is ironic that the

Table 1 Estimating the magnitude of microbial diversityNumber of bacteriophages on Earth 1031

Number of microbes on Earth 5 1030

Number of stars in the universe 7 1021

Number of microbes in all humans 6 1023

Number of humans 6 109

Number of microbial cells in one human gut 1014

Number of human cells in one human 1013

Number of microbial genes in one human gut 3 106

Number of genes in the human genome 2.5 104

Combined length of all bacteriophages on Earth 108 Ly

Diameter of the Milky Way 105 Ly

PERSPECT I VE

"An inordinate fondness of bacteria..."7

Page 8: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

O-chain

Transport

TransportTransport

Transport

O-chainO-chain

O-chain

O-chain

O-chain

O-chain

Transport

Transport

Transport

O-chain

Transport

Transport

O-chain

Transport

O-chain

O-chain

Transport

O-chain

Transport

O-chain

Transport

O-chain

Transport

O-chain

Trans

port

O-ch

ain

Tran

spor

t O-ch

ain

Tran

spor

t

Page 9: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

A Brief History of Biological Information

“THE LOGIC OF LIFE - A History of Heredity”, by Francois Jacob (Vintage Books, A Division of Random House, New York, 1973, translated by Betty E. Spillman).

Based on three excellent books:

“WHO WROTE THE BOOK OF LIFE? - A History of The Genetic Code”, by Lily E. Kay (Stanford University Press, Stanford, California, 2000).

“THE INSIDE STORY - DNA to RNA to Protein”, edited by Jan Witkowski (Cold Spring Harbor Press, New York, 2005).

9

Page 10: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

Aristotle ~350 B.C.

plants animals minerals

10

Page 11: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

www.sciencemag.org SCIENCE VOL 306 1 OCTOBER 2004

CRE

DIT

:E.V

.ARM

BRU

ST E

T AL

.

ScienceScope

31

Experts Probe Flu Death,Call for Poultry VaccinationA 26-year-old woman in Thailand who diedof avian influenza earlier this month proba-bly contracted the disease from her daugh-ter, researchers said this week. But WorldHealth Organization (WHO) scientists arecautiously optimistic that the developmentis not the start of a major outbreak. Mean-while, several global health groups are call-ing for increased vaccination of SoutheastAsia’s poultry flocks in a bid to corral thedangerous H5N1 virus.

Researchers say the woman, who livedin the Bangkok area, had returned to a ru-ral village in northern Thailand to care forher sick daughter, who probably contract-ed the virus from local chickens. Thedaughter was cremated before re-searchers could collect tissue samplesthat could confirm her illness. But tissuesamples from the mother proved positivefor H5N1. The woman’s sister has alsotested positive for the virus and is in ahospital isolation ward.

Evidence to date suggests a case of“nonsustained, dead-end transmission,”says WHO virologist Klaus Stöhr. Similarcases have been documented in the past.But until the WHO collaborating center inAtlanta, Georgia, analyzes the new sam-ples, experts won’t know definitivelywhether the virus has mutated to a moredangerous form. So far, says Stöhr, Thaiauthorities have detected no increase inrespiratory disease among villagers orhealth workers who cared for the patients.

To keep the virus in check, governmentsshould be vaccinating and not just cullingpoultry flocks, the United Nations Food andAgriculture Organization and the World Or-ganisation for Animal Health said in a 28September statement. China and Indonesiaalready have vaccination programs. But Thai-land and other nations do not, in part be-cause poultry exporters fear importingcountries will ban products from vaccinatedbirds, which don’t exhibit flu symptoms butcan still carry the virus.

–DENNIS NORMILE

Boehlert Has BypassRepresentative Sherwood Boehlert(R–NY) is taking an unexpected breakfrom his duties as chair of the House Sci-ence Committee. Boehlert this week un-derwent triple coronary bypass surgery atthe National Naval Medical Center inBethesda, Maryland, after doctors discov-ered several blocked arteries. He’s expect-ed to be back to work within weeks.

–DAVID MALAKOFF

Diatoms are an enigma. Neither plant noranimal, they share biochemical features ofboth. Though simple single-celled algae,they are covered with elegant casingssculpted from silica.

Now a team of 45 biologists has taken abig step toward resolving the paradoxical na-ture of these odd microbes. They have se-quenced the genome of Thalassiosira

pseudonana, which lives in salt water and isa lab favorite among diatom experts. Thework should prove useful to ecologists, geol-ogists, and even biomedical researchers, saysEdward Theriot, a diatom systematist at theUniversity of Texas, Austin: “We’ve justjumped a generation ahead by having thiskind of understanding of this genome.”

Diatoms date back 180 million years, andremnants of their silica shells make up porousrock called diatomite that is used in industrialfilters. Today diatoms occupy vast swaths ofocean and fresh water, where they play a keyrole in the global carbon cycle. Diatomphotosynthesis yields 19 billion tons of or-ganic carbon, about 40% of the marine car-bon produced each year; thus, by processingcarbon dioxide into solid matter, they repre-sent a key defense against global warming.

Many marine organisms feaston diatoms. When conditions areripe, the algae can multiply at as-tonishing rates, creating ocean“blooms” that are sometimes tox-ic. These blooms can suffocatenearby marine life or make a toxinthat harms people who eat infect-ed shellfish. “This is a group oforganisms that has amazing im-portance in global ecology,” saysDeborah Robertson, an algal phys-iologist at Clark University inWorcester, Massachusetts.

Since 2002, Daniel Rokhsar, agenomicist at the DOE JointGenome Institute in Walnut Creek,California, and his colleagues have been un-raveling the genome of T. pseudonana. Theywere aided by a technique called optical map-ping, in which stretched-out chromosomesare nicked by enzymes and viewed through alight microscope. Those nicked pieces ofDNA stay in order and enable the sequencersto assemble almost all the bases in the correctplace on the right chromosomes.

The draft genome consists of 34 millionbases, Rokhsar, E. Virginia Armbrust, anoceanographer at the University of Wash-ington, Seattle, and their colleagues reporton page 79 of this issue. They ultimatelyfound about 11,500 genes along the di-atom’s chromosomes and along the DNA

in its chloroplast and mitochondria. Analyses of these genes and the pro-

teins they encode confirm that diatomshave had a complex history. Like other earlymicrobes, they apparently acquired newgenes by engulfing microbial neighbors.Perhaps the most significant acquisitionwas an algal cell that provided the diatomwith photosynthetic machinery.

Some biologists hypothesize that diatomsbranched off from an ancestral nucleated mi-crobe from which plants and animals laterarose, a theory supported by the identificationof T. pseudonana genes in some plant and an-imal genomes. As diatoms, plants, and ani-mals evolved, each must have shed differentgenes from this common ancestor. As a result,diatoms were left with what looks like a mixof plant and animal DNA, plus other genesthat are remnants of the engulfed algae.

The new data support this complex scenario, says Robertson. Some 182 T.pseudonana proteins are related only to redalgae proteins; another 865 proteins arefound just among plants. About half theproteins encoded by the rest of the di-atom’s genes are equally similar to coun-terparts in plants, animals, and red algae.

The newly analyzed genome has alsobegun to shed light on how a diatom con-structs its intricately patterned glass shell.So far, Rokhsar and his colleagues haveuncovered a dozen proteins involved in thedeposition of the silicon and expect to findmore. Such progress could be a boon tomaterials scientists. “Being able to under-stand [silica processing] should have apayoff in nanofabrication,” says Robertson.

Currently, a mere 100 or so researcherscall themselves diatom specialists. With thegenome in hand, interest in diatoms is goingto expand, Theriot predicts: “It will help putdiatoms on everyone’s radar.”

–ELIZABETH PENNISI

DNA Reveals Diatom’s Complexity

GENET I C S

Aqueous snowflake. The sequence of a diatom should reveal the secrets of its decorative shell.

Published by AAAS

www.sciencemag.org SCIENCE VOL 306 1 OCTOBER 2004

CRE

DIT

:E.V

.ARM

BRU

ST E

T AL

.

ScienceScope

31

Experts Probe Flu Death,Call for Poultry VaccinationA 26-year-old woman in Thailand who diedof avian influenza earlier this month proba-bly contracted the disease from her daugh-ter, researchers said this week. But WorldHealth Organization (WHO) scientists arecautiously optimistic that the developmentis not the start of a major outbreak. Mean-while, several global health groups are call-ing for increased vaccination of SoutheastAsia’s poultry flocks in a bid to corral thedangerous H5N1 virus.

Researchers say the woman, who livedin the Bangkok area, had returned to a ru-ral village in northern Thailand to care forher sick daughter, who probably contract-ed the virus from local chickens. Thedaughter was cremated before re-searchers could collect tissue samplesthat could confirm her illness. But tissuesamples from the mother proved positivefor H5N1. The woman’s sister has alsotested positive for the virus and is in ahospital isolation ward.

Evidence to date suggests a case of“nonsustained, dead-end transmission,”says WHO virologist Klaus Stöhr. Similarcases have been documented in the past.But until the WHO collaborating center inAtlanta, Georgia, analyzes the new sam-ples, experts won’t know definitivelywhether the virus has mutated to a moredangerous form. So far, says Stöhr, Thaiauthorities have detected no increase inrespiratory disease among villagers orhealth workers who cared for the patients.

To keep the virus in check, governmentsshould be vaccinating and not just cullingpoultry flocks, the United Nations Food andAgriculture Organization and the World Or-ganisation for Animal Health said in a 28September statement. China and Indonesiaalready have vaccination programs. But Thai-land and other nations do not, in part be-cause poultry exporters fear importingcountries will ban products from vaccinatedbirds, which don’t exhibit flu symptoms butcan still carry the virus.

–DENNIS NORMILE

Boehlert Has BypassRepresentative Sherwood Boehlert(R–NY) is taking an unexpected breakfrom his duties as chair of the House Sci-ence Committee. Boehlert this week un-derwent triple coronary bypass surgery atthe National Naval Medical Center inBethesda, Maryland, after doctors discov-ered several blocked arteries. He’s expect-ed to be back to work within weeks.

–DAVID MALAKOFF

Diatoms are an enigma. Neither plant noranimal, they share biochemical features ofboth. Though simple single-celled algae,they are covered with elegant casingssculpted from silica.

Now a team of 45 biologists has taken abig step toward resolving the paradoxical na-ture of these odd microbes. They have se-quenced the genome of Thalassiosira

pseudonana, which lives in salt water and isa lab favorite among diatom experts. Thework should prove useful to ecologists, geol-ogists, and even biomedical researchers, saysEdward Theriot, a diatom systematist at theUniversity of Texas, Austin: “We’ve justjumped a generation ahead by having thiskind of understanding of this genome.”

Diatoms date back 180 million years, andremnants of their silica shells make up porousrock called diatomite that is used in industrialfilters. Today diatoms occupy vast swaths ofocean and fresh water, where they play a keyrole in the global carbon cycle. Diatomphotosynthesis yields 19 billion tons of or-ganic carbon, about 40% of the marine car-bon produced each year; thus, by processingcarbon dioxide into solid matter, they repre-sent a key defense against global warming.

Many marine organisms feaston diatoms. When conditions areripe, the algae can multiply at as-tonishing rates, creating ocean“blooms” that are sometimes tox-ic. These blooms can suffocatenearby marine life or make a toxinthat harms people who eat infect-ed shellfish. “This is a group oforganisms that has amazing im-portance in global ecology,” saysDeborah Robertson, an algal phys-iologist at Clark University inWorcester, Massachusetts.

Since 2002, Daniel Rokhsar, agenomicist at the DOE JointGenome Institute in Walnut Creek,California, and his colleagues have been un-raveling the genome of T. pseudonana. Theywere aided by a technique called optical map-ping, in which stretched-out chromosomesare nicked by enzymes and viewed through alight microscope. Those nicked pieces ofDNA stay in order and enable the sequencersto assemble almost all the bases in the correctplace on the right chromosomes.

The draft genome consists of 34 millionbases, Rokhsar, E. Virginia Armbrust, anoceanographer at the University of Wash-ington, Seattle, and their colleagues reporton page 79 of this issue. They ultimatelyfound about 11,500 genes along the di-atom’s chromosomes and along the DNA

in its chloroplast and mitochondria. Analyses of these genes and the pro-

teins they encode confirm that diatomshave had a complex history. Like other earlymicrobes, they apparently acquired newgenes by engulfing microbial neighbors.Perhaps the most significant acquisitionwas an algal cell that provided the diatomwith photosynthetic machinery.

Some biologists hypothesize that diatomsbranched off from an ancestral nucleated mi-crobe from which plants and animals laterarose, a theory supported by the identificationof T. pseudonana genes in some plant and an-imal genomes. As diatoms, plants, and ani-mals evolved, each must have shed differentgenes from this common ancestor. As a result,diatoms were left with what looks like a mixof plant and animal DNA, plus other genesthat are remnants of the engulfed algae.

The new data support this complex scenario, says Robertson. Some 182 T.pseudonana proteins are related only to redalgae proteins; another 865 proteins arefound just among plants. About half theproteins encoded by the rest of the di-atom’s genes are equally similar to coun-terparts in plants, animals, and red algae.

The newly analyzed genome has alsobegun to shed light on how a diatom con-structs its intricately patterned glass shell.So far, Rokhsar and his colleagues haveuncovered a dozen proteins involved in thedeposition of the silicon and expect to findmore. Such progress could be a boon tomaterials scientists. “Being able to under-stand [silica processing] should have apayoff in nanofabrication,” says Robertson.

Currently, a mere 100 or so researcherscall themselves diatom specialists. With thegenome in hand, interest in diatoms is goingto expand, Theriot predicts: “It will help putdiatoms on everyone’s radar.”

–ELIZABETH PENNISI

DNA Reveals Diatom’s Complexity

GENET I C S

Aqueous snowflake. The sequence of a diatom should reveal the secrets of its decorative shell.

Published by AAAS

www.sciencemag.org SCIENCE VOL 306 1 OCTOBER 2004

CRE

DIT

:E.V

.ARM

BRU

ST E

T AL

.

ScienceScope

31

Experts Probe Flu Death,Call for Poultry VaccinationA 26-year-old woman in Thailand who diedof avian influenza earlier this month proba-bly contracted the disease from her daugh-ter, researchers said this week. But WorldHealth Organization (WHO) scientists arecautiously optimistic that the developmentis not the start of a major outbreak. Mean-while, several global health groups are call-ing for increased vaccination of SoutheastAsia’s poultry flocks in a bid to corral thedangerous H5N1 virus.

Researchers say the woman, who livedin the Bangkok area, had returned to a ru-ral village in northern Thailand to care forher sick daughter, who probably contract-ed the virus from local chickens. Thedaughter was cremated before re-searchers could collect tissue samplesthat could confirm her illness. But tissuesamples from the mother proved positivefor H5N1. The woman’s sister has alsotested positive for the virus and is in ahospital isolation ward.

Evidence to date suggests a case of“nonsustained, dead-end transmission,”says WHO virologist Klaus Stöhr. Similarcases have been documented in the past.But until the WHO collaborating center inAtlanta, Georgia, analyzes the new sam-ples, experts won’t know definitivelywhether the virus has mutated to a moredangerous form. So far, says Stöhr, Thaiauthorities have detected no increase inrespiratory disease among villagers orhealth workers who cared for the patients.

To keep the virus in check, governmentsshould be vaccinating and not just cullingpoultry flocks, the United Nations Food andAgriculture Organization and the World Or-ganisation for Animal Health said in a 28September statement. China and Indonesiaalready have vaccination programs. But Thai-land and other nations do not, in part be-cause poultry exporters fear importingcountries will ban products from vaccinatedbirds, which don’t exhibit flu symptoms butcan still carry the virus.

–DENNIS NORMILE

Boehlert Has BypassRepresentative Sherwood Boehlert(R–NY) is taking an unexpected breakfrom his duties as chair of the House Sci-ence Committee. Boehlert this week un-derwent triple coronary bypass surgery atthe National Naval Medical Center inBethesda, Maryland, after doctors discov-ered several blocked arteries. He’s expect-ed to be back to work within weeks.

–DAVID MALAKOFF

Diatoms are an enigma. Neither plant noranimal, they share biochemical features ofboth. Though simple single-celled algae,they are covered with elegant casingssculpted from silica.

Now a team of 45 biologists has taken abig step toward resolving the paradoxical na-ture of these odd microbes. They have se-quenced the genome of Thalassiosira

pseudonana, which lives in salt water and isa lab favorite among diatom experts. Thework should prove useful to ecologists, geol-ogists, and even biomedical researchers, saysEdward Theriot, a diatom systematist at theUniversity of Texas, Austin: “We’ve justjumped a generation ahead by having thiskind of understanding of this genome.”

Diatoms date back 180 million years, andremnants of their silica shells make up porousrock called diatomite that is used in industrialfilters. Today diatoms occupy vast swaths ofocean and fresh water, where they play a keyrole in the global carbon cycle. Diatomphotosynthesis yields 19 billion tons of or-ganic carbon, about 40% of the marine car-bon produced each year; thus, by processingcarbon dioxide into solid matter, they repre-sent a key defense against global warming.

Many marine organisms feaston diatoms. When conditions areripe, the algae can multiply at as-tonishing rates, creating ocean“blooms” that are sometimes tox-ic. These blooms can suffocatenearby marine life or make a toxinthat harms people who eat infect-ed shellfish. “This is a group oforganisms that has amazing im-portance in global ecology,” saysDeborah Robertson, an algal phys-iologist at Clark University inWorcester, Massachusetts.

Since 2002, Daniel Rokhsar, agenomicist at the DOE JointGenome Institute in Walnut Creek,California, and his colleagues have been un-raveling the genome of T. pseudonana. Theywere aided by a technique called optical map-ping, in which stretched-out chromosomesare nicked by enzymes and viewed through alight microscope. Those nicked pieces ofDNA stay in order and enable the sequencersto assemble almost all the bases in the correctplace on the right chromosomes.

The draft genome consists of 34 millionbases, Rokhsar, E. Virginia Armbrust, anoceanographer at the University of Wash-ington, Seattle, and their colleagues reporton page 79 of this issue. They ultimatelyfound about 11,500 genes along the di-atom’s chromosomes and along the DNA

in its chloroplast and mitochondria. Analyses of these genes and the pro-

teins they encode confirm that diatomshave had a complex history. Like other earlymicrobes, they apparently acquired newgenes by engulfing microbial neighbors.Perhaps the most significant acquisitionwas an algal cell that provided the diatomwith photosynthetic machinery.

Some biologists hypothesize that diatomsbranched off from an ancestral nucleated mi-crobe from which plants and animals laterarose, a theory supported by the identificationof T. pseudonana genes in some plant and an-imal genomes. As diatoms, plants, and ani-mals evolved, each must have shed differentgenes from this common ancestor. As a result,diatoms were left with what looks like a mixof plant and animal DNA, plus other genesthat are remnants of the engulfed algae.

The new data support this complex scenario, says Robertson. Some 182 T.pseudonana proteins are related only to redalgae proteins; another 865 proteins arefound just among plants. About half theproteins encoded by the rest of the di-atom’s genes are equally similar to coun-terparts in plants, animals, and red algae.

The newly analyzed genome has alsobegun to shed light on how a diatom con-structs its intricately patterned glass shell.So far, Rokhsar and his colleagues haveuncovered a dozen proteins involved in thedeposition of the silicon and expect to findmore. Such progress could be a boon tomaterials scientists. “Being able to under-stand [silica processing] should have apayoff in nanofabrication,” says Robertson.

Currently, a mere 100 or so researcherscall themselves diatom specialists. With thegenome in hand, interest in diatoms is goingto expand, Theriot predicts: “It will help putdiatoms on everyone’s radar.”

–ELIZABETH PENNISI

DNA Reveals Diatom’s Complexity

GENET I C S

Aqueous snowflake. The sequence of a diatom should reveal the secrets of its decorative shell.

Published by AAAS

11

Page 12: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology 12

BACTERIA

ARCHAEA

EUCARYA

Unicellular

eukaryotes

Animals Plants

Macro-organisms

Protozoans

Flavobacterium

Crenarc

haeota

EuryarchaeotaChlamydiae

Cyanobacteria

Pro

teobacte

ria

Actinobacte

ria

Chlorobi

Clostridium

Bacillus

Chloro

flexi

Acidobacteria

Giardia

Saccharomyces

Trypanosoma

Slime mold

Babesia

Firmicutes

Bacteroidetes

Spirochaetes

Pla

ncto

mycete

s

Page 13: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

Gregor Mendel 1866

genes13

Page 14: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

Albrecht Kossel 1881

14

Page 15: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

T.H. Morgan 1919

Chromosomes

15

Page 16: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

G. Beadle & E. Tatum1930s

one enzyme

one gene

16

Page 17: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

O. Avery 1941

17

Page 18: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

E. Schrödinger 1943

“We believe a gene - or perhaps the whole chromosome fibre - to be an aperiodic solid.”

“...For an illustration, think of Morse code...”

"What is Life?" by Erwin Schrödinger

(Cambridge University Press, 1944)

18

Page 19: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

Watson & Crick 1953

19

Page 20: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

DNA is like Coca-cola!Coke DNA

Water WaterSugar (sucrose) (deoxyribose)

Phosphate acid (PO4) backbone

Caffeine Bases

20

Page 21: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

A=T

A=TA=TTilt

Twist

Roll

Propellor Twist

A=T

A=T

A=T

A == TDNA bases will spontaneously stack on

top of each other and form a helix!

21

Page 22: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology 22

Page 23: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

MinorGroove

MajorGroove

1.12

3600= one helical

turn

10.5 bp per turn

34.30 twist angle

(rotation per residue)3.4A Axial Rise

Base Pair Tilt - 6o

Helix Pitch

35.7A

34.3o

Helix Diameter

20A

23

Page 24: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

The “DNA code”

24

Page 25: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

mRNA 5’...GAUCUAGCGAUGCCGAUGAAACAUGAUCAUG...3’

DNA 5’...GATCTAGCGATGCCGATGAAACATGATCATG...3’3’...CTAGATCGCTACGGCTACTTTGTACTAGTAC...5’

Protein N met-pro-met-lys-his-his-his...C

transcription

translation

25

Page 26: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

Genome -> Transcriptome ->Proteome

2. The Central Dogma

Once information flows to protein, it cannot come back!

1. The Sequence Hypothesis

The amino acid sequence in proteins is specified from DNA and RNA.

The General Idea

26

Page 27: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology 27

Page 28: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology 28

Page 29: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

Biological Sequences as Information DNA sequences as information

1. DNA sequence can code for an amino acid sequences (mRNAs)

2. The DNA sequence can code for stable RNA sequencessnRNA telomerase RNA

3. The DNA sequence can code for protein binding sites

4. The DNA can code for architectural information

nucleosome positioning

5. The DNA can code for structural / stability informationtranscription initiationorigins of replication

intrinsic DNA curvature

mutational "hot spots"

rRNAtRNA

29

Page 30: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology 30

Page 31: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

Biological Sequences as Information RNA sequences as information1. The mRNAs can contain several different levels of information:

- specifies amino acid sequence for proteins

- localisation signals for WHERE the protein will be made

- stability signals to determine HOW MUCH protein is made- splice sites

- editing sites

2. The tRNAs code for the genetic code - same in all living organisms

(n.b. diff. in mitochondria)

3. The rRNAs code for the structures of ribosomes

4. Other RNA/protein complexes

31

Page 32: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

Biological Sequences as Information

1. The protein sequence can code for an "active site" for enzymes

2. The protein sequence can code for structural roles:

microtubules, myosin, collagen, etc.

3. The protein sequence can code for ion channels/pumps

4. The protein sequence can code for localisation information

5. The protein sequence can code for modification sites

Protein sequences as information

32

Page 33: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology 33

Page 34: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

Malaria

IMPs

14-3-3 proteins

Other Enzymes

Ubiquitin system

Oxidoreductase

GTPase/Regulators

Kinases/Phosphatases

Page 35: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology

Summary (so far!)

Sequences StructureDNARNA

Protein

Function

35

Page 36: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

36

Page 37: Introduction to Microbial Genomics - DTU … · Center&for&Biological&Sequence&Analysis Department&of&Systems&Biology Introduction to Microbial Genomics Sequences as information Dave

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012CBS, Department of Systems Biology 37

Questions:!What information is in the following sequence?

!How can you find out?

!Is the DNA sequence REALLY like a ‘language’?

>Mystery sequence - prize for the first person who tells me what this is!ATGGGACTACCCTGGTACCGCGTACATACAGTAGTTCTGAACGATCCAGGACGGCTGATTTCTGTACACCTAATGCACACTGCTCTTGTCGCAGGTTGGGCGGGCTCTATGGCCCTGTACGAATTGGCAGTTTTTGACCCATCAGACCCAGTTCTCAATCCCATGTGGCGTCAAGGTATGTTTGTCATGCCTTTTATGGCTCGTTTGGGTGTAACTCAATCCTGGGGTGGCTGGAGTCTAACTGGTGAAGTAGCCGATAATCCCGGAATTTGGTCTTTTGAAGGGGTAGCCGCTACCCATATCATCTTGTCAGGTCTATTATTCCTGGCAGCAGTTTGGCACTGGGTTTACTGGGATCTGGAACTGTTTACCGATCCTCGGACTGGTGAACCAGCCCTAGACCTACCCAAAATGTTCGGAATTCATTTATTCCTATCTGGTTTGCTTTGTTTTGGCTTCGGAGCCTTCCACCTCACGGGACTATTCGGACCGGGAATGTGGGTTTCTGACCCCTATGGATTGACGGGAAGTATACAACCTGTCGCTCCTTCCTGGGGGCCTGAAGGATTTAACCCCTTCAATGCTGGCGGTATTGCGGCTCACCATATTGCGGCCGGAATTGTTGGCATTATTGCCGGACTATTCCACCCGTCCGTCAGACCACCTCAGCGCCTATACAAAGCCCTGCGTATGGGAAATATCGAAACTGTACTATCTAGTAGTATCGCGGCGGTATTCTTTGCGGCTTTTGTGGTAGCTGGAACTATGTGGTATGGTTCGGCTGCAACTCCGATTGAACTGTTTGGACCTACCCGCTATCAGTGGGATCAGGGATATTTCCAACAGGAAATTCAGCGCCGGGTACAAAGCAGTATTGCTCAGGGTGACAGCCCCTCAGAAGCATGGTCTAAGATTCCTGAAAAACTGGCATTTTATGACTATGTTGGTAACAGTCCCGCTAAAGGCGGTTTGTTCCGCGTCGGTCCGATGAACAAGGGCGATGGTATTGCTCAAGGTTGGCTCGGACACCCAGTATTCACTGATGCAGAAGGTCGCGAATTAACTGTTCGTCGTCTTCCTAACTTCTTTGAAACCTTCCCCGTCATTCTGACTGATGCTGATGGCGTAATTCGCGCTGACGTTCCTTTCCGTCGCGCGGAGTCTCGCTACAGCTTTGAGCAAACTGGGGTGACTGTTTCTTTATATGGTGGTGAACTCAATGGTAAAACCTTCACCGATCCCGCCTCTGTGAAGAAATATGCCCGCTTTGCTCAACAGGGTGAACCATTTGCCTTTGACCGGGAAACTCTCGGCTCTGATGGGGTATTTCGTACCAGTACCCGTGGCTGGTTTACTTTCGGTCACGCTTGCTTTGCTCTGCTTTTCTTCTTTGGTCATATTTGGCACGGTTCCCGCACCATCTTCCGAGATGTATTTGCTGGGGTGGAAGCTGACCTAGAAGAACAAGTTGAGTGGGGTAACTTCCAGAAAGTTGGAGACCAAACAACTCGTGTTCAAAAGACCGTCTAA