introduction to genomics and the tree of life (part 2) monday, october 25, 2010 genomics 260.605.01...

74
Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner [email protected]

Upload: delilah-bryant

Post on 25-Dec-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Introduction to Genomics and the Tree of Life

(part 2)

Monday, October 25, 2010

Genomics260.605.01J. Pevsner

[email protected]

Page 2: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Outline of this course

Tree of lifeVirusesBacteria and archaea (Egbert Hoiczyk)Eukaryotes

The eukaryotic chromosomeFungi; yeast functional genomics (Jef Boeke)Protozoans (David Sullivan)Nematodes (Alan Scott)Rodents: mouse and ratPrimatesThe human genome (Dave Valle)Human disease

Page 3: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Five approaches to genomics

As we survey the tree of life, consider these perspectives:

Approach I: cataloguing genomic informationGenome size; number of chromosomes; GC

content; isochores; number of genes; repetitive DNA; unique features of each genome

Approach V: Bioinformatics aspectsAlgorithms, databases, websites

Approach IV: Human disease relevance

Approach III: function; biological principles; evolutionHow genome size is regulated; polyploidization; birth and death of genes; neutral theory of

evolution; positive and negative selection; speciation

Approach II: cataloguing comparative genomic informationOrthologs and paralogs; COGs; lateral gene transfer

Page 4: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Fig. 13.1Page 521

Pace (2001) described a tree of life based on small subunit rRNA sequences.

This tree shows the mainthree branches describedby Woese and colleagues.

Page 5: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Outline of what we covered in Friday’s lecture

Introduction: 5 perspectives, history of life: time lines

Genome-sequencing projects: chronology

Genome analysis: criteria, resequencing, metagenomics

DNA sequencing technologies: Sanger, 454, Solexa

Process of genome sequencing: centers, repositories

Genome annotation: features, prokaryotes, eukaryotes

Page 6: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Learning objectives for today

1. You should be able to describe these aspects of genome analysis: criteria, resequencing, metagenomics

2. You should understand how one next-generation DNA sequencing technology is performed

3. You should be able to describe the process of genome sequencing: centers, repositories

4. You should be able to describe basic features of genome annotation for prokaryotes and eukaryotes

Page 7: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Completed genome projects (current 10/10)

Eukaryotes: 37 complete; 350 assembly, 505 in progress

Viruses: 3,110 complete

Bacteria: 1,163 complete

Archaea: 91 complete

Organellar: 2,487 complete

Metagenomics projects: 306

Page 527, 537

Page 8: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Learning objectives from Friday’s lecture(first half of Chapter 13)

• Describe five perspectives on genomics (cataloguing information, comparative genomics, biological principles, disease relevance, bioinformatics aspects)

• Provide a rough chronology of the history of life on earth (this will become easier as the course progresses)

• Provide a rough chronology of genome-sequencing projects

• Use key NCBI resources to find information about genomes

Page 9: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Outline of today’s lecture

Introduction: 5 perspectives, history of life

Genome-sequencing projects: chronology

Genome analysis: criteria, resequencing, metagenomics

DNA sequencing technologies: Sanger, 454, Solexa

Process of genome sequencing: centers, repositories

Genome annotation: features, prokaryotes, eukaryotes

Page 10: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Learning objectives for today’s lecture(Chapter 13, second half)

• Explain the major applications of next-generation sequencing including resequencing and metagenomics

• Describe how next-generation sequencing technologies work (we will examine NGS data later)

• Define the major repositories of DNA sequence data and search one (the Trace Archive) for particular sequences such as human beta globin

Page 11: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

• Selection of genomes for sequencing

• Sequence one individual genome, or several?

• How big are genomes?

• Genome sequencing centers

• Sequencing genomes: strategies

• When has a genome been fully sequenced?

• Repository for genome sequence data

• Genome annotation

Overview of genome analysis

Page 537

Page 12: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Applications of Genome Sequencing

Purpose Template Example

De novo sequencing

Genome sequencing Sequencing >1000 influenza genomes

Ancient DNA Extinct Neanderthal genome

Metagenomics Human gut

Resequencing Whole genomes Individual humans

Genomic regions Assessment of genomic rearrangements or disease-associated regions

Somatic mutations Sequencing mutations in cancer

Transcriptome Full-length transcripts Defining regulated messenger RNA transcriptsSerial Analysis of

Gene Expression (SAGE)

Noncoding RNAs Identifying and quantifying microRNAs in samples

Epigenetics Methylation changes Measuring methylation changes in cancer

Table 13.15 p.538

Page 13: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Fig. 13.8p.539

Overview of genome analysis

Page 14: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Criteria include:

• genome size (some plants are >>>human genome)• cost• relevance to human disease (or other disease)• relevance to basic biological questions• relevance to agriculture

Criteria for selecting genomes for sequencing

Page 538

Page 15: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Criteria include:

• genome size (some plants are >>>human genome)• cost• relevance to human disease (or other disease)• relevance to basic biological questions• relevance to agriculture

Recent projects:Chicken Fungi (many)Chimpanzee Honey beeCow Sea urchinDog Rhesus macaque

Page 540

Criteria for selecting genomes for sequencing

Page 16: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Selection of genomes for sequencing is basedon specific criteria.

For an overview, see a series of white papers posted on the National Human Genome Research Institute (NHGRI) website: http://www.genome.gov/10002154

For a description of NHGRI selection criteria, visit:http://www.genome.gov/10001495

Selection criteria

Page 540

Page 17: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Sequence one individual genome, or several?

Try one…

--Each genome center may study one

chromosome from an organism

--It is necessary to measure polymorphisms

(e.g. SNPs) in large populations

For viruses, thousands of isolates may be sequenced.

For the human genome, cost is the impediment.

Page 540

Criteria for selecting genomes for sequencing

Page 18: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

How big are genomes?

Viral genomes: 1 kb to 350 kb (Mimivirus: 1181 kb)

Bacterial genomes: 0.5 Mb to 13 Mb

Eukaryotic genomes: 8 Mb to 686 Gb (human: ~3 Gb)

Diversity of genome sizes

Page 540

Page 19: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

viruses

plasmids

bacteria

fungi

plants

algae

insects

mollusks

reptiles

birds

mammals

Genome sizes in nucleotide base pairs

104 108105 106 107 10111010109

The size of the humangenome is ~ 3 X 109 bp;almost all of its complexityis in single-copy DNA.

The human genome is thoughtto contain ~30,000-40,000 genes.

bony fish

amphibians

http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt

Page 20: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Entrez Genomes at NCBI (September 2006): sizes of 197 sequenced eukaryotic genomes

0

1000

2000

3000

4000

0 100 200

genome size (megabases)

number

Updated 9/06

Page 21: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Genus, species Subgroup Size (Mb) #chr common name

Macropus eugenii Mammals 3800 8 tammar wallaby

Oryctolagus cuniculus Mammals 3500 22 rabbit

Cavia porcellus Mammals 3400 31 guinea pig

Pan troglodytes Mammals 3100 24 chimpanzee

Homo sapiens Mammals 3038 23 human

Bos taurus Mammals 3000 30 cow

Dasypus novemcinctus Mammals 3000 32 nine-banded armadillo

Loxodonta africana Mammals 3000 28 African savanna elephant

Sorex araneus Mammals 3000 European shrew

Rattus norvegicus Mammals 2750 21 rat

Canis familiaris Mammals 2400 39 dog

Zea mays Land Plants 2365 10 corn

Aplysia californicaOther Animals 1800 17 California sea hare

Danio rerio Fishes 1700 25 zebrafish

Gallus gallus Birds 1200 40 chicken

Triphysaria versicolor Land Plants 1200 plant parasite

16 eukaryotic genome projects > 1000 megabases

Page 22: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Entrez Genomes at NCBI (September 2006): 100 smallest sequenced eukaryotic genomes

0

10

20

30

40

50

0 50 100

These 100 genomes consist of fungi (n=76), plants (n=2), and protists (n=22).

genome size (megabases)

number

Updated 9/06

Page 23: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Smallest eukaryotic genome projects

Name Group Size (Mb) #Chr

Encephalitozoon cuniculi Fungi 2.5 11

Antonospora locustae Fungi 2.9

Leishmania major Friedlin Protists 5.44 36

Pneumocystis carinii Fungi 8

Cryptosporidium hominis Protists 8.74 8

Eremothecium gossypii Fungi 8.74 7

Cryptosporidium parvum Iowa Protists 9.1 8

Pichia angusta Fungi 9.5 6

Updated 10/06

Page 24: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Entrez Genomes at NCBI (September 2006): sizes of 629 sequenced prokaryotic genomes

number

988 prokaryotic genomes are listed in Entrez Genomes

genome size (megabases)

Updated 9/06

Page 25: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Ancient DNA projects

Special challenges:

• Ancient DNA is degraded by nucleases• The majority of DNA in samples derives from unrelated organisms such as bacteria that invaded after death• The majority of DNA in samples is contaminated by human DNA• Determination of authenticity requires special controls, and analysis of multiple independent extracts

Page 542

Page 26: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Metagenomics projects

Two broad areas:

• Environmental (ecological) e.g. hot spring, ocean, sludge, soil

• Organismal e.g. human gut, feces, lung

Page 543

Page 27: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Outline of today’s lecture

Introduction: 5 perspectives, history of life: time lines

Genome-sequencing projects: chronology

Genome analysis: criteria, resequencing, metagenomics

DNA sequencing technologies: Sanger, 454, Solexa

Process of genome sequencing: centers, repositories

Genome annotation: features, prokaryotes, eukaryotes

Page 28: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Sanger sequencing

Introduced in 1977

A template is denatured to form single strands, and extended with a polymerase in the presence of dideoxynucleotides (ddNTPs) that cause chain termination.

Typical read lengths are up to 800 base pairs. For the sequencing of Craig Venter’s genome (2008), Sanger sequencing was employed because of its relatively long read lengths.

Page 544

Page 29: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Source: http://www.bio.davidson.edu/Courses/Bio111/seq.html

Sanger sequencing: chain termination using dideoxynucleotides

Page 30: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Pyrosequencing

Advantages:• Very fast• Low cost per base• Large throughput; up to 40 megabases/epxeriment• No need for bacterial cloning (with its associated artifacts); this is especially helpful in metagenomics• High accuracy

Disadvantages:• Short read lengths (soon to be extended to ~500 bp)• Difficulty sequencing homopolymers accurately

Page 545

Page 31: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Page 546

Page 32: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org
Page 33: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Cycle termination sequencing (Solexa)

(1) DNA is fragmented, adaptors are added(2) Single stranded DNA fragments attached to cell(3) DNA polymerase, dNTPs for “bridge amplification”(4) Double-stranded bridges formed(5) Four labeled reversible terminators added (with primer and polymerase); one single terminator is added per cycle(6) Laser excitation to record the base(7) Reversible terminators are removed. All four labeled reversible terminators and polymerase are added for the next cycle.

Page 547

Page 34: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Cycle termination sequencing (Solexa)

Disadvantage:• Short read length (~36 bases; currently in 2009 ~75 bases)

Advantages:• Very fast• Low cost per base• Large throughput; up to 1 gigabase/epxeriment• Short read length makes it appropriate for resequencing projects• No need for gel electrophoresis• High accuracy• All four bases are present at each cycle, with sequential addition of dNTPs. This allows homopolymers to be accurately read.

Page 35: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Illumina sequencing technology in 12 steps

Source: http://www.illumina.com/downloads/SS_DNAsequencing.pdf

Page 36: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

1. Prepare genomic DNA

2. Attach DNA to surface

3. Bridge amplification

4. Fragments become double stranded

5. Denature the double- stranded molecules

6. Complete amplification

Randomly fragment genomic DNA and ligate adapters to both ends of the fragments

adapters

DNA

Page 37: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

1. Prepare genomic DNA

2. Attach DNA to surface

3. Bridge amplification

4. Fragments become double stranded

5. Denature the double- stranded molecules

6. Complete amplification

adapter

Bind single-stranded fragments randomly to the inside surface of the flow cell channels

adapterDNA fragment

dense lawn of primers

Page 38: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

1. Prepare genomic DNA

2. Attach DNA to surface

3. Bridge amplification

4. Fragments become double stranded

5. Denature the double- stranded molecules

6. Complete amplification

Add unlabeled nucleotides and enzyme to initiate solid-phase bridge amplification

Page 39: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

1. Prepare genomic DNA

2. Attach DNA to surface

3. Bridge amplification

4. Fragments become double stranded

5. Denature the double- stranded molecules

6. Complete amplification

Attached terminus

The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate

terminusfree

Attached terminus

Page 40: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

1. Prepare genomic DNA

2. Attach DNA to surface

3. Bridge amplification

4. Fragments become double stranded

5. Denature the double- stranded molecules

6. Complete amplification

Attached

Attached

Denaturation leaves single-stranded templates anchored to the substrate

Page 41: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

1. Prepare genomic DNA

2. Attach DNA to surface

3. Bridge amplification

4. Fragments become double stranded

5. Denature the double- stranded molecules

6. Complete amplification Clusters

Several million dense clusters of double-stranded DNA are generated in each channel of the flow cell

Page 42: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

7. Determine first base

8. Image first base

9. Determine second base

10. Image second chemistry cycle

11. Sequencing over multiple chemistry cycles

12. Align dataLaser

The first sequencing cycle begins by adding four labeled reversible terminators, primers, and DNA polymerase

Page 43: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

After laser excitation, the emitted fluorescence from each cluster is captured and the first base is identified

7. Determine first base

8. Image first base

9. Determine second base

10. Image second chemistry cycle

11. Sequencing over multiple chemistry cycles

12. Align data

Page 44: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

The next cycle repeats the incorporation of four labeled reversible terminators, primers, and DNA polymerase

7. Determine first base

8. Image first base

9. Determine second base

10. Image second chemistry cycle

11. Sequencing over multiple chemistry cycles

12. Align dataLaser

Page 45: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

After laser excitation the image is captured as before, and the identity of the second base is recorded.

7. Determine first base

8. Image first base

9. Determine second base

10. Image second chemistry cycle

11. Sequencing over multiple chemistry cycles

12. Align data

Page 46: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time.

7. Determine first base

8. Image first base

9. Determine second base

10. Image second chemistry cycle

11. Sequencing over multiple chemistry cycles

12. Align data

Page 47: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Reference sequence

The data are aligned and compared to a reference, and sequencing differences are identified.

7. Determine first base

8. Image first base

9. Determine second base

10. Image second chemistry cycle

11. Sequencing over multiple chemistry cycles

12. Align data

Known SNP called

Unknown variant identified and called

Page 48: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Outline of today’s lecture

Introduction: 5 perspectives, history of life: time lines

Genome-sequencing projects: chronology

Genome analysis: criteria, resequencing, metagenomics

DNA sequencing technologies: Sanger, 454, Solexa

Process of genome sequencing: centers, repositories

Genome annotation: features, prokaryotes, eukaryotes

Page 49: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

20 Genome sequencing centers contributedto the public sequencing of the human genome.

Many of these are listed at the Entrez genomes site.(Or see Table 19.3, page 803.)

Overview of genome analysis

Page 548

Page 50: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Whole genome shotgun sequencing (Celera)

Hierarchical shotgun sequencing (public consortium)

Two approaches to genome sequencing

Page 51: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Whole Genome Shotgun (from the NCBI website)

An approach used to decode an organism's genome by shredding it into smaller fragments of DNA which can be sequenced individually. The sequences of thesefragments are then ordered, based on overlaps in the genetic code, and finally reassembled into the complete sequence. The 'whole genome shotgun' (WGS) method isapplied to the entire genome all at once, while the 'hierarchical shotgun' method is applied to large, overlapping DNA fragments of known location in the genome.

Page 548

Two approaches to genome sequencing

Page 52: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Human genome project: strategies

Whole genome shotgun sequencing (Celera)

-- given the computational capacity, this approach is far faster than hierarchical shotgun sequencing-- the approach was validated using Drosophila

Page 53: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Hierarchical shotgun methodAssemble contigs from various chromosomes, then sequence and assemble them. A contig is a set of overlapping clones or sequences from which a sequence can be obtained. The sequence may be draft or finished.

A contig is thus a chromosome map showing the locations of those regions of a chromosome where contiguous DNA segments overlap. Contig maps are important because they provide the ability to study a complete, and often large segment of the genome by examining a series of overlapping clones which then provide an unbroken succession of information about that region.

Two approaches to genome sequencing

Page 548

Page 54: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Hierarchical shotgun sequencing (public consortium)

-- 29,000 BAC clones-- 4.3 billion base pairs-- it is helpful to assign chromosomal loci to sequenced fragments, especially in light of the large amount of repetitive DNA in the genome-- individual chromosomes assigned to centers

Two approaches to genome sequencing

Page 55: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Source: IHGSC (2001)

Page 56: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Source: IHGSC (2001)

Page 57: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Source: IHGSC (2001)

Page 58: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Fig. 19.8Page 804Source: IHGSC (2001)

Sequenced-clone contigs are merged to form scaffolds of known order and orientation

Page 59: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

A typical goal is to obtain five to ten-fold coverage.

Finished sequence: a clone insert is contiguouslysequenced with high quality standard of error rate0.01%. There are usually no gaps in the sequence.

Draft sequence: clone sequences may contain severalregions separated by gaps. The true order andorientation of the pieces may not be known.

When has a genome been fully sequenced?

Page 549

Page 60: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

When has a genome been fully sequenced?

Fold coverage % sequenced0.25 220.5 390.75 531 632 87.53 954 98.25 99.46 99.757 99.918 99.979 99.9910 99.995

When has a genome been fully sequenced?

Page 551

Page 61: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Raw data from many genome sequencing projectsare stored at the trace archive at NCBI or EBI (~2b traces).

http://trace.ensembl.org/

Trace repository for genome sequence data

Page 552

Page 62: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Fig. 13.12Page 553

Page 63: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Fig. 13.12Page 553

Page 64: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Blastn search of human trace archive with human RBP4 (10/05):block-like structure of output reflects exons in the gene

Page 65: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org
Page 66: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org
Page 67: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Role of comparative genomics: utility of close versus distant comparisons

Phylogenetic footprinting

Phylogenetic shadowing

Population shadowing

Page 552

Page 68: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Fig. 13.13Page 554

Page 69: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Outline of today’s lecture

Introduction: 5 perspectives, history of life: time lines

Genome-sequencing projects: chronology

Genome analysis: criteria, resequencing, metagenomics

DNA sequencing technologies: Sanger, 454, Solexa

Process of genome sequencing: centers, repositories

Genome annotation: features, prokaryotes, eukaryotes

Page 70: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Fig. 13.14Page 555

Page 71: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Information content in genomic DNA includes:

-- nucleotide composition (GC content)

-- repetitive DNA elements

-- protein-coding genes, other genes

These topics will be discussed later

Genome annotation

Page 555

Page 72: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

20 30 40 50 60 70 80

GC content (%)

Vertebrates

Invertebrates

Plants

Bacteria

3

5

10

Nu

mb

er o

f sp

ecie

sin

eac

h G

C c

lass

5

10

5

GC content varies across genomes

Fig. 13.15Page 556

Page 73: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Entrez Genomes at NCBI: GC content of 751 sequenced prokaryotic genomes

GCcontent

(%)

number

Updated 9/06

Page 74: Introduction to Genomics and the Tree of Life (part 2) Monday, October 25, 2010 Genomics 260.605.01 J. Pevsner pevsner@kennedykrieger.org

Wednesday we discuss viruses (Chapter 14). Topics include HIV and the flu virus.

Friday we turn to bacteria (lecture by Egbert Hoiczyk), then Monday discuss Chapter 15 (bacteria and archaea).

Next in the course