large-scale genome projects libraries sequencing release assembly annotation closure strategy...

47
Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy • Sequencing DNA molecules in the Mb size range • All strategies employ the same underlying principles: Random Shotgun sequencing

Upload: abraham-richardson

Post on 11-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Large-scale genome projects

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Sequencing DNA molecules in the Mb size range

• All strategies employ the same underlying principles:

Random Shotgun sequencing

Page 2: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Complete sequence

Shotgun reads

Contigs

Genomic DNA

Shearing/Sonication

Subclone and Sequence

Assembly

Finishing

Finishing read

Page 3: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Nucleotide Database Growth

Page 4: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

EMBL breakdown by organism

Page 5: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

EMBL Release 65

Page 6: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Progress on Large Sequencing Projects

Page 7: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Strategies for sequencing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• How big can you go??

• Large-insert clones

• cosmids 30-40 kb

• BACs/PACs 50 - 100 kb

• Whole chromosomes

• Whole genomes

Page 8: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Genome size and sequencing strategies

Genome size (log Mb)

D.melanogaster (170 Mb)C.elegans (100Mb)

H.sapiens (3000 Mb)

S.cerevisiae (14 Mb)E.coli (4 Mb)

P.falciparum (30 Mb)

0 1 2 3 4

Whole genome shotgun (WGS)

Whole Chromosome Shotgun (WCS)

Clone-by-clone

Whole Genome Shotgun (WGS)with Clone ‘skims’

Page 9: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Complete sequence

Shotgun reads

Contigs

Genomic DNA

Shearing/Sonication

Subclone and Sequence

Assembly

Finishing

Finishing read

Page 10: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Strategies for sequencing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Size and GC composition of genome

• Volume of data

• Ease of cloning

• Ease of sequencing

• Genome complexity

• dispersed repetitive sequence

• telomeres & centromeres

• Politics/Funding

Page 11: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Strategies: Clone by Clone

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Simple (0.5 - 2 K reads)

• Few problems with repeats

• Relatively simple informatics

• Scalability

• Quality of physical map

• Fingerprint / STS maps

• End sequencing

Page 12: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Strategies: Whole Chromosome shotgun (WCS)

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Requires chromosome isolation

• Moderate complexity (10’s K reads)

• Problems with repeats

• Complex informatics

• Inefficient in isolation

• Quality of physical map

• Skims of mapped clones

Page 13: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Strategies: Whole Genome shotgun (WGS)

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Moderate to High complexity (10-100’s K reads)

• Problems with repeats

• Complex informatics

• Quality of physical map

• Fingerprint map

• STS markers

• End-sequences

• Skims of mapped clones

Page 14: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Sequencing my genome

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

Annotation

Finishing

Production

Politics

TIME MONEY

Page 15: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

What do you get?

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Sequence

• incomplete v complete

• First-pass annotation

• Gene discovery

• Full annotation

• A starting point for research

DATA!!, DATA !!, and more DATA!!

Page 16: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Genome annotation is central to functional genomics

Gene Knockout

Expression Microarray

RNAi phenotypes

ORFeome based functional genomics

Page 17: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies
Page 18: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies
Page 19: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Sequencing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Library construction

• Colony picking

• DNA preparation

• Sequencing reactions

• Electrophoresis

• Tracking/Base calling

Page 20: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Libraries

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Essentially Sub-cloning

• Generation of small insert libraries in a well characterised vector.

• Ease of propagation

• Ease of DNA purification

• e.g. puc18, M13

Page 21: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Libraries - testing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Simple concepts

• Insert/Vector ratio

• Real data

• Insert size

• Sequence ….

• Simple analysis

Page 22: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Sequence generation

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Pick colonies

• Template preparation

• Sequence reactions

• Standard terminator chemistry

• pUC libraries sequenced with forward and reverse primers

Page 23: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Sequence generation

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Electrophoresis of products

• Old style - slab gels, 32 > 64 > 96 lanes

• New style - capillary gels, 96 lanes

• Transfer of gel image to UNIX

• Sequencing machines use a slave Mac/PC

• Move data to centralised storage area for processing

Page 24: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Gel image processing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Light-to-Dye estimation

• Lane tracking

• Lane editing

• Trace extraction

• Trace standardisation

• Mobility correction

• Background substitution

Page 25: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Pre-processing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Base calling using Phred

• modifies SCF file

• Quality clipping

• Vector clipping

• Sequencing vector

• Cloning vector

• Screen for contaminants

• Feature mark up (repeats/transposons)

Page 26: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies
Page 27: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Finishing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Assembly: Process of taking raw single-pass reads into contiguous consensus sequence

• Closure: Process of ordering and merging consensus sequences into a single contiguous sequence

• Finished is defined as sequenced on both strands using multiple clones. In the absence of multiple clones the clone must be sequenced with multiple chemistries. The overall error rate is estimated at less than 1 error per 10 kb

Page 28: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Genome Assembly

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Pre-assembly

• Assembly

• Automated appraisal

• Manual review

Page 29: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Pre-Assembly

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Convert to CAF format

• flatfile text format

• choice of assembler

• choice of post-assembly modules

• choice of assembly editor

www.sanger.ac.uk/Software/CAF

Page 30: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Assembly

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Assemble using Phrap

• Read fasta & quality scores from CAF file

• Merge existing Phrap .ace file as necessary

• Adjust clipping

Page 31: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Assembly appraisal

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• auto-edit

• removes 70% of read discrepancies

• Remove cloning vector

• Mark up sequence features

• finish

• Identify low-quality regions

• Cover using ‘re-runs’ and ‘long-runs’

• Compare with current databases

• plate contamination

Page 32: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Manual Assembly appraisal

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Use a sequence editor (GAP/consed)

• Tools to identify Internal joins

• Tools to identify and import data from an overlapping projects

• Tools to check failed or mis-assembled reads for inclusion in project

Page 33: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Manual editing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Sanger uses 100% edit strategy

• Where additional data is required:

• Check clipping

• Additional sequencing

• Template / Primer / Chemistry

• Assemble new data into project

• GAP4 Auto-assemble

• Repeat whole process

Page 34: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Manual Quality Checks

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Force annotation tag consistency

• All unedited data is re-assembled using Phrap

• All high-quality discrepancies are reviewed

• Confirm restriction digest (clones)

• Check for inverted repeats

• Manually check:

• Areas of high-density edits

• Areas with no supporting unedited data

• Areas of low read coverage

Page 35: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Gap closure

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Read pairs

• PCR reactions (long-range / combinatorial)

• Small-insert libraries

• Transposon-insertion libraries

Page 36: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Gap closure - contig ordering

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Read pair consistency

• STS mapping

• Physical mapping

• Genetic mapping

• Optical mapping

• Large-insert clone

• skims

• end-sequencing

Page 37: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies
Page 38: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Annotation

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• DNA features (repeats/similarities)

• Gene finding

• Peptide features

• Initial role assignment

• Others- regulatory regions

Page 39: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Annotation of eukaryotic genomes

transcription

RNA processing

translation

AAAAAAA

Genomic DNA

Unprocessed RNA

Mature mRNA

Nascent polypeptide

folding

Reactant A Product BFunction

Active enzyme

ab initio gene prediction

Comparative gene prediction

Functional identification

Gm3

Page 40: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Genome analysis overview: C.elegans

Page 41: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

DNA features

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Similarity features

• mapping repeats

• simple tandem and inverted

• repeat families

• mapping DNA similarities

• EST/mRNAs in eukaryotes

• Duplications,

• RNAs

• mapping peptide similarities

• protein similarities

Page 42: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Gene finding

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• ORF finding (simple but messy)

• ab initio prediction

• Measures of codon bias

• Simple statistical frequencies

• Comparative prediction

• Using similarity data

• Using cross-species similarities

Page 43: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Peptide features

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Peptide features

• low-complexity regions

• trans-membrane regions

• structural information (coiled-coil)

• Similarities and alignments

• Protein families (InterPro/COGS)

Page 44: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Initial role assignment

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Simple attempt to describe the functional identity of a peptide

• Uses data from:

• peptide similarities

• protein families

• Vital for data mining

• Large number of predicted genes remain hypothetical or unknown

Page 45: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Other regulatory features

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Ribosomal binding sites

• Promoter regions

Page 46: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies
Page 47: Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies

Data Release

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• DNA release

• Unfinished

• Finished

• Nucleotide databases

• GENBANK/EMBL/DDBJ

• Peptide databases

• SWISSPROT/TREMBL/GENPEPT

• Others