bioinformatics genome sequencing projects hierarchical and shotgun approaches genome assembly tigr...

20
Bioinformati cs Genome sequencing projects Hierarchical and Shotgun approaches Genome assembly TIGR Assembler Ensembl Lecture 14

Post on 18-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Bioinformatics

• Genome sequencing projects

• Hierarchical and Shotgun approaches

• Genome assembly

• TIGR Assembler

• Ensembl

Lecture 14

Genome size

Mammalian genome ~ 3 megabase = 3x109 base pairs

How many books are needed to print the entire mammalian genome?

1,500 letter per page x 1000 pages per book x 2000 books

Assuming 5 cm per book this shelf is ~ 100 meters long!

Genome sequencing: the problem

• Sequencing read lengths vary depending upon several parameters but 600 to 800 nucleotides correspond to a good estimate. To sequence much larger fragments or even whole genome, essentially two strategies have been designed.

• a) The hierarchical approach. Depending on the vector used for cloning BAC, YAC, cosmid and other libraries of cloned contigs are usually created. The size of insert/contig may vary from tens to hundred thousand of base pairs. Collections of sub-fragments obtained by enzymatic restriction are mapped to get a unique contigs from which a minimal set of sub-fragments can be selected and sequenced thus limiting sequence redundancy.

• b) The shotgun approach. This can be applied to a DNA sequence of any size, including the whole genome. DNA is randomly fragmented by sonication or shearing. Following fragmentation and enzymatic end repair the DNA fragments are ligated to a plasmid vector and a bacterium host transformed to produce a library. Clones taken at random from the library are then sequenced from both end using two universal primers. At this stage a shotgun is characterised by its depth i.e. the cumulative length of sequence determined divided by the length of the fragment or genome to be sequenced. For example with an estimated size of 4 Mb a 10X shotgun would correspond to the assembly of about 60,000 reads with a mean size of 650 nt. The resulting sequences are assembled in a unique contig representing the whole fragment by sequence comparison using appropriate bio-informatic programs. The final stage or “polishing stage” corresponds to the elimination of gaps and other possible problems.

Shotgun approach

Genome assembly

Assembly of a contiguous DNA sequences

• Sequencing projects have rapidly moved to using the two approaches sequentially.

• For example, the construction of a BAC map covering an entire genome or chromosome is followed by a shotgun strategy to sequence a minimal set of BACs.

• The change that was introduced by G. Venter was the size of the DNA fragment or genome that was directly shotguned. The possibility to increase the size of the shotgun projects was dependent upon the development of robots adapted to high throughput project and of bioinformatic programs that solve two major problems.

• One is a quantitative problem regarding the capacity to store, compare, retrieve millions of reads corresponding to billions of nucleotides. DB problem.

• The second problem is related to the presence of numerous repeat sequences that are often longer than the mean read length, complicating correct assembly. Assembly problem.

Fragment assembly problem• The Shortest Superstring Problem, while representing a challenge, is simplified

abstraction, since it should also take into consideration three other difficulties.

• 1. Sequence data are not perfect and mistaken reads are possible.

• 2. Presence of numerous repeats. There is ~ a million of 300 base pairs Alu copies and many other repeats. Fortunately some repeats may slightly differ due to mutation process.

• 3. As DNA is double-stranded, orientation of substrings is unknown and it is not known which strand should be used in the reconstruction.

• Most of fragment assembly algorithms include the following three steps:

• Overlap. The problem is to find the best match between the suffix of one sequence an the prefix of another. The difficulties above force to use variation of the dynamic programming algorithm + filtration methods

• Layout. This is the hardest step in DNA assembly, which becomes even more computationally demanding with increasing number of fragments. The most difficult is deciding whether two fragments with a good overlap really overlap or represent a repeat or something else.

• Consensus. This step is devoted to finding the most frequent character in the stringing layout that is constructed after the layout step is completed. More sophisticated algorithms align substrings in small windows along the layout or use a mosaic of the best (high probabilistic scores) segments from the layout.

Genome assembly from smaller sequence fragments

TIGR Assembler

• TIGR Assembler is an Open Source software.

• The TIGR Assembler is a sequence fragment assembly program building contigs from small sequence reads.

• It is versatile, offering a wide variety of options for tuning the assembly process and analyzing sequence data. The current assembly engine uses a greedy algorithm and heuristics to build contigs, find repeat regions, and target alignment regions.

• Sequence overlaps are detected and scored using a 32-mer hash.

• Sequence alignment and merging is done using a Smith-Waterman dynamic programming algorithm.

• Gap penalties and score values corresponding to the bases and their quality values are predefined and hard coded into the program.

Genome assembly – contigs and suprcontigs alignment

• It is very difficult to produce a finished continuous sequence having the level of redundancy typical for many high eukaryotes.

• Instead, a draft sequence of about 150,000 contigs will be generated that could be combined to give a few thousand supercontigs.

• The production, in parallel, of a dense RH map will not only facilitate the assembly of the contigs into supercontigs, but will also make it possible to order the supercontigs — a necessary step for understand genome rearrangements and synteny.

85cM

1

11

16

17

99Mb

14.1

14.3

11

12

13

14.2

21

22

23

24

31

32

33

34

35

36

H68

THY1

H201

H248

SLC2A4

DIO1

K315

CytogénéticHSA FISH

***

650.2cR5000

CPH14

C05.377

C05.414

FH2383

C05.771

CPH18

CO2608

SLC2A4

ZUBECA6

AHT141

FH2140

FH2594

AHTH248

REN78M01

REN285I23ZUBECA6

REN12N03

AHT141

REN114G01

CD3E

REN265H13

THY-1

REN42N13REN51I08

HuEST-D29618

FH2140

REN109K18REN111B12AHTH248REN92G21REN283H21

AHTH68Ren

CPH14

REN68H12

C05.377

REN287B11

REN122J03

AHTH201Ren

C05.771

C05.414

REN134J18

AHTK315MSHR

REN192M20REN162F12CPH18

REN137C07

11q23

1p32

16q24

DIO1

11q22

REN175P10 /REN213E01

***

11q23

***

***

*********

***

******

***

***

***

******

***

***

RH meiotic

CFA5

Mouse Genome: sequencing and assembly

• The mouse genome is about 14% smaller than the human genome (2.5 Gb compared with 2.9 Gb) probably due to higher rate of deletions.

• Over 90 % of mouse and human genomes can be partitioned into corresponding regions of conserved synteny.

• Sequencing strategy included four approaches: 1) construction of BAC-based physical map by fingerprinting and sequencing the clones ends, 2) Whole-Genome Shotgun sequencing to ~7 fold coverage and assembly to generate an initial draft, 3) hierarchical shotgun sequencing of BAC clones combined with WGS to create a hybrid WGS-BAC assembly, 4) production of finished sequence by using the BAC clones as template for direct finishing

• About 41 million reads were generated by the project participants, of which 33.6 million passed quality checks and 29.7 were paired (opposite end of the same clone). Clone inserts provide ~47-fold physical coverage of the genome.

• Genome assembly were achieved using two newly developed programs Arachne and Phusion.

• The assembly contains 224,713 contigs, connected into 7,418 supercontigs. The 200 largest supercontigs span more that 98% of the assembled sequence, of which 3 % is within sequence gaps.

Ensembl: An Open-Source Tool

• The Ensembl consists of two main parts:

• 1) The analysis pipeline, which adds new data and analyses regularly to the core database. The DB contains DNA sequences, predicted features on the sequences and a complete body of evidence supporting these predictions. Ensembl known genes therefore are those predicted genes that have high similarity to genes confirmed by experimental evidence.

• 2) The API (application programming interface), which gives structured access to the data. Easiness of retrieving information in meaningful form makes API an extremely powerful tool. The initial implementation of the API is in Perl, built upon layer of Bio-Perl objects. Other implementations and languages like Java and Python are also in use.

• The Ensembl is based around two ideas: a golden path (the pathway through the data containing nonredundant sequence) and virtual contig (contig determined by the user, an arbitrary region of a chromosome).

• NCBI and USCS web-sites contains systems similar to the Ensembl.