cse182-l10 lw statistics/assembly. whole genome shotgun break up the entire genome into pieces...

39
CSE182-L10 LW statistics/Assembly

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

CSE182-L10

LW statistics/Assembly

Page 2: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Whole Genome Shotgun

• Break up the entire genome into pieces

• Sequence ends, and assemble using a computer

• LW statistics & Repeats argue against the success of such an approach

Alternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together

Page 3: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Questions

• Algorithmic: How do you put the genome back together from the pieces?

• Statistical? How many pieces do you need to sequence, etc.?– The answer to the statistical questions had

already been given in the context of mapping, by Lander and Waterman.

Page 4: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Lander Waterman Statistics

G

L

• The fragments are falling randomly on the genome• Overlapping fragments form islands of contiguous sequence. • Ideally, we want one island for each chromosome. How many

fragments should we sequence?

Page 5: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Lander Waterman Statistics

G

L€

G = Genome Length

L = Fragment Length

N = Number of Fragments

T = Required Overlap

c = Coverage = LN/G

α = N/G

θ = T/L

σ = 1-θ

Page 6: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

LW statistics: questions

• As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island.• Q1: What is the expected number of islands?

• Ans: N exp(-c)• The number

increases at first, and gradually decreases.

Page 7: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Analysis: Expected Number Islands

• Computing Expected # islands.• Let Xi=1 if an island ends at position i,

Xi=0 otherwise.• Number of islands = ∑i Xi

• Expected # islands = E(∑i Xi) = ∑i E(Xi)

Page 8: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Prob. of an island ending at i

• E(Xi) = Prob (Island ends at pos. i)

• =Prob(clone began at position i-L+1

AND no clone began in the next L-T positions)

iL

T

E(X i) =α 1−α( )L−T

=αe−cσ

Expected # islands = E(X i) =i

∑ Gαe−cσ = Ne−cσ

Page 9: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

LW statistics

• Pr[Island contains exactly j clones]?• Consider an island that has already begun. With probability e-c,

it will never be continued. Therefore• Pr[Island contains exactly j clones]=

(1− e−cσ ) j−1e−cσ

• Expected # j-clone islands

=Ne−cσ (1− e−cσ ) j−1e−cσ

Page 10: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Expected # of clones in an island

• Expected # of clones in an island =

ecσ

Q: How? Why do we care?

Often, at the beginning of a genome project, we do not know the length of the genome. This equation helps us determine the length.

Page 11: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Expected length of an island

Lecσ −1

c

⎝ ⎜

⎠ ⎟+ (1−σ )

⎣ ⎢

⎦ ⎥

Page 12: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Whole Genome Sequencing & Assembly

Page 13: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Whole Genome Shotgun

• Break up the entire genome into pieces

• Sequence ends, and assemble using a computer

• LW statistics & Repeats argue against the success of such an approach

Page 14: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Assembly Basics

• Three main components:– Overlap– Layout– Consensus

Page 15: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Overlap

• Given a pair of fragments s1 and s2, do they belong together?

• Yes, if a prefix of s2 matches a suffix of s1

• How would you compute such a match?

Page 16: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Overlap

• S[i,j] = optimum score of an alignment of s1[1..i] against a suffix of s2[1..j]

i

j

• The best prefix-suffix alignment is given by:

• Maxi {S[i,n] }

Page 17: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Overlap Detection

• Compute the best prefix-suffix alignments between each pair of fragments.

• Keep the “high-scoring” ones as evidence of true overlap.

• What is the problem?

Page 18: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Overlap detection problem

• Consider the number of fragments. The LW statistics say that we need good coverage (c=8, 10) to get most of the base-pairs. – G = 3000Mb, L=500– Coverage LN/G = 10– N = 10*3*109/500 = 6*107

– Number of comparisons needed = 3.6 * 1015

• Not good! (Only a small fraction are true overlaps)

Page 19: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

k-mer based overlap (Piegeonhole principle again)

• Consider a 25bp sequence. – Expected number of occurrences

in the genome– 3*109*4-25 = 2*10-6

• A 25-bp sequence appears is unique to the genome!

• Two overlapping sequences should share a 25-mer

• Two non-overlapping sequences should not!

25bp

Page 20: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Sorting k-mers

• Build a list of k-mers that appear in the sequences and their reverse complements

• Create a record with 4 entries:– K-mer– Sequence number– Position in the sequence– Reverse complementation flag

• Sort a vector of these according to k-mer

• How many records per k-mer are expected?

• If number of records exceeds threshold, discard (why?)

K-mer S.idPos.

Page 21: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Alignment module

• Coalesce k-mer hits into longer, gap-free partial alignments.

• These extended k-mer hits are saved.

• For each pair of sequences, form a directed graph.

• For each maximal path in the graph, construct an alignment.

• Refine alignment via banded DP

Page 22: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Problem2: Size

• Islands might simply be too small in length = (1-T/L) = (1-50/500) = 0.9, c = 8.• #Islands = N e-c = 45K• Size of an island = 54K• Not enough to make it an acceptable assembly!• PLUS, there is the problem of Repeats, Chimerism etc.

Page 23: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Solution 2: Clones can have mate-pairs

• Recall that we sequence about 1000bp of the end of a clone

• If we sequenced both ends, we get extra information, particularly if we know the length of the original clone.

Page 24: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Mate Pairs

• Mate-pairs allow you to merge islands (contigs) into super-contigs

Page 25: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Super-contigs are quite large

• Make clones of truly predictable length. EX: 3 sets can be used: 2Kb, 10Kb and 50Kb. The variance in these lengths should be small.

• Use the mate-pairs to order and orient the contigs, and make super-contigs.

Page 26: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Whole genome shotgun

• Input: – Shotgun sequence fragments (reads)– Mate pairs

• Output:– A single sequence created by consensus of overlapping reads

• First generation of assemblers did not include mate-pairs (Phrap, CAP..)

• Second generation: CA, Arachne, Euler• We will discuss Arachne, a freely available sequence

assembler (2nd generation)

Page 27: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Problem 3: Repeats

Page 28: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Repeats & Chimerisms

• 40-50% of the human genome is made up of repetitive elements.

• Repeats can cause great problems in the assembly!

• Chimerism causes a clone to be from two different parts of the genome. Can again give a completely wrong assembly

Page 29: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Repeat detection

• Lander Waterman strikes again!• The expected number of clones in a Repeat containing

island is MUCH larger than in a non-repeat containing island (contig).

• Thus, every contig can be marked as Unique, or non-unique. In the first step, throw away the non-unique islands.

Repeat

Page 30: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Detecting Repeat Contigs 1: Read Density

• Compute the log-odds ratio of two hypotheses:

• H1: The contig is from a unique region of the genome.

• The contig is from a region that is repeated at least twice

Page 31: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Detecting Chimeric reads

• Chimeric reads: Reads that contain sequence from two genomic locations.

• Good overlaps: G(a,b) if a,b overlap with a high score

• Transitive overlap: T(a,c) if G(a,b), and G(b,c)

• Find a point x across which only transitive overlaps occur. X is a point of chimerism

Page 32: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Contig assembly

• Reads are merged into contigs upto repeat boundaries.

• (a,b) & (a,c) overlap, (b,c) should overlap as well. Also,

– shift(a,c)=shift(a,b)+shift(b,c)

• Most of the contigs are unique pieces of the genome, and end at some Repeat boundary.

• Some contigs might be entirely within repeats. These must be detected

Page 33: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Creating Super Contigs

Page 34: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Supercontig assembly

• Supercontigs are built incrementally• Initially, each contig is a supercontig.• In each round, a pair of super-contigs is

merged until no more can be performed.• Create a Priority Queue with a score for every

pair of ‘mergeable supercontigs’.– Score has two terms:

• A reward for multiple mate-pair links• A penalty for distance between the links.

Page 35: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Supercontig merging

• Remove the top scoring pair (S1,S2) from the priority queue.

• Merge (S1,S2) to form contig T.• Remove all pairs in Q containing S1 or S2

• Find all supercontigs W that share mate-pair links with T and insert (T,W) into the priority queue.

• Detect Repeated Supercontigs and remove

Page 36: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Repeat Supercontigs

• If the distance between two super-contigs is not correct, they are marked as Repeated

• If transitivity is not maintained, then there is a Repeat

Page 37: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Filling gaps in Supercontigs

Page 38: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Consensus Derivation

• Consensus sequence is created by converting pairwise read alignments into multiple-read alignments

Page 39: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics

Summary

• Whole genome shotgun is now routine:– Human, Mouse, Rat, Dog, Chimpanzee..– Many Prokaryotes (One can be sequenced in a day)– Plant genomes: Arabidopsis, Rice – Model organisms: Worm, Fly, Yeast

• A lot is not known about genome structure, organization and function.– Comparative genomics offers low hanging fruit