computational and evolutionary molecular biologydb.cs.berkeley.edu/visitday08/song.pdf ·...
TRANSCRIPT
Computational and EvolutionaryMolecular Biology
Yun S. Song
UC Berkeley, CS Grad Visit DayMarch 17, 2008
0 / 9
Introduction Short-read resequencing Coalescent Theory DE
Biosystems & Computational Biology
CS faculty members:
Ruzena BajcsyBrian A. BarskyJerome A. FeldmanMichael I. Jordan (coordinator)Richard KarpJitendra MalikElchanan Mossel
Christos PapadimitriouLior PachterSatish RaoStuart J. RussellYun S. SongBernd SturmfelsKatherine Yelick
In what follows, I will give a brief overview of my group’sresearch interests.
1 / 9
Introduction Short-read resequencing Coalescent Theory DE
Biosystems & Computational Biology
CS faculty members:
Ruzena BajcsyBrian A. BarskyJerome A. FeldmanMichael I. Jordan (coordinator)Richard KarpJitendra MalikElchanan Mossel
Christos PapadimitriouLior PachterSatish RaoStuart J. RussellYun S. SongBernd SturmfelsKatherine Yelick
In what follows, I will give a brief overview of my group’sresearch interests.
1 / 9
Introduction Short-read resequencing Coalescent Theory DE
Computational BiologyThe use of computational and mathematical techniques toaddress questions arising from biology.Examples: Forensic DNA analysis, Sequence alignment,Genome-wide disease gene mapping, Phylogenetics
Mathematical Population GeneticsThe study of the evolutionary forces (such as populationhistory, natural selection, and recombination) that produceand maintain genetic variation within species.Tightly linked to many branches of mathematics.
Tools usedAlgorithmsGraph theoryCombinatoricsStochastic processes
Dynamical systemsMachine learningSignal processingReconfigurable computing
2 / 9
Introduction Short-read resequencing Coalescent Theory DE
High-throughput short-read resequencing
Genomic sequencing technology entered a revolutionaryphase in 2007.High-throughput short-read sequencing technology candeliver fast and cost-effective generation of immenseamounts of sequence data.For instance, we can now sequence a Drosophila genomein a few days and a human genome in a few weeks usingthe Solexa/Illumina platform.
The main and immediate challengeThe assembly and analyses of short-read resequencing data.
We want to develop efficient algorithms and computationalinfrastructure to meet that challenge, addressing the sheervolume and nature of short-read data.
3 / 9
Introduction Short-read resequencing Coalescent Theory DE
Short-read sequencing
hundreds of millions of them)
Randomly fragmentgenomic DNA
Sequence short−reads
Assemble
(tens of millions to
In the Solexa/Illumina platform, each short-read is between30 to 45 base-pairs long.
4 / 9
Introduction Short-read resequencing Coalescent Theory DE
Our approach1 Simultaneously map hundreds of short-reads onto a
reference genome using FPGA.
Calculates diagonal score
a = score_diagb = score_leftc = score_up
Outputs minimum score
Calculates minimum score
Reset lastlast one pulsebefore this PE startscomputing on valid data.
2score_up_out
1score_diag_out
xlmux
sel
d0
d1d1
d2
score_select_mux
addr
score_select_lut
xlconcathi
locat
score_select_concat
xlrelationalz-2
a
ba<=b
score_select_bc
xlrelationalz-2
a
ba<=b
score_select_ac
xlrelationalz-2
a
ba<=b
score_select_ab
k =1
score_left_cnsxladdsuba+ba
b
a
score_left_add
xlmux
sel
d0
d1d1
score_diag_reset_mux
xlrelationala
ba!=b
score_diag_neqxlmux
sel
d0
d1d1
score_diag_mux_2
k =1
score_diag_cns
xladdsuba+ba
b
a
score_diag_add
xlregisterz-1d
enq
lastlast
xlregisterz-1d
enq
last
z-3
delay_mux_2
z-3
delay_mux_1
z-3
delay_mux_0
k =0
cns_lastlast_rst
6rst
5en
4ref_char
3query_char
2 score_up_in
1score_diag_in
2 Use graph-theory-based algorithms to assemble the shortreads into a complete genome, while detecting geneticvariation and correcting for sequencing errors.
5 / 9
Introduction Short-read resequencing Coalescent Theory DE
1000 human genomes
The National Human Genome Research Institute (NHGRI)recently announced an international collaboration toresequence 1000 human genomes from around the world.(http://www.1000genomes.org/)
1000 Drosophila genomesMy group is closely involved in a project that proposes toresequence 1000 Drosophila genomes.
A well-studied model organism.About 20 times shorter than the human genome.Drosophila genome is less repetitive.Can perform direct functional analysis.
Further Computational ChallengeHow are we going to analyze 1000 genomes?
6 / 9
Introduction Short-read resequencing Coalescent Theory DE
1000 human genomes
The National Human Genome Research Institute (NHGRI)recently announced an international collaboration toresequence 1000 human genomes from around the world.(http://www.1000genomes.org/)
1000 Drosophila genomesMy group is closely involved in a project that proposes toresequence 1000 Drosophila genomes.
A well-studied model organism.About 20 times shorter than the human genome.Drosophila genome is less repetitive.Can perform direct functional analysis.
Further Computational ChallengeHow are we going to analyze 1000 genomes?
6 / 9
Introduction Short-read resequencing Coalescent Theory DE
Coalescent Theory (CS294-26/STAT260)Random rooted graph with directed, weighted, markededges.Related to random partitions, size-biased permutations,and other combinatorial structures.
Some questions that can be addressed using the coalescentHow many ancestors at time t back in time?Who is related to whom?Time to the common ancestor.The age of a mutation.Population history.Targets and dynamics of natural selection.Speciation.
7 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
H01110 10110111 11000011 11018 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
H0
H1
0111 1110
0111 1110
1011
***1
1101
1011
1100
11000011
0011
110*
8 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
H1
H0
H2
0111 1110 101111011100
0111 1110 1100 1011***10011
00118 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
m3 H2
H1
H0
H3
1100
0111 1110 1110 ***1 1011
0011
0011
101111010111 11108 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
m3
H1
H0
H2
H3H4
0011
0111 1110 1110 10110011
110111000111 1110 10118 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
m3
H1
H0
H2
H3H4H5
0111 1110 10110011
0011 11000111 1110 101111018 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
m3
H1
H0
H2
H3H4H5
H6*111 1110 10110011 0***
0011 0111 1110 1011110111008 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
m3
H1
H0
H2
H3H4H5
H6
H7*111 1110
0011
0011
0111 1110 101111011100
1011
8 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
m4
H1
H0
H2
H3H4H5
H6
H7H8
m3
1110 1011110111000011
0011 *111 1111 1011
01118 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
m4
H1
H0
H2
H3H4H5
H6
H7H8H9
m3
0111 1110 101111011100
1111
0011
10110011
8 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
m2
H1
H0
H2
H3H4H5
H6
H7H8H9H10
m3
m4
101111011100
1011
0011
0011 1011
0111 11108 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
m1
H1
H0
H2
H3H4H5
H6
H7H8H9H10
H11
m3
m4
m2
11011100
1011
0011
10111011
1110 101101118 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
H12
H3H4H5
H6
H7H8H9H10
H11
m3
m4
m2
m1
H1
H0
H2
0111 1110 101111011100
1011
0011
1011
8 / 9
Introduction Short-read resequencing Coalescent Theory DE
The coalescent with recombinationA retrospective stochastic genealogical process with collisionsand branchings of lineages.
H13
H5
H6
H7H8H9H10
H11
H12
m3
m4
m2
m1
H1
H0
H2
H3H4
0111 1110 101111011100
1011
00118 / 9
Introduction Short-read resequencing Coalescent Theory DE
Computational biology is an active area of research at UC Berkeley.
Designated Emphasis in Computational and Genomic BiologyAbout 30 faculty members spanning 7 Departments:
BioengineeringBiostatisticsChemistryEECSIntegrative BiologyMathematicsStatistics
(http://computationalbiology.berkeley.edu/)
Faculty members in CS affiliated with the DEMichael I. Jordan, Richard M. Karp, Lior Pachter, Yun S. Song,Bernd Sturmfels
9 / 9