bioinformatics for high-throughput dna sequencing gabor marth boston college biology new grad...

26
Bioinformatics for high- throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Post on 19-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Bioinformatics for high-throughput DNA sequencing

Gabor MarthBoston College Biology

New grad student orientationBoston CollegeSeptember 8, 2009

Page 2: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

DNA sequence variations

The Human Genome Project has determined a reference sequence of the human genome

However, every individual is unique, and is different from others at millions of nucleotide locations

Page 3: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Why do we care about variations?

underlie phenotypic differences

cause inherited diseases

allow tracking ancestral human history

Page 4: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

4

Human genetic variation

Page 5: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

The first “famous” genomes

Page 6: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Genome sequencing

~1 Mb ~100 Mb >100 Mb ~3,000 Mb

Page 7: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

New sequencing technologies…

Page 8: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Next-gen sequencing – a revolution

read length

base

s per

mach

ine r

un

10 bp 1,000 bp100 bp

1 Gb

100 Mb

10 Mb

10 Gb

Illumina/Solexa, AB/SOLiD sequencers

ABI capillary sequencer

Roche/454 pyrosequencer

(100-400 Mb in 200-450 bp reads)

(10-30Gb in 25-100 bp reads)

1 Mb

100 Gb

Page 9: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

The re-sequencing informatics pipeline

REF

(ii) read mapping

IND

(i) base calling

IND(iii) SNP and short INDEL calling

(v) data viewing, hypothesis generation

(iv) SV callingGigaBayesGigaBayes

Page 10: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Tools

Page 11: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Read mapping is like a jigsaw puzzle…

… and they give you the picture on the box

2. Read mapping

…you get the pieces…

Big and Unique pieces are easier to place than others…

Page 12: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

The MOSAIK read mapping program

• Reads from repeats cannot be uniquely mapped back to their true region of origin

Michael Strömberg(Wan-Ping Lee)

Page 13: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

SNP discovery

GigaBayesGigaBayes

Marth et al. Nature Genetics 1999Quinlan et al. in prep.(Amit Indap, Wen Fung Leong)

Page 14: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Structural variation discovery

Navigation bar

Fragment lengths in selected region

Depth of coverage in selected region

Stewart et al. in prep.(Deniz Kural, Jiantao Wu)

Page 15: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Sequence alignment viewers

Huang et al. Genome Research 2008(Derek Barnett)

Page 16: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Data mining

Page 17: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Mutational profiling in deep 454 data

• Pichia stipitis is a yeast that efficiently converts xylose to ethanol (bio-fuel production)• one specific mutagenized strain had especially high conversion efficiency• goal was to determine where the mutations were that caused this phenotype• we analyzed 10 runs (~3 million reads) of 454 reads (~20x coverage of the 15MB

genome)

Pichia stipitis reference sequence

• found 39 mutations• informatics analysis in < 24 hours (including manual checking of all candidates)

Image from JGI web site

Smith et al. Genome Research 2008

Page 18: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

SNP calling in short-read coverage

C. elegans reference genome (Bristol, N2 strain)

Pasadena, CB4858(1 ½ machine runs)

Bristol, N2 strain(3 ½ machine runs)

• goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes• 5 runs (~120 million) Illumina reads were collected by Washington Univ.

SNP

• we found 45,000 SNP with very high validation rate

Hillier et al.Nature Methods 2008

Page 19: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Current focus

Page 20: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

1000 Genomes Project

• data quality assessment• project design (# samples depth of read coverage)• read mapping• SNP calling• structural variation discovery

Page 21: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

SV discovery in autism

deletion

amplification

Page 22: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Lab

Page 23: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

People

Page 24: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Resources

• computer cluster (72 servers)• 128 GB RAM server• ~200TB disk space

• 2 R01 grants (NHGRI/NIH)• 1 R21 grant (NIAID/NIH)• a BC RIG grant

• 2 RC2 grants (NHGRI/NIH) starting September 2009

Page 25: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Collaborations

Baylor HGSC

Wash. U. GSC

Genome Canada

UBC GSC

Cornell

UC Davis UCSF

NCBI @ NIH NCI @ NIH Marshfield Clinic

UCLA

Pfizer

Page 26: Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009

Graduate student rotations

• Looking for new graduate students

• Spots are available for all three rotations

• Lots or projects

• Caveat: you need to be able to program…

• Check us out at: http://bioinformatics.bc.edu/marthlab/

• If you are interested, please talk to me