![Page 1: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/1.jpg)
Informatics tools for next-generation sequence analysis
Gabor T. MarthBoston College Biology Department
University of MichiganOctober 20, 2008
![Page 2: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/2.jpg)
Next-gen. sequencers offer vast throughput
read length
base
s p
er
mach
ine r
un
10 bp 1,000 bp100 bp
1 Gb
100 Mb
10 Mb
10 Gb
Illumina, AB/SOLiD short-read sequencers
ABI capillary sequencer
454 pyrosequencer(100-400 Mb in 200-450 bp reads)
(5-15Gb in 25-70 bp reads)
1 Mb
![Page 3: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/3.jpg)
Next-gen sequencing enables new applications
Meissner et al. Nature 2008
Ruby et al. Cell, 2006
Jones-Rhoades et al. PLoS Genetics, 2007
• organismal resequencing & de novo sequencing
• transcriptome sequencing for transcript discovery and expression profiling
• epigenetic analysis (e.g. DNA methylation)
![Page 4: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/4.jpg)
Large-scale individual human resequencing
![Page 5: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/5.jpg)
Technologies
![Page 6: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/6.jpg)
Roche / 454 system
• pyrosequencing technology• variable read-length• the only new technology with >100bp reads
![Page 7: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/7.jpg)
Illumina / Solexa Genome Analyzer
• fixed-length short-read sequencer• very high throughput• read properties are very close to traditional capillary sequences
![Page 8: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/8.jpg)
AB / SOLiD system
A C G T
A
C
G
T
2nd Base
1st
Bas
e
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
• fixed-length short-reads• very high throughput• 2-base encoding system• color-space informatics
![Page 9: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/9.jpg)
Helicos / Heliscope system
• short-read sequencer• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing
![Page 10: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/10.jpg)
Data characteristics
![Page 11: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/11.jpg)
Read length
read length [bp]0 100 200 300
~200-450 (variable)
25-70 (fixed)
25-50 (fixed)
20-60 (variable)
400
![Page 12: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/12.jpg)
Representational biases
• this affects genome resequencing (deeper starting read coverage is needed)• will have major impact is on counting applications
“dispersed” coverage distribution
![Page 13: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/13.jpg)
Amplification errors
many reads from clonal copies of a single fragment
• early PCR errors in “clonal” read copies lead to false positive allele calls
early amplification error gets propagated into every clonal copy
![Page 14: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/14.jpg)
Read quality
![Page 15: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/15.jpg)
Error rate (Illumina)
![Page 16: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/16.jpg)
Error rate (454)
![Page 17: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/17.jpg)
Per-read errors (Solexa)
![Page 18: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/18.jpg)
Per read errors (454)
![Page 19: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/19.jpg)
Base quality values not well calibrated
![Page 20: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/20.jpg)
Tools for genome resequencing
![Page 21: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/21.jpg)
The resequencing informatics pipeline
(iii) read assembly
REF
(ii) read mapping
IND
(i) base calling
IND(iv) SNP and short INDEL calling
(vi) data validation, hypothesis generation
(v) SV calling
![Page 22: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/22.jpg)
The variation discovery “toolbox”
• base callers
• read mappers
• SNP callers
• SV callers
• assembly viewers
GigaBayesGigaBayes
![Page 23: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/23.jpg)
1. Base calling
base sequence
base quality (Q-value) sequence
diverse chemistry & sequencing error profiles
![Page 24: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/24.jpg)
454 pyrosequencer error profile
• multiple bases in a homo-polymeric run are incorporated in a single incorporation test the number of bases must be determined from a single scalar signal the majority of errors are INDELs
![Page 25: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/25.jpg)
454 base quality values
• the native 454 base caller assigns too low base quality values
![Page 26: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/26.jpg)
PYROBAYES: determine base number
![Page 27: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/27.jpg)
PYROBAYES: Performance
• assigned quality values predict measured error rate better
• higher fraction of bases are high quality
![Page 28: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/28.jpg)
Base quality value calibration
![Page 29: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/29.jpg)
Recalibrated base quality values (Illumina)
![Page 30: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/30.jpg)
… and they give you the picture on the box
2. Read mapping
Read mapping is like doing a jigsaw puzzle…
…you get the pieces…
Unique pieces are easier to place than others…
![Page 31: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/31.jpg)
Non-uniqueness of reads confounds mapping
• Reads from repeats cannot be uniquely mapped back to their true region of origin
• RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length
![Page 32: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/32.jpg)
Strategies to deal with non-unique mapping
• Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented)
0.8 0.19 0.01
read
• mapping to multiple loci requires the assignment of alignment probabilities
![Page 33: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/33.jpg)
Paired-end reads help unique read placement
• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency
Korbel et al. Science 2007
• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity
PE
MP
• PE reads are now the standard for genome resequencing
![Page 34: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/34.jpg)
MOSAIK
![Page 35: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/35.jpg)
INDEL alleles/errors – gapped alignments
454
![Page 36: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/36.jpg)
Aligning multiple read types together
ABI/capillary
454 FLX
454 GS20
Illumina
• Alignment and co-assembly of multiple reads types permits simultaneous analysis of data from multiple sources and error characteristics
![Page 37: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/37.jpg)
Aligner speed
![Page 38: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/38.jpg)
3. Polymorphism / mutation detection
sequencing error
polymorphism
![Page 39: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/39.jpg)
New challenges for SNP calling
• deep alignments of 100s / 1000s of individuals • trio sequences
![Page 40: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/40.jpg)
Rare alleles in 100s / 1,000s of samples
![Page 41: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/41.jpg)
Allele discovery is a multi-step sampling process
Population Samples Reads
![Page 42: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/42.jpg)
Capturing the allele in the sample
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1E-0
4
2E-0
4
5E-0
40.
001
0.00
20.
005
0.01
0.02
0.05 0.
10.
20.
5
Population AF
Pro
b(a
llele
cap
ture
d in
sam
ple
)
n=100
n=200
n=400
n=800
n=1600
![Page 43: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/43.jpg)
Allele calling in the reads
1 2
1 21
1
1 2
Pr | Pr | Pr , , ,
Pr | Pr | Pr , , ,
Pr , , , |i
kT
ii n
l kT
nk ki i i n
i
nk k l l l li i
iG
n
B T T G G G G
B T T G G G G
G G G B
base call
sample size
GigaBayesGigaBayes
individual read coverage
base quality
![Page 44: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/44.jpg)
Allele calling in deep sequence data
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
Q30 Q40 Q50 Q60
1 0.01 0.01 0.1 0.5
2 0.82 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
![Page 45: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/45.jpg)
More samples or deeper coverage / sample?
Shallower read coverage from more individuals …
…or deeper coverage from fewer samples?
simulation analysis by Aaron
Quinlan
![Page 46: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/46.jpg)
Analysis indicates a balance
![Page 47: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/47.jpg)
SNP calling in trios
2
2
2 22 2
2
2
2
2 2
2
11 12 22
1 111: 1 1
2 2 11: 111: 11 1
11 12 : 2 1 12 : 2 1 1 12 : 12 2
22 : 22 : 11 122 : 1
2 2
1 1 111: 1 1 11:
2 2 4Pr | , 1 1
12 12 : 2 1 12 2
1 122 : 1
2 2
M M M
F
C M F
F
G G G
G
G G GG
2 2 2
2 22 2
2 22
2
2 22 2
1 1 1 11 1 11: 1
2 4 2 21 1 1 1 1
12 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2
1 1 1 1 122 : 1 1 22 : 1 1
4 2 4 2 2
1 111: 1
2 211: 11 1
22 12 : 1 12 : 12
22 : 1FG
2
2
2
11:
2 1 12 : 2 12
22 : 11 122 : 1 1
2 2
• the child inherits one chromosome from each parent• there is a small probability for a mutation in the child
![Page 48: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/48.jpg)
SNP calling in trios
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
mother father
childP=0.79
P=0.86
![Page 49: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/49.jpg)
Determining genotype directly from sequence
AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATA
individual 1
individual 3
individual 2
A/C
C/C
A/A
![Page 50: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/50.jpg)
4. Structural variation discovery
![Page 51: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/51.jpg)
SV events from PE read mapping patterns
Deletion
DNA reference
LM ~ LF+Ldel & depth: low
pattern
LMLF
Ldel
Tandemduplication
LM ~ LF-Ldup & depth: highLdup
Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv
Translocation
LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2
LT2 LT1
LM LM
LM
InsertionLins
un-paired read clusters & depth normal
Chromosomaltranslocation
LT
LM ~LF+LT & depth: normal& cross-paired read clusters
![Page 52: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/52.jpg)
Deletion: Aberrant positive mapping distance
![Page 53: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/53.jpg)
Copy number estimation from depth of coverage
![Page 54: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/54.jpg)
Alignability – read coverage normalization
reads mapped possible all
reads mapped uniquelyA(p)
![Page 55: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/55.jpg)
Het deletion “revealed” by normalization
![Page 56: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/56.jpg)
Tandem duplication: negative mapping distance
![Page 57: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/57.jpg)
Spanner – a hybrid SV/CNV detection tool
Navigation bar
Fragment lengths in selected region
Depth of coverage in selected region
![Page 58: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/58.jpg)
5. Data visualization
1. aid software development: integration of trace data viewing, fast navigation, zooming/panning
2. facilitate data validation (e.g. SNP validation): simultanous viewing of multiple read types, quality value displays
3. promote hypothesis generation: integration of annotation tracks
![Page 59: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/59.jpg)
Data visualization
![Page 60: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/60.jpg)
Our software tools for next-gen data
http://bioinformatics.bc.edu/marthlab/Beta_Release
![Page 61: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/61.jpg)
Data mining projects
![Page 62: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/62.jpg)
SNP calling in short-read coverage
C. elegans reference genome (Bristol, N2 strain)
Pasadena, CB4858(1 ½ machine runs)
Bristol, N2 strain(3 ½ machine runs)
• goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes
• 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University
• primary aim was to detect polymorphisms between the Pasadena and the Bristol strain
![Page 63: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/63.jpg)
Polymorphism discovery in C. elegans
• SNP calling error rate very low:
Validation rate = 97.8% (224/229)Conversion rate = 92.6% (224/242)Missed SNP rate = 3.75% (26/693)
SNP
INS
• INDEL candidates validate and convert at similar rates to SNPs:
Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)
• MOSAIK aligned / assembled the reads (< 4 hours on 1 CPU)• PBSHORT found 44,642 SNP candidates (2 hours on our 160-CPU cluster) • SNP density: 1 in 1,630 bp (of non-repeat genome sequence)
![Page 64: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/64.jpg)
Mutational profiling: deep 454/Illumina/SOLiD data
• Pichia stipitis converts xylose to ethanol (bio-fuel production)
• one mutagenized strain had especially high conversion
efficiency
• determine where the mutations were that caused this
phenotype
• we resequenced the 15MB genome with 454 Illumina, and
SOLiD reads
• 14 true point mutations in the entire genome
Pichia stipitis reference sequence
Image from JGI web site
![Page 65: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/65.jpg)
Technology comparisons
![Page 66: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/66.jpg)
Thanks
![Page 67: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/67.jpg)
Credits
Elaine Mardis
Andy Clark
Aravinda Chakravarti
Doug Smith
Michael Egholm
Scott Kahn
Francisco de la Vega
Kristen StoopsEd Thayer
![Page 68: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/68.jpg)
Lab
![Page 69: Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008](https://reader034.vdocuments.site/reader034/viewer/2022051401/56649d615503460f94a42f77/html5/thumbnails/69.jpg)
Recruitment