informatics challenges and computer tools for sequencing 1000s of human genomes gabor t. marth...
Post on 22-Dec-2015
222 views
TRANSCRIPT
Informatics challenges and computer tools for sequencing 1000s of human genomes
Gabor T. MarthBoston College Biology Department
Cold Spring Harbor LaboratoryPersonal Genomes meetingOctober 9-12, 2008
Large-scale individual human resequencing
Next-gen sequencers offer vast throughput…
read length
base
s p
er
mach
ine r
un
10 bp 1,000 bp100 bp
1 Gb
100 Mb
10 Mb
10 Gb
Illumina, AB/SOLiD short-read sequencers
ABI capillary sequencer
454 pyrosequencer(100-400 Mb in 200-450 bp reads)
(5-15Gb in 25-70 bp reads)
1 Mb
The resequencing informatics pipeline
(iii) read assembly
REF
(ii) read mapping
IND
(i) base calling
IND(iv) SNP and short INDEL calling
(vi) data validation, hypothesis generation
(v) SV calling
The variation discovery “toolbox”
• base callers
• read mappers
• SNP callers
• SV callers
• assembly viewers
GigaBayesGigaBayes
1. Base calling
base sequence
base quality (Q-value) sequence
• early manufacturer-supplied base callers were imperfect• third party software made substantial improvements• machine manufacturers are now focusing more on base calling
… and they give you the picture on the box
2. Read mapping
Read mapping is like doing a jigsaw puzzle…
…you get the pieces…
Larger, more unique pieces are easier to place than others…
Next-gen reads are generally short
read length [bp]0 100 200 300
~200-450 (variable)
25-70 (fixed)
25-50 (fixed)
20-60 (variable)
400
Base error rates are low
Illumina
454
Strategies to deal with non-unique mapping
Mapping probabilities (qualities)
0.8 0.19 0.01
read
Error types are very different
Illumina
454
Gapped alignments
MOSAIK
• fast• accurate• gapped• versatile (short + long reads)
3. SNP and short-INDEL calling
• deep alignments of 100s / 1000s of individuals • trio sequences
Allele discovery is a multi-step sampling process
Population Samples Reads
Capturing the allele in the sample
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1E-0
4
2E-0
4
5E-0
40.
001
0.00
20.
005
0.01
0.02
0.05 0.
10.
20.
5
Population AF
Pro
b(a
llele
cap
ture
d in
sam
ple
)
n=100
n=200
n=400
n=800
n=1600
Allele calling in the reads
1 2
1 21
1
1 2
Pr | Pr | Pr , , ,
Pr | Pr | Pr , , ,
Pr , , , |i
kT
ii n
l kT
nk ki i i n
i
nk k l l l li i
iG
n
B T T G G G G
B T T G G G G
G G G B
base quality
allele call in read
number of individuals
GigaBayesGigaBayes
How many reads needed to call an allele?aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
Q30 Q40 Q50 Q60
1 0.01 0.01 0.1 0.5
2 0.82 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
The need for accurate data…
… and realistic base quality values
Recalibrated base quality values (Illumina)
More samples or deeper coverage / sample?
Shallower read coverage from more individuals …
…or deeper coverage from fewer samples?
simulation analysis by Aaron
Quinlan
Analysis indicates a balance
SNP calling in trios
2
2
2 22 2
2
2
2
2 2
2
11 12 22
1 111: 1 1
2 2 11: 111: 11 1
11 12 : 2 1 12 : 2 1 1 12 : 12 2
22 : 22 : 11 122 : 1
2 2
1 1 111: 1 1 11:
2 2 4Pr | , 1 1
12 12 : 2 1 12 2
1 122 : 1
2 2
M M M
F
C M F
F
G G G
G
G G GG
2 2 2
2 22 2
2 22
2
2 22 2
1 1 1 11 1 11: 1
2 4 2 21 1 1 1 1
12 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2
1 1 1 1 122 : 1 1 22 : 1 1
4 2 4 2 2
1 111: 1
2 211: 11 1
22 12 : 1 12 : 12
22 : 1FG
2
2
2
11:
2 1 12 : 2 12
22 : 11 122 : 1 1
2 2
• the child inherits one chromosome from each parent• there is a small probability for a mutation in the child
SNP calling in trios
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
mother father
childP=0.79
P=0.86
4. Structural variation discovery
Deletion
DNA reference
LM ~ LF+Ldel & depth: low
pattern
LMLF
Ldel
Tandemduplication
LM ~ LF-Ldup & depth: highLdup
Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv
Translocation
LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2
LT2 LT1
LM LM
LM
InsertionLins
un-paired read clusters & depth normal
Chromosomaltranslocation
LT
LM ~LF+LT & depth: normal& cross-paired read clusters
Read pair mapping pattern (breakpoint detection)
Copy number estimation
Depth of read coverage
Deletion: Aberrant positive mapping distance
Tandem duplication: negative mapping distance
Het deletion “revealed” by normalization
Chip StewartSaturday poster session
5. Data visualization
• software development• data validation• hypothesis generation
Summary
• Next-generation sequencing is a boon for large-scale individual human resequencing
• Basic data mining tools are getting applied and tested in the 1000 Genomes Project
• There is still a lot of fine-tuning to do
• A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes
Credits
Derek BarnettEric Tsung
Aaron QuinlanDamien Croteau-Chonka
Weichun Huang
Michael Stromberg
Chip Stewart
Michele Busby
Several postdoc positions are available… … mail [email protected]
Software tools for next-gen data
http://bioinformatics.bc.edu/marthlab/Beta_Release
Positions
Several postdoc positions are available… mail [email protected]
Individual genotype directly from sequence
AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATA
individual 1
individual 3
individual 2
A/C
C/C
A/A
Genotyping from primary sequence data
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 500000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SNP Position
Fra
ctio
n of
con
fiden
t gen
otyp
es
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
200
400
600
800
1000
1200
1400
1600
100 @ 16x: 0.975 +/- 0.121
200 @ 8x: 0.968 +/- 0.129
400 @ 4x: 0.924 +/- 0.151
800 @ 2x: 0.769 +/- 0.154
Most reads contain no or few errors
Paired-end reads help unique read placement
• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency
Korbel et al. Science 2007
• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity
PE
MP
How many reads needed to call an allele?aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
P=0.82 P=0.08