John McEwanAgResearch
PAG Jan 2010
SNP Discovery in Deer (Cervus elaphus) Using
The Illumina Genome Analyser IIx
Summary
• 4.1M SNPs
• 8 lanes
• ~1c/SNP
• 9X with 7 animals
• 100bp PER
• Sufficient for SNP chip
Deer SNPs… lessons learned
• Illumina GA IIx 100bp PER ~500bp insert 3Gbp x 7 animals
• Select animals span genetic diversity
• 1 flow cell 7 lanes
– WGS … more even coverage
– 100bp reads > match to related genome
– 8X coverage …. >98% depth of 4 or greater
– Low coverage SNPs vital to track read source
– Better info on flanking sequence
– PER = better assembly (by simulation)
– Forms basis for draft sequence of a genome
– Sheep ~$2M in 2007 3X ~$50K 2009 9X
– started Sept 2009, seq late Oct with Illumina
Wob 1
War 1
Red 1
Eas 1 M
Elk 1
Elk 2
1x 1x
1x 1x
2x 1x
Repeat mask
Blast UMD3
Assemble with Velvet
Meld against bovine scaffold
Detect SNPs
Sequencing
1x Hun 1
Sequence
• 8 lanes
• 100bp PER
• 284.3M reads
• 28.4Gbp
• High % full length
• Not trimmed
Masking
• Used Repeatmasker
• Used Ruminantia db
• Supplemented with:
– >10 identical reads assembly
– Multiple blast hit assembly
– Sped up sequence matching
– Greatly reduced output size
• Optimal masking sensitivity & mapping need to be different!!!
Mapping: deer reads to UMD3
• Used Megablast
• Options
-D 3 -t 21 -W 11 -q -3 -r 2 -G 5 -E 2 -s 56 -N 2 -F "m D" -U T
• Opt speed with maximal specificity & % unique hits
• ~ 10% added if Blastn hits W9 also added (sensitive blast)
• Used unique hits and where ehit1/ehit2 =1e-20
48 44
8
52
Mapping Specificity
• High specificity
• P~0.0009-0.004
• Some animal diffs?
Est distance between mate pair ends
~200bp insert sizes
0.02-0.03% had mate pairs wrong orientation if match on same chromosome→ that blast criteria very specific
Velvet assembly criteria selection
• Varied kmer length
• N50 length
• % assembly coverage
• Non chimeric %
• cf CAP3
• Chose default kmer=31
Velvet assembly
• 1Mbp regions assembled
• Divide and conquer approach
• Many small contigs
• 58.6% length UMD3
• UMD3 59.5% unique!
• N50 & coverage affected by insert length
• Better for SNP oligo design
Results
Start
N sequences (M) 284.3
Blast
N sequences (M) 147.3
Assembly
Contigs (M) 3.2
Bases (Gbp) 1.562
N50 (bp) 813
Meld Process
Ovine contigs
Align (BLAST)
reference
contigs
MELDCTAGTGCATGCTGCactTGCTataTGTGCtagNNgcATATTGCTGNNTGCTAT
Bovine reference scaffold
Figure 2. Creation of MELDed ovine sequence using bovine as a guide genome
Ovine contigs
Align (BLAST)
reference
contigs
MELDCTAGTGCATGCTGCactTGCTataTGTGCtagNNgcATATTGCTGNNTGCTAT
Bovine reference scaffold
Figure 2. Creation of MELDed ovine sequence using bovine as a guide genome
Deer contigs
Meld and overall assembly stats
• Reduce contigs 34%
• Reduce length 8%
• Increase N50 26%
• 53.8% coverage of UMD3
• Optimised for SNP discovery
% Assembly Refseq Coverage
• Masked Bov refseqs
• Mapped deer assembly
• ~13% not mapped
• 80% refseqs >40% unique coverage
• Seq matched 66%
• Conservative
SNP detection• SNP Detection Criteria
– Stacking: collapsed where reads same start base
– Depth: >3 (98% of sequence) and <17 reads deep
– MAF: at least 2 reads present
– SNP Class:
A 2 or more animals present for both alleles.
B 2 or more animals present for at least 1 allele,
C alleles present one animal
– SNP quality:
• discarded if 10bp flanking sequence has variants
– Previous expts get ~93% conversion rate on SNP chip
Read Depth distribution at SNP calls
• ~Poisson
• Little genome bias?
• A = both alleles seen in 2 deer
• SNP chip real estate• Infinium 2 SNP
1 probe 50bp no G/C, A/T• Infinium 1 SNP
2 probes 50bp
• 38% removed proximity filter
• 5% removed depth filter
• leaves 4.1M SNPs ~1/349bp
• ~90% pass design (0.8 threshold)
• ~ 1.98M Class A Infinium 2 SNPs
Illumina Deer SNP Results
Estimated Minor allele frequency
• Bias to high MAF
• SNP chip results will be similar
• Average MAF =0.3
SNP density across genome
1
10
100
1000
10000
0 20 40 60 80 100 120 140 160
SNP
nu
mb
er/M
bp
Chromosome 1 Mbp
A/C
A/G
A/T
G/C
Total
SNP specificity
• Large % fixed differences
• Impt when selecting SNPs
• Reflects est genetic divergence
SNP freq
Elk only 0.04
Europe only 0.50
both 0.15
fixed dif 0.30
Summary
• Sequenced 7 animals to ~1X coverage– selected to span genetic diversity
– ≥4X depth over 99% of the genome
– 100bp PER
• Used a mixture of assisted and de novo assembly – Optimised to provide high quality sequence for SNP discovery
– Ordered and orientated contigs via related genome
• SNP calling routine– corrects for “stacking” artifacts and repetitive regions
– traces animal origin of reads for high quality calls
• Results– 4.1M SNPs, 2.4M class A
– Suitable to create a high density Illumina SNP array
– Cost ~1 cent/SNP identified
Acknowledgements
• Cindy Lawley Illumina
• Kimberly Gietzen
• Nan Leng
• Rudi Brauning, AgResearch
• Paul Fisher AgResearch
• Jason Archer
• Matt Bixley
• Jamie Ward
• Geoff Nicoll Landcorp