mouse genomes project summary june 2010
DESCRIPTION
A talk given at the Jackson Laboratory, ME, USA with a summary of the progress of the mouse genomes project as of June 2010TRANSCRIPT
Vertebrate Resequencing Informatics 18th June, 2010
The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains
Thomas Keane, Senior Scientific Manager, Vertebrate Resequencing Informatics, Wellcome Trust Sanger Institute, Cambridge, UK
http://www.sanger.ac.uk/mousegenomes
Vertebrate Resequencing Informatics 18th June, 2010
The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains
Introduction
SNP Calling
Variation Consequences
Structural Variation
Future Work
Vertebrate Resequencing Informatics 18th June, 2010
Sequencing Technologies over past 30 years
MR Stratton et al. Nature 458, 719-724 (2009)
Vertebrate Resequencing Informatics 18th June, 2010
WTSI - major project usage 2009
Human variation
Cancer
Mouse Pulldown Medical reseq
Pathogen
Development
Malaria
WTCCC Small faculty
Other 3rd party Zebrafish tilling
ICR Infrastructure
Vertebrate Resequencing Informatics 18th June, 2010
A view from the Sanger
Vertebrate Resequencing Informatics 18th June, 2010
A view from the Sanger
Vertebrate Resequencing Informatics 18th June, 2010
Mouse Genomes Project
2002 Mouse genome reference published C57BL/6J strain – currently mm9
2006 Perlegen 15 strains and 8.3M SNPs
2008 Mouse Genomes Project Sequence the genomes of 17 key mouse strains Inbred lab strains used by mouse geneticists Include 3 recently wild derived strains
Primary goals of the project Catalogue all types sequence variations across the strains Produce de novo assemblies per strain
Sequencing on illumina platform at Sanger Majority of sequencing completed by end 2009 All strains >20x sequence coverage Read lengths: 54, 76, 108bp Fragment sizes of 200-600bp
Regular data releases Initial release June 2009
SNPs, short indels Web interface to query and visualise read alignments and sequence variation
Final data freeze: December 2009 1.53 Tbp sequenced (510x) One BAM file per strain (mapped and unmapped reads)
Vertebrate Resequencing Informatics 18th June, 2010
Mouse Genomes Project
Vertebrate Resequencing Informatics 18th June, 2010
The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains
SNP Calling
Introduction
Variation Consequences
Structural Variation
Future Work
Vertebrate Resequencing Informatics 18th June, 2010
SNP Calling
Homozygous inbred mouse strains Very few heterozygous SNP
SNP calling strategy Use multiple SNP callers
Samtools varfilter QCALL (Le Si Quang, Sanger) GATK (Broad institute) Iterative remapping caller (Oxford)
Merge individual callsets to produce final set
Validation Data Sanger finished BACs: NOD/SHILtJ Mouse hapmap (Perlegen) - all strains
Samtools/GATK
QCALL
Petr Danecek
Vertebrate Resequencing Informatics 18th June, 2010
Final Strategy
SNPs called with each of the individual callers Validation data NOD BACs Several Mbp of finished sequence
Gold standard sequence to determine our error rates
Combined the SNP calls in various permutations:
Final callset chosen was the ‘at least two callers’ rule Optimal balance of false positives/negatives
Vertebrate Resequencing Informatics 18th June, 2010
Final SNPs
Strain Raw Private C57BL 37240 13819 BALB 4532356 99785 A_J 4845615 113684 NOD 4924864 229125 AKR 5051556 202664 129S5 5087252 86948 C3H 5104106 96541 NZO 5144172 327797 129S1 5157820 61212 DBA 5167254 171522 CBA 5225691 122140 LP_J 5432909 134922 129P2 5478407 146580 WSB 6956471 1177170 PWK 19393909 5421503 CAST 20197854 6642404 SPRET 40558991 27409147
dbSNP calls: 10.1M Perlegen calls: 8.3M Mouse Genomes: 65.0M
Vertebrate Resequencing Informatics 18th June, 2010
Perlegen Concordance
17 strains sequenced 7 strains overlap with 15 Perlegen strains Extremely high SNP concordance rates
SNPs have been imputed across 61 strains http://mouse.cs.ucla.edu/mousehapmap/beta/index.html
Strain % concordant % Discordant % with no info X129S1.SvImJ 0.885636549 0.006589192 0.107774259 A.J 0.903007001 0.005785193 0.091207806 AKR.J 0.890526999 0.005832134 0.103640867 BALB.cByJ 0.896493269 0.004583561 0.09892317 C3H.HeJ 0.879009694 0.005307051 0.115683255 DBA.2J 0.884198574 0.004952265 0.110849161 NOD.LtJ 0.884508602 0.006592604 0.108898794
Nick Furlong Eleazar Eskin
Vertebrate Resequencing Informatics 18th June, 2010
Mutation Types
Vertebrate Resequencing Informatics 18th June, 2010
Example False Negative
Vertebrate Resequencing Informatics 18th June, 2010
Example False Positive
BACs
Vertebrate Resequencing Informatics 18th June, 2010
The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains
Structural Variation
Introduction
Variation Consequences
SNP Calling
Future Work
Vertebrate Resequencing Informatics 18th June, 2010
Structural Variation
Several types of structural variations (SVs) Large Insertions/deletions Inversions Translocations Copy number variations
Read pair information used to detect these events Paired end sequencing of either end of DNA
fragment Observe deviations from the expected fragment size
Presence/absence of mate pairs Read depth to detect copy number variations Several SV callers published recently
Run several callers and produce large set of partially overlapping calls
Vertebrate Resequencing Informatics 18th June, 2010
SV Types
Vertebrate Resequencing Informatics 18th June, 2010
Deletion
SV Visualisation LookSeq viewer Read pairs displayed Y axis is aligned insert size
Deletions are easily spotted Read pairs are mapped
further apart than expected Coverage is zero across the
deletion sequence Deletion in NOD/ShiLtJ
Vertebrate Resequencing Informatics 18th June, 2010
Inversion
Mate pairs align in the same orientation
Coverage zero at breakpoints
Vertebrate Resequencing Informatics 18th June, 2010
Insertion
One end mapped reads
Coverage zero at breakpoint
Vertebrate Resequencing Informatics 18th June, 2010
Insertion Insertion
Inversion
Complex SV Events
Vertebrate Resequencing Informatics 18th June, 2010
User defined!Calls!User defined!
Calls!Breakdancer! Pindel! SE Cluster! CND! RDExplorer!User defined!
Calls!User defined!Calls!
BED Format!
Merged SV List!SVs<100bp!SVs>100bp!
Computational Validation!Gather local reads per SV!
Run local assemblies (Velvet)!Align contigs!
Parse alignments!
Refined SV List!SVs<100bp!SVs>100bp!
SV Context!Overlap with introns/exons!
Overlap with QTLs!
BAM!
Filter! Filter! Filter! Filter! Filter!
Computational Pipeline - SVMerger
Kim Wong
Vertebrate Resequencing Informatics 18th June, 2010
SV Call Strategy
Iterative cycle of comparing computational calls vs. manual calls vs. lab validated calls Final callset Require that each SV to be validated by either computational local
reassembly, lab validation or manual calling
Manual annotation - Chr19 on 8 strains - 2606 variants
PCR + re-seq Computational call + confirmation
Vertebrate Resequencing Informatics 18th June, 2010
Local assemblies of deletions (predicted size >=100bp)
0
20000
40000
60000
80000
100000
NOD
129S1_SvImJ
C3H_HeJ
CAST_Ei
LP_J
PWK_Ph
Spretus_Ei
CBA_J
AKR_J
129P2
BALBc_J
C57BL_6N A
_J
WSB_Ei
DBA_2J
129S5NZO
NoAssem
NoMatch
Inconclusive
Complex
Confirm(<75bp)
Confirm(>=75bp)
Deletions
Vertebrate Resequencing Informatics 18th June, 2010
Insertions
Local assembly of insertions (all)
0
25000
50000
75000
100000
125000
150000
NOD
129S
1_Sv
ImJ
C3H
_HeJ
CAST_
Ei
LP_J
PWK_P
h
Spr
etus
_Ei
CBA_J
AKR_J
129P
2
BALB
c_J
C57
BL_
6N A_J
WSB_E
i
DBA_2
J
129S
5NZO
NoAssem
NoMatch
Inconclusive
Complex
Partial insert
Whole insert
Vertebrate Resequencing Informatics 18th June, 2010
Inversions
Local assembly of inversions
0
200
400
600
800
1000
1200
1400
NOD
129S1_SvImJ
C3H_HeJ
CAST_Ei
LP_J
PWK_Ph
Spretus_Ei
CBA_J
AKR_J
129P2
BALBc_J
C57BL_6N A
_J
WSB_Ei
DBA_2J
129S5NZO
NoAssem
NoMatch
Inconclusive
Complex
Confirm(<75bp)
Confirm(>=75bp)
Vertebrate Resequencing Informatics 18th June, 2010
Most SVs are Shared
0
50000
100000
150000
200000
250000
300000
129P
2
129S
1_S
vIm
J 129S
5
A_J
AK
R_J
BA
LBc_
J
C3H
_HeJ
C57
BL_
6N
CB
A_J
DB
A_2
J
LP_J
NO
D
NZO
CA
ST_
Ei
PW
K_P
h
Spr
etus
_Ei
WS
B_E
i
Total Shared
Vertebrate Resequencing Informatics 18th June, 2010
Retrotransposon Insertions
Endogenous retroviral elements (ERVs) are significant germline mutagens ~10% of spontaneous mutations Still active in the inbred mouse strains
Much more likely to affect the host Powerful transcriptional regulatory elements Human ERVs implicated in several autoimmune diseases Can activate oncogenes or growth control genes
Two important classes in mice Intracisternal A Particle (IAP) MusD/Early Transposon (ETn)
Currently incomplete picture across strains 2005: 26 IAP mutations and 19 ETn been reported 2007: Zhang el al. completed first whole genome study using available
capillary sequence data (~2-3x coverage) for 4 strains
Vertebrate Resequencing Informatics 18th June, 2010
IAPd1oligo1 spans 1∆1 deletion
5’ LTR 3’ LTR
5’ LTR (~430 nt)
3’ LTR
IAPTypeI1∆1~5.4kb
Mostmobileandmutagenic(mainlyC3H)
~200inmouse
IAPTypeI7.3kb(fulllength)
~500inmouse
Individual elements may have various internal deletions – some define subclasses, some are isolated
~1.9 kb deletion in 1∆1 subclass
Solo LTR
IAPTypeII(“IAP‐E”)~4.0kbkb
significantseq.divergencefromTypeI
~300inmouse
5’ LTR 3’ LTR env-like
gene
gag-pol genes (usually defective)
gag-pol (usually del’d)
gag-pol
IAPretrotransposonsinmice
Wayne Frankel, Jax
Vertebrate Resequencing Informatics 18th June, 2010
ERV Calls
IAP ETn
MLV
Vertebrate Resequencing Informatics 18th June, 2010
ERV Visualisation
C3H/HeJ Chr18:82476615 84 -
Vertebrate Resequencing Informatics 18th June, 2010
The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains
Variation Consequences
SNP Calling
Introduction
Structural Variation
Future Work
Vertebrate Resequencing Informatics 18th June, 2010
Enabling Forward Genetics
Traditional reverse genetic approaches Genetic crosses for QTL mapping Identifying causal variants is major rate-limiting step to functional
genomics Major Issue: Lack of good sequence data for parental strains
Forward genetic approaches to complex trait analysis Knock out programmes
Mouse genomes project is generating all types of sequence variants across 17 strain panel Calling consequences of the variants Goal is to enable community to create prioritised list of causal variants for
the phenotype in question Start experiments with a prioritised list of variants + follow up with genetics
experiments
Vertebrate Resequencing Informatics 18th June, 2010
An Example: Chr 17 Pilot
With Joe Nadeau Genome Biology 2009, 10:R112
Vertebrate Resequencing Informatics 18th June, 2010
SNP Variation Consequences
Stops variants in coding regions Potentially many of these are annotation errors
129P2/OlaHsd 125 129S1/SvImJ 116 129S5/SvEvBrd 114 A/J 118 AKR/J 122 BALB/cJ 104 2 C3H/HeJ 127 3 C57BL/6N 3 1 CAST/EiJ 342 101 CBA/J 147 2 DBA/2J 118 1 LP/J 119 3 NOD/ShiLtJ 126 6 NZO/HiLtJ 116 11 PWK/PhJ 357 104 SPRET/EiJ 705 415 WSB/EiJ 164 29
Stops
Vertebrate Resequencing Informatics 18th June, 2010
Large Deletion Consequences
Av. Exon: 408
Whole Gene: 221*
*(Olfr, MUPs clusters)
Vertebrate Resequencing Informatics 18th June, 2010
Knock-out Genes Deleted
Genes likely to contribute to phenotypic variability between strains
Vertebrate Resequencing Informatics 18th June, 2010
The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains
Future Work
SNP Calling
Introduction
Structural Variation
Variation Consequences
Lane1Lane1Lane1
Ref
1Mbp
>ConRg1Acgagtacgagacgatgacagacgta>ConRg2agctagcactagactagactgacgat
>ConRg1Acgagtacgagacgatgacagacgta>ConRg2agctacgacaccccgacgggcactacg
(Maq) (Image)
LocalAssembly2
ConRgs
Ref
LocalAssembly1 LocalAssembly2
LocalAssembly1
(Velvet)
1.Ini8alReadAlignment
2.ClusterAssemblies
3.Gapfilling
4.Referencebasedgapfilling
5.Scaffolding
3‐10kblibraries
De novo Assembly
Vertebrate Resequencing Informatics 18th June, 2010
Future Work
Genomic sequencing is completed Long insert sequencing ongoing
Variant Calls SNPs – done Short indels – final stages Structural variants – final stages
Transcriptome sequencing Currently sequencing brain RNA samples for 17 strains Aim to find coding variants showing expression changes
Data submission Short read data -> SRA SNPs -> dbSNP SVs -> dbVAR Display via MGI and Ensembl
Sequence assemblies in progress Version 1 coming soon – aim for final assemblies by end 2010
Oustanding questions Annotation of novel sequence Visualisation and representation of new strain sequence in genome browsers Sequence more strains vs. put effort into improving assemblies
Vertebrate Resequencing Informatics 18th June, 2010
Acknowledgements Acknowledgements
Vertebrate Resequencing Informatics Team Jim Stalker Petr Danecek Sendu Bala Kim Wong Guy Slater
Sanger David Adams Richard Durbin Ewan Birney
Mouse Genomes Collaborators Jonathan Flint et al. Richard Mott et al. Chris Ponting et al.
Jackson Lab Laura Reinholdt Leah Rae Donahue Wayne Frankel
Sanger Sequencing Teams
http://www.sanger.ac.uk/mousegenomes