mouse genomes project summary june 2010

43
Vertebrate Resequencing Informatics 18 th June, 2010 The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains Thomas Keane, Senior Scientific Manager, Vertebrate Resequencing Informatics, Wellcome Trust Sanger Institute, Cambridge, UK http://www.sanger.ac.uk/mousegenomes

Upload: thomas-keane

Post on 10-May-2015

558 views

Category:

Education


1 download

DESCRIPTION

A talk given at the Jackson Laboratory, ME, USA with a summary of the progress of the mouse genomes project as of June 2010

TRANSCRIPT

Page 1: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains

Thomas Keane, Senior Scientific Manager, Vertebrate Resequencing Informatics, Wellcome Trust Sanger Institute, Cambridge, UK

http://www.sanger.ac.uk/mousegenomes

Page 2: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains

 Introduction

 SNP Calling

 Variation Consequences

 Structural Variation

 Future Work

Page 3: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Sequencing Technologies over past 30 years

MR Stratton et al. Nature 458, 719-724 (2009)

Page 4: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

WTSI - major project usage 2009

Human variation

Cancer

Mouse Pulldown Medical reseq

Pathogen

Development

Malaria

WTCCC Small faculty

Other 3rd party Zebrafish tilling

ICR Infrastructure

Page 5: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

A view from the Sanger

Page 6: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

A view from the Sanger

Page 7: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Mouse Genomes Project

2002 Mouse genome reference published   C57BL/6J strain – currently mm9

2006 Perlegen   15 strains and 8.3M SNPs

2008 Mouse Genomes Project   Sequence the genomes of 17 key mouse strains   Inbred lab strains used by mouse geneticists   Include 3 recently wild derived strains

Primary goals of the project   Catalogue all types sequence variations across the strains   Produce de novo assemblies per strain

Sequencing on illumina platform at Sanger   Majority of sequencing completed by end 2009   All strains >20x sequence coverage   Read lengths: 54, 76, 108bp   Fragment sizes of 200-600bp

Regular data releases   Initial release June 2009

 SNPs, short indels  Web interface to query and visualise read alignments and sequence variation

  Final data freeze: December 2009  1.53 Tbp sequenced (510x)  One BAM file per strain (mapped and unmapped reads)

Page 8: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Mouse Genomes Project

Page 9: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains

 SNP Calling

 Introduction

 Variation Consequences

 Structural Variation

 Future Work

Page 10: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

SNP Calling

Homozygous inbred mouse strains  Very few heterozygous SNP

SNP calling strategy  Use multiple SNP callers

 Samtools varfilter  QCALL (Le Si Quang, Sanger)  GATK (Broad institute)  Iterative remapping caller (Oxford)

 Merge individual callsets to produce final set

Validation Data  Sanger finished BACs: NOD/SHILtJ  Mouse hapmap (Perlegen) - all strains

Samtools/GATK

QCALL

Petr Danecek

Page 11: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Final Strategy

SNPs called with each of the individual callers Validation data  NOD BACs  Several Mbp of finished sequence

 Gold standard sequence to determine our error rates

Combined the SNP calls in various permutations:

Final callset chosen was the ‘at least two callers’ rule  Optimal balance of false positives/negatives

Page 12: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Final SNPs

Strain Raw Private C57BL 37240 13819 BALB 4532356 99785 A_J 4845615 113684 NOD 4924864 229125 AKR 5051556 202664 129S5 5087252 86948 C3H 5104106 96541 NZO 5144172 327797 129S1 5157820 61212 DBA 5167254 171522 CBA 5225691 122140 LP_J 5432909 134922 129P2 5478407 146580 WSB 6956471 1177170 PWK 19393909 5421503 CAST 20197854 6642404 SPRET 40558991 27409147

dbSNP calls: 10.1M Perlegen calls: 8.3M Mouse Genomes: 65.0M

Page 13: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Perlegen Concordance

17 strains sequenced  7 strains overlap with 15 Perlegen strains  Extremely high SNP concordance rates

SNPs have been imputed across 61 strains  http://mouse.cs.ucla.edu/mousehapmap/beta/index.html

Strain % concordant % Discordant % with no info X129S1.SvImJ 0.885636549 0.006589192 0.107774259 A.J 0.903007001 0.005785193 0.091207806 AKR.J 0.890526999 0.005832134 0.103640867 BALB.cByJ 0.896493269 0.004583561 0.09892317 C3H.HeJ 0.879009694 0.005307051 0.115683255 DBA.2J 0.884198574 0.004952265 0.110849161 NOD.LtJ 0.884508602 0.006592604 0.108898794

Nick Furlong Eleazar Eskin

Page 14: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Mutation Types

Page 15: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Example False Negative

Page 16: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Example False Positive

BACs

Page 17: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains

 Structural Variation

 Introduction

 Variation Consequences

 SNP Calling

 Future Work

Page 18: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Structural Variation

Several types of structural variations (SVs)  Large Insertions/deletions   Inversions  Translocations  Copy number variations

Read pair information used to detect these events  Paired end sequencing of either end of DNA

fragment  Observe deviations from the expected fragment size

 Presence/absence of mate pairs  Read depth to detect copy number variations  Several SV callers published recently

Run several callers and produce large set of partially overlapping calls

Page 19: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

SV Types

Page 20: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Deletion

SV Visualisation  LookSeq viewer  Read pairs displayed  Y axis is aligned insert size

Deletions are easily spotted  Read pairs are mapped

further apart than expected  Coverage is zero across the

deletion sequence Deletion in NOD/ShiLtJ

Page 21: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Inversion

Mate pairs align in the same orientation

Coverage zero at breakpoints

Page 22: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Insertion

One end mapped reads

Coverage zero at breakpoint

Page 23: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Insertion Insertion

Inversion

Complex SV Events

Page 24: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

User defined!Calls!User defined!

Calls!Breakdancer! Pindel! SE Cluster! CND! RDExplorer!User defined!

Calls!User defined!Calls!

BED Format!

Merged SV List!SVs<100bp!SVs>100bp!

Computational Validation!Gather local reads per SV!

Run local assemblies (Velvet)!Align contigs!

Parse alignments!

Refined SV List!SVs<100bp!SVs>100bp!

SV Context!Overlap with introns/exons!

Overlap with QTLs!

BAM!

Filter! Filter! Filter! Filter! Filter!

Computational Pipeline - SVMerger

Kim Wong

Page 25: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

SV Call Strategy

Iterative cycle of comparing computational calls vs. manual calls vs. lab validated calls Final callset  Require that each SV to be validated by either computational local

reassembly, lab validation or manual calling

Manual annotation - Chr19 on 8 strains - 2606 variants

PCR + re-seq Computational call + confirmation

Page 26: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Local assemblies of deletions (predicted size >=100bp)

0

20000

40000

60000

80000

100000

NOD

129S1_SvImJ

C3H_HeJ

CAST_Ei

LP_J

PWK_Ph

Spretus_Ei

CBA_J

AKR_J

129P2

BALBc_J

C57BL_6N A

_J

WSB_Ei

DBA_2J

129S5NZO

NoAssem

NoMatch

Inconclusive

Complex

Confirm(<75bp)

Confirm(>=75bp)

Deletions

Page 27: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Insertions

Local assembly of insertions (all)

0

25000

50000

75000

100000

125000

150000

NOD

129S

1_Sv

ImJ

C3H

_HeJ

CAST_

Ei

LP_J

PWK_P

h

Spr

etus

_Ei

CBA_J

AKR_J

129P

2

BALB

c_J

C57

BL_

6N A_J

WSB_E

i

DBA_2

J

129S

5NZO

NoAssem

NoMatch

Inconclusive

Complex

Partial insert

Whole insert

Page 28: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Inversions

Local assembly of inversions

0

200

400

600

800

1000

1200

1400

NOD

129S1_SvImJ

C3H_HeJ

CAST_Ei

LP_J

PWK_Ph

Spretus_Ei

CBA_J

AKR_J

129P2

BALBc_J

C57BL_6N A

_J

WSB_Ei

DBA_2J

129S5NZO

NoAssem

NoMatch

Inconclusive

Complex

Confirm(<75bp)

Confirm(>=75bp)

Page 29: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Most SVs are Shared

0

50000

100000

150000

200000

250000

300000

129P

2

129S

1_S

vIm

J 129S

5

A_J

AK

R_J

BA

LBc_

J

C3H

_HeJ

C57

BL_

6N

CB

A_J

DB

A_2

J

LP_J

NO

D

NZO

CA

ST_

Ei

PW

K_P

h

Spr

etus

_Ei

WS

B_E

i

Total Shared

Page 30: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Retrotransposon Insertions

Endogenous retroviral elements (ERVs) are significant germline mutagens  ~10% of spontaneous mutations  Still active in the inbred mouse strains

Much more likely to affect the host  Powerful transcriptional regulatory elements  Human ERVs implicated in several autoimmune diseases  Can activate oncogenes or growth control genes

Two important classes in mice   Intracisternal A Particle (IAP)  MusD/Early Transposon (ETn)

Currently incomplete picture across strains  2005: 26 IAP mutations and 19 ETn been reported  2007: Zhang el al. completed first whole genome study using available

capillary sequence data (~2-3x coverage) for 4 strains

Page 31: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

IAPd1oligo1 spans 1∆1 deletion

5’ LTR 3’ LTR

5’ LTR (~430 nt)

3’ LTR

IAPTypeI1∆1~5.4kb

Mostmobileandmutagenic(mainlyC3H)

~200inmouse

IAPTypeI7.3kb(fulllength)

~500inmouse

Individual elements may have various internal deletions – some define subclasses, some are isolated

~1.9 kb deletion in 1∆1 subclass

Solo LTR

IAPTypeII(“IAP‐E”)~4.0kbkb

significantseq.divergencefromTypeI

~300inmouse

5’ LTR 3’ LTR env-like

gene

gag-pol genes (usually defective)

gag-pol (usually del’d)

gag-pol

IAPretrotransposonsinmice

Wayne Frankel, Jax

Page 32: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

ERV Calls

IAP ETn

MLV

Page 33: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

ERV Visualisation

C3H/HeJ Chr18:82476615 84 -

Page 34: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains

 Variation Consequences

 SNP Calling

 Introduction

 Structural Variation

 Future Work

Page 35: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Enabling Forward Genetics

Traditional reverse genetic approaches  Genetic crosses for QTL mapping   Identifying causal variants is major rate-limiting step to functional

genomics  Major Issue: Lack of good sequence data for parental strains

Forward genetic approaches to complex trait analysis  Knock out programmes

Mouse genomes project is generating all types of sequence variants across 17 strain panel  Calling consequences of the variants  Goal is to enable community to create prioritised list of causal variants for

the phenotype in question  Start experiments with a prioritised list of variants + follow up with genetics

experiments

Page 36: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

An Example: Chr 17 Pilot

With Joe Nadeau Genome Biology 2009, 10:R112

Page 37: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

SNP Variation Consequences

Stops variants in coding regions Potentially many of these are annotation errors

129P2/OlaHsd 125 129S1/SvImJ 116 129S5/SvEvBrd 114 A/J 118 AKR/J 122 BALB/cJ 104 2 C3H/HeJ 127 3 C57BL/6N 3 1 CAST/EiJ 342 101 CBA/J 147 2 DBA/2J 118 1 LP/J 119 3 NOD/ShiLtJ 126 6 NZO/HiLtJ 116 11 PWK/PhJ 357 104 SPRET/EiJ 705 415 WSB/EiJ 164 29

Stops

Page 38: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Large Deletion Consequences

Av. Exon: 408

Whole Gene: 221*

*(Olfr, MUPs clusters)

Page 39: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Knock-out Genes Deleted

Genes likely to contribute to phenotypic variability between strains

Page 40: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

The Mouse Genomes Project - Next-generation sequencing and variation analysis of 17 mouse strains

 Future Work

 SNP Calling

 Introduction

 Structural Variation

 Variation Consequences

Page 41: Mouse Genomes Project Summary June 2010

Lane1Lane1Lane1

Ref

1Mbp

>ConRg1Acgagtacgagacgatgacagacgta>ConRg2agctagcactagactagactgacgat

>ConRg1Acgagtacgagacgatgacagacgta>ConRg2agctacgacaccccgacgggcactacg

(Maq) (Image)

LocalAssembly2

ConRgs

Ref

LocalAssembly1 LocalAssembly2

LocalAssembly1

(Velvet)

1.Ini8alReadAlignment

2.ClusterAssemblies

3.Gapfilling

4.Referencebasedgapfilling

5.Scaffolding

3‐10kblibraries

De novo Assembly

Page 42: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Future Work

Genomic sequencing is completed   Long insert sequencing ongoing

Variant Calls   SNPs – done   Short indels – final stages   Structural variants – final stages

Transcriptome sequencing   Currently sequencing brain RNA samples for 17 strains   Aim to find coding variants showing expression changes

Data submission   Short read data -> SRA   SNPs -> dbSNP   SVs -> dbVAR   Display via MGI and Ensembl

Sequence assemblies in progress   Version 1 coming soon – aim for final assemblies by end 2010

Oustanding questions   Annotation of novel sequence   Visualisation and representation of new strain sequence in genome browsers   Sequence more strains vs. put effort into improving assemblies

Page 43: Mouse Genomes Project Summary June 2010

Vertebrate Resequencing Informatics 18th June, 2010

Acknowledgements Acknowledgements

Vertebrate Resequencing Informatics Team   Jim Stalker   Petr Danecek   Sendu Bala   Kim Wong   Guy Slater

Sanger   David Adams   Richard Durbin   Ewan Birney

Mouse Genomes Collaborators   Jonathan Flint et al.   Richard Mott et al.   Chris Ponting et al.

Jackson Lab   Laura Reinholdt   Leah Rae Donahue   Wayne Frankel

Sanger Sequencing Teams

http://www.sanger.ac.uk/mousegenomes