agbt2015 workshop schneider

23
Advancing the Human Reference Assembly Valerie Schneider NCBI 25 February 2015 The Human Reference Genome: Today, Tomorrow and Next ? http://genomereference.org

Upload: genome-reference-consortium

Post on 14-Jul-2015

340 views

Category:

Health & Medicine


1 download

TRANSCRIPT

Advancing the Human Reference Assembly

Valerie SchneiderNCBI

25 February 2015

The Human Reference Genome: Today, Tomorrow and Next ?

http://genomereference.org

Dilthey et al.Paten et al.

Scientific Models

Outline

• The assembly model• Basics• Value added• Challenges

• Future relevance of the reference• Multiple genomes• Haploid genomes

• Assembly updates• Mechanisms• Requirements/Challenges

Sequences from haplotype 1

Sequences from haplotype 2

Old Assembly model: compress into a consensus

Current Assembly model: represent both haplotypes

GRC Assembly Model

many

Assembly (e.g. GRCh38)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)

Genomic Region(MAPT)

Church et al., PLoS Biol. 2011 Jul;9(7):e1001091GRC Assembly Model

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

ALT 1

GRC Assembly Model

Alt loci alignments are an integral part of the assembly modelalignment to chr + scaffold sequence = Alt

GRCh38• 178 regions with alt loci: 2% of chromosome

sequence (61.9 Mb)

• 261 Alt Loci: 3.6 Mb novel sequence relative to

chromosomes

• Average alt length = 400 kb, max = ~5 Mb

GRCh38

GRC Assembly Model

The human reference assembly represents population genomic diversity in the context of linear sequences

GRCh38: Alt Loci

Alignment Legend

no alignmentmismatchdeletion

GRCh38: Alt Loci

GRCh38 alt loci alignment

GRCh37 chr. 7

chromosome

alt/patch

reads On-target alignment

Off-target alignments

(n=122,922)

GRCh38: Alt Loci

GRC Assembly Model

http://notvine.co/

htt

p:/

/de

sign

taxi

.co

m/

Challenges: Allelic Duplication

Challenges: Reporting Multiple Locations

SRPRISM

Challenges: Solutions

https://github.com/samtools/hts-specs/issues/51

Multiple Genome Era

ADMIXTURE?

http://medcitynews.com/

Multiple Genome Era

Assembly (e.g. GRCh38.p1)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)

Genomic Region(MAPT)

Patches

Genomic Region(ABO)

Genomic Region

(FOXO6)

Genomic Region

(FCGBP)

Assembly Updates

Patches

FIX NOVEL

SCAFFOLD STATUS AT NEXTMAJOR ASSEMBLY RELEASE

ALT LOCI

--(integrated)

Treat as: Allelic

Treat as: Preferred

Assembly Updates

Assembly Updates

GRC

• Finished Quality• INSDC Accessioned• Representative of an actual DNA molecule

Criteria for Reference Assembly Component Sequences

Summary

• Reference Assembly: Today• Multi-allelic• Need compatible toolsuites

• Reference Assembly: Tomorrow• Defining sequence context• Providing coordinates

• Reference Assembly: Next ?• Patches• Challenges

GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes

GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs

GRC Credits

Workshop sponsor:

http://genomereference.org