church_genomeaccess_2013_genome2013

Post on 10-May-2015

4.997 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Sequencing and assembly lecture for the CSHL genome access course, Nov 2013

TRANSCRIPT

Deanna M. Church Staff Scientist, NCBI

@deannachurch

Genome Sequencing and Assembly The human reference assembly

http://genomereference.org

Valerie Schneider, NCBI

Why should you care about the Reference Assembly?

Genes, NCBI Homo sapiens Annotation Release 105

Transcript

CDS

dbSNP Build 138 using annotation release 104

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

Human assemblies available in the NCBI assembly database

http://www.ncbi.nlm.nih.gov/assembly

N50:Measure of continuity.Half of the contigs in the assembly are this length or greater.

What is the Reference Assembly?

BiologyRepetitive sequence (interspersed repeats, segmental duplications)Variation

(regions of high diversity, structural variation)

Kidd et al., 2008

GRCh37 (Primary)

TechnologyRead length long reads vs. short readsMate lengths distribution of insert sizesRead accuracy error model for your technologyRead depth coverage at each baseGenome distribution reads covering entire genome equally

Ajay et al., 2011

An assembly is a MODEL of the genome

Collins FS et al, 1998

Throughput: 500 Mb/yearCost: < $0.25 per base

Variation: 100,000 SNPs mapped

February 2001

Genome Research, May, 1997

Contig: a sequence constructed from smaller, overlapping sequences, which contains no gaps.

Scaffold: a sequence constructed from smaller sequences, which may contain gaps.

Genome Vocabulary

Typically built from reads, but also from sequences in GenBank/EMBL/DDBJ

Typically built from sequences in GenBank/EMBL/DDBJ

Restrict and make libraries2, 4, 8, 10, 40, 150 kb

End-sequence allclones and retainpairing information“mate-pairs”

Find sequence overlaps

Each end sequenceis referred to as a read

WGS contig

tails

WGS: Sanger Reads

Scaffold

A T T T T C C C T T C T G A A A T G A T G A A A G A G T C

BAC insertBAC vector

Shotgun sequence

Assemble

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Lander and Waterman(1988) Genomics

Reads are randomly distributedOverlap between reads does not vary

AssumptionsVariables:G= haploid genome length in bpL= sequence read length in bpN= number of reads sequencedT= amount of overlap needed for detection in bpC= Coverage (C=LN/G)

Poisson distribution:P(Y=y)=(ly * e–l)/y!

y= number of events in an interval

l = mean number of events in an interval

For sequence calculations, coverage can be viewed as l

SequencedNot sequenced

1X Coverage5X Coverage

10X Coverage

37% 63%0.6% 99.4%

0.005% 99.995%

2009 Sanger cost: shotgun sequence ~ $0.01/base finished sequence ~ $0.03/base

This clone: Shotgun=$1500Finish=$3000

tetra

odon

mun

tjak_

indian

zebr

afinc

h

zebr

afish

mac

aque

alliga

tor

chick

en

shee

p

mon

odelp

his

oran

gutan

goril

la

verv

et

cpba

t

chim

p

owl_m

onke

y cat

pig

dusk

y_titi co

w

eleph

ant

fugu

babo

on dog

hedg

ehog

shre

w

arm

adillo

opos

sum

squir

rel_m

onke

yra

bbit

galag

o

lemur

rfbat ra

t

mou

se

mar

mos

et

wallab

y

colob

us_m

onke

y

platyp

us

0

1

2

3

4

5

6

7

8

9

10

Sequence Gaps : Uncaptured vs. Total

Uncaptured gaps Captured gaps

Species

Gap

Ave

. per

BA

C

Captured gap= no sequence, but a sub-clone spans the gapUncaptured gap= no sequence, no sub-clone spanning gap

Bob Blakesley, NISC

A

BCD

EFGH

IJKLMNO

ABCD

FGH

KL

ON

Ideally…

Non-sequence based Map

(flip)

ABCD

FGH

KL

ON

More like…

A

BCD

EFGH

IJKLMNO

A

BC

ZYX

W

H

J

M

V

N

O

AB

HIJ

CDY

LMNO

AB

HIJ

LMNO

?

Sequence vs. Non-sequence based mapsMmu7

WI GeneticWI/MRC RH

EnrichmentObservedExpected

-5

-4

-3

-2

-1

0

1

2

3

4

5

60

40

20

0

20

40

60

Maj

or h

isto

com

patib

ility

com

plex

ant

igen

Che

mok

ine

Tum

or n

ecro

sis

fact

or r

ecep

tor

Oth

er c

ytok

ine

rece

ptor

Cys

tein

e pr

otea

se in

hibi

tor

CA

M fa

mily

adh

esio

n m

olec

ule

Apo

lipop

rote

in

KR

AB

box

tran

scrip

tion

fact

or

Inte

rmed

iate

fila

men

t

Imm

unog

lobu

lin r

ecep

tor

fam

ily m

embe

r

Oth

er c

ell a

dhes

ion

mol

ecul

e

Zin

c fin

ger

tran

scrip

tion

fact

or

Def

ense

/imm

unity

pro

tein

Str

uctu

ral p

rote

in

Cys

tein

e pr

otea

se

Cyt

okin

e re

cept

or

Oxy

gena

se

Cel

l adh

esio

n m

olec

ule

Tra

nscr

iptio

n fa

ctor

Mis

cella

neou

s fu

nctio

n

Sig

nalin

g m

olec

ule

Oxi

dore

duct

ase

Unc

lass

ified

Nuc

leic

aci

d bi

ndin

g

Sel

ect r

egul

ator

y m

olec

ule

Kin

ase

Hyd

rola

se

Rib

osom

al p

rote

in

Pro

tein

kin

ase

G-p

rote

in m

odul

ator

Ext

race

llula

r m

atrix

Oth

er tr

ansc

riptio

n fa

ctor

Human- panther classifications (biological process)

Evan Eichler, University of Washington

Fragmented genomes tend to have more partial models

Fragmented genomes have fewer frameshifts

Alexander Souvorov, NCBI

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1012

RP11-34P13 64E8 RP4-669L17 RP5-857K21 RP11-206L10 RP11-54O7

Gaps

NCBI36 (hg18)

GRC

h37

(hg1

9)

NCBI35 (hg17)

GRCh37 (hg19)

AL139246.20

AL139246.21

Build sequence contigs based on contigs defined in TPF (Tiling Path File).

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

NCBI36

nsv832911 (nstd68) Submitted on NCBI35 (hg17)

NCBI35 (hg17) Tiling Path

GRCh37 (hg19) Tiling Path

Gap Inserted

Moved approximately 2 Mb distal on chr15

NC_0000015.8 (chr15)

NC_0000015.9 (chr15)

Removed from assembly

Added to assembly

HG-24

http://genomereference.org

http://genomereference.org

Distributed data

Genome not in INSDC Database

Old Assembly Model

Human Genome Project (HGP)

5 July 2011

Issue tracking system (based on JIRA)

http://genomereference.org

Full Dovetail

Half-dovetail

Contained

Short/Blunt

AGP: A Golden Path

Provides instructions for building a sequence• Defines components sequences used to build scaffolds/chromosome• Switch points• Defines gaps and types

GRC Produces• AGP• FASTA

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

Assembly (e.g. GRCh37)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

UGT2B17 Region

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

GRCh37 (hg19)

Oh No! Not a new version of the human reference!

http://genomereference.org

Assembly (e.g. GRCh37.p13)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Patches

Genomic Region(ABO)

Genomic Region(SMA)

Genomic Region

(PECAM1)

MHC (chr6)Chr 6 representation (PGF)

Alt_Ref_Locus_2 (COX)

17q deletion

H1

H2

Zody et al, 2008

chromosome

alt/patch

reads On-target alignment

Off-target alignments

(n=122,922)

Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning

reads to the full assembly

Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds

Distributed data

Old Assembly Model

Centralized Data

Updated Assembly Model

Genome not in INSDC Database

http://www.ncbi.nlm.nih.gov/genome/assembly

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Updated Assembly Model

Genome in INSDC Database

Genome not in INSDC Database

Variant Calling and the Reference Assembly

http://www.bioplanet.com/gcat

Kidd et al, 2007 APOBEC cluster

Part of chr22 assembly

Alternate locus for chr22

White: InsertionBlack: Deletion

Rawe et al, 2013

Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320

NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N

129S6/SvEvTac Alt Locus Alignment Ren1 (allelic)

FVB/N Transcript Alignment Ren2 (paralog)

129S6/SvEvTac Ren1

FVB Ren2 Tx

Paralogousdiff

SNP +Paralogous

diff

Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320

NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N

Hydin: chr16 (16q22.2)Hydin2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38

Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID

Alignment to Hydin1 CHM1_1.0, >99.9% ID

(Paralogous)

(Allelic)Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID

Alignment to Hydin1 CHM1_1.0, >99.9% ID

Doggett et al., 2006

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

CDC27

1KG Phase 1 Strict accessibility mask

SNP (all)

SNP (not 1KG)

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

Sudmant et al., 2010

GRCh38 is coming(September, 2013)

GRCh37 Scaff N50: 44,983,201GRCh37B Scaff N50: 62,124,159

GRCh37 Contig N50: 38,440,852GRCh37B Contig N50: 49,319,739

Modeled CentromeresIndividual base updatesFixed tiling path/assembly errorsAddition of novel sequence

Major Features of GRCh38

Adding Novel Sequence

Karen Miga and Jim Kent arXiv:1307.0035

Dennis et al., 2012

1q32 1q21 1p21

1p21 patch alignment to chromosome 1

61-mer analysis

set9664

1kG high-confidence

set1358

4222

Ref allele frequency = 0Mismatches MAF = 0

n=15,244

MAF=0Insertio

nsn=834

MAF=0Deletion

sn=1541

MAF<5%Mismatc

h in pseudo/pr txptn=1413

Annotator and clinical

requestsn= ~260

Pile-Up Analysis: “Never Seen” Mismatched Bases Originating from RP11 Components

79% of these bases are heterozygous in RP11 WGS

GRCh37 Insertions Originating from RP11

GRCh37 Deletions Originating from RP11

17% heterozygous in RP11 WGS

18% heterozygous in RP11 WGS

Fixing Rare/Incorrect Bases

NOVEL GENES!

GRCh37.p13: 211 genes found only on alt loci and patches

Genovese et al., 2013

FAM23_MRC1 Region, chr10

Segmental Duplications

1KG accessibility Mask

Novel Patch 250 kb of artificial duplication

Adding Novel Sequence

GRCh37p13120 Fix Patches60 Novel

Human Resolved for GRCh38

http://genomereference.org

Remap Set up slide

GRCh38 is coming(September, 2013)

top related