grc workshop
DESCRIPTION
GRC Workshop. ASHG. 22 Oct 2013. Outline Reference Assembly Basics GRC: Assembly management and dataflow GRCh38 Accessing the assembly and data. http://genomereference.org. Reference Assembly Basics. What is the Reference Assembly?. An assembly is a MODEL of the genome. - PowerPoint PPT PresentationTRANSCRIPT
GRC WorkshopASHG
22 Oct 2013
OutlineReference Assembly BasicsGRC: Assembly management and dataflowGRCh38Accessing the assembly and data
http://genomereference.org
What is the Reference Assembly?
Reference Assembly Basics
An assembly is a MODEL of the genome
Lander and Waterman(1988) Genomics
Reads are randomly distributedOverlap between reads does not vary
AssumptionsVariables:G= haploid genome length in bpL= sequence read length in bpN= number of reads sequencedT= amount of overlap needed for detection in bpC= Coverage (C=LN/G)
Poisson distribution:P(Y=y)=(ly * e–l)/y!y= number of events in an intervall = mean number of events in an interval
For sequence calculations, coverage can be viewed as l
Reference Assembly Basics
Using this equation, you can calculate the probability that a base hasbeen sequenced y number of times.
By manipulating this formula, you can estimate the numbers of gaps for any given level of coverage.
SequencedNot sequenced1X Coverage5X Coverage
10X Coverage
37% 63%0.6% 99.4%
0.005% 99.995%
Reference Assembly Basics
2009 Sanger cost: shotgun sequence ~ $0.01/base finished sequence ~ $0.03/base
This clone: Shotgun=$1500Finish=$3000
Reference Assembly Basics
Reference Assembly Basics
tetrao
don
muntja
k_ind
ian
zebra
finch
zebra
fish
macaq
ue
alliga
tor
chick
ensh
eep
monod
elphis
orang
utan
gorill
ave
rvet
cpba
t
chim
p
owl_m
onke
y cat
pig
dusk
y_titi co
w
eleph
ant
fugu
babo
on dog
hedg
ehog
shrew
armad
illo
opos
sum
squir
rel_m
onke
yrab
bit
galag
olem
urrfb
at rat
mouse
marmos
et
wallab
y
colob
us_m
onke
y
platyp
us
0
1
2
3
4
5
6
7
8
9
10
Sequence Gaps : Uncaptured vs. Total
Uncaptured gaps Captured gaps
Species
Gap
Ave
. per
BA
C
Captured gap= no sequence, but a sub-clone spans the gapUncaptured gap= no sequence, no sub-clone spanning gap
Bob Blakesley, NISC
Reference Assembly Basics
BiologyRepetitive sequence (interspersed repeats, segmental duplications)Variation
(regions of high diversity, structural variation)
Kidd et al., 2008
Reference Assembly Basics
Reference Assembly Basics
Eugene Yaschenko, NCBI
EnrichmentObservedExpected
-5
-4
-3
-2
-1
0
1
2
3
4
5
60
40
20
0
20
40
60
Maj
or h
isto
com
patib
ility
com
plex
ant
igen
Che
mok
ine
Tum
or n
ecro
sis
fact
or re
cept
or
Oth
er c
ytok
ine
rece
ptor
Cys
tein
e pr
otea
se in
hibi
tor
CA
M fa
mily
adh
esio
n m
olec
ule
Apo
lipop
rote
in
KR
AB
box
tran
scrip
tion
fact
or
Inte
rmed
iate
fila
men
t
Imm
unog
lobu
lin re
cept
or fa
mily
mem
ber
Oth
er c
ell a
dhes
ion
mol
ecul
e
Zinc
fing
er tr
ansc
riptio
n fa
ctor
Def
ense
/imm
unity
pro
tein
Stru
ctur
al p
rote
in
Cys
tein
e pr
otea
se
Cyt
okin
e re
cept
or
Oxy
gena
se
Cel
l adh
esio
n m
olec
ule
Tran
scrip
tion
fact
or
Mis
cella
neou
s fu
nctio
n
Sig
nalin
g m
olec
ule
Oxi
dore
duct
ase
Unc
lass
ified
Nuc
leic
aci
d bi
ndin
g
Sel
ect r
egul
ator
y m
olec
ule
Kin
ase
Hyd
rola
se
Rib
osom
al p
rote
in
Pro
tein
kin
ase
G-p
rote
in m
odul
ator
Ext
race
llula
r mat
rix
Oth
er tr
ansc
riptio
n fa
ctor
Human- PANTHER classifications (biological process)
Evan Eichler, University of Washington
Reference Assembly Basics
TechnologyRead length long reads vs. short readsMate lengths distribution of insert sizesRead accuracy error model for your technologyRead depth coverage at each baseGenome distribution reads covering entire genome equally
Ajay et al., 2011
Genome Research, May, 1997
Reference Assembly Basics
Restrict and make libraries2, 4, 8, 10, 40, 150 kb
End-sequence allclones and retainpairing information“mate-pairs”
Find sequence overlaps
Each end sequenceis referred to as a read
WGS contig
tails
WGS: Sanger Reads
Scaffold
Reference Assembly Basics
Contig: a sequence constructed from smaller, overlapping sequences, which contains no gaps.
Scaffold: a sequence constructed from smaller sequences, which may contain gaps.
Genome Vocabulary
Typically built from reads, but also from sequences in GenBank/EMBL/DDBJ
Typically built from sequences in GenBank/EMBL/DDBJ
Reference Assembly Basics
Schatz et al, 2010
Reference Assembly Basics
A T T T T C C C T T C T G A A A T G A T G A A A G A G T C
Reference Assembly Basics
BAC insertBAC vector
Shotgun sequence
Assemble
Fold
sequ
ence
Gaps
deeper sequencecoverage rarelyresolves all gaps
GAPS
“finishers” go in to manually fill the gaps, often by PCR
Clone based assemblies
Reference Assembly Basics
A
BCD
EFGH
IJKLMNO
ABCD
FGH
KL
ON
Ideally…
Non-sequence based Map
(flip)
ABCD
FGH
KL
ON
Reference Assembly Basics
More like…
A
BCD
EFGH
IJKLMNO
A
BC
ZYX
W
HJ
M
V
N
O
AB
HIJ
CDY
LMNO
AB
HIJ
LMNO
?
Reference Assembly Basics
Sequence vs. Non-sequence based mapsMmu7
WI GeneticWI/MRC RH
Human assemblies available in the NCBI assembly database
http://www.ncbi.nlm.nih.gov/assembly
Reference Assembly Basics
Reference Assembly Basics
Reference Assembly Basics
N50:Measure of continuity.Half of the contigs in the assembly are this length or greater.
Reference Assembly BasicsFragmented genomes tend to
have more partial modelsFragmented genomes have
fewer frameshifts
Alexander Souvorov, NCBI
OutlineReference Assembly BasicsGRC: Assembly management and dataflowGRCh38Accessing the assembly and data
http://genomereference.org
http://genomereference.org
Distributed data
Genome not in INSDC Database
Old Assembly Model
GRC Assembly Management
Human Genome Project (HGP)
GRC Assembly Management
GRC Assembly Management
Distributed data
Genome not in INSDC Database
Old Assembly Model
Centralized Data
GRC Assembly Management
Issue tracking system (based on JIRA)
GRC Assembly Management
http://genomereference.org
GRC Assembly Management
GRC Assembly Management
5 July 2011
GRC Assembly Management
GRC Assembly Management
ACCESSION NAME CONTIG
GAP Telomere 10000
AP006221 XX-190A2 Hschr1_ctg1
AL627309 RP11-34P13 Hschr1_ctg1
GAP type-3
AC114498 RP5-857K21 Hschr1_ctg3
AL669831 RP11-206L10 Hschr1_ctg3
AL645608 RP11-54O7 Hschr1_ctg3
Tiling Path File (TPF)
GRC Assembly Management
Full Dovetail
Half-dovetail
Contained
Short/Blunt
GRC Assembly Management
GRC Assembly Management
GRC Assembly Management
GRC Assembly Management
GRC Assembly Management
Build sequence contigs based on contigs defined in TPF (Tiling Path File).
Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis
Switch point
Representative chromosome sequence
GRC Assembly Management
AGP: A Golden Path
Provides instructions for building a sequence• Defines components sequences used to build scaffolds/chromosome• Switch points• Defines gaps and types
GRC Produces
GRC Assembly Management
• AGP• FASTA
Distributed data
Old Assembly ModelCentralized Data
Updated Assembly Model
GRC Assembly Management
Genome not in INSDC Database
Sequences from haplotype 1Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
GRC Assembly Management
Assembly (e.g. GRCh37)
Primary Assembly
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 9
ALT 6
ALT 7ALT
8
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
GRC Assembly Management
AC074378.4AC079749.5
AC134921.2AC147055.2
AC140484.1AC019173.4
AC093720.2AC021146.7
NCBI36 NC_000004.10 (chr4) Tiling Path
Xue Y et al, 2008
TMPRSS11E TMPRSS11E2
GRCh37 NC_000004.11 (chr4) Tiling Path
AC074378.4AC079749.5
AC134921.1AC147055.2
AC093720.2AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4AC140484.1
AC019173.4AC226496.2
AC021146.7
TMPRSS11E2
UGT2B17 RegionGRC Assembly Management
GRC Assembly Management
7 alternate haplotypesat the MHC
Alternate loci released as:FASTA
AGPAlignment to chromosome
UGT2B17 MHC MAPT
GRCh37 (hg19)
Assembly (e.g. GRCh37.p13)
Primary Assembly
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 9
ALT 6
ALT 7ALT
8
PAR
…
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Patches
Genomic Region(ABO)
Genomic Region(SMA)
Genomic Region
(PECAM1)
GRC Assembly Management
GRC Assembly Management
GRCh37.p13• 178 Regions: 3.15% of chromosome
sequence• 131 FIX patches: add 6.8 Mb novel
sequence• 73 NOVEL patches: add >800kb novel
sequence
MHC (chr6)Chr 6 representation (PGF)
Alt_Ref_Locus_2 (COX)
GRC Assembly Management
17q deletion
H1
H2
Zody et al, 2008
GRC Assembly Management
GRC Assembly Management
chromosome
alt/patch
reads On-target alignment
Off-target alignments
(n=122,922)
GRC Assembly Management
GRC Assembly Management
Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning
reads to the full assembly
Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds
GRC Assembly Management
Distributed data
Genome not in INSDC Database
Old Assembly ModelCentralized Data
Updated Assembly Model
Genome in INSDC DatabaseGenome not in INSDC Database
GRC Assembly Management
OutlineReference Assembly BasicsGRC: Assembly management and dataflowGRCh38Accessing the assembly and data
http://genomereference.org
GRCh38 Impact
GRCh38
GRCh38 Impact
GRCh37 Scaff N50: 44,983,201GRCh37B Scaff N50: 62,124,159
GRCh37 Contig N50: 38,440,852GRCh37B Contig N50: 49,319,739
GRCh38 Impact
GRCh38 Impact
Modeled CentromeresIndividual base updatesFixed tiling path/assembly errorsAddition of novel sequence
GRCh38 Impact
Major Features of GRCh38
CENTROMERES
GRCh38 Impact
61-mer analysis set
9664
1kG high-confidence set
13584222
Mismatches MAF = 0n=15,244
MAF=0Insertio
nsn=834
MAF=0Deletion
sn=1541
MAF<5%Mismatc
h in pseudo/pr txptn=1413
Annotator and clinical
requestsn= ~260
GRCh38 Impact
Pile-Up Analysis: “Never Seen” Mismatched Bases Originating from RP11 Components
GRCh38 Impact
79% of these bases are heterozygous in RP11 WGS
GRCh37 Insertions Originating from RP11
GRCh38 Impact
GRCh37 Deletions Originating from RP11
17% heterozygous in RP11 WGS 18% heterozygous in RP11 WGS
GRCh38 Impact
GRCh38 Impact
GRCh38 Impact
1q32 1q21 1p211p21 patch alignment to chromosome 1
Dennis et al., 2012GRCh38 Impact
HYDIN: chr16 (16q22.2)HYDIN2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38
Alignment of HYDIN2 Genomic, 300 Kb, 99.4% ID
Alignment of HYDIN CHM1_1.0, >99.9% IDAlignment of HYDIN2 Genomic, 300 Kb, 99.4% ID
Alignment of HYDIN CHM1_1.0, >99.9% ID
Doggett et al., 2006GRCh38 Impact
GRCh38 Impact
Other Major Tiling Path Updates• Single CHM1 haplotype paths for:
• 1p12, 1q21, 1q32: SRGAP2• IGH• LRC/KIR• CCL3L1 (17q21)
• OM-guided• 10q11• Chr. 9 peri-centromeric inversion
GRCh38 Impact
NOVEL GENES!GRCh37.p13: 211 genes found only on alt
loci and patches
GRCh38 Impact
Sudmant et al., 2010
Genovese et al., 2013
1000G decoy sequence, viewed by:• GenBank alignment• Percent Repeat Masked• Repeat Mask type• Sequence Source (HTG, HuRef, ALLPATHS)
GRCh38 Impact
In a preliminary analysis, 90% of NA12878 reads that previously aligned uniquely to the decoy sequence had
an alignment to the updated assembly.
GRCh38 Impact
Where is the decoy sequence in GRCh38?• Alt loci (low repeat content)• Model centromeres (high repeat content)• Unlocalized/Unplaced Scaffolds• Chromosomes
OutlineReference Assembly BasicsGRC: Assembly management and dataflowGRCh38Accessing the assembly and data
http://genomereference.org
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
NCBI Genes, Ensembl Genes, Annotated Clone Problems, Segmental Duplications
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
GRCh38 in Ensembl
GRCh38 will be incorporated into the existing Ensembl interface. Features such as genes, variation, regulation will be remade or remapped onto the new genome. Nearly 500 tracks are available.
GENCODE gene set
Accessing the Data
Alternate sequences in Ensembl
Haplotypes and patches on the chromosome
A fix patch around the ABO gene
Use the Region comparison view to see the difference between the patch and primary assembly
The GRC alignment track indicates edits
View your data on the Genome
Zoomed in
Zoomed out
Follow the link from the homepage
Red bases show mismatches
Transition to GRCh38 in Ensembl
INSDC coordinates identify the assembly as well as the position
Convert coordinates between assemblies
Our blog series details our progress with GRCh38Ensembl.info
Remap Set up slide
Accessing the Data
Accessing the Data
1000 Genomes Browser: http://www.ncbi.nlm.nih.gov/variation/tools/1000genomesGeT-RM Browser: http://www.ncbi.nlm.nih.gov/variation/tools/getrmVariation Viewer: http://www.ncbi.nlm.nih.gov/variation/view (coming Fall 2013!)
Tiling Path
Sequence Bar
Segmental Duplications, Eichler Lab
1000 Genomes strict accessibility mask
Annotated clone assembly problems
dbSNP Build 138 based on annotation run 104
Model based paralogous sequence differences, NCBI annotation run #Paralogous/pseudo gene alignments, NCBI annotation run #
Single Unique Nucleotide (SUN) map, Sudmant 2010ClinVar Long Variations
GRC Curation Issues
ClinVar Short Variations
http://twitter.com/[email protected]
Accessing the Data
http://genomeref.blogspot.com/
Accessing the Data
Accessing the Data