graph and assembly strategies for the mhc and ribosomal dna regions
TRANSCRIPT
![Page 1: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/1.jpg)
Graph and assembly strategies for the MHC and ribosomal DNA regions
Alexander Dilthey
![Page 2: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/2.jpg)
The MHC is the zebrafish of the genome!
(model region)
![Page 3: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/3.jpg)
PRGs – Population Reference Graphs• Simple: acyclic, directed (sub-class of general variation graphs)
• Usually built from MSA, preserve gap positions(i.e. global homology between input sequences).
• Generative model: Recombination
• Ploidy well-defined (0, 1, 2)
TA CT A G
C
C
_
_
A
TA
A
![Page 4: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/4.jpg)
Outline• Quick recap:
What we know about the utility of graph genome approaches
• New results:
Haplotyping in hypervariable regions (HLA)Pseudo graph alignment
• De novo assembly of ribosomal DNA
![Page 5: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/5.jpg)
In most of the MHC, single-reference approaches work just fine…
Num
ber o
f kme
rs (m
illion
s)4.5
5.0
PGF reference Platypus PRG-Viterbi PRG-Mapped
kmers recoveredkmers not recovered
+ long-read validation with consistent results (not shown)Dilthey et al., Nature Genetics 2015
![Page 6: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/6.jpg)
… graph genomes outperform in the most complex sub-region of the MHC …
Dilthey et al., Nature Genetics 2015
![Page 7: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/7.jpg)
… remaining problems driven by incomplete input haplotypes + algorithmics.
Aligned kmers
Chromotype position (kb)
Read
posit
ion (k
b)
0 10 200
2
4
6
Incomplete input haplotypes:Large uncharacterized inversion
Algorithmics:Incorrect HLA haplotyping.
Dilthey et al., Nature Genetics 2015
![Page 8: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/8.jpg)
HLA haplotyping• Hypothesis: Whole-genome sequencing data contains the information
necessary for accurate HLA typing
• “HLA typing” HLA gene exon sequences• HLA class I: exons 2 and 3• HLA class II: exon 2
• Challenge: align reads to the right gene – homology hell.
• Proper read-to-graph alignment instead of k-Mers.
![Page 9: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/9.jpg)
Class I exon homology
Exon 2 Exon 3
HLA-A 3284 allelesHLA-B 4077 allelesHLA-C 2799 alleles
![Page 10: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/10.jpg)
Approach: deep PRG + mapping
Exonic MSAT*01:01 _ _ A C G T A C T _ _T*01:02 C A A C A T A C T _ _T*01:03 _ _ A C G C G C T _ _T*01:04 _ _ A T C C G C T A CT*01:05 _ _ A T C C C C T _ _T*01:06 _ _ _ C C T A C T _ _
Genomic MSAT*01:01 A G C A _ _ A C G T A C T _ _ C C T AT*01:02 A C C A C A A C A T A C T _ _ C C T AT*01:04 _ T T A _ _ A T C C G C T A C C C T A
8 xMHC reference haplotypes
PGF (with T*01:01) A C T A G C A _ _ A C G T A C T _ _ C C T A T G AMANN (with T*01:04) T T T _ T T A _ _ A T C C G C T A C C C T A T G A
1) Gene-only PRG – 46 (pseudo) genes, mostly HLA|--NNN--| |--NNN--| Gene 1 Gene 2 Gene 3
Padding UTR Exon 1 Intron 1 Exon 2 UTR Padding
Num
ber o
f ref
eren
ce se
quen
ces
Region covered by 'genomic' sequences
2) Varying numbers of input sequences across PRG
3) Use hierarchical MSA approach to combine in
![Page 11: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/11.jpg)
Approach: deep PRG + mapping
Level 1
CA
_ _
C T
C
CC
G
AAligned read
2 3 4 5 6 7
A _ TATA _ C
198 9 10 11 12 13 14 15 16 17 18 25 26
C AGTATC
20 21 22 23 24
TCTC
T T
A
_
A _A G
CT
C
T
T
C T
ATAC
C {G, C}T
C
G
CA A
_ _
A
4) Seed-and-extend paired-end mapping to PRG
5) Likelihood-based inference: maximize L( aligned reads | HLA types ) (independently per locus)
![Page 12: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/12.jpg)
High-quality WGS data enables gold-standard accuracy
(of note: 2/3 original discrepancies with validation data were errors in the validation data!)
![Page 13: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/13.jpg)
… but not from exome, MiSeq data
![Page 14: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/14.jpg)
Sequencing error?
![Page 15: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/15.jpg)
Effective fragment length? [2 x read length + IS]
![Page 16: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/16.jpg)
Conclusion (intermediate)• If the input sequencing data is „good enough“, we manage near-
perfect haplotyping in the genome‘s most polymorphic region
• Effective fragment length likely the most important factor
• Not-so-good sequencing data: joint haplotyping + alignment(i.e. alignment location is not independent of inferred haplotype)
• Read mapping implementation SLOW
![Page 17: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/17.jpg)
Pseudo graph mappingInput sequences
![Page 18: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/18.jpg)
Pseudo graph mappingInput sequences
Graph
![Page 19: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/19.jpg)
Pseudo graph mappingInput sequences
Graph
Align short reads to input sequences...
![Page 20: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/20.jpg)
Pseudo graph mappingInput sequences
Graph
Align short reads to input sequences...
... transpose onto graph
![Page 21: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/21.jpg)
Scrubbing, cutting, cleaning
Input MSA Lin. alignment MSA coor. Scrubbed
123456789 123456X789 123456789Seq1 AACAC_TTT Seq1 AACAC_TTT AACAC__TTT AACAC_TTTSeq2 TTCACGTTT Read AACACGTTT AACAC_GTTT AACACGTTT
-Graph TTCAC TTT G
Scrubbing: get rid of INDEL-induced changes in the alignment coordinate system
Cutting: Examine alignment gap structure; cut in „bad“ areas; use longest stretch
Cleaning: Find the best gap-less sequence-to-graph alignment + extension with gaps
Graph alignment
123456789Graph AACACGTTTSeq1 AACACGTTT
![Page 22: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/22.jpg)
Accuracy slightly worse; fast!
Conclusion: perhaps there is a middle ground between graph and linear sequence alignment. Work in progress. Further tuning?
Inferred Accuracy Call Rate Inferred Accuracy Call RateA 6 6 1.00 1.00 6 1.00 1.00B 6 6 1.00 1.00 6 1.00 1.00C 6 6 1.00 1.00 6 1.00 1.00DQA1 6 6 1.00 1.00 6 1.00 1.00DQB1 6 6 1.00 1.00 6 1.00 1.00DRB1 6 6 1.00 1.00 6 1.00 1.00A 22 22 0.86 1.00 22 1.00 1.00B 22 22 1.00 1.00 22 1.00 1.00C 22 22 1.00 1.00 22 1.00 1.00DQA1 12 12 1.00 1.00 12 1.00 1.00DQB1 22 22 1.00 1.00 22 1.00 1.00DRB1 22 22 0.91 1.00 22 0.95 1.00
PlatinumTrio
1000 Genomes
Highest Resolution
MHC-PRG-2 HLA*PRGNLocusCohort
![Page 23: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/23.jpg)
Towards additional high-quality reference haplotypes…
Remaining challenges: extreme repeats, haplotypes.Sergey Koren
![Page 24: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/24.jpg)
Ribosomal DNA• Encodes ribosomal RNA• Hundreds of copies
(tandem repeat arrays)
• Variation poorly characterized
• Step 1: Targeted approach• Step 2: WGS-based• Step 3: Variation graph
![Page 25: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/25.jpg)
Read error vs variation
… from whole-genome data?Long reads de Bruijn graph Technology!
6% > 50k
![Page 26: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/26.jpg)
Summary• Variation graphs are worth the effort – at least in highly complex regions.
• Evidence: MHC „model system“+ overall improvement of Genome inference accuracy+ complex-locus haplotyping
• Incorporate LD?
• Middle ground between full graph alignment and linear sequence alignment?
• Ribosomal DNA – let me know if you‘re also interested!
![Page 27: Graph and assembly strategies for the MHC and ribosomal DNA regions](https://reader035.vdocuments.site/reader035/viewer/2022062822/587da7f61a28ab22148b8121/html5/thumbnails/27.jpg)
AcknowledgementsNIHAdam PhillippySergey KorenBrian WalenzJung-Hyun KimVladimir Larionov
OxfordGil McVeanZam IqbalAlexander Mentzer
HistogeneticsNezih Cereb
UCSF/NantesPierre-Antoine Gourraud
GSKMatt NelsonCharles Cox