160628 giab for festival of genomics
TRANSCRIPT
![Page 1: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/1.jpg)
So you’ve sequenced my genome. How well did you do?
Justin ZookNIST Genome-Scale Measurements
Group
June 28, 2016
![Page 2: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/2.jpg)
Sequencing technologies and bioinformatics pipelines disagree
O’Rawe et al. Genome Medicine 2013, 5:28
![Page 3: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/3.jpg)
Sequencing technologies and bioinformatics pipelines disagree
O’Rawe et al. Genome Medicine 2013, 5:28
Who’s right?
Is anyone right?
![Page 4: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/4.jpg)
Genome in a Bottle ConsortiumWhole Genome Variant Calling
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials to evaluate performance– materials certified for their
variants against a reference sequence, with confidence estimates
• established consortium to develop reference materials, data, methods, performance metrics
• Characterized Pilot Genome NA12878
• Ashkenazim Trio, Asian Trio from PGP in process
gene
ric m
easu
rem
ent p
roce
ss
![Page 5: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/5.jpg)
Well-characterized, stable RMs• Obtain metrics for
validation, QC, QA, PT• Determine sources and
types of bias/error• Learn to resolve difficult
structural variants• Improve reference
genome assembly• Optimization• Enable regulated
applicationsComparison of SNP Calls forNA12878 on 2 platforms, 3
analysis methods
![Page 6: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/6.jpg)
Bringing Principles of Metrologyto the Genome
• Reference material– DNA in a tube you can buy from
NIST– $45/ug
• NA12878 as pilot sample
• Extensive state-of-the-art characterization– as good as we can get for small
variants– arbitrated “gold standard” calls
for SNPs, small indels• “Upgradable” as technology
develops
• Analysis of PGP trios are ongoing– open project
• PGP genomes suitable for commercial derived products
• Developing benchmarking tools and software– with GA4GH
• Samples being used to develop and demonstrate new technology– for instance, 10X Genomics
![Page 7: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/7.jpg)
Paper describing data…
![Page 8: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/8.jpg)
Integration Methods to Establish Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
![Page 9: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/9.jpg)
Integration Methods to Establish Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
NEW: Reproducible
integration pipeline
with new calls for
NA12878 and AJ
Son!
![Page 10: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/10.jpg)
So, how does WGS make it into Regulated Clinical Applications?
• FDA developing strategy to regulate NGS, which is a novel medical device“...this technology allows broad and indication-blind testing and is capable of generating vast amounts of data, both of which present issues that traditional regulatory approaches are not well-suited to address.”
• FDA Workshops Feb ’15, Nov ’15– strategy to rely on
standards-based approaches, including reference materials…
“need for reference materials for validation and proficiency testing… there is no substitute for having real samples.”FDA Whitepaper, Dec ‘14 GenomeWeb, Nov ‘15
![Page 11: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/11.jpg)
Clinical Genome Sequencing Process
Preanalytical
Sequencing
Sequence Bioinformatics
Functional Variant Annotation
Clinical Variant Knowledgebase
Query
Clinical Interpretation Reporting
EHR Archival
![Page 12: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/12.jpg)
What is the standards architecture to demonstrate safety and efficacy?
Preanalytical
Sequencing
Sequence Bioinformatics
Functional Variant Annotation
Clinical Variant Knowledgebase
Query
Clinical Interpretation Reporting
EHR Archival
![Page 13: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/13.jpg)
Analytical/Technical PerformanceAssessment
Preanalytical
Sequencing
Sequence Bioinformatics
Functional Variant Annotation
Clinical Variant Knowledgebase
Query
Clinical Interpretation Reporting
EHR Archival
![Page 14: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/14.jpg)
Global Alliance for Genomics and Health Benchmarking Task Team
• Developed standardized definitions for performance metrics like TP, FP, and FN.
• Developing sophisticated benchmarking tools• vcfeval – Len Trigg• hap.py – Peter Krusche• vgraph – Kevin Jacobs
• Standardized bed files with difficult genome contexts for stratification
Credit: GA4GH, Abby Beeler, Ellie Wood
Stratification of FP RatesHigher FP rates at Tandem Repeats
![Page 15: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/15.jpg)
Approaches to Benchmarking Variant Calling
• Well-characterized whole genome Reference Materials
• Many samples characterized in clinically relevant regions
• Synthetic DNA spike-ins• Cell lines with engineered mutations• Simulated reads• Modified real reads• Modified reference genomes• Confirming results found in real samples over time
![Page 16: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/16.jpg)
Challenges in Benchmarking Variant Calling
• It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)
• Easiest to benchmark only within high-confidence bed file, but…
• Benchmark calls/regions tend to be biased towards easier variants and regions– Some clinical tests are enriched for difficult sites
• Always manually inspect a subset of FPs/FNs• Stratification by variant type and region is important• Always calculate confidence intervals on performance
metrics
![Page 17: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/17.jpg)
How can we extend this approach to structural variants?
Similarities to small variants• Collect callsets from
multiple technologies• Compare callsets to find
calls supported by multiple technologies
Differences from small variants• Callsets generally are not
sufficiently sensitive to assume that regions without calls are homozygous reference
• Variants are often imprecisely characterized– breakpoints, size, type, etc.
• Representation of variants is poorly standardized, especially when complex
• Comparison tools in infancy
![Page 18: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/18.jpg)
Callsets Contributed so far
Short reads• Illumina
– Spiral Genetics– cortex– Commonlaw– MetaSV– Parliament/assembly– Parliament/assembly-force
• Complete Genomics• CG-SV• CG-CNV• CG-vcfBeta
Long reads and Linked reads• PacBio
• CSHL-assembly• Sniffles• PBHoney-spots and –tails• Parliament/pacbio• Parliament/pacbio-force• MultibreakSV• smrt-sv.dip• Assemblytics-Falcon and-MHAP
• Nanopore mapping• Nabsys force calls
• optical mapping• BioNano with and without haplotype-
aware assembly• 10X Genomics
![Page 19: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/19.jpg)
Number of Calls Supported by 2 Technologies by Size Range
<50bp 50-100bp 100-1000bp 1kb-3kb >3kbpre-filtered 2404 1307 2288 481 600
filtered 2325 1188 1875 379 341
![Page 20: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/20.jpg)
Sensitivity to Draft Benchmark Calls<50bp 50-100bp 100-1000bp 1kb-3kb >3kb
AssemblyticsFalcon 0% 55% 68% 59% 45%AssemblyticsMHAP 0% 51% 66% 56% 52%
CGvcf 86% 20% 4% 0% 0%CGCNV 0% 0% 0% 0% 29%CGSV 0% 0% 39% 65% 56%
CSHLassembly 0% 47% 62% 49% 42%sniffles 7% 28% 58% 59% 64%
BioNano 0% 0% 2% 26% 37%Spiral 85% 44% 57% 38% 40%Cortex 39% 15% 7% 2% 0%
CommonLaw 0% 0% 8% 47% 40%PBHoneySpots 0% 39% 63% 9% 0%PBHoneyTails 0% 0% 0% 31% 57%
MetaSV 0% 0% 75% 74% 71%ParliamentPacBio 0% 0% 74% 75% 48%
ParliamentAssembly 0% 0% 65% 44% 2%MultibreakSV 16% 66% 72% 59% 47%
CNVnator 0% 0% 22% 71% 74%ParliamentPacBioForce 1% 45% 72% 31% 18%
ParliamentAssemblyForce 0% 42% 63% 11% 2%BionanoHaplo 0% 0% 0% 36% 49%
NabsysForce160405 0% 0% 5% 25% 28%smrtsvdip 0% 66% 77% 65% 55%fermikit 94% 86% 83% 59% 56%
![Page 21: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/21.jpg)
Size distributions
![Page 22: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/22.jpg)
Concordance between technologies
All Calls
High-confidence Calls
![Page 23: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/23.jpg)
Acknowledgements
• NIST– Marc Salit– Jenny McDaniel– Lindsay Vang– David Catoe
• Genome in a Bottle Consortium
• GA4GH Benchmarking Team
• FDA– Liz Mansfield– Zivana Tevak– David Litwack
![Page 24: 160628 giab for festival of genomics](https://reader035.vdocuments.site/reader035/viewer/2022062522/587da8271a28ab22148b81ad/html5/thumbnails/24.jpg)
For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://biorxiv.org/content/early/2015/09/15/026468
Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools
Twice yearly public workshops – Winter at Stanford University, California, USA– Summer at NIST, Maryland, USA
NRC postdoc opportunities available!Justin Zook: [email protected] Salit: [email protected]