aug2014 smash performance metrics tool

SMaSHa Benchmarking Toolkit

for Variant Calling

SMaSH and GIAB:A Good Match

High overlap with features in the Performance Metrics Specificationsdoc.

Many of the features not currently supported are ones we'd like tointegrate.

About meWorked at the UC Berkeley AMPlab for about a yearCurrently the primary SMaSH developerStarting a CS PhD at Berkeley in Programming Languages this fall

About this talkSMaSH as it is nowSMaSH in the futureSMaSH and GIAB

SMaSH as it isnow

SMaSHProject out of the AMP-X group at UC Berkeley

Talwalkar et al., 2014, Bioinformaticssmash.cs.berkeley.edu

Initial goalCreate a unified way of benchmarking germline variant calling pipelines.

SMaSH componentsCodebase for comparing VCF callsetsReads and ground truth datasetsMetrics for accuracy and computational performance

CodebaseFor benchmarking purposes, we compare a predicted callset against a

ground truth callsetComparing two predicted callsets works exactly the same.

Variant ClassificationSNPsIndel (less than 50 base pairs)Structural variants

EvaluationSNPs and indels are strictly evaluated.Structural variants are evaluated on:

Same type (insertion/deletion/other)Length same as true variant within specified tolerancePosition same as true variant within specified tolerance

Accuracy MetricsEvaluate variants as true positive, false positive, false negative

Evaluate accuracy of genotyping

Error barsCalculated on confidence in ground truth calls

Choose some upper bound on ground truth call error rate based onvalidation methology

E.g., 2 out of every 1000 SNPs is wrong.Use this error rate to calculate upper/lower bounds on precision and

recall.

The VCF format isambiguous!

SMaSH addresses this problem with two strategies:NormalizationRescue

Guiding principle: metrics should never be worse afternormalization/rescue than they were without them.

NormalizationA single variant may be plausibly placed in many different positions but

describe the same change.

For example, we normalize this variant:

First, we remove the longest proper suffix from the ref and alt alleles.

Then, we "slide" the variants by adding a base from the reference to thehead and removing a base from the tail, until the last bases on both

alleles are no longer the same.

RescueThe same underlying haplotype can be represented by different sets of

variants.True callset

Predicted callset

Rescue AlgorithmFor every false negative, we attempt rescue:

Build up a window around the variant positive for the true andpredicted callsetsFor all sets of non-overlapping variants, expand the underlyinghaplotypes for the variants within those windows.If the haplotypes match, mark all false negatives/false positives astrue positives.

Rescue Example

OutputsStatistics, including counts for all categories, in plain text, TSV andJSON formatsCalculations for precision and recall, including error barsVCF containing variants from both callsets, annotated with the callsetthey came from and their categorization (TP/FP/FN/rescued)

Where is SMaSH headed?

Global Alliance for Genomics &Health

ga4gh.orgThe benchmarking task force includes:

Illumina, Amazon, GoogleUC Berkeley, UC Santa Cruz, NIST

Development continues byGA4GH

Chief maintainers will be Kelly Westbrooks and Cassie Doll (Google).

Feature RoadmapNew variant types: complex variants, compound heterozygous variants,

etc.Phasing evaluation

Better handling of known false positives

SMaSH and GIAB

Try it and let us know what youthink!

git clone https://github.com/amplab/smash.gitComplete documentation available at smash.cs.berkeley.edu

Post feedback at the Google Group smash-benchmarking

Code contributionsOpen source and BSD-licensed; pull requests and issues very welcome

github.com/amplab/smash

DatasetsThe SMaSH paper proposed eight datasets, including synthetic, sampled

human, and mouse.Other data to use as ground truth?

NIST pedigree calls for NA12878the Illumina Platinum Genome

Others?

Interpretationof results

Tools for Downstream Analysis?Visualizations?

Compatibility with genome browsers?Other?

[email protected]/amplab/smash

aug2014 smash performance metrics tool

Health & Medicine

true variant

heterozygous variants

complex variants

true callset

smash paper

smash project

future smash

aboutthistalk smash