aug2014 smash performance metrics tool

33
SMaSH a Benchmarking Toolkit for Variant Calling

Upload: genomeinabottle

Post on 24-Jun-2015

231 views

Category:

Health & Medicine


1 download

DESCRIPTION

SMASH

TRANSCRIPT

Page 1: Aug2014 smash performance metrics tool

SMaSHa Benchmarking Toolkit

for Variant Calling

Page 2: Aug2014 smash performance metrics tool

SMaSH and GIAB:A Good Match

High overlap with features in the Performance Metrics Specificationsdoc.

Many of the features not currently supported are ones we'd like tointegrate.

Page 3: Aug2014 smash performance metrics tool

About meWorked at the UC Berkeley AMPlab for about a yearCurrently the primary SMaSH developerStarting a CS PhD at Berkeley in Programming Languages this fall

Page 4: Aug2014 smash performance metrics tool

About this talkSMaSH as it is nowSMaSH in the futureSMaSH and GIAB

Page 5: Aug2014 smash performance metrics tool

SMaSH as it isnow

Page 6: Aug2014 smash performance metrics tool

SMaSHProject out of the AMP-X group at UC Berkeley

Talwalkar et al., 2014, Bioinformaticssmash.cs.berkeley.edu

Page 7: Aug2014 smash performance metrics tool

Initial goalCreate a unified way of benchmarking germline variant calling pipelines.

Page 8: Aug2014 smash performance metrics tool

SMaSH componentsCodebase for comparing VCF callsetsReads and ground truth datasetsMetrics for accuracy and computational performance

Page 9: Aug2014 smash performance metrics tool

CodebaseFor benchmarking purposes, we compare a predicted callset against a

ground truth callsetComparing two predicted callsets works exactly the same.

Page 10: Aug2014 smash performance metrics tool

Variant ClassificationSNPsIndel (less than 50 base pairs)Structural variants

Page 11: Aug2014 smash performance metrics tool

EvaluationSNPs and indels are strictly evaluated.Structural variants are evaluated on:

Same type (insertion/deletion/other)Length same as true variant within specified tolerancePosition same as true variant within specified tolerance

Page 12: Aug2014 smash performance metrics tool

Accuracy MetricsEvaluate variants as true positive, false positive, false negative

Evaluate accuracy of genotyping

Page 13: Aug2014 smash performance metrics tool

Error barsCalculated on confidence in ground truth calls

Choose some upper bound on ground truth call error rate based onvalidation methology

E.g., 2 out of every 1000 SNPs is wrong.Use this error rate to calculate upper/lower bounds on precision and

recall.

Page 14: Aug2014 smash performance metrics tool

The VCF format isambiguous!

SMaSH addresses this problem with two strategies:NormalizationRescue

Guiding principle: metrics should never be worse afternormalization/rescue than they were without them.

Page 15: Aug2014 smash performance metrics tool

NormalizationA single variant may be plausibly placed in many different positions but

describe the same change.

Page 16: Aug2014 smash performance metrics tool

For example, we normalize this variant:

Page 17: Aug2014 smash performance metrics tool

First, we remove the longest proper suffix from the ref and alt alleles.

Page 18: Aug2014 smash performance metrics tool

Then, we "slide" the variants by adding a base from the reference to thehead and removing a base from the tail, until the last bases on both

alleles are no longer the same.

Page 19: Aug2014 smash performance metrics tool

RescueThe same underlying haplotype can be represented by different sets of

variants.True callset

Predicted callset

Page 20: Aug2014 smash performance metrics tool

Rescue AlgorithmFor every false negative, we attempt rescue:

Build up a window around the variant positive for the true andpredicted callsetsFor all sets of non-overlapping variants, expand the underlyinghaplotypes for the variants within those windows.If the haplotypes match, mark all false negatives/false positives astrue positives.

Page 21: Aug2014 smash performance metrics tool

Rescue Example

Page 22: Aug2014 smash performance metrics tool

OutputsStatistics, including counts for all categories, in plain text, TSV andJSON formatsCalculations for precision and recall, including error barsVCF containing variants from both callsets, annotated with the callsetthey came from and their categorization (TP/FP/FN/rescued)

Page 23: Aug2014 smash performance metrics tool

Where is SMaSH headed?

Page 24: Aug2014 smash performance metrics tool

Global Alliance for Genomics &Health

ga4gh.orgThe benchmarking task force includes:

Illumina, Amazon, GoogleUC Berkeley, UC Santa Cruz, NIST

Page 25: Aug2014 smash performance metrics tool

Development continues byGA4GH

Chief maintainers will be Kelly Westbrooks and Cassie Doll (Google).

Page 26: Aug2014 smash performance metrics tool

Feature RoadmapNew variant types: complex variants, compound heterozygous variants,

etc.Phasing evaluation

Better handling of known false positives

Page 27: Aug2014 smash performance metrics tool

SMaSH and GIAB

Page 28: Aug2014 smash performance metrics tool

Try it and let us know what youthink!

git clone https://github.com/amplab/smash.gitComplete documentation available at smash.cs.berkeley.edu

Post feedback at the Google Group smash-benchmarking

Page 29: Aug2014 smash performance metrics tool

Code contributionsOpen source and BSD-licensed; pull requests and issues very welcome

github.com/amplab/smash

Page 30: Aug2014 smash performance metrics tool

DatasetsThe SMaSH paper proposed eight datasets, including synthetic, sampled

human, and mouse.Other data to use as ground truth?

NIST pedigree calls for NA12878the Illumina Platinum Genome

Others?

Page 31: Aug2014 smash performance metrics tool

Interpretationof results

Page 32: Aug2014 smash performance metrics tool

Tools for Downstream Analysis?Visualizations?

Compatibility with genome browsers?Other?

Page 33: Aug2014 smash performance metrics tool

[email protected]/amplab/smash