targeted data introduction many mapping, alignment and variant calling algorithms most of these...

Post on 18-Dec-2015

220 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Omixon WorkshopsConsiderations for Analyzing Targeted NGS Data - IntroductionTim Hague, CEO

Targeted Data

Introduction

Many mapping, alignment and variant calling algorithms

Most of these have been developed for whole genome sequencing and to some extent population genetic studies

Premise

In contrast, NGS based diagnostics deals with particular genes or mutations of an individual

Different diagnostic targets present specific challenges

Goal

Present analysis issues related to differences in:

Sequencing technologies

Targeting technologies

Target specifics

Pseudogenes and segmental duplication

Roche 454Illumina IonTorrentt

NGS Sequencers

Illumina

Ion Torrent

Roche 454

(SOLiD)

Mind The Gap

Moore B, Hu H, Singleton M, De La Vega, FM, Reese MG, Yandell M. Genet Med. 2011 Mar;13(3):210-7.

Sequencing Technology

Differences: Homopolymer error rates G/C content errors Read length Sequencing protocols (single vs paired reads)

Targeting Methods

PCR primers (e.g. amplicons) Hybridization probes (e.g. exome kits)

Targeting Technology

Differences: Exact matching regions vs regions with SNPs

Results in: Need for mapping against whole chromosomes to

avoid false positives

Analysis Targets

Differences:

Rate of polymorphism

Repetitive structures

Mutation profiles

G/C content

Single genes vs multi gene complexes

BRCA1/2 HLA CFTR1/2000 1/29 1/2000

Distributions of insertions and deletions

Distribution of repeat elements

Segmental Duplications

Sometimes called Low Copy Repeats (LCRs)

Highly homologous, >95% sequence identity

Rare in most mammals

Comprise a large portion of the human genome (and other primate genomes)

Important for understanding HLA

Many LCRs are concentrated in "hotspots„

Recombinations in these regions are responsible for a wide range of disorders, including: – Charcot-Marie-Tooth syndrome type 1A– Hereditary neuropathy with liability to pressure palsies– Smith-Magenis syndrome– Potocki-Lupski syndrome

Segmental Duplications

Data analysis shouldn’t be like this!

Data Analysis Tools

Differences: Detection rates of complex variants (sensitivity) False positive rates (accuracy) Speed Ease of use

“Depending upon which tool you use, you can see pretty big differences between even the same genome called with different tools—nearly as big as the two Life Tech/Illumina genomes.”

Mark Yandel in BioIT-World.com, June 8, 2011

Examples

Missing variants

SNPs, a DNP and deletions

Identify More Valid Variants

Find Homopolymer Indels

Examples

Coverage differences

[0-432]

[0-96]

Four Times Exon Coverage

[0-24]

[0-10]

Higher Exome Coverage

First Conclusion

Read accuracy is not the limiting factor in accurate

variant analysis

Example - Dense Region of SNPs

Second Conclusion

As variant density increases the performance of most tools

goes down

Variant Calling

There are few popular variant callers: GATK, SAMtools mpileup, VarScan

The most comprehensive (GATK) has a whole pipeline, including a quality recalibration step and an indel realignment step

These recalibration and realignment steps are highly recommended to be run before any variant call

Deduplication and removing non-primary alignments may also be required

Indel Realigner Problem

Variants That Can be Hard to Find

DNPs TNPs Small indels next to SNPs 30+ bp indels Homopolymer indels Homopolymer indel and SNP together Indels in palindromes Dense regions of variants

Contact

Tim Hague, CEO

Omixon Biocomputing Solutions

Tim.Hague@omixon.com

+36 70 318 4878

Download our Omixon Target™ Evaluation Version

Today

OMIXON.COM

top related