eccb10 talk - next-generation sequencing and structural variation

Next-generation sequencingand structural variation

Jan AertsWellcome Trust Sanger Institute

jan.aerts@gmail.com

principles & pittfallsvs

list of commands

What is structural variation?

• “variation that changes the structure ofa chromosome”

• Mechanisms: NAHR, NHEJ, FoSTeS• This presentation: focus on discovery

(not: genotyping)

“experiment 4” from last slide Thomas

Types of structural variation

Approaches for discovery

Combination of:• Read pairs• Read depth• Split reads• Fine-mapping breakpoints: local assembly

=> Identify signatures

A. Read Pairs

RP - General principle

• Paired-end library => insert size• Orientation/distance

RP - Signatures

Medvedev et al, 2009

RP - Real world

RP - Workflow overview

Mapping Identify discordant readpairsCluster on locationFilter on nr RPs/clusterFilter on RDFilter: mappingQ x #readpairs Identify signaturesAlternative referenceValidate

RP - Mapping

• Provides raw data => crucial• MAQ/bwa

– only report one hit (mappingQ = 0)– MAQ might prefer mismatches to aberrant

distance!• Insert size = distribution instead of exact

RP - Discordant readpairs

• Orientation• Distance

– Plot insert size distribution for chromosome– Very long tail! => difficult to set cutoff:

• 4mad or 0.01%?

RP - Clustering

“standard clustering strategy”– Only consider mate pairs that do not have

concordant mappings– Ignore read pairs that have more than one

good mapping

Clustering: use insert size distribution(e.g. 2x4 mad)

RP - Clustering: issues

• Ignores pairs that have >1 good mapping =>no detection within repetitive regions(segmental duplications)

• What cutoff for what is considered abnormaldistance? (4 mad? 0.01%? 2stdev?)

• Low library quality or mix of libraries =>multiple peaks in size distribution

RP - Filtering

• On nr RPs/cluster– Normally: n=2– For high coverage (e.g. pilot 2: 80X): n=5

• On drop in RD & SR• On (mappingQ x nrRP)

– If published data available: ROC fordifferent cutoffs mQxnrRP

– If not: very difficult

RP - Issues

• Difficult => different groups = different results“consensus set”– RP & SP: many set agree– RD: totally different

• CEU (80X): sometimes drop in RD in all 3,but RP spanning only in 2 => why??

• Mapper = critical; maq/bwa: only 1 mapping(=> many false negatives); mosaik, mrFAST:return more results

RP - Issues (2)

• Large insert size: low resolution for detectingbreakpoints

• Small insert size: low resolution for detectingcomplex regions

B. Read Depth

RD - General principle

• Similar to aCGH: using reference RDfile (e.g. based on 1kG)

• In theory: higher resolution, but noisierthan aCGH– Algorithms not mature yet– More complex steps

=> Data binned

RD - Exome

here: using exome data

RD - Example

RD - Workflow overview

• Mapping• Read filtering• GC correction• Spike identification• Validation

RD - mapping

Critical…(see RP)

RD - Filtering

• mapQ– mapQ >= 0 (noisy; few FN, many FP)– mapQ >= 10– mapQ >= 30 (many FN, few FP)

• Mean depth exon (often: e.g. +/- 0.01)– Mean depth > 1– Mean depth > 5

RD - Filtering: what’s left

152,000153,000160,000mean DP exon > 5

162,000163,000169,000mean DP exon > 1

207,000207,000207,000all

mapQ >= 30mapQ >= 10mapQ >= 0

RD - correction

• Mainly: GC– Other: repeat-rich regions, mapping Q, …

• Fit linear model GC-content exon andRD of exon=> noise decreases

RD - segmentation

• Identify spikes• Many segmentational algorithms, e.g.

GADA• Issues: setting parameters: when to cut

off peaks?– Combine outputs from different runs with

different parameters– Compare to known CNVs

RD - Combine algorithms

RD - Issues

• How to assess TP/FP/FN? => comparewith known CNVs

• Breakpoints: unknown– 1 datapoint/exon– Can be outside of exon

• Different parameters for rare vscommon CNVs => which?

C. Split Reads

SR - Principle

SR - Mapping

Short subsequences => many possiblemappings

Solution: “anchored split mapping” (e.g.Pindel)

D. Local reassembly

Aim: to determine breakpoints

Which reads?– for deletions: local reads– for insertions: hanging reads for read pairs with

only one read mapped

– (rather not: unmapped reads)

For large region: split up

Assemblers

VelvetABySSTIGRA…

Conclusions

• Available algorithms: more todemonstrate technique rather thancomplete solution

• Different algorithms => different results

Chris Yoon

Genotyping• Create alternative reference => remap reads

– All reads vs reads covering variant locis– Whole-genome vs concatenation of variant loci

• Homozygous insertions/deletions: should disappear• Heterozygous insertions/deletions: should have different

signatures• Bayesian approach: see what’s the most likely: do the reads

support wild-type/het/homnonref?• Not exact mapping => local reassembly

– Microhomologies & non-template sequence => “breakpoint”= region of 2-10 bp

• Convention: left-most position reported (but not always)

References and software• Medvedev P et al. Nat Methods 6(11):S13-S20 (2009)• Lee S et al. Bioinformatics 24:i59-i67 (2008)• Hormozdiari F et al. Genome Res 19:1270-1278 (2009)• Campbell P et al. Nat Genet 40:722-729 (2008)• Ye K et al. Bioinformatics 25(21):2865-2871 (2009)• Chen K et al. Genome Res 19:1527-1541 (2009)• Yoon S et al. Genome Res 19:1586-1592 (2009)• Du J et al. PLoS Comp Biol 5(7):e1000432 (2009)• Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences

(2009)

Questions?

eccb10 talk - next-generation sequencing and structural variation

Documents

evolution of dna sequencing - talk by jonathan eisen for the...

strain lineage revealed by whole-genome sequencing genetic...

fundamental concepts in genome sequencing and genome...

regular variation talk

jeffrey m. kidd et al- mapping and sequencing of structural...

scientific talk on effects of climate variation and young...

chapter 7 analyzing dna and gene structure, variation and...

forward genetics by sequencing ems variation induced inbred...

lauren hall-lew & zac boyd's nwav45 talk on phonetic...

bioinformatics' approaches to detect genetic variation in...

2014 talk at nyu cusp: "biology caught the bus: now what?...

boss schoenberg op. 22 radio talk and developing variation...

bias correction in finding copy number variation with...

eccb10 talk - nextgen sequencing and snps

variation detection based on second generation...

data processing and analysis of genetic variation using …...

mapping copy number variation by population scale genome...

detecting and ameliorating systematic variation from large...

long read sequencing - lscc lab talk - fri 5 june 2015

whole genome shotgun sequencing...