eshg sequencing workshop

Dr. Mike Evans — Chief Executive

A unique targeted sequencing service providing meaningful results, not insurmountable data

Outline of presentation

• Delivering a unique next generation sequencing service — Dr Mike Evans, CEO

• Optimised bait design for targeted sequencing — Dr Jolyon Holdstock, Senior Computational Biologist

• Adding value through analysis — Dr Volker Brenner, Head of Computational Biology

• Summary• Q&A• Lunch

OGT - provides advanced clinical genetics solutions - develops innovative molecular diagnostics

• Founded by Ed Southern in 1995• 64 people

OGT Begbroke: Corporate offices and high-throughput labs

OGT Southern Centre: Biomarker discovery

IP Licensing40 licence relationships

TechnologiesFor Molecular

Medicine

Clinical and Genomic SolutionsCytogenetics products and genomic services

Diagnostic BiomarkersGenomic- and protein-based diagnostics

OGT’s key businesses

Clinical and Genomic Solutions

Addressing the challenges of high-throughput, high-resolution molecular technologies:

•High equipment and staff training costs•Short equipment lifespan•Complex study design and processes (e.g. platform evaluation & selection)•Vast amounts of data

• Extensive computing infrastructure

• Data analysis expertise and resource

The solution: Genefficiency Genomic Services

High-quality data & complete reassurance

• Experimental and array design expertise• High-throughput processing (>2000 samples / week)• Applications: aCGH-CNV, methylation, miRNA, gene expression

analysis• Comprehensive data analysis services • >40 QC checks on each sample to ensure high-quality data

Genefficiency™ — World’s Leading aCGH Service

Independent Accreditations

• First Agilent High-Throughput Microarray Certified Service Provider

• ISO 9001:2008 — Quality management systems

• ISO 27001:2005 — Information security

• ISO 17025:2005 — aCGH Laboratory services

FS 561156

IS 561157

4593

20,000 samples. 1,000 samples / week

“In order to characterise genetic variants, reproducible performance and reliable processing of the high resolution microarrays is essential. We were pleased with OGT’s responsive approach and attention to producing high quality data to tight deadlines”

Dr Matt Hurles, Wellcome Trust Sanger Institute.”

Customer Satisfaction…

OGT Collaborators and Customers

A World-class Team

Our expert team deliver:• Excellent project management and customer service

• >600 projects to date• >50,000 samples

• Unparalleled expertise in study and probe design• Advanced data analysis though a dedicated team of

bioinformaticians• Rapid turnaround times• A wealth of experience of clinical and translational

research projects

Delivering Discovery

Genefficiency Targeted Sequencing Services — designed to be different:

• Comprehensive — taking you from genomic DNA to filtered, qualified results

• Rigorously designed — project and probe design expertise maximises your likelihood of discovery

• Expert support — experienced team of biologists and bioinformaticians

• Dedication to quality — from sample to result, delivering reliable results every time

Delivering an Integrated, Comprehensive Service

11/04/23 12

1. Selection of most appropriate genomic regions for enrichment

2. Capture, sample multiplexing and sequencing

3. Data analysis and advanced filtering of variants

Delivering Expert Project Design

Step 1: Selection of most appropriate genomic regions for your project and budget

Whole exome

Pre-designed, validated whole exome capture probes

Coding regions are “most likely” candidates for many disorders

Custom genomic regions

Expert custom design of capture probes for your regions of interest

Flexibility to focus on regions of clinical significance or GWAS regions

Delivering Class-leading Technology

We have fully optimised the DNA capture and sequencing methodologies, so you don’t have to!

Step 2: Performing the capture, sample multiplexing, library preparation and sequencing

•Options for sample indexing and multiplexing to minimise sequencing cost

•Depth of sequencing coverage to suit your samples and project

•Paired-end sequencing on the industry-leading Illumina HiSeq 2000

OGT Delivers Discovery, not just Data

Step 3: Data analysis and advanced filtering of variants

•OGT’s dedicated analysis pipeline brings you beyond data, to a filtered list of variants relevant to your study

SEQUENCE FILTER DISCOVER

OGT Genefficiency Targeted Sequencing Services

The PLATFORM• Core sequencing platform: Illumina HiSeq 2000 • Core sequence capture technology: Agilent SureSelect

The PEOPLE• Team of highly skilled molecular biologists and bioinformaticians• Core expertise in probe design • Successful development of advanced analysis solutions

Agenda

• Important Definitions and Terminologies

• Introduction to Targeted Enrichment

• Custom Bait Design

Definitions and Terminologies

• Read length – The number of bases sequenced in a fragment

• Capture efficiency

• Paired end sequencing

• Read depth - How many times has a base been sequenced?

On target Off targetOff target

Region of Interest

Region of Interest

Read Depth Will Vary Across a Region of Interest

*Sequence Depth >20x: ~82% for Single End

How many times has a base been sequenced?

*Agilent. 5990-4928EN

Read Depth Will Vary Across a Region of Interest

*Sequence Depth >20x: ~82% for Single End~90% for Paired End

How many times has a base been sequenced?

*Agilent. 5990-4928EN

Assuming no allelic bias the theoretical read depth required to detect heterozygous variation with given accuracy can be calculated using a binomial distribution

• Minimum capacity required = Region of interest (ROI) x required depth

• Q30 variant detection for 15Kb ROI requires 210Kb sequencing capacity

Calculations based on variation being seen in at least 2 reads

• Should not be just one read as this could be ‘noise’

• Required observations could be a percentage of reads

Read Depth Required for Mutation Detection

Depth Required Het. Call Accuracy Probability of Error Quality

11 99% 1:100 Q20

14 99.9% 1:1000 Q30

18 99.99% 1:10000 Q40

25 99.999% 1:100000 Q50

Agenda




Why use Targeted Enrichment?

Flexibility in choice of genomic loci• Allows capture of specific regions of interest for SNP and Indel detection

Cost Effectiveness• Ideal for clinical applications

• Specific candidate genes are targeted

• Fine mapping post-GWAS

• Cost Benefits

• Enables multiplexing to fill capacity

Streamlined Data Analysis• Reduced noise due to targeted specificity

Targeted Approaches Introduce Bias

There are significant imbalances in the sequence coverage achieved, particularly with targeted approaches

E.g. Agilent SureSelect*

• 3.3MB ROI

• 10M reads

• ~80% Targeted bases covered at ≥ 20x depth

• < 4% Targeted bases missed

*Ernani F. And LeProust E, Agilent. 5990-3532EN

14x (Q30)

Targeted gene sequencing can lead to some targets without the

required depth of coverage

Example of Design Bias - Insufficient Coverage

Inadequate Coverage

*data kindly provided by C. Mattocks National Genetics Reference Lab, Salisbury, UK

Option 1:

•Increase coverage by increasing depth of sequencing

•Coverage of all targets proportionally increased

•Increased cost of sequencing

•Some bases still missed

(Q30)

Solution: Intelligent Design to Improve Coverage:

Option 2:

•Intelligent design of capture probes increases under-represented loci

•More even coverage of entire region, no loci missed (more likely to find mutations present)

•No need to increase sequence depth overall (more cost effective)

Agenda




Problems Facing Users

• Design tools not user friendly• Design tools only good for draft design• Potential sources of bias• Regions of interest too short

• Bait thermodynamic behaviour

• GC content

• Melting Temperature

• Risk of Design Errors

• OGT’s extensive experience in designing probes for microarrays allows us to minimise bias and ensure evenness of coverage giving the best chance to identify mutations

OGT’s Design Pipeline – what we need from you:

• Regions of Interest• Gene lists• Chromosomal locations

• Genome build version

• Data file format• Text, Excel, etc....• Consistent e.g. chr1: 2247628-2248537

3. Singletons2. Draft Design1. Data 4. Thermo-

dynamics 5. Report

• Assess the output:• Coverage• Bait distribution• Repeatmasking

Region of Interest

Run Draft Design

3. Singleton Baits

2. Draft Design1. Data 4. Bait Thermo-

dynamics 5. Report

• Assess the output:• Coverage• Bait distribution• Repeatmasking

Region of Interest

Run Draft Design

3. Singleton Baits


dynamics 5. Report

Repeatmasking

• This ensures that small regions are captured as well as large regions

• Advantage - Improves evenness of capture across the design

Before After

• Review the draft design and identify any regions covered by a

single bait• These regions span less than 120 bases

• Add additional singleton baits to the design

Correction for Singleton Baits

3. Singleton Baits


dynamics 5. Report

GC content

• Calculate GC content for all baits

• Identify those baits where GC content is extreme (for instance >65% and <40%)

• Add additional copies of these baits

Region of Interest

GC extreme

Correction for Bait Thermodynamics

Tm content

• Calculate the Tm for all baits

• Identify those baits where Tm is extreme (e.g. > 75oC)

• Add additional copies of these baits

Tm extreme

3. Singleton Baits


dynamics 5. Report

3. Singleton

Baits2. Draft Design1. Data 4. Bait Thermo-

dynamics 5. Report

• Design Parameters

• Depth of Coverage• On target / Off target• Regions not covered – and why not

• Bait Details• Singletons• GC distribution• Tm distribution

• Library Design• Baits generated

Customer Report

• Better ‘evenness’ of coverage helps ensure no regions are missed and maximises the likelihood of variant detection

• Improvement of overall capture efficiency and on-target performance equals cost effective sequencing downstream

• Increase capture efficiency of SNPs and Indels equals an increase in the likelihood of detection

• Reduction of risk

Advantages of OGT’s Approach

Summary

• Custom design of regions for targeted sequencing offers significant flexibility for many applications

• Expert probe design will ensure:• Evenness of coverage across the entire region

• Maximum likelihood of discovery of variants

• Efficient and cost effective use of sequencer capacity

• Overall these modifications make OGT’s capture perform better

Adding Value Through Analysis

• Introduction• NGS data analysis

• Primary analysis• Mapping and assembly• Q score re-calibration• NGS sequencing QC• NGS alignment QC

• Secondary analysis• SNP and Indel calling• Annotation and evaluation pipeline• SIFT and PolyPhen

• Deliverables• Case study• Summary

The Analysis Challenge

SequencerHard drive

with ~4Gb per exome

Publication

Raw Data: FASTQ(standard text representation of short reads)

FASTQ uses four lines per sequence.

• Line 1: '@' followed by a sequence identifier

• Line 2: raw sequence letters

• Line 3: '+' (and optional sequence identifier)

• Line 4: quality values for the sequence in Line 2. Must contain the same number of symbols as letters in the sequence. (The letters encode Phred Quality Scores from 0 to 93 using ASCII 33 to 126)

Example

@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Phred Quality Scores

• Phred is an accurate base-caller used for capillary traces (Ewing et al Genome Research 1998)

• Each called base is given a quality score Q

• Quality based on simple metrics (such as peak spacing) calibrated against a database of hand-edited data

• QPhred = -10 * log10(estimated probability call is wrong)

Q30 often used as a threshold for useful sequence data

Phred Quality ScoreProbability of incorrect base call

Base call accuracy

10 1 in 10 90 %

20 1 in 100 99 %

30 1 in 1000 99.9 %

40 1 in 10000 99.99 %

• FASTQ is FASTA with quality scores added. Standard output format of NGS basecalling;

• SAM and BAM are equivalent formats for describing alignments of reads to a reference genome

• SAM: text file• BAM: compressed binary, indexed, so it is possible to access reads

mapping to a segment without decompressing the entire file• BAM is used by IGV and other software• Current Standard Binary Format containing:

• Meta Information (read groups, algorithm details)• Sequence and Quality Scores• Alignment information

• VCF file: text file that lists all called variants (= differences to reference genome)

File Formats: FASTQ, SAM, BAM, VCF

• Just FASTQ files

• Data mapped and assembly (vs. genome or exome? De-duplicated? Locally re-aligned? Indexed?)

• All of the above plus VCF file

• Annotation of variants against genes, exons, transcripts...

• Links to external resources

• Sequence alignments for visual inspection of variant calls

• Filtered and prioritised data

• Multi-genome analysis

*) Kevin Rose (born Robert Kevin Rose, February 21, 1977) is an American Internet entrepreneur

NGS Data Analysis: A rose is a rose is a rose

#CHROM POS ID REF ALT QUAL FILTER INFO FORMATA_36_B100184 65 . T C 6.2 . DP=27;AF1=0.4999;CI95=0.5,0.5;DP4=7,12,5,3;MQ=44;FQ=8.65;PV4=0.4,4.2A_36_B100224 48 . G A 225 . DP=80;AF1=0.5;CI95=0.5,0.5;DP4=32,4,38,3;MQ=56;FQ=8.65A_36_B100255 42 . A C 22 . DP=32;AF1=0.5;CI95=0.5,0.5;DP4=23,2,4,3;MQ=20;FQ=25;PV4=0.057,1.9e-06,1,0.004A_36_B100333 76 . G A 225 . DP=50;AF1=0.5;CI95=0.5,0.5;DP4=10,9,18,9;MQ=57;FQ=225;PV4=0.3...

Adding Value Through Analysis

• Introduction• NGS data analysis

• Primary analysis• Mapping and assembly• Q score re-calibration• NGS sequencing QC• NGS alignment QC

• Secondary analysis• SNP and Indel calling• Annotation and evaluation pipeline• SIFT and PolyPhen

• Deliverables• Case study• Summary

Primary Analysis - Mapping and Alignment

Raw Sequence

Files

FASTQ Format

MappingMapping

BWA/Bowtie

Raw Alignment

Files

SAM/BAM Format

Local Realignment(around InDels)


GATK

Duplicate marking

Duplicate marking

Analysis-ready

Alignment

Picard SAM/BAM Format

Quality score re-

calibration

Quality score re-

calibration

Picard

Why Mark Duplicates and Realignment around Indels?

Why Mark Duplicates and Realignment around Indels?

3 incorrect calls within 40bp!

Primary Analysis - Mapping and Alignment

Raw Sequence

Files

FASTQ Format

MappingMapping

BWA/Bowtie

Raw Alignment

Files

SAM/BAM Format



GATK

Duplicate marking

Duplicate marking

Analysis-ready

Alignment


Quality score re-

calibration

Quality score re-

calibration

Picard

NGS Variant Calling Methods

Option 1 - Hard filtering

Example: SNP can only be called if• read depth >10 • >35% of reads carry SNP

Effective filtering Transparent to user– Simplistic approach– Will miss high quality calls that don’t pass threshold

Option 2 - Statistical analysis

Based on quality scores of individual basepairs, the alignment and statistical probability models

Robust Optimum balance of sensitivity and specificity due to the use of statistical models Fewer false positive and false negative SNP calls– Requires correctly pre-processed data with reliable quality scores

Base Quality Score Re-Calibration

Source: The Broad Institutehttp://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_poplin.pdf

Before Recalibration After Recalibration

Primary Analysis – Raw data and assembly QC

Raw Sequence

Files

FASTQ Format

MappingMapping

BWA/Bowtie

Raw Alignment

Files

SAM/BAM Format



GATK

Duplicate marking

Duplicate marking

Analysis-ready

Alignment


Quality score re-

calibration

Quality score re-

calibration

Picard

Primary Analysis – Raw data and assembly QC

Raw Sequence

Files

FASTQ Format

MappingMapping

BWA/Bowtie

Raw Alignment

Files

SAM/BAM Format



GATK

Duplicate marking

Duplicate marking

Analysis-ready

Alignment


Quality score re-

calibration

Quality score re-

calibration

Picard

Sequence QC checkSequence QC check

Raw data QC ReportRaw data QC Report

FastQC AlignmentQC ReportAlignmentQC Report

Alignment QC checkAlignment QC check

Picard

Secondary Analysis SNP and Indel calling, annotation and filtering

GATK

Unified Genotyper

Unified Genotyper

Analysis-ready

alignment

SNPs

InDels

VCF Format

Variant Evaluation

Variant Evaluation

• Known variant?

• Impact on gene expression?

• Splicing affected?

• Non-synonymous or frameshift mutation?

• Impact on protein function?

• How confident are we in the call?

• Zygosity?

Comprehensiveinteractive OGT

Report

Comprehensiveinteractive OGT

Report

AlignmentQC ReportAlignmentQC Report

Sequence QC ReportSequence QC Report

SAM/BAM Format OGT

SNP/Indel Classification(standard analysis)

We check and annotate every single detected SNP and Indel against all human Ensembl genes and transcripts and dbSNP

dbSNP annotation:•Is the variant known?•Obtain allele frequency

Does it affect any of the following•Promoter region•UTR•Splice sites or intronic region•CDS

• Synonymous mutation• Non synonymous mutation• Frameshift mutation• Stop codon (truncated/elongated protein sequence)• Overlap with protein domain• Consequence on protein function predicted (SIFT & PolyPhen)

SIFT predicts whether an amino acid substitution affects protein function

based on • sequence homology (phylogenetic conservation)• the physical properties of amino acids.

SIFT can be applied to naturally occurring non-synonymous polymorphisms and laboratory-induced mutations.

SIFT – SORTS INTOLERANT FROM TOLERANT MUTATIONS

PolyPhen: Prediction of Functional Effect of nsSNPs

PolyPhen (=Polymorphism Phenotyping) is an automatic tool for prediction of possible impact of an amino acid substitution on the structure and function of a human protein. This prediction is based on straightforward empirical rules which are applied to the sequence, phylogenetic and structural information characterizing the substitution

OGT Processing Overview

Individual Genome Analysis(Standard Level)

Multi Genome Analysis, Data Gathering and Comparison

(Advanced Level)

Tailored analysis based on client’s individual requirements

(Expert Level)

Perform pairwise genome analysis

Filter out variants

present in any “baseline”

exome (e.g. somatic tissue, healthy sibling)

AND not all “case” exomes

Study specific additional in-depth filtering and analysis

DataInformation

NGS Data Delivery

Hard drive(or FTP)

ship data

browse

Double click!

Copy data to shared drive or

local hard drive and...

NGS Data Delivery

Hard drive(or FTP)

ship data

browse

Comprehensive HTML analysis report



NGS Data Delivery

Hard drive(or FTP)

ship data

browse

Comprehensive HTML analysis report



File location& share results

Analysis Report: Summary Section

Analysis Report: QC Section – Read QC

Analysis Report: QC Section – Alignment QC

Analysis Section - Overview

The Variant Table View

Filter In

terface

The Variant Table View

Data display

Data export

The Variant Table View – External Links

The Detailed Variant View

Predicted Consequences on Protein Function

Alignment View of Selected Variant in IGV

Interactive Data Filtering

Case Study: a published exome studyMulti exome study reveal causative mutation of monogenic disorder

Stan

dard

Ana

lysi

sAdv

ance

d Ana

lysi

s

Analysis Report: Supplementary Section

SummaryOGT offers fast, accurate & powerful NGS analysis

Standard Analysis

• Robust statistical data analysis

• Comprehensive variant annotation

• Interactive filtering and prioritisation of data based on

• chromosomal region

• allele frequency / novelty

• zygosity

• confidence score

• severity of mutation

Advanced Analysis• Multi-genome comparison

Bespoke analysis • Tailored to your specific requirements

let us help you with your workload

Outline of Presentation





Please Enjoy Your Lunch!

Come and visit us at Booth #562

•Complete a survey for the chance to win a Kindle* eBook Reader

•Come to our wine reception tomorrow (Sunday) at 17:00 at our booth

*For full Terms and Conditions please visit www.ogt.co.uk/genefficiency/ESHGsurvey.html

95

Thank youwww.ogt.co.uk

eshg sequencing workshop

Technology

sequencing options

end sequencing

sequencing methodologies

data step

high quality data

insurmountable data

genefficiency genomic

probe design expertise