eshg sequencing workshop
TRANSCRIPT
Dr. Mike Evans — Chief Executive
A unique targeted sequencing service providing meaningful results, not insurmountable data
Outline of presentation
• Delivering a unique next generation sequencing service — Dr Mike Evans, CEO
• Optimised bait design for targeted sequencing — Dr Jolyon Holdstock, Senior Computational Biologist
• Adding value through analysis — Dr Volker Brenner, Head of Computational Biology
• Summary• Q&A• Lunch
OGT - provides advanced clinical genetics solutions - develops innovative molecular diagnostics
• Founded by Ed Southern in 1995• 64 people
OGT Begbroke: Corporate offices and high-throughput labs
OGT Southern Centre: Biomarker discovery
IP Licensing40 licence relationships
TechnologiesFor Molecular
Medicine
Clinical and Genomic SolutionsCytogenetics products and genomic services
Diagnostic BiomarkersGenomic- and protein-based diagnostics
OGT’s key businesses
Clinical and Genomic Solutions
Addressing the challenges of high-throughput, high-resolution molecular technologies:
•High equipment and staff training costs•Short equipment lifespan•Complex study design and processes (e.g. platform evaluation & selection)•Vast amounts of data
• Extensive computing infrastructure
• Data analysis expertise and resource
The solution: Genefficiency Genomic Services
High-quality data & complete reassurance
• Experimental and array design expertise• High-throughput processing (>2000 samples / week)• Applications: aCGH-CNV, methylation, miRNA, gene expression
analysis• Comprehensive data analysis services • >40 QC checks on each sample to ensure high-quality data
Genefficiency™ — World’s Leading aCGH Service
Independent Accreditations
• First Agilent High-Throughput Microarray Certified Service Provider
• ISO 9001:2008 — Quality management systems
• ISO 27001:2005 — Information security
• ISO 17025:2005 — aCGH Laboratory services
FS 561156
IS 561157
4593
20,000 samples. 1,000 samples / week
“In order to characterise genetic variants, reproducible performance and reliable processing of the high resolution microarrays is essential. We were pleased with OGT’s responsive approach and attention to producing high quality data to tight deadlines”
Dr Matt Hurles, Wellcome Trust Sanger Institute.”
Customer Satisfaction…
OGT Collaborators and Customers
A World-class Team
Our expert team deliver:• Excellent project management and customer service
• >600 projects to date• >50,000 samples
• Unparalleled expertise in study and probe design• Advanced data analysis though a dedicated team of
bioinformaticians• Rapid turnaround times• A wealth of experience of clinical and translational
research projects
Delivering Discovery
Genefficiency Targeted Sequencing Services — designed to be different:
• Comprehensive — taking you from genomic DNA to filtered, qualified results
• Rigorously designed — project and probe design expertise maximises your likelihood of discovery
• Expert support — experienced team of biologists and bioinformaticians
• Dedication to quality — from sample to result, delivering reliable results every time
Delivering an Integrated, Comprehensive Service
11/04/23 12
1. Selection of most appropriate genomic regions for enrichment
2. Capture, sample multiplexing and sequencing
3. Data analysis and advanced filtering of variants
Delivering Expert Project Design
Step 1: Selection of most appropriate genomic regions for your project and budget
Whole exome
Pre-designed, validated whole exome capture probes
Coding regions are “most likely” candidates for many disorders
Custom genomic regions
Expert custom design of capture probes for your regions of interest
Flexibility to focus on regions of clinical significance or GWAS regions
Delivering Class-leading Technology
We have fully optimised the DNA capture and sequencing methodologies, so you don’t have to!
Step 2: Performing the capture, sample multiplexing, library preparation and sequencing
•Options for sample indexing and multiplexing to minimise sequencing cost
•Depth of sequencing coverage to suit your samples and project
•Paired-end sequencing on the industry-leading Illumina HiSeq 2000
OGT Delivers Discovery, not just Data
Step 3: Data analysis and advanced filtering of variants
•OGT’s dedicated analysis pipeline brings you beyond data, to a filtered list of variants relevant to your study
SEQUENCE FILTER DISCOVER
OGT Genefficiency Targeted Sequencing Services
The PLATFORM• Core sequencing platform: Illumina HiSeq 2000 • Core sequence capture technology: Agilent SureSelect
The PEOPLE• Team of highly skilled molecular biologists and bioinformaticians• Core expertise in probe design • Successful development of advanced analysis solutions
Outline of presentation
• Delivering a unique next generation sequencing service — Dr Mike Evans, CEO
• Optimised bait design for targeted sequencing — Dr Jolyon Holdstock, Senior Computational Biologist
• Adding value through analysis — Dr Volker Brenner, Head of Computational Biology
• Summary• Q&A• Lunch
Agenda
• Important Definitions and Terminologies
• Introduction to Targeted Enrichment
• Custom Bait Design
Definitions and Terminologies
• Read length – The number of bases sequenced in a fragment
• Capture efficiency
• Paired end sequencing
• Read depth - How many times has a base been sequenced?
On target Off targetOff target
Region of Interest
Region of Interest
Read Depth Will Vary Across a Region of Interest
*Sequence Depth >20x: ~82% for Single End
How many times has a base been sequenced?
*Agilent. 5990-4928EN
Read Depth Will Vary Across a Region of Interest
*Sequence Depth >20x: ~82% for Single End~90% for Paired End
How many times has a base been sequenced?
*Agilent. 5990-4928EN
Assuming no allelic bias the theoretical read depth required to detect heterozygous variation with given accuracy can be calculated using a binomial distribution
• Minimum capacity required = Region of interest (ROI) x required depth
• Q30 variant detection for 15Kb ROI requires 210Kb sequencing capacity
Calculations based on variation being seen in at least 2 reads
• Should not be just one read as this could be ‘noise’
• Required observations could be a percentage of reads
Read Depth Required for Mutation Detection
Depth Required Het. Call Accuracy Probability of Error Quality
11 99% 1:100 Q20
14 99.9% 1:1000 Q30
18 99.99% 1:10000 Q40
25 99.999% 1:100000 Q50
Agenda
• Important Definitions and Terminologies
• Introduction to Targeted Enrichment
• Custom Bait Design
Why use Targeted Enrichment?
Flexibility in choice of genomic loci• Allows capture of specific regions of interest for SNP and Indel detection
Cost Effectiveness• Ideal for clinical applications
• Specific candidate genes are targeted
• Fine mapping post-GWAS
• Cost Benefits
• Enables multiplexing to fill capacity
Streamlined Data Analysis• Reduced noise due to targeted specificity
Targeted Approaches Introduce Bias
There are significant imbalances in the sequence coverage achieved, particularly with targeted approaches
E.g. Agilent SureSelect*
• 3.3MB ROI
• 10M reads
• ~80% Targeted bases covered at ≥ 20x depth
• < 4% Targeted bases missed
*Ernani F. And LeProust E, Agilent. 5990-3532EN
14x (Q30)
Targeted gene sequencing can lead to some targets without the
required depth of coverage
Example of Design Bias - Insufficient Coverage
Inadequate Coverage
*data kindly provided by C. Mattocks National Genetics Reference Lab, Salisbury, UK
Option 1:
•Increase coverage by increasing depth of sequencing
•Coverage of all targets proportionally increased
•Increased cost of sequencing
•Some bases still missed
(Q30)
Solution: Intelligent Design to Improve Coverage:
Option 2:
•Intelligent design of capture probes increases under-represented loci
•More even coverage of entire region, no loci missed (more likely to find mutations present)
•No need to increase sequence depth overall (more cost effective)
Agenda
• Important Definitions and Terminologies
• Introduction to Targeted Enrichment
• Custom Bait Design
Problems Facing Users
• Design tools not user friendly• Design tools only good for draft design• Potential sources of bias• Regions of interest too short
• Bait thermodynamic behaviour
• GC content
• Melting Temperature
• Risk of Design Errors
• OGT’s extensive experience in designing probes for microarrays allows us to minimise bias and ensure evenness of coverage giving the best chance to identify mutations
OGT’s Design Pipeline – what we need from you:
• Regions of Interest• Gene lists• Chromosomal locations
• Genome build version
• Data file format• Text, Excel, etc....• Consistent e.g. chr1: 2247628-2248537
3. Singletons2. Draft Design1. Data 4. Thermo-
dynamics 5. Report
• Assess the output:• Coverage• Bait distribution• Repeatmasking
Region of Interest
Run Draft Design
3. Singleton Baits
2. Draft Design1. Data 4. Bait Thermo-
dynamics 5. Report
• Assess the output:• Coverage• Bait distribution• Repeatmasking
Region of Interest
Run Draft Design
3. Singleton Baits
2. Draft Design1. Data 4. Bait Thermo-
dynamics 5. Report
Repeatmasking
• This ensures that small regions are captured as well as large regions
• Advantage - Improves evenness of capture across the design
Before After
• Review the draft design and identify any regions covered by a
single bait• These regions span less than 120 bases
• Add additional singleton baits to the design
Correction for Singleton Baits
3. Singleton Baits
2. Draft Design1. Data 4. Bait Thermo-
dynamics 5. Report
GC content
• Calculate GC content for all baits
• Identify those baits where GC content is extreme (for instance >65% and <40%)
• Add additional copies of these baits
Region of Interest
GC extreme
Correction for Bait Thermodynamics
Tm content
• Calculate the Tm for all baits
• Identify those baits where Tm is extreme (e.g. > 75oC)
• Add additional copies of these baits
Tm extreme
3. Singleton Baits
2. Draft Design1. Data 4. Bait Thermo-
dynamics 5. Report
3. Singleton
Baits2. Draft Design1. Data 4. Bait Thermo-
dynamics 5. Report
• Design Parameters
• Depth of Coverage• On target / Off target• Regions not covered – and why not
• Bait Details• Singletons• GC distribution• Tm distribution
• Library Design• Baits generated
Customer Report
• Better ‘evenness’ of coverage helps ensure no regions are missed and maximises the likelihood of variant detection
• Improvement of overall capture efficiency and on-target performance equals cost effective sequencing downstream
• Increase capture efficiency of SNPs and Indels equals an increase in the likelihood of detection
• Reduction of risk
Advantages of OGT’s Approach
Summary
• Custom design of regions for targeted sequencing offers significant flexibility for many applications
• Expert probe design will ensure:• Evenness of coverage across the entire region
• Maximum likelihood of discovery of variants
• Efficient and cost effective use of sequencer capacity
• Overall these modifications make OGT’s capture perform better
Outline of presentation
• Delivering a unique next generation sequencing service — Dr Mike Evans, CEO
• Optimised bait design for targeted sequencing — Dr Jolyon Holdstock, Senior Computational Biologist
• Adding value through analysis — Dr Volker Brenner, Head of Computational Biology
• Summary• Q&A• Lunch
Adding Value Through Analysis
• Introduction• NGS data analysis
• Primary analysis• Mapping and assembly• Q score re-calibration• NGS sequencing QC• NGS alignment QC
• Secondary analysis• SNP and Indel calling• Annotation and evaluation pipeline• SIFT and PolyPhen
• Deliverables• Case study• Summary
The Analysis Challenge
SequencerHard drive
with ~4Gb per exome
Publication
Raw Data: FASTQ(standard text representation of short reads)
FASTQ uses four lines per sequence.
• Line 1: '@' followed by a sequence identifier
• Line 2: raw sequence letters
• Line 3: '+' (and optional sequence identifier)
• Line 4: quality values for the sequence in Line 2. Must contain the same number of symbols as letters in the sequence. (The letters encode Phred Quality Scores from 0 to 93 using ASCII 33 to 126)
Example
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Phred Quality Scores
• Phred is an accurate base-caller used for capillary traces (Ewing et al Genome Research 1998)
• Each called base is given a quality score Q
• Quality based on simple metrics (such as peak spacing) calibrated against a database of hand-edited data
• QPhred = -10 * log10(estimated probability call is wrong)
Q30 often used as a threshold for useful sequence data
Phred Quality ScoreProbability of incorrect base call
Base call accuracy
10 1 in 10 90 %
20 1 in 100 99 %
30 1 in 1000 99.9 %
40 1 in 10000 99.99 %
• FASTQ is FASTA with quality scores added. Standard output format of NGS basecalling;
• SAM and BAM are equivalent formats for describing alignments of reads to a reference genome
• SAM: text file• BAM: compressed binary, indexed, so it is possible to access reads
mapping to a segment without decompressing the entire file• BAM is used by IGV and other software• Current Standard Binary Format containing:
• Meta Information (read groups, algorithm details)• Sequence and Quality Scores• Alignment information
• VCF file: text file that lists all called variants (= differences to reference genome)
File Formats: FASTQ, SAM, BAM, VCF
• Just FASTQ files
• Data mapped and assembly (vs. genome or exome? De-duplicated? Locally re-aligned? Indexed?)
• All of the above plus VCF file
• Annotation of variants against genes, exons, transcripts...
• Links to external resources
• Sequence alignments for visual inspection of variant calls
• Filtered and prioritised data
• Multi-genome analysis
*) Kevin Rose (born Robert Kevin Rose, February 21, 1977) is an American Internet entrepreneur
NGS Data Analysis: A rose is a rose is a rose
#CHROM POS ID REF ALT QUAL FILTER INFO FORMATA_36_B100184 65 . T C 6.2 . DP=27;AF1=0.4999;CI95=0.5,0.5;DP4=7,12,5,3;MQ=44;FQ=8.65;PV4=0.4,4.2A_36_B100224 48 . G A 225 . DP=80;AF1=0.5;CI95=0.5,0.5;DP4=32,4,38,3;MQ=56;FQ=8.65A_36_B100255 42 . A C 22 . DP=32;AF1=0.5;CI95=0.5,0.5;DP4=23,2,4,3;MQ=20;FQ=25;PV4=0.057,1.9e-06,1,0.004A_36_B100333 76 . G A 225 . DP=50;AF1=0.5;CI95=0.5,0.5;DP4=10,9,18,9;MQ=57;FQ=225;PV4=0.3...
Adding Value Through Analysis
• Introduction• NGS data analysis
• Primary analysis• Mapping and assembly• Q score re-calibration• NGS sequencing QC• NGS alignment QC
• Secondary analysis• SNP and Indel calling• Annotation and evaluation pipeline• SIFT and PolyPhen
• Deliverables• Case study• Summary
Primary Analysis - Mapping and Alignment
Raw Sequence
Files
FASTQ Format
MappingMapping
BWA/Bowtie
Raw Alignment
Files
SAM/BAM Format
Local Realignment(around InDels)
Local Realignment(around InDels)
GATK
Duplicate marking
Duplicate marking
Analysis-ready
Alignment
Picard SAM/BAM Format
Quality score re-
calibration
Quality score re-
calibration
Picard
Why Mark Duplicates and Realignment around Indels?
Why Mark Duplicates and Realignment around Indels?
3 incorrect calls within 40bp!
Primary Analysis - Mapping and Alignment
Raw Sequence
Files
FASTQ Format
MappingMapping
BWA/Bowtie
Raw Alignment
Files
SAM/BAM Format
Local Realignment(around InDels)
Local Realignment(around InDels)
GATK
Duplicate marking
Duplicate marking
Analysis-ready
Alignment
Picard SAM/BAM Format
Quality score re-
calibration
Quality score re-
calibration
Picard
NGS Variant Calling Methods
Option 1 - Hard filtering
Example: SNP can only be called if• read depth >10 • >35% of reads carry SNP
Effective filtering Transparent to user– Simplistic approach– Will miss high quality calls that don’t pass threshold
Option 2 - Statistical analysis
Based on quality scores of individual basepairs, the alignment and statistical probability models
Robust Optimum balance of sensitivity and specificity due to the use of statistical models Fewer false positive and false negative SNP calls– Requires correctly pre-processed data with reliable quality scores
Base Quality Score Re-Calibration
Source: The Broad Institutehttp://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_poplin.pdf
Before Recalibration After Recalibration
Primary Analysis – Raw data and assembly QC
Raw Sequence
Files
FASTQ Format
MappingMapping
BWA/Bowtie
Raw Alignment
Files
SAM/BAM Format
Local Realignment(around InDels)
Local Realignment(around InDels)
GATK
Duplicate marking
Duplicate marking
Analysis-ready
Alignment
Picard SAM/BAM Format
Quality score re-
calibration
Quality score re-
calibration
Picard
Primary Analysis – Raw data and assembly QC
Raw Sequence
Files
FASTQ Format
MappingMapping
BWA/Bowtie
Raw Alignment
Files
SAM/BAM Format
Local Realignment(around InDels)
Local Realignment(around InDels)
GATK
Duplicate marking
Duplicate marking
Analysis-ready
Alignment
Picard SAM/BAM Format
Quality score re-
calibration
Quality score re-
calibration
Picard
Sequence QC checkSequence QC check
Raw data QC ReportRaw data QC Report
FastQC AlignmentQC ReportAlignmentQC Report
Alignment QC checkAlignment QC check
Picard
Secondary Analysis SNP and Indel calling, annotation and filtering
GATK
Unified Genotyper
Unified Genotyper
Analysis-ready
alignment
SNPs
InDels
VCF Format
Variant Evaluation
Variant Evaluation
• Known variant?
• Impact on gene expression?
• Splicing affected?
• Non-synonymous or frameshift mutation?
• Impact on protein function?
• How confident are we in the call?
• Zygosity?
Comprehensiveinteractive OGT
Report
Comprehensiveinteractive OGT
Report
AlignmentQC ReportAlignmentQC Report
Sequence QC ReportSequence QC Report
SAM/BAM Format OGT
SNP/Indel Classification(standard analysis)
We check and annotate every single detected SNP and Indel against all human Ensembl genes and transcripts and dbSNP
dbSNP annotation:•Is the variant known?•Obtain allele frequency
Does it affect any of the following•Promoter region•UTR•Splice sites or intronic region•CDS
• Synonymous mutation• Non synonymous mutation• Frameshift mutation• Stop codon (truncated/elongated protein sequence)• Overlap with protein domain• Consequence on protein function predicted (SIFT & PolyPhen)
SIFT predicts whether an amino acid substitution affects protein function
based on • sequence homology (phylogenetic conservation)• the physical properties of amino acids.
SIFT can be applied to naturally occurring non-synonymous polymorphisms and laboratory-induced mutations.
SIFT – SORTS INTOLERANT FROM TOLERANT MUTATIONS
PolyPhen: Prediction of Functional Effect of nsSNPs
PolyPhen (=Polymorphism Phenotyping) is an automatic tool for prediction of possible impact of an amino acid substitution on the structure and function of a human protein. This prediction is based on straightforward empirical rules which are applied to the sequence, phylogenetic and structural information characterizing the substitution
OGT Processing Overview
Individual Genome Analysis(Standard Level)
Multi Genome Analysis, Data Gathering and Comparison
(Advanced Level)
Tailored analysis based on client’s individual requirements
(Expert Level)
Perform pairwise genome analysis
Filter out variants
present in any “baseline”
exome (e.g. somatic tissue, healthy sibling)
AND not all “case” exomes
Study specific additional in-depth filtering and analysis
DataInformation
NGS Data Delivery
Hard drive(or FTP)
ship data
browse
Double click!
Copy data to shared drive or
local hard drive and...
NGS Data Delivery
Hard drive(or FTP)
ship data
browse
Comprehensive HTML analysis report
Copy data to shared drive or
local hard drive and...
NGS Data Delivery
Hard drive(or FTP)
ship data
browse
Comprehensive HTML analysis report
Copy data to shared drive or
local hard drive and...
File location& share results
Analysis Report: Summary Section
Analysis Report: Summary Section
Analysis Report: Summary Section
Analysis Report: Summary Section
Analysis Report: QC Section – Read QC
Analysis Report: QC Section – Read QC
Analysis Report: QC Section – Read QC
Analysis Report: QC Section – Read QC
Analysis Report: QC Section – Read QC
Analysis Report: QC Section – Alignment QC
Analysis Report: QC Section – Alignment QC
Analysis Report: QC Section – Alignment QC
Analysis Section - Overview
Analysis Section - Overview
The Variant Table View
Filter In
terface
The Variant Table View
Data display
Data export
The Variant Table View – External Links
The Variant Table View – External Links
The Detailed Variant View
The Detailed Variant View
Predicted Consequences on Protein Function
Predicted Consequences on Protein Function
Predicted Consequences on Protein Function
Alignment View of Selected Variant in IGV
Alignment View of Selected Variant in IGV
Alignment View of Selected Variant in IGV
Interactive Data Filtering
Interactive Data Filtering
Case Study: a published exome studyMulti exome study reveal causative mutation of monogenic disorder
Stan
dard
Ana
lysi
sAdv
ance
d Ana
lysi
s
Analysis Report: Supplementary Section
SummaryOGT offers fast, accurate & powerful NGS analysis
Standard Analysis
• Robust statistical data analysis
• Comprehensive variant annotation
• Interactive filtering and prioritisation of data based on
• chromosomal region
• allele frequency / novelty
• zygosity
• confidence score
• severity of mutation
Advanced Analysis• Multi-genome comparison
Bespoke analysis • Tailored to your specific requirements
let us help you with your workload
Outline of Presentation
• Delivering a unique next generation sequencing service — Dr Mike Evans, CEO
• Optimised bait design for targeted sequencing — Dr Jolyon Holdstock, Senior Computational Biologist
• Adding value through analysis — Dr Volker Brenner, Head of Computational Biology
• Summary• Q&A• Lunch
Please Enjoy Your Lunch!
Come and visit us at Booth #562
•Complete a survey for the chance to win a Kindle* eBook Reader
•Come to our wine reception tomorrow (Sunday) at 17:00 at our booth
*For full Terms and Conditions please visit www.ogt.co.uk/genefficiency/ESHGsurvey.html
95
Thank youwww.ogt.co.uk