ashg sequencing workshop
TRANSCRIPT
Dr. Mike Evans — Chief Executive
A unique targeted sequencing service providing meaningful results, not insurmountable data
Outline of presentation
• Delivering a unique next generation sequencing service —Dr Mike Evans, CEO
• Optimised bait design for targeted sequencing — Dr Volker Brenner, Head of Computational Biology
• Adding value through analysis — Dr Volker Brenner, Head of Computational Biology
• Summary• Q&A
OGT - provides advanced clinical genetics solutions - develops innovative molecular diagnostics
• Founded by Ed Southern in 1995• 64 people
OGT Begbroke: Corporate offices and high-throughput labs
OGT Southern Centre: Biomarker discovery
IP Licensing40 licence relationships
TechnologiesFor Molecular
Medicine
Clinical and Genomic SolutionsCytogenetics products and genomic services
Diagnostic BiomarkersGenomic- and protein-based diagnostics
OGT’s key businesses
Clinical and Genomic Solutions
Addressing the challenges of high-throughput, high-resolution molecular technologies:
• High equipment and staff training costs• Short equipment lifespan• Complex study design and processes (e.g. platform evaluation &
selection)• Vast amounts of data
• Extensive computing infrastructure • Data analysis expertise and resource
The solution: Genefficiency Genomic Services
High-quality data & complete reassurance
• Experimental and array design expertise• High-throughput processing (>2000 samples / week)• Applications: aCGH-CNV, methylation, miRNA, gene expression
analysis• Comprehensive data analysis services • >40 QC checks on each sample to ensure high-quality data
Genefficiency™ — World’s leading aCGH service
Independent accreditations
• First Agilent High-Throughput Microarray Certified Service Provider
• ISO 9001:2008 — Quality management systems
• ISO 27001:2005 — Information security
• ISO 17025:2005 — aCGH Laboratory services
F S 5 6 1 1 5 6
I S 5 6 1 1 5 7
4 5 9 3
20,000 samples. 1,000 samples / week
“In order to characterise genetic variants, reproducible performance and reliable processing of the high resolution microarrays is essential. We were pleased with OGT’s responsive approach and attention to producing high quality data to tight deadlines”
Dr Matt Hurles, Wellcome Trust Sanger Institute.”
Customer satisfaction…
OGT collaborators and customers
A world-class team
Our expert team deliver:• Excellent project management and customer service
• >600 projects to date• >50,000 samples
• Unparalleled expertise in study and probe design• Advanced data analysis though a dedicated team of
bioinformaticians• Rapid turnaround times• A wealth of experience of clinical and translational
research projects
New Genefficiency Targeted Sequencing Services
Delivering discovery
Genefficiency Targeted Sequencing Services — designed to be different:
• Comprehensive — taking you from genomic DNA to filtered, qualified results• Rigorously designed — project and probe design expertise maximises your
likelihood of discovery• Expert support — experienced team of biologists and bioinformaticians• Dedication to quality — from sample to result, delivering reliable results
every time
Delivering an integrated, comprehensive service
27/10/2011 13
1. Selection of most appropriate genomic regions for enrichment
2. Capture, sample multiplexing and sequencing
3. Data analysis and advanced filtering of variants
Delivering expert project design
Step 1: Selection of most appropriate genomic regions for your project and budget
Whole exomePre-designed, validated whole exome capture probes
Coding regions are “most likely” candidates for many disorders
Custom genomic regionsExpert custom design of capture probes for your regions of interest
Flexibility to focus on regions of clinical significance or GWAS regions
Delivering class-leading technology
We have fully optimised the DNA capture and sequencing methodologies, so you don’t have to!
Step 2: Performing the capture, sample multiplexing, library preparation and sequencing
• Options for sample indexing and multiplexing to minimise sequencing cost
• Depth of sequencing coverage to suit your samples and project
• Paired-end sequencing on the industry-leading Illumina HiSeq 2000
OGT delivers discovery, not just data
Step 3: Data analysis and advanced filtering of variants
• OGT’s dedicated analysis pipeline brings you beyond data, to a filtered list of variants relevant to your study
SEQUENCE FILTER DISCOVER
Genefficiency Targeted Sequencing Services
The PLATFORM• Core sequencing platform: Illumina HiSeq 2000 • Core sequence capture technology: Agilent SureSelect
The PEOPLE• Team of highly skilled molecular biologists and bioinformaticians• Core expertise in probe design • Successful development of advanced analysis solutions
Outline of presentation
• Delivering a unique next generation sequencing service —Dr Mike Evans, CEO
• Optimised bait design for targeted sequencing — Dr Volker Brenner, Head of Computational Biology
• Adding value through analysis — Dr Volker Brenner, Head of Computational Biology
• Summary• Q&A
Agenda
• Important Definitions and Terminologies
• Introduction to Targeted Enrichment
• Custom Bait Design
Definitions and terminologies
• Read length — The number of bases sequenced in a fragment
• Capture efficiency
• Paired end sequencing
• Read depth — How many times has a base been sequenced?
On target Off targetOff target
Region of Interest
Region of Interest
Fragment 1
Fragment 2
Assuming no allelic bias the theoretical read depth required to detect heterozygous variation with given accuracy can be calculated using a binomial distribution
• Minimum capacity required = Region of interest (ROI) x required depth• Q30 variant detection for 15Kb ROI requires 210Kb sequencing capacity
Calculations based on variation being seen in at least 2 reads• Should not be just one read as this could be ‘noise’• Required observations could be a percentage of reads
Read depth required for mutation detection
Depth Required Het. Call Accuracy Probability of Error Quality11 99% 1:100 Q2014 99.9% 1:1000 Q3018 99.99% 1:10000 Q4025 99.999% 1:100000 Q50
Agenda
• Important Definitions and Terminologies
• Introduction to Targeted Enrichment
• Custom Bait Design
Why use targeted enrichment?
Flexibility in choice of genomic loci• Allows capture of specific regions of interest for SNP and Indel detection
Cost Effectiveness• Ideal for clinical applications
• Specific candidate genes are targeted • Fine mapping post-GWAS
• Cost Benefits• Enables multiplexing to fill capacity
Streamlined Data Analysis• Reduced noise due to targeted specificity
14x (Q30)
Targeted gene sequencing can lead to some targets without therequired depth of coverage
Example of design bias — Insufficient coverage
Inadequate Coverage
*data kindly provided by C. Mattocks National Genetics Reference Lab, Salisbury, UK
Option 1:• Increase coverage by
increasing depth of sequencing
• Coverage of all targets proportionally increased
• Increased cost of sequencing
• Some bases still missed
(Q30)
Solution: Intelligent design to improve coverage:
Option 2:• Intelligent design of capture probes
increases under-represented loci• More even coverage of entire region,
no loci missed (more likely to find mutations present)
• No need to increase sequence depth overall (more cost effective)
Agenda
• Important Definitions and Terminologies
• Introduction to Targeted Enrichment
• Custom Bait Design
Problems facing users
• Design tools not user friendly• Design tools only good for draft design• Potential sources of bias
• Regions of interest too short• Bait thermodynamic behaviour
• GC content• Melting Temperature
• Risk of Design Errors
• OGT’s extensive experience in designing probes for microarrays allows us to minimise bias and ensure evenness of coverage giving the best chance to identify mutations
OGT’s design pipeline — what we need from you
• Regions of Interest• Gene lists• Chromosomal locations
• Genome build version
• Data file format• Text, Excel, etc....• Consistent e.g. chr1: 2247628-2248537
3. Singletons2. Draft Design1. Data 4. Thermo-
dynamics 5. Report
• Assess the output:• Coverage• Bait distribution• Repeat masking
Region of Interest
Run draft design
3. Singleton Baits
2. Draft Design1. Data 4. Bait Thermo-
dynamics 5. Report
Repeat masking
OGT custom bait design gives increased read depth around edges of target regions.
Custom baits improve coverage at region boundaries
1KGOGT
• This ensures that small regions are captured as well as large regions
• Advantage — Improves evenness of capture across the design
Before After
• Review the draft design and identify any regions covered by a single bait• These regions span less than 120 bases
• Add additional singleton baits to the design
Correction for singleton baits
3. Singleton Baits
2. Draft Design1. Data 4. Bait Thermo-
dynamics 5. Report
Custom approach ensures variant detection
OGT
1KG
Even at more than 50x coverage, whole exome sequencing does not accurately identify all SNPs.OGT custom baits design compared with 1000 Genomes whole exome capture data.
GC content • Calculate GC content for all baits• Identify those baits where GC
content is extreme (for instance >65% and <40%)
• Add additional copies of these baits
Region of Interest
GC extreme
Correction for bait thermodynamicsTm content • Calculate the Tm for all baits• Identify those baits where Tm is
extreme (e.g. > 75oC)
• Add additional copies of these baits
Tm extreme
3. Singleton Baits
2. Draft Design1. Data 4. Bait Thermo-
dynamics 5. Report
In a region with 70% GC content OGT custom bait design achieved a maximum read depth of 50x. The Agilent SureSelect 50Mb capture kit does not capture any reads in this region.
OGT
SureSelect
OGT custom bait designs help overcome GC issues
Relative capture of targets within a single gene. Agilent coverage is 20x for the target with no GC content bias, and minimal for targets with a GC content of 65%. In contrast OGT custom baits perform excellently in this region.
OGT
SureSelect
OGT custom bait designs help overcome GC issues
3. SingletonBaits
2. Draft Design1. Data 4. Bait Thermo-
dynamics 5. Report
• Design Parameters
• Depth of Coverage• On target / Off target• Regions not covered – and why not
• Bait Details• Singletons• GC distribution• Tm distribution
• Library Design• Baits generated
Customer report
• Custom design of regions for targeted sequencing offers significant flexibility for many applications
• Expert probe design will ensure:
• Better ‘evenness’ of coverage helps ensure no regions are missed and maximises the likelihood of variant detection
• Improvement of overall capture efficiency and on-target performance equals cost effective sequencing downstream
• Increase capture efficiency of SNPs and Indels equals an increase in the likelihood of detection
• Reduction of risk and better performance
Summary
Adding value through analysis
• Introduction• NGS data analysis
• Primary analysis• Mapping and assembly• Q score re-calibration• NGS sequencing QC• NGS alignment QC
• Secondary analysis• SNP and Indel calling• Annotation and evaluation pipeline• SIFT and PolyPhen
• Deliverables• Case study• Summary
The analysis challenge
NGS Raw data Mapping Annotation Filtering Reporting
SequencerHard drive
with ~4Gb per exome
Publication
Mapping Annotation Filtering Reporting
Raw data: FASTQ(standard text representation of short reads)
FASTQ uses four lines per sequence.
• Line 1: '@' followed by a sequence identifier
• Line 2: raw sequence letters
• Line 3: '+' (and optional sequence identifier)
• Line 4: quality values for the sequence in Line 2. Must contain the same number of symbols as letters in the sequence. (The letters encode Phred Quality Scores from 0 to 93 using ASCII 33 to 126)
Example
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Phred quality scores
• Phred is an accurate base-caller used for capillary traces (Ewing et al Genome Research 1998)
• Each called base is given a quality score Q• Quality based on simple metrics (such as peak spacing) calibrated against a
database of hand-edited data• QPhred = -10 * log10(estimated probability call is wrong)
Q30 often used as a threshold for useful sequence data
Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90 %
20 1 in 100 99 %
30 1 in 1000 99.9 %
40 1 in 10000 99.99 %
Adding value through analysis
• Introduction• NGS data analysis
• Primary analysis• Mapping and assembly• Q score re-calibration• NGS sequencing QC• NGS alignment QC
• Secondary analysis• SNP and Indel calling• Annotation and evaluation pipeline• SIFT and PolyPhen
• Deliverables• Case study• Summary
Primary analysis — Mapping and alignment
Raw Sequence
Files
FASTQ Format
Mapping
BWA/Bowtie
Raw Alignment
Files
SAM/BAM Format
Local Realignment(around InDels)
GATK
Duplicate marking
Analysis-ready
Alignment
Picard SAM/BAM Format
Quality score re-
calibration
Picard
Why mark duplicates and realignment around indels?
3 incorrect calls within 40bp!
Primary analysis — Mapping and alignment
Raw Sequence
Files
FASTQ Format
Mapping
BWA/Bowtie
Raw Alignment
Files
SAM/BAM Format
Local Realignment(around InDels)
GATK
Duplicate marking
Analysis-ready
Alignment
Picard SAM/BAM Format
Quality score re-
calibration
Picard
NGS variant calling methods
Option 1 - Hard filteringExample: SNP can only be called if
• read depth >10 • >35% of reads carry SNP
Effective filtering Transparent to user– Simplistic approach– Will miss high quality calls that don’t pass threshold
Option 2 - Statistical analysisBased on quality scores of individual basepairs, the alignment and statistical probability models
Robust Optimum balance of sensitivity and specificity due to the use of statistical models Fewer false positive and false negative SNP calls– Requires correctly pre-processed data with reliable quality scores
Base quality score re-calibration
Source: The Broad Institutehttp://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_poplin.pdf
Before Recalibration After Recalibration
Primary analysis — Raw data and assembly QC
Raw Sequence
Files
FASTQ Format
Mapping
BWA/Bowtie
Raw Alignment
Files
SAM/BAM Format
Local Realignment(around InDels)
GATK
Duplicate marking
Analysis-ready
Alignment
Picard SAM/BAM Format
Quality score re-
calibration
Picard
Sequence QC check
Raw data QC Report
FastQC AlignmentQC Report
Alignment QC check
Picard
Secondary analysis SNP and Indel calling, annotation and filtering
GATK
Unified Genotyper
Analysis-ready
alignment
SNPs
InDels
VCF Format
Variant Evaluation
• Known variant?
• Impact on gene expression?
• Splicing affected?
• Non-synonymous or frameshiftmutation?
• Impact on protein function?
• How confident are we in the call?
• Zygosity?
Comprehensiveinteractive OGT
Report
AlignmentQC Report
Sequence QC Report
SAM/BAM Format OGT
SNP/Indel classification(standard analysis)
We check and annotate every single detected SNP and Indel against all human Ensembl genes and transcripts and dbSNP
dbSNP annotation:• Is the variant known?• Obtain allele frequency
Does it affect any of the following• Promoter region• UTR• Splice sites or intronic region• CDS
• Synonymous mutation• Non synonymous mutation• Frameshift mutation• Stop codon (truncated/elongated protein sequence)• Overlap with protein domain• Consequence on protein function predicted (SIFT & PolyPhen)
OGT Processing Overview
Gather All detected SNP/Indels
Not Described in dbSNP
Mapped to Promoter Regions
Perform pairwisegenome analysis
Filter out variants present in “baseline” genome (e.g. somatic tissue, healthy sibling)
Additional Filtering and Analysis
Mapped to Exons, Splice sites or UTRs
and Protein domains
Non-synonymous Coding Variations
Perform pairwisegenome analysis
Filter out variants present in “baseline” genome (e.g. somatic tissue, healthy sibling)
Additional Filtering and Analysis
Variations with Serious Consequences to the
Protein Sequence (SIFT)
Perform pairwisegenome analysis
Filter out variants present in “baseline” genome (e.g. somatic tissue, healthy sibling)
Additional Filtering and Analysis
Described in dbSNP Rare RS ID Variations
Perform pairwisegenome analysis
Filter out variants present in “baseline” genome (e.g. somatic tissue, healthy sibling)
Additional Filtering and Analysis
Individual Genome Analysis
(Standard Level)
Multi Genome Analysis, Data Gathering and Comparison
(Advanced Level)
Tailored analysis based on client’s
individual requirements
(Expert Level)
Perform pairwisegenome analysis
Filter out variants
present in any “baseline”
exome (e.g. somatic tissue, healthy sibling)
AND not all “case” exomes
Study specific additional in-depth filtering and analysis
DataInformation
NGS data delivery
Hard drive(or FTP)
ship data
Double click!
Comprehensive HTML analysis report
File location& share results
Analysis report: Summary section
Analysis report: QC section — Read QC
Analysis report: QC section — Read QC
Analysis report: QC section — Alignment QC
Analysis report: QC section — Alignment QC
Analysis section — Overview
The Variant Table View
Data display
Data export
The Variant Table View — External links
The Detailed Variant View
Predicted consequences on protein function
Alignment View of selected variant in IGV
OGT data processing ensures detection of insertions
Detection of an 31bp insertion
Detection of an 84bp deletion
OGT data processing ensures detection of deletions: Example1
Detection of homozygous and heterozygous deletions
Heterozygous deletion
Homozygous deletion
No deletion (reference sequence)
Interactive data filtering
Customer data: Analysis of consanguineous samples
1
1
2
2II
I
Data courtesy of Dr. Bernd Wollnik, Institute of Human Genetics, University Hospital of Cologne
HACE1Exon11c.994C>TR332X(CGA -> TGA)
ANK1 ANK1 HECT69-161 168-258 602-909
R332XControl
Mother
Father
Patient1
Patient2
H V F R I G PX
Data courtesy of Dr. Bernd Wollnik, Institute of Human Genetics, University Hospital of Cologne
Confirmation by Sanger sequencing
Customer feedback...
Analysis of Consanguineous Samples
“Just wanted to let you know that we have probably identified the
causative gene and mutation in the patient sample.
The mutation is located in the middle of an 18 Mb homozygous
stretch and is a homozygous nonsense mutation!!!
Wow, its going so nicely with your data!!!”
Dr. Bernd Wollnik, Institute of Human Genetics, University Hospital of Cologne
SummaryOGT offers fast, accurate & powerful NGS analysis
Standard Analysis
• Robust statistical data analysis
• Comprehensive variant annotation
• Interactive filtering and prioritisation of data based on• chromosomal region
• allele frequency / novelty
• zygosity
• confidence score and read depth
• severity of mutation
Advanced Analysis• Multi-genome comparison
Bespoke analysis • Tailored to your specific requirements
Outline of presentation
• Delivering a unique next generation sequencing service —Dr Mike Evans, CEO
• Optimised bait design for targeted sequencing — Dr Volker Brenner, Head of Computational Biology
• Adding value through analysis — Dr Volker Brenner, Head of Computational Biology
• Summary• Q&A
Speak to one of our team or visit booth 713 to:
• Book a demonstration of our interactive analysis report — Hurry limited availability
• Discuss your specific project requirements
• Take part in our short survey and have your chance to win an Amazon Kindle
75
Thank youwww.ogt.co.uk