data analysis pipelines for ngs applications
TRANSCRIPT
![Page 1: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/1.jpg)
Data analysis pipelines for NGS applications
Sergi Beltran Agulló
VHIR-CNAG Course, 11th February 2015
![Page 2: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/2.jpg)
BIOREPOSITORY LABORATORY SEQUENCING
QC ANALYSIS TRANSFER
LIM
S
Full Traceability of CNAG’s Workflow ISO9001
![Page 3: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/3.jpg)
P7
P5 Index SP read 1
SP read 2
DNA insert
WG _BS _SEq
WG_Seq
mRNA Seq
…and many more
smallRNA_Seq
Target capture
ChIP _Seq
Sequencing Platform (M. Gut)
![Page 4: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/4.jpg)
cBot Automatic
reagent dispencer
Flow cell Glass slide with
a lawn of oligonucleotides
Sequencing Platform (M. Gut)
![Page 5: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/5.jpg)
Flow cell Glass slide with
a lawn of oligonucleotides and sequencing library
HiSeq2000 – the sequencer
Sequencing Platform (M. Gut)
![Page 6: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/6.jpg)
Sequencing-by-synthesis (SBS)
5’
G
T
C
A
G
T
C
A
G
T
C
A
G
T
3’
5’
C
A
G
T
C
A
T
C
A
C
C
T
A
G
C
G
T
A
First base incorporated
Cycle 1: Add sequencing reagents
Remove unincorporated bases
Detect signal
Cycle 2-n: Add sequencing reagents and repeat
All four labelled nucleotides in one reaction
High accuracy
Base-by-base sequencing
No problems with homopolymer repeats
5’
Sequencing Platform (M. Gut)
![Page 7: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/7.jpg)
100 microns
![Page 8: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/8.jpg)
Sequencing Output: FASTQ files
- Developed by the Wellcome Trust Sanger Institue
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
- Usually, each sequence (read) is split in 4 rows
- Sequence identifiers, description and quality encoding can be different
![Page 9: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/9.jpg)
SEQUENCING
CN
AG
-LIM
S
MAPPING QUALITY CTRL
MAP/BAM
VARIANT CALLING
RNA-Seq
QUANTIFICATION
ASSEMBLY &
METAGENOMICS
BISULFITE
SHAPE-BASED 3D
MODELLING OF RNA
SEQUENCES
Hi-C TO 3D MODELS OF
GENOMIC DOMAINS
AND GENOMES
STRUCTURAL VARIANTS
FASTQ
Data Analysis Pipelines at CNAG
![Page 10: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/10.jpg)
SEQUENCING
CN
AG
-LIM
S
MAPPING QUALITY CTRL
MAP/BAM
VARIANT CALLING
RNA-Seq
QUANTIFICATION
ASSEMBLY &
METAGENOMICS
BISULFITE
SHAPE-BASED 3D
MODELLING OF RNA
SEQUENCES
Hi-C TO 3D MODELS OF
GENOMIC DOMAINS
AND GENOMES
STRUCTURAL VARIANTS
FASTQ
Data Analysis Pipelines at CNAG
![Page 11: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/11.jpg)
Aligning and merging fragments of DNA in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather small pieces between 20 and 1000 bases. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA or gene transcript (ESTs). (adapted from Wikipedia)
Adapted from Li-Jun Ma; Natalie D. Fedorova. Mycology, pages 9 - 24
Assembly: definition
![Page 12: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/12.jpg)
CNAG de novo assembly pipeline
Assembly and Annotation Team (T. Alioto)
Removed
![Page 13: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/13.jpg)
CNAG genome projects
Assembly and Annotation Team (T. Alioto)
Removed
![Page 14: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/14.jpg)
Assembly: Metagenomics
http://wiki.biomine.skelleftea.se
![Page 15: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/15.jpg)
Clinical Applications: Human Microbiome Project
![Page 16: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/16.jpg)
SEQUENCING
CN
AG
-LIM
S
MAPPING QUALITY CTRL
MAP/BAM
VARIANT CALLING
RNA-Seq
QUANTIFICATION
ASSEMBLY &
METAGENOMICS
BISULFITE
SHAPE-BASED 3D
MODELLING OF RNA
SEQUENCES
Hi-C TO 3D MODELS OF
GENOMIC DOMAINS
AND GENOMES
STRUCTURAL VARIANTS
FASTQ
Data Analysis Pipelines at CNAG
![Page 17: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/17.jpg)
Mapping to reference genome
Adapted from wikipedia
100bp read 100bp read
![Page 18: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/18.jpg)
Adapted from wikipedia
100bp read 100bp read
Mapping to reference genome
![Page 19: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/19.jpg)
Adapted from wikipedia
100bp read 100bp read 100bp read 100bp read
Mapping to reference genome
![Page 20: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/20.jpg)
Mapping: Exome sequence example
Alignments are stored in a BAM file, which is the binary version of SAM
(Sequence/Alignment Map) format
![Page 21: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/21.jpg)
Exome sequencing metrics
Removed
![Page 22: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/22.jpg)
Variant Calling
Identification of genetic differences in comparison to a reference (strict definition)
- Designs: Pedigree, trio, group, somatic
Removed
![Page 23: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/23.jpg)
CNAG’s Variant Calling Pipeline J. Camps, S. Derdak, S. Laurie, E.
Serra, R. Tonda, JR Trotta, S Beltran
Removed
![Page 24: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/24.jpg)
CNAG’s Variant Calling Pipeline: Sensitive and Precise
- NA12878 50x Whole Genome FASTQs from Illumina Platinum Genomes
analyzed with the pipeline: http://www.illumina.com/platinumgenomes/
- Results compared independently for SNPs and INDELs agains NIST
reference set: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype
calls. Zook et al. Nat Biotechnol. 2014 Mar;32(3):246-51.
- Results (on callable region):
S. Derdak, A. Kanterakis, S. Laurie,
E. Serra, R. Tonda, S Beltran
Removed
![Page 25: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/25.jpg)
Results: PDF Report
Removed
![Page 26: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/26.jpg)
General (6) Chrom
Pos
Ref
Alt
RS
GMAF
Call Specific (5) Genotype
Genotype Quality
Depth
GT Probabilty Likelihood
Strand Bias
Functional Annotation
(12) Gene Name
Coding /Non-coding
Transcript Biotype
Variant Effect
Variant Effect Impact
Functional Class
Codon Change
Amino Acid affected
Trnscript ID
Transcript Length
Exon rank in transcript
Effect Prediction (12) Sift Prediction & Score
Polyphen2 HDIV Prediction and
score
Polyphen2 HVAR Prediction and
score
Mutation Taster Prediction and
score
Phylop Score
Gerp++ Score
SiPhy 29 Mammal Score
CADD Score
Control Populations
(6) ESP6500 European-
Americans
ESP6500 African-
Americans
1000GP-phase 1
Europeans
1000GP-phase 1 Africans
1000GP-phase 1 Asians
ExAC
Results: Relevant Fields in gVCF and Excel
![Page 27: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/27.jpg)
Results: Secondary data analysis
Mutation number (All chr)
Inte
r-m
uta
tional dis
tance
Chromosomal position (Chr 6)
Norm
aliz
ed
Copy
Num
ber
![Page 28: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/28.jpg)
Removed
![Page 29: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/29.jpg)
Examples in Rare Diseases
![Page 30: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/30.jpg)
30
RD-Connect : Integration and Sharing
WP1: Coordination
WP2: Patient registries
WP3: Biobanks
WP4: Bioinformatics
WP5: Unified platform
Hanns Lochmüller (Newcastle and TREAT-NMD)
Domenica Taruscio (ISS and EPIRARE)
Lucia Monaco (Fondaz. Telethon & EuroBioBank)
WP6 Ethical/legal/social
Ivo Gut (CNAG Barcelona)
Christophe Béroud (INSERM Marseille)
WP7: Impact/Innovation
Mats Hansson (Uppsala)
Kate Bushby (Newcastle and EUCERD/ EJARD)
![Page 31: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/31.jpg)
Removed
![Page 32: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/32.jpg)
SEQUENCING
CN
AG
-LIM
S
MAPPING QUALITY CTRL
MAP/BAM
VARIANT CALLING
RNA-Seq
QUANTIFICATION
ASSEMBLY &
METAGENOMICS
BISULFITE
SHAPE-BASED 3D
MODELLING OF RNA
SEQUENCES
Hi-C TO 3D MODELS OF
GENOMIC DOMAINS
AND GENOMES
STRUCTURAL VARIANTS
FASTQ
Data Analysis Pipelines at CNAG
![Page 33: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/33.jpg)
Microarrays RNA-seq
Nature Methods, 8 469-477 (2011)
RNA-Seq Differential Expression
![Page 34: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/34.jpg)
RNA-Seq Analysis Pipeline
A. Esteve, S. Heath
![Page 35: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/35.jpg)
A. Esteve, S. Heath
![Page 36: Data analysis pipelines for NGS applications](https://reader030.vdocuments.site/reader030/viewer/2022032420/55a516e01a28abd77f8b4691/html5/thumbnails/36.jpg)
Summary
- NGS has multiple applications, usually with higher precision compared to
microarrays.
- NGS has direct clinical applicability
- Sequencing can greatly speed up research and diagnostics.
- Analysis is far from being standardized but results are already very accurate.
- The CNAG offers full collaborations (from experiment design to user-friendly
analysed results)