NGS, Cancer and BioinformaticsExome-sequencing followed by Variant Calling.
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Overview of exome analysis
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Galaxy Workflow
10
Reads
(Fastq)
Reference Genome
(Fasta)
Conversion
to Galaxy
Format
---------
Groomer
Quality
Control
---------
FastQC
Mapping
---------
Bowtie2
PCR duplicates
Marking
---------
MarkDup
Preprocess GATK
part 1
---------
Local realignment
around indels
Preprocess GATK
part 2
---------
Base Quality Score
Recalibration
Target
Intersection
---------
Intersect Bam
Target
regions
(bed)
Aligned and preprocessed
reads (BAM)
---------- Marked PCR duplicates
- Intersected on target regions
- Realigned around indels
- Recalibrated
7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Public dataset
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Accessible online on SRA (Sequence Read Archive): ERA148528
Exome sequencing of 2 samples: tumor (lung cancer) and blood(normal sample)
Publication : Ys et al., Genome Res. 2012 Mar;22(3):436-45
• 100bp paired-end reads, Illumina HiSeq 2000• Mean depth higher for the tumor sample (~100X) than for the normal sample (~30X) to detect somatic variant with a low allelicfrequency• Aligned Exome size: ~15 Go tumor; ~7 Go blood• Complete analysis processing time: ~20h
Need to restrict the analysis to a few regions in order to limit the processing time (~112kb)
Select libraries on Galaxy
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. Open your web browser and go to ”https://galaxy.gustaveroussy.fr/galaxyprod”
2. In the top menu, click on « Shared Data » then « Data librairies »
3. Click on [FORMATION] Input Data then « EXOME »4. Select « tumor_R1.fastq » ; « tumor_R2.fastq » ;
« exome_regions.bed » ; « known_sites_regions.vcf » then click on « Go »
Select Librairies on Galaxy
12
1. Open your web browser and go to « http://galaxy.sb-roscoff.fr »
2. In the top menu, click on « Shared Data » then « Data librairies »
3. Click on «canceropole-tp-input »
4. Select « tumor_R1.fastq » ; « tumor_R2.fastq » ; « exome_regions.bed » ;
« known_sites_regions.vcf » then click on « Go ».
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
FASTQ format conversion
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. Rename your history to « Tumor » by clicking on « Unnamedhistory ».2. In the left panel, click on the « search tools » textbox and enter « FASTQ Groomer » and then click on it to convert both your FASTQ into FASTQ Sanger Format.3. Click on « Execute » to launch the conversion.
FASTQ format conversion
13
1. Rename your history to « Tumor » by clicking on « Unnamed history »
2. In the left panel, click on « FASTQ Groomer » under the NGS: QC and
manipulation section to convert both your FASTQ into FASTQ Sanger Format
3. Click on « Execute » to launch the conversion
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
FASTQC : FASTQ Quality Control
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. In the left panel, click on the « search tools » textbox and enter « FASTQC: Read QC »2. Select the FASTQ Groomer dataset and click on « Execute »; repeatfor both reads
The result of FASTQC is an html page that you can view by clicking on the eye
FASTQC : FASTQ Quality Control
14
1. In the left panel, click on « FASTQC: Read QC » under the NGS: QC and
manipulation section
2. Select the FASTQ Groomer dataset and click on « Execute »; repeat for both reads
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
The result of FASTQC is an html page that you can view by clicking on the eye
GENERAL TIP : RENAME YOUR HISTORY ITEMS FREQUENTLY TO BE MORE EXPLICIT THAN « on data xxx » !
FastQC Metrics
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Look at the different metrics for both reads• Problem: the per base sequence quality of the Read2 are quitelow towards the end
Solution:Trim the 25bp from the 3’ end of the reads Higher confidence in the
sequenced information
FASTQ Trimmer
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. Use « FASTQ Trimmer » to cut off 25bp from 3’ end of the Groomed Reads (use the « search tools » object to find the tool)2. Run « FASTQC » on the trimmed reads
1. Use « FASTQ Trimmer » to cut off 25bp from 3’ end of t he Groomed Read2 (use
the « search tools » object to find the tool)
2. Run « FASTQC» on the trimmed reads
FASTQ Trimmer
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
16
1. Use « FASTQ Trimmer » to cut off 25bp from 3’ end of t he Groomed Read2 (use
the « search tools » object to find the tool)
2. Run « FASTQC» on the trimmed reads
FASTQ Trimmer
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
16
If bad-quality sequences/bases distribution more complex
Use a more « elaborated » trimming step
12 - 14 novembre 2014 Formation NGS & Cancer - Analyses RNA-Seq
Overview of exome analysis
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Galaxy Workflow
10
Reads
(Fastq)
Reference Genome
(Fasta)
Conversion
to Galaxy
Format
---------
Groomer
Quality
Control
---------
FastQC
Mapping
---------
Bowtie2
PCR duplicates
Marking
---------
MarkDup
Preprocess GATK
part 1
---------
Local realignment
around indels
Preprocess GATK
part 2
---------
Base Quality Score
Recalibration
Target
Intersection
---------
Intersect Bam
Target
regions
(bed)
Aligned and preprocessed
reads (BAM)
---------- Marked PCR duplicates
- Intersected on target regions
- Realigned around indels
- Recalibrated
7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Mapping with Bowtie2
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. Use « Bowtie2 » from the « Mapping » section to align reads on the hg19 genome1. Use « Bowtie2 » from the « Mapping » section to align reads on the hg19 genome
Mapping with Bowtie2
24
Preset option:
combination of parameters
designed to have a good tradeoff
between speed, sensitivity,
accuracy
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
24
1. Use « Bowtie2 » from the « Mapping » section to align reads on the hg19 genome
Mapping with Bowtie2
24
Preset option:
combination of parameters
designed to have a good tradeoff
between speed, sensitivity,
accuracy
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
24
1. Use « Bowtie2 » from the « Mapping » section to align reads on the hg19 genome
Mapping with Bowtie2
24
Preset option:
combination of parameters
designed to have a good tradeoff
between speed, sensitivity,
accuracy
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
24
SAM/BAM aligned format
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• SAM Format: aligned format, human readable
SAM/BAM aligned format
@SQ SN:chr12 LN:133851895
@RG ID:Sample_ID LB:Sample_Library PL:ILLUMINA SM:Sample_Name PU:Platform_Unit
ERR166338.1 99 chr12 82670685 23 101M = 82670850 266
GCCCCTGGGGATGTTTTGCACCAAGCCACTGTCTCCAGCTGG
BBC@GIIHGCFCIEHEAIEIFFGEONDNJFINIONHNGJNNNNKNJN
RG:Z:Sample_ID XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 XA:Z
• BAM Format: Binary SAM Format (not human readable but compressed = smaller)
• SAM Format: aligned format, human readable
Read name Flag Chr 5’ pos MAPQ Cigar paired 5’ pos of
the mate Insert size
sequence
Base quality
tags
Group affiliation
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
25
• BAM Format: Binary SAM Format (not human readable but compressed = smaller)
Mapping statistics (directly from the mapper)
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Mapping statistics
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Use « Flagstat » from « Samtools » to see some mapping statistics
Mapping Statistics
• Use « Flagstat » from « Samtools » to see some mapping statistics
% of mapped reads
Properly paired reads:
- 0<= Insert size <= Max size
- Reads on same chromosome
- Reads facing each other
- Both reads are mapped
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
26
Removing duplicates (not for targetedsequencing)
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Duplicates reads: different reads having the same sequence caused by PCR amplication during sequencing library preparation• The removal of the duplicates depends on the application (not suitable for sequencing on small target)
Removing Duplicates
• Duplicates reads: different reads having the same sequence caused by PCR
amplication during sequencing library preparation
• The removal of the duplicates depends on the application (not suitable for sequencing
on small target)
PCRdup
removal
• Galaxy: Use “Mark D
u
plicates reads” from “NGS:Picard” to mark duplicates
• Galaxy: Run “Flagstat” on the output BAM to se e th e nu mber of PC R du plicates
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
27
• Galaxy: Use “Mark Duplicates reads” from “NGS:Picard” to mark duplicates
• Galaxy: Run “Flagstat” on the output BAM to see the number of PCR duplicates
Target intersection
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Use « Intersect BAM alignments with intervals » from« NGS:Bedtools » to keep only the reads mapped on the targetedregions Smaller BAM size The targeted regions must be in BED format (4 columns : chr ;
start ; end ; name)
Target intersection
• Use « Intersect BAM alignments with intervals » from « NGS:Bedtools » to keep only
the reads mapped on the targeted regions
Smaller BAM size
The targeted regions must be in BED format (4 columns : chr ; start ; end ; name)
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
28
Group association
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Use « Add or Replace Groups » from « NGS:Picard » to associate a sample ID and a sequencing technology to the reads
Mandatory for some tools (GATK) or in multi-sample analysis
Group Association
• Use « Add or Replace Groups » from « NGS:Picard » to associate a sample ID
and a sequencing technology to the reads
Mandatory for some tools (GATK) or in multi-sample analysis
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
29
Overview of exome analysis
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Next session
30
Reads
(Fastq)
Reference Genome
(Fasta)
Conversion
to Galaxy
Format
---------
Groomer
Quality
Control
---------
FastQC
Mapping
---------
Bowtie2
PCR duplicates
Marking
---------
MarkDup
Preprocess GATK
part 1
---------
Local realignment
around indels
Preprocess GATK
part 2
---------
Base Quality Score
Recalibration
Target
Intersection
---------
Intersect Bam
Target
regions
(bed)
Aligned and preprocessed
reads (BAM)
---------- Marked PCR duplicates
- Intersected on target regions
- Realigned around indels
- Recalibrated
7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Variant calling pre-processing : GATK part 1
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Preprocess GATK: part 1
4
Reads
(Fastq)
Reference Genome
(Fasta)
Conversion
to Galaxy
Format
---------
Groomer
Quality
Control
---------
FastQC
Mapping
---------
Bowtie2
PCR duplicates
Marking
---------
MarkDup
Preprocess GATK
part 1
---------
Local realignment
around indels
Preprocess GATK
part 2
---------
Base Quality Score
Recalibration
Target
Intersection
---------
Intersect Bam
Target
regions
(bed)
Aligned and preprocessed
reads (BAM)
---------- Marked PCR duplicates
- Intersected on target regions
- Realigned around indels
- Recalibrated
7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Why realign around indels ?
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Small Insertion/deletion (Indels) in reads (especially near the ends) can trick the mappers into mis-aligning with mismatches Alignment scoring – cheaper to introduce multiple Single
Nucleotide Variants (SNVs) than an indel: induce a lot of false positive SNVs
• These artifactual mismatches can harm base quality recalibrationand variant detection
• Realignment around indels helps improve the accuracy of severalof the downstream processing steps
Local realignment identifies most parsimoniousalignment along all reads at a problematic locus
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. Find the best alternate consensus sequence that, together withthe reference, best fits the reads in a pile
6 Local realignment identifies most parsimonious
alignment along all reads at a problematic locus
1. Find the best alternate consensus sequence that, together with the reference,
best fits the reads in a pile
2. The score for an alternate consensus is the total sum of the quality scores of
mismatching bases
3. If the score of the best alternate consensus is sufficiently better than the original
alignments, then we accept the proposed realignment of the reads
Realigning
determines
which is better
consistent with the reference
consistent with a 3bp insertion
3 adjacent
SNPs
Local realignment identifies most parsimonious alignment along all reads at a problematic locus
FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014
2. The score for an alternate consensus is the total sum of the qualityscores of mismatching bases3. If the score of the best alternate consensus is sufficiently betterthan the original alignments, then we accept the proposedrealignment of the reads
Three types of realignment targets
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Known sites: Common polymorphisms: dbSNP, 1000Genomes
• Indels seen in original alignments (in CIGAR, indicated by I for Insertion or D for Deletion)
• Sites where evidences suggest a hidden indel (SNVsabundance)
Known siteshttps://www.broadinstitute.org/gatk/guide/tagged?tag=knownsites
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Why are they important?
Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which candramatically affect the sensitivity and reliability of the results.
In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.
2. Recommended sets of known sites per tool
Local realignment around indels
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
After Before SNVs
Deletion
SNVs
Deletion
FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014
Local realignment around indels
8
Local realignment around indels
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
9
FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014
Local realignment around indels
Local realignment around indels
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
10
FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014
Local realignment around indels
Indel realignment steps/tools
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. Identify what regionsneed to be realigned RealignerTargetCreator
+ known sites
2. Perform the actualrealignment (BAM output) IndelRealigner
1. Identify what regions need
to be realigned
RealignerTargetCreator
2. Perform the actual
realignment (BAM output)
IndelRealigner
11
Indel realignment steps/tools
+ known sites
Intervals
FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014
1. Identify what regions need
to be realigned
RealignerTargetCreator
2. Perform the actual
realignment (BAM output)
IndelRealigner
11
Indel realignment steps/tools
+ known sites
Intervals
FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014
Intervals
Galaxy: Realigner Target Creator
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Use « Realigner Target Creator » from « GATK Tools » to detect intervals in need of local realignment
12
FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014
Galaxy: Realigner Target Creator • Use « Realigner Target Creator » from « GATK Tools » to detect intervals in need of
local realignment
Add new binding
for reference-
ordered datas
Choose
advanced
GATK options
Add new operate
on Genomic
Intervals
Galaxy: Indel Realigner
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Use « Indel Realigner » from « GATK Tools » to apply local realignment
13
7 - 9 AVRIL 2014
Galaxy: Indel Realigner
• Use « Indel Realigner » from « GATK Tools » to apply local realignment
Add new binding
for reference-
ordered datas
Choose
advanced
GATK options
Add new operate
on Genomic
Intervals
Preprocess GATK: part 2
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Preprocess GATK: part 2
14
Reads
(Fastq)
Reference Genome
(Fasta)
Conversion
to Galaxy
Format
---------
Groomer
Quality
Control
---------
FastQC
Mapping
---------
Bowtie2
PCR duplicates
Marking
---------
MarkDup
Preprocess GATK
part 1
---------
Local realignment
around indels
Preprocess GATK
part 2
---------
Base Quality Score
Recalibration
Target
Intersection
---------
Intersect Bam
Target
regions
(bed)
Aligned and preprocessed
reads (BAM)
---------- Marked PCR duplicates
- Intersected on target regions
- Realigned around indels
- Recalibrated
7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Why recalibrate base qualities ?
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
15
Real data is messy so properly estimating the evidence is critical
Why recalibrate base qualities?
FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014
Real data is messy so properly estimating the evidence is critical
The quality scores issued by sequencers are inaccurate and biased
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Quality scores are critical for all downstream analysis• Systematic biases are a major contributor to bad calls• Example of sequence context bias in the reported qualities:
16
The quality scores issued by sequencers are inaccurate and biased
• Quality scores are critical for all downstream analysis
• Systematic biases are a major contributor to bad calls
• Example of sequence context bias in the reported qualities:
before after
FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014
The quality scores issued by sequencers are inaccurate and biased
Evidences of error covariates
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Analyze covariation among several features of a base, e.g.:
• Reported quality score
• Position within the read (machine cycle)
• Preceding and current nucleotides (chemistry effect)
• Sequencing technology...
• Adjust the quality score associated to each sequenced base to bemore accurate Remove systematic biases
How the covariates are analyzed ?
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Keep track of the number of observations and the number of times it was an error as a function of various covariates:
• Typically stratify the data by lane, by original quality score, by machine cycle and sequencing context• Databases of known variants are used to discount most of the real genetic variation present in the sample• All other differences are assumed to be errors• Having done indel realignment first reduces noise
18
How the covariates are analyzed?
• Keep track of the number of observations and the number of times it was
an error as a function of various covariates:
• Typically stratify the data by lane, by original quality score, by
machine cycle and sequencing context
• Databases of known variants are used to discount most of the real
genetic variation present in the sample
• All other differences are assumed to be errors
• Having done indel realignment first reduces noise
Phred-scaled
Quality score
#mismatches+1
#bases+2
FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014
How the covariates are analyzed?
https://www.broadinstitute.org/gatk/guide/article?id=44
BQSR steps/tools
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. Model the error modes and recalibrate qualities Base Recalibrator
19
1. Model the error modes and
recalibrate qualities
Count Covariates
2. Perform the actual
recalibration (BAM output)
Table Recalibration
FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014
BQSR steps/tools
Covariates 2. Perform the actualrecalibration (BAM output) Print Reads
19
1. Model the error modes and
recalibrate qualities
Count Covariates
2. Perform the actual
recalibration (BAM output)
Table Recalibration
FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014
BQSR steps/tools
Covariates
Co
variates
Galaxy: Base Recalibrator
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Use « Base Recalibrator » from « GATK Tools » to recalibrate base quality scores
Add new binding for reference-
ordered datas
Chooseadvanced
GATK optionsAdd new
operate on Genomicintervals
Galaxy: Print Reads
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Final results
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Final Results
22
Reads
(Fastq)
Reference Genome
(Fasta)
Conversion
to Galaxy
Format
---------
Groomer
Quality
Control
---------
FastQC
Mapping
---------
Bowtie2
PCR duplicates
Marking
---------
MarkDup
Preprocess GATK
part 1
---------
Local realignment
around indels
Preprocess GATK
part 2
---------
Base Quality Score
Recalibration
Target
Intersection
---------
Intersect Bam
Target
regions
(bed)
Aligned and preprocessed
reads (BAM)
---------- Marked PCR duplicates
- Intersected on target regions
- Realigned around indels
- Recalibrated
7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Next step: somatic variant calling
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Next Step: somatic variant calling
Variant Calling
---------
VarScan
Somatic
Variant
Annotation
-----
Annovar
Mpileup
Tumeur
---------
Samtools
Mpileup
Mpileup
Normal
---------
Samtools
Mpileup
Aligned and
preprocessed Reads
(BAM)
Aligned and
preprocessed Reads
(BAM)
Tumor
Normal
Variant
Selection
-----
Select
23
7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Using workflow in Galaxy: Repeat the same stepsfor the blood sample
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Repeat the same steps for the blood sample
Variant Calling
---------
VarScan
Somatic
Variant
Annotation
-----
Annovar
Mpileup
Tumeur
---------
Samtools
Mpileup
Mpileup
Normal
---------
Samtools
Mpileup
Aligned and
preprocessed Reads
(BAM)
Aligned and
preprocessed Reads
(BAM)
Tumor
Normal
Variant
Selection
-----
Select
4
7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Repeat process
Repeat all these steps in a few click…
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Repeat all these steps in a
few click…
5
Reads
(Fastq)
Reference Genome
(Fasta)
Conversion
to Galaxy
Format
---------
Groomer
Quality
Control
---------
FastQC
Mapping
---------
Bowtie2
PCR duplicates
Marking
---------
MarkDup
Preprocess GATK
part 1
---------
Local realignment
around indels
Preprocess GATK
part 2
---------
Base Quality Score
Recalibration
Target
Intersection
---------
Intersect Bam
Target
regions
(bed)
Aligned and preprocessed
reads (BAM)
---------
- Marked PCR duplicates
- Intersected on target regions
- Realigned around indels
- Recalibrated
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Select libraries on Galaxy
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. In the top menu, click on « Shared Data » then « Data librairies »2. Click on «[FORMATION] Input Data» then « EXOME »3. Select « normal_R1.fastq » ; « normal_R2.fastq » ; « exome_regions.bed » ; « known_sites_regions.vcf »4. Select « Import to Histories » then click on Go
Select Librairies on Galaxy
6
1. In the top menu, click on « Shared Data » then « Data librairies »
2. Click on «canceropole-tp-input »
3. Select « normal_R1.fastq » ; « normal_R2.fastq » ; « exome_regions.bed » ;
« known_sites_regions.vcf »
4. Select « Import to Histories » then click on Go
5. Write a new history name and click on
« import library datasets »
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Select Librairies on Galaxy
6
1. In the top menu, click on « Shared Data » then « Data librairies »
2. Click on «canceropole-tp-input »
3. Select « normal_R1.fastq » ; « normal_R2.fastq » ; « exome_regions.bed » ;
« known_sites_regions.vcf »
4. Select « Import to Histories » then click on Go
5. Write a new history name and click on
« import library datasets »
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
5. Write a new history nameand click on « import librarydatasets »
Extract workflow from history
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. In the top menu, click on « AnalyzeData » to return to the main frame2. In the « history » panel, click on the topside wheel then on « ExtractWorkflow »3. Write a name for your workflow thenclick on « Create Workflow »
Extract workflow from history
7
1. In the top menu, click on « Analyze Data » to return to
the main frame
2. In the « history » panel, click on the topside wheel then
on « Extract Workflow »
3. Write a name for your workflow then click on « Create
Workflow »
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Extract workflow from history
7
1. In the top menu, click on « Analyze Data » to return to
the main frame
2. In the « history » panel, click on the topside wheel then
on « Extract Workflow »
3. Write a name for your workflow then click on « Create
Workflow »
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Edit your workflow
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. In the top menu, click on « Workflow » to return to the main frame2. Click on your new workflow then select «Edit»3. Identify and rename « Input dataset » boxes corresponding to R1/R2/BED/VCF4. Each box represent a tool set with parametersthat you can modify by clicking on it
Click on the « Add or Replace Groups » box and change the read group ID and name (you can also choose to set them atruntime)
Check the input of the « Intersect BAM» box: it should be the BAM output from « Mark Duplicates »
5. Save your edited worfklow by clicking on the wheel then « save »
Edit your worklow
8
1. In the top menu, click on « Workflow » to return
to the main frame
2. Click on your new workflow then select « Edit »
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
3. Each box represent a tool set with parameters that
you can modify by clicking on it
Click on the « Add or Replace Groups » box and
change the read group ID and name (you can also
choose to set them at runtime)
Check the input of the « Intersect BAM» box: it
should be the BAM output from « Mark Duplicates »
and from « flagstat » 4. Save your edited worfklow by clicking on the wheel
then « save »
Set at
runtime
Edit your worklow
8
1. In the top menu, click on « Workflow » to return
to the main frame
2. Click on your new workflow then select « Edit »
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
3. Each box represent a tool set with parameters that
you can modify by clicking on it
Click on the « Add or Replace Groups » box and
change the read group ID and name (you can also
choose to set them at runtime)
Check the input of the « Intersect BAM» box: it
should be the BAM output from « Mark Duplicates »
and from « flagstat » 4. Save your edited worfklow by clicking on the wheel
then « save »
Set at
runtime
Run your workflow on the blood sample
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. A workflow can only be runned on data presentin the current history Click on the wheel in the top of your history
panel and select « saved histories » Click on the « Normal » history and click on
« switch » Go back to the « Workflow » page (top
panel) then click on your workflow then on « Run »
2. Check that all your input files are correct (step1: bed, step2: vcf, step3: R1, step4: R2)3. Click on « Run workflow » at the bottom of the page
Run your workflow on the blood sample
9
1. A workflow can only be runned on data present in the current
history
Click on the wheel in the top of your history panel and
select « saved histories »
Click on the « Normal » history and click on « switch »
Go back to the « Workflow » page (top panel) then click
on your workflow then on « Run »
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
2. Check that all your input files are correct (step1: bed, step2: vcf, step3: R1, step4: R2)
3. Click on « Run workflow » at the bottom of the page
Run your workflow on the blood sample
9
1. A workflow can only be runned on data present in the current
history
Click on the wheel in the top of your history panel and
select « saved histories »
Click on the « Normal » history and click on « switch »
Go back to the « Workflow » page (top panel) then click
on your workflow then on « Run »
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
2. Check that all your input files are correct (step1: bed, step2: vcf, step3: R1, step4: R2)
3. Click on « Run workflow » at the bottom of the page
Let Galaxy work for you!
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Let Galaxy work for you!
10
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Let Galaxy work for you!
10
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Let Galaxy work for you!
10
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Next Step
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Next Step
Variant Calling
---------
VarScan
Somatic
Variant
Annotation
-----
Annovar
Mpileup
Tumeur
---------
Samtools
Mpileup
Mpileup
Normal
---------
Samtools
Mpileup
Aligned and
preprocessed Reads
(BAM)
Aligned and
preprocessed Reads
(BAM)
Tumor
Normal
Variant
Selection
-----
Select
11
7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Somatic Variant Calling with VarScan2
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Somatic Variant Calling with Varscan
Variant Calling
---------
VarScan
Somatic
Variant
Annotation
-----
Annovar
Mpileup
Tumeur
---------
Samtools
Mpileup
Mpileup
Normal
---------
Samtools
Mpileup
Aligned and
preprocessed Reads
(BAM)
Aligned and
preprocessed Reads
(BAM)
Tumor
Normal
Variant
Selection
-----
Select
4
7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Variant Calling
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Factors to consider when calling a SNVs:– Base call qualities of each supporting base (base quality)– Proximity to small indels, or homopolymer run– Mapping qualities of the reads supporting the SNP– Sequencing depth: >=30x for constit ; >=100 for tumor– SNVs position within the reads: Higher error rate at the
reads ends– Look at strand bias (SNVs supported by only one strand are
more likely to be artifactual)– Allelic frequency: Tumor cellularity will reduce the % of an
heterozygous variant• Higher stringency when calling indels (and sanger validation oftenneeded)
Depth of Coverage
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Depth of Coverage = number of reads supporting one position ex: 1X, 5X, 100X... >1000X
NGS - Applications and Analysis
Depth of Coverage = number of reads supporting one positions
ex: 1X, 5X, 100X… >1000X
Reference Genome
7X 5X brin + 2X brin -
Calling Confidence
Reference Base
SNV Sequencing Error
Aligned Reads
2X 17X 9X brin + 8X brin -
100% = SNV
Homozygote
50% 50%
= SNV Heterozygote
--- +++
SNV and sequence context (errors)
7
Depth of Coverage
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
VarScan2
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Mutation caller written in Java (no installation required) working withPileup files of Targeted, Exome, and Whole-Genome sequencing data (DNAseq or RNAseq)
• Multi-platforms: Illumina, SOLiD, Life/PGM, Roche/454
• Detection of different kinds of Germline SNVs/Indels (classical mode): Variants in individual samples Multi-sample variants shared or private in multi-sample datasets
• VarScan specificity is to be able to work with Tumor/Normal pairs (somatic mode):
Somatic and germline mutation, LOH events in tumor-normal pairs Somatic copy number alterations (CNAs) in tumor-normal exome
data
VarScan2
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Most published variant callers use Bayesian statistics (a probabilistic framework) to detect variants and assess confidence in them (e.g.: GATK)
• VarScan uses a robust heuristic/statistic approach to call variantsthat meet desired thresholds for read depth, base quality, variant allele frequency, and statistical significance
• In Stead et al. (2013), they compared 3 different somatic callers : MuTect, Strelka, VarScan2 VarScan2 performed best overall with sequencing depths of 100x,
250x, 500x and 1000x required to accurately identify variantspresent at 10%, 5%, 2.5% and 1% respectively
Common history
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. In the wheel from the history panel, select « Copy datasets »2. Select the preprocessed BAM from Normal and Tumor histories and « exome_regions.bed »
Common history
1. In the wheel from the history panel, select « Copy datasets »
2. Select the preprocessed BAM from Normal and Tumor histories
and « exome_regions.bed »
10
Common history
1. In the wheel from the history panel, select « Copy datasets »
2. Select the preprocessed BAM from Normal and Tumor histories
and « exome_regions.bed »
10
Common history
1. In the wheel from the history panel, select « Copy datasets »
2. Select the preprocessed BAM from Normal and Tumor histories
and « exome_regions.bed »
10
Common history
1. In the wheel from the history panel, select « Copy datasets »
2. Select the preprocessed BAM from Normal and Tumor histories
and « exome_regions.bed »
10
Common history
1. In the wheel from the history panel, select « Copy datasets »
2. Select the preprocessed BAM from Normal and Tumor histories
and « exome_regions.bed »
10
Common history
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. Edit each BAM attributes by clicking on the little pen2. To add more clarity, rename your BAM in « Normal.bam » and « Tumor.bam » then « Save » the changes
Common history
1. Edit each BAM attributes by clicking on the little pen
2. To add more clarity, rename your BAM in « Normal.bam » and « Tumor.bam »
then « Save » the changes
11
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Common history
1. Edit each BAM attributes by clicking on the little pen
2. To add more clarity, rename your BAM in « Normal.bam » and « Tumor.bam »
then « Save » the changes
11
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Common history
1. Edit each BAM attributes by clicking on the little pen
2. To add more clarity, rename your BAM in « Normal.bam » and « Tumor.bam »
then « Save » the changes
11
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Somatic variant calling with VarScan
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Somatic Variant Calling with Varscan
Variant Calling
---------
VarScan
Somatic
Variant
Annotation
-----
Annovar
Mpileup
Tumeur
---------
Samtools
Mpileup
Mpileup
Normal
---------
Samtools
Mpileup
Aligned and
preprocessed Reads
(BAM)
Aligned and
preprocessed Reads
(BAM)
Tumor
Normal
Variant
Selection
-----
Select
12
7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Mpileup
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. Use « Mpileup » from « NGS:Samtools » to create pileup files (repeatfor Tumor and normal samples)
Mpileup
1. Use « Mpileup » from « NGS:Samtools » to create pileup files (repeat for Tumor
and normal samples) Anomalous read pairs
are due to the
restriction of the exome
to a region
13
Mpileup
1. Use « Mpileup » from « NGS:Samtools » to create pileup files (repeat for Tumor
and normal samples) Anomalous read pairs
are due to the
restriction of the exome
to a region
13
Anomalousread pairs aredue to therestriction ofthe exome to aregion
Pileup format
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Describes the base-pair information at each position
Pileup format
• Describes the base-pair information at each position
Reference base
Number of reads
covering the site
(total depth)
Read bases:
. / , = match on forward/reverse strand
ACGTN / acgtn = mismatch on forward/reverse strand
`-\+[0-9]+[ACGTNacgtn]+‘ indicates an indel
chr12 112888238 A 108 .$,$.,,.,,..,,.,.,..,.,,.,.,,.,..,,..,,.,.,.,.,,.,.,.,,,.,.
,.,.,,,.,.,.,.,.,,.,.,.,.,,,,.,,.,.,..,,,,,.,.,,,,^F,
=4 =??????@??@?@@ @=@ ??@ ? @? ?
??< ? ??@??????? ? @??? ? ??@??
A???@@ ?@@???AB????= ? @
@@??@@?@ A 00
chr12 112888239 C 108
.$t,.,,.T,tT,.,T.,.,t.tTtt.tTT,t.T,tTt.tT,T,,.t
TtTttt.,.,Tt.ttt.,T,.,.tT,,T,T,.tT,,t,TttTtT,T.
,tt,,T,T,tttt^F.^F,
936 78??6??6 45<875? ??? ?@6
@6???<?=6 66= ? ???? 6=7??=???<8@7=7=?
77?7?8 ??78?7????? 8 <8??88 9?8 ?0048
Base qualities
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
14
Somatic variant calling with VarScan
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Somatic Variant Calling with Varscan
Variant Calling
---------
VarScan
Somatic
Variant
Annotation
-----
Annovar
Mpileup
Tumeur
---------
Samtools
Mpileup
Mpileup
Normal
---------
Samtools
Mpileup
Aligned and
preprocessed Reads
(BAM)
Aligned and
preprocessed Reads
(BAM)
Tumor
Normal
Variant
Selection
-----
Select
15
7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
VarScan somatic
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
1. Use « VarScan somatic » from « Varscan » to detectvariant• Min-var-freq: minimal allelic frequency to call a variant (10% here)• Min-coverage: minimum coverage to call a variant (in normal and tumor and combined)• Tumor and normal purity: cellularity of your sample
2 output files: SNVs & Indels in VCF format
VarScan Somatic
1. Use « VarScan somatic » from « Varscan » to
detect variant
• Min-var-freq: minimal allelic frequency to call a variant
(10% here)
• Min-coverage: minimum coverage to call a variant (in
normal and tumor and combined)
• Tumor and normal purity: cellularity of your sample
2 output files: SNVs & Indels in VCF format
16
VarScan Somatic
1. Use « VarScan somatic » from « Varscan » to
detect variant
• Min-var-freq: minimal allelic frequency to call a variant
(10% here)
• Min-coverage: minimum coverage to call a variant (in
normal and tumor and combined)
• Tumor and normal purity: cellularity of your sample
2 output files: SNVs & Indels in VCF format
16
VarScan VCF format
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• 2 types: VCF (specific to VarScan) Tabulated (available only for VarScan in classical mode)
• VarScan VCF format: classic VCF header (#) but specific variant lines
GT=Genotype (1/1: Homozygous ; 0/1 : Heterozygous) / GQ= Genotype Quality
SS= Somatic Status (0=ref; 1=Germline ; 2=Somatic; 3=LOH ; 5= Unknown)
DP= Quality Read Depth of bases with Phred score >= BAPQ
RD= Depth of reference-supporting bases
AD= Depth of variant-supporting bases
FREQ= Variant allele frequency
DP4= Ref/FWD , Ref/REV, Alt/FWD, Alt/REV
VarScan VCF Format
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr12 250239 . A G 20 PASS
DP=115;SOMATIC;SS=2;
SSC=21;GPV=1; SPV=6.3E-
3
GT:GQ:
DP:RD:AD:
FREQ:DP4
0/0:.:52:48:0:0%
:15,33,0,0
0/1:.:63:50:8:13.79%
:19,31,3,5
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
• VarScan VCF format: classic VCF header (#) but specific variant lines
• 2 types:
• VCF (specific to VarScan)
• Tabulated (available only for VarScan in classical mode)
17
VarScan Tabulated Format
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
VarScan Tabulated Format
Chrom Position Ref Cons Reads1 Reads2 VarFreq Strands
1
Strands
2 Qual1 Qual2 Pvalue
Map
Qual1
Map
Qual2 R1 + R1 - R2 + Rs2 - Alt
chr12 113348849 C Y 31 30 49.18% 2 2 27 27 0.98 1 1 19 12 25 5 T
chr12 113354329 G R 72 2 2.70% 2 2 31 26 0.98 1 1 48 24 1 1 A
chr12 113357193 G A 2 72 97.30% 1 2 28 24 0.98 1 1 2 0 45 27 A
chr12 113357209 G A 0 77 100% 0 2 0 29 0.98 0 1 0 0 51 26 A
Cons : Consensus Genotype of Variant Called (IUPAC code):
M -> A or C Y -> C or T D -> A or G or T W -> A or T V -> A or C or G
R -> A or G K -> G or T B -> C or G or T S -> C or G H -> A or C or T
18
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
VarScan Tabulated Format
Chrom Position Ref Cons Reads1 Reads2 VarFreq Strands
1
Strands
2 Qual1 Qual2 Pvalue
Map
Qual1
Map
Qual2 R1 + R1 - R2 + Rs2 - Alt
chr12 113348849 C Y 31 30 49.18% 2 2 27 27 0.98 1 1 19 12 25 5 T
chr12 113354329 G R 72 2 2.70% 2 2 31 26 0.98 1 1 48 24 1 1 A
chr12 113357193 G A 2 72 97.30% 1 2 28 24 0.98 1 1 2 0 45 27 A
chr12 113357209 G A 0 77 100% 0 2 0 29 0.98 0 1 0 0 51 26 A
Cons : Consensus Genotype of Variant Called (IUPAC code):
M -> A or C Y -> C or T D -> A or G or T W -> A or T V -> A or C or G
R -> A or G K -> G or T B -> C or G or T S -> C or G H -> A or C or T
18
7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Variant Annotation
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Variant Annotation
Variant Calling
---------
VarScan
Somatic
Variant
Annotation
-----
Annovar
Mpileup
Tumeur
---------
Samtools
Mpileup
Mpileup
Normal
---------
Samtools
Mpileup
Aligned and
preprocessed Reads
(BAM)
Aligned and
preprocessed Reads
(BAM)
Tumor
Normal
Variant
Selection
-----
Select
19
7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Different types of SNVs
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• SNVs and short indels are the most frequent events: Intergenic Intronic cis-regulatory splice sites frameshift or not synonymous or not begnin or damaging etc...
• Example of SNV one want to pinpoint: non-synonymous + highly deleterious + somatically acquired
Resources dedicated to human geneticvariations
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• dbSNP and 1000-genomes Population-scale DNA polymorphisms
• COSMIC Catalogue Of Somatic Mutations In Cancer
• Non synonymous SNVs predictions SIFT, Polyphen2 (damaging impact)... PhyloP, GERP++
(conservation)
ANNOVAR• Tools to annotate genetic variations
Annovar
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Use « Annovar » to annotate SNVs and IndelsMulti sample VCF (contains Tumor & normal samples)
RefGene: Gene & Function & AminoAcidChange (HGVS format: c.A155G ; p.Lys45Arg)
1000g2012apr_all: Minor Allele Frequency for all ethnies
ESP6500: Exome Sequencing Project Ljb_all : predictions (SIFT, Polyphen2, LRT,
MutationTaster, PhyloP, GERP++)
Tabulated file
Annovar
Use « Annovar » to annotate SNVs and Indels
Multi sample VCF (contains Tumor & normal samples)
• RefGene: Gene & Function & AminoAcid Change
(HGVS format: c.A155G ; p.Lys45Arg)
• 1000g2012apr_all: Minor Allele Frequency for all
ethnies
• ESP6500: Exome Sequencing Project
• Ljb_all : predictions (SIFT, Polyphen2, LRT,
MutationTaster, PhyloP, GERP++)
Tabulated file
22
Annovar
Use « Annovar » to annotate SNVs and Indels
Multi sample VCF (contains Tumor & normal samples)
• RefGene: Gene & Function & AminoAcid Change
(HGVS format: c.A155G ; p.Lys45Arg)
• 1000g2012apr_all: Minor Allele Frequency for all
ethnies
• ESP6500: Exome Sequencing Project
• Ljb_all : predictions (SIFT, Polyphen2, LRT,
MutationTaster, PhyloP, GERP++)
Tabulated file
22
Variant selection
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Variant Selection
Variant Calling
---------
VarScan
Somatic
Variant
Annotation
-----
Annovar
Mpileup
Tumeur
---------
Samtools
Mpileup
Mpileup
Normal
---------
Samtools
Mpileup
Aligned and
preprocessed Reads
(BAM)
Aligned and
preprocessed Reads
(BAM)
Tumor
Normal
Variant
Selection
-----
Select
23
7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”
Select variant predicted as somatic
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
• Use the « Select » tools from « Filter and Sort » to select only linesmatching the pattern « SOMATIC »
Select variant predicted as somatic
• Use the « Select » tools from « Filter and Sort » to select only lines matching the
pattern « SOMATIC »
24
Select variant predicted as somatic
• Use the « Select » tools from « Filter and Sort » to select only lines matching the
pattern « SOMATIC »
24
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Annexe 1 : Frequently mutated genes in WES
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Fuentes Fajardo KV, Adams D; NISC Comparative SequencingProgram, Mason CE, Sincan M, Tifft C, Toro C, Boerkoel CF, Gahl W, Markello T. Detectingfalse-positive signals in exome sequencing. Hum Mutat. 2012 Apr;33(4):609-13.doi: 10.1002/humu.22033. Epub 2012 Mar 5. PubMed PMID: 22294350; PubMed Central PMCID: PMC3302978.
Potentially false-positives = 2157 GENES !!! (ex : MUCxx, HLA-xxx,
Annexe 2 : without « normal » sample ?
29 janvier 2015 Formation NGS & Cancer - Analyses Exome
Annexe 3 : how to visualize variants ?
29 janvier 2015 Formation NGS & Cancer - Analyses Exome