raw illumina next generation sequencing data files and ... · pdf fileit is a good idea to...
TRANSCRIPT
![Page 1: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/1.jpg)
Raw Illumina Next Generation
Sequencing data files
and
Quality Control
Gilgi Friedlander
Bioinformatics Unit, Biological Services
![Page 2: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/2.jpg)
Sample preparation(Application-specific)
Load on Flow-Cell
Generate clusters
Sequence base-by-base
Run pipeline + QC
IlluminaWorkflow
![Page 3: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/3.jpg)
Next Generation Sequencing (NGS) experiments
● Plan the experiment!
Design
How to perform
Single end/paired end
read length
Replicates
Data analysis
We encourage to have a “kick-off ” meeting
Lab- High Throughput Sequencing Unit:
Dr. Shirley Horn-Saban - Head of Genomic Technologies
Dr. Daniela Amann-Zalcenstein
Muriel Chemla
Bioinformatics (NGS analysis):
Dr. Dena Leshkowitz
Dr. Ester Feldmesser
Dr. Gilgi Friedlander
![Page 4: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/4.jpg)
● How is the data organized?
●Viewing the data in a genomic browser
● Quality control
Today:
●The format of the output files
Exercise
● Illumina’s pipeline
![Page 5: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/5.jpg)
Illumina’s pipeline
CASAVA (v1.8.2)
Consensus Assessment of
Sequence And Variation
Irit Orr
![Page 6: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/6.jpg)
● How is the data organized?
●Viewing the data in a genomic browser
● Quality control
Today:
●The format of the output files
Exercise
● Illumina’s pipeline
![Page 7: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/7.jpg)
How is the data organized?
htsdata
Unaligned
Aligned
Unaligned
Sample_name1
SampleSheet.csv
fastq.gz filesdivided to multiple files
(4M reads in each file)
Export
23456_SN808_0051_BA0BC9ABXX
Aligned
Sample_name1
Run_ID...
...
Summary_
Stats
Barcode_Lane_Summary.htm
![Page 8: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/8.jpg)
How is the data organized?
htsdata
Unaligned
Aligned
Unaligned
Sample_name1
SampleSheet.csv
fastq.gz filesdivided to multiple files
(4M reads in each file)
Export
23456_SN808_0051_BA0BC9ABXX
Aligned
Sample_name1
Run_ID...
...
Summary_
Stats
Barcode_Lane_Summary.htm
![Page 9: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/9.jpg)
● How is the data organized?
●Viewing the data in a genomic browser
● Quality control
Today:
●The format of the output files
Exercise
● Illumina’s pipeline
![Page 10: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/10.jpg)
The FastQ format
(standard text representation of short reads)
A FASTQ text file normally uses four lines per sequence.
Example@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters. Line 3 begins with a '+' character (optionally followed by SEQ_ID). Line 4 encodes the quality values for the sequence in Line 2, and must
contain the same number of symbols as letters in the sequence. The letters encode Phred Quality Scores.
Output files
![Page 11: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/11.jpg)
fastq files– cont’
Quality scores
Each base has a quality score that measures the probability that a base is called incorrectly
Illumina's base scoring is similar to Phred scores-a way of expressing estimates of sequencing
error probabilities.
The quality score is in ASCII format:ASCII character code - 33
Q phred = ASCII code-33 = -10 log10( Pe )
Pe = error probability of a particular base call
Q20 = 1 error in 100 bases
Q30 = 1 error in 1000 bases
Output files
![Page 12: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/12.jpg)
Char ASCIIQphred
ASCII-33 P(error)
! 33 0 1
" 34 10.7943
282
# 35 20.6309
573
$ 36 30.5011
872
% 37 40.3981
072
& 38 50.3162
278
' 39 60.2511
886
( 40 70.1995
262
) 41 80.1584
893
* 42 90.1258
925
+ 43 10 0.1
.
.
.
H 72 390.0001
259
I 73 40 0.0001
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAAT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))
Q phred = -10 log10( Pe )
![Page 13: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/13.jpg)
@EAS139:136:FC706VJ:2:5:1000:12850 1:N:18:ATCACG
CNAGGCTGGAGTGCAATGGCACAATCTTGGCTCNTNNCANCCTTTGGCTC
+
@#1ADDDDHCF<D9EGBEE>FHAHBCGICHFBE#1##00#008?FHB>D#
Output files
fastq files– cont’
@instrument run number
flowcell IDlane tile
xpos
yposread*
* Read number: 1 can be single read or read 2 of paired-end
Is filtered
(N- passed filter)
Index
(if multuiplex)
Divided: each contains 4M read
Zipped
Contains only passed filtered reads
![Page 14: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/14.jpg)
How is the data organized?
htsdata
Unaligned
Aligned
Unaligned
Sample_name1
SampleSheet.csv
fastq.gz filesdivided to multiple files
(4M reads in each file)
Export
23456_SN808_0051_BA0BC9ABXX
Aligned
Sample_name1
Run_ID...
...
Summary_
Stats
Barcode_Lane_Summary.htm
![Page 15: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/15.jpg)
Output files
![Page 16: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/16.jpg)
How is the data organized?
htsdata
Unaligned
Aligned
Unaligned
Sample_name1
SampleSheet.csv
fastq.gz filesdivided to multiple files
(4M reads in each file)
Export
23456_SN808_0051_BA0BC9ABXX
Aligned
Sample_name1
Run_ID...
...
Summary_
Stats
Barcode_Lane_Summary.htm
![Page 17: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/17.jpg)
● How is the data organized?
●Viewing the data in a genomic browser
● Quality control
Today:
●The format of the output files
Exercise
● Illumina’s pipeline
![Page 18: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/18.jpg)
Quality Control
What is the quality of my data?
![Page 19: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/19.jpg)
Quality Control
Two levels:
1. The qualities of the bases of the reads
2. If there is an available genome:
What is the fraction of the reads that align to the
genome? What is the error rate?
![Page 20: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/20.jpg)
Quality Control
When you get your data, you get a mail with the
location of the files and with a link to some tables
and plots, that looks something like:
http://dapsas.weizmann.ac.il/ngsreports/110922
_SN808_0058_BB07HNABXX/
![Page 21: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/21.jpg)
Open the link and go to the Summary tab
% reads passed filter
(chastity filter)% aligned to
reference
using ELAND
(the read-mapper
supplied by Illumina)
Average of the four
intensities at the
first cycle
% Error rate
The PhiX reads are mapped on the PhiX reference genome, the error rate is then
estimated by the number mismatches, over the total number of bases of mapped
PhiX reads
should be below 1.5
%intensity after 20 cycles should
be 50% or more
![Page 22: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/22.jpg)
Quality Control
Explore the quality of the data by looking at
boxplots of various parameters.
Median
75th percentile
25th percentile
Min/Max or
maximum of 1.5
times the inter-
quartile range
outliers
![Page 23: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/23.jpg)
Quality Control: viewing plots
Check one lane at a time
Q phred = -10 log10( Pe ) Q34=> p(error) ~ 0.0004
![Page 24: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/24.jpg)
Quality Control: viewing plots
Qualities drop gradually (Q30 => P(error)=0.001)
For reads with 50 bases => >90% For reads with 100 bases => >75%
![Page 25: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/25.jpg)
Important:
These plots are created during the RUN in the HiSeq.
During CASAVA - the qualities are being calibrated.
It is a good idea to look at the quality scores also after the calibration (fastqc tool)
And decide whether we would like to filter our reads prior to our downstream analyses
![Page 26: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/26.jpg)
In case we have alignment, it is
important to check the % of reads that
were aligned
![Page 27: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/27.jpg)
How is the data organized?
htsdata
Unaligned
Aligned
Unaligned
Sample_name1
SampleSheet.csv
fastq.gz filesdivided to multiple files
(4M reads in each file)
Export
23456_SN808_0051_BA0BC9ABXX
Aligned
Sample_name1
Run_ID...
...
Summary_
Stats
Barcode_Lane_Summary.htm
![Page 28: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/28.jpg)
Fastqc: a great tool for assessing the
quality of the data
Quality Control: viewing plots
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
Simon Andrews, Cambridge - UK
![Page 29: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/29.jpg)
Good dataset
Quality Control: viewing plots
Position in read (bp)
Qua
lity
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/good_sequence_short_fastqc/fastqc_report.html
![Page 30: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/30.jpg)
Poor dataset
Quality Control: viewing plots
Position in read (bp)
Qua
lity
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/bad_sequence_fastqc/fastqc_report.html
![Page 31: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/31.jpg)
After you make sure your data is of good quality:
Analysis step (beyond the scope of this workshop)
During the year the bioinformatics unit will give various
workshops for specific NGS applications:
http://bip.weizmann.ac.il/ws/
During down stream steps of the analysis we can use a genome
browser to view the data and assess specific, local quality,
depending on the application.
Examples:
In RNA seq: investigating a newly identified transcripts
In genomic DNA seq: investigating a specific called SNP
Genome Browser
![Page 32: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/32.jpg)
● How is the data organized?
●Viewing the data in a genomic browser
● Quality control
Today:
●The format of the output files
Exercise
● Illumina’s pipeline
![Page 33: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/33.jpg)
Genome Browser
There are many available genomic browsers.
Among them:
UCSC browser
IGV (Integrative Genomics Viewer)
IGV:
A desktop application for integrated visualization of
multiple data types and annotations in the context
of the genome
http://www.broadinstitute.org/software/igv
Developed by Jim Robinson, Broad Institute
![Page 34: Raw Illumina Next Generation Sequencing data files and ... · PDF fileIt is a good idea to look at the quality scores also after the calibration (fastqc tool) ... IGV: A desktop application](https://reader031.vdocuments.site/reader031/viewer/2022030418/5aa52fc27f8b9a517d8ce721/html5/thumbnails/34.jpg)
Genomic Browser: IGV
IGV provides a set of hosted genomes, but it is also possible to import other genomes