dna-seq analysis pipeline€¦ · dna-seq analysis pipeline introduction. amel ghouila, claudia...
TRANSCRIPT
C3B
I Han
ds-
on N
GS
cour
seIn
stitu
tPas
teur
Par
isN
ov 2
1 –
Dec
2, 2
016
Emna AchouriC3BI Hands-on NGS course – IPP – 23rd Nov 2016
DNA-Seq analysis pipelineIntroduction
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6OverviewofAnalysisWorkflow
1
FASTQFILES
FASTQCQUALITYCONTROL
OFREADS
TRIMMINGFILTERINGBADQUALITYREADS
2
MAPPINGOFREADSTOA
REFERENCEGENOME
ASSEMBLY(DE NOVO)RECONSTRUCTIONOF
AGENOME
3SAMFILES
BAMFILES4READDEPTH
VARIANTCALLING
STRUTURALVARIATIONS
GENE/CHRCNV
5VCFFILES
SNPSINDELS
ANNOTATIONVISUALIZATION
FASTAFILE GFFFILE
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6FileFormats
TherearemultiplefileformatsusedatvariousstagesofNGSdataprocessing.Wecandividethemintotwobasictypes:
§ Textbased(FASTA,FASTQ,SAM,GTF/GFF,BED,VCF)§ Binary (BAM,BCF,SFF(454sequencerdata))
Wecanviewandmanipulatetextbasedformatswithoutspecialtools,butwewillneedspecifictoolstoaccessandviewbinaryformatsThetext-basedformatareoftencompressedtosavespacemakingthemdefactobinary,butstilleasytoreadbyeyeusingstandardUnixtools
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6Today’sprogram
ReadsQCandpre-processing
Assembly
Mapping
Introductiontofileformats
Mappingqualitycheck
C3B
I Han
ds-
on N
GS
cour
seIn
stitu
tPas
teur
Par
isN
ov 2
1 –
Dec
2, 2
016
Emna AchouriC3BI Hands-on NGS course – IPP – 23rd Nov 2016
Quality AnalysisRaw reads QC
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6OverviewofAnalysisWorkflow
1
FASTQFILES
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6FastQfiles
Rawsequencedata:FastQ
files
WhatisaFastQ file?
FASTQ=FASTA+Quality
FastQ format isatext-basedformatforstoringbothabiologicalsequenceanditscorrespondingqualityscores
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6FASTQFileFormat
• AFASTA fileusesatleast2linespersequence:• Line1 beginswitha‘>’ andisfollowedbyasequenceidentifier• Line2 isthesequenceletters
• AFASTQ fileusesfourlinespersequence:• Line1 beginswitha'@' andisfollowedbyasequenceidentifier• Line2 isthesequenceletters• Line3 beginswitha'+' andisrarely followedbythesequenceidentifier• Line4 encodesthequalityvaluesforthesequenceinLine2,andmust
containthesamenumberofsymbolsaslettersinthesequence
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAA+CCCFFFFFHHHHHJJJJHJJJIJJJJJJJJJJJJJJJJJJJJJJJGJJJJ
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
• ThequalityscoreofabaseiscalledthePhred score (orQscore)Itisanintegervaluerepresentingtheestimatedprobabilityofanerror.IfPistheerrorprobability,then:
• Phred orQscoresareoftenrepresentedasASCIIcharactersStartinginIllumina1.8,thequalityscoreshavebasicallyreturnedtotheuseoftheSangerformat(Phred+33)
• Phred versionisindependentfromthesequencingtechnologyversion
Phred QualityScore
https://en.wikipedia.org/wiki/FASTQ_format#Quality
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
.................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.....................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
| | | | | |
33 59 64 73 104 126
0.2......................26...31........41
Sanger
Solexa
Illumina1.3+
Illumina1.5+
Illumina1.8+
Phred score
ASCIIcode
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.....................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
| | | | | |
0.........................26...31.......41
Phred QualityScore
Illumina1.8+
Qscoresaredefined asaproperty that is logarithmically relatedtothebasecalling error probabilities (P)
Phred score
Position Quality Score:
30–40: ✔ Good
<20: 🗑 Discard
http://www.illumina.com
20 – 30: depends on overall quality
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6OverviewofAnalysisWorkflow
FASTQCQUALITYCONTROL
OFREADS
TRIMMINGFILTERINGBADQUALITYREADS
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6Tool:FastQC
FastQC:QualityControlforFastQ files
GUI,Commandline,AvailableonGalaxyGraphicreports
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
http://dnacore.missouri.edu/PDF/FastQC_Manual.pdf
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6Tool:FastQC
• Qualitycontrolchecksonrawsequencedata
• Input: FastQ files(orBAM/SAM)• Output: Summarygraphsallowingquickdataassessment
• FastQC canberunin2modes:
• Standaloneinteractiveapplication ß smallnumberofsamples
• Non-interactivemode ß largenumber
FastQC can be integrated
in ANALYSIS PIPELINES
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
A summaryofthemoduleswhichwererun,andaquickevaluationofwhetherthe
resultsseementirelynormal, slightlyabnormalorveryunusual.
Normal
Unexpected
Reasonable
Tool:FastQC – QCreportsoverview
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
• BasicStatistics
• ASCIIqualityencodingformat(Phred format)• Numberofreads,Readlength,…
Tool:FastQC – QCreportsoverview
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
• Perbasesequencequality
Positioninread(bp)
Overviewoftherangeofqualityvalues(Phred scores)acrossallbasesfromallreads
Redline:median value
PhredQScore
UpperandLowerwhiskers:10%and90%points
Blueline:mean quality
Yellowbox:inter-quartile range(25-75%)
Inmostplatforms,thequalityisloweratthebeginning/end ofreads->Thisiscommon!
Tool:FastQC – QCreportsoverview
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
PhredQScore
Positioninread(bp)
• Perbasesequencequality
Tool:FastQC – QCreportsoverview
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
• PerSequenceQuality
Averagequality>30
• PerBaseSequenceContent • PerBaseNContent
Tool:FastQC – QCreportsoverview
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
• SequenceLengthDistribution
Allreadsmustbeofsamelength
• PerSequenceGCContent
Shouldlooklikeanormaldistribution
Tool:FastQC – QCreportsoverview
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6 Raw datapre-processing
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
• Thegreylinesrepresentrawreadssequences
RawData
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
GoodReads
RawData
• Rawreadssequences
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6RawData
GoodReads
NotSoGoodReads
• Rawreadssequences
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6FastQC Results(RawReads)
=> Poor quality in the ends of reads can be removedusing reads filtering or reads trimming tools
OurexamplesampleanalysedusingFastQC:
• readsof51bp• averagequality=26
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6Readspre-processing
1) READSFILTERING
2) READSTRIMMING
3) ADAPTIVETRIMMING
4) ADAPTIVETRIMMINGFOLLOWEDBYREADFILTERING
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
• ReadswithpoorqualityendscanbediscardedusingFILTERING tools
GoodReads
NotSoGoodReads
Method1:ReadsFiltering
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6Method1:ReadsFiltering
GoodReads
AFTERFILTERING:SampleOverallQuality↑
NumberofReads↓
• ReadswithpoorqualityendscanbediscardedusingFILTERING tools
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
28
• FastQC graph
Method1:ReadsFiltering
• Readswithpoorqualityendsareremovedfromthedataset(1)• Butsomegoodqualitypositionsarelost(2)
1
2
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6Readspre-processing
1) READSFILTERING
2) READSTRIMMING
3) ADAPTIVETRIMMING
4) ADAPTIVETRIMMINGFOLLOWEDBYREADFILTERING
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
• PoorqualityendsofreadscanbecutusingTRIMMING tools
GoodReads
RatherGoodReads
Lengthcut-offthreshold
Method2:ReadsTrimming
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
• PoorqualityendsofreadscanbecutusingTRIMMING tools
GoodReads
RatherGoodReads
Method2:ReadsTrimming
Trimmed reads are of better qualitythan raw reads but they are shorter!
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
• FastQC graph
32
Cut-offthresholdLength=30bp
Method2:ReadsTrimming
• Readsarenowofdifferentlengths• Numberofrawreads=numberofrawreads
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6Readspre-processing
1) READSFILTERING
2) READSTRIMMING
3) ADAPTIVETRIMMING
4) ADAPTIVETRIMMINGFOLLOWEDBYREADFILTERING
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
• AmorepreciseTRIMMING approach
GoodReads
RatherGoodReads
✁✁✁✁
2525
2525
✁25
Method3:ADAPTIVETrimming
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6
• AmorepreciseTRIMMING approach
GoodReads
RatherGoodReads
Method3:ADAPTIVETrimming
Trimmed reads are of better quality thanraw reads but they are of variable lengths.
Some reads can be extremely short!
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6Method3:ADAPTIVETrimming
36
• FastQC graph
• Poorqualityendsareremovedfromthedataset• Butthegoodqualitypositionsarekept(yeah!)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6Readspre-processing
1) READSFILTERING
2) READSTRIMMING
3) ADAPTIVETRIMMING
4) ADAPTIVETRIMMINGFOLLOWEDBYREADFILTERING
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6Method4:ADAPTIVETrimming+filtering
• PoorqualityintheendcanbecutusingTRIMMING AND FILTERING tools
GoodReads
✁✁✁✁
✁✁
Lengthcut-offthreshold
RatherGoodReads
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6Clipping
Adapter Sequencingprimer
www.ecseq.com
Clipping is necessary when
Read Length > Insert Size
• Trimmingsequencingadapters/primers (clipping)
Librarysizedistribution (AgilentBioanalyzer)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6ReadsPre-Processing
• Trimmingsequencingadapters/primers(clipping)andpoorqualityendsand filtershortreads
• Dozensofpublishedtools!AlienTrimmer,Cutadapt,ConDeTri,FastX,Sickle,SolexaQA,Trimmomatic…
Theyusedifferentmethodsandalgorithmsandofferatlotofoptions
• IfPEreads:Filterunpairedreads
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri, F
atm
aG
uerfa
liC
3BI H
and
s-on
NG
S co
urse
–IP
P –
23rd
Nov
201
6ReadsPre-Processing
There is no golden method / tool
à DEPENDS ON THE APPLICATION
• TrimmingPoorQualityEndsAfewexamplesfromDelFabbro et.Al.
• RNA-Seq (Homosapiens)“SolexaQAachievesthehighestqualitywhilekeepingthehighestamountofreads”
• Assembly (Prunus persica)“Readtrimmingaffectsonlypartiallygenomeassemblyresults”“Stringenttrimmingtendstoheavilyremovedataanddecreaseoverallassemblyquality”
• SNPIdentification (Prunus persica andSaccharomycescerevisiae)“Alltrimmersdrasticallyreducethepercentageofalternativeallelenucleotides(…)bringingthisfalsepositivecallindicatorfrom30%to10%”