dna-seq analysis pipeline€¦ · dna-seq analysis pipeline introduction. amel ghouila, claudia...

41
C3BI Hands-on NGS course Institut Pasteur Paris Nov 21 – Dec 2, 2016 Emna Achouri C3BI Hands-on NGS course – IPP – 23 rd Nov 2016 DNA-Seq analysis pipeline Introduction

Upload: others

Post on 26-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

C3B

I Han

ds-

on N

GS

cour

seIn

stitu

tPas

teur

Par

isN

ov 2

1 –

Dec

2, 2

016

Emna AchouriC3BI Hands-on NGS course – IPP – 23rd Nov 2016

DNA-Seq analysis pipelineIntroduction

Page 2: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6OverviewofAnalysisWorkflow

1

FASTQFILES

FASTQCQUALITYCONTROL

OFREADS

TRIMMINGFILTERINGBADQUALITYREADS

2

MAPPINGOFREADSTOA

REFERENCEGENOME

ASSEMBLY(DE NOVO)RECONSTRUCTIONOF

AGENOME

3SAMFILES

BAMFILES4READDEPTH

VARIANTCALLING

STRUTURALVARIATIONS

GENE/CHRCNV

5VCFFILES

SNPSINDELS

ANNOTATIONVISUALIZATION

FASTAFILE GFFFILE

Page 3: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6FileFormats

TherearemultiplefileformatsusedatvariousstagesofNGSdataprocessing.Wecandividethemintotwobasictypes:

§ Textbased(FASTA,FASTQ,SAM,GTF/GFF,BED,VCF)§ Binary (BAM,BCF,SFF(454sequencerdata))

Wecanviewandmanipulatetextbasedformatswithoutspecialtools,butwewillneedspecifictoolstoaccessandviewbinaryformatsThetext-basedformatareoftencompressedtosavespacemakingthemdefactobinary,butstilleasytoreadbyeyeusingstandardUnixtools

Page 4: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6Today’sprogram

ReadsQCandpre-processing

Assembly

Mapping

Introductiontofileformats

Mappingqualitycheck

Page 5: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

C3B

I Han

ds-

on N

GS

cour

seIn

stitu

tPas

teur

Par

isN

ov 2

1 –

Dec

2, 2

016

Emna AchouriC3BI Hands-on NGS course – IPP – 23rd Nov 2016

Quality AnalysisRaw reads QC

Page 6: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6OverviewofAnalysisWorkflow

1

FASTQFILES

Page 7: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6FastQfiles

Rawsequencedata:FastQ

files

WhatisaFastQ file?

FASTQ=FASTA+Quality

FastQ format isatext-basedformatforstoringbothabiologicalsequenceanditscorrespondingqualityscores

Page 8: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6FASTQFileFormat

• AFASTA fileusesatleast2linespersequence:• Line1 beginswitha‘>’ andisfollowedbyasequenceidentifier• Line2 isthesequenceletters

• AFASTQ fileusesfourlinespersequence:• Line1 beginswitha'@' andisfollowedbyasequenceidentifier• Line2 isthesequenceletters• Line3 beginswitha'+' andisrarely followedbythesequenceidentifier• Line4 encodesthequalityvaluesforthesequenceinLine2,andmust

containthesamenumberofsymbolsaslettersinthesequence

@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAA+CCCFFFFFHHHHHJJJJHJJJIJJJJJJJJJJJJJJJJJJJJJJJGJJJJ

Page 9: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

• ThequalityscoreofabaseiscalledthePhred score (orQscore)Itisanintegervaluerepresentingtheestimatedprobabilityofanerror.IfPistheerrorprobability,then:

• Phred orQscoresareoftenrepresentedasASCIIcharactersStartinginIllumina1.8,thequalityscoreshavebasicallyreturnedtotheuseoftheSangerformat(Phred+33)

• Phred versionisindependentfromthesequencingtechnologyversion

Phred QualityScore

https://en.wikipedia.org/wiki/FASTQ_format#Quality

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................

..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................

...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................

.................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.....................................................

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

| | | | | |

33 59 64 73 104 126

0.2......................26...31........41

Sanger

Solexa

Illumina1.3+

Illumina1.5+

Illumina1.8+

Phred score

ASCIIcode

Page 10: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.....................................................

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

| | | | | |

0.........................26...31.......41

Phred QualityScore

Illumina1.8+

Qscoresaredefined asaproperty that is logarithmically relatedtothebasecalling error probabilities (P)

Phred score

Position Quality Score:

30–40: ✔ Good

<20: 🗑 Discard

http://www.illumina.com

20 – 30: depends on overall quality

Page 11: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6OverviewofAnalysisWorkflow

FASTQCQUALITYCONTROL

OFREADS

TRIMMINGFILTERINGBADQUALITYREADS

Page 12: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6Tool:FastQC

FastQC:QualityControlforFastQ files

GUI,Commandline,AvailableonGalaxyGraphicreports

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://dnacore.missouri.edu/PDF/FastQC_Manual.pdf

Page 13: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6Tool:FastQC

• Qualitycontrolchecksonrawsequencedata

• Input: FastQ files(orBAM/SAM)• Output: Summarygraphsallowingquickdataassessment

• FastQC canberunin2modes:

• Standaloneinteractiveapplication ß smallnumberofsamples

• Non-interactivemode ß largenumber

FastQC can be integrated

in ANALYSIS PIPELINES

Page 14: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

A summaryofthemoduleswhichwererun,andaquickevaluationofwhetherthe

resultsseementirelynormal, slightlyabnormalorveryunusual.

Normal

Unexpected

Reasonable

Tool:FastQC – QCreportsoverview

Page 15: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

• BasicStatistics

• ASCIIqualityencodingformat(Phred format)• Numberofreads,Readlength,…

Tool:FastQC – QCreportsoverview

Page 16: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

• Perbasesequencequality

Positioninread(bp)

Overviewoftherangeofqualityvalues(Phred scores)acrossallbasesfromallreads

Redline:median value

PhredQScore

UpperandLowerwhiskers:10%and90%points

Blueline:mean quality

Yellowbox:inter-quartile range(25-75%)

Inmostplatforms,thequalityisloweratthebeginning/end ofreads->Thisiscommon!

Tool:FastQC – QCreportsoverview

Page 17: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

PhredQScore

Positioninread(bp)

• Perbasesequencequality

Tool:FastQC – QCreportsoverview

Page 18: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

• PerSequenceQuality

Averagequality>30

• PerBaseSequenceContent • PerBaseNContent

Tool:FastQC – QCreportsoverview

Page 19: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

• SequenceLengthDistribution

Allreadsmustbeofsamelength

• PerSequenceGCContent

Shouldlooklikeanormaldistribution

Tool:FastQC – QCreportsoverview

Page 20: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6 Raw datapre-processing

Page 21: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

• Thegreylinesrepresentrawreadssequences

RawData

Page 22: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

GoodReads

RawData

• Rawreadssequences

Page 23: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6RawData

GoodReads

NotSoGoodReads

• Rawreadssequences

Page 24: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6FastQC Results(RawReads)

=> Poor quality in the ends of reads can be removedusing reads filtering or reads trimming tools

OurexamplesampleanalysedusingFastQC:

• readsof51bp• averagequality=26

Page 25: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6Readspre-processing

1) READSFILTERING

2) READSTRIMMING

3) ADAPTIVETRIMMING

4) ADAPTIVETRIMMINGFOLLOWEDBYREADFILTERING

Page 26: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

• ReadswithpoorqualityendscanbediscardedusingFILTERING tools

GoodReads

NotSoGoodReads

Method1:ReadsFiltering

Page 27: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6Method1:ReadsFiltering

GoodReads

AFTERFILTERING:SampleOverallQuality↑

NumberofReads↓

• ReadswithpoorqualityendscanbediscardedusingFILTERING tools

Page 28: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

28

• FastQC graph

Method1:ReadsFiltering

• Readswithpoorqualityendsareremovedfromthedataset(1)• Butsomegoodqualitypositionsarelost(2)

1

2

Page 29: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6Readspre-processing

1) READSFILTERING

2) READSTRIMMING

3) ADAPTIVETRIMMING

4) ADAPTIVETRIMMINGFOLLOWEDBYREADFILTERING

Page 30: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

• PoorqualityendsofreadscanbecutusingTRIMMING tools

GoodReads

RatherGoodReads

Lengthcut-offthreshold

Method2:ReadsTrimming

Page 31: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

• PoorqualityendsofreadscanbecutusingTRIMMING tools

GoodReads

RatherGoodReads

Method2:ReadsTrimming

Trimmed reads are of better qualitythan raw reads but they are shorter!

Page 32: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

• FastQC graph

32

Cut-offthresholdLength=30bp

Method2:ReadsTrimming

• Readsarenowofdifferentlengths• Numberofrawreads=numberofrawreads

Page 33: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6Readspre-processing

1) READSFILTERING

2) READSTRIMMING

3) ADAPTIVETRIMMING

4) ADAPTIVETRIMMINGFOLLOWEDBYREADFILTERING

Page 34: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

• AmorepreciseTRIMMING approach

GoodReads

RatherGoodReads

✁✁✁✁

2525

2525

✁25

Method3:ADAPTIVETrimming

Page 35: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6

• AmorepreciseTRIMMING approach

GoodReads

RatherGoodReads

Method3:ADAPTIVETrimming

Trimmed reads are of better quality thanraw reads but they are of variable lengths.

Some reads can be extremely short!

Page 36: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6Method3:ADAPTIVETrimming

36

• FastQC graph

• Poorqualityendsareremovedfromthedataset• Butthegoodqualitypositionsarekept(yeah!)

Page 37: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6Readspre-processing

1) READSFILTERING

2) READSTRIMMING

3) ADAPTIVETRIMMING

4) ADAPTIVETRIMMINGFOLLOWEDBYREADFILTERING

Page 38: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6Method4:ADAPTIVETrimming+filtering

• PoorqualityintheendcanbecutusingTRIMMING AND FILTERING tools

GoodReads

✁✁✁✁

✁✁

Lengthcut-offthreshold

RatherGoodReads

Page 39: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6Clipping

Adapter Sequencingprimer

www.ecseq.com

Clipping is necessary when

Read Length > Insert Size

• Trimmingsequencingadapters/primers (clipping)

Librarysizedistribution (AgilentBioanalyzer)

Page 40: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6ReadsPre-Processing

• Trimmingsequencingadapters/primers(clipping)andpoorqualityendsand filtershortreads

• Dozensofpublishedtools!AlienTrimmer,Cutadapt,ConDeTri,FastX,Sickle,SolexaQA,Trimmomatic…

Theyusedifferentmethodsandalgorithmsandofferatlotofoptions

• IfPEreads:Filterunpairedreads

Page 41: DNA-Seq analysis pipeline€¦ · DNA-Seq analysis pipeline Introduction. Amel Ghouila, Claudia Chica, Emna Achouri, Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri, F

atm

aG

uerfa

liC

3BI H

and

s-on

NG

S co

urse

–IP

P –

23rd

Nov

201

6ReadsPre-Processing

There is no golden method / tool

à DEPENDS ON THE APPLICATION

• TrimmingPoorQualityEndsAfewexamplesfromDelFabbro et.Al.

• RNA-Seq (Homosapiens)“SolexaQAachievesthehighestqualitywhilekeepingthehighestamountofreads”

• Assembly (Prunus persica)“Readtrimmingaffectsonlypartiallygenomeassemblyresults”“Stringenttrimmingtendstoheavilyremovedataanddecreaseoverallassemblyquality”

• SNPIdentification (Prunus persica andSaccharomycescerevisiae)“Alltrimmersdrasticallyreducethepercentageofalternativeallelenucleotides(…)bringingthisfalsepositivecallindicatorfrom30%to10%”