sequence preprocessing: a perspective - github pages · qa/qc •beyond generating ‘better’...

13
Sequence Preprocessing: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis [email protected]

Upload: others

Post on 08-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

Sequence Preprocessing: A perspective

Dr. Matthew L. Settles

Genome CenterUniversity of California, Davis

[email protected]

Page 2: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

WhyPreprocessreads

• Wehavefoundthataggressively“cleaning”andprocessingreadscanmakealargedifferencetothespeed andquality ofassemblyandmappingresults.Cleaningyourreadsmeans,removingreads/basesthatare:• otherunwantedsequence(polyA tailsinRNA-seq data)• artificiallyaddedontosequenceofprimaryinterest(vectors,adapters,primers)

• joinshortoverlappingpaired-endreads• lowqualitybases• originatefromPCRduplication• notofprimaryinterest(contamination)

• Preprocessingalsoproducesanumberofstatisticsthataretechnicalinnaturethatshouldbeusedtoevaluate“experimentalconsistancy”

Page 3: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

ReadPreprocessingstrategies,manyovertime

• Identityandremovecontaminantandvectorreads• Readswhichappeartofullycomefromextraneoussequenceshouldberemoved.

• Qualitytrim/cut• “end”trimareaduntiltheaveragequality>Q(Lucy)• removeanyreadwithaveragequality<Q

• eliminatesingletons/duplicates• Ifyouhaveexcessdepthofcoverage,andparticularlyifyouhaveatleastx-foldcoveragewherexisthereadlength,theneliminatingsingletonsisanicewayofdramaticallyreducingthenumberoferror-pronereads.

• Readwhichappearthesame(particularlypaired-end)areoftenmorelikelyPCRduplicatesandthereforredundantreads.

• eliminateallreads(pairs)containingan“N”character• Ifyoucanaffordthelossofcoverage,youmightthrowawayallreadscontainingNs.

• Identityandtrimoffadapterandbarcodesifpresent• Believeitornot,thesoftwareprovidedbyIllumina,eitherdoesnotlookfor,ordoesamediocrejobof,identifyingadaptersandremovingthem.

Page 4: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

DNA/RNA,couldcontain‘contamination’Libraryprep,fragmentation,adapteraddition

PCRenrichment

FinalLibrary,sizedistribution PossibleadditionofphiX SequencingCharacteristics/Quality

Page 5: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

Preprocessing• Mapreadstocontaminants/PhiX andextractunmappedreads[bowtie2--local

• Removecontaminants(atleastPhiX),usesbowtie2thenextractsallreads(pairs)thataremarkedasunmapped.

• Super-Deduper [PEreadsonly]• RemovePCRduplicates(weusebases10-35ofeachpairedread)

• FLASH2[ PEreadsonly]• Joinandextend,overlappingpairedendreads• Ifreadscompletelyoverlaptheywillcontainadapter,removeadapters• Identifyandremoveanyadapterdimerspresent

• Scythe[SEReadsonly]• Identifyandremoveadaptersequence

• Sickle• Trimsequences(5’and3’)byqualityscore(IlikeQ20)

• cleanup• RunapolyA/Ttrimmer• Removeanyreadsthatarelessthentheminimumlengthparameter• Producepreprocessingstatistics

Page 6: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

WhyScreenforPhiX

• PhiX isacommoncontrolinIlluminaruns,facilitiesrarelytellyouif/whenPhiX hasbeenspikedin

• Doesnothaveabarcode,sointheoryshouldnotbeinyourdata

• However• WhenIknowPhiX hasbeenspikedin,Ifindsequenceeverytime• WhenIknowPhiX hasnotbeenspikedin,Idonot findsequence

• Bettersafethansorryandscreenforit.

Page 7: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

SuperDeduper

https://github.com/dstreett/Super-Deduper

Read1

Read2

Data AlignmentAlgorithm

MarkDuplicates Rmdup SuperDeduper FastUniq Fulcrum Total#ofReads

PhiX BWAMEM 1,048,278(0.25%)

1,011,145(1.05%)

1,156,700(13.7%)

4,202,526 3,092,155 4,750,299

Bowtie2Local 1,054,725(6.62%)

948,784(10.2%)

1,166,936(14.0%)

4,236,647 3,103,872 4,790,972

Bowtie2Global 799,524(0%)

800,868(0.12%)

896,487(9.92%)

3,768,641 2,704,114 4,293,787

Acroporadigitifera

BWAMEM 5,132,111(2.26%)

6,906,634(44.5%)

5,133,339(10.2%)

12,968,469 2,103,567 54,108,240

Bowtie2Local 4,688,809(4.03%)

5,931,862(38.9%)

3,971,743(9.32%)

9,893,903 4,259,619 41,728,154

Bowtie2Global 1,457,865(3.62%)

1,512,966(24.2%)

1,185,838(11.4%)

3,014,498 1,286,031 11,600,847

Page 8: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

SuperDeduper

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00False Positive Rate

True

Pos

itive

Rat

e

10

0.875

0.900

0.925

0.950

0.975

0.03 0.04 0.05 0.06 0.07False Positive Rate

StartPosition

1

5

20

40

150

Figure 1: ROC curves. Only a representative subset of the different start positions is shown. The image on the left shows the full ROC curves and the image on the left is a zoomed in view of corner of the curves. Each curve represents a start position and each point represents a length. The labeled point in the image on the right is the default start and length for

Super Deduper.

We calculated the Youden Index for every combination tested and the point that acquired the highest index value (as compared to Picard MarkDuplicates) occurred at a start position of 5bp and a length of 10bps (20bp total over both reads)

Page 9: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

Flash2– overlappingofreadsandadapterremovalinpairedendreads

TargetRegion

Read1

InsertsizeRead2

TargetRegion

Read1

InsertsizeRead2

TargetRegion

Read1

InsertsizeRead2

Insertsize>lengthofthenumberofcycles

Insertsize<lengthofthenumberofcycles(10bpmin)

Insertsize<lengthofthereadlength

Product:ReadPair

Product:Extended,Single

Product:AdapterTrimmed,Single

https://github.com/dstreett/FLASH2

Page 10: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

QualityTrimming- Sickle

Remove“poor”qualitysequencefromboththe5’and3’ends

Page 11: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

RibosomalRNA

• RibosomalRNAmakesup90%ormoreofatypicaltotalRNAsample.• LibraryprepmethodsreducetherRNA representationinasample

• oligoDt onlybindstopolyA tailstoenrichasampleformRNA• Ribo-depletionbindsrRNA sequences

Neithertechniqueis100%efficient

Canscreen(mapreadstorRNA sequences)todeterminerRNAefficiencyandpotentiallyremovethosereads.

Page 12: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

EffectsinVariantAnalysis

• Readsresultingfromcontaminantsmayaligntothegenome• PCRduplicatereadsareclearlyredundantdata• Overlappingreadsalsoproduceredundantandpossiblyerrorpronebasepairs

• Freebayes doesnotpseudooverlapreads• Samtools

• Setsoneofthebasepairqualitiesto0andtheothertothesumofbothqualities• Inconflict,setsbothqualitiesto0

• GATKUnifiedGenotype• Triestomodelthereads

• GATKHaplotypeCaller• Setsoneofthebasepairqualitiesto0andtheothertothesumofbothqualities• Inconflict,setsbothqualitiesto0

Page 13: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality

QA/QC

• Beyondgenerating‘better’datafordownstreamanalysis,cleaningstatisticsalsogiveyouanideaastothequalityofthesample,librarygeneration,andsequencingqualityusedtogeneratethedata.

• Thiscanhelpinformyouofwhatyoumightdointhefuture.• I’vefounditbesttoperformQA/QConboththerunasawhole(poorsamplescanaffectothersamples)andonthesamplesthemselvesastheycomparetoothersamples (REMEMBER,BECONSISTANT).

• ReportssuchasBasespace forIllumina,aregreatwaystoevaluatetherunsasawhole.

• PCA/MDSplotsofthepreprocessingsummaryareagreatwaytolookfortechnicalbiasacrossyourexperiment