![Page 1: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/1.jpg)
Sequence Preprocessing: A perspective
Dr. Matthew L. Settles
Genome CenterUniversity of California, Davis
![Page 2: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/2.jpg)
WhyPreprocessreads
• Wehavefoundthataggressively“cleaning”andprocessingreadscanmakealargedifferencetothespeed andquality ofassemblyandmappingresults.Cleaningyourreadsmeans,removingreads/basesthatare:• otherunwantedsequence(polyA tailsinRNA-seq data)• artificiallyaddedontosequenceofprimaryinterest(vectors,adapters,primers)
• joinshortoverlappingpaired-endreads• lowqualitybases• originatefromPCRduplication• notofprimaryinterest(contamination)
• Preprocessingalsoproducesanumberofstatisticsthataretechnicalinnaturethatshouldbeusedtoevaluate“experimentalconsistancy”
![Page 3: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/3.jpg)
ReadPreprocessingstrategies,manyovertime
• Identityandremovecontaminantandvectorreads• Readswhichappeartofullycomefromextraneoussequenceshouldberemoved.
• Qualitytrim/cut• “end”trimareaduntiltheaveragequality>Q(Lucy)• removeanyreadwithaveragequality<Q
• eliminatesingletons/duplicates• Ifyouhaveexcessdepthofcoverage,andparticularlyifyouhaveatleastx-foldcoveragewherexisthereadlength,theneliminatingsingletonsisanicewayofdramaticallyreducingthenumberoferror-pronereads.
• Readwhichappearthesame(particularlypaired-end)areoftenmorelikelyPCRduplicatesandthereforredundantreads.
• eliminateallreads(pairs)containingan“N”character• Ifyoucanaffordthelossofcoverage,youmightthrowawayallreadscontainingNs.
• Identityandtrimoffadapterandbarcodesifpresent• Believeitornot,thesoftwareprovidedbyIllumina,eitherdoesnotlookfor,ordoesamediocrejobof,identifyingadaptersandremovingthem.
![Page 4: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/4.jpg)
DNA/RNA,couldcontain‘contamination’Libraryprep,fragmentation,adapteraddition
PCRenrichment
FinalLibrary,sizedistribution PossibleadditionofphiX SequencingCharacteristics/Quality
![Page 5: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/5.jpg)
Preprocessing• Mapreadstocontaminants/PhiX andextractunmappedreads[bowtie2--local
• Removecontaminants(atleastPhiX),usesbowtie2thenextractsallreads(pairs)thataremarkedasunmapped.
• Super-Deduper [PEreadsonly]• RemovePCRduplicates(weusebases10-35ofeachpairedread)
• FLASH2[ PEreadsonly]• Joinandextend,overlappingpairedendreads• Ifreadscompletelyoverlaptheywillcontainadapter,removeadapters• Identifyandremoveanyadapterdimerspresent
• Scythe[SEReadsonly]• Identifyandremoveadaptersequence
• Sickle• Trimsequences(5’and3’)byqualityscore(IlikeQ20)
• cleanup• RunapolyA/Ttrimmer• Removeanyreadsthatarelessthentheminimumlengthparameter• Producepreprocessingstatistics
![Page 6: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/6.jpg)
WhyScreenforPhiX
• PhiX isacommoncontrolinIlluminaruns,facilitiesrarelytellyouif/whenPhiX hasbeenspikedin
• Doesnothaveabarcode,sointheoryshouldnotbeinyourdata
• However• WhenIknowPhiX hasbeenspikedin,Ifindsequenceeverytime• WhenIknowPhiX hasnotbeenspikedin,Idonot findsequence
• Bettersafethansorryandscreenforit.
![Page 7: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/7.jpg)
SuperDeduper
https://github.com/dstreett/Super-Deduper
Read1
Read2
Data AlignmentAlgorithm
MarkDuplicates Rmdup SuperDeduper FastUniq Fulcrum Total#ofReads
PhiX BWAMEM 1,048,278(0.25%)
1,011,145(1.05%)
1,156,700(13.7%)
4,202,526 3,092,155 4,750,299
Bowtie2Local 1,054,725(6.62%)
948,784(10.2%)
1,166,936(14.0%)
4,236,647 3,103,872 4,790,972
Bowtie2Global 799,524(0%)
800,868(0.12%)
896,487(9.92%)
3,768,641 2,704,114 4,293,787
Acroporadigitifera
BWAMEM 5,132,111(2.26%)
6,906,634(44.5%)
5,133,339(10.2%)
12,968,469 2,103,567 54,108,240
Bowtie2Local 4,688,809(4.03%)
5,931,862(38.9%)
3,971,743(9.32%)
9,893,903 4,259,619 41,728,154
Bowtie2Global 1,457,865(3.62%)
1,512,966(24.2%)
1,185,838(11.4%)
3,014,498 1,286,031 11,600,847
![Page 8: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/8.jpg)
SuperDeduper
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00False Positive Rate
True
Pos
itive
Rat
e
10
0.875
0.900
0.925
0.950
0.975
0.03 0.04 0.05 0.06 0.07False Positive Rate
StartPosition
1
5
20
40
150
Figure 1: ROC curves. Only a representative subset of the different start positions is shown. The image on the left shows the full ROC curves and the image on the left is a zoomed in view of corner of the curves. Each curve represents a start position and each point represents a length. The labeled point in the image on the right is the default start and length for
Super Deduper.
We calculated the Youden Index for every combination tested and the point that acquired the highest index value (as compared to Picard MarkDuplicates) occurred at a start position of 5bp and a length of 10bps (20bp total over both reads)
![Page 9: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/9.jpg)
Flash2– overlappingofreadsandadapterremovalinpairedendreads
TargetRegion
Read1
InsertsizeRead2
TargetRegion
Read1
InsertsizeRead2
TargetRegion
Read1
InsertsizeRead2
Insertsize>lengthofthenumberofcycles
Insertsize<lengthofthenumberofcycles(10bpmin)
Insertsize<lengthofthereadlength
Product:ReadPair
Product:Extended,Single
Product:AdapterTrimmed,Single
https://github.com/dstreett/FLASH2
![Page 10: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/10.jpg)
QualityTrimming- Sickle
Remove“poor”qualitysequencefromboththe5’and3’ends
![Page 11: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/11.jpg)
RibosomalRNA
• RibosomalRNAmakesup90%ormoreofatypicaltotalRNAsample.• LibraryprepmethodsreducetherRNA representationinasample
• oligoDt onlybindstopolyA tailstoenrichasampleformRNA• Ribo-depletionbindsrRNA sequences
Neithertechniqueis100%efficient
Canscreen(mapreadstorRNA sequences)todeterminerRNAefficiencyandpotentiallyremovethosereads.
![Page 12: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/12.jpg)
EffectsinVariantAnalysis
• Readsresultingfromcontaminantsmayaligntothegenome• PCRduplicatereadsareclearlyredundantdata• Overlappingreadsalsoproduceredundantandpossiblyerrorpronebasepairs
• Freebayes doesnotpseudooverlapreads• Samtools
• Setsoneofthebasepairqualitiesto0andtheothertothesumofbothqualities• Inconflict,setsbothqualitiesto0
• GATKUnifiedGenotype• Triestomodelthereads
• GATKHaplotypeCaller• Setsoneofthebasepairqualitiesto0andtheothertothesumofbothqualities• Inconflict,setsbothqualitiesto0
![Page 13: Sequence Preprocessing: A perspective - GitHub Pages · QA/QC •Beyond generating ‘better’ data for downstream analysis, cleaning statistics also give you an idea as to the quality](https://reader035.vdocuments.site/reader035/viewer/2022070805/5f03b1f27e708231d40a5117/html5/thumbnails/13.jpg)
QA/QC
• Beyondgenerating‘better’datafordownstreamanalysis,cleaningstatisticsalsogiveyouanideaastothequalityofthesample,librarygeneration,andsequencingqualityusedtogeneratethedata.
• Thiscanhelpinformyouofwhatyoumightdointhefuture.• I’vefounditbesttoperformQA/QConboththerunasawhole(poorsamplescanaffectothersamples)andonthesamplesthemselvesastheycomparetoothersamples (REMEMBER,BECONSISTANT).
• ReportssuchasBasespace forIllumina,aregreatwaystoevaluatetherunsasawhole.
• PCA/MDSplotsofthepreprocessingsummaryareagreatwaytolookfortechnicalbiasacrossyourexperiment