software: the good, the bad and the ugly...software: what’s good, bad and ugly the bad and ugly...

23
Software: the good, the bad and the ugly Festival of Genomics 2017 Russell Hamilton [email protected]

Upload: others

Post on 21-May-2020

18 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Software:thegood,thebadandtheugly

FestivalofGenomics2017

[email protected]

Page 2: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Software:What’sgood,badandugly

BioinformaticsSoftware

SoftwareX

All softwarecontainsbugs:

• IndustryAverage:“about15- 50errorsper1000linesofdeliveredcode”SteveMcConnell(authorofCodeCompleteandSoftwareEstimation:DemystifyingtheBlackArt)

• Rangefromspellingmistakeinerrormessagetocompletelyincorrectresults

• Mostsoftwarewillprocesstheinputtoproduceanoutputwithouterrorsorwarnings

Never blindlytrustsoftwareorpipelines

• Always test andvalidate results

• Avoidblack box software(definition:producesresults,butnooneknowshow)

W Y

X

Z

Page 3: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Software:What’sgood,badandugly

Differentclassesofbioinformaticssoftware

SoftwareX

W Y

X

Z

Class Examples Description

Processing TopHat2,Bowtie2 Performingcomputationally intensivetask,applyingmathematicalmodels

Evaluation FastQC,BamQC DerivingQCmetricsfromoutputfiles

Converters SamToFastq (PicardTools) Simply convertingbetweenfileformats.Generallystablenoregularupdates

Pipelines Galaxy,ClusterFlow Theglueforjoiningsoftwaretocreateanautomatedpipeline

Page 4: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

1. FindingSoftwaretodothejob2. Hasthesoftwarebeenpublished?3. SoftwareAvailability4. DocumentationAvailability5. Presenceonusergroups6. InstallationandRunning7. ErrorsandLogFiles8. Usestandardfileformats9. EvaluatingCommercialSoftware10. Bugsinscripts/pipelinestorunsoftware11. Writingyourownsoftware12. Usingandcreatingpipelines

12StepGuideforevaluatingandselectingbioinformaticssoftwaretools

Software:What’sgood,badandugly

Page 5: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Identifytherequiredtask

1.Findingsoftwaretodothejob

Alignmentofmethylation sequencingdatatoreferencegenome

Arethererelatedstudiesperformingsimilaranalysis?Publication/posters/talks

Requiredfeatures MustHave1.INPUTstandardFASTQformatfiles2.OUTPUTstandardBAMalignments3.OUTPUTCompatiblewithmethylKit

Liketohave1.Performmethylationcalls2.Mustmakeuseofmultiprocessorsforlargenumbersofsamples

Software:What’sgood,badandugly

Page 6: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

✓ PublishedinapeerreviewedjournalAsstandalonesoftwareorpartofstudy

Citedbyotherpeerreviewedpapers✓

Hasthesoftwarebeenbenchmarked(byotherpeoplethantheauthors)

Shortreadmappingis“generallysolvedproblem”Informativeforruntimes

✓ BMCBioinformatics;2016;17(Suppl 4):69DOI:10.1186/s12859-016-0910-3

2.Hasthesoftwarebeenpublished?

Software:What’sgood,badandugly

Page 7: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

SoftwareavailablefordownloadHostedonarecognisedsoftwarerepositorye.g. GitHub,BitBucket,SourceForge

Softwareregularlyupdated/bugsfixed/releasesMorethanonedeveloper(e.g.groupaccount)

Permanentarchiveofsoftwarereleasese.g.zenodo.org,figshare.com

University/Institute/CompanyWebsiteSoftwareistheresponsibilityofagroupnotjustanindividual

3.SoftwareAvailability

Software:What’sgood,badandugly

Page 8: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Bugsarereportedandfixed

Newfeaturerequestsareadded

3.SoftwareAvailability

Software:What’sgood,badandugly

Page 9: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

UserDocumentation✓

ReleaseDocumentationVersions– aidsreproduceability

4.DocumentationAvailability

Software:What’sgood,badandugly

Page 10: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

WT KO1 KO2 KO3

WT

KO1

KO2

KO3

X

X

X

XX

X

4.DocumentationAvailability

Software:What’sgood,badandugly

Example:RNA-SeqDifferentialGeneanalysisusingDESeq2DiscoverlimitationsDESeq2 Manual”The resultsfunction without any arguments will automatically perform a contrast of the last level of the last variable in the design formula over the first level.”

count.data <- DESeqDataSetFromMatrix(sampleTable=smplTbl, design= ~ genotype)

count.data <- DESeq(count.data)

binomial.result <- results(count.data)

binomial.result <- results(count.data, contrast=c(”genotype",”K01",”K02"))

Name fileName genotypeWT1 wt1.htseq_counts.txt WTWT2 wt2.htseq_counts.txt WTWT3 wt3.htseq_counts.txt WTKO1.1 ko1.1.htseq_counts.txt KO1KO1.2 ko1.2.htseq_counts.txt KO1KO1.3 ko1.3.htseq_counts.txt KO1KO2.1 ko2.1.htseq_counts.txt KO2KO2.2 ko2.2.htseq_counts.txt KO2KO2.3 ko2.3.htseq_counts.txt KO2KO3.1 ko3.1.htseq_counts.txt KO3KO3.2 ko3.2.htseq_counts.txt KO3KO3.3 ko3.3.htseq_counts.txt KO3

Page 11: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Evidenceforsupportquestionsbeinganswerede.g.FAQ,searchablepublicsupportgroup

5.Presenceonusergroups

https://www.biostars.org

http://seqanswers.com

Software:What’sgood,badandugly

Istheresomeonenearbyyoucanaskforhelp BioinformaticsCoreFacility

Researchgroupdownthecorridor

GitHubIssuesGoogleGroups

Page 12: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Willrunonstandardarchitecture✓

✓ Easytoinstall

6.InstallationandRunning

Releaseversions✓$bismark --version

Bismark - BisulfiteMapperandMethylationCaller.Bismark Version:v0.16.3_dev

Copyright2010-15FelixKrueger,Babraham Bioinformaticswww.bioinformatics.babraham.ac.uk/projects/

SourcecodeavailableBinariescansimplifyinstallation

Docker/Galaxy/BaseSpace

Software:What’sgood,badandugly

✓ DefaultparametersAsensiblesetofdefaultparametersthatarelikelytoproduceagoodfirstpassattheresults

Page 13: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Example:Traceabilityofresultsthoughthestepsintheanalysis

Intermediateresultsareexcellentcheckpoints

Software:What’sgood,badandugly

6.InstallationandRunning

Alignment BAM countsHTSeq-count DESeq2

Normalisedreadcounts

DifferentiallyExpressedGenes

✓ ✓ ✓

RNA-SeqDifferentialGeneExpressionAnalysis

Sample:SampleCorrelation

MA-PlotsSamplePCA

PerGenestd.dev

✓✓

✓ ✓

Page 14: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

7.ErrorsandLogFiles

Warnings Don’tignorewarnings,theymaybetellingyousomethingcrucialaboutyourdata

Errors Problemsevereenoughfortheprogramtostopandproduceanerror

Software:What’sgood,badandugly

Keepandreadlogfilesforsoftwarerun

Page 15: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

8.Usestandardfileformats

StandardInputFiles✓FASTA,FASTQ

Convertingbetweenformatscouldintroduceerrors

StandardOutputFiles✓BAM

Compatiblewithdownstreamtools

Software:What’sgood,badandugly

Bioinformaticiansspendanembarrassingamountoftimeconvertingbetweenfileformats

Page 16: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

9.EvaluatingCommercialSoftware

Software:What’sgood,badandugly

ShouldyouusecommercialsoftwaretodoRNA-SeqDGEanalysis?

Lotsofgoodcommercialsoftwareavailablee.g.Partek

Pros Cons

GraphicalInterface– nocommandline Runanalysiswithoutunderstandingthesteps

Singleapplicationforallsteps Hardertotracebackstepbystep

DedicatedCustomerSupport Limited usergroupactivity

Lesstransparency(methods/bugsfixed)

Expensive

Licenserequiredtoreproduceanalysis(e.g.reviewers)

Page 17: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

10.Bugsinscripts/pipelinestorunsoftware

Software:What’sgood,badandugly

1.BashScriptforrunningfastQC

for file in *_1.fq.gz;dofastqc $file

done

multiqc .

Oftenwrittenspecificallyforeachanalysisorprojectandarepronetobugs

Examplesofaccidentallymissingoutsamples

2.RNA-SeqDESeq2SampleTablegenotype <- data.frame(

‘WT’, ’wt’, ’Wt’, ‘KO1’,’kO1’,’KO1’,‘KO2’,’Ko2’,’KO2’,‘KO3’,’KO3’,’KO3’)

...results(dds, contrast=c(”genotype",”WT",”K02"))

Page 18: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Software:What’sgood,badandugly

Thebadandugly

• Homemade“glue”scriptsforrunningsoftwarecanbebugprone• “darkscriptmatter”isn’treviewedorassessesandrarelyreleasedinmethodssections• Ina3000samplestudy,errorsarepropagated3000times!

Thegood

• Purposebuildpipelinetools• Premadepipelinesfore.g.RNA-Seqdifferentialgeneexpression• Jobqueuing- Loadbalancingacrosshardware(laptoptoclusterfarm)• Logfilestrackasamplesprogressthroughpipeline

11.UtilisingdedicatedPipelinetools

Page 19: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Software:What’sgood,badandugly

http://clusterflow.io/

https://usegalaxy.org/

https://github.com/common-workflow-language

CommonWorkflowLanguage

11.UtilisingdedicatedPipelinetools

InteractionviaawebbrowserPublicandprivateserverinstallsManypre-builtpipelinesLargeusercommunity

CommandlineinterfaceManypre-builtpipelines

AlanguageforbuildingyourownpipelinesUtilisedbyotherpipelinetoolse.g.NextIO

Page 20: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Rule1:IdentifytheMissingPiecesRule2:CollectFeedbackfromProspectiveUsersRule3:BeReadyforDataGrowthRule4:UseStandardDataFormatsforInputandOutputRule5:ExposeOnlyMandatoryParametersRule6:ExpectUserstoMakeMistakesRule7:ProvideLoggingInformationRule8:GetUsersStartedQuicklyRule9:OfferTutorialMaterialRule10:ConsidertheFutureofYourTool

Software:What’sgood,badandugly

12.Developingyourownsoftware

Ifyouaresureagreatpieceofsoftwaredoesn’talreadyexistorcanbemodifiedforthetask

Developingyourowntoolsgivesanappreciationofhowdifficultitcanbe

Page 21: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Weightingtheevaluationcriteria

Software:What’sgood,badandugly

Criteria Importance Comments1 FindingSoftwaretodothejob +++++ Use therighttoolsforthejob

2 Hasthesoftwarebeenpublished? +++ Newsoftwarebeingreleasedsocheckforimprovedmethods.Justbecauseitspublishedandwelluseddoesn’tmeanit’sstill thebest

3 SoftwareAvailability ++

Opennessitagoodsignforfindingerror/bugs/suggestingfeatureenhancements

4 DocumentationAvailability +++++5 Presenceonusergroups +++++6 InstallationandRunning +++7 ErrorsandLogFiles +++++8 Usestandardfileformats ++++ Conversions couldaddsourcesoferror9 EvaluatingCommercialSoftware + PriceVsOpensourcesoftware

10 Bugsinscripts/pipelinestorunsoftware +++++ Pipelinesstandardise workflows

11 Utilising dedicatedpipelinetools +12 Writingyourownsoftware + Don’tre-inventingthewheel

Page 22: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Software:What’sgood,badandugly

Method1Method2Method3

Compromisesforruntimevsaccuracy/sensitivity

ProjectAhas3000samplesvsProjectBwith12samples

Method1:4hourspersample98%accuracyMethod2:30minspersample97%accuracy

WhatwouldyouchooseifMethod1:4hourspersample98%accuracyMethod3:15minspersample 90%accuracy

Weightingtheevaluationcriteria

Page 23: Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly • Home made “glue” scripts for running software can be bug prone • “dark

Software:What’sgood,badandugly

Summary

Manywaystoevaluatesoftware

• Opennessandengagementwithusersisveryimportant- bugsfixed,featuresadded,largeuserbase

• Evaluatefeatures,e.g.runtime,againstyourprojectrequirements

• Ifyouareusingpipelines,usepurposebuildpipeliningtools