biomake bosc 2004

23
BioMake BioMake Chris Mungall Chris Mungall Berkeley Drosophila Genome Berkeley Drosophila Genome Project Project [email protected] [email protected]

Upload: chris-mungall

Post on 11-May-2015

623 views

Category:

Technology


3 download

DESCRIPTION

http://open-bio.org/bosc2004/accepted_abstracts.html

TRANSCRIPT

Page 1: BioMake BOSC 2004

BioMakeBioMakeChris MungallChris Mungall

Berkeley Drosophila Genome Berkeley Drosophila Genome ProjectProject

[email protected]@fruitfly.org

Page 2: BioMake BOSC 2004

build networksbuild networks

Many bioinformaticians spend large amounts Many bioinformaticians spend large amounts of time coding and running of time coding and running build/make build/make networksnetworks

A build network is a recipe describing the A build network is a recipe describing the execution of a collection of execution of a collection of interdependent interdependent heterogeneous tasksheterogeneous tasks sequence analysis pipelinessequence analysis pipelines data ‘compilation’data ‘compilation’

importing, transforming and exporting dataimporting, transforming and exporting data LIMSLIMS

Error prone, tedious repetitive code and hard Error prone, tedious repetitive code and hard to configureto configure

Page 3: BioMake BOSC 2004

Existing approachesExisting approaches

Run tasks by hand, or with ad-hoc scriptsRun tasks by hand, or with ad-hoc scripts doesn’t scale, leads to insanitydoesn’t scale, leads to insanity

Unix/GNU makefilesUnix/GNU makefiles ++: concise, generic, high level abstraction++: concise, generic, high level abstraction --: limited expressive power, hacky--: limited expressive power, hacky

makefile replacements (cons,scons,ant,build…)makefile replacements (cons,scons,ant,build…) geared towards software developmentgeared towards software development

Bio compute pipeline software (biopipe, Bio compute pipeline software (biopipe, enspipe)enspipe) excellent for certain tasksexcellent for certain tasks not completely genericnot completely generic

Page 4: BioMake BOSC 2004

biomake: executable biomake: executable computational protocolscomputational protocols

A declarative language for specifying A declarative language for specifying build networksbuild networks concise, Turing-complete, highly configurableconcise, Turing-complete, highly configurable

Dependency managementDependency management e.ge.g mask genomic sequence prior to mask genomic sequence prior to

genefindinggenefinding

Local and remote job executionLocal and remote job execution compute farm job managementcompute farm job management

Filesystem or database orientedFilesystem or database oriented

Page 5: BioMake BOSC 2004

Example Genomic Example Genomic Sequence Analysis Sequence Analysis

PipelinePipeline Prepare and cut assembled sequence into slicesPrepare and cut assembled sequence into slices Download latest NR peptide dataset and index itDownload latest NR peptide dataset and index it Blast genomic slices against NR and other Blast genomic slices against NR and other

datasetsdatasets RepeatMask genomic slices and run genefinders RepeatMask genomic slices and run genefinders

on masked sequenceon masked sequence Synthesise gene models and do peptide Synthesise gene models and do peptide

analysis on resultsanalysis on results Store everything in a relational db, and prepare Store everything in a relational db, and prepare

files for export to publicfiles for export to public

Page 6: BioMake BOSC 2004

Target dependenciesTarget dependencies

AssembledGenomic Sequence

ChunkedSequence

RepeatMaskedSequence

GenePredictions

BlastAlignments

FastaDB(local)

FastaDB

(remote)

BlastIndexedFastaDB

RelationalDatabase

GeneModels

BlastP & HMMAlignments

XMLExpo

rt

XMLImport

FlatfileExport

Page 7: BioMake BOSC 2004

Specifying TargetsSpecifying Targets

ChunkedSequence

BlastAlignments

FastaDB(local)

BlastIndexedFastaDB

formatdb(D) run: formatdb -i D

blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A flat: S-blast/bn(S).D.P.out

A generic targetpattern has a nameand arguments

A target pattern has tags for specifying dependencies, actions and filesystem or database IDs

Page 8: BioMake BOSC 2004

BioMake ExecutionBioMake ExecutionTARGET:blast(blastx, my.seq, nr.fly, -)

SUBTARGET:formatdb(nr.fly)

RUN: formatdb -i nr -p TOUT: nr.fly.{psq,pin,phr} formatdb-nr.fly.OK

RUN: blastall -p blastx -i my.seq -d nr.fly

OUT:my.seq-blast/my.seq.nr.fly.blastx.outmy.seq-blast/my.seq.nr.fly.blastx.out.OK

formatdb(D) run: formatdb -i D

blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A flat: S-blast/bn(S).D.P.out

Page 9: BioMake BOSC 2004

Targets can be nestedTargets can be nested

bop(S,B) run: apollo -bop -s S B -o target flat: B.game.xml

store(XML) run: xml2db XML

store(bop(my.seq, genscan(repeatmask(my.seq, drosophila))))

target instantiations can be thought of as a skolem IDs

repeatmask(S,Org) run: repeatmask S -a Org flat: S.mask

genscan(S) run: genscan S flat: S-pred/bn(S).genscan.out

Page 10: BioMake BOSC 2004

IterationIteration

Pipelines frequently involve iterating Pipelines frequently involve iterating over collections of data:over collections of data: Perform a sequence analysis on every entry Perform a sequence analysis on every entry

in a multi-fasta format filein a multi-fasta format file Perform a peptide analysis on every gene Perform a peptide analysis on every gene

prediction in some genscan outputprediction in some genscan output Query a database for a list of IDs and Query a database for a list of IDs and

perform some task on eachperform some task on each

biomake has language constructs for biomake has language constructs for iterationiteration

Page 11: BioMake BOSC 2004

Iterating over datasetsIterating over datasets

splitfasta(F) run: splitfasta.pl -d F-split -md5 F flat: F-split/bn(F).pathlist comment: splitfasta.pl is part of the biomake distro

analyze_multifasta(F) iterate: analyze_seq(S) where S in splitfasta(F)

analyze_seq(S) req: genie(S) blast(blastx,S,nr,-)

MultiFasta>seq1TAGGTATTGGTTAGGTGCGTCCTC>seq2GCGGTATAGCTTTTCCTTCTCTCT>seq3CAAAGCAGAGATATATTTATTCGC

>seq1TAGGTATTGGTTAGGTGCGTCCTC

>seq2GCGGTATAGCTTTTCCTTCTCTCT

>seq3CAAAGCAGAGATATATTTATTCGC

seq1.genie.out

seq1.nr.blastx.out

seq2.genie

seq2.nr.blastx.out

seq3.nr.blastx.out

seq3.genie

Page 12: BioMake BOSC 2004

Controlling the runmodeControlling the runmode

Tasks can be run locally or on a Tasks can be run locally or on a compute farm, synchronously or compute farm, synchronously or asynchronouslyasynchronously wrapper provided for PBSwrapper provided for PBS

runmode:runmode: tag states the mode and tag states the mode and wrapper for a particular target patternwrapper for a particular target pattern can be set globally and per-patterncan be set globally and per-pattern

special status targets provide special status targets provide execution statusexecution status

Page 13: BioMake BOSC 2004

runmode examplerunmode example

blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A flat: S-blast/bn(S).D.P.out runmode: async(qsubwrap)

The blast job will beexecuted on the computefarm via qsubwrap (comes with biomake distro)

Upon submission, the status targetstatus_run(blast(P,S,D,A))will be generated; oncompletion, the targetstatus_ok (blast(P,S,D,A))will be generated

biomake can automaticallyhandle moving data inand out between user’sfilesystem (or db) and local cluster nodes

Page 14: BioMake BOSC 2004

DatastoresDatastores

BioMake persists targets in DatastoresBioMake persists targets in Datastores The The flat:flat: tag flattens targets to unique tag flattens targets to unique

datastore IDsdatastore IDs Datastore can be filesystem or relational Datastore can be filesystem or relational

databasedatabase default is filesystemdefault is filesystem can be set globally or per targetcan be set globally or per target

e.g. analysis result targets can be stored on filesystem, e.g. analysis result targets can be stored on filesystem, status targets stored in DBstatus targets stored in DB

NFS traffic can be avoided on compute farm by NFS traffic can be avoided on compute farm by storing targets and status targets in a databasestoring targets and status targets in a database

Page 15: BioMake BOSC 2004

Asynchronous ExecutionAsynchronous Execution

local machinerunning biomake

scheduler

node

NFS

For each target T to be built:1) biomake fetches status of T skips T if status = ok/run2) biomake stores status_run(T)3) biomake creates a runner agent script and submits it to the cluster4) continue onto next target

1) agent fetches any input data2) agent runs command (eg blast)synchronously3) agent stores result4) agent stores status of T as‘ok’ or ‘err’

AGENT

AGENT

Page 16: BioMake BOSC 2004

Specifying rulesSpecifying rules

Pipeline systems often require a rule Pipeline systems often require a rule basebase

only do nuc to nuc alignments on one species or only do nuc to nuc alignments on one species or two recently diverged speciestwo recently diverged species

use sequence ontology hierarchy to decide use sequence ontology hierarchy to decide analyses or parametersanalyses or parameters

biomake protocols can have prolog facts biomake protocols can have prolog facts and rules embedded inside themand rules embedded inside them

biomake distro comes with SO prolog db biomake distro comes with SO prolog db and rules for graph traversaland rules for graph traversal

Page 17: BioMake BOSC 2004

<data relation=“fastadb” cols=“ID,SeqAlphabet,SOType,Org” del=“ws”>protein.fly.fstprotein.fly.fst aa

aapolypeptidepolypeptide D melanogasterD melanogaster

est.fly.fstest.fly.fst nnaa

ESTEST D melanogasterD melanogaster

cdna.fly.fstcdna.fly.fst nnaa

cDNAcDNA D melanogasterD melanogaster</data><prolog> nucdb(D):- fastadb(D,na,_,_). pepdb(D):- fastadb(D,aa,_,_).</prolog>

formatdb(D) run: formatdb -i D -p ‘T’ {pepdb(D)} run: formatdb -i D -p ‘F’ {nucdb(D)}

Embedding prolog factsEmbedding prolog facts

Page 18: BioMake BOSC 2004

BioMake module systemBioMake module system

The biomake core language is generic The biomake core language is generic no bioinformatics-specific code or tweaksno bioinformatics-specific code or tweaks

biomake uses a module systembiomake uses a module system biomake distro comes withbiomake distro comes with

biosequence_analysis modulebiosequence_analysis module Sequence Ontology prolog db and rulesSequence Ontology prolog db and rules scripts for handling bioinformatics datascripts for handling bioinformatics data

Page 19: BioMake BOSC 2004

biomake extensibilitybiomake extensibility

biomake is a declarative languagebiomake is a declarative language embodies both logical and functional embodies both logical and functional

paradigmsparadigms targets are actually higher order functionstargets are actually higher order functions standard FP functions available in fp standard FP functions available in fp

modulemodule cons,map,grep,filter,fold,…cons,map,grep,filter,fold,…

goal: expressive power, concise goal: expressive power, concise specifications and simplicityspecifications and simplicity

Page 20: BioMake BOSC 2004

BioMake in useBioMake in use

currently we’re using biomake for…currently we’re using biomake for… analysis of repeat families found in analysis of repeat families found in

orthologous and paralogous intronsorthologous and paralogous introns Building the Gene Ontology databaseBuilding the Gene Ontology database

……but we are still dependent on but we are still dependent on legacy pipeline code for many legacy pipeline code for many analysesanalyses

Page 21: BioMake BOSC 2004

Running biomakeRunning biomake

Get distro from Get distro from http://skam.sourceforge.nethttp://skam.sourceforge.net Requires XSB PrologRequires XSB Prolog

http://xsb.sourceforge.nethttp://xsb.sourceforge.net Run via command lineRun via command line

similar to unix make commandsimilar to unix make command Works on both OS X and LinuxWorks on both OS X and Linux Relational datastore requires mysql (Pg Relational datastore requires mysql (Pg

soon)soon) Better docs coming soon, lots of examplesBetter docs coming soon, lots of examples

Page 22: BioMake BOSC 2004

AcknowledgementsAcknowledgements

Shengqiang ShuSima MisraErwin FriseEric SmithMark YandellGeorge HartzellChris SmithSimon Prochnik

Jon TupyJosh KaminkerKaren EilbeckNomi HarrisSuzanna LewisGerry Rubin

Page 23: BioMake BOSC 2004

Problem SpecificationProblem Specification A build network consist of multiple A build network consist of multiple TargetsTargets

e.ge.g the output from a the output from a blastxblastx alignment of alignment of my.seqmy.seq to to the protein database the protein database nr.flynr.fly

Targets have a Targets have a logical patternlogical pattern e.ge.g blastblast alignment using alignment using PP of some seq of some seq SS vs some db vs some db

DD Targets are Targets are dependentdependent on other Targets on other Targets

e.ge.g blast depends on the indexing of db blast depends on the indexing of db DD using using formatdbformatdb

Upstream changes trigger downstream actionsUpstream changes trigger downstream actions Targets are built by Targets are built by running actionsrunning actions

““formatdb -p T -i nr.flyformatdb -p T -i nr.fly”” ““blastall -p blastx -i my.seq -d nr.flyblastall -p blastx -i my.seq -d nr.fly””