pyconfr 2015 - asaim: lessons learned from developing a framework for biologists

29
ASaiM Lessons learned from developing a framework for biologists Bérénice Batut — October 16th, 2015

Upload: berenice-batut

Post on 14-Apr-2017

194 views

Category:

Science


0 download

TRANSCRIPT

Page 1: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

ASaiMLessonslearnedfromdevelopingaframeworkforbiologists

BéréniceBatut—October16th,2015

Page 2: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists
Page 3: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

PhDthesisinbioinformaticsandcomputationalbiology

ContributiontoaevolprojectDevelopmentofsimplePythonscripts

Page 4: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Post-docinbioinformatics

DevelopmentofASaiMproject

Page 5: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

ASaiMproject

Page 6: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists
Page 7: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

ObjectivesDevelopmentofabioinformaticsenvironmenttoanalyze

datafromgutmicrobiota

Page 8: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Gutmicrobiota

Communityofmicroorganismspeciesthatliveinthedigestivetracts

"Forgotten"organ

Page 9: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Metagenomic:studyofmicrobiota

Page 10: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

ComplexityShortsequencesSequencevariabilityUncompletereferencedatabases

Needfornumeroustreatmentstoextractusefulinformation

Page 11: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Extract id

16S sequence id

Extract id

18S sequence id

Extract id and e-value

16S e-value

Extract id and e-value

18S e-value

Compare similarity

16S similar id

Compare difference

16S specific id

Compare similarity

18S similar id

Compare difference

18S specific id

Compare similarity Compare similarity

18S similar e-value16S similar e-value

Join columns

16S and 18S similar e-value

Extract line where 18S e-value < 16S e-value

Extract line where 16S e-value < 18S e-value

E-value of 16S similar to conserve

E-value of 18S similar to conserve

Extract column corresponding to id Extract column corresponding to id

Id of 18S similar to conserve

Id of 16S similar to conserve

Concatenate

Id of 18S to conserve

Concatenate

Id of 16S to conserve

Extract sequences whose id in a list

16S sequences to conserve

Extract sequences whose id in a list

16S sequences to conserve

16S sequence id

Remove first line

16S e-value 18S e-value 18S sequence id

Remove first line Remove first line Remove first line

Input sequences

16S sequences 18S sequences

rRNA populus sequences

rRNA sclerotinia sequences

Silva bacteria 16S sequences

Silva archee 16S sequences

Silva eukaryota 18S sequences

SortMeRNA

Non populus sequencesPopulus sequencesPopulus blast report

Extact id for report with id > 97% and

coverage > 97%Extact id

Populus id Id

Compare difference

Populus conserved id

Extact id for report with id > 97% and target position = 1

Id

Compare difference

Populus conserved id

Extact id for report with id > 97% and target position > target sequence length

Id

Compare difference

Populus conserved id

Extract sequences whose id not in a list

Populus not conserved sequences

Concatenate

SortMeRNA

Non populus sequences

Non sclerotinia and non populus

sequences

SortMeRNA SortMeRNA

16S blast report 18S blast report

Exampleofworkflowtosortsequencesgiventheirtype

Page 12: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

ASaiMframework

Page 13: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Bioinformaticsframeworktogenerateworkflowstoanalyzedatafromgutmicrobiota

Page 14: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

MainRequirements

Generationofworkflowwithnumeroustools

Easytouse

Flexibility

Heavilyandeasilydocumented

Easytomaintain

Page 15: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

FirsttestedapproachSimplePythonscripts

Page 16: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Fitwithframeworkrequirements?

Generationofworkflowwithnumeroustools

Easytouse

Flexibility

Heavilyandeasilydocumented

Easytomaintain

Page 17: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

SecondtestedapproachWorflowmanagerssuchasLuigi,Airflow,...

Airflowdependencygraph(from )Airbnbsite

Page 18: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Fitwithframeworkrequirements?

Generationofworkflowwithnumeroustools

Easytouse

Flexibility

Heavilyandeasilydocumented

Easytomaintain

Page 19: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

ThirdtestedapproachHomemadeapproach

Configurationfile

WorkflowdescriptionWebinterfaceforgeneration

Pythonscriptstoexecuteworkflowinconfigurationfile

Page 20: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Fitwithframeworkrequirements?

Generationofworkflowwithnumeroustools

Easytouse

Flexibility

Heavilyandeasilydocumented

Easytomaintain

Page 21: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

MainissuewiththeseapproachesDependencybetweenthetasks

Airflowdependencygraph(from )Airbnbsite

Page 22: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Extract id

16S sequence id

Extract id

18S sequence id

Extract id and e-value

16S e-value

Extract id and e-value

18S e-value

Compare similarity

16S similar id

Compare difference

16S specific id

Compare similarity

18S similar id

Compare difference

18S specific id

Compare similarity Compare similarity

18S similar e-value16S similar e-value

Join columns

16S and 18S similar e-value

Extract line where 18S e-value < 16S e-value

Extract line where 16S e-value < 18S e-value

E-value of 16S similar to conserve

E-value of 18S similar to conserve

Extract column corresponding to id Extract column corresponding to id

Id of 18S similar to conserve

Id of 16S similar to conserve

Concatenate

Id of 18S to conserve

Concatenate

Id of 16S to conserve

Extract sequences whose id in a list

16S sequences to conserve

Extract sequences whose id in a list

16S sequences to conserve

16S sequence id

Remove first line

16S e-value 18S e-value 18S sequence id

Remove first line Remove first line Remove first line

Input sequences

16S sequences 18S sequences

rRNA populus sequences

rRNA sclerotinia sequences

Silva bacteria 16S sequences

Silva archee 16S sequences

Silva eukaryota 18S sequences

SortMeRNA

Non populus sequencesPopulus sequencesPopulus blast report

Extact id for report with id > 97% and

coverage > 97%Extact id

Populus id Id

Compare difference

Populus conserved id

Extact id for report with id > 97% and target position = 1

Id

Compare difference

Populus conserved id

Extact id for report with id > 97% and target position > target sequence length

Id

Compare difference

Populus conserved id

Extract sequences whose id not in a list

Populus not conserved sequences

Concatenate

SortMeRNA

Non populus sequences

Non sclerotinia and non populus

sequences

SortMeRNA SortMeRNA

16S blast report 18S blast report

Extract id

16S sequence id

Extract id

18S sequence id

Extract id and e-value

16S e-value

Extract id and e-value

18S e-value

Compare similarity

16S similar id

Compare difference

16S specific id

Compare similarity

18S similar id

Compare difference

18S specific id

Compare similarity Compare similarity

18S similar e-value16S similar e-value

Join columns

16S sequence id

Remove first line

16S e-value 18S e-value 18S sequence id

Remove first line Remove first line Remove first line

16S sequences 18S sequences

rRNA sclerotinia sequences

Silva bacteria 16S sequences

Silva archee 16S sequences

Silva eukaryota 18S sequences

SortMeRNA

Non populus sequences

Non sclerotinia and non populus

sequences

SortMeRNA SortMeRNA

16S blast report 18S blast report

Page 23: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

FinalapproachGalaxy

Open-sourceprojectbasedonPythonInternationaldevelopmentcommunityWebinterfaceGalaxyToolShed

Page 24: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Fitwithframeworkrequirements?

Generationofworkflowwithnumeroustools

Easytouse

Flexibility

Heavilyandeasilydocumented

Easytomaintain

Page 25: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Galaxydependencygraph

Input dataset

output

Line/Word/Character count

Text file

out_file1

Extract (constrained) information

Similarity search report

report_filepathoutput_filepath

Extract (constrained) information

Similarity search report

report_filepathoutput_filepath

Line/Word/Character count

Text file

out_file1

Line/Word/Character count

Text file

out_file1

Remove beginning

from

out_file1

Remove beginning

from

out_file1

Line/Word/Character count

Text file

out_file1

Compare two Datasets

Compareagainst

out_file1

Join two Datasets

Joinwith

out_file1

Compare two Datasets

Compareagainst

out_file1

Line/Word/Character count

Text file

out_file1

Cut

From

out_file1

Line/Word/Character count

Text file

out_file1

Filter

Filter

out_file1

Filter

Filter

out_file1

Line/Word/Character count

Text file

out_file1

Cut

From

out_file1

Cut

From

out_file1

Line/Word/Character count

Text file

out_file1

Line/Word/Character count

Text file

out_file1

Cut

From

out_file1

Concatenate datasets

Concatenate DatasetDataset 1 > Select

out_file1

Concatenate datasets

Concatenate DatasetDataset 1 > Select

out_file1

Line/Word/Character count

Text file

out_file1

Line/Word/Character count

Text file

out_file1

Extract

Sequence fileConstraints on sequences 1 > List of constraint

information_filefasta_sequence_filefastq_sequence_filequality_filefasta_sequence_file_from_fastqreport_filepath

Input dataset

output

Input dataset

output

Workflowtosortsequencesgiventheirtype

Page 26: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

ASaiMframeworkConfigurationofaGalaxyserverDevelopmentofwrappersfortoolintegrationDevelopmentofscriptstouseGalaxyandAPI

Page 27: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Usedtools

Code

Githubandsubmodules,Gitlab

Documentation

Sphinx+ReadTheDoc+Github

Webpage

Jekyll+Githubpage

Management

Trello,Slack

Page 28: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Learnedfromthisproject

Needtocorrectlydefinetheconception

Noworkflowmanagerwithinput/outputdependency

Donoreinventthewheel

Donotpreferhome-madesolution

Integrateactivecommunity

Needofgoodtoolsandgoodhabitsinbigprojects

Page 29: PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

ThankYou.Questions?

bebatut.fr

github.com/bebatut

twitter.com/bebatut