transmart community meeting 5-7 nov 13 - session 2: herding cat
Post on 19-Jan-2015
375 Views
Preview:
DESCRIPTION
TRANSCRIPT
Herding Cats: Managing Open Source Projects and Communities
Peter Rice5th November 2013
Herding Cats
• Managing an open source community is like herding cats.
• ‘Cat, come with me.' 'Nenni!' said the Cat. 'I am the Cat who walks by himself, and all places are alike to me. I will not come. But all the same, he followed' (Rudyard Kipling, Just So Stories)
– EMACS– Linux– GPL– Apache
A brief history of open source software
• Source code provided• Users free to inspect, modify and redistribute• Restrictions may be applied• Freedoms may be guaranteed• Several licenses may be combined
– If they are compatible
Open source licensing
• Originally written for the EMACS editor and the GNU project
• Based on copyleft– Copyright holder usually restricts rights– In GPL, copyright holder requires all further
distributions to ensure free access– No further restrictions may be imposed– “Free as in speech, not as in beer”
GNU General Public License
• The full GNU General Public License makes it difficult to combine with other licenses as the whole binary is covered by GPL.
• The Lesser (Library) GPL only preserves the interface and requires LGPL library source code to be made available.
• Applications can be under any license• GPL code requires unlinked interfaces (APIs)
Lesser General Public License
• Apache 2.0 allows modified code to use another license (including proprietary). “Indemnity clause” can be scary but is safe
• Perl artistic license has issues with redistributed code
• BSD license imposed a “restriction” requiring citing the original authors, usually removed in several “modified BSD” versions
Other Open Source Licenses
• 1980 Staden package– in support of Fred Sanger
• 1982 EMBL/GenBank– Free sequence databases, later also SwissProt
• 1984 Genetics Computer Group– Free (initially) sequence analysis package
• 1990 Sequence Retrieval System• 1990 BLAST• 1997 EMBOSS
A brief history of bioinformatics
• The Staden package was developed from 1987 to 2003 by Rodger Staden at the MRC-funded Laboratory for Molecular Biology
• To get a copy of the software, users mailed a cheque for £100 to the Medical Research Council
• In 2003, renewal of funding was rejected
Copyright and Ownership
• The software was still owned by the funders• The authors had no right to apply for
alternative funding• … nor did anyone else• Two years later it was formally re-released as
open source, but developers had left.
Copyright and Ownership
• The HMMER package provides standard Hidden Markov Model applications for multiple alignments of protein sequences
• HMMER 2 had a dual licensing model– GNU General Public License– Commercial license
• Only one of these can include third-party contributions. The commercial license cannot.
Multiple licensing
• The Sequence Retrieval System was developed by Thure Etzold as a PhD project, then at EMBL Heidelberg and the European Bioinformatics Institute.
• LION Bioscience in Cambridge started up to maintain and develop SRS commercially
• LION merged with competitors (e.g. NetGenics)
From academia to commercial
• NetGenics software was withdrawn– Customers had to purchase an SRS license instead
• LION merged with BioWisdom• BioWisdom merged with Instem• Lesson: commercial software is high quality,
well supported, but can disappear at any time.
• Open source software avoids this risk
From academia to commercial
• BLAST was developed at NCBI as a successor to FASTA
• Development split into BLAST and WU-BLAST (Washington University) providing new features
• WU-BLAST in turn became commercial AB-BLAST
Branching
• BLAST and the NCBI Toolkit were an early example of open source bioinformatics
• Most software at the time was commercial• In 1990 the commercial providers wrote to
Congress asking for withdrawal of funding for NCBI software because it competed with US industry.
• They failed.
Competition
• The GCG package was developed by the Genetics Computer Group at the University of Wisconsin
• One of the most cited papers in biology– If you change more than 25% of the code, you can
remove the GCG copyright• Changed to an annual source code license
model• Extensions (EGCG) distributed as source code
by EMBL Heidelberg and then by Sanger
Competition
• Social scientists have reported in detail on GCG as an example of he development of bioinformatics.
• Intelligenetics Inc objected to GCG’s unfair competition
• Wisconsin spun off GCG Inc• Software license fee doubled• Usage continued• EGCG developed to 50% of the GCG code base
Competition
• GCG Inc looked for a new owner• Source code deemed to be their major asset• Source code distribution was withdrawn• Increased fee for source code• Very restrictive terms of distribution• EGCG was abandoned with 150 applications• EMBOSS written from scratch to replace both
– GPL/LGPL licensing– Created by the former EGCG community
Competition
• So, to summarise– 1984 GCG started as open source– 1990 Became GCG Inc– 1997 Acquired by Oxford Molecular– 2000 EMBOSS 1.0 released as open source
Harvey, M. and McMeekin, A. (2007) “Public or Private Economies of Knowledge? Turbulence in the Biological Sciences”
Competition
• The developers are only the beginning– Users– Installers– Technical authors– Helpdesk and support– Communication– Quality assurance– Competitors
Managing an Open Source Community
• New source code, new functionality• Maintaining source code
– Bug fixes, coding standards• Interfaces
– APIs, third party integration• Competititors (including open source)
– New features and functionality– Integration and active collaboration
Contributions by developers
• Branches– Someone needs to merge branches
• Original developers should agree to help• Often merged by others wanting to use new features
– Ideally, merge with a single core– Useful to merge any set of branches– Combine with test suite(s)
Contributions (continued)
• New data types• ETL procedures• Standards• Project-specific requirements
Contributions (continued)
• Documentation– Users are good at writing/updating manuals
• Training– Shared examples with public data and common
standards• Support
• Feature requests
Contributions (continued)
• Git: Github etc.• Sourceforge• Open Bio Foundation• Locally hosted solutions:
– CVS or SubVersion• Wiki
– Documentation: developers, users, installers
Hosting solutions
• Projects need a coordinator– Linux: Linus Torvalds– Emacs: Richard Stallman– GCG: John Devereux– EMBOSS: Peter Rice
Coordination
• Maintaining a standard code base– github.com/transmart
• Tracking branches and modified copies elsewhere
• Selecting best solutions from available branches
• Merging conflicting changes• Continuous testing
Coordination
• Community meetings (London, Amsterdam, Paris, …) for developers and users
• Regular technical developer meetings / TCs• Mailing lists
– Provide a useful archive• Trackers (JIRA, Pivotal, …)
– Defining tasks/issues and resolving them• Wiki
– Community documentation
Communication
• Quality assurance– More tests are always helpful
• Automated documentation– Creating screenshots from test outputs
• Create tests for documented examples• Automated update when results change• Ensure documented functionality still functions
Efficiency
• In a small community, sanctions work– Financial penalties for breaking the code– Small fines for bugs– Put back e.g. funding Xmas drinks
Cat incentives
• Acknowledge contributions• Benefit from sharing code in other branches• Developers need to support one another
– Put out any flame wars• Involve the user community
– Encourage non-developers to contribute• Keep everything public
– Support the community– Attract new cats
Cat treats
Herding Cats: Managing Open Source Projects and Communities
top related