open research onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · correspondence...

24
Open Research Online The Open University’s repository of research publications and other research outputs A decadal view of biodiversity informatics: challenges and priorities Journal Item How to cite: Hardisty, Alex; Roberts, Dave and The Biodiversity Informatics Community (2013). A decadal view of biodiversity informatics: challenges and priorities. BMC Ecology, 13(1), article no. 16. For guidance on citations see FAQs . c 2013 The Authors Version: Version of Record Link(s) to article on publisher’s website: http://dx.doi.org/doi:10.1186/1472-6785-13-16 Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk

Upload: others

Post on 03-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

Open Research OnlineThe Open University’s repository of research publicationsand other research outputs

A decadal view of biodiversity informatics: challengesand prioritiesJournal ItemHow to cite:

Hardisty, Alex; Roberts, Dave and The Biodiversity Informatics Community (2013). A decadal view of biodiversityinformatics: challenges and priorities. BMC Ecology, 13(1), article no. 16.

For guidance on citations see FAQs.

c© 2013 The Authors

Version: Version of Record

Link(s) to article on publisher’s website:http://dx.doi.org/doi:10.1186/1472-6785-13-16

Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyrightowners. For more information on Open Research Online’s data policy on reuse of materials please consult the policiespage.

oro.open.ac.uk

Page 2: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

CORRESPONDENCE Open Access

A decadal view of biodiversity informatics:challenges and prioritiesAlex Hardisty1*, Dave Roberts2* and The Biodiversity Informatics Community3

Abstract

Biodiversity informatics plays a central enabling role in the research community's efforts to address scientificconservation and sustainability issues. Great strides have been made in the past decade establishing a frameworkfor sharing data, where taxonomy and systematics has been perceived as the most prominent discipline involved.To some extent this is inevitable, given the use of species names as the pivot around which information isorganised. To address the urgent questions around conservation, land-use, environmental change, sustainability,food security and ecosystem services that are facing Governments worldwide, we need to understand how theecosystem works. So, we need a systems approach to understanding biodiversity that moves significantly beyondtaxonomy and species observations. Such an approach needs to look at the whole system to address speciesinteractions, both with their environment and with other species.It is clear that some barriers to progress are sociological, basically persuading people to use the technologicalsolutions that are already available. This is best addressed by developing more effective systems that deliverimmediate benefit to the user, hiding the majority of the technology behind simple user interfaces. Aninfrastructure should be a space in which activities take place and, as such, should be effectively invisible.This community consultation paper positions the role of biodiversity informatics, for the next decade, presentingthe actions needed to link the various biodiversity infrastructures invisibly and to facilitate understanding that cansupport both business and policy-makers. The community considers the goal in biodiversity informatics to be fullintegration of the biodiversity research community, including citizens’ science, through a commonly-shared,sustainable e-infrastructure across all sub-disciplines that reliably serves science and society alike.

Keywords: Biodiversity, Informatics, Grand challenge, Decadal vision, Research infrastructure, e-Infrastructure,Data sharing, Systems approaches

The grand challengeThe grand challenge for biodiversity informatics is to de-velop an infrastructure to allow the available data to bebrought into a coordinated coupled modelling environ-menta able to address questions relating to our use ofthe natural environment that captures the ‘variety, dis-tinctiveness and complexity of all life on Earth’b.Biodiversity processes are complex and can have a large,

long-term impact at the macro-scale, even if they have oc-curred rapidly at the sub-cellular molecular level e.g., thePhosphate cycle [1,2]. Processes taking place in seconds

over scales of nanometres, are crucial to processes thattake years and at scales of many hectares and ultimately toplanetary processes in geological time. Capturing suchinter-dependent processes, across such a breadth of scales,is beyond the capability of current information manage-ment and modelling methods. To have an impact on bio-diversity conservation, sustainability or our environment,we need to consider all aspects of biodiversity, from genesto ecosystems, in a holistic approach. We need to be ableto assess global biodiversity changes and make predictionsabout ecosystems. We need to be able to integrate dif-ferent facets of past and present environmental and bio-diversity observations and embed them in models withpredictive power [3]. We will need to develop new modelsto address socially urgent questions. Such an approachwill take biodiversity science far beyond a collection

* Correspondence: [email protected]; [email protected] of Computer Science and Informatics, Cardiff University, QueensBuildings, 5 The Parade, Cardiff, CF24 3AA, UK2Department of Zoology, The Natural History Museum, Cromwell Road,London SW7 5BD, UKFull list of author information is available at the end of the article

© 2013 Hardisty et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited. The authors also waive their right of attributionunder the terms of CC0 (http://creativecommons.org/about/cc0), so that text may be excerpted and reused in governmentalor other authoritative reports or documents without the need for attribution.

Hardisty et al. BMC Ecology 2013, 13:16http://www.biomedcentral.com/1472-6785/13/16

Page 3: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

of taxon names, capturing data about different facetsof biodiversity, both by their absolute position and theirrelative position, together with their observational andtemporal context. Most importantly, through biodiversityinformatics, biodiversity scientists will be able to under-stand, to measure and predict how change affects the ac-tual functioning of the ecosystem.

RecommendationsAs well as addressing practitioners with an interest inand knowledge of informatics and how it can be appliedto support biodiversity science, our recommendationsare intended to inform funders, project managers andinstitutions whose remit includes at least some aspect ofbiodiversity science. Our recommendations are intendedto establish a background against which decisions can bemade when making and evaluating proposals, allocatingfunds or directing work to build infrastructures. Forlong-term success, geographically distributed infrastruc-ture involving multiple stakeholders depends on thecommitment of those stakeholders to support a visionand to adhere to standards agreed by the community.Stakeholders each have to fund their part in the endeavourto make the whole thing sustainable. Long-term sustain-ability will be achieved by integrating services provided bykey players as part of their core mission.The first 3 recommendations should apply to all activity

in this area. They are necessary to reduce duplication andto enhance collaboration. The intended consequence is tofacilitate the creation of new knowledge by synthesis acti-vities using the data and tools thus generated.

1. Open Data [4], should be normal practice andshould embody the principles of being accessible,assessable, intelligible and usable [see Context].

2. Data encoding should allow analysis across multiplescales, e.g. from nanometers to planet-wide and fromfractions of a second to millions of years, and suchencoding schemes need to be developed. Individualdata sets will have application over a small fraction ofthese scales, but the encoding schema needs tofacilitate the integration of various data sets in asingle analytical structure [see Paragraph 19 et seq.].

3. Infrastructure projects should devote significantresources to market the service they develop,specifically to attract users from outside theproject-funded community, and ideally insignificant numbers. To make such aninvestment effective, projects should release theirservice early and update often, in response touser feedback. [see paragraphs 10 and 31].

While several technologies have already been devel-oped, they are not widely embraced by the community,

often due to reasons related to the ‘human factor’. Thefollowing 4 recommendations on technological founda-tions focus on enhancing the usability and better deploy-ment of existing technologies:

4. Build a complete list of currently used taxonnames with a statement of their interrelationships(e.g. this is a spelling variation; this is asynonym; etc.). This is a much simpler challengethan building a list of valid namesc, and anessential pre-requisite [see paragraph 1].

5. Attach a Persistent Identifier (PID) to everyresource so that they can be linked to one another.Part of the PID should be a common syntacticstructure, such as ‘DOI: . . .’ so that any instancecan be simply found in a free-text search[see paragraph 7].

6. Implement a system of author identifiers so that theindividual contributing a resource can be identified.This, in combination with the PID (above), willallow the computation of the impact of anycontribution and the provenance of any resource[see paragraph 11].

7. Make use of trusted third-party authenticationmeasures so that users can easily work withmultiple resources without having to log into eachone separately [see paragraph 12].

The foundational technologies described above allexist to some degree, but need to be integrated. Thenext steps will require developing new structures byexploiting existing technologies in novel ways.

8. Build a repository for classifications (classificationbank) that will allow, in combination with the list oftaxonomic names, automatic construction of taxo-nomies to close gaps in coverage [see paragraph 2].

9. Develop a single portal for currently acceptednames - one of the priority requirements for mostusers [see paragraph 3].

10. Standards and tools are needed to structure datainto a linked format by using the potential ofvocabularies and ontologies for all biodiversityfacets, including: taxonomy, environmentalfactors, ecosystem functioning and services,and data streams like DNA (up to genomics).[see paragraphs 16 and 17].

11. Mechanisms to evaluate data quality and fitness-for-purpose are required [see paragraphs 20 and 23].

Looking to the future, it is clear that new techniques,such as observatories employing novel sensors are deliv-ering data in unprecedented volumes, especially molecu-lar data, as the Genomic Observatories Network [5,6]

Hardisty et al. BMC Ecology 2013, 13:16 Page 2 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 4: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

has emphasised. This will require development of newtechnologies, or adaptation of technologies from relatedfields, new information systems, and platforms offeringoverviews of detectors and experimental setups for bio-diversity research to facilitate exploitation of the oppor-tunities presented.

12. A next-generation infrastructure is needed tomanage ever-increasing amounts of observationaldata [see Paragraph 13, 19, 21 and Appendix 2].

Preface

“The Hubbell paper [7] made it into BBC [8]. It is sadto see where we stand after 20 years. We have donemore work, we developed an impressive array ofbiodiversity informatics, we have tools to capturespecimens in our collections and make the dataaccessible, but the basic we are missing: A strategy toexplore the living planet, and even less a strategy tomeasure the change of species based at least on abasic count of what's out there.” Donat Agosti, Plazi.

“When writing my electronic monography(e-monograph) in 2007–9 I wished to link the plantspecies to other organisms within the ecological foodchains / food web. However, I could not even find ane-monograph on birds at the time or have the softwareprogramming knowledge to create interspeciesrelationships between electronic monographs and / orelectronic floras. Ultimately I wish to see a ‘virtual lifeon Earth’ where cross-linking of data can be explored,for example, how shifting species distribution in lightof climate change will affect food webs. Consequentlythe results can be used to drive conservationmanagement and placement on the IUCN Red DataList.” Fiona Young, University of Reading, UK

The two quotes above illustrate the challenges and asso-ciated shortcomings facing biodiversity informatics today.Despite considerable progress, biodiversity science is stillreliant on data that is not as fully available, linkable, dis-coverable and accessible as it should be. Services and toolsto process those data are not yet ‘plug and play’. Modelsof different parts of the overall biodiversity system fromthe molecular to the planetary are not yet linked acrosstime, space and scales. We are still unable to understandthe complex behaviour of the entire system because untilnow we have reduced it, only taken account of some of itsparameters and analysed only parts of it, and just by sum-ming those different parts we cannot understand how theentire system functions.Biodiversity science is part of the broader drive to-

wards managing our planetary environment, particularly

moving to a sustainable pattern of use in the face of agrowing human population. Related questions to poseover the next decade include: Will we need an organis-mal inventory to understand and monitor ecosystemfunction? Will we be able to monitor functional diversitydirectly? Can we measure fluxes as a metric of eco-system health [9]? Will we be able to develop bettermechanisms to represent organism interactions, for ins-tance, the microbiome of multicellular organisms [10],viruses in plants or the composition of the rhizosphere?These are comparatively new areas of research not yetrepresented by a significant body of data or services,but essential for managing our planet in the long-term[3,11,12].To scale up and understand the whole system, we

need new approaches, data types and services. Access tothese larger data resources are largely to be found throughinformatics, but the application of those resources will bemade by domain specialists. Our ultimate goal is an un-derstanding of the whole Earth system, so we must retaina broad range of biodiversity monitoring sites, but at thesame time we should also focus research effort on keymodel ecosystems where we can achieve the intensity ofoutcomes the biomedical research community has withthe model organism approach. Only by looking at vast da-tabases that describe the whole of the system will we beable to understand the big picture, find correlations andpatterns of activities. Knowing how such patterns and pro-cesses of biodiversity change will further help in moretargeted experimentation, resulting in new key datasets.Enhancing the biodiversity informatics infrastructure wehave today is therefore indispensable.

ContextThe EC Commissioners Máire Geoghegan-Quinn (Re-search and Innovation), Neelie Kroes (Digital Agenda), andConnie Hedegaard (Climate Action) have emphasisedd thecrucial nature of infrastructures for achieving their respect-ive political agendas. In particular, Commissioner NeelieKroes on 11th April 2012 [13], emphasised the importanceof open e-Infrastructures, sharing of raw data and results,and collaboration to enable more open science. Open sci-ence is the direction that the European Commission (EC)promotes for project proposals under the Horizon 2020funding initiative, also in accordance with the Nagoyaprotocol [14], adopted at the 10th meeting of the Parties tothe Convention on Biological Diversity [15].The UK’s Royal Society published a report called

‘Science as an open enterprise’ [4] that highlights theneed for a paradigm shift away from traditional practicesand mindsets. To quote the report, “. . . although scientistsdo routinely exploit the massive data volumes and com-puting capacity of the digital age, the approach is oftenredolent of the paper age rather than the digital age”. Key

Hardisty et al. BMC Ecology 2013, 13:16 Page 3 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 5: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

to their vision is the concept of ‘Intelligent Openness’(Table 1), a standard that Biodiversity Informatics must at-tain before attempting more complex linkage of services.Within this context of a more open and transparent

future where both scientific results and the data neededfor the conduct of science are easily accessible, linkedand properly attributed and preserved, we consider thechallenges and priorities in a decadal vision for biodiver-sity informaticse at the European level.Such a vision is of global interestf and should be the re-

sult of a comprehensive strategic roadmapping exercise,like the one recently undertaken in the health informaticsdomain [16]. It is necessary to engage with the biodiversitycommunity and use mixed methods to elaborate likely fu-ture scenarios from which to derive the required strandsof future informatics development. On the other hand, wecan build on substantial reflective work that already exists.In “The big questions for biodiversity informatics” [17]

Peterson et al. assert that biodiversity informatics cur-rently exists “without major guiding scientific goals thatrepresent intellectual frontiers and challenges”, and fearthis gap leaves biodiversity informatics without a frame-work for effective thinking, resulting in a disjoint set ofresources – both data and tools – that cannot be effec-tively harnessed together yet. They posit a future wherebiodiversity informatics enables biodiversity science tobecome a predictive exploration of space, time and form.In “Evolutionary Informatics: Unifying knowledge aboutthe diversity of life” [18], Parr et al. propose the grandgoal to "Link together evolutionary data across the greatTree of Life by developing analytical tools and proper

documentation and then use this framework to conductcomparative analyses, studies of evolutionary processand biodiversity analyses". Five challenges to realisingthat goal are also discussed.In this white paper we must establish a stronger focus

and a direction to guide the development of biodiversityinformatics in Europe over the next decade whilst at thesame time allow for serendipity. Clearly, the rate of changein the technology environment around us is dramatic: toolslike Facebook, Twitter, Foursquare, Google (Earth/Scholar/. . .), Mendeley and Dropbox have penetrated our workinglives and techniques like MapReduce [19] impact greatlyour ability to manipulate and analyse massively largedatasets. Smartphones, digital cameras, GPS positioningand progress in geospatial analysis offer possibilities for‘apps’ and techniques that were hardly imagined just a fewyears ago. Workflows as a tool for in-silico processing ofdata and the concept of Virtual Laboratories where scien-tists carry out digital experiments, hardly known few yearsago, offer today enormous opportunities in virtually repro-ducing our environment. Similarly in genomics, the rapidlydecreasing costs of sequencing technologies combinedwith the emergence of increasingly sophisticated alignmentand inferencing algorithms is leading to huge increases inour knowledge of life as a system.This ‘disruptive’ innovation trend will continue with

‘cloud’, ‘big data’, ‘linked data’ and ‘open access’ leading tonew ideas, products and services. The biodiversity in-formatics community is adapting to the increasing rateof change by adopting dynamic solutions freed fromrigid technologies that may be obsolete tomorrow. Weneed to demonstrate how we are joining up. Collectivelywe need to see the big picture, understand the jigsaw ofchallenges and decode all the complexity that existswithin populations and species.To put this in context, two recent initiatives have, re-

spectively, examined the challenges facing biodiversityscience research in Europe, and espoused the architec-tural principles of the Biodiversity Observation Network(GEO BON) within the Global Earth Observation Sys-tem of Systems (GEOSS) initiative. The first initiativeled to the LERU report [20] that discusses 18 buildingblocks for the biodiversity research agenda, necessary toimplement the EU’s 2050 vision for biodiversity and eco-system servicesg, and to reach the EU’s 2020 targetsh forhalting biodiversity loss. Principally, the report points tothe need for a common e-Science infrastructure for bio-diversity research (sub-clause 13). Several of its key rec-ommendations involve informatics playing a substantial,enabling role. With their clause numbers from the LERUreport in brackets, these recommendations are:

� Investing in a European infrastructure forbiodiversity data and research (sub-clause 32)

Table 1 Intelligent openness as defined by the UK’s RoyalSociety

Intelligentopenness terms

Definition

Accessible Data must be located in such a manner that it canreadily be found and in a form that can be used.

Assessable In a state in which judgments can be made asto the data or information’s reliability. Data mustprovide an account of the results of scientificwork that is intelligible to those wishing tounderstand or scrutinise them. Data musttherefore be differentiated for different audiences.

Intelligible Comprehensive for those who wish to scrutinisesomething. Audiences need to be able to makesome judgment or assessment of what iscommunicated. They will need to judge the natureof the claims made. They should be able to judgethe competence and reliability of those making theclaims. Assessability also includes the disclosure ofattendant factors that might influence public trust.

Useable In a format where others can use the data orinformation. Data should be able to be reused, oftenfor different purposes, and therefore will require properbackground information and metadata. The usability ofdata will also depend on those who wish to use them.

Hardisty et al. BMC Ecology 2013, 13:16 Page 4 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 6: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

� Investing strongly in enhancing fundamentalknowledge on biodiversity drivers and threats(sub-clause 33)

� Supporting effective translation of scientificknowledge into biodiversity practice (sub-clause 34)

� Supporting multidisciplinary collaborative networks(sub-clause 41)

� Supporting the science-policy interface inbiodiversity protection, and in particular supportingthe needs of the Intergovernmental Panel onBiodiversity and Ecosystem Services (IPBES)(sub-clause 42)

� Delivering education and awareness (sub-clause 43).

The second initiative is GEO BON [21-23] that, buildingon existing networks and initiatives, proposes "an inform-atics network in support of the efficient and effective col-lection, management, sharing, and analysis of data on thestatus and trends of the world’s biodiversity, covering vari-ation in composition, structure and function at ecosystem,species and genetic levels and spanning terrestrial, fresh-water, coastal, and open ocean marine domains".GEO BON fits in the broader conceptual framework

of GEOSS [24] to deliver a decentralised and distributedinformatics infrastructure. The GEO BON system willhave a Service Oriented Architecture (SOA) and will bebuilt largely from contributing systems that have theirgenesis at regional, national or sub-national scales. Atthe European level, the planned ESFRI LifeWatch researchinfrastructure [25], along with EBONE [26] and EU BON[27] projects, eventually forms the European contributionto realising GEO BON.Key among the technical objectives of GEO BON is

the need to promote the use of multidisciplinary inter-operability standards, and to define and update inter-operability solutions – applying the System of Systemsapproach promoted by GEOSS. GEO BON will also helpto promote data publication principles in support of fulland open availability of data and information, recogni-sing relevant international instruments and national pol-icies and legislation. One of the main tasks for GEOBON contributors is thus to identify the main contribu-ting components, list the services they provide and alsothe standards or special interoperability solutions theyuse. Central to the success of GEO BON is increasingcooperation among the standards organisations with in-terests in the biodiversity science domain, notably: theGenomics Standards Consortium (GSC) for standardsat the genomic / genetic level; TDWG for biodiversityinformation standards at the organism level; and theLong-Term Ecological Research network (LTER) forstandards concerned with populations and ecosystems.Better cooperation leads to better coherence amongstandards and better interoperability.

It is clear, then, that the landscape of biodiversityinformatics is already complicated, but is understood.Moving forward must take account of, and build upon,what has already been achieved.

The taxonomic impedimentThe term Taxonomic Impediment [28] was coined byIUBS/Diversitas to refer to the gaps in our taxonomicknowledge, the shortage of taxonomic expertise and theimpacts that these have on the progress of biodiversityscience. In this context, taxonomy is the knowledge thatallows the actors in a process to be identified and forinference to be drawn from the presence of a particu-lar organism. Taxonomic services refer to the meansof delivering that knowledge and present three basicproblems:

� Taxonomic services rely on highly educatedpersonnel and hence are very expensive;

� The data delivered by traditional taxonomic serviceshave a limited application potential, largely becausespecies identification is expensive and thereforetypically carried out with a limited spatio-temporaland taxonomic scope, unsuited to address ecologicalquestions at larger spatio-temporal scales or morecomplex patterns;

� Taxonomic expertise is shifting away fromtraditional practices of producing morphologicaldescriptions and identification keys towardsphylogenetic, especially molecular, studies.

The Taxonomic Impediment is, in part, a reflection ofthe need to use these expensive taxonomic services forall studies in the natural world. Alternative approachesthat could address some biodiversity-related problemscould help to relieve the currently perceived bottleneckand allow taxonomists to focus on those groups wheretheir skill delivers greatest return. Necessary tools in-clude semi-automated image-based species identificationservices based on techniques such as those described in[29] and citizen reporting systems such as the SwedishSpecies Gateway [30]. Enhancing taxonomic serviceswith DNA-based identification tools (e.g. the DNA Bar-code of Life standard [31]) for example will not onlyimprove the quality of identifications (objectivity, datainteroperability), but will also deliver high-throughputapproaches for environmental monitoring, species in-tense ecosystem research (e.g. Moorea Biocode Project[32]), and better ecosystem-based management.Biodiversity informatics can help by liberating the taxo-

nomic scientist from the clerical labour of locating com-parative materials, both specimens and literaturei [33].A more radical way to overcome the taxonomic impe-

diment might be to use biodiversity informatics without

Hardisty et al. BMC Ecology 2013, 13:16 Page 5 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 7: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

traditional taxonomy. Molecular studies can generatecharacteristic sequences, to identify organisms, or moreradically still, identify particular enzymes central to theprocess of interest. It could be argued that the currentthinking of species name and location is paper-basedand not embracing the informatics potential. Researchprojects exploring such innovative approaches should beencouraged. In particular genomic observatories are in aposition routinely to sequence DNA and link this foun-dational layer of biodiversity to its biological, ecological,environmental and social context.

Changing the landscape - a decadal visionThe key component needed to develop biodiversity in-formatics further is effective integration of the availableresources, to ensure that the practice of publishing bio-diversity information becomes widely adopted in thescientific community and leads to scientific synthesis.Synthesis is increasingly recognised as an essential com-ponent of the scientific endeavour. Scientific synthesisrefers to the integration of diverse research in order toincrease the generality and applicability of the results. Atits core, synthesis is about blending disparate informa-tion and knowledge in ways that yield novel insights orexplanations [34]. Synthesis occurs both within andacross disciplines and the implementation of an effectivebiodiversity informatics infrastructure would greatly en-hance this type of activity. Such an enhanced integrationof all related information, including raw data, processeddata, algorithms (code, workflows) and publications canbe achieved through the implementation of an effectivebiodiversity informatics infrastructure: a shared and main-tained multi-purpose network of computationally-basedprocessing services sitting on top of an open (published,registered and linked) data domain. Together, these delivera stable, broad portfolio of biodiversity information andanalytical services that can be used by user communitiesto investigate problems of interest.The vision is to develop the concept of 'services' deli-

vering either data or analysis of information using asmall set of interchange standards. New services can beintroduced into such an environment and be generallyaccessible without special effort. This vision implies anumber of significant details, which are elaborated inmore detail in the remainder of this white paper.

Realising the visionEffective realisation of the decadal vision relies on achie-ving a balance of top-down and bottom-up approachesby making appropriate funding decisions. Top-down ap-proaches include thinking and acting at the Europeanlevel, encouraging community adoption of standardswithin the EU (part of a worldwide effort in which the EUis a key player), setting direction and goals through

targeted funding calls, workshops and meetings. Bottom-up approaches derive from the motivation of individuals,their ideas and enthusiasms and their need to solve spe-cific problems. Both approaches together recognise therole that individuals and groups have to play in the de-cadal vision by encouraging islands of infrastructureto emerge, grow organically and fuse with one ano-ther over time.

Leveraging existing projectsNumerous biodiversity informatics projects have beenfunded in Europe by, amongst others, the FrameworkProgrammes. Globally, there are already more than 650projects known [35]. Examples from Europe includethe networks of excellence (ALTER-Net, LTER-Europe,EDIT/PESI, MarBEF/Mars, EuroMarine etc.) and otherprojects such as 4D4Life/i4Life, agINFRA, Aquamaps,iMarine, BioFresh, BioVeL, ENVRI, EU-BON, EUBrazilOpenBio, Fauna Iberica, MicroB3, OpenUp!, pro-iBiosphere and ViBRANT among othersj. Many of theseprojects directly address the challenges of deployinge-Infrastructure for biodiversity sciencek. They seem toshare similar characteristics, such as scientific field, inte-gration and interoperation of resources, open access,service orientation, e-Infrastructure and e-Science virtualenvironments. They differ substantially, however, in theirarchitectures and technological approaches. These lar-gely technical differences illustrate a larger problem: thelack of a common understanding about how best to de-ploy e-Infrastructures for biodiversity and ecosystemresearch. None of the projects can solve the problemalone nor hope to provide all the functionalities that willbe needed in the future. Working on non-convergingagendas, understandable given the imperative to pushboundaries for innovation and academic advancement,does not lead to a coherent infrastructure with all ne-cessary capabilities and capacities to support scientificresearch. There are overlaps, dead-ends and often, com-plete lack of mainstream industrial involvement. It is forsuch reasons that community consensus around a dec-adal vision, combined with effective selection of projectsto be funded and their subsequent interactions and man-agement within a coherent programme, is so important.The decadal vision provides the means by which the

complementary aspects of multiple projects can be com-bined in a common roadmap forward. Achieving this re-quires an increased awareness from all projects of thearchitectural approaches and construction steps to beadopted. Multiple projects contributing to that infra-structure need to get aligned because no single projectcan solve all problems alone. Separate projects need toachieve greater coherence and coordination to maximisethe benefit from substantial investments of the past,present and future.

Hardisty et al. BMC Ecology 2013, 13:16 Page 6 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 8: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

Within the Horizon 2020 framework it is therefore re-quired to develop an effective and continuing coordin-ation, dissemination, education and training capabilityproviding and re-distributing help, technical guidanceand examples of best practice. This capability will informindividuals and groups about the top-down strategies,the priorities and progress made, leading towards greatercommunity understanding of the overall vision.Project proposals developed bottom-up for Horizon

2020 funding should fit under the umbrella of the com-munity’s decadal vision. They should leverage completedand existing funded projects to gain the maximum be-nefit for the future biodiversity infrastructure. Proposalsshould explain how they have taken earlier and currentproject results into account and demonstrate that theyare building on them rather than offering incompatiblealternatives. Re-inventing the same (or different) solutionsis not cost-effective. Letters of support from other projectsshould be used to demonstrate that community-wide dis-cussion and acceptance of proposals has taken place priorto submission for funding. Project proposals should showclearly where and how they contribute towards the dec-adal vision. They should devote a significant portion oftheir resources to networking with other projects, to de-monstrating compatibility and added-value as a key per-formance indicator at an early stage, and to marketing theservices, technologies and approaches being developed topotential users.

Section 1: the fundamental backbone (getting thebasics right)Why are names important?1. Until the recent application of molecular

technologies to biodiversity studies, almost allinformation has been labelled with scientific names.Names have a special significance to linkinformation elements and as such, it is important touse them knowingly and to build tools that workwith names [36]. As they reflect concepts thatchange between individuals and over time, namesmay refer to many different concepts, making themequivocal identifiers. In addition, information isoften available only in local databases. The challengeis to find it, harmonise the way it is accessed andmake it available in computer-readable formats.Nomenclature, taxonomy, taxa and their biologytogether constitute a large challenge requiring novelinfrastructure and change of usual practices bystakeholders. Numerous initiatives exist to deal withthese aspects but progress will require a commonagenda to bring about a virtual infrastructure thatwill reduce the apparent diversity of web resourceswithout reducing the diversity of services requiredby a diverse user community. While content about

taxon names must be assembled bynomenclaturalists, taxonomists and managers ofbiodiversity information, there is an urgent need forvision-driven architectural and engineeringsolutions. The GNA’s (Global Name Architecture)[37] current priority is on name-strings (GlobalNames Index GNI [38]) and name-usage instancesl

(Global Names Usage Bank GNUB [39]). The latterdoes not yet exist but will provide the essentialsemantic relationships (cross-links) at thenomenclatural level. This focus is entirelyappropriate because universal coverage is tractablein the short to medium term. The resolution ofnames to concepts (see paragraph 3, below) is farmore difficult and is likely to be intractable foruniversal coverage.

How are names organised?2. A long unorganised list of names is not particularly

helpful. Since Linnaeus biologists have used latinisedbinomial names where the first part (the genus) isshared by a group of similar organisms and thesecond part, the epithet, differentiates betweenmembers of the group (e.g. oak trees belong to thegenus Quercus that contains around 600 species). Asimilar hierarchical classification is followed forgenera that are grouped into families, families intoorders, orders into classes and classes into phyla. Asscience advances, however, these relationshipschange with greater understanding. While it ispossible to build hierarchies from instances of name-strings, it is inefficient. The solution requiredincludes a classification bank combined with a namelist (see Paragraph 1) to produce a taxonomichierarchy automatically for groups that have notrecently received taxonomic attention.

Which is the right name?3. The Species2000 / ITIS Catalogue of Life [40] is a

global taxonomic reference system drawing oncontent from more than 100 sources. It provides acomposite expert view on taxonomic information,providing an authoritative but mutable framework.Names within CoL represent concepts, but there isno link to the concepts themselves and therefore anidentification cannot be unequivocally verified.Other classifications with names, such as NCBItaxonomy [41] or the WoRMS systems [42] can alsobe used as organisational frameworks. Yet eachserves its own audiences, revealing the need formultiple systems that are however interoperable.Initiatives such as the Global Names Architecture(GNA) [37] promote the development of aninfrastructure capable of linking available

Hardisty et al. BMC Ecology 2013, 13:16 Page 7 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 9: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

information about biological names. iPlant’sTaxonomic Name Resolution Service, TNRS 3.0 [43]corrects and standardises plant scientific namesagainst particular taxonomies. ZooBank [44] is anew initiative to move the process by which newnames become recognised into the digital age. Toolsfor alignment and cross-mapping of taxonomies canonly be partially automated, since the domainknowledge held by taxonomists is very difficult fullyto codify. Some projects such as i4Life [45] havedeveloped tools that exploit characteristics ofbiological nomenclature to detect relationshipsbetween taxonomies, providing a useful "first draft"cross-map. Nevertheless to be authoritative, futureenvironments must link nomenclators (likeZooBank, IPNI and Index Fungorum, Mycobank),taxonomic compendia (such as CoL and WoRMS),other classifications (a classification bank, perhaps),literature sources (describing species, theirattributes, distributions, and common names), andphylogenies, covering the whole spectrum ofbiodiversity complexity. The taxon name is an accesskey, but it is essential that it can be linked to otherresources, such as descriptions, traits and habitat.Ultimately names are the bridge to the accumulatedinformation built over the past 300 years andtrapped in the paper world.

What is the name of that organism?4. The practical identification of an organism relies on

the construction of a circumscription of the taxon tobe identified, which in turn requires the examinationof a range of specimens agreed to belong to thetaxon. Before the digital era, the only ways toidentify the name of an organism were to use apaper identification key or to consult an expert. Newidentification techniques have emerged to get to thename of an organism, including matrix keys and‘smart keys’ that use the locality and time of the yearto reduce the number of identification choices.These identification techniques however are labour-intensive and depend on experts to create thenecessary circumscriptions, keys and the link to thelist of accepted names. Automated identificationtechniques, like image recognition or in situ DNAanalysis, are not yet sufficiently developed to be usedroutinely and reliably for most organisms.Identification keys always cover a small part ofbiodiversity and may also be difficult to discover.Developing morphological keys to all organisms is notachievable because no global organisation can establisha central repository, or even coordinate, prioritise andfund the creation of keys. The major priority thereforeis to make the necessary descriptive data with their

associated range and habitat information freelyavailable. Services can then be created, for instance as'apps' for the mobile phone market.

Can biodiversity studies be done without names?5. Almost all of our accumulated knowledge about

biodiversity has been gathered and organised usingspecies names. According to a recent exhaustivereview by Costello [33], taxonomists think thatabout 1.5 million living species have been described,but lacking a single authoritative list of names, thisis only an approximation and many species may beinvalid [46]. The number of species left to bediscovered are substantial [47] and, given thatcurrent taxonomy is the product of more than 250years of effort, it is unrealistic to have a completecatalogue if we adhere to currently acceptedmethods [33,48]. Solutions include modernmolecular techniques, such as DNA barcoding andmassively parallel high throughput sequencing,effective at revealing much of the undescribeddiversity. Such systems based on environmentalgenomics (‘metagenomics’) are already wellestablished in microbial ecology where DNAsequences act as tags identifying organisms in theecosystem. These techniques, being inherentlydestructive, cannot yield a traditional specimen, socannot be used to name a new species, but they arepromising in assessing ecosystem biodiversitywithout the requirement to name every speciespresent. Challenges in the deployment of suchtechniques include:

� Ensuring that the data, information andknowledge emerging from this new paradigmbecome integrated with traditional taxonomy sothat we continue to benefit from the efforts oftaxonomists over the previous 250 years;

� Curating the species’ names that have beenattributed to sequences in databases of theInternational Nucleotide Sequence DatabaseCollaboration (INSDC);

� Devising a framework to integrate the specimen-centric observations encoded by the Darwin Corestandard with the environmental and ecologicalcontext of metagenomics;

� Putting these different layers of informationtogether so as to identify the response of theecosystem to environmental change;

� Being able to access covariate data, for exampleconcurrent observable chemical, biological andother environmental variables from the targetecosystem, especially for environmentalmetagenomic studies.

Hardisty et al. BMC Ecology 2013, 13:16 Page 8 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 10: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

Ecological research will largely benefit from such new ap-proaches classifying and understanding genomic biodiver-sity based on functions, their evolution and distribution.

Biodiversity data beyond names6. Names are an access key to biodiversity information,

including information on the occurrence of speciesin time and space. Ultimately we need to be able tointegrate biodiversity information indexed by nameswith information on:

� Functional diversity;� Diversity at various levels of organisation: genes,

organisms, ecosystems, landscapes;� Relationships between facets of biodiversity and

ecosystem functioning and services;� Those variables and data that describe the

physical environment;� Fluxes through environments, such as

phosphorus into and out of a system.

Crucial to this endeavour will be our ability to devisemethods to link such data and to define data standardsto make such linkage straightforward (see Paragraph 15et seq. below).

To link resources we need identifiers7. Biodiversity science needs to adopt a system of

persistent and universally unique identifiers (PIDs orUUIDs) that will allow resources to be located andlinked. An identifier can be attached to a resource ofany kind, including data (e.g. specimens in acollection), taxonomic concepts, genetic sequences(e.g. INSDC accession numbers), publications(e.g. DOIs), or data sets and services, (e.g. workflows,computational services or computer code). Identifiersmust be stable and unique but they should also:

� conform to some widely-used syntactic definition;� their initial part should be consistent (e.g. http://

or DOI:), so that they can be recovered in a free-text search;

� ideally, be resolvable (resolution = where to find aparticular resource);

� be archived together with the resource in asustainable manner, ideally in multiple locations(if the GUID is not resolvable, the resource canbe found by searching for the GUID).

PIDs do not protect against duplication, i.e. a singleresource carrying multiple PIDs, but if they were used,then resources could be linked, so that discovery of anyitem in a chain of connections would permit the disco-very of all the rest of the resources as well as allow for

consequent credit allocation (see paragraph 11). There isno technical challenge in the use of PIDs. DOIs, for ex-ample, are now familiar in many publications and al-though the DataCite [49] initiative has significantlyreduced both the cost and complexity of using DOIs,their direct application by the biodiversity informaticscommunity remains rare. The community, then, seemsreluctant to use PIDs. It is not clear whether this is dueto reluctance to change current working practice orwhether it is due to lack of suitable tools - either forlinking or for following links. Note that a well-known re-solver, comparable to CrossRef for DOIs, is only neces-sary for resolvability (point and click). PIDs, by theirunique nature can be discovered with standard searchtools such as Google. More elaborate linkage mecha-nisms are possible, and could deliver much greater be-nefit, but introducing the community to simple linkageis a challenge, so Linked Data [50] is considered a “nextstep” (Section 2) rather than a foundational technologyin this white paper.

Centralised or networked services?8. Networked services refers to the use of a resource

directly over the web, so that one website may callanother site for information necessary to carry outits function. Centralised services, on the other hand,concentrate all the associated resources at a singlesite. Although networked services are desirable tomaintain consistency and to focus resources formaintenance (i.e. an authoritative master copy), inpractice they are often unable to deliver the speed ofresponse necessary for usability and creating a localcopy (a snapshot of a dynamic resource) ordeveloping an independent resource is often the onlyrealistic solution available. Local copies are often theonly practical solution for computation but there isno mechanism by which alterations to the primaryresource can be effectively propagated to all copies.This inevitably leads to differences between copies.Copies should, therefore, be used over a shorttimeframe and, if necessary, refreshed. A feedbackmechanism is required so that a data user can reportan inconsistency to the data owner and, where acorrection is suggested, can be easily incorporatedinto the original dataset. Automated workflowscrucially require Web services but working withlarge datasets in networked services poses technicalchallenges in the ability to move large volumes ofdata, the provision of suitable search facilities thatminimise the number of host-client interactions, andthe bandwidth necessary to keep response timesshort. Centralised services, such as VertNet [51]assemble large collections of data in a commonstructure, submitted by individual data creators. The

Hardisty et al. BMC Ecology 2013, 13:16 Page 9 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 11: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

benefit that this brings is economy of scale and theability to tune the performance of the system. Inthe context of data cleaning, for example, having thedata centrally makes it much easier to compareacross data sets and discover inconsistencies. Thegain of economy of scale is very important sinceonce a given type of error is identified, rules can beapplied for cleaning across the whole data set,therefore avoiding overloading remote services.Once established, though, it is difficult to change thestructure and change the purpose, but for large-scaledata generation systems, logically centralised servicesoffer significant advantages. One drawback, however,now receiving attention in the genomics and otherfields is the issue of time taken to move large datasetsto where the resources for computational analysis andmodelling are located. Strategies are presently beingconsidered for how to move computation to the data.

How to balance professional and non-professionalcontributions9. Engagement of the biodiversity expert community is

undoubtedly a key factor in advancing knowledge.Citizen science projects have been remarkablysuccessful in advancing scientific knowledge byproviding data primarily on species occurrence anddistribution around the world [52]. These engage thepublic in the collection and analysis of data setsfrom multiple habitats and can span long periods oftime. The big scientific issue tackled by these largedata sets is how biodiversity varies through spaceand time, including biodiversity loss and detection oftrends, such as shifting distribution boundaries.Citizen science projects represent a massive effortspent on biodiversity monitoring that could nototherwise be covered by the professional communityalone without huge sustained financial investment.The primary challenge for the biodiversity informaticscommunity is to develop a framework to address thecurrently multiple, cross-cutting requirements ofcitizen science projects, such as:

� Covering all steps in the development andimplementation model of such projects, from thechoice of scientific question to the evaluation ofthe outcomes;

� Automating validation (quality assurance, qualitycontrol and data cleansing processing) andannotation of the data produced [53];

� Developing incentives to encourage participationin processing, analysis and use of data;

� Developing virtual research and teachingenvironment(s) for citizen scientists, to developtheir skills to answer basic scientific questions;

� Improving systems for automated imagerecognition based on existing technologies(e.g. TinEye Reverse Image Search [54]) toharvest the vast repositories of amateurnaturalists' photos;

� Promoting best practices by disseminatingsuccessful examples of actions on natureconservation;

� Ensuring continuous economic viability for theservices through the linking of such citizenscience projects with the relevant economicsector’s stakeholders [55].

Engagement of Users10. A great deal of high-quality software, services and

resources have been created over the past decade,but much remains underused, even within thebiodiversity informatics community itself. Manyprojects have relied on traditional routes topublicise their products, primarily throughacademic publications. It would be undesirable toimpose standard applications or resources upon thecommunity. Better to allow users to decide, toselect which products best match theirrequirements. Projects should invest significantresource into marketing their products, engagingwith real users and refining the product from userfeedback, following the dictum of "release early andrelease often". Such marketing need not absorb asignificant fraction of a project's budget but shouldbe a clear strategy and an integral part of projectmanagement.

Who's who?11. Traditionally, experts have published their

observations and conclusions in peer-reviewedpaper publications, a tradition that has beeneffectively transferred to the digital age throughe-publication. The tradition has severalconsequences. First, it has created a system ofcitations by which individuals are assessed forcareer development. Second, the cost of print-on-paper has driven data presentation to a compact,often summarised, format. Third, the financialinterests of the publisher have restricted theavailability of the data for re-use. Two aspects ofthe citation mechanism are important:

� Provenance, meaning that a data user can easilydiscover who generated the data, which canattach a level of reliability to the data; and

� Impact, by providing a hyperlink that allows auser to see where a particular data set has beenused, both how often and for what, which could

Hardisty et al. BMC Ecology 2013, 13:16 Page 10 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 12: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

easily be incorporated into the managerialassessment of an individual’s career.

Modern digital publication could effectively removethe typographical restrictions [4] making data more eas-ily available for re-use. Some publishers, e.g. Pensoft[56], are already introducing publication in parallel for-mats (paper, PDF, HTML, XML). The new paradigm isabout evolving new methods to identify contributors andusers consistently, where identification can be carriedfrom one environment to another, including the popularsocial networking environments like Facebook and Twit-ter. Approaches to this are being developed in theORCID consortium [57] and VIAF [58], designed forthose who publish scientific articles (scholarly authors),but also need to include other users, such as compilersof reports and assessments. This transition to reusabledata identifiably associated with an individual or groupof individuals is a common call within the Open Sciencemovement, relevant for all scientific disciplines. Note thatthe US National Science Foundation now requires appli-cants for funding to list his or her research “products” ra-ther than “publications”, implicitly recognising the valueof contributions beyond paper publication [59].

User identification12. Open access data and services allow users to remain

anonymous for some level of access. Some forms ofinteraction however, such as posting comments,corrections and some types of services, such asdownload, often require that users identifythemselves. Social media tools like Facebook,Google and Twitter offer common 3rd-partyauthentication mechanisms that can be used foraccess control. This has two main advantages: first,it makes every resource easy to access; and second,it is a stronger security check compared toinventing a username and password for each sitevisited. Nevertheless, some resources will require astronger form of authentication, for instance wherepayment is required. As a general principle, accessto biodiversity data should normally be unrestrictedexcept where it is essential to protect, for example,location data for rare bird nesting sites.

How do we ensure the right metadata are created at thepoint of data generation?13. The scientific process requires the collection of

observations from which hypotheses can be formedand, when necessary, more data to be collected totest them. Adding metadata represents an overheadon current practice but it is essential if data are tobe discoverable and re-usable. Metadata are the keyto discoverability and provide the context for

linking resources. To improve current practicesthere is an urgent need for i) community agreementon metadata standards for specific purposes and ii)mechanisms to collect and append the necessarymetadata, automatically whenever possible, such asthe design of workflows that make use of standardservices to create data-recording templates. In theshort term, the extra effort of metadata productionwill have to be borne by the data producer, especiallyin the context of data journals, but tools to automatethe production of metadata are conceivable, essentiallyeliminating the burden of production. A move toLinked Open Data is expected to obviate the need forenlarged metadata by making data more easilydiscoverable through concept linkage (see paragraph15 et seq. below on Linked Data).

Sustaining the physical infrastructure14. Appropriate biodiversity informatics tools will

generate greater impact than is currently possiblefrom the physical infrastructure of natural historycollections, mesocosms, other experimentalfacilities, long-term ecological monitoring sites andgenomics facilities, through much greater digitaland on-line access to the facilities than is physicallypractical. This will enhance the sustainability of theinfrastructure, since a large user base is critical forpolitical sustainability.

Section 2: the next stepsData sharing15. Two relatively large surveys were conducted to

understand how data are treated by scientists acrossdifferent disciplines: by the PARSE.Insight project[60], with 1202 respondents, and by ScienceMagazine [61], with 1700 respondents, both withmultidisciplinary international responses. From whatresearchers say about where they store and managedata, it can be deduced that data are not often sharedopenly. The results show that across all disciplinesonly between 6-8% of the researchers depositdatasets in an external archive of the discipline/research domain. The most common environmentfor storing, managing and re-using data is the laband/or individual working environment, down to PCsand portable storage carriers. The category “server” isprobably best understood as a file server of theresearch organisation behind a firewall and withrestricted access for defined groups of registeredusers. According to Science Magazine, most of therespondents (80.3%) said that they do not havesufficient funding available for data curation.Other reports [4,62-65] share more insight intodata sharing practices by research area and

Hardisty et al. BMC Ecology 2013, 13:16 Page 11 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 13: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

highlight the importance of data sharingbecoming normal practice.

Why do we need vocabularies and ontologies?16. Common vocabularies are the foundation for both

human and machine communication (e.g. in datasharing, in automated workflows, data integrationand analysis). By agreeing on a set of concepts andtheir definitions within a domain, a communityof practice can share data and informationunambiguously. Data integration and analysiscritically requires semantic consistency as well assyntactic standardisation, the former being morechallenging to achieve than the latter. Initiallycommunities will accept a small controlledvocabulary - terms supported by human-readabletext definitions. As terms are rarely independent ofone another, the vocabulary list evolves into athesaurus and, as formal relationships betweenterms are agreed, an ontology [66]. There arelessons to be learnt by looking elsewhere, forexample, Google’s "Knowledge Graph" [67], theUnified Medical Language System (UMLS), medicalinformatics) [68], AGROVOC (agriculture) [69] andOBO (plant and animal phenotypes) stable ofontologies [70]. AGROVOC covers many of theterms relevant to biodiversity and is modularenough to be extended. There are other ontologiesuseful for capturing biodiversity data, such as theenvironment ontology, EnvO [71], and the moregeneral DAML [72]. There is a pressing need forontologies that span multiple communities,implying domains, and at present, such over-arching technologies do not seem to exist.Individual community ontologies tend to isolatecommunities rather than enable more open sharing,but community ontologies are with us now andneed to be integrated. Some systems, such asUMLS are not structured to support reasoningor subsumption, so are not necessarily a goodmodel for further development. Nevertheless,establishment of community standard terminologiesand ontologies presents problems that are familiarto other communities, such as human genetics andmodel organism functional genomics, and some ofthese lessons have already been learned:

� Terminologies / ontologies need to be owned by thecommunity but their maintenance is an ongoingrequirement which requires stable funding and adegree of community coordination and interaction;

� tools that biologists find intuitive need to bedeveloped for both data coding and analysis, makingthe process efficient and effectively invisible;

� ongoing terminology and syntax developmentneed expert construction and are not justproblems of computer science;

� a significant problem exists in thecommunication of changes in those lists to sitesthat consume the data and a central catalogue /source is required, such as currently provided byOBO or the NCBO (National Centre forBiomedical Ontology);

� mapping of data coded by legacy terminologiesand integration of data coded by differentspecies-specific ontologies are problems alreadyaddressed by some communities.

There is potential in semantic interoperability forbiodiversity data, but this requires quite basic researchand IT development to enter new paradigms supportingopen semantic approaches. The provision of a strategy fortransferring “legacy” data models into semantic-awaretechnologies is clearly desirable because existing data mo-dels are often accurate, comprehensive and represent agreat deal of effort from the scientific community. Weneed a pragmatic strategy for mobilising this knowledge.Such mobilisation may also assist in achieving broad useracceptance, a greater problem than are the associatedtechnical issues. Developing and applying vocabularies isclearly hard and requires the existence of persistent identi-fiers (paragraph 7 above) to be effective. It will require or-ganisation and cooperation, or to put it simply, it takesgoodwill but also cash.

How would knowledge organising systems help?17. The term ‘knowledge organisation system’ (KOS)

covers any system for organising information,ranging from the traditional library subject headingsto newer approaches like semantic networks andontologies. Recognising the need for a standardsarchitecture to provide basic interoperability acrossopen systems, the TDWG Technical Roadmaps in2006, 2007 and 2008 all identified community-supported vocabularies and ontologies, expressingshared semantics of data, as one of three requiredcomponents; the other two are common exchangeprotocols and use of persistent identifiers for thedata. The TDWG Darwin Core [73] glossary ofterms is amongst the most widely deployedbiodiversity vocabularies and both its managementand relation to the TDWG Ontologies can be usedas a model for other vocabularies. The GBIF LSID-GUID Task Group [74] highlighted the need forGBIF to identify sustainable support mechanismsfor essential shared vocabularies and commissionedWhite Paper Recommendations on the use ofKnowledge Organisation Systems. GBIF [75]

Hardisty et al. BMC Ecology 2013, 13:16 Page 12 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 14: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

separated the need for ontology management fromthe lifecycle management of flat vocabularies insuch tools as BioPortal. The development,management and governance of such vocabulariesremain a challenge for the biodiversity community.As concluded in paragraph 16 and discussed insection 4, the core technologies are available andwell understood, but uptake by the community isnot ideal. The challenge is to develop and deploytools within the overall biodiversity informaticsinfrastructure that make the implementation ofknowledge organisation systems effectively invisible.GBIF’s Integrated Publishing Toolkit [76] is oneexample of a step in this direction. Put simply, whatwould it take to make knowledge organisation workeffectively and what would it achieve if it did?

How easy is it to integrate data?18. Biodiversity informatics is inherently a global

initiative. With a multitude of organisations fromdifferent countries publishing biodiversity data, ourforemost challenge is to make the diverse anddistributed participating systems interoperable inorder to support discovery and access to those data.A common exchange technology, e.g. XML orJSON, may allow the syntactic exchange of datablocks, but both systems also need to understandthe semantics of the data being delivered to processit meaningfully. If the data do not share a commonreference model, then the exchange requires somebrokering or other semantic processing (using toolsdescribed in paragraphs 3, 7, 12 and 16 above). Forinstance, the widely used standard Darwin Core ispredicated on the occurrence (either a physicalspecimen or an observational record) as the unit ofinformation, so is of limited value in the context ofmetagenomics for example, that may containinformation about environmental function withoutmention of a named taxonomic entity, or informationabout communities of taxa. It is crucial that futureefforts in this area take account of major globalinitiatives, especially GEO BON, GBIF and GenomicsStandards Consortium, as well as novel approaches ineco-informatics, but it is likely that the data modelsused in these initiatives will also need to be extended[77]. Existing data must either be transformed in asemantically-aware manner to conform to suchstandards, or software that is aware of the semanticheterogeneity must work with multiple standards.

Beyond sharing and Re-use: the problem of scale19. It should be straightforward to assemble a dataset

on biodiversity and reach conclusions by linkingavailable information. To understand and model

processes, such as the phosphate cycle, requiresinformation at the molecular level over seconds(solubility, diffusion and uptake), kilometre levelover years (transportation and availability) andplanetary level over geological time (mineralformation, extractability). The integration of allthese data resources is necessary to model the cycle,from which policy decisions can be made for the timewhen cheap mineral phosphate (a fertiliser) is nolonger available (in the next few decades) [2]. Thisexample illustrates the complexity of the naturalworld, and how ‘grand’ is the challenge faced bybiodiversity informatics to create a coordinatedcoupled modelling environment to address health,sustainability and environmental questions [78].

How reliable are the data?20. Science is, by its nature, a sceptical process. Data

are received at face-value, examined and tested. Ifthe user is satisfied, then the data will be applied.This process is crucial in biodiversity sinceinformation can rarely be generated by simplemeasurements. Concepts (like species), observations(based on human interpretations), proxy data (oftenoriginating from sensors) or algorithms (models fitfor specific cases) constitute most biodiversity datawith their inherent uncertainties and fuzziness. It isvital, then, that information about how themeasurement was taken, to the minimum datastandard, is included in the associated metadata.Judgement of quality involves an assessment offitness-for-purpose and therefore cannot be anabsolute measure. Data can of course contain botherrors of fact, e.g. typographical errors, or errors ofdesign, e.g. collecting data under a flawedmethodology. Errors of fact can be detected byvarious means, e.g. duplicate entry or proof-readingwhereas errors of design are more difficult to findautomatically. A more significant problem is theaccuracy of the data, meaning how precise andcomplete they are. In measurement it is acceptedthat a balance might weigh to the nearest 5 g, beinga characteristic of the balance. In informationterms, lacking a standard for generating the datum,it is harder. For instance, bibliographic citations canhave diverse formats that humans can easily resolveto the same publication however computers, by andlarge, cannot unless given a PID as an informationstandard. The challenge for biodiversity informaticsis to provide appropriate tools for data cleaningm

and to automate procedures for reading data forconsistency [79], particularly against standard lists(see paragraph 16 above). Ultimately it is a case ofcaveat emptor. Users will develop trust in an

Hardisty et al. BMC Ecology 2013, 13:16 Page 13 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 15: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

information supplier and sites may wish to use avoting mechanism, e.g. similar to the supplier ratingsystem on eBay. A system is required for datapublishers to display comments from identifiableusers (see paragraph 12 above), providing a feedbackmechanism, essentially an open peer-review.Exposure to users is the best way to validate data.

What will the physical infrastructure look like?21. Plummeting cost of hardware, increasing use of

virtualisation and blurring between fixed / mobilecomputing and work / domestic environments forcomputing makes the prediction of preferredcomputing environments of dubious value.Compiling this white paper has identified noapparent need for bespoke ICT technologies. Acontinued use of a wide variety of platforms andapproaches is to be expected. Biodiversityinformatics has many requirements in commonwith other informatics domains and it isnoteworthy that biodiversity research, as in otherdisciplines has the potential to produce very largeand rapidly growing data sets from, for exampleautomated digitisation, remote sensing and geneticsequencing. Although the configuration of existingand planned cross-domain infrastructure such asLifeWatch supports biodiversity informatics well,the domain will place heavy capacity demands onthe computing infrastructure in the medium-term.Hardware associated with sensor and data logging isaddressed in Appendix 2. Like other domains,biodiversity informatics will require robustness,stability and persistence, so will likely rely on keyinstitutions with long-term funding. Over the corehardware infrastructure lies a spatial datainformation infrastructure, the biodiversitycomponent of which is largely the topic of thiswhite paper. The leveraging of information fromdistinct but adjacent domains will be increasinglynecessary in the future, such as digital literatureresources, image, environmental and climaticinformation databases. As molecular methods findever greater uptake, one particular set of resourceswill become increasingly important to biodiversityinformatics: these are the many biomolecularresources that, within Europe, lie within thepurview of the ELIXIR infrastructure [80]. Whilemany of the core resources themselves may besustained with comparatively long-term support,the services built upon these resources must beconfigured to include biodiversity science use cases.A unified voice in specifying these use cases isrequired from the biodiversity community. Buildingthe ‘social infrastructure’, however, is a major

challenge: we have the technological capability butwe need to increase its uptake by the community.For that we need to strengthen considerably thesocially connected network of experts spanning thetwo communities: ICT and biodiversity science.

Section 3: new toolsHow much of the legacy collections can be digitised?22. The world’s biological collections represent the hard

core of biodiversity information. All other uses,from identification and naming onwards, areanchored in them. The collections contain anestimated 2–3 billion specimens but less than 10%have been catalogued in databases and much lesscaptured as digital images [81,82]. This means thatmore than 90% of the collections are essentiallyunavailable for use through the Internet. Manuallydigitising collections represents an effort estimatedat up to one million person years, but, with today’smass-digitisation methodologies, the task is feasible.As shown by multiple virtual herbarium projects[83,84], the process can be partly automatedthrough imaging techniques. With gazetteerservices such as GeoLocate [85], georeferencing canalso be computer-assisted. Another good example isthe Volunteer site of the Atlas of Living Australia[86] whereby, when a backlog of digital images isavailable, their transcription is distributed throughcrowd-sourcing to a large number of volunteers.With help from initiatives like iDigBio [87], weenvisage that distributed digitisation infrastructureswill become essential parts of most major naturalhistory collections and that dedicated services willbe developed for outsourcing this task. A majorchallenge however is that collections still growfaster than they are being digitised (e.g. throughendowments). As private collections must also bedigitised by their owners, this requires a new suiteof easy and inexpensive tools that can be deployedat large scale. To effectively deliver this researchinfrastructure service, digitisation requiresprioritisation and its own funding channels.

How to generate more targeted and reliable data?23. Gathering information about the world around us

has been a priority for biodiversity science for manyyears (see Appendix 3). Observatories will soonoperate throughout the biosphere capturingdifferent kinds of data over multiple scales, frommicrons to planet-wide, from parts of a second toyears. It is very important to know the relative andalso the absolute position of observed objects andevents. This brings special challenges whenobserving the desired phenomenon and operating

Hardisty et al. BMC Ecology 2013, 13:16 Page 14 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 16: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

in extreme environments, such as the deep sea. Theinfrastructure for biodiversity data urgently needsmore advanced informatics, support - not onlymainstream ICT development but also the abilityto deal with the specifics of biodiversity featuresand datan. It requires informatics to supportobservations, event detections, speciesidentification, data transfer, storage, filtering andother kinds of data processing. New data-gathering tools that will allow new observatoriesat all biological scales and sensor networkscovering the globe need to be designed, createdand tested. There should be automatic processesallowing for feedback from data interpretationback to the observation or detection at site. Thiscombination of techniques and relatedbiodiversity informatics tools is expected toherald a revolution in biodiversity research,resolving much of our current fragmented datacoverage and knowledge. Public-private partnershipsshould be encouraged to enter pre-competitiveresearch and development in this evolution.

What role do mobile devices play?24. Developments in mobile communications offer

numerous opportunities for innovation (seeAppendix 2). Smartphones and tablet PCs withon-board GPS location can be easily taken into thefield, creating opportunities both for innovative datacollection and user information services. They arealso particularly innovative for reference productssuch as identification keyso. Apps like these can beused to generate image-vouchered, location-taggedobservations uploaded to central databasesp.Performing science in large virtual communities,where participants have varied levels of expertiserequires new techniques for data harvesting,processing, cleansing and validation.

How do you find the data you need?25. Most biodiversity data that now exists are semi-

structured and can be searched with typical searchtools (Google, GBIF, etc.). However, these are oftendesigned for use by humans rather than for automateddata retrieval tasks and may have in-built limitationsor constraints. To make better use of general purposetools, users may need to use more specialisedresources as well. GBIF, for example, supports retrievalby species name but the user may also need to useresources such as the Catalogue of Life to providealternative names for species-based searching. Thevolume of data now being searched is so large that it isoften not possible to refine keyword-type searchessufficiently to recover the needle buried in the

haystack, especially in the absence of widely-usedvocabularies. Contextualising information (establishingrelationships between data elements) in a resource ispossibleq, but currently difficult and slow. Theimplementation of PIDs (see Paragraph 7) would makethe construction of metadata portals much easier. Amature search mechanism that contextualises ratherthan simply indexing would be far more powerful. Anumber of newly developed techniques exist, andsome are under development, that make extensive useof visualisation methods to detect patterns and issuesin data collections. These could be useful for qualityand fitness-for-use assessment, especially in very largedatasets such as the LTER-Europe data index or theGBIF index and taxonomical nomenclators. Datapublishers need to go further in helping users find thedata that match their requirements, with the use ofPIDs, vocabularies and KOS (see paragraph 17).

How do you extract the data you need?26. In publications, either paper or PDF, information is

often embedded in text blocks or tables in a waythat inhibits its re-use. Semantic technologies (datamining) offer potential for liberating such data, buthave not yet demonstrated the necessary flexibilityor speed needed for broad uptake in the verbose,descriptive disciplines of taxonomy and ecology.Perversely, it is also difficult to extract informationfrom highly condensed scientific writing such astaxonomic descriptions because this style of writingrelies on implicit context in order to beunderstandable. New tools will be needed that usethe vocabularies, ontologies and KOS describedabove to establish context between data elements,then to extract and assemble those elements into aformat suitable for the user's purposes. Copyrightheld by commercial publishers remains a seriousobstacle to recovering the non-copyright factualdata. Some older publications that are not incopyright are being digitised, but serious issuesremain over the rate at which the historic legacyliterature is being captured, the completeness of thedigitised literature and ease of access to thisliterature. Errors in the Optical CharacterRecognition (OCR) process mean that search ofthese archives will return an incomplete result set[88]. This presents an additional level of difficultyover and above the problems with extractinginformation from the born-digital literature.

How do you aggregate the data you need?27. For many analyses it is often necessary to aggregate

data from several sources. Several data-aggregatinginitiatives have emerged in the last two decades for

Hardisty et al. BMC Ecology 2013, 13:16 Page 15 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 17: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

various areas of biodiversity informatics. Some ofthese initiatives were done on a project basis, whileothers were embedded in national structures,making them more reliable sources of informationin the long-term. Examples include GBIF forprimary data, Encyclopedia of Life (EOL) [89] forspecies descriptions, uBio [90] and Global NamesIndex [38] for names usages, CoL [40] for validatedspecies names and Europeana [91] for multimediaresources. These data-aggregation initiatives haveimportant beneficial side effects for biodiversityinformatics such as enhancing the availability,standardisation and duplication (‘backup’) of data.Aggregation problems remain, similar for all theseinitiatives, such as hidden duplication, properattribution, harvesting and storage. Each aggregratorhas tended to solve these on their own by developingtheir own data provider network and internalinfrastructure. Increasingly, though they arerecognising the need to streamline and to avoidduplication of effort (for some examples see [18]). Tofacilitate integration, it will be valuable to develop awell-known catalogue of large-scale resources, withassociated metadata, including a concept map.

How complete are the data?28. When combining data from different sources and

domains, interoperability is clearly not the onlyobstacle to analysing complex patterns of biodiversity.The accessible biodiversity data today come largelyfrom repositories and individual researchers, and aregenerally of high quality with respect to reliability. Thequality is rather low however, with respect toconsistency. In other words the data aggregated todayhas been collected for different purposes and ondifferent spatio-temporal scales leaving significant gapsin the assembled data sets and seriously hampering theanalysis of complex patterns with data from diversedomains. Gaps can be filled by developing morecomprehensive biodiversity observatory networks(BONs) and associated e-tools to support thecollection, aggregation, and discovery of data fromthese observatories. There is also a fundamental needto re-consider what we already have – to ‘invert theinfrastructure’ [92,p.20] – to re-examine and re-formulate existing data to make it more homogeneous,to remove non-biodiversity factors (e.g. compensatingfor differing observation technique) and to make itsuited to the kinds of analyses we foresee for thefuture.

How can we encourage virtual research environments?29. Virtual research environments (VREs), or Virtual

Laboratories are online systems helping researchers to

carry out their work. They include environments bothto publish data (e.g. Scratchpads [93]) and to executeoperations on data (e.g. myExperiment [94]) or both(e.g. AquaMaps [95] and iMarine [96]). VREs alsoinclude facilities to support collaboration betweenindividuals. The challenge is to build integrativeflexible e-Science environments using standardisedbuilding blocks and workflows, with access to datafrom various sources. Just as with physicallaboratories, different kinds of VREs are possible,ranging from general-purpose to the highly specialised.A general VRE for wetland studies can be customisedto a specific geographical area and populated withrelevant datasets. AVRE specialised for a singlescientific objective e.g. to find an optimal way ofsequestering carbon in a forest would be equippedwith workflows based on highly specialised simulationtools such as Biome-BGC [97]. For a successful uptakeof VREs, they must generate immediate benefit fortheir users. For casual users the interface(s) must besimple and intuitive. For developers, there must be ausable pool of services and other resources that can belinked simply (e.g. BioVeL [98]). VREs must performfunctions that people find useful. VREs as envisagedhere, also act as social networking applications andhave a central role in making some of the availabletechnology described above usable or better, invisibleto the majority of users.

What can you do with your data in the future?30. A biodiversity e-infrastructure should go much

further than the generation, transfer, storage, andprocessing of biodiversity data. Applications demandthat the infrastructure supports deploying the data inanalysis, predictive modelling and decision support.The complexity of biodiversity includes systemsinteracting in chaotic and non-linear processes,extreme system effects and interactions betweenmicroscopic and macroscopic global levels, as well ason multiple time-scales. Understanding biodiversity isfar more complex than understanding eithermeteorology or climate-change. To address problemsthat we cannot now handle we need:

� User friendly VREs with:� interoperable and easily configurable

components;� access to real-time data (sensor, earth

observation, weather, etc.);� pre- and post-processing capabilities.

� Predictive multi-scale models;� Feedback mechanisms to prompt new data

generation (remote observations andmeasurements);

Hardisty et al. BMC Ecology 2013, 13:16 Page 16 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 18: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

� Methods for integrated interaction between data,parameters, models and visualized results (“finetuning” experiments, computational steering);

� New approaches for decision support when modeloutcomes result in various scenarios.

Section 4: the human interfaceHow can we give users confidence?31. Experience suggests that for an effective outcome in

biodiversity informatics, a balance of top-down andbottom-up approaches is required. It is alsoimportant to remember that there is small benefit toasking end-users for their requirements, when theymay not be aware of the benefits that newtechnologies can bring. The FP5 funded project ENBI(European Network for Biodiversity Information,funded 2003–2005) [99] concluded that a modularinfrastructure could provide both the architectureand the sustainability to overcome the partial and adhoc solutions developed in the past twenty years, anddesigned the LifeWatch infrastructure for biodiversityand ecosystem research (an ESFRI project [100]).This approach, followed for a decade, has not led tothe development of trusted repositories or stablefunding, and therefore has not generated userconfidence. Thus far, all biodiversity informationprojects share a common problem, viz. how to keepthe service running after the funding period. If peopledon't have confidence that an environment will lastessentially indefinitely, or at least as long as it isrelevant to enter information, then why would theyinvest their time and effort contributing to thecommon system? Paper publication is perceived,unrealistically, to last indefinitely and is the yardstickby which people judge any new approach. Publishingin PDF format is conceptually equivalent to paper,although it’s easier, faster and cheaper to distributecopies and its longevity has not been demonstrated.New paradigms are making data available in formsthat can be readily re-used, e.g. Pensoft Publishers[56] and GBIF's Integrated Publishing Toolkit [76]. Itmay be possible that international organisations, suchas GBIF, or large national institutions, such as naturalhistory museums, will agree to underwrite dataservices on a care-and-maintenance basis, while theunderlying software goes open source but withinstitutional oversight. Ultimately, to build the crucialuser confidence, service managers need to invest farmore than they have thus far in marketing to createnew social norms.

Who owns what?32. The traditional citation system provides a

mechanism to measure impact and provenance but

it applies to a publication unit rather than codefrom a software library or data from a repository.Systems are required that will generate comparablemetrics from the new open science resources (seeparagraph 15). Data contributors would benefit byknowing the number of people and/or projects whoused those data (impact). Code developers willbenefit by knowing how widely their code is beingused (impact). Users want to be able to drill downand find who wrote the code or generated the data(provenance). The ability to credit the data creatoror code author is the primary basis for trust in thequality of the data or service. The followingchallenges need to be resolved: first, we need asystem of attribution that is robust in a distributednetwork, easily achieved by the use of PIDs andauthor identities (see paragraphs 7 and 11). Second,licensing is poorly understood in the community,both by producers and consumers. For data, theflavours of Creative Commons licenses that involve"non-commercial" clauses make risk-averseconsumers wary of using material, even when freeuse was the intention of the original contributor.For software, the terms of open source licencingand free-use are similarly subtle. In both cases,there is a widespread failure to understand thedistinction between licensing and copyright. Third,copyright often creates a barrier to data use andre-use, although in academic work no instances ofcase law have been identified, so guidance is basedon commercial publishing case law, predicated onfinancial loss. The wider Open Science movementis pushing hard to clarify this situation andbiodiversity data should benefit from the increasingwidespread liberalisation.

What benefits come to contributors?33. Career progression is enormously influenced by

citation metrics as a proxy for impact and, morethan anything else, this keeps us tied to a paperpublication model. Products that users want, e.g.identification keys, are often used without citationand contribute nothing to career progression. Peopleare too often not sharing their data freely, but save itfor their close collaborators: they need to be givennew tools that facilitate data sharing in the long run,but keep them in control while the research is stillactive. New metrics need to be defined that measurehow often a data set is used and where conclusionsbased (in part) on those data appear. This through-tracking requires, at the very least, two of thefundamentals discussed above in Section 1, the useof PIDs to track the data and the developmentof a system to identify contributors (= authors).

Hardisty et al. BMC Ecology 2013, 13:16 Page 17 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 19: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

Ultimately, this is the single largest problem weface in persuading people share their data.

Endnotesa By a "coordinated coupled modelling environment"

we mean a technological framework of interoperabilitythat allows researchers to bring together different dataand algorithms without undue difficulty for analysis, mo-delling and prediction. Such a framework could assistus to better understand biodiversity as a comprehensive,complex, integrated system rather than as an assemblageof species (or any other biological organisation). Thiscomprehensive systems-oriented framework would bebuilt from diverse but interlinked data and tools for datadiscovery and analysis across dimensions of scale of phe-nomena, time, space and disciplines (biology, chemistry,climatology, economics, sociology, geography). The effectof impacts and processes can then be assessed acrosstemporal, spatial, and organisational (e.g. gene, individual,species, ecosystems) dimensions. For an alternative im-pression, refer to Virtual Physiological Human (VPH) foran analogous objective, as described by [101]:

“. . . a technological framework that aims to bedescriptive, integrative and predictive.

Descriptive

The framework should allow observations made inlaboratories, in hospitals and in the field, at a varietyof locations situated anywhere in the world, to becollected, catalogued, organized, shared and combinedin any possible way.

Integrative

The framework should enable experts to analyse theseobservations collaboratively, and develop systemichypotheses that incorporate the knowledge of multiplescientific disciplines.

Predictive

The framework should facilitate the interconnection ofpredictive models defined at different scales, withdifferent methods and with different levels of detail,producing systemic networks that breathe life intosystemic hypotheses; simultaneously, the frameworkshould enable their validity to be verified by comparisonwith other clinical or laboratory observations.”

b Based on the Lister definition of biodiversity, [102]:‘Biodiversity is the variety, distinctiveness and complexity ofall life on Earth, including its structures, functions, cultures,

and information at all scales (from genetic to global) and inall its contexts (from DNA to self organization)’.

c A valid name is the correct biological name ofa taxon, determined according to the relevant rulesof nomenclature.

d At the International Conference on Research Infra-structures (ICRI2012), Copenhagen, 21–23 March 2012.

e For a working definition of biodiversity informaticssee http://en.wikipedia.org/wiki/Biodiversity_informatics

f Related ‘future’ initiatives are presently being coordi-nated at the global level by the FP7 funded CReATIVE-Bproject (http://creative-b.eu/) and by GBIF (http://www.gbif.org/) through its Global Biodiversity InformaticsConference (Copenhagen, 2–4 July 2012) and subse-quent Global Biodiversity Informatics Outlook publica-tion (in preparation).

g The EU vision for 2050 is: “Biodiversity and ecosys-tem services – the world’s natural capital – are pre-served, valued and, insofar as possible, restored for theirintrinsic value and so that they can continue to supporteconomic prosperity and human well-being as well asavert catastrophic changes linked to biodiversity loss.”[http://ec.europa.eu/environment/nature/biodiversity/policy/].

h The EU target for 2020 is to: “halt the loss of bio-diversity and ecosystem services in the EU by 2020 andrestore them insofar as possible, and step up the EU’scontribution to averting global biodiversity loss.” [http://ec.europa.eu/environment/nature/biodiversity/policy/].

i BiSciCol project [http://biscicol.blogspot.co.uk/p/home.html] is one example of an attempt to do that.

j At the non-European and global levels impor-tant projects include: DataONE, iDigBio, Atlas of LivingAustralia, Catalogue of Life, COOPEUS, CReATIVE-B,EOL, GBIF, GSC Biodiversity WG, TreeBase, CBOL andmany more.

k BioVeL in particular is a pilot implementation fol-lowing the architecture and technical approach envi-saged for the ESFRI LifeWatch Research Infrastructurefor biodiversity science and ecosystems research.

l A name usage is a statement that includes a name. TheGNUB connects names with their usage in the literature,collections, etc.

m See for example, how Atlas of Living Australia ap-proaches this problem: http://www.ala.org.au/aboutthe-atlas/how-we-integrate-data/data-quality-assurance/.

n The situation today can be likened to that which existedin the fields of meteorology and climatology in the 1960’sand 70’s when the emergence of numerical weather predic-tion drove the demand for new observations and the emer-gence of a global infrastructure for acquiring data.

o The EC KeyToNature project (http://www.keytonature.eu) developed a series of apps for identifying species inthe field.

Hardisty et al. BMC Ecology 2013, 13:16 Page 18 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 20: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

p For example Artportalen in Sweden (http://www.artportalen.se/default.asp), Ornitho in Italy (http://www.ornitho.it/) and Project Noah in the USA (http://www.projectnoah.org/).

q For example sig.ma (http://sig.ma/).

Appendix 1A moderated mailing list has been established for theBiodiversity Informatics Community. To join the list,e-mail DR ([email protected]). Contributing authorsto this White Paper are, in alphabetical order:Wouter Addink, ETI Bioinformatics, NLBart Aelterman, Research Institute for Nature and Forest(INBO), BEDonat Agosti, Plazi, CHLinda Amaral-Zettler, Marine Biological Laboratory, USArturo H. Ariño, Universidad de Navarra, ESChristos Arvanitidis, Hellenic Center for MarineResearch, GRThierry Backeljau, Royal Belgian Institute for NaturalSciences, BENicolas Bailly, WorldFish Center, PHLee Belbin, Atlas of Living Australia, AUWalter Berendsohn, Botanischer Garten und BotanischesMuseum Berlin-Dahlem, Freie Universität Berlin, DENic Bertrand, Centre for Ecology and Hydrology,Lancaster, UKNeil Caithness, Oxford University, UKDavid Campbell, The Paleontological Research Institution,USGuy Cochrane, EMBL - European Bioinformatics InstituteHinxton, UKNoël Conruyt, Université de la Réunion, FRAlastair Culham, University of Reading, UKChristian Damgaard, Aarhus University, DKNeil Davies, UC Berkeley, USBruno Fady, INRA, UR629 Ecologie des ForêtsMéditerranéennes (URFM) and CESAB (CEntre deSynthèse et d'Analyse sur la Biodiversité), FRSarah Faulwetter, Hellenic Center for Marine Research,GRAlan Feest, Bristol University, UKDawn Field, Oxford University, UKEric Garnier, UMR5175 Centre d'Ecologie Fonctionnelle& Evolutive and CESAB (CEntre de Synthèse et d'Analysesur la Biodiversité), FRGuntram Geser, Salzburg Research Forschungsgesellschaft,ATJack Gilbert, University of Chicago, USBernd Grosche, Federal Office for Radiation Protection,DEDavid Grosser, Université de la Réunion, FRAlex Hardisty, Cardiff University, UK

Bénédicte Herbinet, Fondation pour la Recherche sur laBiodiversité, FRDonald Hobern, GBIF Secretariat, DKAndrew Jones, Cardiff University, UKYde de Jong, Universiteit van Amsterdam, NLDavid King, The Open University, UKSandra Knapp, Natural History Museum, London, UKHanna Koivula, Finnish Museum of Natural History, FIWouter Los, University of Amsterdam, NLChris Meyer, Smithsonian Institution, USRobert A. Morris, UMASS-Boston and Harvard UniversityHerbaria, USNorman Morrison, University of Manchester, UKDavid Morse, The Open University, UKMatthias Obst, University of Gothenburg, SEEvagelos Pafilis, Hellenic Center for Marine Research,GRLarry M. Page, Florida Museum of Natural History, USRoderic Page, University of Glasgow, UKThomas Pape, Natural History Museum of Denmark,DKCynthia Parr, Smithsonian Institution, USAlan Paton, Royal Botanic Gardens, Kew, UKDavid Patterson, Marine Biological Laboratory, WoodsHole, USElisabeth Paymal, Fondation pour la Recherche sur laBiodiversité, FRLyubomir Penev, Pensoft Publishers, BGMarc Pollet, Research Institute for Nature and Forest(INBO), BERichard Pyle, Bishop Museum, Honolulu, USEckhard von Raab-Straube, Botanischer Garten undBotanisches Museum Berlin-Dahlem, Freie UniversitätBerlin, DEVincent Robert, Centraalbureau voor Schimmelcultures,NLDave Roberts, Natural History Museum, London, UKTim Robertson, GBIF Secretariat, DKOlivier Rovellotti, Natural Solutions, FRHannu Saarenmaa, Finnish Museum of Natural History,FIPeter Schalk, ETI Bioinformatics, NLJoop Schaminee, Wageningen UR and RadboudUniversity Nijmegen, NLPaul Schofield, University of Cambridge, UKAndy Sier, Centre for Ecology & Hydrology, UKSoraya Sierra, Stichting Naturalis Biodiversity Center, NLVince Smith, Natural History Museum, London, UKEdwin van Spronsen, ETI Bioinformatics, NLSimon Thornton-Wood, University of Reading, UKPeter van Tienderen, Universiteit van Amsterdam, NLJan van Tol, Stichting Naturalis Biodiversity Center, NLÉamonn Ó Tuama, GBIF Secretariat, DKPeter Uetz, Virginia Commonwealth University, US

Hardisty et al. BMC Ecology 2013, 13:16 Page 19 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 21: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

Lea Vaas, Centraalbureau voor Schimmelcultures, NLRégine Vignes Lebbe, University Pierre et MarieCurie, FRTodd Vision, University of North Carolina at ChapelHill, USDuong Vu, Centraalbureau voor Schimmelcultures, NLAaike De Wever, Royal Belgian Institute for NaturalSciences, BERichard White, Cardiff University, UKKathy Willis, University of Oxford, UKFiona Young, University of Reading, UK

Appendix 2Mobilising economic benefitsAt present, 87% of the world’s population have mobilephone subscriptions and 1.2 billion of these are mobileWeb users. In 2011, almost half a billion smartphoneswere shipped globally, exceeding sales of PCs [103]. In2010, 300,000+ Smartphone apps were downloaded 10.9billion times. Prediction is that in 2014 some 77 billionapps will be downloaded representing an estimated US$35 billion market [104]. With the 10 times faster 4Gmobile networks as successor of 3G already available insome countries, high speed bandwidth to mobile deviceswill facilitate online use of services demanding band-width such as video-streaming.Next generation apps, incorporating stable content,smart algorithms and location-awareness in combinationwith multiple layers of online data delivered over 4Gbandwidth (not yet available in Europe), offer the prom-ise of highly innovative information products that canserve markets in both the science and social domains,provided the data are made available to serve theseneeds.The EC KeyToNature project [105] developed a series ofapps for identifying species in the field demonstratingthat there is a market for quality taxonomic referenceworks that can contribute to cost recovery. This ap-proach however does not come without risk. The mobiledevices’ field is evolving extremely fast and apps devel-oped for a device are out of business only one or twoyears later.

Appendix 3Gathering biodiversity dataGathering biodiversity data can be divided into 3 mainroutes:

Remote sensingEarth observation at multiple wavelengths by aeroplane,satellite and ground-based sensors are in the early stagesof development for biodiversity observation. They arelargely based on surveillance technologies and requirethe development of new techniques to process the type

of data they produce, both in routine monitoring andthe detection of rare events. New generations of sensorsdesigned for biodiversity observation are needed. Cameratraps today and DNA chip sensors tomorrow could trans-mit data wirelessly, and be linked directly to researcher’sdesks. Even with existing technology, it is becoming eco-nomically feasible to collect large amounts of environmen-tal data automatically. This approach will undoubtedlypresent a significant new challenge in handling very largedata volumes [106].

Environmental metagenomics"Grind and find" techniques allows the study of many or-ganisms in a sample at the same time, presenting thechallenge of scaling biodiversity observation from themolecule to the planet [107]. For example in November2011, the Beijing Genomics Institute (BGI) launched its“Three Million Genomes Project”, an ambitious effortconsisting of three sub-projects: “Million Plant and Ani-mal Genomes Project”, “Million Human Genomes Pro-ject” and “Million Micro-Ecosystem Project”. In the latter,genomes of more than 600 microbial species, includingover 3,500 strains and 1,800 metagenomes have alreadybeen completed. Projects like these are generating in theorder of 20 petabytes of data per year. With the unlimitedinflux of sequence data being a real possibility, archivesoperating under the INSDC (International Nucleotide Se-quence Database Collaboration) face a new situation inwhich it is no longer possible to archive all components ofall datasets. Indeed, a community discussion is underwayaround decision-making that is informed by scientific andeconomic arguments about the aggressiveness to whichdifferent classes of sequence data should be compressed[108]. Given the value of samples from temporal environ-mental genomic studies with time-point specific elements(e.g. Ocean Sampling Day 2014 http://www.microb3.eu/work-packages/wp2) and limited or no opportunity toresample, contributions to the sequence compression de-bate from the biodiversity informatics community areessential.

Human observationInformatics should empower the human observer in thefield and in the laboratory, improving observational dataquality and providing for data transfer with automaticfeedback mechanisms. Laboratory-based studies are in-creasingly being supported by electronic tools that arereplacing the traditional paper laboratory notebook andincreasingly, instruments are producing data feeds thatcan be directly integrated. It is often necessary to pre-pare baseline sample information that is used to in-terpret field information, for example use micro-CTscanning [109] to reveal details of three-dimensionalstructure. In field sites the infrastructure is either based

Hardisty et al. BMC Ecology 2013, 13:16 Page 20 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 22: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

on long term monitoring of selected parameters, or con-sists of small experimental plots where the response ofcontrolled biodiversity systems on parameter change canbe detected. Examples of the latter are mesocosms orplant communities in laboratory conditions. Long termmonitoring is quite well developed in the LTER-Europenetwork (Long term Ecological Research monitoring),the MARS network of marine stations, GLEON (GlobalLake Ecological Observatory Network) [110], NEON(National Ecological Observatory Network) [111] andthe Swedish Taxonomy Initiative [112]. These monitor-ing networks produce vast amounts of biodiversity dataand a common data infrastructure is yet only developedfor the metadata.

Author’s contributions and acknowledgementsThis article was complied from contributions to a public GoogleDoc(http://is.gd/WhitePaperChapters) and notice of the consultation was widelycirculated through mailing lists and presentation at conferences.Contributions were either additions to the text or in the form of comments,some by e-mail. Continuity editing was performed jointly by AH and DR. Thefundamental structuring of the article has been determined by AH, NM, DR,HS and PvT. The unnumbered sections (Grand Challenge, Recommendations,Preface, Context, Taxonomic Impediment, Changing the landscape - Adecadal vision, Realising the vision) have been written by AH and DR. Manyindividuals have helped with the overall creation of this article, either byproviding contributions and comments on the sections previously referredto or by contributing content for the numbered paragraphs 1 – 33. This hasbeen by direct contribution of text or as the source of material on specificpoints or by providing comments and points of clarification.We thank each one of them. The editing authors would also like to extendparticular thanks to Niobe Haitas for her invaluable help for proof-readingand editing some late drafts of this article.The work reported in this article forms part of the coordination and supportactivities (dissemination, outreach, community building) being carried out bythe BioVeL and ViBRANT projects. BioVeL is funded by the European Union7th Framework Programme within the Research Infrastructures group, grantno. 283359. ViBRANT is funded by the European Union 7th FrameworkProgramme within the Research Infrastructures group, grant no. 261532.

Hardisty, Roberts & biodiversity informatics communitySee Appendix 1 for a complete list of the 80 contributors and theiraffiliations.

Author details1School of Computer Science and Informatics, Cardiff University, QueensBuildings, 5 The Parade, Cardiff, CF24 3AA, UK. 2Department of Zoology, TheNatural History Museum, Cromwell Road, London, SW7 5BD, UK. 3Detailed inAppendix 1.

Received: 31 October 2012 Accepted: 4 March 2013Published: 15 April 2013

References1. Hodge A: Feed the world? arbuscular mycorrhiza and agriculture.

Microbiology Today 2012, 39:90–93. http://www.sgm.ac.uk/download.cfm/docid/83F7DB84-8FB3-4801-BD5A83582EB91F70.

2. Gilbert N: Environment: the disappearing nutrient. Nature News 2009,461:716–718.

3. Purves D, Scharlemann JPW, Harfoot M, Newbold T, Tittensor DP, Hutton J,Emmott S: Ecosystems: time to model all life on earth. Nature 2013,493:295–297.

4. The Royal Society: Science as an open enterprise. Final report June. ; 2012.http://royalsociety.org/uploadedFiles/Royal_Society_Content/policy/projects/sape/2012-06-20-SAOE.pdf. ISBN 978-0-85403-962-3.

5. Davies N, Field D, The Genomic Observatories Network: Sequencing data: agenomic network to monitor earth. Nature 2012, 481:145. doi:10.1038/481145a.

6. Davies N, Meyer C, Gilbert JA, Amarel-Zettler L, Deck J, Bicak M, Rocca-SerraP, Assunta-Sansone S, Willis K, Field D: A call for an international networkof genomic observatories (GOs). GigaScience 2012, 1:5. doi:10.1186/2047-217X-1-5.

7. He F, Hubbell SP: Species-area relationships always overestimateextinction rate from habitat loss. Nature 2011, 473:368–371.http://dx.doi.org/10.1038/nature09985.

8. Knight R: Biodiversity loss: how accurate are the numbers?. The BBC NewsMagazine; 2012. http://www.bbc.co.uk/news/magazine-17826898.

9. Palmer MA, Febria CM: The heartbeat of ecosystems. Science 2012,336(6087):1393–1394. http://dx.doi.org/10.1126/science.1223250.

10. Relman DA: Learning about who we are. Nature 2012, 486:194–195.http://dx.doi.org/10.1038/486194a.

11. Kau AL, Ahern PP, Griffin NW, Goodman AL, Gordon JI: Human nutrition,the gut microbiome and the immune system. Nature 2011, 474:327–336.

12. Christensen NL, Bartuska AM, Brown JH, Carpenter S, D’Antonio C, Francis R,Franklin JF, MacMahon JA, Noss RF, et al: The report of the ecologicalsociety of America committee on the scientific basis for ecosystemmanagement. Ecol Appl 1996, 6:665–691.

13. Europa press releases. http://europa.eu/rapid/pressReleasesAction.do?reference=SPEECH/12/258.

14. Nagoya protocol on access to genetic resources and the fair and equitablesharing of benefits arising from their utilization to the Convention on BiologicalDiversity. http://treaties.un.org/doc/source/signature/2010/CN782E.pdf.

15. Convention on Biological Diversity tenth meeting of the Conference of theParties (COP 10). http://www.cbd.int/cop10/?section=welcome.

16. Hardisty AR, Peirce SC, Preece A, Bolton CE, Conley EC, Gray WA, Rana OF,Yousef Z, Elwyn G: Bridging two translation gaps: a new informaticsresearch agenda for telemonitoring of chronic disease. Int J Med Inform2011, 80:734–744. http://dx.doi.org/10.1016/j.ijmedinf.2011.07.002.

17. Peterson AT, Knapp S, Guralnick R, Soberón J, Holder MT: The big questionsfor biodiversity informatics. Syst Biodivers 2010, 8:159–168. http://dx.doi.org/10.1080/14772001003739369.

18. Parr CS, Guralnick R, Cellinese N, Page RDM: Evolutionary informatics:unifying knowledge about the diversity of life. Trends Evol Ecol 2012,27:94–103. http://dx.doi.org/10.1016/j.tree.2011.11.001.

19. Dean J, Ghemawat S: MapReduce: simplified data processing on largeclusters. In Proceedings of the 6th symposium on operating systems designand implementation (OSDI '04): December 6–8 2004. San Francisco, California:USENIX: The Advanced Computing Systems Association; 2004:379–394.

20. De Meester L, Van Tienderen P, Werger M, Hector A, Wörheide G, Niemelä J,Aguilar A, Smets E, Godfray C, Sutherland W, Bauhus J, Courchamp F,Gandini G, Kochn M, Le Maho Y, Manuel M, Pawlowski J, Quéinnec E,Owens I, Keustermans L: The LERU Biodiversity Working Group:Challenges for biodiversity research in Europe. Procedia Social andBehavioral Sciences 2011, 13:83–100. doi:10.1016/j.sbspro.2011.03.007.

21. GEO BON: Principles of the GEO BON information architecture (version 1.0).http://www.earthobservations.org/documents/cop/bi_geobon/geobon_information_architecture_principles.pdf.

22. GEO BON: Implementation overview. ; 2008. http://www.earthobservations.org/documents/geo_v/07_%20GEO%20Bon%20-%20Implementation%20Overview.pdf.

23. GEO BON: Detailed implementation plan (version 1.0). ; 2010. http://earthobservations.org/documents/cop/bi_geobon/geobon_detailed_imp_plan.pdf.

24. The Global Earth Observation System of Systems. http://www.earthobservations.org/geoss.shtml.

25. e-science European infrastructure for biodiversity and ecosystem research.http://www.lifewatch.eu/.

26. European Biodiversity Observation Network. http://www.ebone.wur.nl/UK/.27. European Biodiversity Observation Network. http://www.eubon.eu.28. Convention on Biological Diversity – taxonomic impediment. http://www.cbd.

int/gti/problem.shtml.29. MacLeod N (Ed): Automated taxon identification in systematics - theory,

approaches and applications. CRC Press; 2007. ISBN 978-0-8493-8205-5.30. Swedish Species Gateway. http://www.slu.se/en/collaborative-centres-and-

projects/swedish-lifewatch/news-archive/2011/9/betaversion/.31. Barcode of Life. http://www.barcodeoflife.org/.

Hardisty et al. BMC Ecology 2013, 13:16 Page 21 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 23: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

32. Moorea Biocode Project. http://mooreabiocode.org/.33. Costello MJ, May RM, Stork NE: Can We name Earth's species before they

Go extinct? Science 2013, 339(6118):413–416.34. Hampton SE, Parker JN: Collaboration and productivity in scientific

synthesis. BioScience 2011, 61:900–910. http://dx.doi.org/10.1525/bio.2011.61.11.9.

35. Biodiversity information projects of the world. http://www.tdwg.org/biodiv-projects/projects-database.

36. Patterson DJ, Cooper J, Kirk PM, Pyle RL, Remsen DP: Names are key to thebig new biology. Trends Evol Ecol 2010, 25:686–691. doi:10.1016/j.tree.2010.09.004.

37. Global Names Architecture. http://globalnames.org/.38. Global Names Index. http://gni.globalnames.org/.39. Global Names Usage Bank. http://www.globalnames.org/GNUB/.40. Species2000 / ITIS catalogue of life. http://www.catalogueoflife.org/.41. NCBI taxonomy database. http://ncbi.nlm.nih.gov/taxonomy/.42. WOrld register of Marine Species. http://www.marinespecies.org/.43. iPlant Taxonomic Name Resolution Service. http://tnrs.iplantcollaborative.org/.44. Zoobank: the official registry of zoological nomenclature. http://zoobank.org/.45. Indexing for life. http://www.i4life.eu/.46. Alroy J: How many named species are valid? P Natl Acad Sci USA 2002,

99:3706–3711. http://dx.doi.org/10.1073/pnas.062691099.47. Mora C, Tittensor DP, Adl S, Simpson AGB, Worm B: How many species Are

there on earth and in the ocean? PLoS Biol 2011, 9(8):e1001127.doi:10.1371/journal.pbio.1001127.

48. Schindel DE, Miller SE: Provisional nomenclature: the on-ramp totaxonomic names. In Systema naturae 250 - The Linnaean Ark. Edited byPolaszek A. CRC Press; 2010:109–116. http://dx.doi.org/10.1201/EBK1420095012-c10.

49. DataCite. http://www.datacite.org/.50. Linked Data. http://linkeddata.org/.51. Constable H, Guralnick R, Wieczorek J, Spencer C, Peterson AT, et al:

VertNet: a new model for biodiversity data sharing. PLoS Biol 2010, 8(2).doi:e1000309. http://dx.doi.org/10.1371/journal.pbio.1000309.

52. Bonney R, Cooper CB, Dickinson J, Kelling S, Phillips T, Rosenberg KV, Shirk J:Citizen science: a developing tool for expanding science knowledge andscientific literacy. BioScience 2009, 59:977–984. http://dx.doi.org/10.1525/bio.2009.59.11.9.

53. Delaney DG, Sperling CD, Adams CS, Leung B: Marine invasive species:validation of citizen science and implications for national monitoringnetworks. Biol Invasions 2008, 10:117–128. http://dx.doi.org/10.1007/s10530-007-9114-0.

54. TinEye Reverse Image Search. http://www.tineye.com/.55. Arvanitidis C, Faulwetter S, Chatzigeorgiou G, Penev L, Bánki O, Dailianis T,

Pafilis E, Kouratoras M, Chatzinikolaou E, Fanini L, Vasileiadou A, Pavloudi C,Vavilis P, Koulouri P, Dounas C: Engaging the broader community inbiodiversity research: the concept of the COMBER pilot project for diversin ViBRANT. Zookeys 2011, 150:211–229.

56. Semantic markup and publishing. . Pensoft Publishers. http://www.pensoft.net/page.php?P=14.

57. ORCID. http://about.orcid.org/.58. Virtual International Authority File. http://www.oclc.org/viaf/default.htm.59. Piwowar H: Altmetrics: value all research products. Nature 2013,

493:159–159.60. Kuipers T, Van der Hoeven J: Insight into digital preservation of research

output in Europe. PARSE. Insight Project deliverable D3.4 Survey Report. ; 2009.http://www.parse-insight.eu/downloads/PARSE-Insight_D3-4_SurveyReport_final_hq.pdf.

61. Science Staff: Introduction to special issue dealing with data. Science2011, 331:692–693. http://dx.doi.org/10.1126/science.331.6018.692.

62. Tenopir C, Allard S, Douglass K, Aydinoglu AU, Wu L, Read E, Manoff M,Frame M: Data sharing by scientists: practices and perceptions. PLoS One,6(6):e21101. http://dx.doi.org/10.1371/journal.pone.0021101.

63. Williams R, Pryor G, Bruce A, Macdonald S, Marsden W, Calvert J, Dozier M,Neilson C: Patterns of information use and exchange: case studies ofresearchers in the life sciences. Research Information Network 2009. http://rin.ac.uk/system/files/attachments/Patterns_information_use-REPORT_Nov09.pdf.

64. Leonelli S, Bastow R (Eds): Making data accessible to all - GARNet and EGENISworkshop report: 12–13 July 2012. Exeter, UK; 2012. http://www.garnetcommunity.org.uk/sites/default/files/Making%20Data%20Accessibile%20to%20All%20Report%202012.pdf.

65. OECD: OECD principles and guidelines for access to research data from publicfunding. Paris; 2007. http://www.oecd.org/dataoecd/9/61/38500813.pdf.

66. Gruber T: Ontology. In The encyclopedia of database systems. Edited by LiuL, Tamer Ozsu M. Springer; 2009. http://tomgruber.org/writing/ontology-definition-2007.htm.

67. Singhal A: Introducing the knowledge graph: things, not strings.GoogleBlog 2012. http://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-things-not.html.

68. Unified Medical Language System (UMLS). http://www.nlm.nih.gov/research/umls.69. Agrovoc Thesarus. http://aims.fao.org/standards/agrovoc/about.70. The Open Biological and Biomedical Ontologies. http://obofoundry.org/.71. Environment Ontology: Environment Ontology. http://www.

environmentontology.com/.72. DAML Ontology Library. http://www.daml.org/ontologies/.73. Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, et al: Darwin Core: an

evolving community-developed biodiversity data standard. PLoS One2012, 7(1):e29715. http://dx.doi.org/10.1371/journal.pone.0029715.

74. Cryer P, Hyam R, Miller C, Nicolson N, Cryer P, Hyam R, Miller C, Nicolson N,Tuama EO, Page R, Rees J, Riccardi G, Richards K, White R: Adoption ofpersistent identifiers for biodiversity informatics. Report published by GBIFSecretariat 2009. http://imsgbif.gbif.org/CMS_ORC/?doc_id=2956&download=1.

75. Catapano T, Hobern D, Lapp H, Morris RA, Morrison N, Noy N, SchildhauerM, Thau D: Recommendations for the Use of knowledge organisationsystems by GBIF. Report published by GBIF Secretariat 2011. http://imsgbif.gbif.org/CMS_ORC/?doc_id=2942&download=1.

76. GBIF: The Integrated Publishing Toolkit. http://www.gbif.org/informatics/infrastructure/publishing.

77. Friend F, Guédon JC, Van de Sompel H: Beyond sharing and re-using: towardglobal data networking. GRDI2020 Project. ; 2011. http://www.grdi2020.eu/pdf/Toward_Global_Data_Networking.pdf.

78. Adam D, Brown N, et al: Position statement on food security and safety. UK:Report published by the Society for General Microbiology, Reading; 2011.http://www.sgm.ac.uk/PA_Forms/FoodPS_Web.pdf.

79. Mathew C, Güntsch A: A workflow for data refinement. BioVeL projectnewsletter No. 2, Autumn. ; 2012. http://www.biovel.eu/images/publications/newsno2autumn12.pdf.

80. ELIXIR infrastructure for biological information. http://www.elixir-europe.org/.81. Ariño AH: Approaches to estimating the universe of natural history

collections data. Biodiversity Informatics 2010, 7:81–92.82. Duckworth WD, Genoways HH, Rose CL: Preserving natural science

collections: chronicle of our environmental heritage. Washington, DC: NationalInstitute for the Conservation of Cultural Property; 1993.

83. Virtual herbarium at the Botanic Garden And Botanical Museum Berlin-Dahlem.http://ww2.bgbm.org/Herbarium/default.cfm.

84. Paris Natural History collections. http://coldb.mnhn.fr/.85. GEOLocate – a platform for georeferencing natural history collections data.

http://www.museum.tulane.edu/geolocate/.86. Atlas of Living Australia Biodiversity Volunteer Portal. http://volunteer.ala.org.

au/.87. iDigBio Integrated Digitized Biocollections. https://www.idigbio.org/home.88. Wei Q, Heidorn PB, Freeland C: Name matters: taxonomic name

recognition (TNR) in biodiversity heritage library (BHL). iConference 2010Proceedings 2010:284–288. http://hdl.handle.net/2142/14919.

89. Encyclopedia of Life. http://eol.org/.90. Universal Biological Indexer and Organizer. http://www.ubio.org/.91. Europeana. http://www.europeana.eu/portal/.92. Edwards PN: A vast machine: computer models, climate data, and the politics

of global warming. MIT Press; 2010.93. Scratchpads: biodiversity online. http://scratchpads.eu/.94. myExperiment. http://www.myexperiment.org/.95. Aquamaps. http://www.aquamaps.org/.96. iMarine. http://www.i-marine.eu/.97. Biome-BGC. http://www.ntsg.umt.edu/project/biome-bgc/.98. Biodiversity Virtual e-Laboratory. http://www.biovel.eu/.99. European Network for Biodiversity Information. http://www.enbi.info.100. European Strategy Forum on Research Infrastructures. http://ec.europa.eu/

research/infrastructures/index_en.cfm?pg=esfri.101. Fenner JW, Brook B, Clapworthy G, et al: The EuroPhysiome. STEP and a

roadmap for the virtual physiological human. Phil. Trans. R. Soc. A 2008,366(1878):2979–2999.

Hardisty et al. BMC Ecology 2013, 13:16 Page 22 of 23http://www.biomedcentral.com/1472-6785/13/16

Page 24: Open Research Onlineoro.open.ac.uk/37625/1/king & morse.pdf · 2020-06-16 · CORRESPONDENCE Open Access A decadal view of biodiversity informatics: challenges and priorities Alex

102. Lister NM: A systems approach to biodiversity conservation planning.Environ Monit Assess 1998, 49:123–155.

103. mobiThinking. http://www.mobithinking.com/.104. ABI research. http://www.abiresearch.com/.105. Key To Nature. http://www.keytonature.eu.106. Nature Editorial: NEON and on. Nature 2011, 476:125. http://dx.doi.org/

10.1038/476125b.107. Jones MB, Schildhauer MP, Reichman OJ, Bowers S: The New

bioinformatics: integrating ecological data from the gene to thebiosphere. Ecol Evol Syst 2006, 37:519–544. http://dx.doi.org/10.1146/annurev.ecolsys.37.091305.110031.

108. Cochrane G, Cook CE, Birney E: The future of DNA sequence archiving.GigaScience 2012, 1:2. http://dx.doi.org/10.1186/2047-217X-1-2.

109. Marbigen. http://www.marbigen.org/content/microtomography.110. Global Lake Ecological Observatory Network. http://www.gleon.org.111. National Ecological Observatory Network. http://www.neoninc.org.112. ArtDatabanken Swedish species information centre. http://www.slu.se/en/

collaborative-centres-and-projects/artdatabanken.

doi:10.1186/1472-6785-13-16Cite this article as: Hardisty, Roberts et al: A decadal view of biodiversityinformatics: challenges and priorities. BMC Ecology 2013 13:16.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit

Hardisty et al. BMC Ecology 2013, 13:16 Page 23 of 23http://www.biomedcentral.com/1472-6785/13/16