d2.2 report of workshop 1 - globis-b · 2016-09-14 · d2.2 report of workshop 1 due-date: m11...

89
Project acronym: GLOBIS-B Project full title: “GLOBal Infrastructures for Supporting Biodiversity research” Grant agreement no.: 654003 D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA Dissemination Level: PU Status: Final Version: 2.0

Upload: others

Post on 22-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Project acronym: GLOBIS-B Project full title: “GLOBal Infrastructures for Supporting Biodiversity research”

Grant agreement no.: 654003

D2.2 Report of Workshop 1

Due-Date: M11

Actual Delivery: M12

Lead Partner: UvA

Dissemination Level: PU

Status: Final

Version: 2.0

Page 2: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

DOCUMENT INFO Date and version no. Author Comments/Changes

25-05-2016, V1.0 Jacco Konijn

31-05-2016 V2.0 Jacco Konijn, Daniel Kissling, Wouter Los

Comments and suggested changes by WL and DK

Page 3: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

TABLE OF CONTENTS 1 Executive summary .................................................................................................................... 4

1.1 Main objectives and outcomes .................................................................................................. 4 2 Contributors ............................................................................................................................... 5 3 Report of the first GLOBIS-B Workshop ..................................................................................... 5

3.1 Report from the side meeting of the Legal & Policy Group ....................................................... 5 3.2 Reports from working groups on potential implementation cases ........................................... 6

Camera trap data (lead: Jorge Ahumada) .................................................................................. 6 3.2.1 e-Bird and Atlases (lead: Steve Kelling) ...................................................................................... 6 3.2.2 Global Dynamic Population Database (GDPD) and Living Plant Index (LPI) (lead: Louise McRae 3.2.3

& Dmitry Schigel) ........................................................................................................................ 7 LTER data for EBVs (lead: Johannes Peterseil) ........................................................................... 8 3.2.4 Marine EBVs (lead: Christos Arvanitidis, Matthias Obst & Francisco Hernandez) ..................... 8 3.2.5 Environmental microbiome data sets (lead: Monica Santamaria) ............................................. 8 3.2.6 Collection databases (lead: Dimitris Koureas) ............................................................................ 9 3.2.7

3.3 Feedback from Research Infrastructures ................................................................................... 9 LifeWatch (lead: Jesus Marco de Lucas) ..................................................................................... 9 3.3.1 NEON (lead: Brian Wee) ........................................................................................................... 10 3.3.2 SANBI (lead: Jeffrey Manuel) .................................................................................................... 11 3.3.3 Chinese Academy of Sciences (lead: Liqiang Ji) ........................................................................ 11 3.3.4 CRIA (lead: Renato de Giovanni) .............................................................................................. 11 3.3.5 Atlas of Living Australia (ALA) ................................................................................................... 12 3.3.6

3.4 Second GLOBIS-B workshop ..................................................................................................... 12 3.5 GLOBIS-B publications .............................................................................................................. 12 3.6 Wrap-up.................................................................................................................................... 13

4 Annexes .................................................................................................................................... 14 4.1 Participant List .......................................................................................................................... 14 4.2 Workshop Agenda .................................................................................................................... 14 4.3 Pre Workshop input from participants .................................................................................... 14 4.1 Participant list ....................................................................................................................................... 15 4.2 Workshop Agenda ................................................................................................................................. 17 4.3 Pre Workshop input from participants ................................................................................................. 26

Page 4: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

1 Executive summary GLOBIS-B is organising 4 workshops which will have the calculation of Essential Biodiversity Variables (EBV's) as a case to assess the possibilities in global collaboration among Research Infrastructures developing these EBV's. To that end, a mix of world leading scientific and legal experts are invited, as well as technical staff from the connected Research Infrastructures (RI's). Workshop 1 and 2 are aiming at identifying the options and problems associated with the general challenges of provisioning the research infrastructures to deliver capabilities for supporting the generation and calculation of EBVs. The two workshops focus on the EBV class ‘Species populations’ (EBVs ‘Species distribution’ and ‘Population abundance’) for which the data, models and understanding are presently much better developed than for other EBV classes. Workshop 1 was organised with integrated as well as parallel sessions with ecologists and biodiversity scientists (with special expertise in species distributions and abundance and its monitoring), biodiversity informaticians and infrastructure operators, and legal interoperability experts. Workshop 2 will be a follow-up a few months later with the same set of experts to provide an update on ongoing initiatives and to guarantee the writing of a peer-reviewed scientific paper and the dissemination of key results. The intended scientific paper will be a follow up to the Pereira et al. (2013) Science publication 1 with the aim to respond to it in terms of the challenges of research infrastructures for supporting the implementation and calculation of EBVs. Workshop 1 was organized from February 29 - March 2 2016 in Leipzig at the German Centre for Integrative Biodiversity Research (iDiv). iDiv is the current secretariat of GEO BON, which introduced the concept of EBV's and plays a large stimulating role in developing these EBV's.

1.1 Main objectives and outcomes The main objectives of the workshop were:

• Bring key scientists together with global research infrastructure operators and legal interoperability experts

• Identify the required primary data, analysis tools, methodologies, and legal and technical bottlenecks

• Identify the research needs and infrastructure services needed for computing EBVs globally • Facilitate the multi-lateral cooperation of biodiversity research infrastructures worldwide

The main outcomes were:

• The Living Planet Index will make over a quarter of a million data points freely available to be used in these measurements;

• eBird, the largest data collection of bird distributions in the world, will focus on measuring the change in patterns of bird occurrences globally;

• The Atlas of Living Australia together with the European LifeWatch capability will create a demonstrator proof-of-concept for the process of measuring and presenting an EBV;

• Consensus on using the 'Darwin Core Event' standard as a common model for mobilizing biological datasets from species sampling activities and using GBIF.org as a global aggregator for these data;

• The Wildlife Picture Index will make their publically available data (2.6 million records) now accessible through the Darwin Core Event.

To achieve these outcomes, participants were asked to contribute in writing to a number of questions that were sent to them some weeks before the workshop. This resulted in a large document (see Annex 4.3) that enabled the organizers to focus the discussions and suggest ways forward.

1 Pereira, H. M., S. Ferrier, M. Walters, G. N. Geller, R. H. G. Jongman, R. J. Scholes, M. W. Bruford, et al. (2013): Essential Biodiversity Variables.Science 339, 277-278. doi: 10.1126/science.1229931

Page 5: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Possible next steps for the second workshop were defined for each of the participating RI's and similar initiatives, a number of implementation-cases have been discussed that will be developed in between the two workshops and will receive further elaboration in the second workshop and for the main outcome of the first two workshops, the journal publication, writing groups were defined that would work during the following months on a first draft, to be discussed and finalized in the second workshop.

2 Contributors Writing of this deliverable was organized by the University of Amsterdam (Daniel Kissling, Wouter Los and Jacco Konijn). All relevant other contributors to the workshop contributed to the report in relevant sections.

3 Report of the first GLOBIS-B Workshop The first GLOBIS-B workshop was held from 29 February to 2 March 2016 at the German Centre for Integrative Biodiversity Research (iDiv) in Leipzig, Germany. The workshop was the first in a series of four and focused on the EBV class ‘Species populations’, i.e. species distributions and abundances. Invited biodiversity scientists met together with operators from research infrastructures to discuss and develop a framework for implementing EBVs. In thought experiments different scenarios were considered on how scientists may want to test the relevance of EBVs to build indicators, and which data, workflows and computational capacity scenarios will require. The research infrastructures considered the challenges and potential solutions of providing the required data and workflow services to achieve global interoperability. On day 1, the workshop was opened with a welcome by Marten Winter (scientific coordinator of the synthesis center of iDiv) and a welcome and introduction to workshop participants by Daniel Kissling (scientific coordinator of GLOBIS-B). After this, the following lighting talks assisted in bringing focus to the workshop topics.

EBVs and biodiversity indicators (Henrique Pereira)

Why are we sitting together? (Daniel Kissling)

Methods for extracting trends from distribution data (Nick Isaac)

Calculating biodiversity indicators (Louise McRae)

Data Sharing Principles and Data Management Policies in GEO and RDA / CODATA (Willi Egloff)

The infrastructure landscape: data portals and related biodiversity infrastructures (Donald Hobern)

Interoperability and workflows: state-of-the-art (Alex Hardisty)

Demonstration of EBV pilot from EU BON (Hannu Saarenmaa) On day 2, several break-out groups discussed questions related to scientific and technical (infrastructure) issues for the construction of EBVs in the EBV class ‘species populations’. Experts on legal and policy issues also discussed the current relevant developments. This was followed by plenary sessions were the various groups reported back from their break-out sessions to all participants. A general plenary discussion then allowed to identify topics for working groups. These working groups met in the late afternoon of day 2 to discuss potential implementation cases. On day 3, a plenary session allowed the working groups to report back to all participants about the potential implementation cases. This was followed by a feedback round by representatives from research infrastructures. Finally, Daniel Kissling provided information about the second GLOBIS-B workshop, suggestions for potential GLOBIS-B publications resulting from the workshops, and a wrap-up. The summary reports below provide an overview of the main outcomes. This will provide the basis for the next follow-up workshop in Seville, Spain.

3.1 Report from the side meeting of the Legal & Policy Group Summary from Legal & Policy PPT presentation (lead: Anne Bowser) This group has the goal to identify legal and policy bottlenecks due to the varying provenance of authorship and ownership of data. Regarding the final product of opening up EBVs to diverse audiences, the legal

Page 6: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

implications are different when EBVs are offered as a service (for example for policy makers or environmental managers), or as processed data to meet different research interests. It is an open question whether EBVs have to be derived only from open data.

The GEO data sharing principle says: “Data, metadata and products will be shared as Open Data as default.” This means adopting a limited range of CC licensing. However the reality is different and can affect the accuracy of the EBV. Especially for EBVs offered as a service it must become clear how licensing and attribution mechanisms can be applied and ‘translated’ in metadata information. A more complex workflow may be needed that allows users to select also from closed data, provided they respect the licensing conditions. A special case is for the observation data originating from citizen scientists. Apart from privacy issues, also standardized explanations of data quality (e.g., Data Quality for Geographic Information ISO/IEC19157) are helpful, also important for policy. Data quality flags could be assigned to the EBV to give some hint of its accuracy.

3.2 Reports from working groups on potential implementation cases

Camera trap data (lead: Jorge Ahumada) 3.2.1

The TEAM network (http://wpi.teamnetwork.org/) has indicated its interest to transfer the data into the Darwin Core Event standard. Eventually, TEAM will expose an API (application programming interface) so that DwCEvent can be queried from the outside. Work is underway on designing guidelines to measure change at selected temporal scales. The TEAM network needs to reach out to other camera trap data to discuss and secure that they are comfortable with such developments. In addition, legal issues are asking attention with respect to the position of governments about for example declining populations. The approach with a label “GEOBON-EBV Approved’ might be a simple solution.

A proposed outline of steps toward transferring the TEAM network data into the Darwin Core Event is as follows.

Collect and mobilize data:

Expand TEAM data (wpi.teamnetwork.org) with Wildlife Insights (wildlifeinsights.org), World Count (Australia), and the ARPA network in Brazil.

Identify financial/organizational mechanisms to ensure sustainability of data collection

Publish data with Darwin Core Event

The data are Darwin Core-event ready, but they are not yet accessible in this format.

Make an authoritative dataset for an EBV

Reach out to other camera trap data collecting communities

Follow GEOBON endorsement process – what are the steps?

Measure change at selected scale

With minimum design guidelines can detect change in annual occupancy

Maybe calculate indicator

Wildlife Picture Index (WPI) is a first available indicator (calculated as geometric mean of relative occupancies). Is there a minimum set of covariates that is needed to calculate indicators from camera trap data?

Possible steps to the next workshop:

Illustrate the progress (and bottlenecks) of making the TEAM data accessible via the Darwin Core Event

e-Bird and Atlases (lead: Steve Kelling) 3.2.2

e-Bird engages the public at large to submit checklists of birds. It uses standardized protocols to record location, time, sampling method, number of observers, recording effort (all species, subset etc.). o some extent, it allows to infer absence (opposed to presence-only), putting e-Bird in the same category as atlas

Page 7: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

like projects. However, e-Bird is opportunistic on where data are collected, making it also different from atlases. The checklist of birds includes not only the number seen but also images. 40,000 images per week are received; copyright is owned by the lab. Data are converted to EML and then translated into Darwin Core and contributed to GBIF. Richer data access is also provided more directly.

e-Bird is 14 years old and is a mature citizen science project that can be an exemplar for how such projects can deliver EBVs. e-Bird already produces a lot of distribution and abundance information, which can be used in various ways to give different indications of change. The data are used to make weekly predictions of the relative abundance of >400 bird species at continental scales, using remotely-sensed environmental co-variates (e.g. MODIS derived information on habitat, climate, etc.) and so-called Spatio-Temporal Exploratory Models (STEM) (i.e. linear-mixed effects models with a moving window approach). There are several different data products, incl. primary occurrence data to GBIF; checklist event data and zero fill; and access to all of the modelling data annually to look at trends and patterns.

Possible steps to the next workshop:

Make and show trend estimates over time (e.g. trend maps using 5 year windows, or for sites where enough temporal data are available). Probably requires to keep the spatial extent constant (opposed to the moving window approach)? What is the spatial resolution of the trends maps that can be produced (certain sites? Countries? Spatial grid?)

Illustrate several indicators/measures of change based on the e-Bird EBV data, e.g. o How many days does a certain species spend in a certain area (e.g. in a country or a nature

reserve) and how does this change over time/years? o How does range size (summer and/or winter) change over time? o How does species richness change over time (-> indices of species richness change)

Outline the transferability of the e-Bird approach: how well can the models be transferred to other taxa? Which steps are needed to apply the e-Bird approach to other citizen science data? What are the legal/technical/scientific bottlenecks?

Examples from e-Bird could be described as a use case for EBV use, e.g. in the planned manuscript.

Global Dynamic Population Database (GDPD) and Living Plant Index (LPI) (lead: 3.2.3Louise McRae & Dmitry Schigel)

Global Population Dynamics Database (GDPP)

5000 time series of animal and plant population data

Metadata

Not updated since 2010

Available online Living Planet Index (LPI)

17,000 time series of vertebrate population data

Contains GPDD data except for invertebrate, plant, catch, hunting and duplicate data (500-1000 time series)

Continually augmented and updated

Available online

Both databases could be published with Darwin Core Event, and some initial steps have been discussed at the GLOBIS-B workshop to start the process at short term. It would be preferred to have the involved organizations registered as publishers as this would facilitate continuously updating. An issue is the attribution of data with the involved multiple data sources. When data are published with the Darwin Core Event, the next steps toward using these in EBVs is (a) data processing and cleaning to improve data quality and to remove replicates, and (b) allowing data selection for EBVs with different filtering layers. The LPI is already largely fit to calculate changes in abundances over time, and it contains data at different scales from site to national and regional level.

Possible steps to the next workshop:

Mobilize the data behind the LPI towards GBIF using the Darwin Core Event.

Page 8: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Identify bottlenecks and suggest solutions of this mobilization process

Explore whether a similar mobilization process is possible for data from the GDPP.

LTER data for EBVs (lead: Johannes Peterseil) 3.2.4

LTER Europe is a network of about 420 LTER sites and 30 LTSER platforms documented at http://data.lter-europe.net/deims/. It is providing a central documentation of LTER facilities (sites and platforms) with links to decentral data storage locations (e.g. individual data repositories, like the UK ECN network (UK), TERENO (Germany), etc.). A new eLTER project aims to provide tools for distributed management of LTER data. Example types of data (e.g. example from LTER Europe site Zöbelboden, Limestone Alps, Forested habitat)

Vegetation relevant data (Braun-Blanquet)

Vascular plant species frequency measurement on vegetation plots (plots with high density of observations, repetitions)

Tree species monitoring (forest structure)

Small mammals (e.g. rodents, trap sampling, recapture rate)

Mosses and lichens

Orchids

Bird inventory

Habitat monitoring The group evaluated the options for the implementation of the EBV concept in the framework of LTER Europe and NEON. It should be checked how highly concentrated data from a small LTER area could support, for example evaluation of larger scale approaches and data integration. A first step is to identify appropriate data flows from the different sites (maybe as services) and to use that to help validation of the EBVs, to give an idea about uncertainty. Issues on data sharing and data policy are still under discussion for metadata development inside and outside the network (e.g. embargo).

Possible steps to the next workshop:

Test the data harmonization and data sharing possibilities using 2-3 example datasets from LTER

Explore how LTER data could be contributed to GBIF via the Darwin Core Event, and which bottlenecks currently exist

Explore possibilities for collaboration with the ECOPOTENTIAL project.

Marine EBVs (lead: Christos Arvanitidis, Matthias Obst & Francisco Hernandez) 3.2.5

Known and useful datasets are in: OBIS (incl. FishBase), e.g. fish, continuous plankton record (CPR) data, etc. Most datasets are presence-only data, but some grid-based abundance data exist (interpolated but not modeled). Efforts are also undertaken in the framework of the Global Ocean Observing System (GOOS) (http://www.ioc-goos.org/). From these sources, it is possible to identify specific EBV datasets and how to get them. The focus will be on the datasets in relation to specific methods that allow measuring species distributions (and abundance). A key question is how to compare changes (in species distributions, abundances) between different ecosystem components.

Possible steps to the next workshop:

Explore the available marine databases for potential EBV use.

Extract an example EBV dataset from OBIS, and explore the use of existing workflows to do this (e.g. from EU BON, BioVEL etc.)

Explore collaboration and overlap with the GOOS initiative because they are developing the biological Essential Ocean Variables (EOV) e.g., fish distribution and abundance.

Environmental microbiome data sets (lead: Monica Santamaria) 3.2.6

Several networks are collecting data from different environments (Earth Microbiome Project, Human Microbiome Project, etc.). This is promising data source of measurements of biodiversity aspects not otherwise captured, while sequencing costs are declining and data volumes are rising. The sequencing

Page 9: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

approach affects what is likely to be recorded from a sample: a targeted subset of (known) species vs. a broader community (including repeatable unknowns). There are a several challenges.

The effective grain size (spatial and temporal) of patterns is unclear and unlikely to match those of other classes of data;

Not everything has a Linnean name. This requires to expand the taxonomic axis of integrated data to accommodate diagnosable unnamed clusters of organisms;

Need to maintain links to external sequence repositories;

For human microbiome data, privacy limits associated metadata – need to consider whether adequate benefit will arise from crosslinking to other biodiversity data.

An upcoming symposium will deal with combining taxonomic databases with genomic data sets. This is also a TDWG subject. An area to explore are metagenomic data related to key traits. Possible steps to the next workshop:

Identify which microbiome/genomic/DNA datasets can be used to illustrate an EBV calculation. Which are relevant for EBV species distribution, abundance and community composition? The latter is not the focus of the GLOBIS-B workshops 1 & 2

What is the current state (and related challenges/bottlenecks) for transferring microbiome/genomic/DNA datasets into GBIF via the Darwin Core Event?

Collection databases (lead: Dimitris Koureas) 3.2.7

Characteristics of ‘collections’ data – natural history collections are:

Often targeted surveys

Usually presence-only collections (though abundance may be observed)

Long history of collecting

Place-based

Minimum common level of data in aggregations such as GBIF EBV’s can be derived from collection data with respect to:

Alpha diversity – counts of unique ‘features’ (e.g. species within a specific taxonomic group) at a location, for a given a time period and given spatial scale;

Beta diversity – pairwise comparison of lists of unique ‘features’ (e.g. species within a specific taxonomic group) between locations for a given time period and given spatial scale. Interesting work on beta diversity baselines is done by CSIRO Australia;

Range maps etc. But authoritative sources of taxonomic reconciliation (such as Catalogue of Life) are essential. Also, interchangeable metadata schema are essential to support interoperability. Possible steps to the next workshop:

There is a balance between specificity of a particular metadata schema to a particular community and interchangeability with other communities. What is the minimum set of metadata that is needed to make interoperability (of species distribution data) effective?

3.3 Feedback from Research Infrastructures

LifeWatch (lead: Jesus Marco de Lucas) 3.3.1

The Lifewatch research infrastructure supports scientists with specific e-Laboratories and dedicated workflow services and which also can be constructed by interested specialists. These services are assisted by ICT experts and computational capacity. The establishment of a European legal LifeWatch entity is underway and will facilitate the integration of currently distributed activities in various European countries. LifeWatch can support EBV research with facilities for data storage, computation (e.g. for processing

Page 10: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

workflows), and with virtual research environments. Interested people can contact Jesus Marco to ask for access to computing and other support resources.

Possible steps to the next workshop:

Planning of any requests for infrastructure support.

NEON (lead: Brian Wee) 3.3.2

Context. The US Global Change Research Program, a US$2.7B per year multi-federal agency research and applied program, has proposed using the start of spring as one of its US climate change indicators. The spring indicator is calculated from citizen science phenology observations of cloned lilac and honeysuckle. This represents a model where citizen science data are utilized to generate credible continental-scale indicators for science and decision-support, and demonstrates the utility of supporting investments for such observations and the supporting e-infrastructure (personal communications, Michael Mirtl and Brian Wee, October 2013, ILTER meeting, Seoul, South Korea). Furthermore, when integrated across continents, such as the EU and the US, it provides a model to operationalize the sustained computation of phenology-related EBVs (personal communications, Wouter Los and Brian Wee, September 2013, European Grid Infrastructure conference, Madrid, Spain). EU-US phenology databases. NEON’s plant phenology observations from its constellation of sites are based on USA National Phenology Network (NPN) protocols. NPN protocols are also utilized by citizen scientists and by selected US federal agencies. In Europe, the Pan European Phenology Database (PEP725) hosts quality-controlled plant phenology observations using Biologische Bundesanstalt, Bundessortenamt und CHemische Industrie (BBCH) protocols. Integrating EU and US plant phenology data, as a first step towards operationalizing the sustainable generation of annually updated phenology related EBVs, is possible with a robust ontology. Ontology development. In January 2016, the United States Geological Survey’s (USGS) Powell Center for Analysis and Synthesis funded a workshop co-lead by Rob Guralnick and an ontology scientist to initiate the development of a plant phenology ontology that capitalizes the intellectual capital already invested in a plant development ontology. That way, observed plant phenophases can be related to plant development stages, enabling new science questions that can now be asked. Workshop attendees included individuals from NEON and NPN. The workshop resulted in a draft ontology that was successfully implemented in a triple store that fused NPN- and BBCH-compliant demonstration datasets. The success of the limited prototype has lead to a modest commitment from the USGS to further the development of the plant phenology ontology. This work will be useful for GLOBIS-B for integrating EU – US data for computing phenology related variables. Other global entities. If successfully implemented, citizen science generated data in the US and the EU may one day be utilized for the computation of EBVs. This possibility has been explored with a indigenous tribe in the US, the Tulalip Tribes, during a September 2015 workshop (Tulalip, state of Washington, USA). That workshop was organized under the aegis of a strategic Esri-NEON collaboration called the “Tribal Lands Collaboratory” (TLC). The TLC is envisioned to utilize tribal observations of bird, plant, and salmon phenology for science and decision-support purposes. If successfully implemented, tribal nations, as sovereign nations in their own right, may enjoy the distinction of contributing to a global plant phenology commons for science and decision-support on cross-continental scales. The state of phenology e-infrastructure. The PEP725 database does not appear to offer access to its data via web services: requests have to be submitted to the repository administrators, and subsequent access to the data is enabled through a download link to the requested data. NPN data is discoverable and accessible via the NPN server or via DataONE through APIs or through the web user interface. NEON will ultimately offer the same capabilities when fully constructed. There is a need to stipulate what technologies need to be further developed to support use-cases that capitalize on the availability of continually updated phenology data for the computation of related EBVs.

Page 11: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Possible steps to the next workshop:

Provide information about metadata quality

Develop high-level use-case for integrating NEON, NPN, and PEP725 plant phenology data.

SANBI (lead: Jeffrey Manuel) 3.3.3

SANBI is not only a research infrastructure managing biodiversity information - our primary role is actually to coordinating fields of biodiversity science within the country that are policy-relevant, and advise government. Hence the primary work is monitoring, assessment and policy advice. Much of this work is in the scientific component more than as an RI – specifically ecosystem assessment and monitoring, which is a key strength. This means that SANBI needs to determine to what extent alignment of our national monitoring framework (and therefore funding for research, etc.) can be driven by EBVs – the work as a RI would then simply follow. Up until now, EBVs have not been sufficiently on SANBI’s radar as a policy imperative, but Leipzig and other recent development has changed that. SANBI can probably not be reactive – that there is a significant opportunity to support the construction of EBVs in a way that maximizes this alignment, but at this point it is difficult, given the nature of our work. From a species EBV perspective:

- SANBI is managing the local GBIF node, and would drive any EBV data management outcomes (e.g. implementation of the event core) through this. SANBI also coordinates GBIF-Africa and is very much involved in regional capacity development, so again there are opportunities here to test EBVs at scale.

- As SANBI also coordinates the science locally, should the GLOBIS-B determine specific priority datasets we would drive the local prioritization of addressing those data needs, manage that information and feed it through to the relevant global platform.

Possible steps to the next workshop: SANBI committed to the following :

In July starts the process of documenting how our current monitoring and assessment framework align with EBVs. This will largely be framed by what comes out of Seville (e.g. how far the hypercube concept is developed), but SANBI will focus on EBVs broadly – Species and Ecosystems, and include an assessment of which we think are the best candidate EBVs for South Africa, based on coverage, repeatability, etc.

This process will thus also highlight the EBVs to which SANBI would not contribute.

Chinese Academy of Sciences (lead: Liqiang Ji) 3.3.4

A number of databases hosted in China are relevant:

National Chinese species checklist (yearly updated)

Digitization project toward the goal of 30 million specimens, of which 15 million done, but not all yet on-line accessible

Scientific database of the Flora & Fauna of China, including historical literature

Biodiversity forest monitoring network, with 15 plots throughout China where data of various variables are collected according to protocols with standard methods. Is being expanded with some animal groups. The forest monitoring data can be a candidate to contribute to EBV work, but needs consultation first.

Software development for data management, to analyze the taxonomic system, and to support research on biodiversity change under climate change.

Possible steps to the next workshop:

There will a workshop in CAS to discuss how to respond to the questions as raised in GLOBIS-B workshop in Leipzig focused on EBVs issues.

CRIA (lead: Renato de Giovanni) 3.3.5

CRIA provides access to ~7.7 million specimen records through the speciesLink network. These records are

Page 12: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

mainly from biological collections in Brazil and they are already available in DarwinCore format, making it another possible source of data to EBV research. CRIA also provides an ecological niche modelling web service based on OpenModeller. This way, if GLOBIS-B needs to generate such models, it could either use the service provided by CRIA or install its own instance of the service, since the service is based on free and open source software. The openModeller web service has been successfully used by the BioVeL platform and by the online system Biogeography of Flora and Fungi in Brazil, which currently contains potential distribution models created with occurrence data from speciesLink for ~3.7k species of plants. Possible steps to the next workshop: CRIA can provide support to use or install the ecological niche modelling service, and to fetch data from speciesLink if this is needed by any of the activities that aim to explore calculation of EBVs.

Atlas of Living Australia (ALA) 3.3.6

ALA as the Australian biodiversity data infrastructure is exploring with BioVEL a case study exposing multiple scientific and technical aspects of EBV measurement and presentation to stimulate discussion toward adoption of standard workflows for their production. The study focuses on the computation and presentation of the EBV ‘species distribution’ in the EBV class ‘species populations’, and more specifically the data necessary for defining an indicator of change in invasive plant species range. Possible steps to the next workshop:

ALA is discussing with BioVEL (Hardisty) and the University of Amsterdam (Kissling) how a full workflow can be demonstrated ending up with relevant EBV data.

Initial reporting about the workplan will be presented in the Second workshop (Sevilla).

The partnership is looking for funding opportunities in order to implement the plans.

3.4 Second GLOBIS-B workshop The second workshop will be held from 13-15 June 2016 in Seville, Spain. The practical organization of this workshop is supported by LifeWatch Spain. Local contact person is Antonio Torralba Silgado.

3.5 GLOBIS-B publications Daniel Kissling suggested that more than a single publication could emerge from the two workshops.

Paper 1: A review on the EBV class ‘Species populations’, based on answers from the pre-workshop questions. This would cover scientific, technical and legal aspects of EBV calculation

o Target journal: Annual Review of Ecology, Evolution, and Systematics?

Paper 2: A paper summarizing how research infrastructures will respond to EBVs (i.e. to the Pereira et al. 2013 Science publication)

o How can we bring biodiversity variables in practice on the ground, how to make interoperable workflows, how to demonstrate a multi-lateral cooperation amongst existing research infrastructures world-wide?

o Maybe as a short publication in Science Policy Forum? o Or a longer one in PLOS Biology?

Overall, this might include:

Interoperable workflows for measuring essential biodiversity across the tree of life;

Linking from data to the workflows to the calculation to the output;

Keeping a broad picture of the range of taxa that can be covered;

Demonstrating a multi-lateral cooperation among existing RIs. From where we are right now, we’ve not advanced far enough to write paper 2. This would require to i) develop/illustrate what can be done in each research infrastructure as a set of case studies (this requires more active engagement and commitment of the involved research infrastructure than currently offered); ii) identify key scientific, technical, and legal bottlenecks (this can be achieved with the expertise of the workshop participants, but requires commitment and time investment beyond the workshops).

Page 13: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

A preliminary outline for paper #1 was suggested as follows:

Possible steps to the next workshop:

Daniel Kissling will prepare a refined outline for paper 1 and identify lead persons for specific sections of this manuscript. Once the outline is ready, all workshop participants will be invited to provide active and substantial input to the manuscript with respect to writing text parts, making figures, tables, etc. and providing case studies. The aim is to get a first draft of this manuscript ready before workshop 2 so that the draft and the specific sections can be discussed and improved during workshop 2.

The Legal & Policy group (Enrique Alonso, Anne Bowser, etc.) will draft a publication on how EBV development can be based on citizen science activities.

3.6 Wrap-up The workshop was not only successful with the enthusiasm and commitments of participants, but also by reaching the original main objectives:

Bring key scientists together with global research infrastructure operators and legal interoperability experts

Identify the required primary data, analysis tools, methodologies, and legal and technical bottlenecks

Identify the research needs and infrastructure services needed for computing EBVs globally

Facilitate the multi-lateral cooperation of biodiversity research infrastructures worldwide Take-home questions for the participants to consider in the next few weeks are the following: Technical people:

How can you and your research infrastructure contribute to making interoperable workflows for EBV measurement?

Biodiversity scientists:

How can you contribute to help research infrastructures to develop workflows for EBV measurements?

Legal and policy people:

How can you help with identifying legal/policy bottlenecks for interoperability in EBV measurements?

Participants will be contacted soon about actions prior to the next workshop in Seville.

Page 14: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

4 Annexes

4.1 Participant List

4.2 Workshop Agenda

4.3 Pre Workshop input from participants

Page 15: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

4.1 Participant list

No. First name Last name Organisation Country

1 Jorge Ahumada Tropical Ecology Assessment and Monitoring (TEAM) USA

2 Jane Elith University of Melbourne AUS

3 Miguel Fernandez iDiv Leipzig, GEO BON DEU

4 Kristen Williams CSIRO Ecosystem Sciences AUS

5 Nick Isaac Centre for Ecology & Hydrology GBR

6 Steve Kelling Cornell Lab of Ornithology USA

7 Louise McRae Zoological Society of London GBR

8 Matthias Obst Göteborg University & LifeWatch Sweden SWE

9 Henrique Pereira iDiv Leipzig DEU

10 Dirk Schmeller Helmholtz Centre for Environmental Research - UFZ DEU

11 Nicola Segata University of Trento ITA

12 Eren Turak New South Wales Government Office of Environment AUS

13 Andrew Skidmore University of Twente NLD

14 Jean-Baptiste Mihoub ECOSCOPE & UFZ Leipzig FRA

15 Christos Arvanitidis Hellenic Centre for Marine Research GRC

16 Nestor Fernandez GEO BON observer DEU

17 Donald Hobern Global Biodiversity Information Facility (GBIF) DNK

18 Hannu Saarenmaa University Eastern Finland, EU BON FIN

19 Robert Guralnick Florida Museum of Natural History USA

20 Jeffrey Manuel

South African National Biodiversity Institute (SANBI)

ZAF

21 Lucy Bastin

Joint Research Centre of the European Commission

ITA

22 Liqiang Ji Chinese Academy of Sciences CHN

23 David Martin Atlas of Living Australia AUS

24 Jesus Marco de Lucas LifeWatch Spain ESP

25 Brian Wee National Ecological Observatory Network (NEON) USA

26 Francisco Hernandez Flanders Marine Institute BEL

27 Renato De Giovanni

Brazilian Reference Center on Environmental Information (CRIA)

BRA

28 Daniel Amariles Humboldt Institute COL

29 Dimitris Koureas Natural History Museum, London GBR

30 Dmitry Schigel GBIF, University of Helsinki FIN

31 Johannes Peterseil Umweltbundesamt Austria AUT

32 John Watkins Centre for Ecology & Hydrology GBR

33 Jesús Miguel Santamaría LifeWatch Spain ESP

34 Antonio Torralba Silgado LifeWatch Spain ESP

35 Juan Miguel González Aranda LifeWatch Spain & Ministerio de Economía y Competitividad ESP

36 Donat Agosti Plazi CHE

37 Willi Egloff Plazi CHE

38 Anne Bowser Wilson Center Commons Lab USA

39 Eise van Maanen Food and Agriculture Organization (FAO) ITA

40 Daniel Kissling University of Amsterdam NLD

Page 16: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

41 Wouter Los University of Amsterdam NLD

42 Jacco Konijn University of Amsterdam NLD

43 Alex Hardisty Cardiff University GBR

44 Enrique Alonso Universidad de Alcalá ESP

45 David Manset Gnúbila FRA

46 Francesca De Leo National Research Council Italy ITA

47 Monica Santamaria National Research Council Italy ITA

48 Joerg

Freyhof German Centre for Integrative Biodiversity Research DEU

Page 17: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

4.2 Workshop Agenda

Background and program for the 1st Workshop in

Leipzig, 29 February – 2 March 2016

1. The GLOBIS-B project

2. Framing and defining EBV related terminology

3. Workshop 1 on species distribution and population abundance

4. Workshop program

Annex 1: List of workshop participants

Page 18: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

1. The GLOBIS-B project

The GLOBIS-B project aims to facilitate the global cooperation of world-class research infrastructures with

a focus on supporting frontier research on biodiversity. The project aims to contribute to developing key

measurements that underpin global indicators which are required to study, report and manage biodiversity

change (Pereira et al., 2013). More specifically, the project is focusing on potential infrastructure services

supporting research on measuring biodiversity change, specifically Essential Biodiversity Variables (EBVs).

Thereby GLOBIS-B serves a major goal of the Group on Earth Observations Biodiversity Observation

Network (GEO BON). The GLOBIS-B project consists of six European project partners and twelve

supporting research infrastructures (Table 1). The project will organize four workshops, each with about 40-

50 participants (international experts in biodiversity, research infrastructures, and legal/policy issues). More

details about the GLOBIS-B project can be found in Kissling et al. (2015) and on the project homepage

(www.globis-b.eu).

Table 1: Project partners and supporting research infrastructures of the GLOBIS-B project. The listed supporting

research infrastructures represent those that have agreed to contribute to the GLOBIS-B project. From Kissling

et al. (2015)

A key question for global biodiversity monitoring is how the multi-lateral cooperation of data collectors, data

providers, monitoring schemes, and biodiversity research infrastructures can be achieved at the global level

to support the harmonic implementation of EBVs. Until now, EBVs are hardly tested for their significance in

constructing biodiversity indicators and for their applicability at different spatiotemporal scales. Frontier

research in this area requires the availability and accessibility of substantial data sets with sufficient

spatiotemporal coverage. GLOBIS-B aims at elucidating how the cooperating research infrastructures (Table

1) may contribute to such an objective by focusing on offering data, workflows and computational services

for calculating EBVs…

…for any geographic area, small or large, fine-grained or coarse;

…at a temporal scale determined by need and/or the frequency of available observations;

…at a point in time in the past, present day or in the future;

…as appropriate, for any species, assemblage, ecosystem, biome, etc.

…using data for that area / topic that may be held by any and across multiple research

infrastructures;

…using a harmonized, widely accepted protocol (workflow) capable of being executed in

any research infrastructure;

…by any (appropriate) person anywhere.

Page 19: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

As such, the related scientific questions and methods will assist in defining the user requirements for

extracting, handling and analyzing data to measure biodiversity change. To this end, the project brings

together key scientists with global research infrastructure operators, technical experts and legal

interoperability experts to address research needs and infrastructure services underpinning the concept of

EBVs. With this focus on research needs for calculating and testing EBVs, the attention is on ad-hoc on-

demand services (and related workflows) in the cooperating research infrastructures. The obtained

experiences may later lead to a more systematic, periodic production cycle where EBV data products are

produced, updated and extended, for example annually, quarterly or monthly. However, the latter is not a

specific deliverable of the project.

2. Framing and defining EBV related terminology

Recent documents from GEO express how data should be shared and managed2, and a recent document

3

from GEO BON indicates how to aggregate biodiversity observations into EBVs and then into biodiversity

indicators. In the introduction of the GEO BON document2, it is explained that biodiversity indicators are

derived by integrating data from various Essential Biodiversity Variables. The EBVs applied here are based

on large global datasets, state of the art remote-sensing information, model-based integration of multiple data

sources and types, including in situ (ground based) observations, and online infrastructure enabling

inexpensive and dynamic updates, with full transparency.

Relevant terminology can be defined as follows (see Pereira et al. (2013), available here4).

Biodiversity indicators are designed to convey messages to policy-makers and management, for example on

delivering regular, timely evidence-based information on biodiversity change. They are derived from

aggregated primary data to convey information beyond the data itself. Indicators are often used to track

progress towards specific targets (e.g. the Aichi targets of the Convention of Biological Diversity5 and the

Biodiversity Indicators Partnership6)

Essential Biodiversity Variables are quantities, based on observations and for large parts of the Earth, which

are required for the long-term monitoring of biodiversity at national to global scales and especially for the

detection of change. They facilitate data integration by providing an intermediate abstraction layer between

primary observations and indicators. As such they are defining a minimum set of essential measurements to

capture major dimensions of biodiversity change, complementary to each another. The EBV framework is

based on repeated measures or representative sampling at the same locations or regions, ideally at regular

intervals.

Data are the result of measurements (e.g. based on direct observations, remote sensing etc.), and include both

existing legacy data and targeted, newly generated or reprocessed data. Data can also be produced by models

and proxies, extrapolated from a few real measurements.

3. Workshop 1 on species distribution and population abundance

The first GLOBIS-B workshop is one in a series of four (Figure 1). Workshop 1 is coupled with workshop 2

and both are related and focused on the EBV class ‘Species populations’, i.e. species distributions and

abundances. The workshop participants are listed in Annex 1.

2 http://www.earthobservations.org/geoss.php and http://www.geoportal.org/web/guest/geo_home_stp 3 http://www.geobon.org/Downloads/brochures/2015/GBCI_Version1.2_low.pdf 4 http://www.earthobservations.org/documents/cop/bi_geobon/ebvs/201301_ebv_paper_pereira_et

_al.pdf 5 https://www.cbd.int/sp/targets/default.shtml 6 http://www.bipindicators.net/

Page 20: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Figure 1: Overview of the four

GLOBIS-B workshops which are

related to species distribution and

abundance (1 & 2), species traits (3),

and species interactions (4). The aim of

the workshops is to discuss and

develop a framework for implementing

Essential Biodiversity Variables

(EBVs) across research infrastructures

worldwide. This will be achieved by

discussions among different participant

groups, including biodiversity

scientists and ecologists; biodiversity

informaticians, technologists and

infrastructure operators; and legal

interoperability and policy experts. The

figure is derived from Kissling et al.

(2015).

Interactions between scientists and research infrastructure operators

The GLOBIS-B project is unique in bringing together biodiversity scientists with operators from research

infrastructures to discuss and develop a framework for implementing EBVs (Figure 1). The workshops are

meant as experiments, where different scenarios are considered on how scientists may want to test the

relevance of EBVs to build indicators, and which data, workflows and computational capacity each scenario

will require. In turn, the cooperating research infrastructures will consider the challenges and potential

solutions of providing the required data and workflow services, to achieve global interoperability. This

requires to discuss in detail the necessary steps and tools needed to move from data collection and

transformation over modelling, testing & validation to the final EBV presentation (Figure 2, green).

Important scientific discussion points will be which data are needed and how they have to be transformed,

which analytical tools and models need to be implemented and tested, and how to present and visualize

EBVs (Figure 2, blue). Related technical discussion points are how workflows have to be designed, which

Information and Communication Technology (ICT) approaches and options are available, and how they can

be made interoperable (Figure 2, red). This will allow to formulate key research questions for testing EBVs

and help to identify which technical and legal challenges the research infrastructures are facing to support the

interoperable, on-demand, ad-hoc calculation of EBVs.

Page 21: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Figure 2: Potential steps for the calculation of Essential Biodiversity Variables (EBVs, green) and related

scientific (blue) and technical (red) questions and challenges.

Legal interoperability and policy

The legal interoperability and policy issues are mainly related to legal interoperability, to open access/data

sharing and data management principles, and to licensing/terms of use that web services for collaboration on

EBV calculations would need (e.g. a web service managing data from multiple origins).

Potential legal interoperability problems might create bottlenecks in the workflows due to either data

ownership or contractually agreed limitations to access data or software. This has to be diagnosed and

identified (incl. mandatory protocols/standards, confidentiality, trade secrets, intellectual property rights,

patent rights, database rights, compelling state interests in data secrecy such as location of threatened species

or genetic resources, or embargo periods for scientific publications, attribution/provenance, aggregation of

data, data licensing agreements, etc.). It also includes the analysis of normative versus legally binding

practices of the different biodiversity-related scientific communities who contribute data to the calculations

and/or further use of EBVs.

Policy issues will also play a role. It will be evaluated to what extent potential EBV calculations

comply with GEO and RDA-CODATA open access/data sharing and data management principles. This will

be diagnosed and identified during the workshop 1.

There is a number of other interoperability issues. EBV calculations might need to be accessible to

multiple users by using multiple providers of data sets, including citizen science. This will inevitably

confront web services and collaborative efforts of multiple e-science infrastructures with interoperability

issues. Such issues might offer important suggestions for developing policies (terms of use, licensing,

liability waiver etc.).

In Workshop 1, a small group of legal/policy experts will follow the debates of scientists and

technical experts. This will allow to identify key issues of interoperability and policy. These specific issues

will then be addressed in more detail in Workshop 2. An initial document on the legal/policy issues has been

put together as an input into workshop 1. This document is available at:

http://www.globis-b.eu/files/intranet/Deliverables/Final/D4_1%20final.pdf

Expected outcomes of workshop 1

Workshop 1 aims at identifying the options and problems associated with infrastructure support for the EBV

Page 22: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

class ‘Species populations’ (EBVs ‘Species distribution’ and ‘Population abundance’). Workshop 2 will be a

follow-up a few months later with the objective of writing a peer reviewed scientific paper and other

dissemination of key results from the first workshop. Several outcomes are expected from workshop 1:

Examples of research questions for developing EBVs and the implications with respect to

data and methodology (analysis, models etc.)

Data characteristics (spatial, temporal and periodicity dimensions, following the EBV

criteria as listed by the GEO BON strategy for EBV development7) and technical

requirements for EBV calculation, including potential software and workflows, e.g.

- Candidate datasets which are available that meet the criteria

- Methods that are needed/useful to convert the raw data into normalized/derived data

that fit the purpose to calculate EBVs

- Required workflows, support for their execution, and management of studies and

results

- Statistical methods and analysis tools

- Techniques and tools for visualization

Preliminary views of the research infrastructures on how they could accommodate the

requirements and enhance their performance by global cooperation, e.g. via

- Cross-mapping of requirements and current capabilities of research infrastructures,

i.e. readiness (matrix of workflow steps vs. capabilities of research infrastructures,

with level of readiness indicated)

- Options for infrastructure cooperation, task division and sharing of services

- Identification of existing barriers, obstacles and risk, including legal barriers to

access or reuse data, with suggestions for removing them

Recommendations and guidelines for the harmonization of EBV-relevant data collection,

data transformation/normalization, and for data exchange and sharing of interoperable

datasets, e.g.

- Standards for data production

- Guidelines for handling existing data

Views on associated policy and legal implications, e.g.

- Preliminary observations on legal issues

- Potential implications for policy requirements and to inform policy bodies

Plans for continued remote interaction of participants and/or test implementations in order to

prepare for workshop 2, e.g.

- Identifying steps to develop a demonstration of how a global cooperation among

research infrastructures could lead to an EBV calculation, maybe with the aim to

present it at a large event in 2018 (e.g. GEO related event)

- Identify who (i.e. people and research infrastructures) could take the lead in various

steps

4. Workshop program

Day 1 (afternoon)

Lunch (optional for confirmed requests)

13:30 Welcome (30 mins)

o Welcome by GLOBIS-B and introduction of participants (Daniel Kissling, 15 min)

o Welcome by German Centre for Integrative Biodiversity Research (iDiv) (Director

iDiv, 10 min)

14:00 Introductory session with lighting talks (from data to science and policy) & discussion

(15 minutes each, plus questions). Chair: Daniel Kissling (later Wouter Los), rapporteur:

Francesca De Leo

a. Introduction

- EBVs and biodiversity indicators (Henrique Pereira)

7 http://www.geobon.org/Downloads/reports/2015/Essential_Biodiversity_Variable_Strategy_v1.pdf

Page 23: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

- Why are we sitting together? (Daniel Kissling)

b. Views from biodiversity science

- Sampling procedures and field data collection for biodiversity monitoring (Nicolas

Titeux)

- Methods for extracting trends from distribution data (Nick Isaac)

- Calculating biodiversity indicators (Louise McRae)

c. Views from legal interoperability and policy

- Data Sharing Principles and Data Management Policies in GEO and RDA /

CODATA (Willi Egloff) [Coffee break (20 min)]

d. Views from infrastructures and informatics

- The infrastructure landscape: data portals and related biodiversity infrastructures

(Donald Hobern)

- Interoperability and workflows: state-of-the-art (Alex Hardisty)

- Demonstration of EBV pilot from EU BON (Hannu Saarenmaa)

e. How to proceed in the next sessions (45 mins)

18:00/18:30 Dinner

Day 2 (morning)

08:15 Science session with 5 parallel table groups, each one with a mixture of scientists and

technical experts (plus a legal expert) addressing the scientific questions for the EBV class

‘Species populations’. Chairs and moderators (one for each table group): Daniel Kissling,

Alex Hardisty, Wouter Los, Francesca de Leo, Monica Santamaria. Rapporteur (one for each

table group): Jacco Konijn, Miguel Fernandez, David Manset, Enrique Alonso, Joerg Freyhof

- Question: How would you calculate and visualize the change in distribution/abundance for

constructing an EBV? [Your table group might want to choose a specific taxonomic group,

a specific EBV measurement (e.g. species presence, abundance, range size etc.), and/or

maybe a specific region so that your discussion can be focused. Base this decision on data

that are potentially available from data portals or research infrastructures.]. Specific sub-

questions:

o Which data types do you require (and are they available)?

o Which scientific methods do you apply to calculate your EBV measurement,

making it repeatable?

o How should results preferably be presented and visualized?

o What research questions are key to test EBVs?

9:45 [Coffee break (15 min)]

10:00 Plenary reports (short summary by each group) and discussion Chair: Daniel

Kissling, rapporteur: Monica Santamaria

10:45 Technical session with 5 parallel table groups (same as above), each one with a mixture

of scientists and technical experts (plus a legal expert) addressing technical questions for the

EBV class ‘Species populations’. Chairs and moderators (one for each table group): Daniel

Kissling, Alex Hardisty, Wouter Los, Francesca de Leo, Monica Santamaria. Rapporteur (one

for each table group): Jacco Konijn, Miguel Fernandez, David Manset, Enrique Alonso, Joerg

Freyhof

- Question: How can research infrastructures address the scientific needs discussed in the

previous session? Specific sub-questions:

o What are the key steps of a workflow for calculating EBVs?

o What is a suitable technical (ICT) approach to perform this workflow for

calculating EBVs (any place, any time, using data anywhere, by anyone)?

o What are the options available and what is possible to achieve today / within

12 months?

o What are the top 3-5 technical challenges of supporting interoperable EBV

Page 24: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

calculations?

12:00 Plenary reports (short summary by each group) and discussion Chair: Alex Hardisty,

rapporteur: David Manset

- Topic ranking

12:45 Lunch

Day 2 (afternoon)

14:00 Specific parallel group sessions, scientists, technical experts and legal experts are

separated. Chairs: Daniel Kissling, Alex Hardisty, Enrique Alonso. Rapporteurs Joerg Freyhof,

Lee Belbin, Anne Bowser.

a. Research infrastructures discuss available data sources, associated services, problems to

solve, and readiness level of each research infrastructure. Informatics experts discuss

potential workflow development for interoperability and data processing. Who of the

technical people would be willing to work actively after workshop 1 (until workshop 2) on

the calculation of an EBV?

b. Scientists discuss details of selected EBV research, which methods and data are needed and

their characteristics, which provenance information needs to be recorded, and how

outcomes should be presented for developing EBV-related biodiversity indicators. Who of

the scientists would be willing to work actively after workshop 1 (until workshop 2) on the

calculation of an EBV?

c. Legal experts discuss policy and legal issues arising from the previous sessions.

16:30 Plenary reports (short summary by each group) and deeper plenary discussion

Chair: Henrique Pereira, Rapporteur (Daniel Kissling/Jacco Konijn)

- Summary of specific parallel group sessions

- Potential additional topics for considerations

Day 3 (morning)

08:30 Plenary closing session

Chair: Daniel Kissling. Rapporteur (Wouter Los/Alex Hardisty)

- Conclusions and general summary from the three parallel group meetings, with discussion.

o What might be common conclusions and required next steps on the thought

experiment?

o Recommendations and guidelines for the harmonization of EBV-relevant

data collection and curation and for the sharing of interoperable datasets?

- How to move from workshop 1 to workshop 2? o How can we frame a scientific paper with high impact?

o How to prepare for workshop 2?

o What has to be done in-between workshops?

o Who takes which tasks? Who can invest time until the next workshop to

work on the implementation of EBVs? 12:30 Closure

References

Kissling, W.D., Hardisty, A., García, E.A., Santamaria, M., De Leo, F., Pesole, G., Freyhof, J., Manset, D., Wissel, S., Konijn, J. & Los, W. (2015) Towards global interoperability for supporting biodiversity research on essential biodiversity variables (EBVs). Biodiversity, 16, 99–107.

Pereira, H.M., Ferrier, S., Walters, M., Geller, G.N., Jongman, R.H.G., Scholes, R.J., Bruford, M.W., Brummitt, N., Butchart, S.H.M., Cardoso, A.C., Coops, N.C., Dulloo, E., Faith, D.P., Freyhof, J., Gregory, R.D., Heip, C., Höft, R., Hurtt, G., Jetz, W., Karp, D.S., McGeoch, M.A., Obura, D., Onoda, Y., Pettorelli, N., Reyers, B., Sayre, R., Scharlemann, J.P.W., Stuart, S.N., Turak, E.,

Page 25: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Walpole, M. & Wegmann, M. (2013) Essential Biodiversity Variables. Science, 339, 277-278.

Page 26: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

4.3 Pre Workshop input from participants The questions below with your answers will serve as an input into the first GLOBIS-B workshop. Please answer these questions within the next three weeks and send them back to Daniel Kissling ([email protected]). The workshop will be focused on species distributions and abundances (EBV class ‘Species populations’). The questions below are either related to scientific aspects on species distributions and abundances or to technical aspects of research infrastructures. Please try to answer all questions from your own perspectives (science, technical, legal/policy). Please write a paragraph or make bullet points for each question. We would very much appreciate if you could back-up your statements and thoughts with relevant literature (e.g. key references) or other sources (e.g. links to websites etc.). All answers will be compiled anonymously and made available at the beginning of the workshop. Please send your answers back to Daniel Kissling ([email protected]) until 15th February 2016. Questions related to biodiversity science Question 1: Which data would you require to quantify an EBV related to the EBV class ‘Species populations’, i.e. changes in species distributions and/or abundances over time? Which standardisations are needed for such data? What data is available and where? Question 2: Which specific scientific/quantitative methods (e.g. statistics, models, workflows, sampling designs etc.) would you require to calculate temporal changes of species distributions and/or abundances from the raw data of observations? Question 3: How would you present and visualize the results of an EBV related to changes in species distributions and/or abundances over time? Which presentation and visualisation approaches would be informative to show the results? Question 4: What are in your opinion the two most important research questions relevant for quantifying changes in species distributions and/or abundances over time (i.e. for measuring biodiversity change)? Why do you consider especially those research questions as particularly relevant? Questions related to technical side (research infrastructures) Question 5: What are the key steps of a workflow(s) for calculating species distribution and/or abundance EBVs, starting from accessing the raw data to presenting a visual result? What are the complexities involved? What data preparation is needed? Question 6: What is a suitable technical (ICT) approach to perform this workflow(s) for calculating EBVs (any place, any time, using data anywhere, by anyone)? What special considerations have to be taken into account? Question 7: What are the technical options available and what is possible to achieve today or within the next 12 months? What data and/or workflows, software etc. are available today? Where is it and how can it be used? Question 8:

Page 27: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

What are the top 3-5 technical challenges of supporting interoperable EBV calculations on a global basis?

How can these be addressed and in what time period? Who has to do something?

Answers: Question 1: Which data would you require to quantify an EBV related to the EBV class ‘Species populations’, i.e. changes in species distributions and/or abundances over time? Which standardisations are needed for such data? What data is available and where? ANSWER:

Vegetation: Method 1: hyperspectral image data (airborne, S-2, EnMap) + LIDAR; Method 2: hypertemporal data (MODIS, SPOT, etc time series). Method 3: Species distribution models supported by climate, time series RS, anciallry data and in situ data.

Individual organisms: very high resolution imagery (< 1 m) ANSWER:

Minimum data requirement for distribution: species occurrences (i.e. presence) associated with geographical coordinates and exact sampling date (or at least year) of the record. Ideally, records are representative of the range of the species.

Minimum data requirement for abundance: species count times series at a given location with exact sampling date (at least year) of each record; geographical coordinates of the sampled location would be appreciated but not mandatory. Count data that do not belong to a time-series (e.g. opportunistic counts) could also be used but are more problematic to analyse.

Standardisations: consistency in the sampling procedure (sampling effort, protocols, etc) over time (mostly for abundance) and space (mostly for distribution) is highly preferable. Any change in the sampling procedure should be reported and allow accounting for it in order to standardize the data (e.g. by weighting it per unit effort). If the record did not follow any procedure (opportunistic data), this information should be available in order to differentiate the record from those collected with a specific sampling procedure (coordinated citizen science scheme, research program, etc).

Non detection records or absence (i.e. “0”) should ideally be reported whenever possible. Otherwise, the metadata should allow distinguishing between non detection of a species and no sampling.

Data availability: Very heterogeneous according to taxonomic groups and geographical locations. Count could be available from European Breeding bird surveys (EBCC) or Butterfly monitoring surveys are coordinated citizen science scheme that can provide appropriate abundance data at the scale of Europe. Most countries also have implemented schemes at national level. Opportunistic occurrence records are available from GBIF (museum, citizen science with no sampling procedure, etc). Interestingly, any count allowing to calculate the abundance EBV can be used as occurrence data to inform distribution EBV.

ANSWER:

Identification of key/indicator/target species.

Key/indicator/target species distributions should be monitored using AOO and EOO (see http://jr.iucnredlist.org/documents/RedListGuidelines.pdf)

Distributions are going to be easier than estimating abundances.

Regular surveys of distributions and estimates of abundance required if the EBV is to be robust.

Data: Distribution data is the type most widely available via Data Publisher such as the Atlas of Living Australia, GBIF, CRIA, BISON, SANBI etc.

Abundance data is available through Australian State/Territory agencies but there are no standards for handling these data with most Data Publishers. Most Data Publishers lack systematic survey data, and even if they do, there are no standards for effectively managing it. See http://www.portal.aekos.org.au/ for systematic survey data infrastructure. I am unaware of any international equivalents to AEKOS (integrated systematic survey data with an ontology infrastructure).

Page 28: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Data standardization (Darwin Core), vocabulary/ontology. ANSWER:

Ideally: o Repeated large-scale single methodology surveys including species counts across a wide

area (best of all, whole species range) with good understanding of detectability characteristics for the species in question (and probably best with simultaneous surveying of a range of related or co-distributed species

In the absence of the above: o Access to the fullest possible suite of locality records (presence) for the species (with

evidence evaluated for confidence) o A good spread of sites/dates for which more comprehensive assessment has been made

along the lines described above under "Ideally" o Understanding of the level and evenness of recording effort and quality associated with

available data (possibly derived from overall data levels for an appropriate broader taxon) o Potentially – adequate sequence data from across range to assess connectedness, gene

flow, etc.

In all cases: o Knowledge of taxonomic factors which may bias or affect the ability to discover and use

data relating to the species concept under question (e.g. species splits accepted in a patchwork fashion at national levels may impact transnational comparability)

o Good environmental, climatic, vegetation, land-use, etc. layers for the whole range in question, encompassing factors considered likely to be limiting or influential on the distribution and population density of the species, all at a scale that is appropriate based on the mobility of the species in question

o If possible, particularly for more localised, low vagility species, for which the environment of fine-scale micro-habitats is likely to be more important, appropriate measurements of significant environmental aspects taken simultaneously with the survey activity

o Knowledge of the range/populations of significant host taxa and major predators/parasites/parasitoids, etc. (including crops, etc.)

o (Repeated from above) Understanding of the detectability of the species in question o Understanding of the community and metapopulation structures for the species in

question o Body mass for the species at various points in its development o Other lifecycle and mobility traits

ANSWER:

Specimens occurrence data and event data with associated information such genre, life stage, reproductive condition and demographic parameters. Of course associated metadata is quite important to get description of capture methods, interpolation methods, detection probabilities.

Darwin core standards for occurrence records and event driven documentation are very suitable to have as the main reference in order to make this data interoperable.

I think the issue is not where is it available, but to how to promote making available the existing data keeped on researchers private repositories and make it open for science.

ANSWER:

The most important EBV here would be abundance, hence, we would need either CMR (Capture-Mark-Recapture) data, modelled abundance data or estimations from genetic data (=effective population size). From a standardization point of view CMR data and Ne (effective population size) can be readily used, but the data is usually scattered and will be difficult to be brought together, especially on a global scale. Most of these studies are local and many of high scientific value and not open access.

ANSWER:

Page 29: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

From the molecular biodiversity point of view, in order to quantify changes in species distributions and/or abundances over time, DNA barcode or metagenomic (in case of microbial communities) sequences could be considered the required raw data. Then from these sequences it is possible to infer the list of taxa and, in some cases, their abundances, present in a given geographic area.

A given taxonomic group could be the target, possibly a species, selected on the basis of prior knowledge about his ability to respond to environmental changes (for example in eutrophication it occurs an excessive growth of aquatic plants, following a great availability of natural or anthropogenic nutrients) or to affect the environment itself (the proliferation of microscopic algae causes a greater bacterial activity and an increase in oxygen consumption, which causes fish death). You could also consider alien or invasive species.

A standardization of the entire experimental protocol, from sampling to sequencing, is required so that the data from different samples are comparable. This standardization is being implemented in projects such as OSD, Earth Microbiome Project and TARA. For example, the sequencing technology and its throughput may have evolved between different sampling times. Consequently the same sequencing chemistry and possibly the entire experimental protocol should remain the same along the different time points.

A strict standardization is required even for the subsequent bioinformatic taxonomic analysis. It concerns both the assignment tools and the reference sequences databases. If multiple sequences are produced for the same sample, as it normally occur in a metagenomic experiment, it is often necessary to perform a normalization of the different datasets to be analysed and compared. This may be conducted, for example, by applying tools, as those implemented in Deseq, that minimize the dataset size effect.

DNA barcoding data produced in a lot of experiments are available in a number of online resources as, for example BOLD. The Public Data Portal of BOLD supports queries based on taxonomy, geography, specimen depositories, project or dataset codes, etc.

Metagenomic project, with sequences and their taxonomic or functional characterization, can be browsed from various online archives such as EBI metagenomics, MeganDB, iMicrobe, etc.

ANSWER:

We would need spatial and temporal replicates of presence and/or abundance records for a series of species at the global scale (Isaac & Pocock, 2015):

o Taxonomical coverage: species from different kingdoms (e.g. plants, animals) in the terrestrial, freshwater and marine realms

o Spatial coverage and replicates: global coverage to cover the different biomes in a representative manner (see question 2 for more details)

o Temporal coverage et replicates: time series of presence and abundance records with repeated surveys at the same locations to deal with heterogeneous sampling effort and imperfect detection rates

o Standardisation of the data (see also question 4): species presence/absence versus presence-only data, information needed on the way to calculate abundance (units, sampling scale), estimation of sampling effort along the time series

ANSWER:

Temporally replicated and geo-referenced presence/absence data or count/transect data; in case of count data, distance of animal to observer is important to correct for detection probability. Presence/absence data can come from camera traps, sound recorders or human observers. Multiple observations from each point are required to estimate detection probability. For observer data, cross validations and quantification of observer data is desirable.

ANSWER:

Data o At a high level full life-cycle abundance estimates made across large functional groups (i.e.,

birds) fits well within an EBV framework. We think about spatial and temporal scales outside of geopolitical boundaries but instead incorporate the patterns of abundance

Page 30: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

across a species’ entire life history, regardless of where it occurs. To accomplish this full life-cycle analysis requires thinking/researching about what can be done with the inherently imperfect data that are actually collected.

o The specific data to be considered is primary occurrence data, the raw data available that describes the occurrence of a species at a specific location and date. The variability in how these data are collected, and accessible is high. Developing the tools that allow analysts to infer more information from less well-described observations (i.e., those sources that either provide simple occurrences, presence only, or contain biases based on sampling framework) from more information rich data sources (i.e., those data that are collected following strict collection protocols by trained observers, and you can infer absence as well as presence) is the proper direction to proceed.

Standardisations o Data are collected and analyzed using a variety of methods. Deriving global consensus

indices would benefit from methods to integrate and standardize heterogeneous sources of information. This should happen at a number of different levels –

Data Collection - sampling designs (where and when data are collected) & protocols (how data are collected), across different spatial and temporal extents and grains/resolutions.

Data Analysis – different models, each with various strengths & weaknesses o The cross-scale analysis and integration of information across space, time, and ecological

units necessary to derive global indices also pose challenges that will require scientists to deal with issues of heterogeneity and standardization.

Variation in observer skill can be considered as heterogeneity and likely also bias in the detection process. The effect of skill on the detection process is varies considerably by species. The accuracy of species distribution models and inference is improved if you can account for more heterogeneity in the detection process and this is particularly relevant if there is bias in the effect of observer skill with location/season/etc.

o There are interesting examples of dealing with heterogeneity and standardization The joint analysis of heterogeneous data is now receiving more attention in the

statistics & quantitative ecology communities. Fithian and Hastie’s work, Sauer and Link, etc.

By bringing heterogeneous data together, we can (attempt) to standardize them. The outcome of this is that we learn – either accumulating evidence where different information sources agree or disagree. Recent analysis by Blancher of Breeding Bird Survey, Boreal Avian Monitoring & eBird shows evidence of consistent spatial pattern of breeding season abundance across different data collection protocols in different regions and seasons with different analyses.

By accounting for variation in observer skill we can improve the ability to make inferences about species distributions. This has important implications for citizen science data that are gathered by individuals of varying ability.

Data availability o Varies by Taxon. Birds and most mammals are well-covered while most other taxons are

not. ANSWER:

I would need cultivation-free interrogation of microbial community. This means 16S rRNA sequencing and/or shotgun metagenomics

The main standardization needed is the consistency of the sampling procedures, storage conditions, and sequencing protocols/machines.

Some datasets are already available. This includes: o The Earth Microbiome Project http://www.earthmicrobiome.org/ o The Human Microbiome Project http://hmpdacc.org/ o The MetaSUB project http://www.metasub.org/ o Tara Ocean science http://www.embl.de/tara-oceans/start/

Page 31: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

ANSWER:

Occurrence data, from observation or specimen, are required. Time, place (latitude and longitude) and identification on taxonomy are basic requirements for such data. It is much better if population size or/and density data are available. GBIF and long-term observation projects, like GEO BON, may provide some occurrence data but they need be cleaned first.

ANSWER:

First step is to define what sort of change you want to be able to detect – this will give an idea of the sample size required, and the study design

Species data: o You need a measure at a given site that is at least probabilistic (e.g. probability that the

species occurs there). No point having a relative measure, because these can’t be compared between time 1 and time 2. This means that presence-only data aren’t sufficient for change estimates. Ref: Guillera-Arroita et al. (2015). Global Ecology and Biogeography, 24, 276-292.

o Need to deal with imperfect detection, so you’re not just measuring variation in detectability or survey effort. This requires specialised survey design.

If the data are to be modelled and predicted to new unsampled sites: o A set of environmental covariates (including measures such as land use, disturbance etc)

that are relevant to the species distribution, at a grain fine enough to capture the main drivers of distribution.

Sidenote: I cannot quite conceive what data at a coarse resolution (e.g predicted distributions based on 50 x 50km grid cells) mean for monitoring change. I can understand it at a finer resolution, where there is some match between how the species is responding to conditions relevant to it. At a coarser grain everything is averaged and smoothed, and the main observable relationships are with climate (not topography, soil moisture etc) - I don’t see how such smoothed responses will be useful for monitoring change (the signal will be too diluted).

Environmental data: o If a globally coherent dataset is needed, this group will already know of the data. The

limitations in the available data globally are: Mostly climate is long-term averaged, not extremes or variation Would be excellent to have some global topographic data such as topographic

wetness index (reporting water accumulation in the landscape) or elevation variability, to try to get some topographic influence (especially for vegetation) into the models.

Are soils data site-specific enough? Should we be thinking about climate close to (or under) the ground, for ground-

dwelling species? – ref: Kearney et al. (2014) microclim. Scientific Data, 1 ANSWER:

Global climate time series at relative fine spatial resolution (1km).

Global land cover time series that allows pixel based transitions over time. ANSWER:

It depends on the EBV and how it will be calculated. Some of the data that may be needed include: taxonomic and nomenclatural data, species occurrence data (specimen and observational), gazetteer data (could be used for automatic georeferencing or in data quality filters), environmental data (if ecological niche modelling will be used, as suggested by Soberón & Peterson (2009), AMBIO 38(1): 29-34)), land-use data (for instance, to approximate real distributions from potential distributions), population size and structure data (or other data from where population size can be derived), among others. Clearly, most data must indicate when they were measured, collected or observed to track changes over time.

Page 32: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Besides converting measurement units when necessary, sampling methods may also need some sort of standardisation when mixing species monitoring data from different projects.

Main sources of data: Catalogue of Life, GBIF, Living Planet Database. ANSWER:

Taxon-specific definitions of populations and metapopululations

Repeated surveys of the model species (in major ecological research groups) and model territories (nature reserves, large scale permanent plot schemas)

ANSWER:

The citizen science community typically discusses ‘species populations’ as species distribution or as population abundance. Within this context, different projects support different data collection models. In Bioblitzes (e.g., Lundmark, 2003, BioScience, 53, 329), scientists and the public come together to count biodiversity in a targeted location for a few hours or days. Other types of citizen science projects assume that individuals share data intermittently, either from the same location (e.g., Project Budburst’s regular reports) or from different locations (e.g., traveling birders in eBird). Differences in protocols reflect different research needs, and also the preferences of diverse citizen science communities with different motivations to participate. Thus, from the citizen science data collection perspective EBVs should be designed as sufficiently flexible to support customization to local needs, while still being “standardized.”

There is a wealth of citizen science data on species populations; the SciStarter (http://scistarter.com/) database of 1,100+ projects would be an excellent starting point.

ANSWER:

Data o Reliable, consistent, global data on the selected species or group of species at the points in

time between which change should be measured (this is a huge challenge for most species,

and is where gap-filling will be necessary).

o For calculating distributions: presence observations with reliable spatial and temporal

reference, and possibly sample-based data

o For monitoring changes in ABUNDANCE: as above but MUST have sample-based data,

repeated counts at at least some duplicate locations over time.

o A good, evidence-based rationale for selecting a specific species or subset of species to

focus on (e.g., they are proven to be reliable proxies / umbrella species / keystones in the

community, they are especially sensitive to habitat loss or temperature change…)

o In order to make best use of the EBV - information on locations where a species may be

considered as invasive or problematic.

Standardisations

o Need to account for varying sampling effort in space and time. Detectability needs to be

modelled and this is even more challenging for getting reliable estimates of species

abundances. Units have to match. Taxonomic definitions have to match or be mappable.

Available data

o IUCN RED List species ranges, based on expert opinion and SDM models. As range polygons,

these encompass many areas where a species is not actually present, limiting their use for

spatial overlay. They also improve and tend to become tighter over time, and so have to be

used with caution when assessing change over time. Easily accessible online and usable

under agreed licence from IUCN.

o IUCN lists of species occurrence by country. (These do not always correspond to data

derived from the above-named polygons). Accessible online through a REST API, but only

returned for one country at a time.

o IUCN definitions of threat status for individual species. Changes in status between dates

(i.e., a species becoming more or less threatened, globally) are potentially very powerful

Page 33: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

when combined with data from non-IUCN sources such as those named below.

o GBIF archives of species observations and specimens, which are more or less

‘opportunistic’ and so have varying spatial accuracy and heterogeneous density in sampling

effort. Accessible online via a wide variety of user-specified queries – i.e., many ways in

which data can be tuned and filtered to a user’s needs. One very useful feature for

calculating reproducible EBVs - each query gets a DataCite DOI which ensures its

repeatability.

o Contributed data from iNaturalist, eBirds etc. which may or may not have been ingested

into GBIF. E.g., African elephant observations: http://www.inaturalist.org/taxa/43694-

Loxodonta-africana/map#4/-12.257/20

o Archived data from individual historical studies – often stored in inaccessible formats which

require digitizing and transformation to be suitable for submission to an archive such as

GBIF. However, some research councils are moving towards supporting data archiving in

accessible and interoperable formats. In addition, new developments at GBIF mean that

more historical and current data can be submitted – in particular, sample-based data:

(http://www.gbif.org/sites/default/files/gbif_IPT-sample-data-primer_en.pdf).

o Species-specific schemes or research projects which often generate high-level estimates of

abundance, residence etc. by combining individual observations, transects, drone flights

etc. Sometimes only available as viewable maps online or in published journal papers. For

individual species studies, sometimes the raw data is not easily available so the estimates

of abundance must be taken on trust although methods by which they are calculated might

vary between studies.

o Citizen science species observation projects which do not yet contribute into national

biodiversity networks or GBIF nodes. The best hope for collating and aggregating this data,

and moving towards common proformas and standards, seems to be the emerging

communities of citizen scientists coordinated by initiatives like the European Citizen

Science Association of the Living Atlas of Australia.

o Animal movement data (e.g., radio tracking data) – if published by the collector, may be

available from movebank.org or TOPP.org (Pacific predators specifically)

o Camera trap data. GBIF now offers best-practice examples on how to produce and publish

derived observations from camera trap data, and some GBIF partners (e.g., in Spain) are

publishing photographic data linked to Darwin Core records.

ANSWER:

This depends on the scale. In small scale, the data can come from quantitative monitoring schemes where we record in the field the abundance all organisms belonging to an organism group, such as birds and butterflies. Such trend analysis has been performed for instance by the EBCC for quite some time, using methods such as TRIM.

In a large scale such as country or continent, we can use big data from GBIF. Their data which now is 643 million records, integrated from 15,000 databases worldwide, is opportunistic, including all possible field protocols or lack thereof. It can be used to derive relative “abundance from occurrence”, but the results cannot be directly validated the same way as detailed studies directly linked to field measurements. Analysis of big data forms a “consistent methodology” of its own. In that sense it also fulfils the requirement of the definition above. My team is only working in this scale.

ANSWER:

My particular expertise is in the estimation of species distributions from unstructured observations (and how they change over time). We use presence-only observations such as those stored on GBIF. The observations models for these data are now quite advanced: we convert presence-only data into detections and non-detections by inferring something about the data collection process from which other species were observed at the same place and time. In principle this means that

Page 34: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

existing GBIF data are useable, but the heterogeneity of these data is vast, and we lack any information about how the data were gathered. There are two needs for standardization:

o Metadata about how the data were collected. For example: what organisms were subject to a search? Was there a formal protocol? What spatial and temporal extend do the raw observations refer to?

o Standardization of how uncertainty is expressed in the modelled output. ANSWER:

We basically need to answer : what is out there? o For global analysis of the marine world: the European Ocean Biogeographic information

system (EurOBIS), the European Marine Observation and Data network (EMODNET). o For local analysis the MSFD, OSPAR, ICES data collection frameworks and impact studies.

When analysing data over larger geographic area’s and larger periods, one of the essential standardizations is the taxonomical one. Make sure to use a standard species register : the World register of Marine Species (WoRMS), the Catalogue of Life (CoL), the pan-European Species register (PESI). Some animals have 200 different names!

The second standardization is the sampling methodology, for existing data collection frameworks before the sampling and for older data through metadata. The older data is of course essential to analyse trends.

Relevant links: o http://www.marinespecies.org/ o http://www.emodnet.eu/

ANSWER:

Physically self-consistent global datasets. This means o i) systematic reanalysis of existing data to put it all on the same basis globally - in terms of

its meaning for EBV calculations; and o ii) filling the gaps to create a global coverage of needed data. To what extent has the data

to be “observed” vs “synthetic”? (cf. with ideas from meteorology / climatology. ref: A Vast Machine: Computer models climate data and the politics of global warming. Edwards, PN. MIT Press 2010.)

Two or more independent sources of data are needed, in case of error in a single dataset. ANSWER:

Data types: o population genetics data (haplotypes) in space and time o population biological cycle (e.g. size at first reproductive event, biomass, gonad size and

biomass) in space and time o metabarcoding data in space and time o species abundance/biomass/productivity distributions in space and time

Standardizations: o harmonized sampling design o standardized data collection and processing (gear, sieving, collection of individuals, etc.) o standardized data values (e.g. occurrence, abundance/biomass density values,

transformations, etc.)

Available data: o large aggregators (e.g. GBIF, OBIS, NCBI) for occurrence/abudance/genes o large Consortia and organizations (e.g. EMBOS, ICES, LTER), primarily for

occurrence/abundance/coverage o individual long-term datasets held in Biological Marine Stations, for any type of data falling

within the “Species populations” class

Page 35: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

ANSWER:

Occurrence data. The best source may be monitoring data with information on presence of a species in space/time. There are unique advantages with such data compared to all alternative data sources. These are

o Large amount of such data is already available through large repositories in aggregated/standardized form

o Missing data can be collected relatively easy, e.g. with citizens, sensors, from the literature and public archives, or with supplementary protocols added to on-going monitoring programs

o Because of the small information contents, occurrence data from different sources are easy to aggregate/standardize

In addition to monitoring programs, research projects may also deliver occurrence data. These could either be extracted from the literature, or from well-documented research projects. Many experimental and observatory infrastructures are building up project databases that capture and archive raw research data (e.g. http://www.nordgis.org/sites/home/, http://www.assemblemarine.org/, http://www.embrc.eu/). One could extract occurrence data from such archives. In addition, one could establish protocols and incentives for scientists at these infrastructures to collect such data.

Absence data. The absence of a species is usually not recorded in monitoring programs (with exception of species specific surveys). However, such information may also be inferred if monitoring programs are sufficiently regular and frequent.

Genomic data. Metabarcoding approaches seem to be promising for high-troughput inventories of community species richness. Potentially they can give an objective (i.e. human independent) and highly automated (Davies et al 2014) diagnose for most, if not all multi-cellular species in an environment (Handley 2015). These methods could be used to produce presence/absence data with high spatio-temporal resolution. But there are many scientific, technical and sociological obstacles that will need many years to get solved. Here are just some:

o Genomics is still expensive o Monitoring programs and the experts running them usually have no genomics background o Genomic data are not comparable with earlier time series o Genomics data don't give any estimates of abundance

Abundance data. Abundance data would be desirable especially for regional/local investigations of species distribution. However, there are some principal obstacles to use abundance data for EBVs, these include that abundance data are (i) not standardized and hence difficult to compare and (ii) not accessible in aggregated form. However, if the spatial, temporal, and taxonomic scope is limited than abundance data may become very useful. Then problem (i) may be addressed by using abundance data only for specific EBVs, e.g. for birds in Sweden (this should be possible at http://analysisportal.se). Problem (ii) may be addressed with national data infrastructures where such data are automatically aggregated and standardized (Gärdenfors et al., 2014).

ANSWER:

Data required o Abundance data o Occurrence data o Behavioural data o Population structure? o Life history traits?

Standardisations - these vary and are often taxon specific so there are likely to be a variety of protocols depending on the data in question. I don’t know how to go into these here. Generic and obvious standardisations are maintaining consistency over time in method and effort used, study area surveyed etc or applying correction factors if there has been some change in methods used. In the development of new technology to monitor species standardisation is paramount. For example remote sensing and camera trapping technology changes so rapidly that it is very easy for monitoring to become incomparable in a matter of a few years.

Page 36: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Data availability o LPI database www.livingplanetindex.org (abundance) o GBIF www.gbif.org (occurrence) o USGS Breeding Bird Survey (abundance) o International waterbird census (abundance) o Many more exist and GEOBON have been collating a register of monitoring schemes and so

should have a more comprehensive database of these ANSWER:

I think that counting populations is obviously the best method if feasible (e.g. California Southern sea otter annual counting). But it is valuable exclusively for very small populations of highly monitored endemic species. Indicators seem unavoidable (habitat and threats as soon as the ecology and sample abundance in them is scientifically sound and there is enough data about them). Geo/bioinformatics seem a promise yet to deliver credible results. There seem to be, though, mathematical models some of which seem to be producing some results. Transects visual counting seems to continue to be the main reliable method, as well as cameras-based passive systems. I have followed more closely, though, developments in long distance sound sources and birds sounds in passive acoustics (whose algorithms seem to be quite effective in large marine ocean zones –DECAF project and related marine mammals counting et al, works by Len Thomas & Tiago Marques). I am eager to learn on more scientific and TI developments on populations´ censuses since leaving aside the late PAMs developments have not been updated seriously since the early 2000s. I have seen unconvincing multiple use of surrogates with not too serious correlation or cause effect scientific base.

ANSWER:

Standardised fine-scale information on o What

i. taxon concept / UID 1. Global taxonomic authorities (CoL) 2. Nomeclators (IPNI, Zoobank)

o Where i. Geo-location (WGS84)

o Who i. ORCID

ii. Identity federation services o When (collected/observed) o When (identified)

Occurrence confidence level (time/authority/redundancy)

Distribution confidence level

ANSWER:

Available data on a national, regional or sub-regional scale are very scarce especially when it comes to species populations. Consistent information on species abundance and distribution over time is only given for certain species groups. Especial this data are used for the estimation of the population and the trends of the populations within the Natura2000 context. So at least on this level there is a certain level of harmonisation on the national scale. Missing information on trends and historical data are a big problem.

Often to estimate the range data from a wide period of time needs to integrated so this makes it difficult to estimate population sizes and trends in the population in time. As input data for the estimation of populations e.g. red lists (especially animal species), bird monitoring (Bid Life etc.) and information from GBIF can be used. All these data sources are used in the estimation of population status for the FFH directive.

For the estimation of population trends the integration of remote sensing information (e.g. based on data from the Sentinel missions) calibrated with local population data would be needed in order

Page 37: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

to get reliable data. In addition the consistent monitoring of species population sizes would be needed.

In addition, species populations and trends need to be defined relative to a benchmark, which is linked to a viable population size. This needs to be defined.

ANSWER:

Current population data are not centrally indexed. Facilities may hold and distribute selected population data pertaining to them; e.g. Rothamsted series. There is (was?) a Global Population Dynamics Database (GPDD) in which some GBIF members made some contributions to in the past by collecting time series of population data. E.g., the site (http://www3.imperial.ac.uk/cpb/databases/gpdd) seems not to be functioning now. Indirect population data could perhaps be retrieved from the Global Biodiversity Information Facility (http://www.gbif.org), through one subset of the Darwin Core (now Darwin Terms: http:// http://rs.tdwg.org/dwc/) standard by TDWG that allows for density to be attributed to occurrence, or from occurrence analysis—but that is as yet uncertain.

ANSWER:

Quantification of EBV o Ideally quantification of EBV’s should be done using a systematic approach which should

simultaneously address three objectives: 1) meeting Jurisdiction-specific policy and management needs for measuring

biodiversity; 2) comparing estimates of biodiversity change among jurisdictions; and 3) aggregating these estimates across multiple jurisdictions including at continental

and global scales to help assess progress towards global biodiversity targets. (Turak et al in Prep).

It is preferable to base this on develop conceptual(qualitative) models that represent how

pressures affect biodiversity condition, how changes in condition can provide benefits by protecting or enhancing values (intrinsic, use and option) and lead to management response that in turn affect pressures ( see diagram above modified from Spark et al 2010 )

Page 38: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

A stepwise process was developed for by (Hayes et al 2015, Ecological Indicators 57, 409-419) doing this in Australian Marine Waters. A an adaptation of their approach that can be applied to any larges region (containing nested in it Bioregions of the world : Abbell et al 2008, BioScience 58(5) 403-414; Spalding et al, 2007, BioScience, 57, 573–583; Olson et al 2002, Bioscience 51(11): 933-938.)

Data needed for species distribution o Following the approach shown in the diagram above, data for tracking changes in species\

distribution, should ideally be a large number of species distributed among multiple higher level of taxa and representing key ecological features in each ecoregions. A species distribution EBV populated by data from just one or a small number of species may not have any value in tracking changes in species distributions across any large areas

Data needed for tracking changes in population abundance o Data for tracking changes in population abundance should ideally be those that have

important populations in key ecological features spread across ecoregions. Data from a small number of carefully selected species from each ecoregion may be sufficient for this purpose. Ideally there should include threatened species, iconic species (especially valued by the community e.g. Koala in Australia) and common species for which there are good data.

What freshwater biodiversity data is available and where?

o The Freshwater Animal Diversity Assessment (FADA Balian et al. 2008 a,b) provides an

overview of species and genera of selected animal taxa groups and macrophytes of the

Earth’s inland waters for major Biogeographic Regions of the World. The raw data

provided by the 163 experts who undertook the initial FADA is accessible through an online

database (www.fada.biodiversity.be). Despite many obvious taxonomic and geographic

gaps, and hence a need to collect more data (Balian et al., 2008 b), FADA provides a much

more detailed overview of freshwater biodiversity than had been available previously.

o The Freshwater Information Platform (Error! Hyperlink reference not valid.provides

data and information on freshwater species and ecosystems across the world. Major

components of this platform include the freshwater metadata journal and metadatabase,

the freshwater biodiversity data portal, the Global Freshwater Biodiversity Atlas.

Page 39: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

o The IUCN Red List of Threatened Species (IUCN, 2015) is a useful source of information for

the distribution. leading an initiative to assess the distribution, population size, ecology and

global conservation status (i.e. the category of threat, defined by IUCN; see IUCN 2015) of

all known described species of freshwater fishes, molluscs, crabs, crayfish, shrimps,

dragonflies, damselflies and selected families of aquatic plants; the data for crabs, crayfish

and shrimps are complete (Cumberlidge et al., 2009, 2014; Carrizo et al. 2013. The

distribution of these species are mapped to individual river or lake catchments, which has

been shown to be the most effective spatial units for conservation management and

planning (Carrizo et al., 2013), using HydroBASINS

(http://www.hydrosheds.org/page/hydrobasins). These complement other global

assessments for all known freshwater species of amphibians, turtles, crocodiles, birds, and

mammals (Stuart et al. 2004; Buhlman et al, 2009; Hilton-Taylor et al., 2009 , Rhodin et

al2010).

o Much of iodiversity data remain difficult to access. There is a large number of smaller datasets or individual observations of occurrence data that are not integrated into public repositories even though these data may have been used in scientific papers

o Custodians of freshwater biodiversity data have set up a data publishing infrastructure by making use of the GBIF Integrated Publishing Toolkit (IPT, http://www.gbif.org/ipt). This could allow to automate the process of data publishing while allowing authors to retain full control of that data. BioFresh or national GBIF nodes (see http://www.gbif.org/participation/list for a list of participants and associated nodes) are able to provide assistance in setting up such a system and often also have a central publishing infrastructure for those who do not have easy access to a server to run the IPT (e.g. http://data.freshwaterbiodiversity.eu/ipt/ for BioFresh). For datasets under construction or that cannot (yet) be released for particular reasons, we recommend to document their existence in the freshwater metadatabase (http://data.freshwaterbiodiversity.eu/metadb/bf_mdb_help.php).

Population abundance data o The Living Planet Index (WWF 2014) specifically includes a freshwater component – the

Freshwater LPI – which includes a useful, and frequently cited, measure of change in status of species over time, and the Index is regularly updated. The freshwater LPI is based on a large number of species (3,066 populations of 757 species) that are selected based on several criteria to ensure data quality (population size measured over a minimum of two years; information is available on how the data were collected; geographic location of the population is known; data were consistently collected by the same method; and data sources are traceable Collen et al. 2009. However, the species used are all vertebrates (318 species of birds, 257 fishes, 120 amphibians, 35 reptiles and 24 mammals). Many of these species appear to be relatively common (since these tend to be the ones where there are enough data to fit the selection criteria); and in the case of fishes they include species that have been introduced outside their range. Hence the Freshwater LPI is taxonomically and ecologically biased, and LPI data are also geographically biased towards temperate regions (Collen et al., 2009).

ANSWER:

For most species beyond vertebrates there is no expert map; the best source of distribution data is natural history collections data e.g. opportunistic GBIF-type occurrence points (there are many online portals, not all of which are served through GBIF); these may range over several decades but are rarely stratified or systematically collected, and coverage is very patchy. GBIF also now takes observation and abundance data but the coverage of this is even more patchy. Although large-scale population databases have been compiled (e.g. the LPI dataset, the PREDICTS project), geographical biases remain and the LPI only covers vertebrate species to date. The LPI uses an explicit temporal comparison (though with shifting baselines) whereas PREDICTS currently uses an assumption of space-for-time substitution. Occupancy models downscale species distribution data to local

Page 40: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

abundance estimates, but the various models and techniques that have been proposed need more rigorous evaluation and comparison.

ANSWER:

EBV candidates for EBV class: species populations

Which data would you require to quantify an EBV related to the EBV class ‘Species populations’, i.e. changes in species distributions and/or abundances over time?

Which standardisations are needed for such data?

What data is available and where?

Species distribution

- An agreed taxonomy for the biological group of interest

- If available, a phylogeny for the group also

- Location coordinates of species observations

- Consistent minimum context data, e.g. date of collection, name of collector, institution owning the collection, spatial accuracy (estimation method), etc

- Covariates characterising the detection method

- Data quality assertions systematically and consistently testing the logical validity of aggregated observations (e.g. spatial accuracy)

- Community of practice process to achieve consensus acceptance of taxon naming and changes

- Choice of indicator taxon groups to initially focus EBV development on and criteria for choosing (e.g. well known taxonomically and consensus exists among community of practise; well represented in databases across many countries

- DARWIN Core or other agreed Information Model enabling interoperability

- Community of practice agreement on data quality assertions implemented consistently by country/global data aggregators (with case studies demonstrating use and value-add)

- Estimation methods enabling different species observation methods to be standardised for presence detection (or covariates relevant to the taxon group, e.g. search time, time of day, cloud cover, life stage observed, etc)

- Presently for global names, GBIF backbone taxonomy; but limitations include different country preferences for names adopted in common practice

- GBIF – as the global aggregator of species observation data effected through agreements with country-level aggregators (e.g. ALA; ALS, etc)

- Community of practice data quality control standards (DataOne/Chapman/ GBIF/ALA, etc)

Population abundance

- Minimum data as above, plus:

- A measure of abundance

- Estimation techniques to enable different measures or proxy

- Individual researcher or institutions assets (National and provincial

Page 41: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

EBV candidates for EBV class: species populations

Which data would you require to quantify an EBV related to the EBV class ‘Species populations’, i.e. changes in species distributions and/or abundances over time?

Which standardisations are needed for such data?

What data is available and where?

(and method) - Covariates associated

with method of abundance measurement

indicators of abundance to be standardised (e.g. Braun-Blanquet cover/abundance score vs dominance ranking, or direct measurement, etc)

- Information Model to enable aggregation or interoperability (DARWIN Core?)

government research, university research, industry impact assessments)

- Ecological survey and transect data and long term ecological monitoring (e.g. TERN, Australia)

- Satellite vegetation cover calibration data (e.g. AusCover)

- Remote sensing of population aggregations (radar)

Population structure by age/size class

- Demography data, mostly above applies

-

- As above - Cohort modelling

- Mostly as above, and largely held by individual researchers

ANSWER:

The ideal dataset would be repeated measurements over time (e.g. every year or few years) of species abundances in a complete community (e.g. all birds) in an extensive network of sites with regional to global coverage. At the regional level a dataset that comes to mind is the Breeding Bird Survey of North America. For species distributions, atlas surveys at resolutions of 50x50 km and higher, repeated every 5 to 10 years, would be ideal.

Page 42: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Question 2: Which specific scientific/quantitative methods (e.g. statistics, models, workflows, sampling designs etc.) would you require to calculate temporal changes of species distributions and/or abundances from the raw data of observations? ANSWER:

Huge array of possible approaches – too many to list here. 2 main approaches viz. species distribution models & direct

ANSWER:

For sampling designs, see above for some examples of requirements. For statistical methods, methods accounting from detection probability are highly preferable, e.g. occupancy modelling for distribution and n-mixture model for abundance. Alternatively, generalized additive models or TRIM methods could be used whenever the sampling procedure remained consistent over time, despite it is manly design for abundance indicator instead of an EBV per se.

ANSWER:

Distributions: An extensive range of environmental data (layers), AOO and EOO, temporal SDMs and GDM.

A standardized suite of ‘data quality’ tests by all biodiversity Data Publishers. This is currently my responsibility with the GBIF/TDWG Task Group 2 on Data Quality Tools and Services (see http://community.gbif.org/pg/pages/view/47182/tdwgdqig-task-group-2-tools-services-and-workflows).

ANSWER:

Climate change statistical models, land use/cover changes models, gap analysis, predictive scenarios.

ANSWER:

There is a multitude of potential approaches and they very much depend on the data available. I think, Map of Life has done quite some advances in this respect.

ANSWER:

In order to calculate temporal changes of species distributions and/or abundances from the raw data of molecular observations it would be required:

Workflows for taxonomic assignment of barcode and metagenomic sequences based on well curated reference databases (as BOLD, RDP, GreenGenes, Silva (eukaryotes), ITSoneDB e PR2/HMaDB);

Statistical tools for o multivariate comparative analysis of different geographical areas and time points, such as

the Metagenassist package (ANOVA, PCA, ecc). In particular, the Principal Component Analysis measures how similar are different big samples (datasets) on the basis of certain variables, such as their taxonomic composition. DESeq2 can be used to assess taxa that distinguish two samples.

o Calculation of Diversity Indices such as the alpha-diversity index (Shannon Index) by means of R package such as Phyloseq e Metagenomeseq.

o Correlation or Covariance analysis that allows to relate the species abundance or distribution change with other types of biotic and abiotic measurements. For example, it is possible to evaluate the microbial species that experience major changes in relation to the presence of a macroscopic species which perturb the environment.

ANSWER:

The data and methods needed to calculate temporal changes in species distribution and abundance would differ because the state variable is not the same.

Page 43: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

o Temporal changes in species distributions: we need georeferenced information on species presence (and absence) on a regular basis over time and with a relatively high amount of spatial replicates (Aizpurua et al., 2015), then we can use dynamic spatial modelling approaches to map species distributions in space and to document how distributions changes over time (Royle et al., 2005; Kéry, 2011).

o Temporal changes in species abundance: we need quantitative information on species abundance in a representative sample of locations across the globe but information on the geographical position of the samples is not explicitly required (for instance, the PECBMS programme in Europe http://www.ebcc.info/pecbm.html build on national/regional information and does not explicitly make use of information on the geographical positioning of the common breeding bird sampling locations in the countries/regions). In addition, the amount of spatial replicates is not as important as when documenting changes in species distribution. Here, temporal replicates within and between years are particularly important to deal with imperfect detection rates in estimation of species abundance with trend analyses.

In any of the two cases, there is a need to address explicitly the issue of imperfect detection in biodiversity inventories with appropriate methods based on field data that are repeatedly collected over time (MacKenzie & Kendall, 2002; MacKenzie & Royle, 2005).

I feel a need to clarify which state variables are important in the frame of the EBVs because data needed and methods applied would differ:

o Species distribution o Species abundance o Species occupancy (MacKenzie et al., 2005) is another option o Other(s)?

ANSWER:

Dynamic occupancy models for presence/absence data if same points are used through time (otherwise single-season occupancy models). For count data, binomial mixture models can be used. In both methods covariates of various types (e.g. spatial, environmental, etc.) can be inserted in the model formulation to better understand trends.

ANSWER:

One successful approach to calculate temporal patterns in species abundances was by creating a

regression model that could adapt to a wide variety of nonstationary, spatiotemporal abundance

patterns found among a diverse set of species with different movement dynamics. Raw data

consisted of the combination of observational data from species counts with a large suite of

environmental predictor variables rich enough to characterize a wide range of local habitat

preferences.

Developing a good model for relative abundance from these data required addressing the following

challenges: (1) large numbers of zero counts as well as the rare, but important, large counts, (2)

potentially complex relationships between local patterns of species abundance and environmental

predictor variables, (3) spatially and temporally varying patterns of species abundance and (4)

relationships between abundance and the environment that vary seasonally and spatially (i.e.,

nonstationarity of response–predictor relationships).

To meet these challenges a three-stage modeling strategy was developed emphasizing pattern

discovery and prediction. First, the study extent was partitioned into several overlapping

spatiotemporal neighborhoods. Second, within each neighbourhood, species’ abundance was

assumed to be stationary and we constructed a zero-inflated boosted regression tree (ZI-BRT)

model to deal with zero-inflated data (challenge 1) and to estimate covariate effects (challenge 2).

Third, a mixture-model framework (STEM; Fink et al. 2010, 2014) was used to aggregate

overlapping local ZI-BRT models to produce ensemble estimates of abundance (challenge 3). The

STEM ensemble naturally accommodates spatially and temporally varying predictor–response

associations (challenge 4) (Johnston et al. 2015).

Page 44: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

ANSWER:

We could model the alpha-diversity of a microbial community and check with statistical models whether it tends to increase or decrease.

We could also check the disappearance of some species (see http://www.cell.com/current-biology/abstract/S0960-9822(15)00614-4 for an example related to the human gut in non-urbanized countries)

ANSWER:

We use ecological niche models, such as Maxent model, and statistics methods to calculate changes of species distributions.

ANSWER:

Sampling design: for the species data need to understand variation in detection (which might arise from the ease of detection at each site or from survey effort). Designs that allow for this include repeat visits per site (within relatively short time frame), multiple observers per site, multiple independent detection methods, or spatial subsampling of site. Ref: MacKenzie et al. (2006) Occupancy Estimation and Modeling.

At each time step estimate the probability of occupancy / the abundance from a model that deals with variation in survey effort / detection – e.g. occupancy-detection model. From the results estimate change.

ANSWER:

Ideally a method that takes into account the temporal component of opportunistic species observations and can estimate changes in distributions over time. Occupancy models are in the right direction… Are there any tools that allow to do this in a systematic way? Something similar to Biomod, ModEco or OpenModeller?

ANSWER:

That’s not my area of expertise, but if I understood the question correctly, LPI seemed to use two different approaches to handle this (chain and linear modelling methods), but there are certainly others ways.

ANSWER:

Some wisdom can be harvested from the reserved design research and IUCN criteria, and their criticism.

ANSWER:

Following the goal that EBVs should be accessible “by any (appropriate) person anywhere” I am answering this from the perspective of a citizen science volunteer. Dynamic maps (e.g., eBird (http://ebird.org/content/ebird/occurrence/) and other visualizations (e.g., Citsci.org) are excellent entry points into data analysis. Accompanying educational resources targeted to populations including elementary, middle, and high school students; and, to the general public would explain what biodiversity research questions can be asked or answered through data display.

Beyond simple data display, descriptive analysis, and visualization, there is significant work needed to help citizen science volunteers and the general public engage in deeper forms of data analysis.

ANSWER:

As described above, sample effort, detectability, reliability and confidence limits on computed

abundances must all be computed but as this is not my area of specific expertise it’s not my place

to advise as to which are the best ‘state-of-the-art’ methods.

ANSWER:

Page 45: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

The EBV for species populations was defined in the EU BON / GEO BON workshop 2014-09-30/10-02 as “the relative abundance of a taxon in a place and time, measured through time using a consistent methodology”. We can visualize this definition as a multi-dimensional hypercube – so called OLAP cube (on-line analytical processing), which has dimensions such as latitude, longitude, altitude/depth, time, and taxonomy. In the cells of the cube, one finds relative abundance, such as percentage of a taxon, out of all taxa. It is rather easy cross-tabulation to fill this cube from GBIF data. However, data cleaning is a challenge, which we still have not fully solved.

ANSWER:

We have a reasonably well-defined set of methods defined in the Sparta package (https://github.com/BiologicalRecordsCentre/sparta), and a workflow for converting the outputs of such models into biodiversity indicators (https://github.com/BiologicalRecordsCentre/BRCindicators). Some of this workflow is scalable but we rely heavily on cluster computing.

ANSWER:

The approaches to be followed are: o Univariate measures, tested statistically for differences in patterns (e.g. diversity and

eveness measures tested by Kruskall-Wallis, PERMANOVA, ANOVA, AMOVA (for molecular data)

o Multivariate techniques (e.g. cluster analysis, non-metric MDS, PCA, CCA), tested for differences in trends by ANOSIM, PERMANOVA, ANOVA, etc.

o Complex models, linear and non-linear SDMs, etc. o Sampling design depends on the question of the research, each time round.

ANSWER:

Statistics on ecosystem diversity may be used to summarize the temporal changes for a large number of species. This may include statistics on species range (biogeography), richness (alpha diversity), evenness (community diversity), and turnover (beta diversity) calculated from either occurrence data or abundance data (Chao A and Shen TJ 2009; Curtis et al 2013; Gotelli and Colwell 2011). In addition, statistical correlation approaches such as species distribution modelling (SDM) may be very helpful to define biogeographic ranges and estimate abundances of species from a sample of observation or abundance data (Guisan et al, 2013).

Workflows. However, since the necessary input data for any of these statistical calculations are usually coming from many different sources (e.g. literature, research, monitoring, public archives, etc), one could employ workflows to prepare the input data sets for these statistical calculations, i.e. aggregate and standardize the data (Mathew et al, 2014; Ruete 2015). Workflows can also be employed to automate reiterations in statistical calculations, e.g. over a large number of input data sets or to explore parameter space.

ANSWER:

Bayesian statistics

Generalised additive modelling

Geometric means ANSWER:

A hunch rather than a response: all of them. And if more are invented/discovered, the better. What may not be useful at all to count populations of some species might become the best method to count a different one. Have the impression (I would need to review literature read in the past years) that there are constant innovations applied surprisingly to a given species by some genius (or freaky). The most well-known system (GBIF´s statistical sampling widely used system (GBIF´s statistical spatial and time (including opportunistic data) seems to be criticized because of lack of serious policy on bas control of collection patterns. What needs to be in place is open data and VREs environments for “anybody” who comes up with a better idea.

Page 46: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

ANSWER:

GBIF use stratified, spatially-explicit statistical sampling by date of occurrence to establish baselines, both with opportunistic (i.e. occurrence) data and time series abundance data. However the method is hampered by collection pattern biases. The two requirements we’d posit are: (a) explicit abundance data at point of occurrence in biodiversity datasets and (b) either explicit absence data from distributions or catalogues/datasets of field sampling events linked to occurrence data. These are currently being considered within the TDWG standard for eventual uptake by GBIF.

ANSWER:

Linear mixed effects models where species are treated as ‘subjects’ (random effects) and trends are

fitted as cubic regression splines. I would try to filter the dataset so that the species data came

from sites that have been repeatedly surveyed or sampled over time. The sites might be treated as

replicates within defined geographic regions (e.g. ecoregions) so that regional patterns in trend can

be estimated. Pearce-Higgins et al. (2015) provide a neat framework in which the population

growth rate log(Nt+1/Nt) is modelled rather than abundance N by treating Nt as an offset in a

loglinear poisson regression model.

For some groups of freshwater animals there are excellent references for guiding sampling. E.g.

Strayer, D.L. and D.R. Smith 2003. A Guide to Sampling Freshwater Mussel Populations. American

Fisheries Society Monograph 8. Bethesda, Maryland.

ANSWER:

Methods to help account for bias in the quantity and spatial distribution of existing records. These need to be explicitly focused on opportunistic data types as there is simply not enough structured, randomly-stratified data to account for most of biodiversity/of the world. Techniques to account for spatial bias in species distribution modelling would be useful here as a well-developed modelling workflow such as BioVeL would be needed to help fill data gaps. Confidence as well as probability statistics for modelled surfaces would be needed, and global land cover datasets could be used to clip historical records to current vegetation.

ANSWER:

As an alternative to multiple species distribution modelling, we suggest a complementary community-level approach that can be applied in space and/or time (Blois et al. 2013, Ecography 36, 460-473; Blois et al. 2013, PNAS 10, 9374–9379) using generalised dissimilarity modelling (Ferrier et al. 2007, Diversity and Distributions 13, 252-264)

Review available biological data for use in spatial and temporal modelling by profiling relevant attributes (https://publications.csiro.au/rpr/download?pid=csiro:EP132549&dsid=DS2)

Using species observation distribution and abundance data (standardised for abundance); and covariates characterising differences in detecting presences (https://publications.csiro.au/rpr/pub?list=BRO&pid=csiro:EP102983), if possible

Generate Bray-Curtis dissimilarity using presences and/or abundances

Stratified random sub-sampling of site pairs to overcome known or perceived biases in ecologically representative sampling (Rosauer et al. 2014 Ecography 37, 21-32)

Potential to use detectability covariates as a weight on the response (and/or as explanatory variables) requires careful consideration (e.g. Mazerolleet al. 2005 Ecological Applications 15, 824-834)

A conceptual model of the relationship between compositional change in the distribution and abundance of species occurrences/co-occurrences related to biotic and abiotic drivers (e.g. Guisan & Zimmermann (2000) Ecological Modelling 135, 147-186) as relevant to each biological group, including effects of land use / management and climate change

The conceptual model determines the effort put into searching for, compiling or generating environmental, habitat and anthropogenic covariates as proximal predictors (explanatory variables)

Page 47: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

of natural patterns and change processes used in a statistical correlative model (Williams et al. 2012 International Journal of Geographic Information Sciences 26, 2009-2047) (nested analysis?)

Expert opinion on the ecological relevance and quality of candidate explanatory variables informs (http://www.biometrics.org.au/conferences/Hobart2015/talks2015/Wednesday/W_1430_Wed_RamethaaPirathiban.pdf) choice and/or weights which variables to use (initial set of candidates) in the statistical model

The resulting modelling strategy is tested against the conceptual framework using variance partitioning (Jones et al. 2016, Journal of Biogeography 43, 289-300; Gibson et al. 2015, Records of the Western Australian Museum 78, 515-545)

This requires collaboration among innovation experts in data aggregation, spatial analysis, covariate development, ecological theory, statistical method – Bayesian and frequentist, distribution modelling

ANSWER:

The answer to this question is highly contingent on the data available. But some control of observer bias and survey effort, e.g. using Bayesian statistics, is often required to control for sampling bias. To detect trends over background noise some statistical analysis for trend detection (e.g. regression) is also needed.

Page 48: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Question 3: How would you present and visualize the results of an EBV related to changes in species distributions and/or abundances over time? Which presentation and visualisation approaches would be informative to show the results? ANSWER:

Graphical and tabular approaches can be very compelling if properly presented. User specified real-time output generation from cloud based models are also powerful

ANSWER:

For both abundance and distribution: changes can be visualized using maps, graphs showing estimates from statistical methods or aggregated indicator as a function of time, graphs of representing the slopes of the temporal trend of changes with respect to latitude/longitude/ taxonomic groups/ drivers, etc.

ANSWER:

Graph of changes in AOO and EOO over time

Graph of changes in abundance over time for selected areas/species or total area ANSWER:

I'd rather focus on the information to be held rather than the specific visualisation. I consider all real EBVs (as opposed e.g. to species traits, which are not fundamentally spatio-temporal and serve more as co-parameters for model generation) to be spatio-temporal data layers at the best possible scale and granularity. For most species in most regions of the world, these layers will be coarse (far coarser than e.g. soil mapping in many countries, and vastly coarser than remote-sensed layers). A species population EBV should represent in various forms the prominence of the species within the environment (either as measure of reproductive individuals within an area, or of biomass within an area, etc.). I actually see the species population EBV as a kind of partial pressure for the species within any environment. Species population EBVs may perhaps best be rolled up into a vector representing the partial pressure (as mass or individual count) of all species in a given area at a given time. Clearly any EBV is an attempt to use available data to model this numerical representation as precisely as possible.

ANSWER:

An online map tool will be very useful, with crossable layers over the map, this layers can be as vectorised heat map layers for variables and clusters of points for occurrences. Also, a scrollbar for manipulate the query over time will be useful for evidence changes through the years.

Many examples can be found in the network web pages, the technologies selected should be defined from a software development quality attributes priority tree. Some example projects:

o IABIN Threats Assessment: https://code.google.com/archive/p/iabin-threats/wikis/UserManualNavigationTool.wiki

o SIB Colombia Explorer: http://maps.sibcolombia.net/ o Canadensys Explorer: http://data.canadensys.net/ o Map of life: https://www.mol.org/ o Biomodelos: http://biomodelos.humboldt.org.co/models/visor o Protected Planet: http://www.protectedplanet.net/

A dashboard with numbers and statistics are also a useful for having a quantitative and graphic measure of the data. That can be built automatically form the stored data e.i http://tools.sibcolombia.net/dashboard And another approximation is to give a general measure of the biodiversity status by stablishing and index like WWF Living Planet Index can be very valuable for stakeholders in CDB

Publishing infographics in an annual period of time as a report or article in a journal will indicate the official status of the knowledge in biodiversity as well as the gaps to evaluate it, also valuable for CDB, e.g.

Page 49: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

o http://reporte.humboldt.org.co/biodiversidad2014/visualizador/102 o http://reporte.humboldt.org.co/biodiversidad2014/visualizador/103 o http://reporte.humboldt.org.co/biodiversidad2014/visualizador/211

ANSWER:

It depends on what you consider is an EBV. While an EBV itself can be visualized with 3d graphs there are no results from an EBV per se. EBVs are a sort of container to bring similar data together, which then can be further analysed, leading to indicators, which have various ways to be presented. Change in species distribution were visualized by using Carto-DB.

ANSWER:

Changes in species distributions and/or abundances over time from molecular data (DNA barcode or metagenomic sequences) could be visualized as:

Taxonomic diversity indices such as Shannon index;

PCA plot to show the taxonomic separation of species populations between geographical areas or time points;

Venn Diagrams describing the number of species shared among geographical areas or time points;

Area charts displaying relative abundances of the taxa inferred for the investigated samples corresponding to different areas or time points.

ANSWER:

Again it depends on the state variable we are interested in: o Species occupancy: graphs that show changes over time in the estimated probability of

species presence at any location randomly selected across the globe o Species distribution: dynamic maps depicting the changing geographical distribution of the

species across the globe, for instance the average (central) positioning of species distributions (in the same vein as this daily dynamic example: https://www.allaboutbirds.org/mesmerizing-migration-watch-118-bird-species-migrate-across-a-map-of-the-western-hemisphere/)

o Species abundance: graphs that show changes over time in the estimates of species population abundance at the global scale

ANSWER:

For species abundances, time series plots per unit/site. Also spatially explicit occupancy/abundance maps through time (movies). For some examples see

o http://wpi.teamnetwork.org o http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0073707 o http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002357

ANSWER:

To support the exploration of biodiversity data and dynamic distributions, a series of visual representations (and compositions thereof) that aim to simplify the identification of correlations and changes in the patterns of abundance is envisioned. The goal of this interactive visualization tool is to help ecologists and biologists learn from the embedded signals of model results and allow them to extract the most relevant pieces of information from the model analysis. The tool must enable the combination of different pieces of information through visualizations that present them in a succinct way. One example could be through the use of variable colored maps that show variation in abundance probability. Furthermore, by utilizing a new visualization abstraction, such as Tag Cloud Lenses, in the same image, could provide underlying data that influence the patterns of abundance (i.e., habitat) can be viewed. The visualization tool must support specific requirements for species’ distribution analyses, including the ability drill down into specific regions while exploring correlations among different model parameters. Since searching for patterns requires a trial-and-error process, the tool should provide different mechanisms for comparing multiple visualizations side-by-side and to interactively manipulate these visualizations.

Page 50: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

ANSWER:

Simple plots of alpha-diversity values across time (using boxplots if multiple samples from the same region are available)

Trends of microbial abundances over time ANSWER:

We use tables, bar charts and GIS maps to present our results. I think that using dynamic GIS maps is an ideal method to visualize the results on screen but it is not suitable on printed paper.

ANSWER:

Ideally the mean change, and confidence intervals around it. o Should changes be scaled in relation to the original value? – e.g. does a change in

probability of occurrence from 0.5 to 0.3 matter the same as one from 0.9 to 0.7?

Presentation and visualisation: depends on end use and quality of predictions to unsampled cells o Methods for visualising uncertainty? (blurring, colour depth, increasing size of pixels?)

ANSWER:

Extremely difficult because you can potentially end up with a million maps or graphics. ANSWER:

Personally I would be happy with a simple linear graph, with time in the horizontal axis and the EBV in the vertical axis. Of course it would be nice if users could select between different parameters, such as taxonomic group and geographical region and then get the graph instantly updated.

ANSWER:

A bit depends on time scale – taxa known as present day endemics could be cosmopolite in Pleistocene, and species expanding in XVII century but not now is not invasive in the modern times. Perhaps, for individual species some animated maps with receding and expanding ranges, in human-altered vs. natural species dynamics – what is the baseline? It is more tricky for groups of species, but also more meaningful. I would certainly add the habitat loss and habitat quality changes layers.

ANSWER:

Map areas where species abundance / persistence has changed. In this context, an ‘area’ could be a

pixel or, a larger spatial unit (e.g., choropleth maps). Mapping allows overlay with & comparison to

important contextual information such as MODELLED (i.e., theoretical ideal) habitat suitability,

infrastructure developments, land use change etc.. If the EBV can only be computed at a coarse

spatial granularity, maps may still be made but only if it is made it very clear that coloured polygons

do not represent a homogeneous state of affairs across space. In this context, graphs such as line

graphs showing trends for a region may be more useful and less easily misinterpreted.

Visualisations such as maps or charts should ideally be interactive so that users can see how an EBV

changes if computed for aggregated groups such as guilds, species with common life history traits

etc.

The most important thing about presentations/visualisations is that they will need to give a clear

and honest idea of the uncertainty in the EBV… which, given the gaps in data and necessary

interpolation/inference, may be substantial. Unless the chosen EBVs are single valued-metrics with

some uncertainty built in (e.g., metrics based on exceedance / deceedance probabilities) then there

is likely to be some variation in the values computed, e.g., best / worst case abundance levels over

time. Usually this kind of uncertainty can be quantified probabilistically, which allows display of

quantiles, confidence limits, standard deviations etc… but many end users and decision makers are

not used to this type of display and it is important not to overcomplicate the presentation, and to

allow users some control over the presented information so that they can explore the impact of

uncertainty themselves. One nice web-based tool for visualizing uncertain mapped data in a clean

Page 51: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

and interactive way is the Greenland tool, from 52 North

(https://wiki.52north.org/bin/view/Geostatistics/GreenlandExamples).

ANSWER:

Trend diagrams of relative abundance versus time. Multiple of these over a map. Also rankings of species, how many and which ones have increased/decreased most in different geographic areas. This can be visualised the same way as the “Consumer Reports” magazine ranks the most and least reliable products.

ANSWER:

Principally we use graphs of changes in occupancy over time, with measures of uncertainty. We are working towards spatially-explicit models that would be amenable to mapping.

ANSWER:

Within the EMODNET biology project we calculated several data products showing changes in species distributions over time.

o http://www.emodnet-biology.eu/data-products ANSWER:

Simple statistical graphs, charts or plots o Simple graphs are required when the change in biodiversity should be demonstrated, over

a given area or time interval

Maps in space and time o Maps are required when the biodiversity change needs to be demonstrated over a spatio-

temporal grid ANSWER:

Results could be either visualized as GIS outputs, e.g. raster maps (e.g. Fig 3 in Gärdenfors et al 2014), or as graphs/tables showing range extension and range shifts over time or space. Leidenberger et al (2015) provides an example for workflows that automates these calculations.

ANSWER:

It would depend on the audience, how much technical expertise we can assume and what is being communicated. Usually the best methods are the simplest e.g. a line graph showing an aggregated times series. Maps can be useful to illustrate changes in both distribution and abundance using categorised information or gradients by grid cell – e.g. categories of percentage change in abundance. Adding animation into either of these visualisations can also be useful to convey changing status through time.

Graphics that show disaggregations by taxonomic class or region can be really useful. ANSWER:

Certainly what I find more useful is GIS dynamic space-time charts as soon as it corrects lack of duration of occurrence data. There does not seem to be an adequate method to quantify systematically in an agreed form observation data with historical series of data (I would strongly back GEO´s efforts in this sense as described in www.geobon.org/Downloads/brochures/2015/GBCI_Version1.2_low.pdf )

ANSWER:

Visualisations of data need to be developed in alignment to the needs of identified targeted audiences.

For instance, temporal changes in distribution of species could be presented in a multi-layer visual environment, where the additional geodata layers include habitat types and protected areas. Temporal distribution and abundance of species can also be presented as indicator of the variation

Page 52: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

of the total genetic diversity, provided that the species are robustly linked to trait and genetic data from across repositories.

ANSWER:

Depending on the audience of the reports different ways of showing the results of population trends can be shown. One of the options is to use a simple colour schema (e.g. like for the conservation status of the FFH directive). Defining the classes of viable population sizes is a second step. Information should be given on both:

o On the current size of the population (being viable or not) o On the temporal trend (increasing, stable, decreasing)

The trend signals should be backed up by absolute numbers (size classes of the population) where possible. An example of providing absolute numbers is https://iwc.int/estimate#table on whales.

ANSWER:

Either heat maps built on time-stratified occurrence data, or anomaly maps based on baselines. But they need to be able to incorporate “duration” or “longevity” of occurrence data, which is largely absent from datasets. In fact, is currently lacking.

Therefore, it is missing a proper method to systematically quantify observations addressed to combine concrete areas with historical series. In that sense, we agree that one of the key references on how to proceed and-or to build upon to this regard is described by http://www.geobon.org/Downloads/brochures/2015/GBCI_Version1.2_low.pdf, and we would like to encourage a fully open approach along those lines.

ANSWER:

Changes can be displayed via website using an interactive dashboard which enables people to play the timeline to see changes based on database records filtered by species. One can also enable visualisation of EBVs on a map, showing changes over time. The example attached power point slide I shows a visual display of the cumulative number of individuals observed in protected areas in the Jervis Bay region managed by in South Eastern Australia over the past 25 years. After clicking on the map of slide 2 once you click on play, the cumulative numbers at different locations in the protected areas will appear with the data on the top left corner. If you click of the pause button you can stop it at any given date. The data displayed here can be replaced with population abundance indices for specified time intervals. The live link to the species occurrence records in the database will remain but there would be an intermediate step of calculating indices and aggregating records into time intervals e.g. years.

ANSWER:

The EBV results are the results; visualising its change over time requires an indicator to be developed that visualises these results. Changes in species distribution or abundance need to be relative to a common baseline and spatially explicit, showing relative gains/losses in different areas. Creating new indicators of existing EBVs is less interesting than developing indicators incorporating different EBVs. There are already well-established indicators of changes in species abundance over time such as the Living Planet Index or Wild Bird Index. These may not be ideal scientifically but they have a lot of currency politically.

ANSWER:

World map showing historical change in ecological similarity over time in composition/abundance

between a baseline ‘natural’ occurrence pattern and a scenario representing detectable

anthropogenic or climate change, after accounting for natural dynamics

Areas of statistical extrapolation or sampling coverage demonstrated and lightly ‘obscure’ the map

to identify those regions for which the data and/or the statistical method/explanatory data

requires improvement (e.g. Jones et al. 2016, Journal of Biogeography 43, 289-300; Gibson et al.

2015, Records of the Western Australian Museum 78, 515-545)

Page 53: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Graphical trends in change for each global ecoregion (or country or other relevant global reporting

region) showing variance around the mean/median (using a box plot or series of frequency

histograms) as ‘poster’ pointing to a map of the ecoregions indicator

Need to make it relevant to each country as well as globally and using realms/ecoregions

Presence only and abundance shown separately; demographic may need remote sensing

calibration for change detection, where detectable from space (e.g. tree age class/cover), new

sensors may increasingly enable demographic assessment for some

species/populations/ecosystems

ANSWER:

Ideally we would be able to plot for each species a map with a grid of cells where there has been a significant increase/decrease of the population (or range) of that species. We could also plot analysis at the community level, such as temporal turnover of communities, changes in species richness, and others. The main characteristic of these maps is that they would be spatially explicit at resolutions of a few km, and would allow for spatial aggregation to sub-national, national or regional scales to provide indicators of biodiversity change.

Page 54: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Question 4: What are in your opinion the two most important research questions relevant for quantifying changes in species distributions and/or abundances over time (i.e. for measuring biodiversity change)? Why do you consider especially those research questions as particularly relevant? ANSWER:

Where and when is biodiversity changing?

Why is biodiversity changing and how can it be arrested? ANSWER:

Research question: how species respond to environmental / anthropic changes? Which species are the most sensible or the most resilient/resistant to those changes? Is there consistency in the temporal and spatial patterns of species response (i.e. synchrony, similar or different response of the species in time or space, etc). Where are decline or colonization rates the highest? If applicable, when did they peak?

Those questions are particularly relevant to understand how biodiversity is impacted by environmental and anthropic changes, and will allow to make diagnostic of the current situation and to propose alternative policy and/ or management actions to deal with these responses.

ANSWER:

What species? We can’t monitor all so what are the most effective?

Which environmental variables? We need to relate species to environment and ideally, a consistent suite of environmental variables would be helpful.

Financial support for long-term monitoring will be based on recognised significance of EBVs and SOE reporting to environmental sustainability. Without ‘political’ support, an effective strategy for monitoring long-term changes is doomed.

Using SDMs and GDM for monitoring change. What methods are most accurate and robust for this type of modelling?

ANSWER:

For different species (or different species groups), what is the expected natural temporal variation in species abundance in a locality over time? In other words, how many measurements, at what frequency, do we need in order to separate noise from actual information on change?

To what extent can suites of co-distributed species serve as proxy measures for the expected abundance of less-easily recorded taxa, and how could such suites be identified?

ANSWER:

● How globalization have been affecting species distributions? what are the patterns? ● What’s the relation of the affected species and the ecosystem services also affected with the

human health? ANSWER:

Which drivers and pressures dictate species distribution or abundance.

How are species distributions and abundances interlinked? Meaning, are their tipping points were abundances fall below a certain threshold (allee effect) that it comes to local extinctions. How fast is this process?

ANSWER:

Development of standardized protocols, by sampling to species quantitative assessment, so that the molecular data (DNA barcode or metagenomic sequences) are as comparable as possible.

Selection of a unambiguous reference taxonomy. ANSWER:

Page 55: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Achieving a good trade-off between spatial and temporal replicates to jointly inform on species distribution (spatial replicates are important to achieve a good geographical coverage) and abundance (temporal replicates are important to deal appropriately with the issue of imperfect detection in estimation of species abundance and occupancy)

Incorporation of different sources of data (standardised from long-term monitoring programmes – e.g. http://www.ebcc.info/pecbm.html – and opportunistic from citizen-science programmes – e.g. http://www.eurobirdportal.org/ebp/en/) in the same analytical framework to make use of all available data in a statistically coherent manner

ANSWER:

What factors explain species temporal trends and/or changes in species spatial distributions? Are changes associated with changes in land cover/use, distance to human settlements, climatic variables, human activities (hunting, selective logging), etc.? Understanding what drives changes is essential to successfully manage populations.

ANSWER:

Data from well-designed experiments provide the strongest evidence of causation in biodiversity studies. However, for many species the collection of these data is not scalable to the spatial and temporal extents required to understand patterns at the population level. Only through broad-scale surveys possible with either citizen science or the emerging remote sensing and artificial intelligence techniques can these questions be resolved. Developing the analytical tools to extract patterns and trends from these data is difficult and requires a combination of pattern recognition (machine learning) and probability (statistics).

Scale is a very important aspect when thinking about research questions, because to study patterns in species occurrences begins with an understanding of the patterns of distribution, abundance, and movements of individuals. These patterns are driven by an interacting series of climatic, geological, ecological, and anthropogenic processes operating simultaneously across a range of spatial and temporal scales. Only by comparing these patterns across a range of spatial and temporal scales can we begin to identify the interacting role of these processes. Thus, data must be collected at fine resolutions over broad spatial and temporal extents, particularly for wide ranging species like birds.

The cost of collecting biodiversity data is enormous. Using citizen science to collect these data lessen the cost but introduce huge issues of data quality and sampling bias. More automated approaches through sound collection, radar, video etc are promising however being able to extract species information is only now showing encouraging outcomes

If a species–habitat association does not vary across a wide geographical area, we can gather data within a limited spatial extent and make inferences and predictions well outside the area of data collection. When species–habitat associations change across spatial or temporal scales, as they often do, then making predictions requires a broader spatio-temporal perspective.

ANSWER:

Is the human gut microbiome loosing diversity (as a consequence of Westernalization)? Are environmental microbiomes also loosing diversity (as a consequence of climate change, pollution, industrialization)?

Are specific microbial species undergoing extinction? ANSWER:

Long-term or periodical observation, including unified data collection methods

How a change of a species or a group of species contribute to the biodiversity change locally or globally?

ANSWER:

What are we wanting to use these estimates of change for? – this is important, because without knowing the objective, it’s easy to be vague about the required accuracy of the estimates. There is

Page 56: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

no point in modelling if the results are so uncertain that important trends are obscured, or so biased that the results are misleading / open to rebuttal.

o It is possible that there are several objectives; o The data need to be sufficient for these objectives. o Are management / conservation actions going to be triggered if declines are identified? –

Lindenmayer et al argue that monitoring without action is like counting books while the library burns (Lindenmayer et al.(2013). Frontiers in Ecology and the Environment, 11, 549-555)

More practically: I believe detection should be accounted for. However, most existing data are presence-only (biases hard to identify; detection usually not identifiable at all). Methods are just starting to emerge for combining data sources coherently in a model – how robust are these across datasets? Examples of relevant methods:

o Hutchinson, R.A., Liu, L.-P. & Dietterich, T.G. (2011) Incorporating Boosted Regression Trees into Ecological Latent Variable Models. Proceedings of the Twenty-fifth Conference on Artificial Intelligence (ed by, pp. 1343-1348. San Francisco. – this is interesting because it’s an application of a machine learning method to occupancy-detection modeling. This is not one that combines data from different sources, but it does use citizen science data and makes assumptions about how to restructure it to infer detection. It is also interesting because machine learning methods might be more broadly applicable.

o Dorazio, R.M. (2014) Accounting for imperfect detection and survey bias in statistical analysis of presence-only data. Global Ecology and Biogeography 12: 1472–1484. – Interesting but not widely tested.

o Fithian, W., Elith, J., Hastie, T. & Keith, D. (2015) Bias Correction in Species Distribution Models: Pooling Survey and Collection Data for Multiple Species. Methods in Ecology and Evolution 6, 424-438. – uses a model closely related to the one above; doesn’t explicitly deal with detection, but could be extended to do so; the multispecies angle is interesting

o (there are others emerging too; most are within the point process modeling framework). ANSWER:

How to use opportunistic data to measure changes in species distributions and changes in species abundances?

What are the taxa that we need to focus our efforts at global scale to measure change?

The first question is critical because opportunistic data is probably the only source of information that we have for most of the world. The second question is also important because we currently have approximately 1.2 million species described and we need to be representative but also practical.

ANSWER:

Methods to approach the real distribution of a species from primary occurrence data provided by networks such as GBIF, and how to handle new records collected/observed in a climate change scenario.

Best taxonomic groups to be used as indicators, given their distribution, data availability and ecological relevance.

ANSWER:

Identify the areas were biodiversity losses are higher than expected from changes in habitat quality, and identifying the drivers of the losses.

Detection of the approaching species and community extinction threshold based on the habitat loss signals from the same unit area. Taking into consideration the longevity and speed of life cycles of different species.

ANSWER:

Page 57: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

In citizen science, changes in species distribution and/or abundances are often considered valuable for a) measuring changes in biodiversity with the goal of influencing conservation and other policies, and b) serving as indicators of climate change.

ANSWER:

How to quantify the uncertainty inherent in global estimates of distribution / abundance which have been derived from data which is limited and patchy in time and space. In particular, recognizing that the sample size/density necessary to monitor CHANGE is a lot larger than that needed to map presence at a single point in time. This is relevant because uncertainty can’t be ignored (it helps to drive better sampling) but must be presented to decision makers in a way which doesn’t encourage them to dismiss the whole EBV and its implications.

Verifying that the designed EBVs are actually capturing improvements in and threats to species status soon enough to be useful. Some backcasting could be used to test whether this is the case. This is relevant because if there is a lag in picking up the effects of habitat loss or improved management in the EBVs and the derived indicators, positive or negative effects may be attributed to the wrong driving factors.

ANSWER:

How do we validate the measures of “abundance from occurrence” which we derive from big data? ANSWER:

Understanding the relationship between species change and community change. Why? o We focus on species, but there are too many to model effectively. For ecosystem services

maybe many species are redundant, and community-level properties (e.g. functional diversity) are more important. Community-level properties (alpha, beta etc) may also be more easy to estimate. We need some general principles for a) how strong is the relationship (between community metrics and change the constituent species), and b) under what circumstances is it likely to be stronger or weaker. Then we will know when it’s reasonable to make shortcuts and extrapolations. I co-authored a recent paper (Oliver et al 2015) that discusses elements of this problem.

Spatial scaling: from populations to species (or vice versa). For most species, our knowledge about their status comes from just a few sentinel populations. To model the dynamics of the whole species, we need to scale up, which is nontrivial. Currently we either use a simple extrapolation or use a metapopulation type approach, but we lack a good understanding of which approximations work, and when.

ANSWER:

Questions: o What is the impact of major vectors of change (e.g. climate change, coastal development,

pollution and eutrophication, maritime traffic, etc.) on biodiversity? o How we can disentangle the effect of each of the vectors of change?

Reasoning: o We need to understand the mechanisms through which biodiversity change in space and

time occurs o We need to know how we can organize our response (e.g. regulations) to halt change or, in

the worst case, mitigate the impact or maximize adaptations to change ANSWER:

If we want to obtain a more holistic understanding of past and future changes in biodiversity, o We need to improve the species data quality (i.e. the taxonomic coverage may not be

enough) and quantity (i.e. the spatio-temporal coverage may not be enough) o We need to find efficient ways to aggregate and standardize data from many different

sources, including the literature, research networks, monitoring programs, citizens, and public archives.

Page 58: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

ANSWER:

What is the baseline against which we are measuring change (does this vary between regions?)

Establishing the links between drivers and responses exhibited by species populations. The responses measured can be anywhere on the continuum from a behavioural response to a population decline to extirpation

Which species should we monitor – e.g. habitat specialists, sensitive but persistent species (to monitor long term response to pressures), a representative sample of species from all major taxonomic groups

Can / how can remote sensing be used to monitor species populations? ANSWER:

I consider essential to monitor regularly through time habitat change (land use change) not only anthropogenic but (in particular for climate change) naturally produced by geochemical/geophysical changes. Of course, also invasive species tracking on a regular constant basis. I trust, perhaps too optimistically, geoinformatics of ecological data (see Q1), is reliable enough, as soon as it includes reprocessing capabilities for stored big data to allow tech updating. Other time spans measuring alternatives seem too costly. The counting exercise can only be done through regular snapshots (which excessively rely on funding which can never be regular).

I wonder (don´t assert) whether systematic crowdsourcing of data through citizen science tools can provide data constantly in time that could be reliable enough or its efficiency is still only a dream.

Regular (every X years) transect counting replicating the original effort made for the first assessment still seem unavoidable but it also seems clear that it is only applicable to quite threatened populations whose recovery plans are mandated by law and respected in the budgetary appropriations (even in Europe no exercise has been done in 25 years to double-check the accuracy of the initial Natura 2000 population assessment efforts made in 1992-1997!!).

ANSWER:

One of the key challenges is not only to monitor the spatiotemporal changes in biodiversity (no. of taxa), but be able to identify the underlying factors that drive these changes. These correlations can be robust and effective, only if we can provide evidence at large scale.

Questions related to technical side (research infrastructures) ANSWER:

Changes on species distributions or abundance can be driven by intrinsic population dynamics, ecological interactions, or environmental change. Spatial correlation (especially on niche models) might possibly rule in/out the later, but the two first ones are hard to separate. Therefore research on how to distinguish those two categories of drivers (dynamics/interactions) are paramount in order to assess EBVs.

ANSWER:

Research question 1: Is there a theoretical basis for universal criteria for selecting higher level taxonomic units taxa (e,g range of Phyla Class , Order) and representing these with a subset of lower level taxa (family, genus and order) to estimate population EBVs for different environmental realms (terrestrial, marine, freshwater), different climatic zones (subarctic, temperate and tropical)?

Research question 2: Does the EBV class Species population represent an emergent property of biodiversity? If so, is the accurate quantification of this property reliant on a particular combination of species distribution and population abundance information ?

ANSWER:

Distinguishing a) genuine changes in distribution/abundance from changes in the underlying data or data availability merely revealing an increase in knowledge, and b) genuine change in species distribution/abundance caused by anthropogenic environmental change from genuine change in

Page 59: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

species distribution/abundance caused by intrinsic population dynamics and other biological factors.

ANSWER:

What are in your opinion the two most important research questions relevant for quantifying changes in species distributions and/or abundances over time (i.e. for measuring biodiversity change)?

Why do you consider especially those research questions as particularly relevant?

- Which species/taxon groups are the most suitable interim indicators (while targeting gaps in data aggregations)

- It isn’t possible to model every species or every biological group. Decisions will need to be made as to which taxon/functional indicators can be supported ‘now’ for consistent global assessment by each country; and which may be applicable in the future with agreement on targeted effort.

- In environmental/ecological space, for global assessment, where are the gaps in biological survey or digital data aggregations as indicators of indicator reliability

- We need to be explicit about the choices we make as indictor and what the gaps or weaknesses are in its representation – we can do this statistically using confidence intervals and spatially identifying areas that are poorly represented by the model; and we can do this using p-median type approaches (Faith & Walker 1996; Faith 2003; Faith et al. 2004; Ferrier et al. 2004; Funk et al. 2005; Manion & Ridges 2009 ) in environmental/ecological space of survey completeness with strategic recommendations for data-driven and systematic gap filling.

- How well are we really linking theory and conceptual understanding about the ecology and distribution of species with the modelling method?

- See consistency with theory and empirical evidence (Austin 2002, 2007) as the underpinning of indicator development and the statistical reliability of the modelling method as secondary (but demonstrating incremental improvement is important). That is, we should ensure ‘explanation and understanding’ are paramount and the modelling method not simply driven by prediction mapping, but contributes also to testing underling theory, ultimately toward developing more process/mechanistic approach.

- Which environmental variables as covariates for species population occurrence, abundance and demography dynamics do we need to spend most effort on in order to minimise missing variables in mapping distributions and change; and how can we make the most of expert knowledge about the suitability and quality of candidate variables

- There are more and more options of environmental variables, derived in different ways from mapping, remote sensing and models at different spatial and temporal scales. How soil properties and landform characteristics interact with climate at local scales have always been limiting attributes for modelling species populations. What we think is proximal might be a poorly specified variable and conversely, what we think is indirect and useless may actually be well correlated with distribution patterns (Williams et al. 2012 International Journal of Geographic Information Sciences 26, 2009-2047). While knowledge and models depicting physical soil-climate-land processes accrues, we should be cautious about a prior selection of explanatory, and instead use expert elicitation to generate informative priors about utility as weights in statistical models.

ANSWER:

I think the main issues are about how to increase monitoring programs, and not so much about “hot research” questions, but a few topics are currently important including:

o 1) how to make use of opportunistic/casuistic observations; o 2) how to best stratify monitoring efforts;

Page 60: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

o 3) how to automate monitoring efforts (eg. near remote sensing with cameras); o 4) how to use proxies such as land-use cover and other biophysical variables measurable

from space to predict changes in species populations.

Addressing question 1 would allow us to use a lot of data that has not been sampled systematically, including much of GBIF data and citizen science data. Question 2 would allow us to improve monitoring schemes and design cost-effective schemes. Question 3 and 4 are about automating the detection of biodiversity change.

Page 61: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Question 5: What are the key steps of a workflow(s) for calculating species distribution and/or abundance EBVs, starting from accessing the raw data to presenting a visual result? What are the complexities involved? What data preparation is needed? ANSWER:

Many possible approaches have been described in the literature, ranging from direct detection from RS through to SDM (see above).

Main challenge is avoiding developing categorical data – stick to continuous EBVs ANSWER:

Raw data should be uploaded together with a metadata file mentioning the important characteristics of the data set (name of the monitoring program, type of data (occurrence or count), time duration of the monitoring period (start – end or dates of first and last records), indicative location (e.g. national or regional level), sampling protocols (number of species and taxonomic group, sampling design and frequency, observation techniques (binocular, satellite, microscope, etc), etc.) as well as the name and contact of the data provider (name of the institute or staff member).

The data provider may also need to commit to data sharing agreements: either by agreeing to the direct access to / download of the raw data so that the uploaded data can be (re-)used and (re)analysed by any other user, or at least the use of the data set and the access to / download of the derived EBV calculations from his data set by other users.

Importantly, data preparation to fit to a given format would be a pre-requisite. The format of the datasets should match with specific requirements detailed in some guidelines describing e.g. the minimum number of field categories and the information which are supposed to be recorded (species name, site ID, Latitude-Longitude, quantities (filled with 0/1 for distribution and integer above or equal 0 for abundance, “NA” for no sampling or gaps in the sampling design, etc), unit of the quantities (e.g. number of individuals, densities, catch per unit effort) as well as a nomenclature for naming each field categories so that each dataset should have the exact same name for each field categories. A thesaurus of species names and the geographical reference system (e.g. WGS84) should also be provided. Immediately after the upload of the data set, an automatic quality control could be performed to check whether the name of the field categories, their content, the name of all species in the data set, the geographical coordinates, etc. fit rigorously to the format guidelines. If any recommendation of the guidelines is not respected, the data set could not be considered for further analysis and would need to be modified accordingly. If the data set is respecting the guidelines, it could be allowed to go further.

Then, the user would be proposed to perform some analysis by himself or not, whether his interest is to make use of his data or only provide them to others. Analysis could be performed through a simplified interface by choosing statistical methods and visualisation tools proposed to the user. Applying these methods and tools for calculating and visualizing EBVs would consist in linking the data set to pre-written scripts or software that would have be made available, selected and reviewed by experts. The user might also be offered the opportunity to access to other datasets already uploaded in the database for calculating and visualising EBVs from other species or places. The user could also download files (csv, text, etc.) with the outcomes of the analysis (EBV estimates, trends, maps, etc.).

ANSWER:

Availability of a consistent suite of data

Data Quality tests are the first and most important step to eliminate non-relevant data.

A consistent methodology for surveying abundance and distributions.

I leave the details of SDM and GDM steps to Jane Elith and Kristen Williams. It is no longer my expertise.

ANSWER:

Page 62: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

For all data, the level of supporting evidence for the observation/measurement/trait should be known or estimated and data should be incorporated based on an appropriate minimum confidence level.

For all data, precision and accuracy should be understood for key dimensions (spatial, temporal, taxonomic, environmental) and data should be incorporated based on appropriate precision and accuracy for the model in question.

ANSWER:

Establish a data documentation and a quality assurance plan for the workflow.

Determinate source of data: research projects, involving other sector as health, hydrocarbons, defense, agronomy, etc; biological collections: citizen science. It means a cultural change for most and a capacity enhancement in this matters. Also data use licences must be clear in order to avoid legal issues.

Integration: Once the data is published the mechanism for updating and indexing the data should be clean and fully interoperable, that's why data standardization is so important.

Processing: data can be processed with the relevant variables for research. Here, the computational capacity and development of scripts are components can make this step the most automatized of the workflow.

Analyse: Once the data is processed can be analysed for getting information. In this step informatic tools are useful but the expert judgment determinates what can be concluded from the analysis, whether if something can be concluded or if there is not enough information for.

Visualization: All should be shown on the web, whether it can be presented as a map, table, charts of statistics or a publication. A content management system will facilitate that with some developments and design to enhance the way it can be given to the stakeholders.

ANSWER:

ANSWER:

The key steps will be in the front-end of the workflow. Issues of integrating data by standardizing reference systems and spatial and temporal coverage are most time consuming and may be

Page 63: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

problematic in historic data where contextual information is lacking. I feel the next most important issue is traceability between the evolving concepts of EBVs, the data and algorithms used to calculate these metrics, and evidence for the validity of the approach in the science literature. I think formal references to scientific evidence, validated data and algorithms are increasing important when displaying the final metric to decision makers.

ANSWER:

For molecular data (DNA barcode or metagenomic sequences):

Step 1: development and implementation of a query system that integrates the information from different world wide available databases / infrastructure. The central searching criterion could be the name of the taxonomic class, possibly of the species, combined with other criteria such as the sampling geographical location or time;

Step 2: development and implementation of a system to manage any data submitted by the user for the investigated species. There should be a temporary or definitive data and analysis results storage system;

Step 3: Implementation in the infrastructure of tools and reference databases for taxonomic analysis.

Step 4: Implementation in the infrastructure of tools for comparative statistical analysis of presence / abundance of species in different geographical areas and time points.

ANSWER:

Accessing the relevant data at global scale is one of the biggest challenges and it is of highest importance to find a good balance between bottom-up and top-down approaches when facing this challenge. Top-down approaches such as GBIF implement an infrastructure and then ask countries/regions to share data to feed the system. These approaches are interesting because of their large spatial scale, but they may prevent from easily controlling for important issues such as uneven sampling effort, data storage/sharing and mobilization among countries (Beck et al., 2014). Bottom-up, network-of-network approaches such as the EuroBirdPortal (http://www.eurobirdportal.org/ebp/en/) that intend to connect different systems to each others are initiated by the countries themselves and have the potential to provide information associated with a lower level of spatial bias because sampling/recording effort and/or “sharing willingness” may be estimated in a more straightforward way. Reconciling the two approaches is a big challenge ahead.

ANSWER:

Depends on the data, but I will outline for presence/absence data (camera trap images or recordings).

o Import data from camera traps/recorder from a sampling season and ensure each image/recording has the following info: Project name, site name, date, time, spatial coordinates, species name, and other metadata (person identifying the image, sampling period name, sampling period dates, etc.).

o Basic data consistency check. For example, are all dates and times within the expected time frame? Are species names consistent across the data.

o For each species in the data set create a matrix of sampling points (rows) vs. time (columns) in days or any other meaningful time interval. Fill this matrix with 1,0, or NA depending on whether the species was seen at this point on this time (1), not seen (0) or the point was not sampled on this day. This matrix is the input of basic occupancy analysis. A matrix of number of sampling events of a species can also be created (defined as the number of times the species was sampled at a point at this time) for abundance analysis. User needs to determine what constitutes a sampling event (e.g. series of images that are at least 5 min apart in time). From this event matrix, point abundance analysis can be performed using binomial mixed models.

o If species observations are sparse (species detected at less than 5% of the points per sampling period) and detection probability is low (< 0.05), most models will have difficulty

Page 64: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

converging. In this case, combine species together, or do naïve analyses (without correcting for detection probability) without covariates.

o Choose a series of covariates that can be: spatial (value of covariate changes with space), temporal (value of covariate changes with both), both (value of covariate changes with space and time). These covariates can be used to model occupancy/abundance as well as detection probability.

o Temporal analysis will require fitting a dynamic occupancy or dynamic binomial mixture model (for abundance). Software already exists for this (Presence, package unmarked in R, or TEAM’s Bayesian analysis using JAGS in R).

o Ensure that model recovers patterns adequately by checking model consistency. ANSWER:

Need a good computational infrastructure

Need standardization of computational analysis pipelines ANSWER:

Getting and processing row data are the key step. At present, row data were collected by different agencies with different standards and different data quality, and distributed in different places and lack of data sharing.

ANSWER:

This is general without thinking about whether it needs to be automated / rolled out globally:

First think about what you are trying to achieve with the modelling and what sort of data that requires (don’t start with the available data; think first). Monitoring change is much more challenging than a one-off species distribution model, and – despite published examples to the contrary – I don’t view presence-only data as suitable.

Collect relevant species observation data. If this is not your own data needs some time to understand it – to get to grips with the survey design, to understand survey effort, to get a feeling for whether species identification is reliable etc. Check whether the data meet the requirements for the sort of modelling that is appropriate. If predictions are to be made across landscapes, do the samples cover the main environmental gradients likely to be important to the species? Check for any errors in the data (terrestrial records in the sea; mismatch between textual descriptions and lat/long coordinates; records for riverine fishes on land; etc).

Assess the coverage of the samples, of the environmental and geographic gradients in the region of interest – this is relevant for understanding whether predictions to unsampled sites are likely to be well informed. Ref: Cawsey, E.M., Austin, M.P. & Baker, B.L. (2002) Regional vegetation mapping in Australia: a case study in the practical use of statistical modelling. Biodiversity and Conservation, 11, 2239-2274.

Gather covariate data, for both the observation process (detection) and the state process (occupancy/abundance) – want variables relevant to the species at a grain that represents those environments properly (e.g. aspect is irrelevant at coarser grain but may be important to the temperatures experienced by the species). Evaluate correlations between covariates and decide whether to reduce the covariate set, and how.

Fit a model (what method is going to be used for model selection? – big issue), test its fit and predictive ability (the latter e.g. with cross-validation). At this stage need to decide what minimum predictive ability is acceptable for monitoring change.

If required, predict occupancy / abundance over whole region.

Repeat in time 2. Question: do you use same set of candidate covariates, or same set of covariates selected in the final model at time 1 (intuitively: the first)

Calculate change. Include uncertainty throughout.

ANSWER:

The main complexities are: o Data accessibility, which includes people not wanting to share their data.

Page 65: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

o Data harmonization, which is related to data and metadata standards ANSWER:

Selecting a species or group of species from a taxonomy.

Fetching, providing or getting automatic suggestions (e.g. possible misspellings) for alternative names for the species.

Retrieving what you call “raw data” from each possible data source using all names.

Manually and/or automatically filtering retrieved data based on data quality, geographical, and/or temporal parameters.

Calculating the EBV.

Storing results, preferably with all data used.

Organizing and presenting results.

There can be lots of complexities depending on the details of the workflow. For example, EBV calculation may be as simple as calculating the extent of occurrence based on point data, but it may also involve complex ecological niche modelling techniques with pre and post processing steps. Data quality filters can be numerous. Additionally, many steps could involve human interaction, which could seriously affect workflow efficiency. And if the level of interaction in certain steps is too high, it may be necessary to either create intermediary databases/services or change the original data repositories to accept some specific tagging mechanism through web services, so that further workflow runs do not require duplicate work to review the same data.

ANSWER:

Identification of the taxonomic, spatial and temporal scales where best available biodiversity data are enough to make reliable distribution and abundance predictions. Taxon and locations specific verifications and calibrations.

ANSWER:

I think that the workflow should begin with protocols for data collection and verification. At this stage, it will be important to collect the metadata required to determine legal and policy interoperability (perhaps a generic standard like Dublin Core could be expanded by drawing on other standards and frameworks, including the European Interoperability Framework (http://ec.europa.eu/isa/documents/isa_annex_ii_eif_en.pdf). Of course, all raw data sets without sufficient metadata will need to be re-visited, and new metadata added, before these data sets can be used.

Having accurate metadata to document things like data policies, the amount and type of PII included in a data set (if any), and the legal jurisdiction of data collection is a first step towards supporting interoperability. Automated matchmaking could be sufficient to determine key aspects of legal interoperability, for example by using information including national provenance, data/database structure, and licensing to determine whether two data sets are compatible from an intellectual property perspective.

But, automated metadata matchmaking between data sets is not a complete solution. For some data sets, including those collected by citizen scientists for a specific purpose, the initial goals of data collection may be compatible with some forms of re-use but not others. This could be thought of as a form of political or ethical interoperability, and isn’t always given the same weight as legal and policy concerns. There should be some way for the creators of a data set to indicate (upon uploading) their preferences for re-use (is automated matchmaking OK, or does the system need to send the request to a human researcher for review?).

While all necessary steps of the workflow should be uncovered through the design process (see Q6), a few features that may appeal to citizen scientists include the ability to save an in-progress or completed analysis/ visualization, the ability to export a visualization with references to source data and metadata, the ability to work collaboratively, and the ability to ask questions of experts or others through a discussion feature.

Page 66: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

ANSWER:

Data preparation:

o Collect the raw data (potentially from a range of sources)

o Identify duplicate data

o Quality-check the spatial and temporal reference if present, filter data which cannot be

pinned down in time and space.

o Look for obvious outliers or possible misidentifications using modelled species ranges, and

handle these according to a consistent strategy.

o Investigate and interpret any flags which may indicate breeding / migratory / etc status.

o Ensure that taxonomic definitions, units etc match or can be transformed between the

different datasets.

o Optional: If working with abundances, transform observations to inferred abundance

where sampling effort / strategy is known.

o Perform all necessary transformations and harmonise the data.

Compute species distributions or abundance values / maps for specific time steps. A particular

complexity here is identifying where an absence of records actually indicates absence or loss of the

species.

Combine the values / maps to get an idea of CHANGE between the epochs. Here, a complication is

that the uncertainty at both steps may drown out any actual change signal.

Generate clear, usable maps / tables / charts (ideally accessible via the web, with links to full clear

metadata on how the computation was carried out, and links to the source data). Raw results

should also be made available (along with metadata) so that users can present the results in their

own chosen way.

ANSWER:

Change of distribution: Download data from GBIF. Clean it by merging synonyms and duplicate records. Choose meaningful variables, run principal component analysis (PCA), choose the right calibration area, and calibrate the ecological niche model. The main difficulty is to get modern environmental data layers, because WorldClim is now rather outdated. There are data gaps, fo instance in Eastern Europe. Data management is a challenge in general. OpenModeller does not offer PCA, etc. Lots of technical hurdles and dealing with data gaps.

Change in abundance: Download data from GBIF. Clean it as above. Cross-tabulate it into a OLAP hypercube, aiming at reasonable data density at each cell. Compute trends in various spatiotemporal resolutions. The complexity is to maintain performance when using hundreds of millions of records. Data cleaning at that scale is challenge, because it can only be done automatically.

ANSWER:

In Sparta, we start with a set of records from multiple species that we believe to be recorded as an assemblage, by which we mean that records of one species can be used to infer absences of others (see Isaac & Pocock 2015 for a UK-centric exposition of the data issues). Sparta interfaces with a package called rnbn that collects data from the British National Biodiversity Network (the UK node of GBIF) and it would be trivial to link it directly with rGBIF. The challenge is to identify the assemblage, i.e. which set of records can be considered to be co-recorded. The rest of the workflow is well-defined in Sparta, but involve converting these records into a ‘visit matrix’ where each row is a unique combination of site and date. Our estimate of sampling intensity is the list length (the number of species recorded, following Szabo et al 2010). We use 1km2 and day precision, but coarser resolutions are quite possible. Each species’ detection history is appended to this matrix: the modelling thereafter is described in the scientific literature (e.g. van Strien et al 2013, Powney et al 2015). There are issues of spatial coverage and spatial autocorrelation that we have yet to master.

Page 67: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

ANSWER:

Knowing the fitness of the data for a specific question is essential. Select the data based on proper documentation. Raw data can only be re-used for other applications if properly standardized and quality controlled. For EurOBIS and OBIS we use taxon matching tools, spatial end environmental outlier detection in an automated system to assign quality flags to the records.

Relevant publication: Vandepitte et al. 2015 Database ANSWER:

Drawing on experience from Essential Climate Variables (Bojinski et al 2014) it is clear the process and methodology for generating EBV data products is multi-stage.

o Assembling the relevant raw data is the first step. In some cases this can be quite straightforward; retrieving relevant occurrence records from GBIF, for example. In other cases, less so; perhaps requiring additional processing and transformation of satellite images (for example) or establishment of new observing protocols.

o A second step could be concerned with adjusting the assembled data to account for heterogeneities in it; perhaps in the observing methods or to fill gaps where data is absent.

o Depending on the nature of the EBV, a modelling and correlation step may come next, leading to the EBV product itself.

o Thereafter, comes post-production quality assurance - a check necessary to ensure the uniformity of the product when compared to the same product calculated for a different place or at a different time or using different data.

o All of this has to be carried out in a standard, repeatable, open and transparent manner with clear and accessible documentation of each step, such that it can be subject to expert scrutiny and peer review.

o And finally, the EBV product needs to be updated regularly, perhaps in near real-time so that the information can be used as the basis for monitoring change.

ANSWER:

Key steps: o Knowledge of scientific question, sampling design, sampling procedure o Data management, including protocols, standards, etc. o Data standardization/normalization o EBV calculation method assessment and knowledge of its statistical properties o Identification of the appropriate workflow o Identification of the statistical package o Identification of the web service which provides access to the package o Assessment of the computational limitations of the offered web service o Data uploading and massaging o Execution of the EBV calculation o Visualization of results

Complexities: o Infinite number of complexities starting from the design and sampling procedure to EBVs

calculation. Unless every single step of the above steps of the process is crystal clear, bias may come at any stage and may have considerable effect on our calculation and estimation.

ANSWER:

I’m assuming this question addresses what to do after the raw data has been collected and isn’t asking about how to go about doing the monitoring. The workflow depends on the data being used and the intended output. I’ll describe one example for abundance data.

Raw abundance data are logged in order to make trends comparable between populations and discount the size of the population units (if appropriate). This is essential if different types of population counts have used e.g. sightings per km of transect versus biomass.

Page 68: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Use an appropriate model to process the time series data. This will depend on the type of data and whether it is structured or unstructured. Generalised additive models can be used for longer time series, linear models for shorter ones (Collen et al 2009; Buckland et al 2005). Other models are used when using structured survey data (Gregory et al 2005).

Use ancillary information attached to the population data to disaggregate the results in meaningful ways and answer more specific questions.

Geometric mean is the most common method of combining a multi species abundance indicator. Usually each species is weighted equally but this can be altered if there is an imbalance in the data set.

Complexities o Addressing bias in the data. Weighting can be applied to address under-representation of

species or regions. o Other issues of representation include how much of a global species population should be

monitored in light of the fact that we will always be limited in how much we can monitor. o Other issues when monitoring abundance particularly for migratory species can be in

understanding if a population has moved or shifted its range rather than genuinely decreased in numbers.

o Understanding what makes up a population – is it defined in an ecological sense or does it refer to the geographical extent of a study site and all the individuals of a species located there.

ANSWER:

See answer to Q8 (the first 2 priorities). Certainly reliable metadata is the key issue.

In more abstract and general terms, my answer is simple. I would test whether the data management life cycle analysis of GEO (if they are adaptable to biodiversity data –which I am not so sure about) and others are well articulated or they are still wishful thinking. Too many people have been thinking about it not to take their outputs seriously.

ANSWER:

Species distribution: o Name resolution and reconciliation o Data (occurrence) harvesting from across resources based on taxon concept queries o Fit for purpose evaluation o Data cleaning algorithms (confidence level thresholds) o Data aggregation (occurrence records) o Projection over multi-layered environments o Ad-hoc geo-correlation services

ANSWER:

Dealing with data from different data sources the integration and analysis mainly focuses on two complexities: a) the changing taxonomy and b) changing methods in the estimation of abundance and distribution. Information on the underlying methods applied and the related uncertainties in the quantification needs to be taken into account for the calculation. The estimation of the overall uncertainty is one of the important issues.

The main issue is to get data on a national scale for threatened species. While monitoring for certain species works well e.g. also using tracking mechanisms (e.g. whales or birds), for others it is more complicated to get consistent figures.

Issues to be addressed: o Taxonomic reference o Method comparison for estimation of abundance and distribution o Estimation of the single and overall uncertainty o Dealing with data gaps (both in temporal as well as in spatial terms) o Data availability

Page 69: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

The issue on using provenance and quality information in automated workflows is an issue for the integration of information from different sources.

ANSWER:

The workflow will most likely be ad-hoc depending on the specific question being addressed.

Any workflow will include at least the following steps: o It is necessary to facilitate the access to (meta-) data reliable and identified resources; o Data gap analysis to identify potential holes affecting distributions (see e.g.

http://www.gbif.org/resource/82566); o Data cleaning (see e.g. http:// http://www.gbif.org/resource/80528) to detect suitable, fit-

for-use data (http://www.gbif.org/resource/80623,http://www.unav.es/unzyec/papers/ztp720_postprint.pdf)

o To provide the proper e-Tools to also guarantee their (semantic) inter-operability; o Niche modelling (http://press.princeton.edu/titles/9641.html); o Some type of visualization ranging from the most basic (e.g. GIS) to complex,

network/relationship graphs and their further integration into proper Virtual Research Environments-VRE (see also Question 6).

An effective approach based on COMMON tools as simple as possible (for example Python-based) but flexible to allow the users detailed analysis, could be encouraged.

ANSWER:

Data prep, having the source data evaluated to see if it is fit for purpose (https://www.youtube.com/watch?feature=player_embedded&v=eVYwt86mC_4Q and Complexity is in agreeing how to calculate the measure of the EBV mathematically so that it can be implemented in software.

Once have that algorithmically would suggest that the data is processed into a multi-dimensional cube (OLAP – see https://en.wikipedia.org/wiki/OLAP_cube) to enable interactive dashboard at multiple scales.

ANSWER:

There are likely to be many variations of a particular workflow, depending on the question being asked. The general approach needs to be scientifically well-established but also sufficiently flexible.

The existing BioVeL environment implements many of these steps already: o Accessing, preparing and cleaning the data to remove erroneous records are fundamental

steps. Conveying changes in editing existing data back to the data providers is also very important. For example, we routinely utilise GBIF data but often edit existing co-ordinates more accurately or add co-ordinates where there is detailed locality information but no geo-reference.

o For species distributions, ecological niche modelling is a widely-used tool for individual species, or by making species richness surfaces by stacking model outputs. Models run against different time-stamped datasets can reveal changes in a species distribution EBV for multiple species, if compared against an appropriate baseline.

Again, the interesting questions are not so much changes in particular EBVs over time but in inferring mechanisms and predicting future patterns by integrating approaches for different EBVs into shared indicators based on a common mathematical framework.

ANSWER:

I think the main issue here is the broad deployment of the Extended Darwin Core with Event Core. This extended core allows us to manipulate the structured data coming from systematic monitoring efforts, something that was not possible to do with the original Darwin Core. The next step will be to combine data from various sources, including opportunistic and systematic sampling, to generate species populations and distribution EBVs. This may also include the use of proxies such

Page 70: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

as land-cover and climate to expand from the sample points to a continuous surface (wall to wall monitoring).

Page 71: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Question 6: What is a suitable technical (ICT) approach to perform this workflow(s) for calculating EBVs (any place, any time, using data anywhere, by anyone)? What special considerations have to be taken into account? ANSWER:

The workflow for calculating EBVs would ideally be accessible at any time by anyone. For this purpose, any data provider should commit to a data sharing agreement (see Question 5). Any user, either providing data or not, would need to commit to another agreement for accessing, downloading or using data sets available in the database. This user agreement would specify e.g. non-commercial use of the data, EBV calculations or any outcomes arising from the use of the database. Besides, the user should commit to cite all the data sources that would have been used, analysed or downloaded whenever the outcomes will be published or communicated (citation of the name of each data the provider and the name of the scheme from who data as been).

ANSWER:

Monitoring abundance and distribution EBVs is unlikely to be properly determined, unless specific project teams are appropriately qualified.

Obviously a standard workflow is required, with each step justified as efficient and robust (Best Current Practice).

While anyone, anywhere could use the approach, it is unlikely that a totally ‘canned’ approach could be optimal given current state of knowledge.

ANSWER:

This question is a distraction. Depending on the repeatability and necessary parameterisation of the models in question, ICT solutions may be based on workflow engines (where this a suitable rapid exploration/prototyping approach) or on robustly implemented algorithms. Parallel processing technologies like Hadoop may be important. However, all of this is frankly relatively trivial once we determine what we hope to model and how those models relate to the source data.

ANSWER:

Workflow can be performed in different conditions, the technologies can be adapted to a well-defined software architecture. Establishing an appropriate solution architecture to perform this workflow means that we can address the principal concerns according to the requirements. That also means that requirement gathering should be an exercise that must be done with the best judgement.

ANSWER:

Data must come from several sources and should be able to be dynamically interlink different databases and do corrections, pointing out discrepancies etc. We need to mobilize all data and not focus only on open access data. This brings copyright issues.

ANSWER:

The data, contextual information and algorithms used in workflows will be widely distributed and heterogeneous in nature. There will need to be agreement on a number of standards to enable these resources to be brought together. I would expect these to include wrapping these resources as web services with standard vocabularies to describe them when interrogated. How these services are then marshalled into a workflow can be carried out in any workflow engine that adhere to the service standards

ANSWER:

Implementation of analysis systems in a Workflow Management System such as Galaxy or Taverna.

The systems should be usable even by non-experts and include a user-friendly interface. ANSWER:

Page 72: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

TEAM offers an analytical engine to perform occupancy analysis of camera trap data with or without covariates (wpi.teamnetwork.org) The analysis can also be performed by running R scripts for pre-processing and model fitting. In the near future we will have the capability of doing abundance analyses as well.

ANSWER:

The Creative-B document “D3.1 Comparison of technical basis of biodiversity e-infrastructures” documents the conceptual architecture (figure 26) for how applications, service logic, and resources can be stacked and interfaced. Workflows are incorporated into the framework as part of the service logic. What’s missing from that conceptualization, which would be useful for the computation of EBVs, is to associate technical standards or implementation technologies that are “GLOBIS-B recommended”. For example, the “Data Resource” component needs to have associated with it standards and technologies (for metadata, for catalog services, for discovery services, for access services, etc) that is not just a laundry list of standards and technologies, but a constrained, vetted, set: too much and wide of a set of options, and it becomes useless. Technology implementers / system integrators appreciate a “sanctioned”, controlled suite of options given to them, with perhaps a reference architecture of how one such combination of standards and technologies is used to implement a solution.

I feel that developing solutions at the conceptual level (like the conceptual architecture above, but supplemented with a small suite of “sanctioned” standards and technologies) coupled with a documented instance of one such implementation of the concept would encourage other EBV projects to try to adopt a solution that would be interoperable with each other, at least at various spots in the different implementations.

An example of a constrained set of options for standards and technologies has very recently (late 2015) been proposed by the US Group on Earth Observations, US GEO, which is the US body to the worldwide GEO. The data management subcommittee of US GEO has a draft version of the “Common Framework for Earth-Observation Data” (https://www.whitehouse.gov/blog/2015/12/09/improving-access-earth-observations), which should be finalized sometime in 2016.

ANSWER:

I don’t think any workflow for this purpose is available, now. Quality and standard of row data have to be taken into account.

ANSWER:

(unsure about the meaning of this question). “by anyone” ? I doubt that it’s safe/possible to automate this to the extent that someone with no modelling experience could do it. Maxent (the species distribution modelling software) is a good example of something made available to non-experts that is then used with poor choices by many users (because it’s hard for newcomers to understand the nuances of the choices, and most people go for defaults and don’t understand the implications of their choices). Needs a reasonable level of commitment for someone to develop the expertise to run the models properly. I believe the models needed for change detection are a step more difficult and need trained people to run them.

Presumably some steps in data prep can be automated. Tools can be developed to report on available covariates. Modelling: E-bird (USA) is a good example of sophisticated modelling analyses applied to vast quantities of data. Has taken a team with considerable statistical and computing skills to set it up and to continually evaluate model output (with input from species experts). Hasn’t been left for others to run it (i.e. the analytical team is always there).

ANSWER:

Assuming that the whole process needs to be replicable by anyone, the workflow would need to interact with publicly accessible data repositories and not depend on specific hidden/private data from certain researchers/institutions. Time and geographic range could easily be workflow parameters if the underlying raw data contain these dimensions. Things can get trickier if you also

Page 73: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

want to handle incomplete raw data, such as occurrence data without coordinates, having only a description of a place in natural language. Another critical step is to handle uncertainty. For example, occurrence records with high spatial uncertainty would probably need to be discarded in local/regional scale calculations, but could still be suitable for continental/global scale calculations.

ANSWER:

I think this question can only be answered through the process of cooperative (or at least user-centered) design. The GLOBIS-B team could begin by brainstorming a set of personas representing relevant stakeholders, including scientists, policymakers, and different types of citizen science volunteers. Then, users from each stakeholder group could be recruited to inform the design and development of an EBV calculation platform.

ANSWER:

Data preparation:

o Collection of raw data requires the user to be able to easily discover all the possible sources

of data related to a particular taxonomic group. To some degree this is possible through

catalogue searches but is still challenging.

o Processing and formatting the data requires that it is originally available in an easily-

transformed digital format where each dataset has at least some common tags, fields etc.

o QA requires the user to have a clear idea of what constitutes a reasonable or impossible

value/observation.

o Transformation of datasets may be computationally intensive.

o Most importantly, even if there are shared open-source libraries (e.g., in python and R) for

performing the above tasks, there will always be an element of user parameterization and

tweaking, meaning that potentially huge amounts of effort could be expended to produce

EBVs from the same datasets which are inconsistent. The ideal would be that aggregation,

quality checking, harmonization, catalogue harvesting / metadata publication etc were all

carried out before the user gets to the data. The agency which comes closest to doing this

job at the moment is GBIF.

Computation of EBVs including gap-filling / inference / interpolation. If uncertainty in terms of

detection / misidentification is to be quantified, some Monte Carlo simulations / random

permutations of the data would be necessary. If positional accuracy is likely to be a problem, this

should also be acknowledged and the impact of problematic observations assessed.

Computing change between time steps – probably the simplest step – plain maths or map algebra,

though if uncertainty is taken into account there are more calculations necessary to calculate lower

/ upper bounds or quantiles.

Presenting visual results – recommend open interoperable web services for maps / standard

formatted data (see below for how this can be done)..

What special considerations have to be taken into account?

o Technical capacity of users, necessary investment in training / effort, access to data (or unit

tests) for verification, testing and validation – accessibility of software (i.e., open source /

freeware vs. corporate). Legal / copyright restrictions on component data, and whether

these percolate through to derived products. Necessity to aggregate or obfuscate sensitive

records. One big question – how will ‘any user, anywhere’ be able to get good advice on

when the available data is too sparse or inaccurate for use in their chosen context?

ANSWER:

Computations need to be performed in portals using OLAP. ANSWER:

Page 74: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

We rely heavily on cluster computing. Some of our datasets are reaching the limits of available RAM limitation? We are also finding that many invertebrate groups lack sufficient data to work at 1km2 and date precision.

ANSWER:

[Note: Don’t understand why everyone should be able to calculate EBV’s. It does require some skill’s.. ]

Virtual labs that make data and algorithms available through webservices offers many possibilities, this is the approach taken by Lifewatch and Biovel, Biodiversity catalogue.

Overview of virtual labs for the marine world: http://marine.lifewatch.eu/

Statistical packages can easily harvest data from webservice, we built several interfaces based on R, Rstudio, Rshiny.

A good, scalable infrastructure using OGC standard webservices (WMS/WFS/WCS/WPS) is geoserver. EMODNET makes all data products available as OGC compliant webservices.

The different data sources can be queried simultaneously. o http://www.emodnet.eu/dataservices/

ANSWER:

Behind the above explanation lies a significant issue that has to be addressed early-on by scientists and the potential end-users of EBV products. It concerns a fundamental choice between calculating an EBV data product on-demand on-the-fly, versus a more periodic systematic production cycle where EBV data products are produced, updated and extended, for example annually, quarterly or monthly.

o Simplistically, on-demand, on-the-fly production requires ready access to relevant raw data, and to the workflow and processing capacity to transform this raw data to the selected EBV product for the indicated place or area of interest (local, regional, national) at the timestamp of interest. Processing capacity “at the touch of a button” is necessary to service the instantaneous demand of the request (and of simultaneous requests). Size and complexity of requests is not known in advance (although this can be controlled by limiting geographical area and resolution). Repeatability is a key requirement, such that if the EBV is again requested on-demand for the same place and time, the same answer has to be delivered. EBV data production is ad-hoc, responding to demands of the moment, with the quality assurance checks in-built in the procedure. Archiving of the EBV products is not required.

o In the cyclical approach, EBV data production is systematic, aggregated over large areas (potentially, the whole globe) and archived as an ever extending database(s) of information to be queried to provide the data for the indicated place or area of interest (local, regional, national) at the timestamp of interest. Processing capacity can be estimated in advance. Periodicity of the production cycle for an EBV can be tuned to the available processing capacity and to the expected temporal sensitivity of that EBV. The information is generated once, archived and then available forever (or a set period of time) to be used and re-used as needed. The any time, any place requirement is met not by on-demand computation but by querying previously computed data products that have undergone a post-production quality assurance assessment.

ANSWER:

First part of the question not entirely understood

Special considerations: unlimited computational capacity; transparency (leads to adequate repeatability of any observation and analysis)

ANSWER:

Several techniques exist to process abundance data, for example the software package TRIM, the method behind the Living Planet Index (Collen et al 2009) both of which could probably be developed into an online platform for use by anyone. The former is used for abundance data using

Page 75: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

a standardised monitoring protocol for a set of species e.g. birds, butterflies. The latter approach can incorporate abundance data from any species, method and unit of data

ANSWER:

Combination of existing “official” data management systems (e.g. digitized national biodiversity inventories) with quality controlled citizens´ science based apps, and adequate VREs familiar to biodiversity science actors. I have not at all enough informed knowledge to describe them (with the exception of very specific marine species estimations). What I am sure of is that a cluster of very sophisticated policies concerning all the life-cycle data flows and reuses is an unavoidable development that necessarily has to be in place.

ANSWER:

The implementation of web processing services (WPS) with defined input interfaces including quality information will be a possible solution. The implementation of these standardised WPS could be done on different platforms.

Implementation of data services for species distribution and abundance data including a semantic taxonomy mapping tool. Enhancing service based availability of data on species is a pre-requisite for the modelling approaches. As earlier stated information on the quality and uncertainty of certain methods needs to be provided with the data. Here is a limitation in the current data services which either focus on spatial data services (e.g. species distribution maps) or sensor based observations (e.g. a single species observation). Further development on the services for these kind of data is needed.

ANSWER:

This has already been approached with occurrence data: o See the GBIF portal (http://www.gbif.org) and developments built around it: for

example WALLACE (http://protea.eeb.uconn.edu:3838/wallace2/). In general, web services and REST able to milk large databases and produce subsets of data already condensed according to criteria supplied by the user will be the preferred method, as most scientists or practitioners are likely to prefer experimentation on the data (as opposed to final products such as ready-made niche models)

o See also the LifeWatch Marine Virtual Research Environment (Virtual Lab) developments (www.lifewatch.eu) which common construction blocks are also being used for the implementation of the LifeWatch Freshwater Virtual Research Environment

In general terms, these developments could be integrated (before being adapted accordingly) in LifeWatch ICT distributed e-Infrastructure, as the European Reference Platform (ESFRI) in order to further compose some e-Services to offer these EBVs values in a visual way through the development of in turn proper Virtual Research Environments-VRE. All this process involves analysing in-detail the final users (“customers”: researchers, decision makers-environmental managers) requirements.

Therefore, all of this should be performed through the design, establishment, deployment and maintenance of an OPENESS and Big Data paradigms-based Conceptual Framework such as LifeWatch e-Infrastructure is offered at the disposal to this purpose.

ANSWER:

Use data warehousing techniques through the ETL (Extract Transform Load) process to aggregate the data and build the OLAP cube.

ANSWER:

Again, many of the individual steps in carrying out such a workflow have been tackled already, with several running sequentially e.g. in the BioVeL environment. Having such workflows open source, or making use of repositories of pre-written code, so that analyses can be adapted to particular research questions is an important factor in their successful implementation.

Page 76: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

ANSWER:

As stated above, a combination of structured data in Darwin Extended Core, and some statistical inference (e.g. correction for sampling bias, trend detection), will be the first targets. Use of SDM’s or habitat suitability models with remote sensing of proxy variables (e.g. land cover) may also be used to expand the data from the monitoring points to continuous surfaces.

Page 77: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

Question 7: What are the technical options available and what is possible to achieve today or within the next 12 months? What data and/or workflows, software etc. are available today? Where is it and how can it be used? ANSWER:

Global analysis using the freely available RS is central – postage stamp approaches of joining many local studies result in inconsistent output

ANSWER:

Data available: see examples in Question 1.

Software / methods available for calculating or visualizing EBVs: TRIM (software), PRESENCE (occupancy software for distribution EBVs) or R-script of occupancy models, n-mixture models or visualisation tools that could be made available from publications or any expert contributor. Q-GIS or GRASS are open source software that can support the mapping and the visualisation of the EBVs.

ANSWER:

There are Data Publishers that have a good foundation of distribution data, e.g., GBIF, ALA, CRIA, BISON, SANBI etc.

Methods such as MaxEnt and GDM are well known and robust.

Workflows for SDM are widely available, e.g., R, BCCvL, BioVel,

Methods for estimating abundance are well established. ANSWER:

For data, GBIF is probably the most complete occurrence data pool and we should all work together to aggregate all possible data on occurrence and sample events in one place, and collaborate in data quality improvements to the whole.

Significant existing GIS and remote-sensed environmental datasets exist and should be made accessible through a consistent discovery and access catalogue.

ANSWER:

There are many tools and technologies that can speed up the development of this workflow: o Standards as PlinianCore and DarwinCore. o Publishing tools, such GBIF IPT. o Queue messages technologies: Apache Kaftka, Amazon SQS. o Indexing technologies: Solr, Elastic Search o Powerful relational and non-relational databases: PostgreSQL, MongoDB, Hadoop. o Map technologies: Mapbox, CartoDB. o Stats visualization: Kibana. o And several data quality tools: http://community.gbif.org/pg/pages/view/39746/list-of-

data-quality-related-tools-in-the-gbif-catalogue

ANSWER:

We are working a lot with OGC EF and O&M standards at present to bring together the description of the origins, configuration and accessibility of environmental monitoring data. This is being used to deliver SOS web services. There are reference implementation (e.g. 52N) for these standards which many groups are working with. I would like to explore how these standards could be applied to biodiversity (e.g. from GBIF etc) to not only deliver species data but the environmental context around them. There would then be many ways to assemble these services into workflow from simple python scripts to Taverna style systems.

ANSWER:

Data:

Page 78: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

o As concerns DNA-barcoding data, a lot of resources are available online, such as GenBank or BOLD. In BOLD each barcode sequence is associated with a well curated taxonomic description, with the collection site and date, the organism picture, etc.

o As concerns metagenomic data, among the most used reference databases there are RDP, GreenGenes, Silva, ITSoneDB e PR2/HMaDB. Previous metagenomic project sequences can be explored from various online archives such as EBI metagenomics, MeganDB, iMicrobe, etc

Workflows, software: o For taxonomic assignment of DNA-barcoding sequences some of the online available

phylogenetic pipeline are SAP e RaxML. Also in the BOLD site the taxonomic assignment is available but it requires a preliminary registration and not more than 100 sequences can be analysed each time.

o For taxonomic assignment of metagenomic datasets some of the online available pipeline are BioMaS, QIIME, Mothur, MetaShot, Kraken, Sparta.

o Metagenassist, DESeq2, Phyloseq and Metagenomeseq packages are among the commonly used package for the statistical and comparative analysis.

ANSWER:

We can do occupancy analysis of camera trap data on a massive scale now using TEAM’s wildlife picture analytics system. This will calculate population trends and combine these on a flexible biodiversity index (the wildlife picture index). Within the next months we could also accommodate other sources of data (acoustic) and perform abundance based analysis.

ANSWER:

The analysis pipeline can be standardized in the next 12 months ANSWER:

I don’t think any workflow for this purpose is available, now. ANSWER:

Analysts currently run these models using specialised software packages, with R (the free statistical software), etc.

A relevant issue: there is currently quite a push towards “reproducible science” – e.g. https://zoonproject.wordpress.com/ . The ability to trace analyses could be excellent.

ANSWER:

First of all you need to choose what EBVs should be calculated and how (in many cases the same EBV can be calculated in different ways). You could start by listing the possible EBVs, assigning each one a rank of “importance/impact” and a rank of associated data availability, then list the possible ways to calculate them, assigning each way a level of complexity to finally decide what can be done in the given timeframe. Scientists will also need to define which kind of data will be used. For instance, if the whole GBIF database will be used, you may consider a specific partnership with them to build the new application directly on top of their database. On the other side, if only specific parts of it will be used, you may create a separate application, still with significant web service interaction to retrieve data and a local database to store results. An interface on top of that database could be used to display results. There are many possibilities for implementation, including workflow management tools and other software frameworks – it’s hard to tell at this point what could be the best options.

ANSWER:

In addition to time stamped species occurrence and static environmental data, it would be good to involve species interaction, physiological adaption and dynamic habitat loss & quality layers – if there are models that are ready to consume such data

Page 79: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

ANSWER:

Within the next 12 months it is possible to a) write the specifications for an EBV platform by working with different user groups, and b) in parallel, conduct a survey of major existing software used by biodiversity experts and technical experts.

ANSWER:

As stated above, GBIF is the agency currently performing many of the identified data preparation

tasks. Software libraries and tools exist for most of the steps (commercial GIS, Quantum GIS,

PostGIS spatial queries, R / python / Matlab libraries… but the question is whether it makes sense

for this data preparation effort to be duplicated, or whether it would be possible to set up a

toolbox / framework for this workflow which could be shared. If so, python could be a useful

language since it has many statistical, data manipulation and spatial libraries, and can optionally

interact with ArcMap / QGIS where those are installed on a user’s machine. Technically, I think that

at least a prototype workflow for data preparation could be produced in the next 12 months,

though the scraping and discovery of all relevant input data is a big challenge.

Computation - Open-source libraries such as Sparta

(https://github.com/BiologicalRecordsCentre/sparta) may be useful for this process. For

identification and handling of problematic positional referencing, see e.g.

http://onlinelibrary.wiley.com/doi/10.1111/j.1600-0587.2013.00205.x/abstract.

Presenting visual results – many accessible and interoperable ICT solutions are available, e.g., OGC

WMS / WFS / WCS (tools like Geonode, CartoDB and Mapbox have lowered the entry barrier), and

for raw / tabular results, REST services which return JSON or other easily usable formats that don’t

require corporate software to visualize / chart.

ANSWER:

The Swedish LifeWatch Analysis Portal https://www.analysisportal.se/ is a good example of what we need. It just needs to be scaled up to any country and (sub-)continental, and global scales. They do not speak of EBVs, although they are computing something similar. EU BON is working on this.

ANSWER:

We are beginning to explore supercomputing options, including through Microsoft Azure. ANSWER:

The choice - on-demand production or periodic production - is fundamental because of its implications for the way that production processes are defined, and for how infrastructures are organised and optimised for calculating, archiving and serving EBVs data. The choice has to be feasible, efficient, and affordable. Global cooperation is needed to ensure consistency, serving comparable raw data sets and processing capabilities for production and maintaining appropriate archives. The workflows for producing EBVs data have to be capable of being executed in any infrastructure, and from anywhere in the world. The choice raises issues for permissions to use primary data, for secondary data, for citation and attribution and for provenance tracking.

Now and within 12 months, on-demand calculation is possible using the BioVeL infrastructure and, for example an adapted generic ENM workflow.

o generic ENM workflow on BioVeL portal: https://portal.biovel.eu/workflows/440 o myExperiment: http://www.myexperiment.org/workflows/3355.html o Documentation: https://wiki.biovel.eu/x/ooSk

ANSWER:

Technical options: o Mach of the infrastructure already in place: e-infrastructures, such as LifeWatch, BioVel,

iMarine, ViBRANT, that could serve as building blocks of the infrastructure required

Next twelve month target: o Registry of the e-infrastructures in place

Page 80: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

o List their technical specs and features o Deliver a plan by which the services they provide can be interoperable

Data and/or workflows: o Most of data needed are cite above o Workflows and statistical software operational in the context of several e-Infrastructures:

BioVel, LifeWatch, ViBRANT, iMarine, Aquamaps, etc. ANSWER:

I could only provide a wild guess. I´d rather decline to answer. I have the feeling though that open reusable data (with appropriate metadata) is simply not available yet. Certainly the departure point are some already existing systems such as GBIF or OBIS

ANSWER:

Currently the main sources of information will be GBIF on the one side and data from the European FFH directive on the other. An issue with the estimation on population status and trends is that the underlying data are not provided together with the estimation. In addition due to lacking information a part of estimation are based on expert judgement backed-up by in-situ data.

In the short term focuses (next 12 month), the integration of information on status and trends as a compilation of estimations will be the most suitable procedure for a wide range of species. For some already (e.g. certain whale and bird species) estimation models on a global scale exist.

Options: Implementation of WFS/WCS services for species data observation. Extension of SOS services for species observations (issue of complex monitoring schema).

The availability of data to estimate species population including changes in time seems to be a greater issue than the technical limitations. Nevertheless the automatic integration of uncertainties along the chain of methods is still an issue to be solved.

ANSWER:

Taking up Question 6 story line, GBIF is possibly the most advanced, already-available portal and is seeding a growing community of developers that are producing services based on its API. Also, niche-calculating services are very useful as entry points. Other global or theme-specific datasets (e.g. OBIS) are equally important, although they are generally becoming increasingly interoperable.

To this regard, an integration of some of these relevant developments in LifeWatch ICT distributed e-Infrastructure should be feasible within the next 12 months’ time period.

ANSWER:

The issue is more the source data, evaluating if it is fit for purpose and expressing the measure algorithmically. Any number of off-the-shelf software providers and open source solutions exist for data warehousing.

ANSWER:

See above (e.g. BioVeL). For species distributions most of the steps are already in place, save perhaps explicitly visualising these results in the context of change in a particular EBV. Analyses of change in species abundance are more constrained by available data; abundance data is more spatially and taxonomically biased, the analyses more complex and varied, but also potentially more informative of the causes of genuine changes in population abundance.

ANSWER:

There are a lot of challenges in doing anything in 12 months. I think the main goal should be mobilizing population abundance and atlas datasets into the Darwin Extended Core, and deploy a couple of apps that can perform statistical analyses on those

Page 81: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

GLOBIS-B (654003)

Question 8: What are the top 3-5 technical challenges of supporting interoperable EBV calculations on a global basis? How can these be addressed and in what time period? Who has to do something? ANSWER:

Funding

Suitable high resolution imagery (hyperspectral, lidar, hypertemporal)

Link between policy and space agencies ANSWER:

Some of the main challenges would be: o To define the data format guidelines in a way that they can be applied to any kind of

monitoring (standardized survey and / or opportunistic data) and any taxonomic groups (see Question 5)

o Once the EBV abundance and distribution would be clearly defined, to agree on robust and suitable statistical methods for calculating both distribution and abundance EBVs with respect to the requested data format and the heterogeneity of the monitoring.

o To define critical steps of the workflow as well ad key linkages in order to implement it and make it operational.

How it can be addressed: o Organising workshops o Engaging participative contribution of data owners / statisticians / biodiversity

experts / technical IT experts o Learning and getting inspired by previous successful endeavours (e.g. see the

excellent publication by Barker et al 2015 detailing a very efficient workflow of large scale ecological data)

o Barker et al. 2015. Ecological Monitoring Through Harmonizing Existing Data: Lessons from the Boreal Avian Modelling Project. Wildlife Society Bulletin 9999:1–8; 2015; DOI: 10.1002/wsb.567

Who has to do something: o Statistical and biodiversity experts need to define robust and suitable methods for

EBVs calculation together, as well as providing software or scripts. o Technical experts of database management / IT experts need to support the

implementation of the workflow. o Biodiversity experts and technical experts need to work together for defining the

guidelines and the data sharing / data use agreements.

Time period: Within the next 2-5 years seems reasonable. ANSWER:

What species? They can’t be consistent internationally.

Lack of systematic data. Lack of consistent, internationally agreed systematic surveys of key/indicator/target species.

While not technical, it is the ‘political’ that is likely to be the most limiting factor in achieving effective abundance and distribution change monitoring. Long-term, ongoing funding is required for regular surveys.

ANSWER:

The patchiness and sparseness of data especially across continental scales.

Lack of clarity around priority species for organising and delivering EBVs.

Page 82: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

GLOBIS-B (654003)

D0.0 short title v0.0 Page 82 of 89

Absence of consistent modelling approaches for delivering at least best-available EBV data (or even a clear and consistent vision for a global modelled EBV component in e.g. GEOSS)

Lack of clarity around GEO BON's place in delivering the modelled data layer to sit between e.g. GBIF on the one side and IPBES and onwards (to CBD, etc.) on the other.

ANSWER:

By default the computational capacity is a challenge, but it can be overcome easily. Data quality issues are very important to address in order to have more trustful results, that involves georeferences for historical data that can be hard to determinate. In my opinion, the most difficult challenges are actually social, since the culture of making data open is not always well received, so assertive approximation to data holders will be key for enrich the system content.

ANSWER:

Skill and training – without the people with the skills and knowledge to deal with interoperability issues at a global level, this cannot happen – Research funding bodies need to address this (see Belmont forum report http://www.bfe-inf.org/document/community-edition-place-stand-e-infrastructures-and-data-management-global-change-research)

Standards development and adoption – In order to provide globally interoperable resources requires agreement on standards. The internet is the obvious place to start this and many science communities now use this as the basis of their research collaboration. This will have to run through to agreement on vocabularies and ontologies to link information together – There are existing standards for some of this but there needs to be some mechanism / governance to drive adoption of these within day-to-day work.

Data Policy – there clearly needs to openness in the availability of data (and algorithms) so support workflow operations. These legal frameworks exist but are not always enforce as the conflict with researchers expectations of “ownership” of the data they have created. This is still a big cultural issue in that needs to be addressed at research agency level within legal jurisdictions and by scientific rewards within scientific journals.

Long-term funding for informatics R & D and systems operation – researchers and decision making will not trust systems that have uncertain lifespans. We cannot convince researchers to entrust data to systems which have no long-term funding. If decision support systems are seen as important for dealing with environment challenges, they must be seen as part of national / international infrastructure. The stability required to establish these systems and ways of working cannot be achieved through a series of 3 to 5 years research grants. There needs to a rolling funding and review of what is essential infrastructure for development of essential biodiversity indicators – international agreement required between funding bodies(??) – see Belmont or RDA??

ANSWER:

Data format: the informatics format of data to be correlated can be highly variable. First of all, unique formats must be selected for each type of data. Then the infrastructure should be designed to manage and integrate these formats.

Access to data: it would be necessary to define if access to data and tools available in the infrastructure is unrestricted, restricted to certain users or under a simple registration.

The data transfer protocols must be safe: for example, the user who submits his data does not want they become public.

A storage system should be implemented and the following questions should be addressed: which data to keep? How long?

It could be useful to investigate other bioinformatic infrastructure (Elixir, Lifewatch, BioVeL, etc.) and platforms already available for storage and analysis of molecular biodiversity data.

Page 83: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

GLOBIS-B (654003)

D0.0 short title v0.0 Page 83 of 89

ANSWER:

Ensure that data collected under different protocols is standardized in some way and weighted accordingly

Platforms to share databases in a federated way, so that data can be easily accessible on one site (for example see wildlifeinsights.org for camera trap data).

Automation of analyses is a challenge; model building and construction requires some degree of user input unless a narrow class of models and covariates is run.

Willingness of researchers, institutions and governments to share data on a global scale. ANSWER:

As part of ongoing quality assessments, even if one has an automated workflow to ingest data from a variety of sources, there should be a way to compute aggregated indicators of data quality for a given set of data sources. Suppose a workflow ingests from four three data streams:

o Dataset 1: Subset of data retrieved with constraints A from repository X o Dataset 2: Subset of data retrieved with constraints B from repository X o Dataset 3: Subset of data retrieved with constraints A from repository Y

Some suite of quality indicators should be computable against all datasets, to produce information like:

Quality indicator 1 Quality indicator 2 Quality indicator 3

Dataset 1

Dataset 2

Dataset 3

The quality indicator could be of the type {High, Medium, Low} or a quantitative score. It may be a good idea to run audits of those quality indicators on some regular, or irregular, schedule, if you’re running a service to compute EBVs for policy use. This will enable some level of certification of the quality of the EBV for policy use, which may ease any concerns about decision-making based on computed EBVs that use data from various sources.

There is a recently NSF funded project called MetaDIG (lead by staff from DataONE and the US National Center for Ecological Analysis and Synthesis) looking into developing quality indicators for metadata and data. Their aim is to develop a suite of quality indicators that various communities of practice (e.g. the US Geological Survey, the Long-term Ecological Research network) can use.

Provenance metadata standard. I am not sure how widespread is the community acceptance of the W3C PROV standard for provenance capture, but it is my hope that a body like GLOBIS-B plays a part in examining the applicability of the provenance schema that PROV recommends, and determines whether it is suitable for the computation of EBVs. I feel that like with quality indicators, making sure that workflows are accompanied by provenance metadata will be essential at some point down the road. This is especially true given the importance of reproducibility, which has been an issue discussed within the Belmont Forum e-infrastructure and data management cooperative research action.

ANSWER:

1. Implementation of the workflow(s) in a distributed environment, assign tasks on nodes (servers/hubs) and link them together, 2. Efficiency of the workflow, 3. Updating of workflow and nodes, 4. Visualization for results

If institutions in biodiversity informatics in the world work together, this job can be done in 36 to 60 months.

The most important thing is find enough fund and the project be well designed.

Page 84: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

GLOBIS-B (654003)

D0.0 short title v0.0 Page 84 of 89

ANSWER:

substantial work assessing available species data and whether it can be massaged into a form suitable for change modelling (e.g. into a form that allows detection to be estimated)

current online data often treat as presence-only data that are actually from structured surveys (i.e. absences aren’t recorded; it can be hard to identify all sites from the one survey; information on survey methods can be lost). Improve this?

geographical biases in collections data are already well known (e.g. Amano, T. & Sutherland, W.J. (2013) Proceedings of the Royal Society B 280. What are the priorities for monitoring change? Are priority regions the most poorly sampled? If so, can surveys be designed to satisfy several needs at once (so decisions needing data NOW are served, as well as longer term aims for monitoring change)

environmental predictors – are they adequate, and at fine enough resolution to be useful for monitoring? – same for detection covariates.

Modelling: how to manage the tradeoff between wanting it rigorous enough to enable reliable estimates of change, yet somehow widely available?

ANSWER:

Living Planet Index from WWF-ZSL and Map of Life from Yale, but from both initiatives the underlying data that is used to derive the indices and models is not available.

ANSWER:

Again, this would depend on the EBVs that need to be calculated and how they will be calculated, but potential challenges include:

o Handling large volumes of data (fetching them remotely and processing them). o Designing for efficient human interaction, if this will be needed. o Depending on changes to be made on third-party systems (such as asking other

initiatives to create new web services on top of their data or make other adjustments on their systems so that they can be integrated with GLOBIS-B).

ANSWER:

Identifying and supporting key non-technical aspects of interoperability, including semantic, legal, policy, and political or ethical considerations (timeframe: 1-3 years; could be accomplished by a handful of workshops followed by period of comment and consultation).

Finding a balance between automated matchmaking and matchmaking that requires human input. This challenge will be continually revisited through the EBV and platform design process (timeframe: 3 years+, depending on funding).

Developing and documenting a system that is truly accessible to a range of stakeholder audiences, including professional researchers, amateur researchers, educators, and policymakers. Achieving this will require a clear statement of goals and purpose advanced by the GLOBIS-B team and collaborators followed by an inclusive design process (timeframe: 3-5 years+, depending on funding).

Making sure that this project reaches the widest possible audiences, within and beyond the biodiversity and larger scientific community (3-5 years+).

ANSWER:

Patchiness of the data: distinguishing gaps from absences. To tackle this, directed and systematic sampling is needed. High-quality citizen science projects have some potential but are of restricted value in geographically/politically inaccessible areas.

Getting non-digitised / archived data into GBIF – already underway with new task force, and some well-designed citizen science projects based around naturalists’ notebooks. Ensuring that these new observations also feed improved range modelling. Research councils and

Page 85: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

GLOBIS-B (654003)

D0.0 short title v0.0 Page 85 of 89

individual scientists also may be able to support this effort.

Reproducibility and robustness of the EBV calculations - ensuring that they scale correctly when computed at smaller scales, that results are consistent and will be trusted by decision makers.

ANSWER:

For distribution modelling, getting environmental data layers beyond WorldClim.

For abundance, download and cleaning of full GBIF data into an OLAP cube. ANSWER:

SGDR (sui generis database right) - It is fundamental to keep in mind the distinction between data creation and data collection: only data collection (or presentation/verification) can lead to the existence of SGDR (if all the other requirements are met). If data are created there is no SGDR. To make things "easier" there is the unclear definition of data creation and data collection (a difference that not necessarily corresponds to the scientific/epistemological definition).

Importance of correct labelling of data and metadata - It is mandatory that all data/dataset are properly labelled with the right tools: Public Domain Mark, CC0, CCPL.

TDM, copyright, SGDR and licences: it is important to employ licences that address properly these considerations (e.g. CCPL v4.0 yes; CCPL v3.0 depends but usually no; CCPL v2.0 no).

In order to licence data properly it is important that not only the right legal tools be available (to some extent they already are, e.g. licences) but that also the right set of incentives be available (i.e. if in order to obtain grants or get tenure researchers need to have high Impact Factors, then they will publish in high IF journals that not necessarily follow OA principles). Therefore, researchers cannot be "left alone" in dealing with copyright/assessment issues, but they need protective legislative interventions (like the German and Dutch, not like the Spanish or Italian; the UK solution is debatable) + the right set of incentives from funding bodies and employers (e.g. only papers/datasets self-archived in OA, aka green road, will be used for evaluation purposes).

OA to be successful requires a new approach not only in the publication of science, but also in its evaluation/assessment.

ANSWER:

Our contribution to EBV are from two angles, copyright and data sharing policies on the one hand, and form the published record.

What we can contribute is to look at data, data quality and how this relates to open access to the data

From the published record this only makes sense in two specific aspects: Publishing data sets so that they can be cited, e.g. using either GBIF or Pensoft publishing facilities, which is relevant in the longer term to set up monitoring schemes.

Another aspect of the published record is that is often the only source for rare species beyond butterflies, or plants. This might add a special layer of taxa that represent a large part of biodiversity and are in most cases completely underrepresented. At the same time, the question might be raise, whether the known data is strong enough to contribute more than anecdotal evidence to EBV.

For us the experience to participate in GLOBIS-B is that we are very interested to find out weak points in data, EBV workflow and data publishing, and how we can improve future data publishing.

ANSWER:

Page 86: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

GLOBIS-B (654003)

D0.0 short title v0.0 Page 86 of 89

High quality data from charismatic organisms in rich countries often not comparable with sparse data from elsewhere. We need to avoid the lowest common denominator. I see this essentially as a modelling problem, rather than a data availability problem.

Metadata (see above): more sophisticated observation models will be computationally intensive.

Multispecies models will be even more computationally-demanding.

Spatial scale, spatial resolution and temporal resolution of the outputs. ANSWER:

Data generation is still the limiting factor: we need to measure faster, cheaper, automated.

Lifewatch Belgium devotes a large part of budget to install biosensor networks for the measurement of phytoplankton, zooplankton, fish, bird, bats.

Some examples: http://rshiny.lifewatch.be/

The Jerico Next and Atlantos projects are examples at European and TransAtlantic scale. ANSWER:

There are multiple technical challenges but the main challenge lies in getting research infrastructures operators to work together at the global level to pursue an agreed roadmap (e.g., based on that coming from the CReATIVE-B project) that ensures that the various research infrastructures are interlinked and interoperable in both technical and legal terms. This requires investment funding. The responsibility should be taken up by the Belmont Forum, perhaps?

ANSWER:

Challenges: o Secure unlimited computational capacity o Ensure transparency of the process o Provide web services by which viewing of data, using of data and workflow/software

will be tracked and reported back to the developers o Mapping of EBVs at global scales

Ways to address tech challenges: o Engaging grid and cloud infrastructure o Develop tools for the traceability of the entire process on the cyberspace o Develop the pipeline links between the data and workflows/software available, as

well as with the available e-infrastructures

Who has to do? o Scientific community from around the world- mobilizing the large Networks: e.g.

MARS, WAMS, etc. for marine benthic biodiversity, there are many communities for other regimes

o ICT community relevant with the biodiversity informatics o EU and other funding agencies at national, regional, continental and global scale, to

create the appropriate funding instruments, at least for the coordination of the activities.

ANSWER:

Agreed metadata standards (including rights statement metadata, which do not exist at this moment) for all the existing datasets under answer to Q1.

Agreed standard to incorporate abundance-based data occurrences (Is there anyone on place with enough consensus and reliability?).

Agreed GIS standards to facilitate the charts expressions of Q2 (and open source based)

Page 87: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

GLOBIS-B (654003)

D0.0 short title v0.0 Page 87 of 89

Assuming that there is minimum agreement on answer to the previous 7 Qs (which is a background minimal need, at least for some species or taxa) the main need I assume is conducting a real life testing in which digitized national inventories, GBIF data sets, and biodiversity species-related data mining of scientific publications and citizens´ science crowdsourced data, using multiple (or at least double) VRE based ITs to control reliability of results. It would need clear policy agreements and funding.

ANSWER:

Data mobilisation across multiple sources (incl. legacy literature and collections)

Use of common or interchangeable Standards that will allow data interoperability

Open and well-documented web services to serve data

Robust registries of data services and Standards

Development of end-user services tailored to specific audiences ANSWER:

Technical challenges: o Data description including the uncertainties in a machine readable manner. One of

the issues is the provision of this information which is not primarily a technical issue o Taxonomic references and mapping of species names and species groups – should be

already be solved in the GBIF context, but still in the FFH directive it is an issue o Provision of time series of species observations including abundance information –

with the issue on how to upscale from regional data to a global perspective

The establishment of consistent monitoring schemes for biodiversity on national scale are an important pre-requisite for further activities. Methodological the use of high resolution EO data for habitat and species estimation needs to be evaluated. E.g. EcoPotential will focus on the identification of whales in one of the test areas based on EO Sentinel data.

ANSWER:

The first and foremost challenge is to have a global database of occurrences that include abundance data. That’s a major step from the current, presence-only datasets that make the bulk of globally available biodiversity data. A challenge that is currently being addressed is incorporating sample data. For this to work properly, there are still unresolved challenges that might be on track during the timeframe:

o An effective system for unique identifiers (GUIDS) for biodiversity occurrences/objects;

o An agreed-upon, proven standard to incorporate abundance-based and sample-based data to occurrence datasets;

o a reliable way to represent/identify/describe absence data; o A clean, authoritative taxonomic backbone allowing easy identification of taxon

concepts, duplications, synonyms, and overall deduplication of occurrence data. Some of these challenges, as well as many others, were identified by a large number of practitioners through a content needs assessment carried out a few years ago (see https://journals.ku.edu/index.php/jbi/article/view/4126 and https://journals.ku.edu/index.php/jbi/article/view/4094).

Therefore, and in order to achieve these goals, a proper Organizational Knowledge Management Methodology (OKM) should be established and then refined-maintained by an OKM Committee. The OKM should be based on the following premises:

o How to identify some practical cases from the perspective of relevant biotics and abiotics EBV indicators to be performed. This would largely depend on the “quality” of the (meta-)data resources above mentioned. This analysis should be performed by a Scientific Committee.

Page 88: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

GLOBIS-B (654003)

D0.0 short title v0.0 Page 88 of 89

o To this purpose, to further integrate-adapt existing Workflows developments into the LifeWatch ICT distributed e-Infrastructure, so that some essential “blocks” given in the form of e-Services can be offered in order to calculate EBVs values and then presented in a visual way through the development of proper Virtual Research Environments-VRE. This should be coordinated by a ICT Technical Committee.

Therefore, not only we are talking about the creation and maintenance of a EBVs “ontology-driven” system, but also of how to guarantee the “engineering” mechanisms associated to their integration into the LifeWatch (and similar) e-Infrastructures from the ICT perspective.

ANSWER:

It is obviously necessary to have a common data vocabulary, common standard and protocol for data exchange but this also depends on which step of the processing is done by whom. There are at least two broad alternatives and each of these have their own challenges for global assessment.

o 1. EBVs are calculated separately for each jurisdiction, region continent and then aggregated or reported globally .

Advantages of this is that much of the burden is distributed among countries/ jurisdictions do little needs to be done centrally. This would put greater onus on countries to coordinate national, monitoring, assessment and reporting of biodiversity and ensure a stronger link between biodiversity monitoring and management actions, policy and legislation

Challenges: To ensure consistency in calculation, data quality standards etc. among countries. Also some countries jurisdiction will simply not have the staff and resources to do these analyses so some of this will have to be done centrally

o 2. EBVs are calculated globally using data obtained from each jurisdiction Advantages: transparency and consistency of calculation Challenges: major resources may be required to chase up, acquire data. Any

errors may not be easily recognised because the data will be processes by people who have limited knowledge of the data

ANSWER:

Assuming questions on the definition of EBVs do not need to be further addressed, there needs to be a broad recognition that most biodiversity is currently un/under-represented by available data. The main technical challenges would then be that:

o Accurate and widespread recording of abundance data with confirmed absences, rather than just presence-only data. This would sensibly build on the existing GBIF architecture.

o The taxonomic backbone needs to be improved and made explicit i.e. synonymies made clear.

o Over time, changes to existing point data sets (e.g. GBIF) need to be a) recorded and b) explicitly presented i.e. new specimen records, new geo-referencing of specimen localities, edits to the taxonomy and location details of existing records such as re-determinations and more precise geo-referencing. For plant specimens, duplicate records of the same collection from different institutions need to be explicitly linked; if geo-referencing is undertaken retrospectively this may differ between duplicates of the same collection.

o An established but flexible workflow for species distribution modelling and stacking species extents would be imperative.

o One of the outstanding conceptual challenges for the development of EBVs is agreement on common scales/units/indices of measurement to allow data on e.g. species distributions to be integrated sensibly with data on e.g. habitat extent, and in

Page 89: D2.2 Report of Workshop 1 - GLOBIS-B · 2016-09-14 · D2.2 Report of Workshop 1 Due-Date: M11 Actual Delivery: M12 Lead Partner: UvA ... potential solutions of providing the required

GLOBIS-B (654003)

D0.0 short title v0.0 Page 89 of 89

a way comparable for e.g. allelic diversity with e.g. habitat extent. Combining different EBVs in a standardised way is the real power of the whole conceptual approach. Alternative metrics such as effective numbers may be worth exploring.

ANSWER:

The main problem is collecting the data and publishing the data openly. At least we could make a lot of inroads on the later.