http:// common data models and protocols richard white, cardiff university talk given at “making...

35
http://www.systematics.rdg.ac.uk/spice/ Common Data Models and Protocols Richard White, Cardiff University Talk given at “Making Species Databases Interoperable”, Reading, 15 July 20042 SPICE for Species 2000 Funded in the UK by the BBSRC/EPSRC Bioinformatics Initiative Universities of Cardiff & Reading http:// www.systematics.rdg.ac.uk/spice/

Upload: natalie-gordon

Post on 17-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

http://www.systematics.rdg.ac.uk/spice/

Common Data Models and ProtocolsCommon Data Models and Protocols

Richard White, Cardiff University

Talk given at “Making Species Databases Interoperable”, Reading, 15 July 20042

SPICE for Species 2000Funded in the UK by the

BBSRC/EPSRC Bioinformatics Initiative

Universities of Cardiff & Readinghttp://www.systematics.rdg.ac.uk/spice/

http://www.systematics.rdg.ac.uk/spice/

Species 2000Species 2000

The story so far ...

Species 2000 is an international collaborative project to create and provide access to an authoritative and up-to-date checklist and index to all the world’s species.

How is it going to do this?

http://www.systematics.rdg.ac.uk/spice/

Species 2000 services to usersSpecies 2000 services to users

Dynamic Checklist Annual Checklist Web site, including database links submitted

by users or producers Distribution media, including downloaded

data Index to species information (hyperlinks to

SISs) Packaged functions providing services to other

software

http://www.systematics.rdg.ac.uk/spice/

Species 2000 organisationSpecies 2000 organisation

Taxonomic hierarchy (or hierarchies)

Species

Global species databases (GSDs) and interim

checklists: the species index GSDinterim

checklists

Species information sources (SISs): regional faunas and floras, specialist or sectoral

databases, web pages etc.

SIS

http://www.systematics.rdg.ac.uk/spice/

Merging & Linking

MergingThe original databases are physically copied into a new combined database

LinkingThe original databases remain separate, but are accessed through a single system

http://www.systematics.rdg.ac.uk/spice/

Merging

1. The original databases are physically copied into a new combined database.

2. The user interacts with the new combined database.

Plants ofEurope

Plants ofAfrica

Plants ofthe World

1

2

http://www.systematics.rdg.ac.uk/spice/

Linking

1. The user interacts with an access system which does not itself contain data.

2. When the user requests data, it is fetched from the appropriate database.

Plants ofEurope

Plants ofAfrica

Plants ofthe World

2

1

http://www.systematics.rdg.ac.uk/spice/

Architecture of Species 2000Architecture of Species 2000

User interface

Data collector

Wrapper

GSD

Wrapper

GSD

Wrapper

GSD

CAS

(Common Access System)

or “harness”

Protocol

Distributed array of databases

http://www.systematics.rdg.ac.uk/spice/

Need for communicationNeed for communication

Different people are building the various components of the system:– GSDs– wrappers– CAS– user interface

We need to ensure they all have a common understanding of the data to avoid embarrassing mistakes

http://www.systematics.rdg.ac.uk/spice/

Database wrappersDatabase wrappers

Only the interface to the CAS needs to speak CORBA

Wrappers must:– Translate CAS requests into a form

suitable for the GSD (e.g. SQL) and translate responses back

– Deal with other kinds of heterogeneity, including schema heterogeneity

http://www.systematics.rdg.ac.uk/spice/

Data flow through a wrapper Data flow through a wrapper

Divided wrapper

GSD

Wrapper interface

CAS

External wrapper

XML

Strings e.g. CGI

http://www.systematics.rdg.ac.uk/spice/

Common Data ModelCommon Data Model

We need a Common Data Model (CDM)– A definition of the information being

passed to and fro– Human-readable, not machine-readable– This is used as a reference when creating

specific implementations for CGI/XML (DTD, XML Schema), Web Services, etc.

http://www.systematics.rdg.ac.uk/spice/

What does the CDM look like?What does the CDM look like?

It defines the input (“request”) and output (“response”) for six fundamental operations which the system needs to be able to carry out

http://www.systematics.rdg.ac.uk/spice/

Request Types 0-6Request Types 0-6

– Type 0: Get version of the CDM with which the GSD complies

– Type 3: Get information about the GSD– Type 1: Search for a name in the GSD– Type 2: Fetch “standard data” about a

chosen species– Type 4: Move up the taxonomic

hierarchy– Type 5: Move down the taxonomic

hierarchy

http://www.systematics.rdg.ac.uk/spice/

Type 0 RequestType 0 Request

Request:– (nothing)

Response:– CDMVersion

http://www.systematics.rdg.ac.uk/spice/

Type 3 RequestType 3 Request

Request:– GSDIdentifier

Response:– GSDInfo (a set of fields including its name,

date of last editing, etc.)

http://www.systematics.rdg.ac.uk/spice/

Type 1 RequestType 1 Request

Request:– SearchString, SearchType (scientific name,

common name, unknown), SearchLimit (including higher taxon, maximum number of names to return)

Response:– Number, SpeciesName[0:N]

http://www.systematics.rdg.ac.uk/spice/

Type 2 RequestType 2 Request

Request:– Identifier, GSDIdentifier

Response:– StandardData (approximately the same as

the Standard Data defined by Species 2000 and seen by the user)

http://www.systematics.rdg.ac.uk/spice/

Type 4 RequestType 4 Request

Request:– Identifier, GSDIdentifier

Response:– HigherTaxon[0:N]

http://www.systematics.rdg.ac.uk/spice/

Type 5 RequestType 5 Request

Request:– Identifier, SearchLimit

Response:– Taxon[0:N]

http://www.systematics.rdg.ac.uk/spice/

The “standard data”The “standard data”

This comprises the information about a species which Species 2000 wishes to provide:

– AVCNameWithRefs

– SynonymWithRefs

– CommonNameWithRefs

– Family (or other agreed higher taxon)

– Comment

– Scrutiny

– DataLink (links to the GSD’s or other web pages)

– Geography (list of places)

http://www.systematics.rdg.ac.uk/spice/

Where are we now?Where are we now?

Is the Spice Project finished?– We have a fairly stable CDM (version 1.20 is about

to be replaced with version 1.21)

– XML DTD exists

– Several CGI/XML implementations in Java and PHP, and a Web Service

– We have a working Spice system

– A few changes are anticipated:• geographical information

• linking to further information sources

• infraspecific taxa

http://www.systematics.rdg.ac.uk/spice/

“Intelligent” linking“Intelligent” linking

Species 2000 is – not just a catalogue (which lists things)– It is an index (which points to things)

It plans to provide links to take a user – from a species entry (from a GSD) – to further sources of information about

that particular species (Species Information Sources or SISs)

http://www.systematics.rdg.ac.uk/spice/

“Intelligent” linking“Intelligent” linking

There are experimental “unintelligent” links already (as in the ILDIS GSD), which rely on exact name matching

But there are issues in making links more intelligent

http://www.systematics.rdg.ac.uk/spice/

Data quality (again!)Data quality (again!)

How do we know the information is reliable?

One problem is the differing interpretation of species names (species concepts) in different resources

http://www.systematics.rdg.ac.uk/spice/

LITCHI Project

A rule-based tool for the detection and repair of conflicts and merging of data

in taxonomic databases

http://www.systematics.rdg.ac.uk/spice/

Summary of Litchi project

We modelled the knowledge integrity rules in a taxonomic treatment.

The knowledge tested is implicit in the assemblage of scientific names and synonyms used to represent each taxon.

Practical uses include detecting and resolving taxonomic conflicts when merging or linking two databases.

Version 2 now implemented focusses on the creation of “cross-maps”

http://www.systematics.rdg.ac.uk/spice/

Example 1

Checklist A

Caesalpinia crista L. [accepted name]

Checklist B

Caesalpinia crista L. [accepted name] Caesalpinia bonduc (L.) Roxb. [accepted name] Caesalpinia crista L., p.p. [synonym]

http://www.systematics.rdg.ac.uk/spice/

Example 2Example 2

In the case of the species Cytisus scoparius

Treatment A will list it as Cytisus scoparius (synonym Sarothamnus scoparius)

Treatment B will list it as

Sarothamnus scoparius (synonym Cytisus scoparius)

GenusCytisus

GenusSarothamnus

GenusCytisus

Cytisus scoparius Sarothamnus scopariusCytisus striatus Sarothamnus striatus

Cytisus multiflorus Cytisus multiflorusCytisus praecox Cytisus praecox

Treatment Arecognises one genus, Cytisus

Treatment Brecognises two genera,

Cytisus and Sarothamnus

http://www.systematics.rdg.ac.uk/spice/

Cross-mappingCross-mapping

So how can we make intelligent links work?

One way to make links appear more intelligent is to create and maintain “cross-maps” which describe how one or more taxa in one resource (such as the Species 2000 index) relate to one or more taxa in another resource

http://www.systematics.rdg.ac.uk/spice/

Litchi 2.2 in useLitchi 2.2 in use

Checklist A Checklist B

Rules

Heuristics

Concept relationships

Cross-map

Taxonomic intelligence

Read into system

Write

Conflict detection

Inference of concept relationships

http://www.systematics.rdg.ac.uk/spice/

More about cross-mapsMore about cross-maps

They may be created and maintained– manually by experts– automatically or semi-automatically by

LITCHI (as above)– by monitoring the behaviour of users

following species links– by analysing data sets describing the taxa,

when sufficient such data is available, using the usual species taxonomy tools (phenetic and cladistic analyses)

http://www.systematics.rdg.ac.uk/spice/

More about cross-mapsMore about cross-maps

They may be held– by individual GSDs, describing how to link

their species to selected related resources, as ILDIS has done for linking to the Northern Eurasia (aka USSR) database)

– by Species 2000 as a repository and service to facilitate intelligent species links

– by an “intelligent linking engine”, as planned for Species 2000 Europa to link its two hubs

http://www.systematics.rdg.ac.uk/spice/

A dreamA dream

A system for managing intelligent species links using taxonomic concept relationships would maximise the potential of the plethora of species-based catalogues, indexes and rich species resources currently being assembled all over the world

Perhaps on the Web, as with the current Spice/Species 2000 prototype

Or ...

http://www.systematics.rdg.ac.uk/spice/

The GridThe Grid

Or maybe on the Grid– One of the aims of which is to provide

access to such knowledge sources as species checklists, synonymy servers, rich species data sets, and cross-maps, for example in the Biodiversity World project