technical coping strategies for resource discovery - paul walk

Paul [email protected]

@paulwalkhttp://www.paulwalk.net

Technical challenges in resource discovery

Contents

1. a general consideration:• open or closed

2. a particular challenge:• synchronisation in an open world

3. the ‘nothing new’, but doing it better• APIs that work and can be trusted

a healthy(?) state of tension between open and closed

open and closed worlds

• I’m not talking about licensing or access to data

• open• unbounded -‐ like the Web

• closed• bounded -‐ like most collections management system, aggregations etc.

• formally, much of what we do is underpinned by ‘open/closed worlds’ assumptions:

• open world assumption: any statement not known to be true is unknown• closed world assumption: any statement not known to be true is false

characteristics of an open world

characteristics of a closed/bounded world

judging where to apply each

• we need our infrastructure (especially integration technology between systems) to be open and relatively unbounded

• the Web is still the best available foundation for this

• however, we still need to manage our resources, maintain quality and honour complex rights management commitments

• we probably need to recognise that users’ experience is often enhanced through the application of a more focussed, targeted and context-‐aware approach

a particular challenge

synchronisation

• how is the state of the resource maintained across an infrastructure of ‘federated’ repositories?

• if a resource is changed or deleted, how does the right-‐hand side aggregation know?

• note -‐ this is based on our existing ‘harvesting’ or ‘pull’ approach

ResourceCollection

ResourceCollection

ResourceCollection

Aggregation

Aggregation

ResourceCollection

Aggregation

multiple harvest routes,multiple copies

ResourceSync

• a joint project of NISO and OAI, led by Herbert Van de Sompel of Los Alamos

• a light-‐weight mechanism to allow the state of web resources to be communicated between web systems

• developing a spec which builds on the sitemap speciTication, allowing content providers to publish changesets

• draft: http://bit.ly/WYhTz2

• Jisc have funded UK participation in this

The sun shone, having no alternative, on the nothing new. Murphy, Samuel Becket

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable

Leslie Lamport

a common ‘anti-pattern’

• as a developer, I have no reason to trust that these APIs are any good.

• after all, the service provider doesn’t seem to trust them for their own application....

some aggregated data of broad interest and potential usefulness

UI

APIAPIAPI

Future3rd-party

dev

Future3rd-party

dev

Future3rd-party

dev

UI

UI

UI

= certainty= belief= speculation

end-user

end-userend-user

end-user

a better pattern

• As a developer, I’m more likely to trust this pattern.

• the content provider is using their own API to deliver their own application.

• they have a vested interest!

some aggregated data of broad interest and potential usefulness

API

3rd-partyapp

focussedapp

UIUI

end-userend-user

= certainty= belief= speculation

APIs are not best thought of as machine-to-machine interfaces

APIs are interfaces for developers

messages from developers to content-providers

• These are from yesterday’s developer day held here at the BL in support of this summit:

• please don’t build elaborate APIs which do not allow us to see all of the data, or its extent. It’s not that we simply want to download all the data -‐ but we do need to see what we’re dealing with

• if you give us access to incomplete data (perhaps because you’re worried about revealing poor data quality), then we will tend to either abandon our attempts to use it or we will ‘Bill in the gaps’ with data from elsewhere. So offering an API which delivers incomplete data is usually self-‐defeating

• the implicit bargain, made explicit:• give us access to the data as soon as possible and we will do some of the work to process so it is Bit for some new purpose -‐ and we will happily share this code with you

Questions for the parallel sessions

1. Which emerging technologies do we need to focus on in 2013?

2. Do we still need to aggregate?

3. What does data quality stop us doing?

Which emerging technologies do we need to focus on in 2013?

• Graphs: Content Context is king

• both Facebook and Google are betting heavily on graph technologies

• closer to home -‐ so are content providers like the BBC

• linking these is an interesting challenge

• databases based on a graph model give the potential for a richer understanding about entities (users!)

• instrumentation in personal devices makes more context available (e.g. geo-‐location).

Do we still need to aggregate?


yes.


• to address systems/network latency -‐ provide a cache

• to showcase!

• for ‘Web Scale concentration’

• network effects if user facing services also developed

• to create middleman business opportunities

• as infrastructure to support locally developed services

• as an approach to preservation

yes.

What does data quality stop us doing?

• interpreted as: “what does a concern for data quality stop us doing?”• it stops us from releasing data early

• interpreted as: “what does poor/uncertain data quality stop us doing?”• it erodes trust, which impacts the likelihood of someone doing something worthwhile with our data

• reconciling these concerns is a major challenge for us.

thank you!

Paul [email protected]

@paulwalkhttp://www.paulwalk.net

technical coping strategies for resource discovery - paul walk

Education

data open unbounded

incomplete data

poor data quality

open world assumption

open world3

content providers

content context

state of tensionbetween