technical coping strategies for resource discovery - paul walk
DESCRIPTION
Technical Coping Strategies for Resource Discovery Paul's plenary presentation at the Jisc/British Library Discovery Summit 2013 February 2013, LondonTRANSCRIPT
Contents
1. a general consideration:• open or closed
2. a particular challenge:• synchronisation in an open world
3. the ‘nothing new’, but doing it better• APIs that work and can be trusted
a healthy(?) state of tension between open and closed
open and closed worlds
• I’m not talking about licensing or access to data
• open• unbounded -‐ like the Web
• closed• bounded -‐ like most collections management system, aggregations etc.
• formally, much of what we do is underpinned by ‘open/closed worlds’ assumptions:
• open world assumption: any statement not known to be true is unknown• closed world assumption: any statement not known to be true is false
characteristics of an open world
characteristics of a closed/bounded world
judging where to apply each
• we need our infrastructure (especially integration technology between systems) to be open and relatively unbounded
• the Web is still the best available foundation for this
• however, we still need to manage our resources, maintain quality and honour complex rights management commitments
• we probably need to recognise that users’ experience is often enhanced through the application of a more focussed, targeted and context-‐aware approach
a particular challenge
synchronisation
• how is the state of the resource maintained across an infrastructure of ‘federated’ repositories?
• if a resource is changed or deleted, how does the right-‐hand side aggregation know?
• note -‐ this is based on our existing ‘harvesting’ or ‘pull’ approach
ResourceCollection
ResourceCollection
ResourceCollection
Aggregation
Aggregation
ResourceCollection
Aggregation
multiple harvest routes,multiple copies
ResourceSync
• a joint project of NISO and OAI, led by Herbert Van de Sompel of Los Alamos
• a light-‐weight mechanism to allow the state of web resources to be communicated between web systems
• developing a spec which builds on the sitemap speciTication, allowing content providers to publish changesets
• draft: http://bit.ly/WYhTz2
• Jisc have funded UK participation in this
The sun shone, having no alternative, on the nothing new. Murphy, Samuel Becket
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable
Leslie Lamport
a common ‘anti-pattern’
• as a developer, I have no reason to trust that these APIs are any good.
• after all, the service provider doesn’t seem to trust them for their own application....
some aggregated data of broad interest and potential usefulness
UI
APIAPIAPI
Future3rd-party
dev
Future3rd-party
dev
Future3rd-party
dev
UI
UI
UI
= certainty= belief= speculation
end-user
end-userend-user
end-user
a better pattern
• As a developer, I’m more likely to trust this pattern.
• the content provider is using their own API to deliver their own application.
• they have a vested interest!
some aggregated data of broad interest and potential usefulness
API
3rd-partyapp
focussedapp
UIUI
end-userend-user
= certainty= belief= speculation
APIs are not best thought of as machine-to-machine interfaces
APIs are interfaces for developers
messages from developers to content-providers
• These are from yesterday’s developer day held here at the BL in support of this summit:
• please don’t build elaborate APIs which do not allow us to see all of the data, or its extent. It’s not that we simply want to download all the data -‐ but we do need to see what we’re dealing with
• if you give us access to incomplete data (perhaps because you’re worried about revealing poor data quality), then we will tend to either abandon our attempts to use it or we will ‘Bill in the gaps’ with data from elsewhere. So offering an API which delivers incomplete data is usually self-‐defeating
• the implicit bargain, made explicit:• give us access to the data as soon as possible and we will do some of the work to process so it is Bit for some new purpose -‐ and we will happily share this code with you
Questions for the parallel sessions
1. Which emerging technologies do we need to focus on in 2013?
2. Do we still need to aggregate?
3. What does data quality stop us doing?
Which emerging technologies do we need to focus on in 2013?
• Graphs: Content Context is king
• both Facebook and Google are betting heavily on graph technologies
• closer to home -‐ so are content providers like the BBC
• linking these is an interesting challenge
• databases based on a graph model give the potential for a richer understanding about entities (users!)
• instrumentation in personal devices makes more context available (e.g. geo-‐location).
Do we still need to aggregate?
Do we still need to aggregate?
yes.
Do we still need to aggregate?
• to address systems/network latency -‐ provide a cache
• to showcase!
• for ‘Web Scale concentration’
• network effects if user facing services also developed
• to create middleman business opportunities
• as infrastructure to support locally developed services
• as an approach to preservation
yes.
What does data quality stop us doing?
• interpreted as: “what does a concern for data quality stop us doing?”• it stops us from releasing data early
• interpreted as: “what does poor/uncertain data quality stop us doing?”• it erodes trust, which impacts the likelihood of someone doing something worthwhile with our data
• reconciling these concerns is a major challenge for us.