unibasel toward replication in grids for digital libraries with freshness and correctness...

13
unibasel Toward Replication in Grids for Digital Libraries with Freshness and Correctness Guarantees* Fuat Akal, Heiko Schuldt and Hans-Jörg Schek <fuat.akal¦heiko.schuldt>@unibas.ch, [email protected] University of Basel, Computer Science Department Bernoullistr 16, CH-4056, Basel, Switzerland 3 rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 * The work has been partly supported by the EU in the 6 th framework programme within the project DILIGENT (contract No. IST-2003-004260). << DIgital Library Infrastructure on Grid ENabled Technology >>

Upload: louisa-murphy

Post on 28-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

unibasel

Toward Replication in Grids for Digital Libraries with Freshness and Correctness Guarantees*

Fuat Akal, Heiko Schuldt and Hans-Jörg Schek

<fuat.akal¦heiko.schuldt>@unibas.ch, [email protected]

University of Basel, Computer Science DepartmentBernoullistr 16, CH-4056, Basel, Switzerland

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007

* The work has been partly supported by the EU in the 6th framework programme within the project DILIGENT (contract No. IST-2003-004260).<< DIgital Library Infrastructure on Grid ENabled Technology >>

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 2unibasel

Example Scenario

• Satellite pictures of Mediterranean Sea are continuously taken and ...

• stored as complex documents in a Digital Library (DL).

• A typical activity is to generate periodical reports.

ImageFeatures

ImageFeaturesImage

Features

ImageFeaturesImage

Features

ImageFeatures

Storage PropertiesStorage PropertiesStorage PropertiesStorage Properties

<DIMAP_DOCUMENT> <DATASET>MER_RR__2P</DATASET> <INSTRUMENT>MER</INSTRUMENT> … <LONMIN_INT>17000</LONMIN_INT> <LATMIN_INT>12000</LATMIN_INT> <LONMAX_INT>22000</LONMAX_INT> <LATMAX_INT>13500</LATMAX_INT> <COVER_REGIONS>World</COVER_REGIONS> <OVERLAP_REGIONS> World Europe Bigger_Europe Smaller_Europe

Mediterranean Iberia North_Atlantic Africa North_Africa Middle_East Portugal

</OVERLAP_REGIONS>...</DIMAP_DOCUMENT>

<DIMAP_DOCUMENT> <DATASET>MER_RR__2P</DATASET> <INSTRUMENT>MER</INSTRUMENT> … <LONMIN_INT>17000</LONMIN_INT> <LATMIN_INT>12000</LATMIN_INT> <LONMAX_INT>22000</LONMAX_INT> <LATMAX_INT>13500</LATMAX_INT> <COVER_REGIONS>World</COVER_REGIONS> <OVERLAP_REGIONS> World Europe Bigger_Europe Smaller_Europe

Mediterranean Iberia North_Atlantic Africa North_Africa Middle_East Portugal

</OVERLAP_REGIONS>...</DIMAP_DOCUMENT>

Metadata as XML

Documents

Earth Observation

Simple Boolea

n Querie

s

Image Similarit

y Queries

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 3unibasel

Watching the Environment Closely

• Monitoring of the Mediterranean Sea

• There are some busy oil terminals in the region– Oil tankers

keep floating in the sea

– Potential oil spill into the sea

Earth Observation

Both are extremely concerned about the

environment!

Data GridData Grid

satellite images, metadata, image

features...

„I am interested inGreek coasts as of

last week“

„FreshTurkish water

please“

Scientist 1in Athens

Greece

Scientist 2in Antalya

Turkey

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 4unibasel

Desired Replica Management in the Grid

Scientist 1in Athens

Greece

Scientist 2in Antalya

Turkey

satellite images, metadata, image

features...

EntireMediterranean

TurkishCoasts

GreekCoasts

storagenode 0

sn1

sn2

sn3

GreekCoasts

Scientist 3in Thessaloniki

Greece

Data Grid

Assumption: Whole data is collected at a single node, e.g. ESA

in Italy

Assumption: Whole data is collected at a single node, e.g. ESA

in Italy

Automatic selection of the best replica from the user‘s location

Automatic selection of the best replica from the user‘s location

Replication at a higher level, e.g. collections,

subcollections.

Replication at a higher level, e.g. collections,

subcollections.

Dynamic decision on when/where to create

replicas, e.g. sn1 becomes a hot spot

Dynamic decision on when/where to create

replicas, e.g. sn1 becomes a hot spot

Freshness and correctness guarantees

on accessed data is insured, e.g. „I want

uptodate data“

Freshness and correctness guarantees

on accessed data is insured, e.g. „I want

uptodate data“Sophisticated

replication mechanism is

required!

Create Replica

Scientists may also 1) write back their reports

and/or 2) create versions of documents

or annotate

Scientists may also 1) write back their reports

and/or 2) create versions of documents

or annotate

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 5unibasel

Outline

• Digital Library built atop a grid middleware– Rich variety, structure, volume of data, e.g. traditional documents,

complex multimedia objects• Simple Boolean queries as well as sophisticated multi-feature similarity

queries

– Consistent access to up-to-date data may be essential

• Rest of the talk is...– Replication in a DB Cluster– Transition from a DB cluster to the Grid– DILIGENT Replication Architecture– Conclusions and Outlook

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 6unibasel

Replication in a DB Cluster (PDBREP)

• Available replication solutions for grid environments do not meet all of the desired properties just mentioned, e.g. freshness and correctness.

• In our previous work [VLDB2005], we devised a replication protocol for database clusters named PDBREP.– It provides already some properties of what we call desired replica

management in the Grid, e.g. freshnes, higher replication granularity.

• Our approach in this work is to start with this protocol and adapt it to the grid.

PDBREP stands for PowerDB Replication, which was a a project conducted at ETH Zurich partially supported by Microsoft.

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 7unibasel

Replication in a DB Cluster (PDBREP)

Update Node(s)

U: update(a) Q: query(a, b, fr)

a,ca,b,c,d

Coordination MiddlewareCoordination Middleware

Continuous Update

Broadcast

Read-only Nodes

Continuous Update Propagation Transactions

(only, when the node is idle)

Local Update Queue

Global Log

db,d b,c

U

w(a)

Q

r(b)r(a)

distributed query execution

fr : freshness requirement, e.g. „I am fine with 2 minutes old data“, „I want fresh data“ etc.

Refresh Transactions(on-demand)

++

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 8unibasel

Transition to the Grid

Updates Queries

Coordination MiddlewareCoordination Middleware

Update Node(s) Read-only Nodes

• We still distinguish update and read-only nodes• Potentially several update nodes

– We still assume that all updates are serialized into a global log• Broadcast of updates not feasible, replicas subscribe for changes instead• Service Oriented Architecture• More nodes which are heterogeneous• Failures are more likely to happen

Global Log

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 9unibasel

Replication Granularity

• The unit of replication is called a DataSet (DS)– A DataSet can be a collection of documents, a subcollection or as small as

a single document.

– Rule based definition: information on a specific region, documents not older than 30 days, created between date1 and date 2, etc...

Collection of Satellite

Images and its metadata

Subcollection 1 Subcollection 2

DataSet1

EntireMediterranean

TurkishCoasts

GreekCoasts

DS2

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 10unibasel

sn 1

sn 5

sn 2 sn 3

DILIGENT Grid Replication Architecture

Storage Node 4

DS1DS2

DS3

DS4

DS1

DS2 DS2

DS3

DS1 : 1DS2 : 2,3DS3 : 5DS4 : 4Replica Catalog

DS1 : 1DS2 : 2,3DS3 : 5DS4 : 4Replica Catalog

DS1 : <1, 0.7>DS2 : <2, 0.6>,<3, 0.7>DS3 : <5, 0.6>DS4 : <4, 0.6>Freshness Repository

DS1 : <1, 0.7>DS2 : <2, 0.6>,<3, 0.7>DS3 : <5, 0.6>DS4 : <4, 0.6>Freshness Repository

(1) Read(DS2(x), DS4(y), 0.6)

(2.1) Locate bestReplicas

Client

(3) Read Data

continuous propagation

Queue

....TSx, Wx, DSy

...

DS4

Update Queue

subscription

SN1 : 50%SN2 : 25%SN3 : 60%SN4 : 30%SN5 : 50%Load Repository

SN1 : 50%SN2 : 25%SN3 : 60%SN4 : 30%SN5 : 50%Load Repository

(2.2)

(2.3)

RMSRMS RSSRSS

FTSFTS

Access History

Access History (4) Log

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 11unibasel

Conclusions & Outlook

• We presented the first steps of our on-going work whose ultimate goal is to come up with a fully integrated and self-managing replication subsystem for the Grid

• We want to adapt an existing database replication mechanism, i.e. PDBREP from database clusters to data grids

• This looks feasible:– The infrastructure related assumptions like broadcasting of changes to

replicas can be replaced by a subscription mechanism easily– Additional components presented in the envisioned architecture to facilitate

scheduling of queries can be included in the PDBREP without requiring major changes.

• Implementation of the DILIGENT replication on top of gLite is still ongoing

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 12unibasel

Thank you!.. Questions?

3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 13unibasel

References

1. DILIGENT: A DIgital Library Infrastructure on Grid ENabled Technology. http://www.diligentproject.org/. IST-2003-004260

2. F. Akal, C. T¨urker, H.-J. Schek, Y. Breitbart, T. Grabs, and L. Veen. Fine-Grained Replication and Scheduling with Freshness and Correctness Guarantees. In VLDB, pages 565–576, 2005.